Multiprocessing in Python: web scraping doesn't speed up -
i use multiprocessing module speed web scraping. goal extract part of html in page , save in parent variable. finally, write variable file.
but problem have takes around 1 second process page.
my code works, not want:
import urllib.request bs4 import beautifulsoup multiprocessing.dummy import pool # thread-based pool multiprocessing import cpu_count def parseweb(url): page = urllib.request.urlopen(url) soup = beautifulsoup(page) h2_tag = soup.find('h2', class_='midashigo') return h2_tag if __name__ == '__main__': file = 'links.txt' # each link on separate line. pool = pool(cpu_count() * 2) open(file, 'r') f: results = pool.map(parseweb, f) open('output.txt', 'w', encoding='utf-8') w: w.write(str(results))
how can modified give full power of multiprocessing? thank you.
this process should i/o bound, meaning bottle neck should how can pull down connection before parsing, in practice may turn out cpu or memory bound.
the first thing need realize multithreading/processing not going speed individual page parsing times. if 1 page takes 1 second , have 420000 pages take 420000 seconds. if number of thread amount of cores pc has times 2 , pc has 4 cores, going have 8 threads running 1 second each per page. still end 420000 / 8 seconds 875 minutes (in practice not entirely true), 14.5 hours worth of processing....
for time spans manageable need 400 threads, bring processing time down theoretical 17 odd minutes.
with many threads running , pages being parsed memory going become problem well.
i slapped little app test times
from time import sleep multiprocessing.dummy import pool multiprocessing import cpu_count def f(x): sleep(1) x = int(x) return x *x if __name__ == '__main__': pool = pool(cpu_count() * 100) open('input.txt', 'r') i: results = pool.map(f, i) open('output.txt', 'w') w: w.write(str(results))
with input file of numbers 1 420 000, time process took 1053.39 seconds (roughly 17.5 minutes), not indicator of how long take you, since mentioned memory , i/o bound issues, end slower.
the bottom line is, if not maxing out cpu or ram or network i/o, thread pool small.
Comments
Post a Comment