How to do it...

  1. Write the following script, speed_up_step1.py. The full code is available in GitHub under the  Chapter03: https://github.com/PacktPublishing/Python-Automation-Cookbook/blob/master/Chapter03/speed_up_step1.py directory. Here are only the most relevant parts. It is based on crawling_web_step1.py:
...
def process_link(source_link, text):
...
return source_link, get_links(parsed_source, page)
...

def main(base_url, to_search, workers):
checked_links = set()
to_check = [base_url]
max_checks = 10

with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
while to_check:
futures = [executor.submit(process_link, url, to_search)
for url in to_check]
to_check = []
for data in concurrent.futures.as_completed(futures):
link, new_links = data.result()

checked_links.add(link)
for link in new_links:
if link not in checked_links and link not in to_check:
to_check.append(link)

max_checks -= 1
if not max_checks:
return


if __name__ == '__main__':
parser = argparse.ArgumentParser()
...
parser.add_argument('-w', type=int, help='Number of workers',
default=4)
args = parser.parse_args()

main(args.u, args.p, args.w)
  1. Notice the differences in the main function. Also, there's an extra parameter added (number of concurrent workers), and the function process_link now returns the source link.
  2. Run the crawling_web_step1.py script to get a time baseline. Notice the output has been removed here for clarity:
$ time python crawling_web_step1.py http://localhost:8000/
... REMOVED OUTPUT
real 0m12.221s
user 0m0.160s
sys 0m0.034s
  1. Run the new script with one worker, which is slower than the original one:
$ time python speed_up_step1.py -w 1
... REMOVED OUTPUT
real 0m16.403s
user 0m0.181s
sys 0m0.068s
  1. Increase the number of workers:
$ time python speed_up_step1.py -w 2
... REMOVED OUTPUT
real 0m10.353s
user 0m0.199s
sys 0m0.068s
  1. Adding more workers decreases the time:
$ time python speed_up_step1.py -w 5
... REMOVED OUTPUT
real 0m6.234s
user 0m0.171s
sys 0m0.040s
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.106.32