Performance

To further understand how increasing the number of threads and processes affects the time required when downloading; here is a spreadsheet of results for crawling 1000 web pages:

Script

Number of threads

Number of processes

Time

Comparison with sequential

Sequential

1

1

28m59.966s

1

Threaded

5

1

7m11.634s

4.03

Threaded

10

1

3m50.455s

7.55

Threaded

20

1

2m45.412s

10.52

Processes

5

2

4m2.624s

7.17

Processes

10

2

2m1.445s

14.33

Processes

20

2

1m47.663s

16.16

The last column shows the proportion of time in comparison to the base case of sequential downloading. We can see that the increase in performance is not linearly proportional to the number of threads and processes, but appears logarithmic. For example, one process and five threads leads to 4X better performance, but 20 threads only leads to 10X better performance. Each extra thread helps, but is less effective than the previously added thread. This is to be expected, considering the process has to switch between more threads and can devote less time to each. Additionally, the amount of bandwidth available for downloading is limited so that eventually adding additional threads will not lead to a greater download rate. At this point, achieving greater performance would require distributing the crawl across multiple servers, all pointing to the same MongoDB queue instance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.2.78