When there is no API

As with API services, web pages have their owners, and they may or may not be open to the idea of scraping their data. If there is an API in place, this is always preferred over scraping, for the following reasons:

  • First, it is usually way better and simpler to use, and there are a number of guarantees that API owners will retain its structure, or at least let you know of upcoming changes in advance. With HTML web pages, there is no guarantee whatsoever; the website will often change, and they won't tell you ahead of time, so expect lots of emergency breaking changes!
  • Second, being a good citizen, it is substantially cheaper, computation-wise, to serve raw data than a full-blown HTML page, so the service owners will be thankful.
  • Lastly, some data (for example, historic changes) will not be available via the web page.

However, there are plenty of examples of web pages with no API. Some information is not intended for sharing (for example, electronic shops are not excited about their prices being tracked), and many organizations simply don't have the capacity to maintain them.

As a practical exercise, we will now scrape data from Wikipedia, which is, paradoxically, both an extremely good and extremely bad example of a website to scrape. It is a great example because Wikipedia fully supports scraping; in fact, it openly shares the whole dataset, so we can download and work with it if we want. At the same time, it is a bad example, because, by its very nature, Wikipedia has no strict template and data structure – every page can be unique in its structure and layout, and even a raw dataset won't change that. Most of the websites are pre-generated using the same templates, so data scraping, on average, is easier. But it is always dirty, semi-manual hard work. 

But before we dive into coding, let's discuss what HTML is and how to work with it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.123.238