Chapter 6. Web Scraping with Python Requests and BeautifulSoup

We have become experts in how to communicate with the Web through Requests. Everything progressed flamboyantly while working with the APIs. However, there are some conditions where we need to be aware of API folklore.

The first thing that concerns us is not all web services have built an API for the sake of their third-party customers. Also, there is no statute that the API should be maintained perfectly. Even tech giants such as Google, Facebook, and Twitter tend to change their APIs abruptly without prior notice. So, it's better to understand that it is not always the API that comes to the rescue when we are looking for some vital information from a web resource.

The concept of web scraping stands as a savior when we really turn imperative to access some information from a web resource that does not maintain an API. In this chapter, we will discuss tricks of the trade to extract information from web resources by following all the principles of web scraping.

Before we begin, let's get to know some important concepts that will help us to reach our goal. Take a look at the response content format of a request, which will introduce us to a particular type of data:

>>> import requests
>>> r = requests.get("http://en.wikipedia.org/wiki/List_of_algorithms")
>>> r
<Response [200]>
>>> r.text
u'<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" />
<title>List of algorithms - Wikipedia, the free encyclopedia</title>
...

In the preceding example, the response content is rendered in the form of semistructured data, which is represented using HTML tags; this in turn helps us to access the information about the different sections of a web page individually.

Now, let's get to know the different types of data that the Web generally deals with.

Types of data

In most cases, we deal with three types of data when working with web sources. They are as follows:

  • Structured data
  • Unstructured data
  • Semistructured Data

Structured data

Structured data is a type of data that exists in an organized form. Normally, structured data has a predefined format and it is machine readable. Each piece of data that lies in structured data has a relation with every other data as a specific format is imposed on it. This makes it easier and faster to access different parts of data. The structured data type helps in mitigating redundant data while dealing with huge amounts of data.

Databases always contain structured data, and SQL techniques can be used to access data from them. We can regard census records as an example of structured data. They contain information about the date of birth, gender, place, income, and so on, of the people of a country.

Unstructured data

In contrast to structured data, unstructured data either misses out on a standard format or stays unorganized even though a specific format is imposed on it. Due to this reason, it becomes difficult to deal with different parts of the data. Also, it turns into a tedious task. To handle unstructured data, different techniques such as text analytics, Natural Language Processing (NLP), and data mining are used. Images, scientific data, text-heavy content (such as newspapers, health records, and so on), come under the unstructured data type.

Semistructured data

Semistructured data is a type of data that follows an irregular trend or has a structure which changes rapidly. This data can be a self described one, it uses tags and other markers to establish a semantic relationship among the elements of the data. Semistructured data may contain information that is transferred from different sources. Scraping is the technique that is used to extract information from this type of data. The information available on the Web is a perfect example of semistructured data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.47.169