Web content mining

This type of mining focuses on extracting information from the content of web pages. Each page is usually gathered and organized (using a parsing technique), processed to remove the unimportant parts from the text (natural language processing), and then analyzed using an information retrieval system to match the relevant documents to a given query. These three components are discussed in the following paragraphs.

Parsing

A web page is written in HTML format, so the first operation is to extract the relevant pieces of information. An HTML parser builds a tree of tags from which the content can be extracted. Nowadays, there are many parsers available, but as an example, we use the Scrapy library see Chapter 7, Movie Recommendation System Web Application which provides a command-line parser. Let's say we want to parse the main page of Wikipedia, https://en.wikipedia.org/wiki/Main_Page. We simply type this in a terminal:

scrapy shell 'https://en.wikipedia.org/wiki/Main_Page' 

A prompt will be ready to parse the page using the response object and the xpath language. For example we want to obtain the title's page:

In [1]: response.xpath('//title/text()').extract()
Out[1]: [u'Wikipedia, the free encyclopedia']

Or we want to extract all the embedded links in page (this operation is needed for the crawler to work), which are usually put on <a>, and the URL value is on an href attribute:

In [2]: response.xpath("//a/@href").extract()
Out[2]:
[u'#mw-head',
 u'#p-search',
 u'/wiki/Wikipedia',
 u'/wiki/Free_content',
 u'/wiki/Encyclopedia',
 u'/wiki/Wikipedia:Introduction',

 u'//wikimediafoundation.org/',
 u'//www.mediawiki.org/']

Note that a more robust way to parse content can be used since the web pages are usually written by non-programmers, so the HTML may contain syntax errors that browsers typically repair. Note also that web pages may contain a large amount of data due to advertisements, making the parsing of relevant information complicated. Different algorithms have been proposed (for instance, tree matching) to identify the main content of a page but no Python libraries are available at the moment, so we have decided not to discuss this topic further. However, note that a nice parsing implementation for extracting the body of a web article can be found in the newspaper library and it will also be used in Chapter 7, Movie Recommendation System Web Application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.249.210