Searching for an article in Wikipedia

Wikipedia is a great site to gather information about virtually anything, for example, people, places, technology, and what not. If you like to search for something on Wikipedia from your Python script, this recipe is for you.

Here is an example:

Searching for an article in Wikipedia

Getting ready

You need to install the pyyaml third-party library from PyPI using pip or easy_install by entering $ pip install pyyaml or $ easy_install pyyaml.

How to do it...

Let us search for the keyword Islam in Wikipedia and print each search result in one line.

Listing 6.3 explains how to search for an article in Wikipedia, as shown:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Python Network Programming Cookbook -- Chapter - 6
# This program is optimized for Python 2.7.
# It may run on any other version with/without modifications

import argparse
import re
import yaml
import urllib
import urllib2

SEARCH_URL = 'http://%s.wikipedia.org/w/api.php?action=query&list=search&srsearch=%s&sroffset=%d&srlimit=%d&format=yaml'

class Wikipedia:
    
  def __init__(self, lang='en'):
    self.lang = lang

  def _get_content(self, url):
    request = urllib2.Request(url)
    request.add_header('User-Agent', 'Mozilla/20.0')

    try:
      result = urllib2.urlopen(request)
      except urllib2.HTTPError, e:
        print "HTTP Error:%s" %(e.reason)
      except Exception, e:
        print "Error occurred: %s" %str(e)
      return result

  def search_content(self, query, page=1, limit=10):
    offset = (page - 1) * limit
    url = SEARCH_URL % (self.lang, urllib.quote_plus(query), 
offset, limit)
    content = self._get_content(url).read()

    parsed = yaml.load(content)
    search = parsed['query']['search']
    if not search:
    return

    results = []
    for article in search:
      snippet = article['snippet']
      snippet = re.sub(r'(?m)<.*?>', '', snippet)
      snippet = re.sub(r's+', ' ', snippet)
      snippet = snippet.replace(' . ', '. ')
      snippet = snippet.replace(' , ', ', ')
      snippet = snippet.strip()

    results.append({
      'title' : article['title'].strip(),
'snippet' : snippet
    })

    return results
 
if __name__ == '__main__':
  parser = argparse.ArgumentParser(description='Wikipedia search')
  parser.add_argument('--query', action="store", dest="query", 
required=True)
  given_args = parser.parse_args()
 
  wikipedia = Wikipedia()
  search_term = given_args.query
  print "Searching Wikipedia for %s" %search_term 
  results = wikipedia.search_content(search_term)
  print "Listing %s search results..." %len(results)
  for result in results:
    print "==%s== 
 	%s" %(result['title'], result['snippet'])
  print "---- End of search results ----"

Running this recipe to query Wikipedia about Islam shows the following output:

$ python 6_3_search_article_in_wikipedia.py --query='Islam' 
Searching Wikipedia for Islam 
Listing 10 search results... 
==Islam== 
     Islam. (
ˈ
 | 
ɪ
 | s | l | 
ɑː
 | m 
الإسلام
, ar | ALA | al-
ʾ
Isl
ā
m  æl
ʔɪ
s
ˈ

ː
m | IPA | ar-al_islam. ... 

==Sunni Islam== 
     Sunni Islam (
ˈ
 | s | u
ː
 | n | i or 
ˈ
 | s | 
ʊ
 | n | i |) is the 
largest branch of Islam ; its adherents are referred to in Arabic as ... 
==Muslim== 
     A Muslim, also spelled Moslem is an adherent of Islam, a monotheistic Abrahamic religion based on the Qur'an —which Muslims consider the ... 
==Sharia== 
     is the moral code and religious law of Islam. Sharia deals with 
many topics addressed by secular law, including crime, politics, and ... 
==History of Islam== 
     The history of Islam concerns the Islamic religion and its 
adherents, known as Muslim s. " "Muslim" is an Arabic word meaning 
"one who ... 

==Caliphate== 
     a successor to Islamic prophet Muhammad ) and all the Prophets 
of Islam. The term caliphate is often applied to successions of 
Muslim ... 
==Islamic fundamentalism== 
     Islamic ideology and is a group of religious ideologies seen as 
advocating a return to the "fundamentals" of Islam : the Quran and 
the Sunnah. ... 
==Islamic architecture== 
     Islamic architecture encompasses a wide range of both secular 
and religious styles from the foundation of Islam to the present day. ... 
---- End of search results ---- 

How it works...

First, we collect the Wikipedia URL template for searching an article. We created a class called Wikipedia, which has two methods: _get_content() and search_content(). By default upon initialization, the class sets up its language attribute lang to en (English).

The command-line query string is passed to the search_content() method. It then constructs the actual search URL by inserting variables such as language, query string, page offset, and number of results to return. The search_content() method can optionally take the parameters and the offset is determined by the (page -1) * limit expression.

The content of the search result is fetched via the _get_content() method which calls the urlopen() function of urllib. In the search URL, we set up the result format yaml, which is basically intended for plain text files. The yaml search result is then parsed with Python's pyyaml library.

The search result is processed by substituting the regular expressions found in each result item. For example, the re.sub(r'(?m)<.*?>', '', snippet) expression takes the snippet string and replaces a raw pattern (?m)<.*?>). To learn more about regular expressions, visit the Python document page, available at http://docs.python.org/2/howto/regex.html.

In Wikipedia terminology, each article has a snippet or a short description. We create a list of dictionary items where each item contains the title and the snippet of each search result. The results are printed on the screen by looping through this list of dictionary items.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.25.41