Wikipedia is a great site to gather information about virtually anything, for example, people, places, technology, and what not. If you like to search for something on Wikipedia from your Python script, this recipe is for you.
Here is an example:
You need to install the pyyaml
third-party library from PyPI using pip
or easy_install
by entering $ pip install pyyaml
or $ easy_install pyyaml
.
Let us search for the keyword Islam
in Wikipedia and print each search result in one line.
Listing 6.3 explains how to search for an article in Wikipedia, as shown:
#!/usr/bin/env python # -*- coding: utf-8 -*- # Python Network Programming Cookbook -- Chapter - 6 # This program is optimized for Python 2.7. # It may run on any other version with/without modifications import argparse import re import yaml import urllib import urllib2 SEARCH_URL = 'http://%s.wikipedia.org/w/api.php?action=query&list=search&srsearch=%s&sroffset=%d&srlimit=%d&format=yaml' class Wikipedia: def __init__(self, lang='en'): self.lang = lang def _get_content(self, url): request = urllib2.Request(url) request.add_header('User-Agent', 'Mozilla/20.0') try: result = urllib2.urlopen(request) except urllib2.HTTPError, e: print "HTTP Error:%s" %(e.reason) except Exception, e: print "Error occurred: %s" %str(e) return result def search_content(self, query, page=1, limit=10): offset = (page - 1) * limit url = SEARCH_URL % (self.lang, urllib.quote_plus(query), offset, limit) content = self._get_content(url).read() parsed = yaml.load(content) search = parsed['query']['search'] if not search: return results = [] for article in search: snippet = article['snippet'] snippet = re.sub(r'(?m)<.*?>', '', snippet) snippet = re.sub(r's+', ' ', snippet) snippet = snippet.replace(' . ', '. ') snippet = snippet.replace(' , ', ', ') snippet = snippet.strip() results.append({ 'title' : article['title'].strip(), 'snippet' : snippet }) return results if __name__ == '__main__': parser = argparse.ArgumentParser(description='Wikipedia search') parser.add_argument('--query', action="store", dest="query", required=True) given_args = parser.parse_args() wikipedia = Wikipedia() search_term = given_args.query print "Searching Wikipedia for %s" %search_term results = wikipedia.search_content(search_term) print "Listing %s search results..." %len(results) for result in results: print "==%s== %s" %(result['title'], result['snippet']) print "---- End of search results ----"
Running this recipe to query Wikipedia about Islam shows the following output:
$ python 6_3_search_article_in_wikipedia.py --query='Islam' Searching Wikipedia for Islam Listing 10 search results... ==Islam== Islam. ( ˈ | ɪ | s | l | ɑː | m الإسلام , ar | ALA | al- ʾ Isl ā m æl ʔɪ s ˈ læ ː m | IPA | ar-al_islam. ... ==Sunni Islam== Sunni Islam ( ˈ | s | u ː | n | i or ˈ | s | ʊ | n | i |) is the largest branch of Islam ; its adherents are referred to in Arabic as ... ==Muslim== A Muslim, also spelled Moslem is an adherent of Islam, a monotheistic Abrahamic religion based on the Qur'an —which Muslims consider the ... ==Sharia== is the moral code and religious law of Islam. Sharia deals with many topics addressed by secular law, including crime, politics, and ... ==History of Islam== The history of Islam concerns the Islamic religion and its adherents, known as Muslim s. " "Muslim" is an Arabic word meaning "one who ... ==Caliphate== a successor to Islamic prophet Muhammad ) and all the Prophets of Islam. The term caliphate is often applied to successions of Muslim ... ==Islamic fundamentalism== Islamic ideology and is a group of religious ideologies seen as advocating a return to the "fundamentals" of Islam : the Quran and the Sunnah. ... ==Islamic architecture== Islamic architecture encompasses a wide range of both secular and religious styles from the foundation of Islam to the present day. ... ---- End of search results ----
First, we collect the Wikipedia URL template for searching an article. We created a class called Wikipedia
, which has two methods: _get_content()
and search_content()
. By default upon initialization, the class sets up its language attribute lang
to en
(English).
The command-line query string is passed to the search_content()
method. It then constructs the actual search URL by inserting variables such as language, query string, page offset, and number of results to return. The search_content()
method can optionally take the parameters and the offset is determined by the (page -1) * limit
expression.
The content of the search result is fetched via the _get_content()
method which calls the urlopen()
function of urllib
. In the search URL, we set up the result format yaml
, which is basically intended for plain text files. The yaml
search result is then parsed with Python's pyyaml
library.
The search result is processed by substituting the regular expressions found in each result item. For example, the re.sub(r'(?m)<.*?>', '', snippet)
expression takes the snippet string and replaces a raw pattern (?m)<.*?>)
. To learn more about regular expressions, visit the Python document page, available at http://docs.python.org/2/howto/regex.html.
In Wikipedia terminology, each article has a snippet or a short description. We create a list of dictionary items where each item contains the title and the snippet of each search result. The results are printed on the screen by looping through this list of dictionary items.
3.141.25.41