Exploring pyquery

Before we move on and explore pyquery and its features, let's start by installing it by using pip:

C:> pip install pyquery
For more information on using pip and library installation, please refer to the Setting things up section in Chapter 2, Python and the Web – Using urllib and Requests.

The following libraries are installed on a successful installation of pyquery using pip:

  • cssselect-1.0.3
  • lxml-4.3.1
  • pyquery-1.4.0

 

>>> in the code represents the use of the Python IDE; it accepts the code or instructions and displays the output on the next line.

Once the installation is completed and successful, we can use pyquery, as shown in the following code, to confirm the setup. We can explore the properties it contains by using the dir() function:

>>> from pyquery import PyQuery as pq

>>> print(dir(pq))
['Fn', '__add__', '__call__', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '_filter_only', '_get_root', '_next_all', '_prev_all', '_translator_class', '_traverse','addClass', 'add_class', 'after', 'append', 'appendTo', 'append_to','attr','base_url','before','children', 'clear', 'clone', 'closest', 'contents', 'copy', 'count', 'css','each','empty', 'encoding','end','eq', 'extend', 'filter', 'find','fn','hasClass','has_class','height','hide', 'html', 'index','insert','insertAfter', 'insertBefore', 'insert_after','insert_before', 'is_', 'items', 'length','make_links_absolute',
'map','next','nextAll','next_all','not_','outerHtml','outer_html','parent','parents', 'pop', 'prepend', 'prependTo', 'prepend_to','prev', 'prevAll', 'prev_all', 'remove', 'removeAttr', 'removeClass', 'remove_attr', 'remove_class','remove_namespaces', 'replaceAll', 'replaceWith', 'replace_all', 'replace_with', 'reverse', 'root','show', siblings','size','sort','text', 'toggleClass', 'toggle_class', 'val', 'width', 'wrap', 'wrapAll','wrap_all','xhtml_to_html']

Now we will explore certain features from pyquery that are relevant to scraping concepts. For this purpose, we will be using a page source available from https://www.python.org that has been saved locally as test.html to provide real-world usability:

Page source obtained from https://www.python.org
In Google Chrome, you can right-click on the web page and choose the View page source menu option or press Ctrl + U to obtain the page source.

Obtaining the page source or HTML code only is not enough, though, as we need to load this content into the library to gain more tools to explore with. We'll be doing this in the upcoming section.

While testing or following the code, you might find or require changes to be done on the pyquery code expressions in order to obtain the real output. Page sources that are obtained now might be updated or changed. You are suggested to obtain the latest page source from the source URL (https://www.python.org).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.40.171