Loading documents

In most cases, a document's content is obtained by using requests or urllib and is provided to pyquery as follows:

>>> from pyquery import PyQuery as pq
>>> import requests
>>> response = requests.get('http://www.example.com').text #content

>>> from urllib.request import urlopen
>>> response = urlopen('http://www.example.com').read()
>>> docTree = pq(response)

pyquery can also load URLs using the Python library, urllib (default), or requests. It also supports requests-based parameters:

>>> pq("https://www.python.org")
[<html.no-js>]

>>> site=pq("https://www.python.org")
>>> print(type(site))
<class 'pyquery.pyquery.PyQuery'>

>>> pq("https://www.samsclub.com")
[<html>]

The pq object we obtained from the preceding code is being parsed using the XML parser (default) that's available from lxml, which can also be updated with the extra parser argument being passed to it:

>>> doc = pq('http://www.exaple.com', parser = 'xml')  #using parser xml

>>> doc = pq('http://www.exaple.com', parser = 'html') #using parser html

Normally, HTML code from a page source or other sources, such as files, is provided as a string to pyquery for further processing, as shown in the following code:

>>> doc = pq('<div><p>Testing block</p><p>Second block</p></div>')
>>> print(type(doc))
<class 'pyquery.pyquery.PyQuery'>

>>> pagesource = open('test.html','r').read() #reading locally saved HTML
>>> print(type(pagesource))
<class 'str'>

>>> page = pq(pagesource)
>>> print(type(page))
<class 'pyquery.pyquery.PyQuery'>

With the PyQuery object or pq that was received from the document or URL that was loaded, we can proceed and explore the features that are available from pyquery.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.102.34