How to do it...

  1. Import BeautifulSoup and requests:
>>> import requests
>>> from bs4 import BeautifulSoup
  1. Set up the URL of the page to download and retrieve it:
>>> URL = 'http://www.columbia.edu/~fdc/sample.html'
>>> response = requests.get(URL)
>>> response
<Response [200]>
  1. Parse the downloaded page:
>>> page = BeautifulSoup(response.text, 'html.parser')
  1. Obtain the title of the page. See that it is the same as what's displayed in the browser:
>>> page.title
<title>Sample Web Page</title>
>>> page.title.string
'Sample Web Page'
  1. Find all the h3 elements in the page, to determine the existing sections:
>>> page.find_all('h3')
[<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>]
  1. Extract the text on the section links. Stop when you reach the next <h3> tag:
>>> link_section = page.find('a', attrs={'name': 'links'})
>>> section = []
>>> for element in link_section.next_elements:
... if element.name == 'h3':
... break
... section.append(element.string or '')
...
>>> result = ''.join(section)
>>> result
'7. Links Links can be internal within a Web page (like to the Table of ContentsTable of Contents at the top), or they can be to external web pages or pictures on the same website, or they can be to websites, pages, or pictures anywhere else in the world. Here is a link to the Kermit Project home pageKermit Project home page. Here is a link to Section 5Section 5 of this document. Here is a link to Section 4.0Section 4.0 of the C-Kermit for Unix Installation InstructionsC-Kermit for Unix Installation Instructions. Here is a link to a picture: CLICK HERECLICK HERE to see it. '

Notice that there are no HTML tags; it's all raw text.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.171.147