Chapter 1. Instant Jsoup How-to

Welcome to Instant Jsoup How-to. As you look around, you will see that many websites and services provide information through RSS, Atom, or even through a web service API; however, lots of sites don't provide such facilities. That is the reason why many HTML parsers arise to support the ability of web scraping. Jsoup, one among the popular HTML parsers for Java developers, stands as a powerful framework that gives developers an easy way to extract and transform HTML content. This book is therefore written with all the recipes that are needed to grab web information.

Giving input for parser (Must know)

HTML data for parsing can be stored in different types of sources such as local file, a string, or a URI. Let's have a look at how we can handle these types of input for parsing using Jsoup.

How to do it...

  1. Create the Document class structure from Jsoup, depending on the type of input.
    • If the input is a string, use:
      String html = "<html><head><title>jsoup: input with string</title></head><body>Such an easy task.</body></html>";
      Document doc = Jsoup.parse(html);
    • If the input is from a file, use:
      try {
      File file = new File("index.html");
      Document doc = Jsoup.parse(file, "utf-8");
      } catch (IOException ioEx) {
          ioEx.printStackTrace();
      }
    • If the input is from a URL, use:
      Document doc = Jsoup.connect("http://www.example.com").get();
  2. Include the correct package at the top.
    import org.jsoup.Jsoup;
    import.jsoup.nodes.Document;

Note

The complete example source code for this section is in sourceSection01.

The API reference for this section is available at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html

How it works...

Basically, all the inputs will be given to the Jsoup class to parse.

For an HTML string, you just need to pass the HTML string as parameter for the method Jsoup.parse().

For an HTML file, there are three parameters inputted for Jsoup.parse(). The first one is the file object, which points to the specified HTML file; the second one is the character set of the file. There is an overload of this method with an additional third parameter, Jsoup.parse(File file, String charsetName, String baseUri). The baseUri URL is the URL from where the HTML file is retrieved; it is used to resolve relative paths or links.

For a URL, you need to use the Jsoup.connect() method. Once the connection succeeds, it will return an object, thus implementing the connection interface. Through this, you can easily get the content of the URL page using the Connection.get() method.

The previous example is pretty easy and straightforward. The results of parsing from the Jsoup class will return a Document object, which represents a DOM structure of an HTML page, where the root node starts from <html>.

There's more...

Besides receiving the well-formed HTML as input, Jsoup library also supports input as a body fragment. This can be seen at the following location:

http://jsoup.org/apidocs/org/jsoup/Jsoup.html#parseBodyFragment(java.lang.String)

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.243.106