Extracting web data from a URL using JSoup
A large amount of data, nowadays, can be found on the Web. This data is sometimes structured, semi-structured, or even unstructured. Therefore, very different techniques are needed to extract them. There are many different ways to extract web data. One of the easiest and handy ways is to use an external Java library named JSoup. This recipe uses a certain number of methods offered in JSoup to extract web data.
Getting ready
In order to perform this recipe, we will require the following:
- Go to https://jsoup.org/download, and download the
jsoup-1.9.2.jar
file. Add the JAR file to your Eclipse project an external library. - If you are a Maven fan, please follow the instructions on the download page to include the JAR file into your Eclipse project.
How to do it...
- Create a method named
extractDataWithJsoup(String url)
. The parameter is the URL of any webpage that you need to call the method. We will be extracting web data from this URL:public void extractDataWithJsoup(String href){
- Use the
connect()
method by sending the URL where we want to connect (and extract data). Then, we will be chaining a few more methods with it. First, we will chain thetimeout()
method that takes milliseconds as parameters. The methods after that define the user-agent name during this connection and whether attempts will be made to ignore connection errors. The next method to chain with the previous two is theget()
method that eventually returns aDocument
object. Therefore, we will be holding this returned object indoc
of theDocument
class:doc = Jsoup.connect(href).timeout(10*1000).userAgent ("Mozilla").ignoreHttpErrors(true).get();
- As this code throws
IOException
, we will be using atry...catch
block as follows:Document doc = null; try { doc = Jsoup.connect(href).timeout(10*1000).userAgent ("Mozilla").ignoreHttpErrors(true).get(); } catch (IOException e) { //Your exception handling here }
Tip
We are not used to seeing times in milliseconds. Therefore, it is a nice practice to write 10*1000 to denote 10 seconds when millisecond is the time unit in a coding. This enhances readability of the code.
- A large number of methods can be found for a
Document
object. If you want to extract the title of the URL, you can use title method as follows:if(doc != null){ String title = doc.title();
- To only extract the textual part of the web page, we can chain the
body()
method with thetext()
method of aDocument
object, as follows:String text = doc.body().text();
- If you want to extract all the hyperlinks in a URL, you can use the
select()
method of aDocument
object with thea[href]
parameter. This gives you all the links at once:Elements links = doc.select("a[href]");
- Perhaps you wanted to process the links in a web page inpidually? That is easy, too--you need to iterate over all the links to get the inpidual links:
for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); String linkOuterHtml = link.outerHtml(); String linkInnerHtml = link.html(); System.out.println(linkHref + "t" + linkText + "t" + linkOuterHtml + "t" + linkInnterHtml); }
- Finally, close the if-condition with a brace. Close the method with a brace:
} }
The complete method, its class, and the driver method are as follows:
import java.io.IOException; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; public class JsoupTesting { public static void main(String[] args){ JsoupTesting test = new JsoupTesting(); test.extractDataWithJsoup("Website address preceded by http://"); } public void extractDataWithJsoup(String href){ Document doc = null; try { doc = Jsoup.connect(href).timeout(10*1000).userAgent ("Mozilla").ignoreHttpErrors(true).get(); } catch (IOException e) { //Your exception handling here } if(doc != null){ String title = doc.title(); String text = doc.body().text(); Elements links = doc.select("a[href]"); for (Element link : links) { String linkHref = link.attr("href"); String linkText = link.text(); String linkOuterHtml = link.outerHtml(); String linkInnerHtml = link.html(); System.out.println(linkHref + "t" + linkText + "t" + linkOuterHtml + "t" + linkInnterHtml); } } } }