In today’s web-driven world, data is the cornerstone of every major application and decision-making process. Web scraping provides developers with the tools to access and harness that data.
In this article, we’ll explore jsoup, a popular Java library for parsing and scraping web content. Whether you're a beginner or an experienced developer, this guide will provide the foundations and best practices for using jsoup effectively.
What is Jsoup?
At its core, jsoup is a Java library designed for parsing, manipulating, and extracting data from HTML documents. It allows developers to work with web content as if they were using a browser's developer tools. With its intuitive API, jsoup simplifies tasks like data extraction, HTML manipulation, and even cleanup, making it a go-to tool for many Java developers.
Installing Jsoup
Getting started with jsoup is straightforward. Add jsoup as a dependency to your project using a build tool like Maven or Gradle:
To install jsoup using Maven, add the following to your pom.xml
file:
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.16.1</version> </dependency>
To install jsoup using Gradle, add the following to the dependencies block of your build.gradle.kts
file:
implementation 'org.jsoup:jsoup:1.16.1'
Once installed, you’re ready to explore jsoup’s powerful scraping and parsing capabilities.
Scraping with Jsoup
Jsoup doesn't only provide methods to parse HTML, but it also provides a simple and robust way to connect to web pages and extract HTML directly.
You can use the .connect()
method that Jsoup provides to get the HTML content of a website and parse it to extract data from it.
Document doc = Jsoup.connect("https://example.com").get();
The .connect()
method also allows you to customize requests by specifying headers, cookies, and HTTP methods:
Connection connection = Jsoup.connect("https://example.com") .method(Connection.Method.GET) .userAgent("Mozilla/5.0") .header("Authorization", "Bearer token") .cookie("session_id", "abc123");
The .connect()
method returns a Connection
intance which has a .response()
method that gives you access to the HTTP response details. You can use .statusCode()
to check the response status to handle errors gracefully:
if (connection.response().statusCode() == 200) { System.out.println("Success!"); } else { System.out.println("Failed to connect."); }
By combining these methods, you can mimic browser behavior to scrape data effectively.
Jsoup Example Scraper
To illustrate jsoup’s capabilities, let’s scrape product data from the first page of web-scraping.dev/products.
For this example we'll use gradle to manage our dependencies, create a new project and add the following dependencies to your build.gradle.kts
file:
plugins { id("java") id("application") } repositories { mavenCentral() } dependencies { implementation("org.jsoup:jsoup:1.16.1") } application { mainClass.set("JsoupScraper") }
Then let's create a small scraper under /src/main/java/JsoupScraper.java
file that will:
- Scrape web-scraping.dev/products page and find all product URLs
- Scrape each product URL for product name and price
- Collect all results and display them
Our jsoup java scraper should look something like this:
import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import java.util.ArrayList; import java.util.HashMap; public class JsoupScraper { public static HashMap<String, String> scrapeProduct(String url) throws Exception { // Scrape a single product page from web-scraping.dev Document doc = Jsoup.connect(url).get(); HashMap<String, String> productData = new HashMap<>(); productData.put("title", doc.select("h3").text()); productData.put("price", doc.select(".product-price").text()); productData.put("price_full", doc.select(".product-price-full").text()); productData.put("url", url); return productData; } public static void main(String[] args) throws Exception { // Fetch the product directory page Document doc = Jsoup.connect("https://web-scraping.dev/products").get(); // This is where we'll store our results ArrayList<HashMap<String, String>> products = new ArrayList<>(); // Iterate through product elements, find product url and scrape each product Elements productElements = doc.select(".products .product"); for (Element product : productElements) { // Get the product URL String url = product.select("h3 > a").attr("href"); System.out.println("Scraping product: " + url); // Scrape each product and store result HashMap<String, String> productData = scrapeProduct(url); products.add(productData); } // Pretty print the product data System.out.println("Product Data:"); for (HashMap<String, String> product : products) { System.out.println(product); } } }
Example Output
$ gradle run > Task :run Scraping product: https://web-scraping.dev/product/1 Scraping product: https://web-scraping.dev/product/2 Scraping product: https://web-scraping.dev/product/3 Scraping product: https://web-scraping.dev/product/4 Scraping product: https://web-scraping.dev/product/5 Product Data: {price=$9.99, price_full=$12.99, title=Box of Chocolate Candy Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Hiking Boots for Outdoor Adventures Kids' Light-Up Sneakers Blue Energy Potion, url=https://web-scraping.dev/product/1} {price=$4.99, price_full=, title=Dark Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Red Energy Potion Cat-Ear Beanie Running Shoes for Men Classic Leather Sneakers, url=https://web-scraping.dev/product/2} {price=$4.99, price_full=, title=Teal Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Dragon Energy Potion Women's High Heel Sandals Dark Red Energy Potion Running Shoes for Men, url=https://web-scraping.dev/product/3} {price=$4.99, price_full=, title=Red Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Dark Red Energy Potion Cat-Ear Beanie, url=https://web-scraping.dev/product/4} {price=$4.99, price_full=, title=Blue Energy Potion Variants Features Vertical Table Packs Horizontal Table Reviews Similar Products Women's High Heel Sandals Blue Energy Potion Hiking Boots for Outdoor Adventures Classic Leather Sneakers, url=https://web-scraping.dev/product/5}
Above is our jsoup scraper that scraped 5 products with their titles and prices though to break this down a bit further let's take a look at each HTML parsing capability of jsoup.
Parsing HTML with jsoup
Jsoup's Java HTML parser can be used to parse and modify scraped HTML content.
Finding data with CSS Selectors
Jsoup's .select()
method takes in a CSS Selector to find elements in the HTML document. For example, to find the first css selector match select().first()
can be used:
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get(); // find all images using css selector for matching elements with "product-img" class Elements images = doc.select(".product-img"); // print only the first one using first() System.out.println(images.first()); // prints: <img src="https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp" class="img-responsive product-img active">
Selecting attributes and values
Extract attributes and inner text using .text()
and .attr()
:
To get the text content of an html element using jsoup, the text()
method can be used:
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get(); Elements variants = doc.select(".variants .variant"); // text of first variants System.out.println(variants.first().text()); // prints: orange, small // or text of all variants System.out.println(variants.text()); // prints: orange, small orange, medium orange, large cherry, small cherry, medium cherry, large
To get the value of an html attribute set on an element, the attr()
method can be used:
Document doc = Jsoup.connect("https://web-scraping.dev/product/1").get(); Elements images = doc.select(".product-img"); System.out.println(images.first().attr("src")); // prints: https://web-scraping.dev/assets/products/orange-chocolate-box-small-1.webp
Changing the DOM
Jsoup also allows modifications of the DOM using .text()
and .attr()
:
The .text()
method accepts a string argument that makes it alter the inner text of the HTML element.
doc.select("h1").first().text("Updated Title");
The .attr()
method also takes a second string argument that get passed as the value attribute value in the HTML.
doc.select("img").first().attr("src", "new-image.jpg");
This versatility lets you work with HTML dynamically, much like in a browser.
Jsoup Utilities
Jsoup comes equipped with handy utilities to simplify common HTML tasks.
Cleanup HTML
Use Jsoup.clean()
to sanitize HTML, removing unsafe tags and attributes:
String cleanHtml = Jsoup.clean("<script>alert(1)</script><p>Safe content</p>", Safelist.basic());
Prettify HTML
Format raw HTML for readability using:
doc.outputSettings().prettyPrint(true); System.out.println(doc.html());
Escape and Unescape HTML
Handle special characters with Entities.escape()
and Entities.unescape()
:
String escaped = Entities.escape("<div>Content</div>");
String unescaped = Entities.unescape("<div>Content</div>");
These utilities enhance your ability to manage and present HTML effectively.
Jsoup Limitations
Despite its strengths, jsoup has some limitations that developers should consider:
- Lack of HTTP/2 support: Jsoup only supports basic HTTP/1.1 requests. For HTTP/2 and advanced networking capabilities, consider using libraries like OkHttp. Okhttp is a popular http client for Java, check out our comprehensive guide on okhttp to learn more about its capabilities.
- No headless browser functionality: Jsoup doesn’t execute JavaScript, which limits its ability to scrape dynamic web pages. Tools like Selenium or Puppeteer can help in these scenarios.
- Detectability: Jsoup’s requests can be easily identified as non-human by websites, making it less ideal for scraping heavily protected content.
For advanced use cases, combining jsoup with tools like OkHttp or Scrapfly can help overcome these challenges.
Power Up with Scrapfly
Jsoup is great for small to medium-scale scraping tasks when scrping static pages. However, it falls short when it comes to javascript rendered content and scraper blocking due to IP blocks or bot detection.
ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.
- Anti-bot protection bypass - extract web pages without blocking!
- Rotating residential proxies - prevent IP address and geographic blocks.
- LLM prompts - extract data or ask questions using LLMs
- Extraction models - automatically find objects like products, articles, jobs, and more.
- Extraction templates - extract data using your own specification.
- Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.
Here is a simple example of how you can use okhttp with Scrapfly's Scraping API.
import okhttp3.OkHttpClient; import okhttp3.Request; import okhttp3.Response; import java.io.IOException; public class OkHttpExample { public static void main(String[] args) { OkHttpClient client = new OkHttpClient(); HttpUrl.Builder urlBuilder = HttpUrl.parse("https://api.scrapfly.io/scrape") .newBuilder(); // Required parameters: your API key and URL to scrape urlBuilder.addQueryParameter("key", "YOUR_API_KEY"); urlBuilder.addQueryParameter("url", "https://web-scraping.dev/product/1"); // Optional parameters: // enable anti scraping protection bypass urlBuilder.addQueryParameter("asp", "true"); // use proxies of a specific countries urlBuilder.addQueryParameter("country", "US,CA,DE"); // enable headless browser urlBuilder.addQueryParameter("render_js", "true"); // see more on scrapfly docs: https://scrapfly.io/docs/scrape-api/getting-started#spec // Building and send request String url = urlBuilder.build().toString(); Request request = new Request.Builder() .url(url) .build(); try (Response response = client.newCall(request).execute()) { if (response.isSuccessful()) { System.out.println("Response Body: " + response.body().string()); System.out.println("Status Code: " + response.code()); } else { System.out.println("Request Failed: " + response.code()); } } catch (IOException e) { e.printStackTrace(); } } }
FAQ
Can jsoup capture website screenshots?
No, jsoup cannot capture website screenshots. For such needs, you’ll require a headless browser like Selenium or a specialized API. Consider using Scrapfly’s Screenshot API, which simplifies capturing full-page images with minimal setup.
Does jsoup handle JavaScript-rendered content?
No, jsoup cannot execute JavaScript or interact with dynamic content. It works only with static HTML. To scrape JavaScript-rendered pages, you’ll need tools like Selenium or Puppeteer, or services like Scrapfly, which offer JavaScript execution capabilities.
Does jsoup support multi-threaded scraping?
Jsoup itself doesn’t provide built-in multi-threading, but you can use Java’s concurrency utilities (e.g., ExecutorService) to scrape multiple pages simultaneously. Just ensure you manage thread safety and network limits to avoid being blocked by the target website.
Summary
Jsoup is a versatile and lightweight library for scraping and parsing web content. It excels at handling static HTML and provides utilities for cleaning, prettifying, and manipulating content. While it has limitations, combining it with tools like OkHttp or Scrapfly unlocks advanced capabilities, making it a powerful addition to any web scraping toolkit.
Whether you’re building a basic scraper or a robust data pipeline, jsoup provides the flexibility and functionality to get started quickly. Experiment with its features and extend its capabilities with complementary tools to suit your needs. Happy scraping!