Beautifulsoup in Python

5 Jan 2025 | 6 min read

Beautifulsoup is a powerful Python library designed for web scraping, providing an efficient way to navigate, search, and manipulate the content of HTML and XML documents. Developed as a parsing library, Beautiful Soup transforms raw HTML or XML code into a structured, tree-like representation, enabling users to easily extract and manipulate data from web pages.

Main purpose

1. Data Extraction

Used for web scraping to extract data easily and efficiently from HTML and XML documents.

2. Parsing HTML and XML

Beautiful Soup transforms HTML or XML into a navigable tree structure.

3. Tree Modification

The tool supports modifying the parse tree, enabling users to add, remove, or manipulate elements per their requirements.

4. Handling Broken HTML

Beautiful Soup is designed to handle imperfect or poorly formatted HTML gracefully.

5. Navigating and Searching

Beautiful Soup offers methods for searching and navigating the parse tree. Users can locate tags based on criteria such as tag names, attributes, or hierarchical relationships.

Installation of beautifulsoup

Command

Basic Usage:

Let's walk through a simple example to illustrate the basic usage of Beautiful Soup. Consider a scenario where we want to extract the titles of articles from a sample webpage:

Example

Here is the example program:

Program

Output:

 Trending Technologies Preparation B.Tech / MCA Cloud Technologies Testing Tutorials Python Tutorials Java Technology Database Tutorials Web Technology PHP Tutorials Office Tools .Net Technologies Popular Tutorials Miscellaneous Topics Non-Technical Topics American India Author View Feedback 100+ Latest Updates Javatpoint Services Training For College Campus

Explanation

We have imported two libraries: one for pulling requests and the other for parsing HTML and XML documents. Next, we specified the URL, which contains the web address of the target webpage, and requested the URL. The response is stored if the request is successful. We then used Beautiful Soup to create a parse tree from the HTML content received in the response. Using the `find_all()` method, we located the HTML element with the `<h2>` tag in the parsed HTML. Finally, we printed all the `h2` elements.

Parsing HTML and XML

HTML parsing analyses an HTML document to extract its structural components like tags, attributes, and content.

Example

Here is a simple program:

Program

Output:

 Tag Name: html Tag Name: head Tag Name: title Tag Name: body Tag Name: h1 Tag Name: p Tag Name: ul Tag Name: li Tag Name: li Tag Name: li

Explanation

We imported BeautifulSoup from the bs4 library to parse HTML content in Python. We then created a parse tree from an HTML string by using BeautifulSoup. After creating the parse tree, we used the find_all() method to locate all HTML elements in the tree. Finally, we iterated through these elements and printed their tag names.

Tree modification

BeautifulSoup provides various methods to modify the elements in the parse tree. Here are some common methods you can use to modify the tree elements:

Changing tag names

Modifying attributes

Deleting attribute

Modifying Text Content

Appending and Inserting Elements

Modifying Strings:

Example

Here is an example program to use beautifulsoup to modify the tree.

Program

Output:

 <html> <head> <title> Sample Page </title> </head> <body> <div class="modified-class" id="content"> <p> This paragraph has been modified. </p> <ul> <li> C++ </li> <li> Java </li> <li> Python </li> <li> C# </li> <li> HTML </li> </ul> </div> </body> </html>

Explanation

We first imported the Beautiful Soup library, then created HTML content and converted it into a Beautiful Soup object. After that, we specified the paragraph element and modified its string. We also modified the div element by adding a class, and finally, we added a new element using new_tag. Lastly, we formatted the tree using prettify() to make our HTML code more readable and organized.

Navigating and searching

Navigating and searching are vital to using Beautiful Soup to extract information from HTML or XML documents.

Example

Here's an example program that demonstrates how to navigate and search within an XML file using Beautiful Soup:

Program

Output:

 University Name: university First Course Title: Computer Science -------------------------------------------------- 1. Course Title: Computer Science Instructor: Dr. Smith -------------------------------------------------- 2. Course Title: Mathematics Instructor: Prof. Johnson --------------------------------------------------

Explanation

In this Python script utilizing Beautiful Soup for XML parsing, XML content representing university courses is processed. The `BeautifulSoup` instance is configured with the 'xml' parser to create a parse tree. The program navigates directly to specific elements within the XML structure, such as the university name and the title of the first course. It then utilizes the `find_all` method to search for all instances of the `<title>` tag and iterates through them to extract and print relevant information about each course, including the course title and instructor. The script showcases how Beautiful Soup simplifies XML parsing by providing intuitive methods for direct navigation and tag-based searching, making it an effective tool for extracting structured data from XML documents.

Conclusion

Beautiful Soup is a powerful and versatile Python library for parsing and navigating HTML and XML documents. It simplifies the complex task of parsing raw HTML/XML content, making it easy to extract data. The library handles poorly formatted markup and provides rich methods for searching, navigating and modifying the parse tree. Beautiful Soup is an essential toolkit component for web scraping projects that empowers developers to extract information from websites, analyze XML data, and restructure content.

Next TopicBest-books-to-learn-python-in-2023

← prev next →

Beautifulsoup in Python

Main purpose

Installation of beautifulsoup

Example

Parsing HTML and XML

Tree modification

Example

Navigating and searching

Conclusion

Contact info

Follow us

Tutorials

Interview Questions

Online Compiler

Python

Java

.Net Framework

AI, ML and Data Science

Cloud Technology

B.Tech and MCA

Web Technology

PHP

Software Testing

Technical Interview

Java Interview

Python

Web Interview

Database Interview

B.Tech / MCA

Important Interview

Software Testing Interview

Company Interviews

Online Compilers

Multiple Choice Questions

Misc

Beautifulsoup in Python

Main purpose

Installation of beautifulsoup

Example

Parsing HTML and XML

Tree modification

Example

Navigating and searching

Conclusion

Related Posts

Bellman-Ford Algorithm Using Python

Exceptions and Exception Classes in Python

Perceptron Learning Algorithm in Python

HDF5 Files in Python

Genetic Algorithm (GA) in Python

How to return a json object from a Python function

Find the Path from Root to the Node in Python

Boolean Operators in Python

Different Types of Joins in Pandas in Python

Automating Instagram Posts with Python Using Instagrapi

Subscribe to Tpoint Tech

Contact info

Follow us

Tutorials

Interview Questions

Online Compiler