Working with PDF files in Python17 Mar 2025 | 9 min read As in today's world, we all are familiar with PDF files because they are one of the most widely used digital formats of documents. The full form of pdf is "Portable Document Format," which uses the ".pdf" extension to save the document files. This is independent of software-hardware or operating systems, and it can be used for presenting or exchanging documents reliably. PDF was invented by Adobe, and this is now an open standard maintained by the international organization for standardization. The PDF file can also contain links or buttons form fields, audio-video, or other business logic for better interaction with the users or the viewers. In this tutorial, we will discuss how we can perform various operations:
We can perform all these operations by using a simple Python script. InstallationFor interacting with PDF files, we will be using a 3rd party module, that is, PyPDF2. The PyPDF2 is an inbuilt library of Python, which is used as a PDF toolkit. This module is capable of:
For installing PyPDF2 we can use the following command from the command line: The name of this module is case-sensitive, so we have to make sure that the "y" is in lowercase and everything in the name of the module is in uppercase. Operations on PDF File using PyPDF2 ModuleIn this section, we will discuss various operations that we can perform on PDF files by using the PyPDF2 module in Python. 1. How to Extract Text from PDF Document File.We can extract the text from the PDF file by using the PyPDF2 module in Python by using the following approach. Approach: For extracting the text from the PDF file using Python, we will follow the following steps: Step 1: We will open the PDF file named 'exp.pdf' in binary mode and save the file object as "pdf_File_Object". Step 2: We will create an object "pdf_Reader" for the "PDFFileReader" class of the "PyPDF2" module, and then we will pass the PDF file object and get the object for reading the PDF. Step 3: For getting the number of pages in the PDF document file, we will use the numPages Step 4: We will create an object "page_Object" for PageObject class of the "PyPDF2" The PDF reader object has the function "getPage()" which takes the page number as an argument and returns the object of the page. Step 5: We will use extract text which is a function of page object for extracting text from the PDF page. Step 6: At last, we will close the PDF document file object. Code: Output: No. of pages in the given PDF file: 10 GUIDELINES * FOR RE - OPENING OF CAMPUS IN VIEW OF COVID - 19 PANDEMIC (FOR STUDENTS ) 2021 - 22 This has printed the text of the first page of the PDF file in output. 2. How to Rotate PDF File PagesWe can rotate the pages of PDF file using PyPDF2 module in Python. Approach: For rotating the pages of the given pdf file, we will be using the following steps: Step 1: We will create a PDF reader object for the original PDF. Step 2: We will write the rotated pages to the new PDF file. For writing Into the PDF file, we will use the object of the pdfFileWriter class of the PyPDF2 Step 3: We will iterate each page of the original PDF document file. We will get page object getPage() function of the PDF reader class. then we will rotate the page by using the rotateClockwise() function of the page object class. Step 4: We will add pages PDF writer object using the addPage() function of the PDF writer class by passing the rotated page object. Step 5: Then, we will write the PDF pages to the newly created PDF file. We can do this by opening the new file object and writing PDF pages by using the write() function off the PDF writer object. Step 6: We will close the original PDF file object end the newly created new file object. Code: Output: Original File: ![]() Rotated File: ![]() 3. How to Merge two PDF Files.We can merge two PDF files by using the PyPDF2 module in Python. Approach: For merging two PDF files in Python, we will be using the following steps: Step 1: For merging two PDf files, we will be using a pre-built class, pdfFileMerger of the PyPDF2 Step 2: Then, we will append the file object of each PDF to the PDF merger object using the append() Step 3: At last, we will write the pdf pages to the output pdf file by using the write method of the PDF merger object. Code: Output: The output of this code will be in the form of a combined PDF named combined_exp.pdf, which is obtained by merging exp.pdf and rotate_exp.pdf file. ![]() 4. How to Split PDF FileWe can split the PDF document file in Python using the PyPDF2 module according to our requirements. In this code, we will not use a new function or class, and we will be using simple logic and iterations. The splits of the pdf will be created according to the list of splits_1 we would be passing. Code: Output: The output of this code will generate 3 new pdf files, which are the split files of the main pdf. We can check in the PDF folder. It contains 3 new pdf files. ![]() 5. How to Add Watermark to PDF Pages.We can add watermark to the pages of PDF document files using the PyPDF2 module in Python. Approach: In this, we will follow every step same as the page rotation example, the only difference is: The page object will be converted into the watermark page object by using the add_watermark() function. For understanding what the add_watermark() function do, we can see the following example: In this, first, we created a pdf reader object of the water_mark.pdf file. For the passed page object, we have used the mergepage() function, which has passed the page object of the first page of the water_mark pdf reader object. This will cause an overlay of water_mark pdf over the passed page object. Code: Output: water_mark.pdf: ![]() user_watermark.pdf file: ![]() The above code will generate a user_Watermark.pdf file which has the watermark of the water_mark.pdf file. ConclusionIn this tutorial, we have discussed how we can operate different functions on PDF files using Python and its modules' functions and methods. Next TopicPDF Handling in Python |
Bokeh is a Python library which is used for data visualization. It creates its plots using HTML and JavaScript languages, and it also targets modern website browsers to provide presentation elegant, concise construction of novel graphics good high-performance interactivity. In this tutorial, we will learn how to...
3 min read
There are hundreds of statistical tests used for testing hypotheses. However, only a handful of them are required for machine learning projects. In this tutorial, we will see some of the most important hypothesis tests that one must know if one wants to work in the...
14 min read
Python has specific inbuilt functions, due to which it supports numerous looping techniques in several sequential containers. These looping functions and methods are very useful for competitive coding. It can be used in different projects in which user has to use some specific technique with loops...
3 min read
A career ambitions survey provides information on people's job objectives and aspirations. It contains inquiries about values, long-term objectives, ideal working conditions, and hobbies. This tutorial is for you if you want to understand how to analyze the information obtained from a survey. In this tutorial,...
23 min read
Python allows file manipulation (create, save, read, write, delete files, and many more). Python simplifies saving numerous file formats, and saves several file formats. JSON is JavaScript Object Notation. Data is stored and sent via a text-based computer language script (executable) file. Python's json module supports JSON. JSON...
3 min read
The read_clipboard() technique for Pandas makes a DataFrame from information replicated to the clipboard. It peruses text from the clipboard and passes it to read_csv(), which then, at that point, returns a parsed DataFrame object. This technique, appropriately named read_clipboard is a flat out hero when you...
11 min read
An Introduction As we all know, Python provides various Statistics Libraries, in which some are pretty popular such as PyMC3 and SciPy. These libraries provide users different pre-defined functions in order to compute various tests. But in order to understand the mathematics behind the process, it is...
7 min read
In this tutorial, we will learn about one of the popular and widely used industry-standard data interchange formats, REST API. It is the most used technique for data exchange. We will explain how to create REST APIs in Python using Django. Before diving deep into this...
19 min read
In the following tutorial, we will discuss some of the best Python modules or tools that are used for automation and testing. There can be various issues while producing software, and Automation and Testing are one of the best ways to resolve these problems without spending...
5 min read
JSON Web Token is a succinct, URL-safe mechanism to represent claims that need to be exchanged between two parties (JWT). It is frequently used to transport data between computers and authenticate users securely. We'll go through JWT's foundations and how to use them in Python in...
4 min read
We request you to subscribe our newsletter for upcoming updates.
We provides tutorials and interview questions of all technology like java tutorial, android, java frameworks
G-13, 2nd Floor, Sec-3, Noida, UP, 201301, India