Parsing and Processing URL using Python - Regex
Last Updated : 02 Sep, 2020
Prerequisite: Regular Expression in Python
URL or Uniform Resource Locator consists of many information parts, such as the domain name, path, port number etc. Any URL can be processed and parsed using Regular Expression. So for using Regular Expression we have to use re library in Python.
Example:
URL: https://www.geeksforgeeks.org/courses When we parse the above URL then we can find Hostname: geeksforgeeks.com Protocol: https
We are using re.findall( ) function of re library for searching the required pattern in the URL.
Syntax: re.findall(regex, string)
Return: all non-overlapping matches of pattern in string, as a list of strings.
Now, let's see the examples:
Example 1: In this Example, we will be extracting the protocol and the hostname from the given URL.
- Regular expression for extracting protocol group: '(\w+)://'.
- Regular expression for extracting hostname group: '://www.([\w\-\.]+)'.
Meta characters Used:
- \w: Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].
- +: One or more occurrences of previous characters.
Code:
Python3 # import library import re # url link s = 'https://www.geeksforgeeks.org/' # finding the protocol obj1 = re.findall('(\w+)://', s) print(obj1) # finding the hostname which may # contain dash or dots obj2 = re.findall('://www.([\w\-\.]+)', s) print(obj2)
Output:
['https'] ['geeksforgeeks.org']
Example 2: If the URL is of a different type such as 'file://localhost:4040/zip_file', with the port number along with it, then to extract the port number, as it is optional we will use the '?' notation. Here the port number '4040' occurs after the ':' sign. Therefore, as it is a digit (:(\d+)) is used. To make it optional as all URLs do not end with host number, this syntax is used '(:(\d+))?'.
Meta characters Used:
- \d: Matches any decimal digit, this is equivalent to the set class [0-9].
- +: One or more occurrences of previous characters.
- ?: Matches zero or one occurrence.
Code:
Python3 # import library import re # url link s = 'file://localhost:4040/abc_file' # finding the file capture group obj1 = re.findall('(\w+)://', s) print(obj1) # finding the hostname which may # contain dash or dots obj2 = re.findall('://([\w\-\.]+)', s) print(obj2) # finding the hostname which may # contain dash or dots or port # number obj3 = re.findall('://([\w\-\.]+)(:(\d+))?', s) print(obj3)
Output:
['file'] ['localhost'] [('localhost', ':4040', '4040')]
Example 3: For a general URL, this can be used, where the path elements can also be constructed.
Python3 # import library import re # url s = 'http://www.example.com/index.html' # searching for all capture groups obj = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)', s) print(obj)
Output:
[('http', 'www.example.com', 'index', 'html')]
Similar Reads
Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Python Introduction Python was created by Guido van Rossum in 1991 and further developed by the Python Software Foundation. It was designed with focus on code readability and its syntax allows us to express concepts in fewer lines of code.Key Features of PythonPythonâs simple and readable syntax makes it beginner-frien
3 min read
Python Data Types Python Data types are the classification or categorization of data items. It represents the kind of value that tells what operations can be performed on a particular data. Since everything is an object in Python programming, Python data types are classes and variables are instances (objects) of thes
9 min read
Input and Output in Python Understanding input and output operations is fundamental to Python programming. With the print() function, we can display output in various formats, while the input() function enables interaction with users by gathering input during program execution. Taking input in PythonPython input() function is
8 min read
Enumerate() in Python enumerate() function adds a counter to each item in a list or other iterable. It turns the iterable into something we can loop through, where each item comes with its number (starting from 0 by default). We can also turn it into a list of (number, item) pairs using list().Let's look at a simple exam
3 min read