Posted on Feb 17, 2024

Working with Regular Expressions in Python

In this post, we'll explore some common operations on regular expressions in Python, using examples from the world of astronomy.

Regular expressions are a powerful tool for pattern matching and text processing. Python's re module provides several functions for working with regular expressions, including search(), match(), findall(), and sub().

The search() function searches a string for a pattern and returns a match object if the pattern is found. The match() function is similar to search(), but only matches at the beginning of the string. The findall() function returns a list of all non-overlapping matches of a pattern in a string. The sub() function replaces all occurrences of a pattern in a string with a specified replacement string.

Here are some examples of using these functions:

import re text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."

# Search for a pattern match = re.search(pattern="spiral", string=text) if match: print(f"Found: {match.group()}") # Output: Found: spiral

# Match at the beginning of the string match = re.match(pattern=r"The", string=text if match: print(f"Found: {match.group()}") # Output: Found: The

# Find all occurrences of a pattern matches = re.findall(pattern=r"\b\w{5}\b", string=text) print(matches) # Output: ["spiral", "Earth"]

# Replace all occurrences of a pattern new_text = re.sub(pattern=r"\d", repl="#", string=text) print(new_text) # Output: The Andromeda Galaxy is a spiral galaxy approximately #.# million light-years away from Earth.

Regular expressions can also be used to extract specific information from a text. Here are some examples:

text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."

# Extract the first two words from a text match = re.search(pattern=r"^(\w+)\s+(\w+)", string=text) if match: print(f"First word: {match.group(1)}") # Output: First word: The  print(f"Second word: {match.group(2)}") # Output: Second word: Andromeda

# Extract a starting number as long as it has 10 digits match = re.search(pattern=r"^\d{10}", string=text) if match: print(f"Found: {match.group()}") Output:

# Separate a number into units and decimals match = re.search(pattern=r"(\d+)\.(\d+)", string=text) if match: print(f"Units: {match.group(1)}") # Output: Units: 2  print(f"Decimals: {match.group(2)}") # Output: Decimals: 5

# Separate text into words using space characters as reference words = re.split(pattern=r"\s+", string=text) print(words) # Output: ['The', 'Andromeda', 'Galaxy', 'is', 'a', 'spiral', 'galaxy', 'approximately', '2.5', 'million', 'light-years', 'away', 'from', 'Earth.']

# Use regex similar to the strip() function stripped_text = re.sub(pattern=r"^\s+|\s+$", repl="", string=text) print(stripped_text) # Output: The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth.

# Remove symbols from a filename except the dot character filename = "image-of-the-andromeda-galaxy.jpg" new_filename = re.sub(pattern=r"[^\w\.]", repl="", string=filename) print(new_filename) # Output: imageoftheandromedagalaxy.jpg

# Use regex to split a text into a list of words and get the frequency for the list of words words = re.findall(pattern=r"\b\w+\b", string=text) word_counts = {} for word in words: word_counts[word] = word_counts.get(word, 0) + 1 print(word_counts) # Output: {'The': 1, 'Andromeda': 1, 'Galaxy': 1, 'is': 1, 'a': 1, 'spiral': 1, 'galaxy': 1, 'approximately': 1, '2': 1, '5': 1, 'million': 1, 'light': 1, 'years': 1, 'away': 1, 'from': 1, 'Earth': 1}

# Use regex to split the text into sentences and get the frequency for each sentence sentences = re.split(pattern=r"\.\s+", string=text) sentence_counts = {} for sentence in sentences: sentence_counts[sentence] = sentence_counts.get(sentence, 0) + 1 print(sentence_counts) # Output: {'The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth': 1}

These are just a few examples of the many powerful ways that regular expressions can be used to process and manipulate text in Python. With a little practice, you'll be able to use regular expressions to solve a wide variety of text-processing problems.

DEV Community

Working with Regular Expressions in Python

Top comments (0)