In this post, we'll explore some common operations on regular expressions in Python, using examples from the world of astronomy.
Regular expressions are a powerful tool for pattern matching and text processing. Python's re
module provides several functions for working with regular expressions, including search()
, match()
, findall()
, and sub()
.
The search()
function searches a string for a pattern and returns a match object if the pattern is found. The match()
function is similar to search()
, but only matches at the beginning of the string. The findall()
function returns a list of all non-overlapping matches of a pattern in a string. The sub()
function replaces all occurrences of a pattern in a string with a specified replacement string.
Here are some examples of using these functions:
import re text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
# Search for a pattern match = re.search(pattern="spiral", string=text) if match: print(f"Found: {match.group()}") # Output: Found: spiral
# Match at the beginning of the string match = re.match(pattern=r"The", string=text if match: print(f"Found: {match.group()}") # Output: Found: The
# Find all occurrences of a pattern matches = re.findall(pattern=r"\b\w{5}\b", string=text) print(matches) # Output: ["spiral", "Earth"]
# Replace all occurrences of a pattern new_text = re.sub(pattern=r"\d", repl="#", string=text) print(new_text) # Output: The Andromeda Galaxy is a spiral galaxy approximately #.# million light-years away from Earth.
Regular expressions can also be used to extract specific information from a text. Here are some examples:
text = "The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth."
# Extract the first two words from a text match = re.search(pattern=r"^(\w+)\s+(\w+)", string=text) if match: print(f"First word: {match.group(1)}") # Output: First word: The print(f"Second word: {match.group(2)}") # Output: Second word: Andromeda
# Extract a starting number as long as it has 10 digits match = re.search(pattern=r"^\d{10}", string=text) if match: print(f"Found: {match.group()}") Output:
# Separate a number into units and decimals match = re.search(pattern=r"(\d+)\.(\d+)", string=text) if match: print(f"Units: {match.group(1)}") # Output: Units: 2 print(f"Decimals: {match.group(2)}") # Output: Decimals: 5
# Separate text into words using space characters as reference words = re.split(pattern=r"\s+", string=text) print(words) # Output: ['The', 'Andromeda', 'Galaxy', 'is', 'a', 'spiral', 'galaxy', 'approximately', '2.5', 'million', 'light-years', 'away', 'from', 'Earth.']
# Use regex similar to the strip() function stripped_text = re.sub(pattern=r"^\s+|\s+$", repl="", string=text) print(stripped_text) # Output: The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth.
# Remove symbols from a filename except the dot character filename = "image-of-the-andromeda-galaxy.jpg" new_filename = re.sub(pattern=r"[^\w\.]", repl="", string=filename) print(new_filename) # Output: imageoftheandromedagalaxy.jpg
# Use regex to split a text into a list of words and get the frequency for the list of words words = re.findall(pattern=r"\b\w+\b", string=text) word_counts = {} for word in words: word_counts[word] = word_counts.get(word, 0) + 1 print(word_counts) # Output: {'The': 1, 'Andromeda': 1, 'Galaxy': 1, 'is': 1, 'a': 1, 'spiral': 1, 'galaxy': 1, 'approximately': 1, '2': 1, '5': 1, 'million': 1, 'light': 1, 'years': 1, 'away': 1, 'from': 1, 'Earth': 1}
# Use regex to split the text into sentences and get the frequency for each sentence sentences = re.split(pattern=r"\.\s+", string=text) sentence_counts = {} for sentence in sentences: sentence_counts[sentence] = sentence_counts.get(sentence, 0) + 1 print(sentence_counts) # Output: {'The Andromeda Galaxy is a spiral galaxy approximately 2.5 million light-years away from Earth': 1}
These are just a few examples of the many powerful ways that regular expressions can be used to process and manipulate text in Python. With a little practice, you'll be able to use regular expressions to solve a wide variety of text-processing problems.
Top comments (0)