Replace Regex in Python String

When it comes to replacing strings within a text, we sometimes need to replace patterns and not exact strings. Consider a simple example. We have a document which contents a whole bunch of email Ids. We want to send the document after first classifying the email Ids. In other words, whenever we detect an email Id, we want to replace it with [[EMAIL Classified]]. To get this done we need to replace patterns instead of exacts. In programming language such patterns are called RegEx (Regular Expression). We can compile a Regex to represent a text pattern.

Unfortunately, the replace method in python strings doesn’t accept RegEx as an input. So how do we replace a pattern or RegEx in a Python String ?

RegEx replace in Python

Python comes with an inbuilt module for regex handling. It’s called “re”. The “re” module contains a function called “sub” that allows us to replace a regex pattern with something else.

When it comes to using RegEx, the most complex bit is getting the right RegEx expression in place. In this blog, I am not explaining how to get the RegEx pattern. We could use a tool like Regexr to generate the same. We will instead look at a few used cases, where replacing RegEx patterns can solve complex issues.

RegEx Sub Email Ids in a document

In the code below we will replace email Ids in a document with the phrase [[Email Classified]].

First step would be to identify the RegEx pattern that matches all email Ids. A bit of Googling reveals the following pattern:

'''(?:[a-z0-9!#$%&'+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")@(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-][a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])'''

You may wonder why such a complex pattern is needed. It’s because , email IDs could appear in a whole bunch of forms and languages. The above pattern considers it all.

Here’s the code for substituting the above regex in a text:

import re

reg_email= '''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''

str1="My name is Anirban Sengupta. My email id is anirbansengupta31@gmail.com"

text= re.sub(reg_email,"[[EMAIL Classified]]",str1)

print(text)

Here’s the output:

My name is Anirban Sengupta. My email id is [[EMAIL Classified]]

You may have already noted, that the sub function accepts the following attributes:

  1. The pattern to replaced
  2. The replacement pattern
  3. The text to be acted upon

The function thus looks like:

re.sub(pattern_to_be_replaced, replacement_pattern, text)

It also accepts an optional count attribute that limits the changes to a certain number of occurrences. For example , consider the following code:

import re

reg_email= '''(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])'''

str1="My name is Anirban Sengupta. My email id is anirbansengupta31@gmail.com. My alternate email id is anirban.sengupta@gmail.com"

text= re.sub(reg_email,"[[EMAIL Classified]]",str1,1)

print(text)

The output is as follows:

My name is Anirban Sengupta. My email id is [[EMAIL Classified]]. My alternate email id is anirban.sengupta@gmail.com

Since we specified “1” as the 4th attribute, only the first instance of the pattern was replaced.

Regex sub to remove footnote references from a document.

When we copy text from Wikipedia, we often encounter the footnote strings withing the text . Let’s check the below example:

text= '''Global warming is the long-term rise in the average temperature of the Earth's climate system. It is a major aspect of current climate change, and has been demonstrated by direct temperature measurements and by measurements of various effects of the warming.[1][2] The term commonly refers to the mainly human-caused increase in global surface temperatures and its projected continuation.[3][4] In this context, the terms global warming and climate change are often used interchangeably,[5] but climate change includes both global warming and its effects, such as changes in precipitation and impacts that differ by region.[6] There were prehistoric periods of global warming,[7] but observed changes since the mid-20th century have been much greater than those seen in previous records covering decades to thousands of years.[1][8]'''

Can we remove these footnote references using a RegEx Sub code? Yes, we can. Check the code below:

clean_text=re.sub('''\[\w+\]''',"",text)
print(clean_text)

The output is as follows:

 Global warming is the long-term rise in the average temperature of the Earth's climate system. It is a major aspect of current climate change, and has been demonstrated by direct temperature measurements and by measurements of various effects of the warming. The term commonly refers to the mainly human-caused increase in global surface temperatures and its projected continuation. In this context, the terms global warming and climate change are often used interchangeably, but climate change includes both global warming and its effects, such as changes in precipitation and impacts that differ by region. There were prehistoric periods of global warming, but observed changes since the mid-20th century have been much greater than those seen in previous records covering decades to thousands of year.

In the output text, the footnote references have been removed.

The regex string in this case, was short and simple: ”'[\w+]”’

3 thoughts on “Replace Regex in Python String”

  1. You are so intriguing! I do not consider I love read anything same that before. So outstanding to chance somebody with a few example thoughts on this release. Seriously.. thanks for play this up. This position is something that’s needful on the web, someone with a bit of originality! https://www.embromix.com

Leave a Reply

Your email address will not be published. Required fields are marked *