The regular expression is nothing but a sequence of characters that matches a pattern in a piece of text or a text file. It is used in text mining in a lot of programming languages. The characters of the regular expression are pretty similar in all the languages. But the functions of extracting, locating, detecting, and replacing can be different in different languages.
In this article, I will use R. But you can learn how to use the regular expression from this article even if you wish to use some other language. It may look too complicated when you do not know it. But as I mentioned at the top it is easier than you think it is. I will try to explain it as much as I can. You are welcome to ask me questions in the comment section if you did not understand any part.
Here we will learn by doing. I will start with very basic ideas and slowly move towards more complicated patterns.
I used RStudio for all the exercises in this article.
Here is a set of 7 strings that contain, different patterns. We will use this to learn all the basics.
ch = c('Nancy Smith',
'is there any solution?',
".[{(^$|?*+",
"coreyms.com",
"321-555-4321",
"123.555.1234",
"123*555*1234"
)
Extract all the dots or periods from those texts:
R has a function called ‘str_extract_all’ that will extract all the dots from these strings. This function takes two parameters. First the texts of interest and second, the element to be extracted.
str_extract_all(ch, "\\.")
Output:
[[1]]
character(0)[[2]]
character(0)[[3]]
[1] "."[[4]]
[1] "."[[5]]
character(0)[[6]]
[1] "." "."[[7]]
character(0)
Look at the output carefully. The Third-string has one dot. Forth string has one dot and the Sixth string has two dots.
There is another function in R ‘str_extract’ that only extracts the first dot from each string.
Try it yourself. I will use str_extract_all for all the demonstrations in this article to find it all.
Before going into more workouts, it will be good to see a list of patterns of regular expressions:
- . = Matches Any Character
2. \d = Digit (0–9)
3. \D = Not a digit (0–9)
4. \w = Word Character (a-z, A-Z, 0–9, _)
5. \W = Not a word character
6. \s = Whitespace (space, tab, newline)
7. \S = Not whitespace (space, tab, newline)
8. \b = Word Boundary
9. \B = Not a word boundary
10. ^ = Beginning of a string
11. $ = End of a String
12. [] = matches characters or brackets
13. [^ ] = matches characters Not in backets 14. | = Either Or
15. ( ) = Group
16. *= 0 or more
17. + = 1 or more
18. ? = Yes or No
19. {x} = Exact Number
20. {x, y} = Range of Numbers (Maximum, Minimum)
We will keep referring to this list of expressions while working later.
We will work on all of them individually first and then in groups.
Starting With Basics
As per the list above, ‘\d’ catches the digits.
Extract all the digits from the ‘ch’:
str_extract_all(ch, "\\d")
Output:
[[1]]
character(0)[[2]]
character(0)[[3]]
character(0)[[4]]
character(0)[[5]]
[1] "3" "2" "1" "5" "5" "5" "4" "3" "2" "1"[[6]]
[1] "1" "2" "3" "5" "5" "5" "1" "2" "3" "4"[[7]]
[1] "1" "2" "3" "5" "5" "5" "1" "2" "3" "4"
The first four strings do not have any digits. The last three strings are phone numbers. The expression above could catch all the digits from the last three strings.
The capital ‘D’ will catch everything else but the digits.
str_extract_all(ch, "\\D")
Output:
[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
[[2]]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I"[[3]]
[1] "T" "h" "i" "s" " " "i" "s" " " "m" "e"[[4]]
[1] "." "[" "{" "(" "^" "$" "|" "?" "*" "+"[[5]]
[1] "c" "o" "r" "e" "y" "m" "s" "." "c" "o" "m"[[6]]
[1] "-" "-"[[7]]
[1] "." "."[[8]]
[1] "*" "*"
Look, it extracted letters, dots, and other special characters but did not extract any digits.
‘w’ matches word characters that include a-z, A-Z, 0–9, and ‘_’. Let’s check.
str_extract_all(ch, "\\w")
Output:
[[1]]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"[[2]]
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I"[[3]]
[1] "T" "h" "i" "s" "i" "s" "m" "e"[[4]]
character(0)[[5]]
[1] "c" "o" "r" "e" "y" "m" "s" "c" "o" "m"[[6]]
[1] "3" "2" "1" "5" "5" "5" "4" "3" "2" "1"[[7]]
[1] "1" "2" "3" "5" "5" "5" "1" "2" "3" "4"[[8]]
[1] "1" "2" "3" "5" "5" "5" "1" "2" "3" "4"
It got everything except dots and special characters.
However, ‘W’ extracts everything but the word characters.
str_extract_all(ch, "\\W")
Output:
[[1]]
character(0)[[2]]
character(0)[[3]]
[1] " " " "[[4]]
[1] "." "[" "{" "(" "^" "$" "|" "?" "*" "+"[[5]]
[1] "."[[6]]
[1] "-" "-"
I will move to show ‘b’ and ‘B’ now. ‘b’ catches the word boundary. Here is an example:
st = "This is Bliss"
str_extract_all(st, "\\bis")
Output:
[[1]]
[1] "is"
There is only one ‘is’ in the string. So we could catch it here. Let’s see the use of ‘B’
st = "This is Bliss"
str_extract_all(st, "\\Bis")
Output:
[[1]]
[1] "is" "is"
In the string ‘st’ there are two other ‘is’s that’s not in the boundary. That’s in the word ‘This’ and ‘Bliss’. When you use capital B, you catch those.
Number 10 and 11 in the list of expression above are ‘^’ and ‘$’ which indicates the beginning and end of the strings respectively.
Here is an example:
sts = c("This is me",
"That my house",
"Hello, world!")
Find all the exclamation points that end a sentence.
str_extract_all(sts, "!$")
Output:
[[1]]
character(0)[[2]]
character(0)[[3]]
[1] "!"
We have only one sentence that ends with an exclamation point. If R users want to find the sentence that ends with an exclamation point:
sts[str_detect(sts, "!$")]
Output:
[1] "Hello, world!"
Find the sentences that start with ‘This’.
sts[str_detect(sts, "^This")]
Output:
[1] "This is me"
That is also only one.
Let’s find the sentences that start with “T”.
sts[str_detect(sts, "^T")]
Output:
[1] "This is me" "That my house"
‘[]’ matches characters or ranges in it.
For this demonstration, let’s go back to ‘ch’. Extract everything in between 2–4.
str_extract_all(ch, "[2-4]")
Output:
[[1]]
character(0)[[2]]
character(0)[[3]]
character(0)[[4]]
character(0)[[5]]
[1] "3" "2" "4" "3" "2"[[6]]
[1] "2" "3" "2" "3" "4"[[7]]
[1] "2" "3" "2" "3" "4"
Let’s move on to some bigger experiment
Extract the phone numbers only from ‘ch’. I will explain the pattern after you see the output:
str_extract(ch, "\\d\\d\\d.\\d\\d\\d.\\d\\d\\d\\d")
Output:
[1] NA NA NA
[4] NA "321-555-4321" "123.555.1234"
[7] "123*555*1234"
In the regular expression above, each ‘\\d’ means a digit, and ‘.’ can match anything in between (look at the number 1 in the list of expressions in the beginning). So we got the digits, then a special character in between, three more digits, then special characters again, then 4 more digits. So anything that matches these criteria were extracted.
The regular expression for the phone number above can be written as follows as well.
str_extract(ch, "\\d{3}.\\d{3}.\\d{4}")
Output:
[1] NA NA NA
[4] NA "321-555-4321" "123.555.1234"
[7] "123*555*1234"
Look at number 19 of the expression list. {x} means the exact number. Here we used {3} which means exactly 3 times. ‘\\d{3}’ means three digits.
But look ‘*’ in-between digits is not a regular phone number format. Normally ‘-’ or ‘.’ may be used as a separator in phone numbers. Right? Let’s match that and exclude the phone number with ‘*’. Because that may look like a 10 digit phone number but it may not be a phone number. We want to stick to the regular phone number format.
str_extract(ch, "\\d{3}[-.]\\d{3}[-.]\\d{4}")
Output:
[1] NA NA NA
[4] NA "321-555-4321" "123.555.1234"
[7] NA
Look, this matches only the usual phone number format. In this expression, after three digits we explicitly mentioned ‘[-.]’ which means it is asking to match only ‘-’ or a dot (‘.’).
Here is a list of phone numbers:
ph = c("543-325-1278",
"900-123-7865",
"421.235.9845",
"453*2389*4567",
"800-565-1112",
"361 234 4356"
)
If we use the above expression on these phone numbers, this is what happens:
str_extract(ph, "\\d{3}[-.]\\d{3}[-.]\\d{4}")
Output:
[1] "543-325-1278" "900-123-7865" "421.235.9845"
[4] NA "800-565-1112" NA
Look! This format excluded “361 234 4356”. Sometimes we do not use any separators in between and just use a space, right? Also, the first digit of a US phone number is not 0 or 1. It’s a number between 2–9. All the other digits can be anything between 0 and 9. Let’s take care of that pattern.
p = "([2-9][0-9]{2})([- .]?)([0-9]{3})([- .])?([0-9]{4})"
str_extract(ph, p)
I saved the pattern separately here.
In regular expression ‘()’ is used to denote a group. Look at number 15 of the list of expressions.
Here is the breakdown of the expressions above.
The first group was “([2–9][0–9]{2})”:
‘[2–9]’ represents one digit from 2 to 9
‘[0–9]{2}’ represents two digits from 0 to 9
The second group was “([- .]?)”:
‘[-.]’ means it can be ‘-’ or ‘.’
using ‘?’ after that means ‘-’ and ‘.’ are optional. So, if it is blank that’s also ok.
I guess the rest of the groups are also clear now.
Here is the output of the expression above:
[1] "543-325-1278" "900-123-7865" "421.235.9845"
[4] NA "800-565-1112" "361 234 4356"
It finds the phone number with ‘-’, ‘.’, and also with blanks as a separator.
What if we need to find the phone number that starts with 800 and 900.
p = "[89]00[-.]\\d{3}[-.]\\d{4}"
str_extract_all(ph, p)
Output:
[[1]]
character(0)[[2]]
[1] "900-123-7865"[[3]]
character(0)[[4]]
character(0)[[5]]
[1] "800-565-1112"[[6]]
character(0)
Let’s understand the regular expression above: “[89]00[-.]\\d{3}[-.]\\d{4}”.
The first character should be 8 or 9. That can be achieved by [89].
The next two elements will be zeros. We explicitly mentioned that.
Then ‘-’ or ‘.’ which can be obtained by [-.].
Next three digits = \\d{3}
Again ‘-’ or ‘.’ = [-.]
Four more digits at the end = \\d{4}
Extract different formats of Email Addresses
Email addresses are a little more complicated than phone numbers. Because an email address may contain upper case letters, lower case letters, digits, special characters everything. Here is a set of email addresses:
email = c("RashNErel@gmail.com",
"rash.nerel@regen04.net",
"rash_48@uni.edu",
"rash_48_nerel@STB.org")
We will develop a regular expression that will extract all of those email addresses:
First work on the part before the ‘@’ symbol. This part may have lower case letters that can be detected using [a-z], upper case letters that can be detected using [A-Z], digits that can be found using [0–9], and special characters like ‘.’, and ‘_’. All of them can be packed like this:
“[a-zA-Z0–9-.]+”
The ‘+’ sign indicates one or more of those characters (look at the number 17 of the list of expressions). Because we do not know how many different letters, digits or numbers can be there. So this time we cannot use {x} the way we did for phone numbers.
Now work on the part in-between ‘@’ and ‘.’. This part may consist of upper case letters, lower case letters, and digits that can be detected as:
“[a-zA_Z0–9]+”
Finally, the part after ‘.’. Here we have four of them ‘com’, ‘net’, ‘edu’, ‘org’. These four can be caught using a group:
“(com|edu|net|org”)
Here ‘|’ symbol is used to denote either-or. Look at number 14 of the list of expressions in the beginning.
Here is the full expression:
p = "[a-zA-Z0-9-.]+@[a-zA_Z0-9]+\\.(com|edu|net|org)"
str_extract_all(email, p)
Output:
[[1]]
[1] "RashNErel@gmail.com"[[2]]
[1] "rash.nerel@regen.net"[[3]]
[1] "48@uni.edu"[[4]]
[1] "nerel@stb.com"
It will also work if you do not mention the parts after the dots. Because we added a ‘+’ sign after the second part that means it will take any number of characters after that.
But if you need some certain domain type like ‘com’ or ‘net’, you have to explicitly mention them as we did in the previous expression.
p = "[a-zA-Z0-9-.]+@[a-zA_Z0-9-.]+"
str_extract_all(email, p)
Output:
[[1]]
[1] "RashNErel@gmail.com"[[2]]
[1] "rash.nerel@regen.net"[[3]]
[1] "48@uni.edu"[[4]]
[1] "nerel@stb.com"
Another common complicated type is URLs
Here is a list of URLs:
urls = c("https://regenerativetoday.com",
"http://setf.ml",
"https://www.yahoo.com",
"http://studio_base.net",
)
It may start with ‘http’ or ‘https’. To detect that this expression can be used:
‘https?’
That means ‘http’ will stay intact. Then there is a ‘?’ sign after ‘s’. So, ‘s’ is optional. It may or may not be there.
Another optional part is after ‘://’ term: ‘www.’ We can define it using:
“(www\\.)?”
As we worked before, ‘()’ is used to group some expressions. Here we are grouping ‘www’ and ‘.’. After the parenthesis that ‘?’ means this whole term inside the parenthesis is optional. They may or may not be there.
Then domain name. In this set of email addresses, we only have lower case letters and ‘_’. So, [a-z-] will work. But in a general domain name may contain upper case letters and digits as well. So we will use:
“\\w+”
Look at the number 4 of the list of expressions. ‘\\w’ denotes word character that may include lower case letters, upper case letters, and digits. The ‘+’ sign indicates that there might be one or more of those characters.
After domain, there is one more dot and then more characters. We will get them using:
“\\.\\w+”
Remember, if you use only dot(.) to match a dot it will not work. Because only a single dot matches any character. If you have to match only a literal dot(.), you need to put it as ‘\\.’
Here we used one dot denoted by “\\.”, then word characters “\\w” and a ‘+’ sign to indicate there are more characters.
Let’s put it together:
p = "https?://(www\\.)?\\w+\\.\\w+"
str_extract_all(urls, p)
Output:
[[1]]
[1] "https://regenerativetoday.com"[[2]]
[1] "http://setf.ml"[[3]]
[1] "https://www.yahoo.com"[[4]]
[1] "http://studio_base.com"
You may want to get only ‘.com or ‘.net’ domains. That can be explicitly mentioned.
p = "https?://(www\\.)?(\\w+)(\\.)+(com|net)"
str_extract_all(urls, p)
Output:
[[1]]
[1] "https://regenerativetoday.com"[[2]]
character(0)[[3]]
[1] "https://www.yahoo.com"[[4]]
[1] "http://studio_base.com"
See, it only gets ‘.com’ or ‘.net’ domains and excludes the ‘.ml’ domain that we had.
Finally work on a set of names
That can be a bit tricky too. Here is a set of names:
name = c("Mr. Jon",
"Mrs. Jon",
"Mr Ron",
"Ms. Reene",
"Ms Julie")
Look, it may start with Mr, Ms, or Mrs. Sometimes a dot after Mr, sometimes not. Let’s work on this part first. In all of them ‘M’ is common. Keep it intact and make a group using the rest like this:
“M(r|s|rs)”
After ‘M’ it may be ‘r’ or ‘s’, or ‘rs’.
Then an optional dot that can be obtained by using:
“\\.?”
There is a space after that can be detected with:
“\\s”
After the space name starts with an upper case letter that can be brought using:
[A-Z]”
After that upper case letters, there are some lower case letters and we do not know exactly how many. So, we will use this:
“\\w*”
Look at the number 16 of the list of expressions. ‘*’ means 0 or more. So, we are saying there might be 0 or more word characters.
Putting it all together:
p = "M(r|s|rs)\\.?[A-Z\\s]\\w*"
str_extract_all(name, p)
Output:
[[1]]
[1] "Mr. Jon"[[2]]
[1] "Mrs. Jon"[[3]]
[1] "Mr Ron"[[4]]
[1] "Ms. Reene"[[5]]
[1] "Ms Julie"
Congratulation! You worked on some complicated and cool patterns that should give you enough knowledge to use a regular expression to match almost any pattern.
Conclusion
This is not all. There are a lot more in the regular expression. But if you are a beginner, you should be proud of yourself that you came a long way. You should be able to match almost any pattern now. I will make another tutorial sometime later on the advanced regular expression. But you should be able to start using regular expressions now to do some cool thing.
Feel free to follow me on Twitter and like my Facebook page.
#DataScience #RegularExpressions #RProgramming #DataAnalytics