BYU IT&C 210

Regular Expressions Walkthrough

Known as “REs”, “regexes”, or “regex patterns”, Regular Expressions are a standard language for matching text patterns. They are used in many programming languages including Python, JavaScript, PHP, Perl, C#, and more. They are also built into text editors like Visual Studio and into command-line utilities like Grep, Sed, and Awk.

In this walkthrough, we will use Regular Expressions in Visual Studio Code to find and update text.

Resources

Getting Started

  • Download SampleText.txt
  • Open the file in Visual Studio Code
  • Open the search bar (Ctrl+F or Edit > Find)
  • Turn on “Use Regular Expression” by clicking the .* button on the search bar or by pressing Alt+R
  • Turn on “Match Case” by clicking the Aa button on the search bar or by pressing Alt+C

Character Classes and Anchors

The following are basic character classes, anchors and such.

.Any single character (except newline in some cases)
\wWord character: any letter, digit, or underscore (\W is the opposite)
\bBoundary between word and non-word
\dDecimal digit
\sWhitespace (space, tab)
\nNewline
[aeiou] Custom character set. This example matches any lower-case English vowel.
[a-z]Custom character range. This example matches any lower-case Roman letter.
[^q]Negative character set. This example matches anything but the letter 'q'.
^Beginning of a line
$End of a line
\Escape (treat a special character as a literal)
|Or, may match one thing or another
 

Try the following expressions in the Visual Stdio Code search box:

ExpressionMatches
aAll occurrences of the letter 'a'.
abAll occurrences of the letter 'a' followed by 'b'
^aThe letter 'a' at the beginning of a line.
b$The letter 'b' at the end of a line.
[0-9ijk]Any digit or the letters 'i', 'j', and 'k'.

Repetition

The following patterns let you match a specific number of repetitions.

*Zero or more occurrences of the preceding pattern.
+One or more occurrences of the preceding pattern.
?Zero or one occurrences of the preceding pattern.
{6}Exactly six occurrences of the preceding pattern.
{2,5} Between 2 and 5 occurrences of the preceding pattern.
()Parentheses encompass a pattern of more than one symbol.
?Makes the preceding repetition non-greedy (see below).
 

Try the following expressions in the Visual Studio Code search box:

ExpressionMatches
a+Any series of the letter 'a'
(ab)+Any series of the letters 'ab' repeating at least once
r.*mAny sequence of characters, on a single line, that starts with an 'r' and ends with an 'm'

By default, repetitions are “greedy”. That is, they match as many repetitions as possible. Following the repetition with a ? causes it to be non-greedy. That is, match as few repetitions as possible.

Try this variation on the last pattern from the previous table:

r.*?mHow are the matches different from before?

Putting things together

Try these patterns and make up your own:

ExpressionMatches
[\w.-]+@[\w.-]+Email addresses (but may also match other stuff).
\d[A-Z]{3}\d{3}California license plates (one digit, three letters, three digits)
^[A-Z][a-z]+ [A-Z][a-z]+$Most two-word capitalized names.
^[A-Z][a-z]+ [A-Z]\. [A-Z][a-z]+$Most general authority names.
(Fred)|(George)Fred or George.

The general authority names pattern didn’t match names that start with an initial and then the middle name. How can you combine two patterns with the | operator to match both name styles?

Lookahead

Lookahead operators let you specify content that must or must not immediately follow a pattern.

(?=Joe) The word, 'Joe' must immediately follow the pattern but it's not included in the match.
(?!Joe)The word, 'Joe' must NOT immediately follow the pattern.
 
ExpressionMatches
Isaac(?= Asimov)Isaac if it is immediately followed by Asimov.
Isaac(?! Asimov)Isaac if it is NOT immediately followed by Asimov (e.g. Isaac Newton)

Replacement

Patterns in parentheses are a ‘Group’ which may be referenced in the replacement text. In the replacement text, a dollar sign followed by a digit indicates the value of a group should be substituted. For example, $1 references the first group in the match.

In Visual Studio Code, press Ctrl+H or File > Replace to open the replace bar. Make sure regular expressions are turned on.

ExpressionReplacementEffect
^([A-Z][a-z]+) ([A-Z][a-z]+)$$2 $1Swaps first and last names.
([\w.-]+)@[\w.-]+$1@gmail.comMakes all email addresses be at gmail.com.