Skip to content

Latest commit

 

History

History
205 lines (150 loc) · 13.6 KB

regex_1.md

File metadata and controls

205 lines (150 loc) · 13.6 KB

Regular Expressions

Regular Expressions (called REs, regex, regexps, regex patterns) describe a search pattern in a text and are supported by multiple programming languages (Python, Java, Javascript, and many more) and serve as an integral part in several tasks. There are subtle differences between each implementation of Regex in the different programming languages but they are all based on the same basic principles.

They are useful especially for cases where the dataset is complicated and "not-clean", as they support many important tasks in string processing (searching, replacing, validating, parsing, etc.), as they focus on defining the rules of the searched (sub-)string. Depending on the task and on the search pattern, they could spare a lot of work for programmers but can consume a lot of computational power as well. Regex have some limitations and they do not enable the performance of all possible string processing tasks.

The search pattern defines some generalized way to describe a possible string:
The pattern 3 digits and one following capital letter would match both: 123A and 456B, but not 123a.
The pattern possibly some digit(s) and a final exclamation mark would match: 10000!, 2022!, and also ! (why? later!).

Theoretical Part

The origin of regular expressions finds its way to the world of theoretical computer science. When we talk about some languages, we can define regular languages as those who are built from some Alphabet that is closed under union, concatenation and the Kleene star operator (arbitrary repetition(s) of some string, including 0 repetitions, so an empty string), as well as intersection, complement, difference, and some more several operators. By the term closed under we indicate that every time we apply one of the operators, we still have a regular language.

Since a RE language is relatively small and restricted not all possible string processing tasks can be done using REs. There are also tasks that can be done with REs, but the expressions turn out to be very complicated.

There are different notations for regular expressions:

  • Algebraic notion
  • EBNF notation
  • Syntax diagrams
  • POSIX

We will focus on POSIX notations which are mostly the standard in the different technologies.

Define Patterns

We want to use the Regex as a tool to define some generalized pattern, so that it could match a possible set of strings that have some common elements.
Some common examples:

  • e-mail addresses
  • phone numbers
  • clean out time stamps or (special notationed) notes in transcriptions

We will firstly treat basic characters and later see some possibilities to combine them on a higher level to enable more complicated patterns. Notice the multiple ways to define the same pattern along this introduction. It is important to compose the regular expression in a way that is readable and understandable for another person.

Characters

Special sequence Matches Equivalent Sets
\d any decimal digit [0-9]
\D any non-digit character [^0-9]
\s any whitespace character [ \t\n\r\f\v]
\S any non-whitespace character [^ \t\n\r\f\v]
\w any alphanumeric character [a-zA-Z0-9_]
\W any non-alphanumeric character [^a-zA-Z0-9_]

Besides using the predefined special sequences, also the exact text (as well single characters) can be used.
For the characters, either you specify them individually or use ranges by giving a hyphen - inbetween. We will see some examples later.

Exercises

Example: The pattern \d\d\s[A-Z][a-z][a-z][a-z][a-z]\s\d\d\d\d matches the string 10 March 2022.
It can also be expressed with the pattern: [0-9][0-9]\s[A-Z][a-z][a-z][a-z][a-z]\s[0-9][0-9][0-9][0-9].
We can, of course, just write the string and search it in our sentence, but in this case we lose generalization.
Is there a compacter way to write this Regex? Can we make it even more general to include other months as well? Later!

Metacharacters

Metacharacter Name Meaning
[] brackets a set of characters
() parenthesis create a group
. period or dot any character besides new line
^ caret anchor operator: starts with or negation, exclude (inside the brackets)
$ dollar sign anchor operator: ends with, final pattern
\ backslash indicate sequences or escape metacharacters
| pipe or bar logical OR

Some remarks

  • About sets of characters [ ]
    • Remember that one set corresponds to one character in the expression we want to match, i.e. even if we put five characters between the brackets (like [abcde]) this still means one single character among the five listed
    • You can specify individual characters between the brackets [abcde] or ranges of characters by using the hyphen symbol, e.g. [A-Z] (all characters between A and Z, i.e. the whole Latin alphabet) or [0-9] (all numbers). Be aware that ranges follow the order of characters in the ASCII standard.
    • You can also specify several sequences inside a set, for example [A-Za-z] to have the whole Latin alphabet both in capital and small letters. (This is better than the range [A-z], because the latter would include characters like [ and ]: see the ASCII table and the previous bullet point).
    • If you want to use the hyphen as an actual character inside a set (and not to specify a range), you can add it at the beginning or end of a set, for example [-A-Z] or [abcd-] (or between two ranges, but this might complicate things).
  • About groups of characters ( )
    • Groups can be used for a variety of purposes. We will see some uses of groups later, but in the meantime let's say that groups can be used to apply repeating qualifiers (which we will see in the next section) to multiple characters
    • Also, they can be backreferenced. For example, if we want to find all words of four letters where the first two letters are the inverse of the last two, like noon and peep, we can use the pattern \b(\w)(\w)\2\1\b. The expression \2 will match the second group identified before, while the expression \1 the first group (depending on the regex flavor, you might have to use different characters to backreference groups, e.g. \\1 or $1 instead of \1). More information about backreferences will be given in the section Substitution in the next lesson.
    • Groups are also very useful when you have to specify alternatives using the character |. We will see an example of this in exercise d below.

Exercises

Repeating Qualifiers

Metacharacter Name Meaning
* asterisk or star arbitrary many occurences (including zero)
+ plus sign arbitrary many occurences (at least one)
? question mark at most one time (including zero)
{} curly brackets range or number of repetitions

Having characters in a set, it means that each of the character is connected to the other with an OR logical connector. [abc] has the meaning of a or b or c.
A group refers to whole the characters in their sequence. (abc) has the meaning abc. Therefore, this pattern does not match the string ab, whereas [ab][ab] (or [ab]+ does).

Concrete examples:

The metacharacter star *, does not match the literal character *, and its meaning is matching zero or more times. To match the literal character *, we use the backslash for escaping the metacharacter: \*.

String RE Match
hello [hello]* Yes
hello h[ello]* Yes
hello H[ello]* No
hello [Hell]*o Yes
star [e]* Yes
staar st[a]*r Yes

The repeating metacharacter plus + means matching one or more times. This requires at least one occurrence compared to asterisk *.

String RE Match
plus pl[au]+s Yes
pluus plu+s Yes
plusplus pluss+ No
plussusch plus[cs]+ Yes

The third single repeating qualifiers is the question mark ? which matches either once or zero times.

String RE Match
question qu?estion Yes
question qa?estion No
question qa? Yes
question qa?e No
question quest?tion Yes
markka rk?a No
mark ma?r Yes

The fourth repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n repetitions.

String RE Match
complicated compli{1,1}cated Yes
appreciated ap{2,2} Yes
rain r[ai]{2,2} Yes
rain r(ai){1,2} Yes
rain r(ai){2,2} No
complicated compli[act]{3,}ed Yes
complicated compli[act]{3}ed Yes
complicated compli(act){3,}ed No

If either m or n is omitted it becomes for e.g. {3,} three or more and {,3} up to three repetitions. {x} it means exactly x times. With this qualifier you can express all the single repeating qualifiers, e.g. ? as {0,1} + as {1,}, and * as {0,} but the single versions are both easier on the eye and shorter to write.

We now go back to our example from above, we want to form some regex to match a date that looks like: DAY MONTH YEAR where the day and the year appear in digits and the month in letters.
(\d){2}\s[A-Z][a-z]+\s(\d){4}
We have here a set of groups, we firstly have 2 digits followed by a space, one upper case letter, some arbitrary number of lower case letters, another space and 4 final digits.
With this regex the following strings would match:

  • 30 January 2021
  • 01 February 1990
  • 15 April 2000

But also:

  • 90 Fix 9999
  • 00 January 2000
  • 29 February 2022
  • 30 Ja 2021

We cannot cover all edge cases, however, we can definitely make the specification more percise while still maintaining generalization:
(0-3)\d\s[A-Z][a-z]{2, 8}\s(0-2)(\d){3}
What invalid strings that matched before do not match anymore? Can we improve it even more?

Some Remarks:

  • ? is equal to {0, 1}
  • [0123456789] is equal to [0-9]
  • [012] is equal to (0|1|2). (012) does not represent the same, as the pattern would look after exactly "123" in the string.
  • Note: for German characters (or some other languages with Latin characters) you need to redefine the set of letters [a-zA-Z] to include also the extra letters: [a-zA-ZäöüÄÖÜß]. Don't forget that some alphabets (e.g. Greek) share similar letters with Latin characters (e.g., Greek "omikron", etc.); these characters are perceived differently in the regex system, therefore, you need to include them when dealing with multilingual documents.
  • Regex for simple punctuation: [\.\,!\?]

Exercises

Outlook

In the next lecture we will review: anchors, regex-based functions and go over more examples and use cases.

Some links to practice:

Links to more sources: