Categories
Codes Regex

Regex For Starters

“What we call chaos is just patterns we haven’t recognized. What we call random is just patterns we can’t decipher.”     

Chuck Palahniuk

Introduction

Ah, regex. The random gibberish text which magically able to match pattern you want from a string. That was my impression of regex which I first encountered when I was debugging some programmer’s code in the past.

Though it might look random, regular expressions is a very powerful tool to match patterns of a string, with just a line of pattern syntax (yeah, just one line in my experience so far).

While it is very useful, not many programmers utilise regex to do pattern matching. Like me before, many of them feels regex is too hard. But, with a bit of effort, learning regex will improve your skills and you will accomplish a lot more.

Why regex?

There are many use cases of regex. If you are a web developer, you should have already at least try to create an input form, for example a login form, which consists of username and password.

In the past, there were quite many cases of SQL Injections, which were exploited by irresponsible hackers to get the data illegally, or even totally remove the database completely! You can read more about SQL Injections

And then, one of the ways to prevent SQL Injections is to implement pattern matching on inputted string using regex. Here is an example:

/\w*((\%27)|(\'))((\%6F)|o|(\%4F))((\%72)|r|(\%52))/ix 

See, just one line. But it can prevent lots of malicious attackers and save you the pain of recovering, time losses, and mental breakdowns.

Some other use cases include:

  • Cleaning user input
  • Checking format, such as email or password. Remember the “Your password must include a number, special character, uppercase and lowercase” or something like that?
  • Replacing a part of string
  • Parsing various text files
  • Examining web server logs
  • Examining test results
  • Reading configuration files

And many more.

You reached this post because you want to overcome your fear of regex. Fear not, for regex is also learnable.

In this post, we’ll cover some of the regex basics. We will learn some simple syntaxes and examples you can test yourselves. If you would like a quick test of your regex strings, visit https://regex101.com/

The Very Basics

For starters, the syntax of regex started with a slash (/), then followed with any string you want to find in a text, then ended with a slash (/). See the example below:

/lives/

I would like to search the string “lives” in a text. The search is case-sensitive and will be found in the text below.

Someone who lives by coding.

What? It’s that simple?

It is. Now I’m very sure you are getting more confident and would like to learn some more. Nice!

Next, for special characters such as [ \ ^ $ . | ? * + ( ), you have to add a backslash (\) to escape the character. For example:

/coding\?/

Now, don’t get confused. There are two parts of this regex, first being coding, then the regex for special character ?, escaped as \?. It would match the text below:

Someone who lives by coding?

That wasn’t hard, was it? So, in summary for the first part, you have learned two simple syntaxes of regex:

Basic Regex Syntax

  • /coding/ matches string that contains coding
  • /\?/ matches string that contains special character ?
  • /coding\?/ matches string that contains coding?

Flags

Another important part of regex. Flags are added after the ending slash (/), and can be combined in any order.

There are currently six flags by the time this post is written. But for starters, there are three which are most used: g for global, i for case-insensitive, and m for multi-line.

Global (g)

Global flag will not return after the first match. For example, /wh/g will match every string that has “wh” in a text:

Who whoever whatever why

Case-insensitive (i)

Case-insensitive will return the first match without caring the pattern is upper-case or lower case. /wh/i will give the result as below:

Who whoever whatever why

The combined flag, for example /wh/gi or /wh/ig (order sequence does not matter) will give this result:

Who whoever whatever why

Multi-line (m)

The multi-line flag will return the first match in a text which has multiple lines. /queen/m will result in:

red king
black queen
blue jack

In summary, the two flags are simplified as below:

Flags You Should Know

  • g – Don’t return after first match.
  • i – Case-insensitive match.
  • m – First match in a multi-line text.

Character Classes

I am very sure you still have no problem keeping up. Which is why we will cover about character classes.

Character classes are notations which matches any symbol from a defined set. For example, you might want to match characters which are known as digits using \d.

Let’s start from character sets.

Character set

A character set is a way to match from a defined group in a single position. It will match any character in the string which exists in the group. For example

/[bst]ack/

This expression will match back, sack, or tack. But will not match hack or jack.

A character set can also be negated, by adding ^ in front of the characters after the first bracket. Using the same example but with negation:

/[^bst]ack/

The expression will match hack or jack, But will not match back, sack, or tack. Exactly the inverse of the characters inside the bracket.

As summary, we learn character sets as defined below:

Character Set

  • [xyz] – matches a single character in the list xyz
  • [^xyz] – matches a single character not present in list xyz

Ranges

Matches all characters in a range in a single position.

In the example before, we use /[bst]ack/ expression to only match back, sack, and tack, but not hack and jack. Suppose we want to match hack and jack also, we can write it as /[bhjst]ack/. It’s still valid.

What if we want to match from b to t? /[bcdefghijklmnopqrst]ack/is not beautiful for me. Instead, we can write the expression as

/[b-t]ack/

There, much simpler.

We can also match digits in range such as[0-9], or capital letters like [A-Z].

Ranges

  • [b-t] – Matches a single character in the range between b and t.

Meta-characters

Single characters which have special meanings, which were shortened to make it easier to remember if one already get used to it.

Meta-characters You Should Know

  • \d – Matches any digit character, same as [0-9]
  • \D – Matches any non-digit character, same as [^0-9]
  • \w – Matches any word character (equal to [a-zA-Z0-9_])
  • \W – Matches any non-word character (equal to [^a-zA-Z0-9_])
  • \s – Matches any whitespace character (spaces, tabs, etc.)
  • \S – Matches any non-whitespace character
  • \t – Matches tab character only.
  • . – Matches any characters except for newline.

Quantifiers

OK, I promise you this is the last part.

This expression specifies how many of the expression or character before the quantifier symbol should be matched. Here are some of them with examples:

  • + – One or more
    • /ac+k/acknowledge, ack, bracket, accck
    • /\d+/ 0, 12, 1345, airstrip1
  • * – Zero or more
    • /ac*k/acknowledge, bracket, break, accck
    • /\d*/ 0, 12, 374, airstrip3, node
  • ? – Zero or one
    • /ac?k/acknowledge, bracket, break. False for acck, acccck.
    • /\d?apple/apple, 9apple, 7878apple
  • ^ – Matches the beginning of a string
    • /^act/act, acting of act. Note for “acting of act”, only the first act counts as match.
    • /^\d/9dragons, 86lion, 7th son of 7th son
  • $ – Matches the end of a string
    • /ack$/ – back, back of sack. Note for “back of sack”, only the last ack counts as match.
    • /\d$/ – dragon9, 9 dragon9
  • {N}– Matches the preceding character exactly N times.
    • /wha{3}t/whaaat
    • /a{1}nd/and, aaaand
  • {N,} – Matches the preceding character N times or more.
    • /wha{3,}t/whaaat, whaaaaaat
    • /a{1,}nd/and, aaaaand
  • {N,M} – Matches the preceding character minimum N times and maximum M times. M must be more than N (M>N).
    • /wha{1,2}t/what, whaat
    • /a{2,3}nd/aand, aaaaand
  • Alternation X|Y – Matches either X or Y.
    • /(red|white) queen/red queen, white queen
    • Alternation more than two options is also possible: /(red|white|black|blue) queen/red queen, white queen, blue queen, black queen

Phew, that’s a lot. Finally, the quantifiers we should know are summarized as below:

Quantifiers You Should Know

  • a+ – Matches one or more character a.
  • a* – Matches zero or more character a.
  • a? – Matches zero or one character a.
  • ^ack – Matches ack at the beginning of a string.
  • ack$ – Matches ack at the end of a string.
  • a{3}– Matches the character a exactly 3 times.
  • a{3,} – Matches the character a 3 times or more.
  • a{3,5} – Matches the character a between 3 and 5 times.
  • Alternation red|white – Matches either red or white. Add more alternation with |.

Summary

Congratulations! You have reached the end of the post. In summary, we have covered these topics:

  • The basic syntax of regex
  • Flags
  • Character classes:
    • Character set
    • Ranges
    • Meta-characters
    • Quantifiers

I hope you have learned much from this post. Or, at least know that regex is not that impossible to learn. Now, all that’s left is to practice. Because practice makes perfect.

If you’d like to learn more, visit https://www.regular-expressions.info/. They covered a lot about regex there, it should help you a lot. And also, https://regex101.com/ can help you to test your own created expressions, to make sure it works or not.

See you on the next post, good luck on living your code!

“Creativeness is finding patterns where none exist.”

– Thomas M. Disch

You might also interested in:

By Ericko Yap

Just a guy who is obsessed to improve himself. Working as a programmer in a digital banking company. Currently programming himself in calisthenics, reading books, and maintaining a blog.

Leave a Reply

Your email address will not be published. Required fields are marked *