[TUT] Basics: Using Regular Expressions
#1
[Image: VBii827.png]
Color codes:
  • Yellow = name
  • Yellow bold = link
  • Orange bold = bold text or comment/note
  • Blue = path
  • Green = edit
  • Purple = file extension
Sooner or later you're going to use regular expressions (regexp or regex for short). I'd say it's one of many prerequisites for serious development. A regular expression can be described as follows (taken directly from Mozilla Developer Network a.k.a MDN):

MDN Wrote:Regular expressions are patterns used to match character combinations in strings.

The description is pretty self-explanatory, but to be 100% sure that you understand, I'm going to give you an example so it's easier to explain how and why regular expressions works. Let's say you scraped a website using whatever language you like. The reason why you scraped this website is because you want to search the HTML code in order to find specific links, such as images being hosted on TinyPic. One way to do it would be to use a regular expression. Let's try to make one! First off, you need to analyze what you want your regex to match with. Here are some typical TinyPic links:

Code:
http://oi63.tinypic.com/fld66s.jpg
http://oi67.tinypic.com/flbp6v.jpg
http://oi63.tinypic.com/flbv4g.jpg
http://oi65.tinypic.com/9zrslt.jpg
http://i41.tinypic.com/23jrfr9.jpg

If we look at typical TinyPic links, we can see that they consist of 6 parts:

Code:
http:// + xxx + tinypic + com + xxxxxxx + xxx

NOTE: The x's represent any char/number/sign.

All the parts are pretty straightforward. The important thing to notice here though, is the the second part (xxx),  fifth part (xxxxxxx) and sixth part (xxx) aren't static meaning that their values change. If we look at the second part first, we can see that it can contain numbers and letters and that it is either 3 or 4 chars/digits long. Remember this. Now let's look at the fifth part. Again, it can contain both letters and numbers and that it is always 6 or 7 chars/digits long. The last part is the file extension. We know it's always going to be 3 letters. So now that we know all of this, we can begin to construct our regex. All literal regexes have to start with / and end with /. Literal means that the regex explicitly searches for specific parts and that it doesn't change. There are also regular expressions that can be constructed using a constructor function, which is described as follows: a constructor (abbreviation: ctor) in a class is a special type of subroutine called to create an object. It prepares the new object for use, often accepting arguments that the constructor uses to set required member variables (taken from Wikipedia).The difference between the literal and the one created with a constructor function can be read here (MDN). Let's start with making the regex match the first part, which is the http://. For accomplishing this, we can simply do the following:

Code:
/(http|https):\/\//

Woah, hold on a second. What's the | for? If you know how to program, which I assume you do since it wouldn't make sense otherwise that you would learn about regular expressions, then you know what the above does. Although in most programming languages, the above would look like this:

Code:
/(http||https):\/\//

The former basically means "match http or https". We are going to assume that TinyPic has secure hosting. It doesn't hurt to include support for it. The reason why you have to escape the two forward slashes (//) is because otherwise the regex thinks that it should end right after the semicolon. Remember that it's using forward slashes to indicate the beginning and end of the regex, so not escaping any forward slashes IN the regex will throw an error and the regex will become useless.

NOTE: Regular expressions are by default case-sensitive. You can change this by adding a modifier/flag at the end of the regex, after the last /. In this case, the modifier you want to add to make it case-insensitive would simply be i <- small I.

So now we have support for both http:// and https://. But wait, TinyPic links that don't include http:// or https:// are still valid, right? Yes indeed they are. With our current regex, it wouldn't match links without http:// or https://. What can we do? You can make the capturing group, (http|https), and :\/\/ optional. Start by including the :\/\/ in the capturing group like so:

Code:
/((http|https):\/\/)/

After you've enclosed the whole thing in parentheses, it's time to add in the "optional" part. You can do so by adding ?: after the first parenthesis and ? before the last parenthesis like so:

Code:
/(?:(http|https):\/\/)?/

This is basically interpreted as "this part is NOT necessary, although it is found, it should not be disregarded". Now TinyPic links with and without http:// or https:// will be matched. Now that's out of the way, let's add support for the second part. Remember that the values in the second part were NOT static? This means you can't do it like we've done so far by simply adding whatever we want to be matched. We need to somehow make the regex tell whatever language you are using that there can be anything between [a-z], [A-Z] and [0-9]. It just so happens that there exists a token which represents [a-z], [A-Z], [0-9] and an underscore, _. This token is called "\w". Let's try to add that to our regex:

Code:
/(?:(http|https):\/\/)(\w)/

Parentheses used in a regular expression not only group elements of that expression together, but also designate any matches found for that group as tokens. You can use tokens to match other parts of the same text. One advantage of using tokens is that they remember what they matched, so you can recall and reuse matched text in the process of searching or replacing (taken from MathWorks). The ability to reuse previously matched text can be quite useful although we are not going to use that in our example as it's not necessary.

You don't have to use the token, you could also have done it by directly adding the three groups:

Code:
(http|https):\/\/([a-z][A-Z][0-9]_)

NOTE: It's best to enclose your capturing groups in parentheses. Otherwise the regex can be interpreted differently than you'd expect.

As you can see, this looks pretty clumsy and if you need a complex regex, this will make your regex nearly unreadable. You should generally avoid doing this unless the situation absolutely requires it. Ok great, so now we're done with that, right? False. You see, there is room for false positive detections now. We have added support for changing values, however, we have NOT specified how many chars/digits there are, which means that you can have a link that looks like this:

Code:
http://oi67jhsd9fe9s76fs87eyiusdhfk.tinypic.com/27he8uk.jpg

And it would still be totally fine. It would match. This is obviously an invalid link, so we can't have it match this. There's a thing we can do to fix this. We can add a quantifier. A quantifier allows you to specify the number of occurrences to match against. For instance, if we were to add support for the second part in the TinyPic links, we would do the following:

Code:
/(http|https):\/\/(\w{3,4})/

The {3,4} means that the previous token (\w) will be matched against 3 to 4 times, as many times as possible. It basically means that if the second part of the TinyPic links don't have 3 or 4 digits then it won't match. That's it. Now we've added support for the second part. Let's move on to the third and fourth part. They should be pretty easy and you should be able to add them by yourself by now, so I won't be explaining anything. I'll skip it and go on to the last fifth part, which is the part with 7 non-static chars/digits. 

NOTE: Writing just "." would cause issues in your regex. "." in regular expressions means "any character" and that's not what you want in this example. To fix this, do the same thing you did with the two forward slashes. Escape the character i.e. "." -> "\.".

The fifth part is actually the same as the second part. You add a token and then a quantifier. You've hopefully understood everything so far and should be able to do this all by yourself, but for the sake of 100% understanding, here's what it should look like:


The sixth part is where the file extension is. As you can see above, I have added \w{3} (which you should know by now what means). However, this could raise an issue. Although it is unlikely, the user could add an extra character or change it completely if the user is a tad retarded. What do you do then? Your program won't work as intended because the regex doesn't match the last part. You should never trust the user. You could do the same thing you did in the first part, which is to use the OR (|) sign and include every file extension you need such as .jpg, .png, .gif and so on. I just stupidly choose to trust the user on this one though. Wink

With this out of the way, we can start learning about the last thing. Do you see the part I added after the last /? Do you remember what they're called? They're called modifiers or flags. Flags/modifiers change the way your regex behaves. For instance, the "i" modifier makes the regex case-insensitive (as described before) while the "g" modifier makes the regex global meaning it doesn't stop after the first match. The "g" modifier is often needed.

This is the end of the example. If you want to practice creating regexes or check if your regex works, you can use regex101. Regex101 is an amazing resource that you must have in your library if you want to be productive in my opinion. It helps you a lot with creating regexes, it explains everything in your regex, you can see all tokens and quantifiers, what they do, how they work, they have a regex formatter, regex debugger and much more. They also have an IRC where you can ask for help and they also have a library with pre-made regexes submitted by their users for others to use. It's a wonderful resource and if you want to learn how to use and create regexes, you can easily do so by experimenting on regex101. You can also share your regexes with others! Here's the one we've created (with the TinyPic links in the beginning as the test strings and with some invalid links included).

When you have practiced creating regexes for a while, you'll realize that they're in fact not as complex as one might think. At least not most of the time. You will probably end up with very grotesque looking regexes once in a while like these two (YouTube and SoundCloud respectively):

Code:
/^https?:\/\/(?:www\.youtube\.com\/watch[\?&#\w\d=]*[\?&]v=|youtu\.be\/)([a-zA-Z0-9-_]+)[\?&#\w\d=]*$/gi

/(http|https):\/\/(?:www\.)?soundcloud.com\/((?:[^\W_]|-)){3,255}\/(sets|((?:[^\W_]|-){3,255}))(\/((?:[^\W_]|-){3,255})|(\?in=((?:[^\W_]|-){3,255})\/((?:[^\W_]|-){3,255})\/((?:[^\W_]{3,255}|-))|))/g

But if you break them up in bits or use regex101, it won't take too long to understand them nor will it be too hard. If you have any questions or concerns then feel free to PM me or post here.

Enjoy making regular expressions.

05/08/16 - 15:08: Forgot to add a line at the end.
05/08/16 - 17:26: Grammar and spelling corrections.
05/08/16 - 18:14: Thread formatting.
#2
I hate/love/hate regexes. As you've mentioned, you can get some wickedly nasty looking ones once you start selecting for a very specific item. They look like gibberish, but they're such a key element in solid development, and I've seen really nasty code by those who didn't understand regexes but were trying to achieve the same functionality. I've even written some of that code before, so I can't stress enough to any readers that they need to take the time and learn how these work and start working with a few basic ones. Regarding the tutorial itself, this is well-written with a critical subject matter, an excellent flow, and you've linked to excellent resources. This will be of an immense use to new developers.

Well done.
Reply
#3
Jurij, what a good post filled up with interesting and knowledgable content, thanks for the share!
Reply
#4
Regex is the worst thing in the world to learn and to understand but the usage of it is AMAZING and I love that. I will check this everytime when Im about to to Regex since I always forget it.

Nice Job!
Reply


Forum Jump:


Users browsing this thread: 1 Guest(s)