Tuesday, May 1, 2012

Veruca Salt – Understanding REGEX (in 5 minutes or less)

Hack the Gibson

I want, and I want it now!  Regex in short is to make sure that we get what we want.  If I want characters, I use Regex to make sure that’s what I get.  There are only a couple basic principles you need to understand in order to start coding in Regex. 

The brackets:
There are two types of brackets in Regex.  The square brackets [ ] and the curly brackets { }.  The square brackets defines what we are expecting as a result.  For example [a-d] means that we are expecting characters that exist in the alphabet between a and d.  Additionally, [a-x] means a through x and [a-zA-Z] means that any lowercase character a through z in addition to uppercase A through Z.
We can check numbers through the same concept [0-9] means any number that exist between 0 and 9 (to include 0 and 9). 

Thus: [a-g0-4] would mean -
abc123 would evaluate true
zyx123 would evaluate false
ahy123 would evaluate false
abc876 would evaluate false
123 would evaluate true
abc would evaluate true
ab12 would evaluate true

Now for the curly brackets.  The curly brackets give us counts. For instance {2} means that our evaluation must contain at least 2 characters. {1024} means we need at least 1024 characters (whew – lotsa typing).  Considering 1024 is a lot of counting while you are typing, you can also specify the minimum number of characters you will accept.  {1,3} means you want 3 characters but the minimum you will evaluate true is 1. 
Putting it together -
Now to put it together, [a-z]{1,10} means I will accept any 10 characters that exist between a and z but I must have at least 1 in order to evaluate true. 
Simple Example:
[0-9]{3}-[0-9]{2}-[0-9]{4}
What number do we expect here?  Of course, it is a social security number.  We must have 3 numbers between 0 and 9, a dash, 2 numbers between 0 and 9, and finally 4 numbers between 0 and 9.  Thus, we expect 123-45-6789
If we get 12-34-5678 it will evaluate false
12a-45-6789 will also evaluate false.

The caret dollar:
What the crap? The caret exists at the beginning of the expression so we can tell where to start the evaluation.  The dollar ends the expression.  Thus, ^[0-9]{3}$ is our entire statement.  Everything between the caret and dollar will be evaluated as a part of the pattern.

The OR statement:
An OR statement in Regex is represented by the pipe “|” symbol.  So if we were checking the validation of a true\false statement it would look like:
^[true|false]$
If we wanted to make it even more complicated we could do proper case:
^([true|false]|[True|False])$

Using It All:

So let’s get a web address and make sure it is a properly formatted address:

^(http[://][a-z][.][a-z][.](com|net|org))$
This will only evaluate true if it has “http://” followed by characters a-z,a period, characters a-z, another period, and be from either a com,net or org address.  Notice we do not allow numbers at all in this evaluation, if we wanted to allow numbers, it would look like:
^(http[://][a-z0-9][.][a-z0-9][.](com|net|org))$

This was a very basic overview of Regex.  We’ll go deeper in later articles.

Happy .Netting…
Hack the Gibson

No comments:

Post a Comment