Search This Blog

Saturday, July 14, 2012

Regular Expressions in JavaScript


Note: This regular expressions reference guide was first compiled from resources on the net for a presentation at work slightly over two years ago (May 5, 2010). I am just now posting it here, partly because I finally got around to adding proper regular expression highlighting to the JavaScript highlighter used on this blog. :)

Pattern Flags

Three pattern flags may be specified with regular expressions: g, i, and m.
  1. g = global, performs a global search
  2. i = ignore case, is case insensitive
  3. m = multiline, treats line endings/beginnings like string terminators when used with ^ and/or $

Using Literal Notation

JavaScript regular expressions are commonly created using literal notation (via opening and closing forward slashes / /). Pattern flags can be specified after the second slash.
// match all 7-digit numbers globally
var phonenumber = /\d{7}/g;
Note: Normally you will want to use the literal notation if you know the pattern you need to use in advance, as it results in cleaner syntax and boasts a slight performance gain. (Literals are compiled as source code; extended objects are not.)

Native RegExp Constructor:

Native JavaScript contains its own RegExp constructor, useful for building dynamic regular expressions when you do not know ahead of time the pattern value.

There are three main things to remember when using the RegExp object:
  1. The string portion of the pattern goes inside quotation marks.
  2. The escapes of special characters need to be escaped with a back slash (\).
  3. An optional second parameter allows pattern flags to be passed.
For example:
// a simple digit-only RegEx with a global flag
var myRegExp = new RegExp('\\d', 'g');
// with variable (can be +'d with string for complex expressions)
var myRegExp = new RegExp(someVar, 'g');

Regular Expression Methods

test()
Format: RegExp.test(string);

Simplest, least costly method. Returns boolean true or false.
console.log(/vanna/i.test('Vanna'));
// returns true because "i" is specified
console.log(/vanna/.test('Vanna'));
// returns false
exec()
Format: RegExp.exec(string);

Similar to match(), except that the parameter is the string, not the regular expression. Returns array of matches, or null if no match is found. Note that the 0-item index will always be the full pattern match.
var match = /s(amp)le/i.exec('Sample text');
// returns ['Sample', 'amp']
As with exec(), the regular expression is the first refinement, the string the parameter.
match()
Format: string.match(RegExp);

Functionally identical to exec() in all ways, except refinement and parameter is reversed. Also returns null or an array with 0-item being the full string.
Format: string.search(RegExp);

Returns -1 if not found or index of match.

Note: Does NOT support global searches; the g pattern flag is not supported.
'Amy and George were married'.search(/george/i); // returns 8
split()
Format: string.split(RegExp);

Converts strings into array, splitting the string on literal or regex delimiter and puts the chunks in the array. Does NOT return the delimiter(s).
// returns the array ["1", "2", "3", "4", "5"] because
// the regular expression factors in discrepancy in spacing
var oldString = '1,2, 3,   4,    5';
var newString = oldString.split(/\s*,\s*/);
replace()
Format: string.replace(searchFor, replaceWith);

This particular method has a lot of flexibility. We can use string literals for the search and replace values:
'My car is hot'.replace('car', 'girl');
// returns "My girl is hot"
We can use regular expressions and call back the captured values up to 9 places ($1$9):
// Matching on word boundaries with a space between,
// capturing the boundaries.
var reorderName = 'Mary Jane'.replace(/(\b) (\b)/, "$2, $1");
// Returns "Jane, Mary"
But what if you need to replace multiple characters, not just re-order or replace single values? replace() also supports anonymous functions:
// The string portion of the pattern goes inside
// quotation marks. Special characters need to be escaped
// with a back slash (An optional second paramater allows
// pattern flags to be passed.)
var testStr = 'He wrote, "2 < 3 is a true statement" on the board.';
// match any of these characters
var myRx = /[><"'&]/g;
var escapedString = testStr.replace(myRx, function(match) {
    switch (match) {
        case '<':
            return '&lt;';
        case '>':
            return '&gt;';
        case '"':
            return '&quot;';
        case "'":
            return '&#039;';
        case '&':
            return '&amp;';
    }
});
Or, we can simply pass in an existing method, provided it accepts a single parameter. Note that we do not add the invocation () to the function name or pass it any variables: The replace() method will automatically call the passed function.
// Function: replaceChars
var replaceChars = function(match) {
    switch (match) {
        case '<':
            return '&lt;';
        case '>':
            return '&gt;';
        case '"':
            return '&quot;';
        case "'":
            return '&#039;';
        case '&':
            return '&amp;';
    }
};
// no invocation or passed values
var escapedStr2 = testStr.replace(myRx, replaceChars);

// The results are identical
console.log(escapedStr);
// He wrote, &quot;2 &lt; 3 is a true statement&quot; on the board.
console.log(escapedStr2);
// He wrote, &quot;2 &lt; 3 is a true statement&quot; on the board.
Pattern Flags (Switches)
Property Description Example
 i Ignore the case of characters. /The/i matches "the" and "The" and "tHe"
 g Global search for all occurrences of a pattern /ain/g matches both "ain"s in "No pain no gain", instead of just the first.
 gi Global search, ignore case. /it/gi matches all "it"s in "It is our IT department"
 m Multiline mode. Causes ^ to match beginning of line or beginning of string. Causes $ to match end of line or end of string. JavaScript1.5+ only. /hip$/m matches "hip" as well as "hip\nhop"
Position Matching
Symbol Description Example
 ^ Only matches the beginning of a string. /^The/ matches "The" in "The night" but not "In The Night"
 $ Only matches the end of a string. /and$/ matches "and" in "Land" but not "landing"
 \b Matches any word boundary (test characters must exist at the beginning or end of a word within the string) /ly\b/ matches "ly" in "This is really cool."
 \B Matches any non-word boundary. /\Bor/ matches “or” in "normal" but not "origami."
(?=pattern) A positive look ahead. Requires that pattern is within the input. Pattern is not included as part of the actual match. /(?=Chapter)\d+/ matches any digits when it's preceded by the words "Chapter", such as 2 in "Chapter 2", though not "I have 2 kids."
(?!pattern) A negative look ahead. Requires that pattern is not within the input. Pattern is not included as part of the actual match. /JavaScript(?! Kit)/ matches any occurrence of the word "JavaScript" except when it's inside the phrase "JavaScript Kit"
Literals
Symbol Description
Alphanumeric All alphabetical and numerical characters match themselves literally. So /2 days/ will match "2 days" inside a string.
\O Matches NUL character.
 \n Matches a new line character
 \f Matches a form feed character
 \r Matches carriage return character
 \t Matches a tab character
 \v Matches a vertical tab character
[\b] Matches a backspace.
 \xxx Matches the ASCII character expressed by the octal number xxx.

\50 matches left parentheses character "("
 \xdd Matches the ASCII character expressed by the hex number dd

\x28 matches left parentheses character "("
 \uxxxx Matches the ASCII character expressed by the UNICODE xxxx.

\u00A3 matches "£"
The backslash (\) is also used when you wish to match a special character literally. For example, if you wish to match the symbol $ literally instead of have it signal the end of the string, backslash it: \$
Character Classes
Symbol Description Example
 [xyz] Match any one character enclosed in the character set. You may use a hyphen to denote range. For example. /[a-z]/ matches any letter in the alphabet, /[0-9]/ any single digit. /[AN]BC/ matches "ABC" and "NBC" but not "BBC" since the leading “B” is not in the set.
 [^xyz] Match any one character not enclosed in the character set. The caret indicates that none of the characters should match.

NOTE: the caret used within a character class is not to be confused with the caret that denotes the beginning of a string. Negation is only performed within the square brackets.
/[^AN]BC/ matches "BBC" but not "ABC" or "NBC".
 . (Dot). Match any character except newline or another Unicode line terminator. /b.t/ matches "bat", "bit", "bet" and so on.
 \w Match any alphanumeric character including the underscore. Equivalent to [a-zA-Z0-9_]. /\w/g matches "200" in "200%"
 \W Match any single non-word character. Equivalent to [^a-zA-Z0-9_]. /\W/ matches "%" in "200%"
 \d Match any single digit. Equivalent to [0-9].
 \D Match any non-digit. Equivalent to [^0-9]. /\D/g matches "No " in "No 342222"
 \s Match any single space character. Equivalent to [ \t\r\n\v\f].
 \S Match any single non-space character. Equivalent to [^ \t\r\n\v\f].
Repetition
Symbol Description Example
{x} Match exactly x occurrences of a regular expression. /\d{5}/ matches 5 digits.
{x,} Match x or more occurrences of a regular expression. /\s{2,}/ matches at least 2 whitespace characters.
{x,y} Matches x to y number of occurrences of a regular expression. /\d{2,4}/ matches at least 2 but no more than 4 digits.
? Match zero or one occurrences. Equivalent to {0,1}. /a\s?b/ matches "ab" or "a b".
* Match zero or more occurrences. Equivalent to {0,}. /we*/ matches "w" in "why" and "wee" in "between", but nothing in "bad"
+ Match one or more occurrences. Equivalent to {1,}. /fe+d/ matches both "fed" and "feed"
Alternation & Grouping
Symbol Description Example
( ) Grouping characters together to create a clause. May be nested. /(abc)+(def)/ matches one or more occurrences of "abc" followed by one occurrence of "def".
( ) Apart from grouping characters (see above), parenthesis also serve to capture the desired subpattern within a pattern. The values of the subpatterns can then be retrieved using RegExp.$1, RegExp.$2 etc after the pattern itself is matched or compared. For example, the following matches "2 chapters" in "We read 2 chapters in 3 days", and furthermore isolates the value "2":

var myString = "We read 2 \
chapters in 3 days";

var needle = /(\d+) chapters/;

// matches "2 chapters"
myString.match(needle);

// alerts captured subpattern,
// or "2"
alert(RegExp.$1);

The subpattern can also be back referenced later within the main pattern. See "Back References" below.
The following finds the text "John Doe" and swaps their positions, so it becomes "Doe John":

"John Doe"
.replace(/(John) (Doe)/, "$2 $1");
(?:x) Matches x but does not capture it. In other words, no numbered references are created for the items within the parenthesis. /(?:.d){2}/ matches but doesn't capture "cdad".
x(?=y) Positive lookahead: Matches x only if it's followed by y. Note that y is not included as part of the match, acting only as a required conditon. /George(?= Bush)/ matches "George" in "George Bush" but not "George Michael" or
"George Orwell".

/Java(?=Script|Hut)/ matches "Java" in "JavaScript" or "JavaHut" but not "JavaLand".
x(?!y) Negative lookahead: Matches x only if it's NOT followed by y. Note that y is not included as part of the match, acting only as a required conditon. /^\d+(?! years)/ matches "5" in "5 days" or "5 oranges", but not "5 years".
| Alternation combines clauses into one regular expression and then matches any of the individual clauses. Similar to OR statement. /(ab)|(cd)|(ef)/ matches "ab" or "cd" or "ef".
Back References
Symbol Description
( )\n \n (where n is a number from 1 to 9) when added to the end of a regular expression pattern allows you to back reference a subpattern within the pattern, so the value of the subpattern is remembered and used as part of the matching. A subpattern is created by surrounding it with parenthesis within the pattern.

Think of \n as a dynamic variable that is replaced with the value of the subpattern it references. For example:

/(hubba)\1/

is equivalent to the pattern /hubbahubba/, as \1 is replaced with the value of the first subpattern within the pattern, or (hubba), to form the final pattern.

Lets say you want to match any word that occurs twice in a row, such as "hubba hubba." The expression to use would be:

/(\w+)\s+\1/

\1 is replaced with the value of the first subpattern's match to essentially mean "match any word, followed by a space, followed by the same word again."

If there were more than one set of parentheses in the pattern string you would use \2 or \3 to match the desired subpattern based on the order of the left parenthesis for that subpattern.

In the example:

/(a (b (c)))/

\1 references (a (b (c))), \2 references (b (c)), and \3 references (c).
Regular expressions to match JavaScript comments:
// match single-line comments (like this one) globally
/\/\/.*/g
// match multi-line comments (/* ... */) globally
/\/\*([^\*]|\*(?!\/))*\*\//g
// or combine both patterns in one using "|" (pipe)
/\/\*([^\*]|\*(?!\/))*\*\/|\/\/.*/g
// wrap in parens to capture and callback: "(pattern)|(pattern2)"
// as perhaps used in an auto syntax highlighter...
myJSString.replace(/(\/\*([^\*]|\*(?!\/))*\*\/)|(\/\/.*)/g, function(match) {
    return '<span class="comment">' + match + '</span>';
}); 
Validate e-mail addresses:
/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/
Sources: