Regular Expressions

Regular expressions are a concise and flexible notation for matching patterns of strings, for example, with string functions in SQL and with templates and expressions in the Select panel.  It may seem weird to find such a highly technical topic in the Basics chapter, but for some expert users regular expressions really are part of the basics.   Regular expressions are a very important standard in computing.  The mind-numbing complexity of regular expressions in some cases is not something that Manifold invented.   If we find them puzzling and difficult to learn, that indicates we are normal.  tech_ravi_sm.png

 

Tech tip: In many cases, using the LIKE operator in SQL or the Like template in the Select panel is far easier for most users.  The LIKE operator has only two wildcards: % to match zero or more characters and the underscore _ character to match any single character.

 

Beginners usually learn to use regular expressions in extremely simple ways.   For example, T.* is a match to any string of characters that begins with the capital letter T.  O.*n matches any string of characters that begins with a capital O and ends with a lower case n, so it matches the words Operation and Option but not Ohio.  

 

The following documentation provides a beginning guide to regular expressions.   Despite the nearly vertical learning curve involved, once we get our heads around regular expressions it is truly amazing how useful and powerful they can be, saving lots of time and, like magic, getting done right away exactly what we want.   

Syntax

Over many years various facilities for parsing and utilizing regular expressions have arisen, perhaps most famously in UNIX's grep facility.   Most regular expression engines are similar with a variety of dialects for regular expression syntax in use.  Experts take minor differences in dialects in stride but such minor differences can confuse beginners.

 

Manifold supports one of the broadest regular expression syntax standards in common use, the ECMAscript  standard, which in various versions is the standard supported by C++, JavaScript, Perl, Python and other languages as well as in Microsoft's .NET.   Other regular expression dialects tend to be similar to or to be subsets of ECMA.  The guides describe how regular expressions in Manifold operate.

Using Regular Expressions

Special characters and sequences are used in writing patterns for regular expressions.  The following table describes special characters and sequences used within regular expressions.   The table often  includes short examples.

 

Characters

Description

\

Marks the next character as either a special character or a literal.  For example, n matches the character "n".  \n matches a newline character.  The sequence \\ matches the backslash character "\" and \( matches the left parentheses character "(".

 

Important: Please carefully read the discussion and examples below for use of the backslash character "\" in SQL.

@'...'

A Manifold innovation that allows specifying string literals in expressions without using escape characters like \ in SQL.  Replace the three dots in the pattern at left with the string literal desired.  This notation applies in SQL expressions and not just regular expressions, but because it is especially useful in regular expressions it appears here.  The string literal cannot include a single ' quote.  To include a single quote within an expression that also uses this @ notation, we must build it with a concatenation expression in the form @'...' & '\'' & @'...'   

 

For example, if we want to search for patterns like either\or where there are characters on both sides of a backslash character, we can either escape the backslash character using .*\\.* or we can just use the unescaped  backslash character as a literal with  @'.*\.*'   

 

With short expressions, it obviously is easier just to use two backslashes, but with very long expressions involving many backslashes it is clearer to use @'...'.   Likewise, if we need to include many single quotes in an expression it probably will be clearer not to use @'...'.

^

Matches the beginning of input.

$

Matches the end of input.

*

A repetition count: Matches the preceding element zero or more times.  For example, zo* matches either "z" or "zoo".  .* matches any sequence of text of one or more characters.  In the .*  expression the . matches any single character while the * repetition count indicates a repetition of any single character zero or more times.

+

A repetition count: Matches the preceding element one or more times.  For example, zo+ matches "zoo" but not "z".

?

A repetition count: Matches the preceding element zero or one time.  For example, a?ve? matches the "ve" in "never".

.

Matches any single character except a newline character.

(subexpr)

Parentheses enclosing a subexpression subexpr denote a capture group that is treated and remembered as a single unit, that is, as a single element. The matched substring can be retrieved using a label from 1 to 31 where each label matches the capture groups in the regular expression from first to the last, to a maximum of 31 capture groups.   To match parentheses characters "(" and ")" use \( and \).

(?:subexpr)

A non-capture group.   The same as a capture group but is not labeled and cannot be used in a back reference.   Used when we want to group the subexpression as a single unit or element but we do not want it numbered as part of the sequence of capture groups.

x|y

Matches either x or y.  For example, z|food matches "z" or "food".  (z|f)ood matches "zood" or "food" but not "zoo".

{n}

A repetition count: Matches the preceding element n times where n is a non-negative integer and the element is an element (ordinary character, wildcard, bracket expression, escape, etc.) other than a positive asset (?=subexpression), a negative asset (?!subexpression)or an anchor (^ or $).   a{2} matches "aa" but not "a" or "aaa".   a{2}a matches "aaa".

{n,}

A repetition count: Matches the preceding element at least n times where n is a non-negative integer and the element is an element (ordinary character, wildcard, bracket expression, escape, etc.) other than a positive asset (?=subexpression), a negative asset (?!subexpression)or an anchor (^ or $).   a{1,} matches "a",  "aa" and "aaa".   a{2,} matches "aa" and "aaa" but not "a".

{n,m}

A repetition count: Matches the preceding element at least n times and at most m times where n and m are non-negative integers and the element is an element (ordinary character, wildcard, bracket expression, escape, etc.) other than a positive asset (?=subexpression), a negative asset (?!subexpression)or an anchor (^ or $).   a{1,2} matches "a" and "aa" but not  "aaa".   a{2,3} matches "aa" and "aaa" but not "a" and not "aaaa".

[xyz]

A character set.  Matches any single character that is one of the enclosed characters in the set.  [abc] matches "a" or "b" or "c" but not "aa", "x", "M" or "A".  be[abc]t matches "beat" but not "beet" or "beact"

[^xyz]

A negative character set.  Matches any single character that is not one of the enclosed characters in the set.  [^abc] matches "x" or "M" or "A" but not "xx", "a", "b" or "c".  be[abc]t matches "beet" but not "beat" or "beeet"

[a-z]

A range of characters within a character set.  Matches any single character in the specified range, inclusive. [a-z] matches any lowercase alphabetic character in the range "a" through "z".  Multiple ranges can be specified within brackets.   [a-gH-M0-5] matches any lowercase alphabetic character in the range "a" through "g" or any uppercase alphabetic character in the range "H" through "M" or any digit in the range "0" through "5" so it would match or "a", "5", "d" and "K" but not "A", "w", "AA", "6" or "T".   Note that character sets such as [abc] can be combined within brackets with ranges such as [a-z] so that the expression [-0-24] is a set of characters consisting of the hyphen character "-" followed by a range from "0" to "2" followed by the character "4".   The expression contains the characters "-", "0", "1", "2" and "4".   

[^a-z]

A negative range of characters within a character set.  Matches any single character not in the specified range, inclusive. [^a-z] matches any character except any lowercase alphabetic character in the range "a" through "z".  Multiple ranges can be specified within brackets.   [^a-gH-M0-5] matches any characters except those lowercase alphabetic characters in the range "a" through "g" and except those uppercase alphabetic character in the range "H" through "M" and except any digit in the range "0" through "5" so it would match  "A", "w", "AA", "6" or "T" but not "a", "5", "d" and "K".

\b

Matches a word boundary, that is, the position between a word and a space.  er\b matches the "er" in "never early" but not the "er" in "verb"

\B

Matches a nonword boundary, that is, a position not between a word and a space.  ear\B matches the "ear" in "early" but not the "ear" in "ear lobe"

\d

Matches a digit.  Equivalent to [0-9]

\D

Matches any nondigit character.  Equivalent to [^0-9]

\f

Matches a form feed character

\n

Matches a newline character

\r

Matches a carriage return character

\s

Matches any white space character including space, form feed, newline, carriage return, tab and vertical tab.  Equivalent to [ \f\n\r\t\v]  Note the significance of the space character following the left bracket [ in [ \f\n\r\t\v] 

\S

Matches any non white space character.  Equivalent to [^ \f\n\r\t\v]  Note the significance of the space character following the caret ^ in [^ \f\n\r\t\v] 

\t

Matches a tab character

\v

Matches a vertical tab character

\w

Matches any word character, that is, lowercase letters, uppercase letters and digits. Equivalent to [a-zA-Z0-9]

\W

Matches any non-word character, that is, lowercase letters, uppercase letters and digits. Equivalent to [^a-zA-Z0-9]

\i

Matches i, where i is a positive integer that is an index reference back to remembered matches.   For example, (.)\1 matches two consecutive identical characters.

\dd

A back reference to the capture group number dd, where dd is a decimal number N that labels the Nth capture group in the expression.   If there are four capture groups in a regular expression we can refer to each of them respectively using \1, \2, \3 and \4.  

\n

Matches n where n is an octal escape value.  Octal escape values must be 1, 2 or 3 digits long.   For example, \11 and \011 both match a tab.  \0071 matches two characters in a row, the octal character specified by \007 (a bell character, ringing the bell on a teletype terminal) followed by the digit character "1".    Octal escape values must not exceed 256.   If they do, only the first two digits comprise the expression.

\xnn

Matches nn, where nn is a two digit hexadecimal escape value.   Hexadecimal escape values must be exactly two digits long.   For example, \x41 matches "A".    \x091 matches two characters in a row, the hex character specified by \x09 (the tab character)  followed by the digit character "1".

[:name:]

For use within character sets.  Matches any single character in the character class named name.

Character Class Names

The following class names may be used in [:name:] constructions:

Class name

Description

alnum

Alphanumeric characters: lowercase letters, uppercase letters, and digits

alpha

Alphabetic characters: lowercase letters and uppercase letters

blank

Space or tab

cntrl

Control characters: \f, \n, \r, \t or \v  (form feed, newline, carriage return, horizontal tab and vertical tab)

digit

Digits, that is the number characters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9

graph

Visible characters: lowercase letters, uppercase letters, digits, and punctuation but not spaces or control characters or other invisible characters

lower

Lowercase letters

print

Visible characters plus space: lowercase letters, uppercase letters, digits, punctuation, and space

punct

Punctuation and symbols

space

Space character

upper

Uppercase letters

xdigit

Hexadecimal characters: a, b, c, d, e, f, A, B, C, D, E, F or digits.

d

Same as digit

s

Same as space

w

Same as alnum

Escaping Special Characters in Regular Expressions

Because a regular expression itself may be used within some context where the special characters used within the regular expression may also have some special meaning we may have to escape those special characters.  There are two contexts requiring escapes, principally of the backslash character:

 

 

For example, SQL functions within Manifold which use regular expressions are passed those regular expressions as a string argument.   Consider the SQL function

 

StringRegexpMatches(<string>, <regexp>, <flags>)

 

This function finds strings within the given field which match a given regular expression.    If we have a table that lists the names of various food products within the field Name and we want to find all products with a name of Dutch Chocolate we could use the function with the following arguments:

 

StringRegexpMatches([Name], 'Dutch Chocolate', 'c')

 

Note that the regular expression Dutch Chocolate is enclosed in single quotes ' as any string argument would be.   The 'c' flag at the end means that the regular expression match by the function will be case sensitive.

 

We could also write that same regular expression as Dutch\sChocolate using the regular expression \s escape sequence meaning a white space character such as a space or tab.    But in that case the argument would appear within the function as:

 

StringRegexpMatches([Name], 'Dutch\\sChocolate', 'c')

 

Note the use of two backslashes \\ instead of just one.   Why two backslashes?   Strings passed as arguments to functions are one of those contexts where the backslash character has special meaning.   To transmit the backslash character used in the \s escape to the regexp processor used by the function we must escape the \ with another \ so it is treated as a literal within the string and not itself an escape character the string should utilize to modify what is sent into the function.

 

That behavior can add up to a lot of backslashes sometimes, especially when we want to look for backslash characters in a pattern, because the regexp engine itself uses the backslash character as a special escape character.  

 

Suppose for example we wanted to find a product with a name of Sweet\Sour Tart.    A regular expression that would find such a name would be Sweet\\Sour Tart, using two backslashes where the first backslash escapes the second one to mean literally a backslash character and not an escape character in a \S escape expression.   That's OK and easy to understand, but if we want to pass that regular expression through a string argument into a function we will have to escape both of those backslashes as well, making for four backslash characters when we use the function:

 

StringRegexpMatches([Name], 'Sweet\\\\Sour Tart', 'c')

 

Mercifully, backslashes are not that common in most language text, URLs or path names in civilized operating systems so we probably will not be writing many regular expressions requiring four backslashes in a row; but we will need to get used to writing two backslashes when using constructs such as \n to look for newlines or \t for tab characters.

 

For example, suppose we use the Select panel to find all products in a table with a name consisting of two words, such as Mint Tea or Malaysian Coffee.    A regular expression to match such names would be

 

\S+\s\S+

 

The \S escape matches any character that is not white space.    The + plus character is a repetition count that says to match any one or more of the preceding, which in this case means to match one or more characters that are not white space.   The \s escape matches any white space character.   Taken together the regular expression matches any product name made up of two words separated by a white space character where each word consists of one or more characters that are not white space.

 

To use the Select panel to find such matches, in the Template tab we would choose the template Text Matches Regexp and in the Pattern box we would enter:

 

'\\S+\\s\\S+'

 

To deconstruct the above we first note the expression is within single quote ' characters because it is a string that is passed to the template.    However, because backslash characters have special meanings within strings as an escape to transmit those backslash characters as part of the regular expression and not have them stripped out as something which controls the string we escape each backslash character with another backslash character just in front of it.

 

The rule of thumb for using regular expressions as patterns to Select panel templates: simply replace each backslash in the regular expression with two backslashes, and enclose the entire regular expression in single ' quote characters.

 

Avoiding the Use of Escape Characters

Manifold has introduced an innovation, the @'...' notation, which allows us to avoid using escape characters in a string literal.   Whatever is within the single ' quote characters in the notation need not be escaped.   A popular use of this notation is within template boxes to write regular expressions  that are not cluttered with very many backslash characters.   Examples using the above strings would be:

 

@'Sweet\Sour Tart'

 

and

 

@'\S+\s\S+'

 

Capture Groups

A capture group is an expression within parentheses ( ), the contents of which the regular expression engine regards and remembers as a single unit, labeling each such capture group with a number, beginning with the number 1 and numbering in order each capture group that appears in a regular expression up to 31, the maximum number of capture groups that can appear in a regular expression.   

 

Because capture groups can appear nested within capture groups, that is, expressions within parentheses nested within other enclosing parentheses, the number of each capture group is determined by counting the opening parentheses ( of each capture group in order.

 

Let us first consider how capture groups tell the regular expression engine to treat their contents as a single unit.  Consider first a regular expression that does not have a capture unit:

 

ab+

 

The above regular expression has two characters, each considered its own unit by the regular expression engine, followed by a character, the plus sign +,  that the regular expression engine knows is a special character which is a repetition count command that means to match the preceding element one or more times. In this regular expression the preceding element is a single character, the character b.  The expression will therefore hunt for sequences of characters that begin with a and then b and which then continue with one or more b characters.   The expression will match a sequence of characters like abb or abbb but it will not match abab because the repetition count applies only to the element that precedes it.

 

Consider now a regular expression written using a capture group:

 

(ab)+

 

In this case the entire capture group, the pattern ab, is treated by the regular expression as a single unit, that is, a single element.   The plus sign + repetition count applies to that element, so it will look for one or more matches to that element.   This regular expression will hunt for sequences of characters that begin with the two characters ab and which then continue with one or more repetitions of that two character sequence ab.  The expression will match a sequence of characters like abab or ababab but it will not match abb or abbb.

 

Capture Group Labels / Back References

Each capture group in a regular expression is assigned a numeric label from 1 to the maximum 31.   We can refer to the Nth capture group in a regular expression by using the expression \dd where dd is the number N referring to the Nth capture group.   Using a number for dd that is greater than the number of capture groups is an error.

 

Capture group labels are assigned by counting each capture group's opening ( parenthesis character in turn.  Consider the expression:

 

((a+)(b+))(c+)

 

The above expression contains four capture groups, two of which are nested within another capture group.   We can see what the number labels are for each capture group by counting the left, opening parenthesis ( characters of each capture group in order:

 

i_capture_groups01_01.png

 

The first capture group is ((a+)(b+)), the second capture group is (a+), the third capture group is (b+) and the fourth capture group is (c+).     Just to make this fun for beginners, if we have a non-capture group in our regular expression we do not count the left parenthesis for the non-capture group.   See comments below.

 

The entire expression matches a target sequence of aabbbc.  The first capture group, which we can refer to as \1, matches the subsequence aabbb.   The second capture group, which we can refer to as \2, matches the subsequence aa.   The third capture group, \3, matches the subsequence bbb and the fourth capture group, \4 matches the subsequence c.

 

Using a capture group label in a regular expression to refer back to a capture group that appeared earlier in the regular expression is a back reference.   In the above example, we could use the back references \1, \2, \3, and \4 in a more complex regular expression that builds upon the capture groups in the expression above.   Consider for example:

 

((a+)(b+))(c+)\3

 

That expression matches a character sequence consisting of the sequence the original expression matched plus in addition a character sequence matched by the third capture group, that is, it will match aabbbc followed by bbb.    It will match a sequence of characters aabbbcbbb but it will not match a sequence such as aabbbcaabbbc or aabbbcaa or aabbbcc.     Consider the following:

 

((a+)(b+))(c+)\1

 

That will match the sequence of the original expression followed by characters matched by the first capture group, \1,  so it will match aabbbc followed by aabbb.   It therefore will match the sequence aabbbcaabbb.   Consider an example that uses more than one back reference:

 

((a+)(b+))(c+)\4\3\2

 

The above matches the original match followed by the fourth, third and second capture groups, so that it matches aabbbc followed by c and then bbb and then aa.   It matches the sequence aabbbccbbbaaa.

 

Note that the following regular expression is an error:

 

((a+)(b+))(c+)\5

 

There are only four capture groups in the expression which precede the back reference \5 so trying to refer back to a fifth capture group number is an error.

Non-Capture Groups

A non-capture group allows us to group a subexpression as would a capture group but without labelling that group and remembering the result in the numbered sequence of capture groups for later re-use as a back reference.   A non-capture group consists of a left parenthesis ( character followed by a question mark ? followed by a colon :  character, a subexpression and a right parenthesis ) character.   Beginning the group with a (?: sequence indicates a non-capture group instead of a capture group.

 

For example, the regular expression:

(a)(?:b)*(c)

 

Includes the capture groups (a) and (c), the non-capture group (?:b) and a repetition count *.   The entire expression matches the character sequence abbc.   The back reference \1 refers to a, the subsequence from the first capture group (a) and the back reference \2 refers to c, the subsequence from the second capture group (c).    The non-capture group has no back reference so we have no way to refer back to the sequence bb by using a back reference.

 

We are not required to use back references and many people never use them.  If we do not use back references it seems there is no practical difference between capture groups and non-capture groups. We could always use capture groups even if we never intend to use back references.   Given that capture groups use a simple, intuitive syntax of ( )  while non-capture groups use a potentially confusing syntax of (?: ) to do less, why do non-capture groups exist at all?   There is an Essay for that.

 

Examples

The following examples list a regular expression  that can be used to search for various patterns matching the subject given. Each regular expression is then followed by sample strings and whether or not the string would be OK in that regular expression or if it would fail the test posed by the regular expression.

 

Important: When using regular expressions to specify a Pattern in the Select panel Template tab, remember that we must enclose the regular expression in single quote characters, as in '(Carlos|Mario) .*'   and also that we must escape each backslash with a preceding backslash.    To simplify the examples shown the backslash characters are not doubled as they would be to escape each regexp backslash.   For example, the US style phone numbers example

 

(\+\d)?\s*(\(\d+\))?\s*\d[\s\d-]*

 

Would be entered into the Pattern box as

 

'(\\+\\d)?\\s*(\\(\\d+\\))?\\s*\\d[\\s\\d-]*'

 

Using Manifold's @'...' notation we could enter the above into the Pattern box as

 

@'(\+\d)?\s*(\(\d+\))?\s*\d[\s\d-]*'

 

These examples show how regular expressions can be used to find certain patterns with the names attached to each example not intended as rigorous definitions. They do not purport to be definitive filters for the various examples given.   For example, the "US style" phone number example picks out a variety of constructions, such as (8) 34-93-93, that do not look like US phone numbers.

 

A rigorous filter for URLs would be a more complex regular expression than that provided below since it would exclude characters such as # that are not allowed in domain names.  The last example, for email addresses, shows a more robust regular expression that can be used as a true validation filter for acceptable email addresses.

 

First name is Carlos or Mario:

 

Note the space between the right parenthesis ) and the dot . character.    

 

(Carlos|Mario) .*

 

OK    Carlos Cramer

OK    Carlos Hernandez

OK    Carlos Gonzalez

OK    Mario Hernandez

No    Paolo Accorti

 

 

Last name is Hernandez:

 

Note the space between the plus sign +  and the H.

 

.+ Hernandez

 

No   Carlos Cramer

OK   Carlos Hernandez

No   Carlos Gonzalez

OK   Mario Hernandez

No   Paolo Accorti

 

 

First name is Carlos and last name is Hernandez or Cramer:

 

Note the space between the s and the left parenthesis (.

 

Carlos (Hernandez|Cramer)

 

OK   Carlos Cramer

OK   Carlos Hernandez

No   Carlos Gonzalez

No   Mario Hernandez

No   Paolo Accorti

 

 

US style phone number:

 

(\+\d)?\s*(\(\d+\))?\s*\d[\s\d-]*

 

OK   +7(514)555-9931

OK   514) 333-9931

OK   (617) 555-3267

OK   555-8787

No   (1) 03.83.00.68

 

 

US style phone number containing area code 514:

 

(\+\d)?\s*\(514\)\s*\d[\s\d-]*

 

OK   +7(514)555-9931

OK   (514) 333-9931

No   (617) 555-3267

No   555-8787

No   (1) 03.83.00.68

 

 

US style local phone number starting with 555:

 

(\+\d)?\s*(\(\d+\))?\s*555[\s\d-]*

 

OK   +7(514)555-9931

No   (514) 333-9931

OK   (617) 555-3267

OK   555-8787

No   (1) 03.83.00.68

 

 

URL:

 

(?:ftp\:\\\\|http\:\\\\|mailto\:\\\\)?(\w+\@)?(www\.)?\w+(\.\w+)+(\:\d+)?

 

OK   http:\\www.manifold.net:8080

No   http:\\manifold

OK   www.manifold.net

OK   ftp:\\microsoft.com

OK   mailto:\\john@manifold.com

 

 

Manifold URL:

 

(?:ftp\:\\\\|http\:\\\\|mailto\:\\\\)?(\w+\@)?(www\.)?manifold(\.\w+)+(\:\d+)?

 

OK  http:\\www.manifold.net:8080

No  http:\\manifold

OK  www.manifold.net

No  ftp:\\microsoft.com

OK  mailto:\\john@manifold.com

 

The long expression shown above should all be written on a single line.

 

 

FTP URL:

 

ftp\:\\\\(www\.)?\w+(\.\w+)+(\:\d+)?

 

No  http:\\www.manifold.net:8080

No  http:\\manifold

No  www.manifold.net

OK  ftp:\\microsoft.com

No  mailto:\\john@manifold.com

 

Latitude:

 

\d+\°\d+\'\d+(\.\d+)?\"\s*(N|S)?

 

OK  0°00'00.00"

OK  83°02'50.82" N

OK  0°05'43.14" S

No  9'

 

 

South latitude:

 

\d+\°\d+\'\d+(\.\d+)?\"\s*S

 

No  0°00'00.00"

No  83°02'50.82" N

OK  0°05'43.14" S

No  9'

 

 

Date style using forward slash or hyphen to separate month day and year:

 

\d+(\/|\-)\d+(\/|\-)\d+

 

OK  5/24/1985

OK  5/24/85

OK  5-24-1985

No  5.24.1985

 

Note the above does not check the elements for calendar consistency.  It is only a check for three numbers as a group with the numbers separated by a forward slash / or a hypen -.

 

Date style using forward slash or hyphen ending in a 199x number:

 

\d+(\/|\-)\d+(\/|\-)199\d

 

No  5/24/1985

OK  5-24-1995

No  5.24.1995

 

 

Date style with 24 as the middle number:

 

\d+(\/|\-)24(\/|\-)\d+

 

OK  5/24/1985

No  5-29-1985

No  5.24.1985

 

 

Positive dollar currency amount:

 

(\$\s*\d+(\.\d+)?)|(\d+(\.\d+)?\s*\$)

 

OK  $5

OK  4.6 $

No  -7.3 $

No  7.3

 

 

Exponential number:

 

\-?\d+(\.\d+)?([E|D]\-?\d+)?

 

OK  -3.5E2

OK  3D-56

OK  -7

No  .8

 

 

Rational fraction:

 

\-?(?:((\d+\s*)?\d+\/\d+)|(\d+(\s*\d+\/\d+)?))

 

OK  -2

OK  2/5

OK  -1 2/5

No  2/-5

 

 

HTML style tag:

 

<(.*)>.*<\/\1>

 

OK  <a>abc</a>

No  <a>abc</b>

No  <a>abc

 

 

US social security style number pattern:

 

\d{3}-\d{2}-\d{4}

 

OK  223-20-9898

No  22-20-9898

No  223-209898

No  223 20 9898

 

 

Credit card style number patterns:

 

\d{4}( \d{4}){3,4}

 

OK  2235 5656 4578 7890

OK  2235 5656 4578 7890 0010

No  2235 5656 4578 7890 00

No  2235-5656-4578-7890

 

 

Email address validation:

 

([\w\.!#\$%\-+.]+@[A-Za-z0-9\-]+(\.[A-Za-z0-9\-]+)+)

 

OK  john_smith@domain.com

OK  john.smith@domain.com.au

No  john_smith.domain.com.au

No  john smith@domain.com.au

No  john_smith@domain

No  john_smith@

 

The above is an example to show how regular expressions work.   For a real life use of regular expressions aimed at high efficiency, see the discussion in https://msdn.microsoft.com/en-us/library/01escwtf(v=vs.110).aspx

Notes

Grep first appeared in the 4th edition of UNIX in 1973 and has been a signature part of UNIX derived systems such as BSD and Linux ever since.   In modern times the name grep is said to be an acronym for "global regular expression processor," but the name arose when Ken Thompson created grep by adapting a regular expression parser he wrote for the line editor, ed,    In that parser the command g/re/p returns lines matching a prior pattern, hence the name of Thompson's new utility.    

 

Why string escapes in the Select panel?  It might seem unnecessary to enter regular expressions as strings in the Pattern box when using the Select panel template Text Matches Regexp and thus having to enclose each such expression within single quote ' characters as well as escaping each use of a backslash character with yet another backslash character preceding it.   Why does not Manifold simply know that whatever is in the Pattern box should be taken as straight regexp syntax with no need to enclose in single quotes or to escape backslash characters?   

 

That could certainly be done but then there would be two different ways of using regular expressions, one within functions passed as string arguments as is universally done in many settings and the other a special case within a template's Pattern combo box.     That would also prevent simply copying and pasting between the two settings.  Instead, Manifold offers the option of using @'...' notation, which always works.

 

Limited Number of Capture Groups - Although the ECMA standard places no limit on the number of capture groups in a regular expression as a practical matter most implementations have some upper limit.  The Microsoft regular expression engine utilized within Manifold limits the number of capture groups to 31.  If you need more than thirty one capture groups for back references you are probably doing something gruesomely inefficient that should be better structured.

 

ECMA and Regular Expression Syntax - ECMA has had a big impact because ECMAscript is the formal standard from which various JavaScript dialects derive, including both Google's V8 as well as Microsoft implementations.     ECMA syntax for regular expressions is used in C++, .NET, JavaScript (of course), Perl, Python and by many other systems.    The ECMAscript standard has evolved over the years as newer editions of the standard have emerged.

 

Manifold simultaneously supports two different levels of ECMA standards.   For use within Manifold facilities such as Manifold SQL functions and the Select panel the Microsoft-based regular expression parser used by Manifold supports ECMA as it is currently tracked in C++ by the Microsoft libraries utilized within Manifold, at the present writing Edition 3 of the ECMA standard.   For use within JavaScript scripts using the V8 engine Manifold supports ECMA as it is currently tracked within V8, at the present writing Edition 6 of the ECMA standard.

 

The difference between the two is not normally an issue since regular expressions have not changed much over the years.   It is much more likely that users who are have become accustomed to other dialects or subsets of ECMA regular expression syntax, such as GNU-style BRE or UNIX style grep, may have to adapt to contemporary ECMA / JavaScript / C++ / Python / .NET regular expression syntax.   This, too, should not be a big deal as anyone who is enough of a gear head to be using regular expressions in the first place will have no difficulty adjusting.

 

Beginners learning to use regular expressions for the first time with Manifold should be aware of differences between various regexp engines and dialects.   It may help to ensure any supplemental education they undertake discusses ECMA regular expression syntax, for example, text books or web sites discussing use of regular expressions in JavaScript, C++ or Python.  

 

A useful table comparing features of different regular expression engines is at https://msdn.microsoft.com/en-us/library/bb982727.aspx

 

The formal definition of ECMA standard regular expressions is at http://www.ecma-international.org/ecma-262/5.1/#sec-15.10

 

An example of an earlier ECMA standard definition is at http://www-archive.mozilla.org/js/language/E262-3.pdf

 

Regular Expressions and Performance

 

Like many things in computing, for example such as SQL queries, there are efficient and inefficient ways of writing regular expressions to accomplish exactly the same match.  In extreme cases the difference between an efficient regular expression and a exceptionally inefficient one can be the difference between a search for a match taking seconds or taking hours or even days, even when the data to be searched and matched is relatively small.    As a general rule it is a good idea whenever possible to write regular expressions which avoid backtracking and, especially, nested backtracking.

 

A useful article written for developers that opens up the general notion that how well a regular expression performs can depend on how it is written appears at https://msdn.microsoft.com/en-us/library/gg578045(v=vs.110).aspx