Programmer Reference Guide

This section covers the syntax of regular expressions. It is designed as a programme reference guide for localization or development engineers that wish to use advanced regular expressions in their text parsers.

Literals

All characters are literals except: ".", "|", "*", "?", "+", "(", ")", "{", "}", "[", "]", "^", "$" and "\". These characters are literals when preceded by a "\". A literal is a character that matches itself.

Wildcard

The dot character "." matches any single character except newline and NULL.

Repeats

A repeat is an expression that is repeated an arbitrary number of times. There are various types of repeats that can be used within the body of a regular expression.

An expression followed by "*" can be repeated any number of times including zero.
An expression followed by "+" can be repeated any number of times, but at least once.
An expression followed by "?" may be repeated zero or one.

When it is necessary to specify the minimum and maximum number of repeats explicitly, the bound operator can be used.

All repeat expressions refer to the shortest possible previous sub-expression: a single character; a character set, or a sub-expression grouped with "()" for example.

" ba*" will match all of "b", "ba", " baaa" etc.
" ba+" will match "ba" or " baaaa" for example but not "b".
"ba?" will match "b" or "ba".
"ba{2,4}" will match "baa", " baaa" and " baaaa".

Non-greedy repeats

Whenever the "extended" regular expression syntax is in use (the default) then non-greedy repeats are possible by appending a '?' after the repeat; a non-greedy repeat is one which will match the shortest possible string.

For example to match html tag pairs one could use something like:

"<\ s*tagname[^>]*>(.*?)<\ s*/ tagname\s*>"

Bound Operator

The bounds operator "{}" is used when it is necessary to specify the minimum and maximum number of repeats.

"a{2}" matches the letter "a" repeated exactly twice.
"a{2,4}" matches the letter "a" repeated between 2 and 4 times.
"a{2,}" matches the letter "a" repeated at least twice with no upper limit.

Parenthesis

Parentheses serve two purposes, to group items together into a sub-expression, and to mark what generated the match.

"( ab)*" would match all of the string " ababab".

Alternatives

Alternatives occur when the expression can match either one sub-expression or another, each alternative is separated by a "|".

" a(b|c)" could match " ab" or "ac".
" abc|def" could match " abc" or " def".

Sets

A set is a set of characters that can match any single character that is a member of the set. Sets are delimited by "[" and "]" and can contain literals, character ranges, character classes, collating elements and equivalence classes. Set declarations that start with "^" contain the compliment of the elements that follow.

Examples: Character literals

"[ abc]" will match either of "a", "b", or "c".
"[^ abc] will match any character other than "a", "b", or "c".

Examples : Character ranges

"[ a-z]" will match any character in the range "a" to " z".
"[^ A-Z]" will match any character other than those in the range "A" to " Z".

Character Classes

Character classes are denoted using the syntax "[: classname:]" within a set declaration, for example "[[:space:]]" is the set of all whitespace characters.

Character classes are only available if the flag regbase:: char_classes is set. The available character classes are:

alnum	Any alpha numeric character.
alpha	Any alphabetical character a-z and A-Z. Other characters may also be included depending upon the locale.
blank	Any blank character, either a space or a tab
cntrl	Any control character.
digit	Any digit 0-9.
graph	Any graphical character.
lower	Any lower case character a-z. Other characters may also be included depending upon the locale.
print	Any printable character.
punct	Any punctuation character.
space	Any whitespace character.
upper	Any upper case character A-Z. Other characters may also be included depending upon the locale.
xdigit	Any hexadecimal digit character, 0-9, a-f and A-F.
word	Any word character - all alphanumeric characters plus the underscore.
unicode	Any character whose code is greater than 255, this applies to the wide character traits classes only.

Line anchors

An anchor is something that matches the null string at the start or end of a line: "^" matches the null string at the start of a line, "$" matches the null string at the end of a line.

"^abc" will match "abc" only if it appears at the very start of a string.
"ing$" will match "ing" only if it appears at the very end of a string.

Characters by code

Character codes consist of the escape character followed by the digit "0" followed by the octal character code. For example "\023" represents the character whose octal code is 23.

Where ambiguity could occur use parentheses to break the expression up: "\0103" represents the character whose code is 103, "(\010)3 represents the character 10 followed by "3".

To match characters by their hexadecimal code, use \x followed by a string of hexadecimal digits, optionally enclosed inside {}, for example \xf0 or \x{aff}, notice the latter example is a Unicode character.

Word operators

The following operators are provided for compatibility with the GNU regular expression library.

"\w" matches any single character that is a member of the "word" character class, this is identical to the expression "[[:word:]]".
"\W" matches any single character that is not a member of the "word" character class, this is identical to the expression "[^[:word:]]".
"\<" matches the null string at the start of a word.
"\>" matches the null string at the end of the word.
"\b" matches the null string at either the start or the end of a word.
"\B" matches a null string within a word.

Escape operator

The escape character "\" has several meanings.

The escape operator may introduce an operator for example: back references, or a word operator.

The escape operator may make the following character normal, for example "\*" represents a literal "*" rather than the repeat operator.

Single character escape sequences

The following escape sequences are aliases for single characters:

\a	0x07	Bell character
\f	0x0C	Form feed
\n	0x0A	Newline character
\r	0x0D	Newline character
\t	0x09	Tab character
\v	0x0B	Vertical tab
\e	0x1B	ASCII Escape character
\0dd	0dd	An octal character code, where dd is one or more octal digits
\xXX	0xXX	A hexadecimal character code, where XX is one or more hexadecimal digits
\x{XX}	0xXX	A hexadecimal character code, where XX is one or more hexadecimal digits, optionally a Unicode character