The regular expression is an algebraic expression for text search strings. It defines a representation to match a set of strings.
The basic format of regex is /<regex>/
e.g. /love/ matches lover, but not Love. Regex is case-sensitive.
Disjunction of A character
The []
is used to represent character disjunction (or). e.g. /l[ai]ke/
matches like, lack. The place at the second character is either i
or a
.
Match a range of characters
Inside the []
, using -
can represent a range of characters, for instance. [a-z] is a lower case letter from a to z. /[A-Za-z0-9]/
represents numerical digits, upper case letters, and lower case letters.
Indicate not some characters
Inside []
using, the caret ^
right after [
can be used to match characters that are not the specific set of characters. e.g. \[^a]\
matches all characters except for a. e.g. /[^0-9]/
matches 90 <u>k</u>ilogram
. (here only underscore the first matched character).
To be mentioned, if the caret ^
is not right after [
inside of square brackets, it is only a caret! When outside of square brackets, /\^/
matches ^
The counters and wildcard expression
Counters
The counter is being used to indicate how many times the pattern before them should repeat in a matched token.
Expression | Meaning |
---|---|
? | 0 or 1 times |
+ | 1 or more times |
* | 0 or more times |
{,} | at least l times, at most u times. If l and u leave empty, e.g.{2,} /{,2} indicates at least/most two times. |
{} | the pattern should repeat exact k times |
The counter should be placed at the end of the pattern. e.g. /a+/ matches apple, aaa.
Wildcard expression (通配符)
/./
is the wildcard expression, which represents any characters, except the carriage return.
Anchors
Expression | Meaning |
---|---|
^ | Start of the line anchor |
$ | End of the line |
\b | word boundary |
\B | not a word boundary |
Disjunction, Grouping, and Precedence
disjunction
|
is called the disjunction operator, also called the pipe symbol, which means “or”.
/cat|dog/
matches the word cat
,dog
.
group
Group is being indicated using ()
, which is a string of characters that behaves atomically like a single character.
For instance /the(re)?/ matches the and there.
precedence
- ()
- counters
- sequences and anchor
- disjunction
|