RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://docs.racket-lang.org/reference/regexp.html below:

4.8 Regular Expressions

4.8 Regular Expressions🔗ℹ

Regular Expressions in The Racket Guide introduces regular expressions.

Regular expressions are specified as strings or byte strings, using the same pattern language as either the Unix utility egrep or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see Encodings and Locales) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

A regular expression that is represented as a string or byte string can be compiled to a regexp value, which can be used more efficiently by functions such as regexp-match compared to the string or byte string form. The regexp and byte-regexp procedures convert a string or byte string (respectively) into a regexp value using a syntax of regular expressions that is most compatible to egrep. The pregexp and byte-pregexp procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl.

Two regexp values are equal? if they have the same source, use the same pattern language, and are both character regexps or both byte regexps.

A literal or printed regexp value starts with #rx or #px. See Reading Regular Expressions for information on reading regular expressions and Printing Regular Expressions for information on printing regular expressions. Regexp values produced by the default reader are interned in read-syntax mode.

On the BC variant of Racket, the internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators.

4.8.1 Regexp Syntax🔗ℹ

The following syntax specifications describe the content of a string that represents a regular expression. The syntax of the corresponding string may involve extra escape characters. For example, the regular expression (.*)\1 can be represented with the string "(.*)\\1" or the regexp constant #rx"(.*)\\1"; the \ in the regular expression must be escaped to include it in a string or regexp constant.

The regexp and pregexp syntaxes share a common core:

‹regexp›

::=

‹pces›

Match ‹pces›

|

‹regexp›|‹regexp›

Match either ‹regexp›, try left first

ex1

‹pces›

::=

Match empty

|

‹pce›‹pces›

Match ‹pce› followed by ‹pces›

‹pce›

::=

‹repeat›

Match ‹repeat›, longest possible

ex3

|

‹repeat›?

Match ‹repeat›, shortest possible

ex6

|

‹atom›

Match ‹atom› exactly once

‹repeat›

::=

‹atom›*

Match ‹atom› 0 or more times

ex3

|

‹atom›+

Match ‹atom› 1 or more times

ex4

|

‹atom›?

Match ‹atom› 0 or 1 times

ex5

‹atom›

::=

(‹regexp›)

Match sub-expression ‹regexp› and report

ex11

|

[‹rng›]

Match any character in ‹rng›

ex2

|

[^‹crng›]

Match any character not in ‹crng›

ex12

|

.

Match any (except newline in multi mode)

ex13

|

^

Match start (or after newline in multi mode)

ex14

|

$

Match end (or before newline in multi mode)

ex15

|

‹literal›

Match a single literal character

ex1

|

(?‹mode›:‹regexp›)

Match ‹regexp› using ‹mode›

ex35

|

(?>‹regexp›)

Match ‹regexp›, only first possible

|

‹look›

Match empty if ‹look› matches

|

(?‹tst›‹pces›|‹pces›)

Match 1st ‹pces› if ‹tst›, else 2nd ‹pces›

ex36

|

(?‹tst›‹pces›)

Match ‹pces› if ‹tst›, empty if not ‹tst›

|

\ at end of pattern

Match the nul character (ASCII 0)

‹crng›

::=

‹rng›

‹crng› contains everything in ‹rng›

|

^‹crng›

‹crng› contains ^ and everything in ‹crng›

ex37

‹rng›

::=

]

‹rng› contains ] only

ex27

|

-

‹rng› contains - only

ex28

|

‹mrng›

‹rng› contains everything in ‹mrng›

|

‹mrng›-

‹rng› contains - and everything in ‹mrng›

‹mrng›

::=

]‹lrng›

‹mrng› contains ] and everything in ‹lrng›

ex29

|

-‹lrng›

‹mrng› contains - and everything in ‹lrng›

ex29

|

‹lirng›

‹mrng› contains everything in ‹lirng›

‹lirng›

::=

‹riliteral›

‹lirng› contains a literal character

|

‹riliteral›-‹rliteral›

‹lirng› contains Unicode range inclusive

ex22

|

‹lirng›‹lrng›

‹lirng› contains everything in both

‹lrng›

::=

^

‹lrng› contains ^

ex30

|

‹rliteral›-‹rliteral›

‹lrng› contains Unicode range inclusive

|

^‹lrng›

‹lrng› contains ^ and more

|

‹lirng›

‹lrng› contains everything in ‹lirng›

‹look›

::=

(?=‹regexp›)

Match if ‹regexp› matches

ex31

|

(?!‹regexp›)

Match if ‹regexp› doesn’t match

ex32

|

(?<=‹regexp›)

Match if ‹regexp› matches preceding

ex33

|

(?<!‹regexp›)

Match if ‹regexp› doesn’t match preceding

ex34

‹tst›

::=

(‹n›)

True if ‹n›th ( has a match

|

‹look›

True if ‹look› matches

ex36

‹mode›

::=

Like the enclosing mode

|

‹mode›i

Like ‹mode›, but case-insensitive

ex35

|

‹mode›-i

Like ‹mode›, but sensitive

|

‹mode›s

Like ‹mode›, but not in multi mode

|

‹mode›-s

Like ‹mode›, but in multi mode

|

‹mode›m

Like ‹mode›, but in multi mode

|

‹mode›-m

Like ‹mode›, but not in multi mode

The following completes the grammar for regexp, which treats { and } as literals, \ as a literal within ranges, and \ as a literal producer outside of ranges.

‹literal›

::=

Any character except (, ), *, +, ?, [, ., ^, \, or |

|

\‹aliteral›

Match ‹aliteral›

ex21

‹aliteral›

::=

Any character

‹riliteral›

::=

Any character except ], -, or ^

‹rliteral›

::=

Any character except ] or -

The following completes the grammar for pregexp, which uses { and } bounded repetition and uses \ for meta-characters both inside and outside of ranges.

‹repeat›

::=

...

...

|

‹atom›{‹n›}

Match ‹atom› exactly ‹n› times

ex7

|

‹atom›{‹n›,}

Match ‹atom› ‹n› or more times

ex8

|

‹atom›{,‹m›}

Match ‹atom› between 0 and ‹m› times

ex9

|

‹atom›{‹n›,‹m›}

Match ‹atom› between ‹n› and ‹m› times

ex10

|

‹atom›{}

Match ‹atom› 0 or more times

‹atom›

::=

...

...

|

\‹n›

Match latest reported match for ‹n›th (

ex16

|

‹class›

Match any character in ‹class›

|

\b

Match \w* boundary

ex17

|

\B

Match where \b does not

ex18

|

\p{‹property›}

Match (UTF-8 encoded) in ‹property›

ex19

|

\P{‹property›}

Match (UTF-8 encoded) not in ‹property›

ex20

|

\X

Match (UTF-8 encoded) grapheme cluster

‹literal›

::=

Any character except (, ), *, +, ?, [, ], {, }, ., ^, \, or |

|

\‹aliteral›

Match ‹aliteral›

ex21

‹aliteral›

::=

Any character except a-z, A-Z, 0-9

‹lirng›

::=

...

...

|

‹class›

‹lirng› contains all characters in ‹class›

|

‹posix›

‹lirng› contains all characters in ‹posix›

ex26

|

\‹eliteral›

‹lirng› contains ‹eliteral›

‹riliteral›

::=

Any character except ], \, -, or ^

‹rliteral›

::=

Any character except ], \, or -

‹eliteral›

::=

Any character except a-z, A-Z

‹class›

::=

\d

Contains 0-9

ex23

|

\D

Contains characters not in \d

|

\w

Contains a-z, A-Z, 0-9, _

ex24

|

\W

Contains characters not in \w

|

\s

Contains space, tab, newline, formfeed, return

ex25

|

\S

Contains characters not in \s

‹posix›

::=

[:alpha:]

Contains a-z, A-Z

|

[:upper:]

Contains A-Z

|

[:lower:]

Contains a-z

ex26

|

[:digit:]

Contains 0-9

|

[:xdigit:]

Contains 0-9, a-f, A-F

|

[:alnum:]

Contains a-z, A-Z, 0-9

|

[:word:]

Contains a-z, A-Z, 0-9, _

|

[:blank:]

Contains space and tab

|

[:space:]

Contains space, tab, newline, formfeed, return

|

[:graph:]

Contains all ASCII characters that use ink

|

[:print:]

Contains space, tab, and ASCII ink users

|

[:cntrl:]

Contains all characters with scalar value < 32

|

[:ascii:]

Contains all ASCII characters

‹property›

::=

‹category›

Includes all characters in ‹category›

|

^‹category›

Includes all characters not in ‹category›

In case-insensitive mode, a backreference of the form \‹n› matches case-insensitively only with respect to ASCII characters.

The Unicode categories follow.

‹category›

::=

Ll

Letter, lowercase

ex19

|

Lu

Letter, uppercase

|

Lt

Letter, titlecase

|

Lm

Letter, modifier

|

L&

Union of Ll, Lu, Lt, and Lm

|

Lo

Letter, other

|

L

Union of L& and Lo

|

Nd

Number, decimal digit

|

Nl

Number, letter

|

No

Number, other

|

N

Union of Nd, Nl, and No

|

Ps

Punctuation, open

|

Pe

Punctuation, close

|

Pi

Punctuation, initial quote

|

Pf

Punctuation, final quote

|

Pc

Punctuation, connector

|

Pd

Punctuation, dash

|

Po

Punctuation, other

|

P

Union of Ps, Pe, Pi, Pf, Pc, Pd, and Po

|

Mn

Mark, non-spacing

|

Mc

Mark, spacing combining

|

Me

Mark, enclosing

|

M

Union of Mn, Mc, and Me

|

Sc

Symbol, currency

|

Sk

Symbol, modifier

|

Sm

Symbol, math

|

So

Symbol, other

|

S

Union of Sc, Sk, Sm, and So

|

Zl

Separator, line

|

Zp

Separator, paragraph

|

Zs

Separator, space

|

Z

Union of Zl, Zp, and Zs

|

Cc

Other, control

|

Cf

Other, format

|

Cs

Other, surrogate

|

Cn

Other, not assigned

|

Co

Other, private use

|

C

Union of Cc, Cf, Cs, Cn, and Co

|

.

Union of all Unicode categories

When a character regexp with . is used with a byte string or input port, the . matches only a valid UTF-8 encoding in the input. A . in a byte regexp matches any byte (except a newline in multi mode). A property specified with \P or \p matches only a valid UTF-8 encoding, whether it is written in a character regexp or byte regexp. Similarly, \X matches only valid UTF-8 encoding sequences, and it will not match a prefix of a sequence (even if matching only a prefix would allow the rest of the pattern to match remaining input), but a grapheme-cluster sequence can be terminated by an invalid UTF-8 encoding.

Examples:

> (regexp-match #rx"a|b" "cat") ; ex1
'("a")
> (regexp-match #rx"[at]" "cat") ; ex2
'("a")
> (regexp-match #rx"ca*[at]" "caaat") ; ex3
'("caaat")
> (regexp-match #rx"ca+[at]" "caaat") ; ex4
'("caaat")
> (regexp-match #rx"ca?t?" "ct") ; ex5
'("ct")
> (regexp-match #rx"ca*?[at]" "caaat") ; ex6
'("ca")
> (regexp-match #px"ca{2}" "caaat") ; ex7, uses #px
'("caa")
> (regexp-match #px"ca{2,}t" "catcaat") ; ex8, uses #px
'("caat")
> (regexp-match #px"ca{,2}t" "caaatcat") ; ex9, uses #px
'("cat")
> (regexp-match #px"ca{1,2}t" "caaatcat") ; ex10, uses #px
'("cat")
> (regexp-match #rx"(c<*)(a*)" "caat") ; ex11
'("caa" "c" "aa")
> (regexp-match #rx"[^ca]" "caat") ; ex12
'("t")
> (regexp-match #rx".(.)." "cat") ; ex13
'("cat" "a")
> (regexp-match #rx"^a|^c" "cat") ; ex14
'("c")
> (regexp-match #rx"a$|t$" "cat") ; ex15
'("t")
> (regexp-match #px"c(.)\\1t" "caat") ; ex16, uses #px
'("caat" "a")
> (regexp-match #px".\\b." "cat in hat") ; ex17, uses #px
'("t ")
> (regexp-match #px".\\B." "cat in hat") ; ex18, uses #px
'("ca")
> (regexp-match #px"\\p{Ll}" "Cat") ; ex19, uses #px
'("a")
> (regexp-match #px"\\P{Ll}" "cat!") ; ex20, uses #px
'("!")
> (regexp-match #rx"\\|" "c|t") ; ex21
'("|")
> (regexp-match #rx"[a-f]*" "cat") ; ex22
'("ca")
> (regexp-match #px"[a-f\\d]*" "1cat") ; ex23, uses #px
'("1ca")
> (regexp-match #px" [\\w]" "cat hat") ; ex24, uses #px
'(" h")
> (regexp-match #px"t[\\s]" "cat\nhat") ; ex25, uses #px
'("t\n")
> (regexp-match #px"[[:lower:]]+" "Cat") ; ex26, uses #px
'("at")
> (regexp-match #rx"[]]" "c]t") ; ex27
'("]")
> (regexp-match #rx"[-]" "c-t") ; ex28
'("-")
> (regexp-match #rx"[]a[]+" "c[a]t") ; ex29
'("[a]")
> (regexp-match #rx"[a^]+" "ca^t") ; ex30
'("a^")
> (regexp-match #rx".a(?=p)" "cat nap") ; ex31
'("na")
> (regexp-match #rx".a(?!t)" "cat nap") ; ex32
'("na")
> (regexp-match #rx"(?<=n)a." "cat nap") ; ex33
'("ap")
> (regexp-match #rx"(?<!c)a." "cat nap") ; ex34
'("ap")
> (regexp-match #rx"(?i:a)[tp]" "cAT nAp") ; ex35
'("Ap")
> (regexp-match #rx"(?(?<=c)a|b)+" "cabal") ; ex36
'("ab")
> (regexp-match #rx"[^^]+" "^cat^") ; ex37
'("cat")

Changed in version 8.15.0.8 of package base: Added \X grapheme cluster pattern.

4.8.2 Additional Syntactic Constraints🔗ℹ

In addition to matching a grammar, regular expressions must meet two syntactic restrictions:

In a ‹repeat› other than ‹atom›?, the ‹atom› must not match an empty sequence.
In a (?<=‹regexp›) or (?<!‹regexp›), the ‹regexp› must match a bounded sequence only.

These constraints are checked syntactically by the following type system. A type [n, m] corresponds to an expression that matches between n and m characters. In the rule for (‹regexp›), ‹n› means the number such that the opening parenthesis is the ‹n›th opening parenthesis for collecting match reports. Non-emptiness is inferred for a backreference pattern, \‹n›, so that a backreference can be used for repetition patterns; in the case of mutual dependencies among backreferences, the inference chooses the fixpoint that maximizes non-emptiness. Finiteness is not inferred for backreferences (i.e., a backreference is assumed to match an arbitrarily large sequence). No syntactic constraint prohibits a backreference within the group that it references, although such self references might create a pattern with no possible matches (as in the case of (.\1), although (^.|\1){2} matches an input that starts with the same two characters).

‹regexp›1 : [n1, m1] ‹regexp›2 : [n2, m2]

‹regexp›1|‹regexp›2 : [min(n1, n2), max(m1, m2)]

‹pce› : [n1, m1] ‹pces› : [n2, m2]

‹pce›‹pces› : [n1+n2, m1+m2]

‹repeat› : [n, m]

‹repeat›? : [0, m]