package pcre2
Install
Dune Dependency
Authors
Maintainers
Sources
sha512=614bd7d44460ea7c35a61dcff14546e16eb7bbb959be02cf77463d4448c01e2462f10656ca8b1f21fead752a148ce94943de99dff8106a50eef1468e1d2f99f9
doc/pcre2/Pcre2/index.html
Module Pcre2
Source
Perl Compatibility Regular Expressions for OCaml
%%VERSION%% - homepage
Exceptions
type error =
| Partial
(*String only matched the pattern partially
*)| BadPattern of string * int
(*
*)BadPattern (msg, pos)
regular expression is malformed. The reason is inmsg
, the position of the error in the pattern inpos
.| BadUTF
(*UTF string being matched is invalid
*)| BadUTFOffset
(*Gets raised when a UTF string being matched with offset is invalid.
*)| MatchLimit
(*Maximum allowed number of match attempts with backtracking or recursion is reached during matching. ALL FUNCTIONS CALLING THE MATCHING ENGINE MAY RAISE IT!!!
*)| DepthLimit
| WorkspaceSize
(*Raised by
*)pcre2_dfa_match
when the provided workspace array is too small. See documention onpcre2_dfa_match
for details on workspace array sizing.| InternalError of string
(*
*)InternalError msg
C-library exhibits unknown/undefined behaviour. The reason is inmsg
.
Backtrack
used in callout functions to force backtracking.
Regexp_or (pat, error)
gets raised for sub-pattern pat
by regexp_or
if it failed to compile.
Compilation and runtime flags and their conversion functions
Internal representation of compilation flags
Internal representation of runtime flags
and cflag = [
| `ALLOW_EMPTY_CLASS
(*Allow empty classes
*)| `ALT_BSUX
(*Alternative handling of \u, \U, and \x
*)| `ALT_CIRCUMFLEX
(*Alternative handling of ^ in multiline mode
*)| `ALT_VERBNAMES
(*Process backslashes in verb names
*)| `ANCHORED
(*Pattern matches only at start of string
*)| `AUTO_CALLOUT
(*Automatically inserts callouts with id 255 before each pattern item
*)| `CASELESS
(*Case insensitive matching
*)| `DOLLAR_ENDONLY
(*'$' in pattern matches only at end of string
*)| `DOTALL
(*'.' matches all characters (newlines, too)
*)| `DUPNAMES
(*Allow duplicate names for subpatterns
*)| `ENDANCHORED
(*Pattern can match only at end of subject
*)| `EXTENDED
(*Ignores whitespace and PERL-comments. Behaves like the '/x'-option in PERL
*)| `EXTENDED_MORE
| `FIRSTLINE
(*Unanchored patterns must match before/at first NL
*)| `LITERAL
(*Pattern characters are all literal
*)| `MATCH_INVALID_UTF
(*Enable support for matching invalid UTF
*)| `MATCH_UNSET_BACKREF
(*Match unset backreferences
*)| `MULTILINE
(*'^' and '$' match before/after newlines, not just at the beginning/end of a string
*)| `NEVER_BACKSLASH_C
(*Lock out the use of \C in patterns
*)| `NEVER_UCP
(*Lock out UCP, e.g. via (\*UCP)
*)| `NEVER_UTF
(*Lock out UTF, e.g. via (\*UTF)
*)| `NO_AUTO_CAPTURE
(*Disables the use of numbered capturing parentheses
*)| `NO_AUTO_POSSESS
(*Disable auto-possessification
*)| `NO_DOTSTAR_ANCHOR
(*Disable automatic anchoring for .*
*)| `NO_START_OPTIMIZE
(*Disable match-time start optimizations
*)| `NO_UTF_CHECK
(*Do not check the pattern for UTF validity (only relevant if UTF is set) WARNING: with this flag enabled, invalid UTF strings may cause a crash, loop, or give incorrect results
*)| `UCP
(*Use Unicode properties for \d, \w, etc.
*)| `UNGREEDY
(*Quantifiers not greedy anymore, only if followed by '?'
*)| `USE_OFFSET_LIMIT
(*Enable offset limit for unanchored matching
*)| `UTF
(*Treat pattern and subjects as UTF strings
*)
]
Compilation flags
cflags cflag_list
converts a list of compilation flags to their internal representation.
cflag_list cflags
converts internal representation of compilation flags to a list.
type rflag = [
| `ANCHORED
(*Match only at the first position
*)| `COPY_MATCHED_SUBJECT
(*On success, make a private subject copy
*)| `DFA_RESTART
(*Causes matching to proceed presuming the subject string is further to one partially matched previously using the same int-array working set. May only be used with
*)pcre2_dfa_match
orunsafe_pcre2_dfa_match
, and should always be paired with`PARTIAL
.| `DFA_SHORTEST
(*Return only the shortest match
*)| `ENDANCHORED
(*Pattern can match only at end of subject
*)| `NOTBOL
(*Beginning of string is not treated as beginning of line
*)| `NOTEOL
(*End of string is not treated as end of line
*)| `NOTEMPTY
(*An empty string is not a valid match
*)| `NOTEMPTY_ATSTART
(*An empty string at the start of the subject is not a valid match
*)| `NO_JIT
(*Do not use JIT matching
*)| `NO_UTF_CHECK
(*Do not check the subject for UTF validity (only relevant if PCRE2_UTF was set at compile time)
*)| `PARTIAL_HARD
(*Throw Pcre2.Partial for a partial match even if there is a full match
*)| `PARTIAL_SOFT
(*Throw Pcre2.Partial for a partial match if no full matches are found
*)
]
Runtime flags
rflags rflag_list
converts a list of runtime flags to their internal representation.
rflag_list rflags
converts internal representation of runtime flags to a list.
Information on the PCRE2-configuration (build-time options)
Version information
Version of the PCRE2-C-library
Indicates whether unicode support is enabled
Character used as newline
Number of bytes used for internal linkage of regular expressions
Default limit for calls to internal matching function
Default limit for depth of nested backtracking
Indicates use of stack recursion in matching function
Information on patterns
type firstcodeunit_info = [
| `Char of char
(*Fixed first character
*)| `Start_only
(*Pattern matches at beginning and end of newlines
*)| `ANCHORED
(*Pattern is anchored
*)
]
Information on matching of "first chars" in patterns
Compiled regular expressions
firstcodeunit regexp
get_match_limit rex
get_depth_limit rex
Compilation of patterns
Alternative set of char tables for pattern matching
val regexp :
?limit:int ->
?depth_limit:int ->
?iflags:icflag ->
?flags:cflag list ->
?chtables:chtables ->
string ->
regexp
regexp ?limit ?depth_limit ?iflags ?flags ?chtables pattern
compiles pattern
with flags
when given, with iflags
otherwise, and with char tables chtables
. If limit
is specified, this sets a limit to the amount of recursion and backtracking (only lower than the builtin default!). If this limit is exceeded, MatchLimit
will be raised during matching.
For detailed documentation on how you can specify PERL-style regular expressions (= patterns), please consult the PCRE2-documentation ("man pcre2pattern") or PERL-manuals.
val regexp_or :
?limit:int ->
?depth_limit:int ->
?iflags:icflag ->
?flags:cflag list ->
?chtables:chtables ->
string list ->
regexp
regexp_or ?limit ?depth_limit ?iflags ?flags ?chtables patterns
like regexp
, but combines patterns
as alternatives (or-patterns) into one regular expression.
quote str
Subpattern extraction
Information on substrings after pattern matching
get_subject substrings
num_of_subs substrings
get_substring substrings n
get_substring_ofs substrings n
get_substrings ?full_match substrings
get_opt_substrings ?full_match substrings
get_named_substring rex name substrings
get_named_substring_ofs rex name substrings
Callouts
type callout_data = {
callout_number : int;
(*Callout number
*)substrings : substrings;
(*Substrings matched so far
*)start_match : int;
(*Subject start offset of current match attempt
*)current_position : int;
(*Subject offset of current match pointer
*)capture_top : int;
(*Number of the highest captured substring so far
*)capture_last : int;
(*Number of the most recently captured substring
*)pattern_position : int;
(*Offset of next match item in pattern string
*)next_item_length : int;
(*Length of next match item in pattern string
*)
}
Type of callout functions
Callouts are referred to in patterns as "(?Cn)" where "n" is a callout_number
ranging from 0 to 255. Substrings captured so far are accessible as usual via substrings
. You will have to consider capture_top
and capture_last
to know about the current state of valid substrings.
By raising exception Backtrack
within a callout function, the user can force the pattern matching engine to backtrack to other possible solutions. Other exceptions will terminate matching immediately and return control to OCaml.
Matching of patterns and subpattern extraction
val pcre2_match :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
string ->
int array
pcre2_match ?iflags ?flags ?rex ?pat ?pos ?callout subj
val pcre2_dfa_match :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
?workspace:int array ->
string ->
int array
pcre2_dfa_match ?iflags ?flags ?rex ?pat ?pos ?callout ?workspace subj
invokes the "alternative" DFA matching function.
Note that the returned array of offsets are quite different from those returned by pcre2_match
et al. The motivating use case for the DFA match function is to be able to restart a partial match with N additional input segments. Because the match function/workspace does not store segments seen previously, the offsets returned when a match completes will refer only to the matching portion of the last subject string provided. Thus, returned offsets from this function should not be used to support extracting captured submatches. If you need to capture submatches from a series of inputs incrementally matched with this function, you'll need to concatenate those inputs that yield a successful match here and re-run the same pattern against that single subject string.
Aside from an absolute minimum of 20
, PCRE does not provide any guidance regarding the size of workspace array needed by any given pattern. Therefore, it is wise to appropriately handle the possible WorkspaceSize
error. If raised, you can allocate a new, larger workspace array and begin the DFA matching process again.
val exec :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
string ->
substrings
exec ?iflags ?flags ?rex ?pat ?pos ?callout subj
val exec_all :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
string ->
substrings array
exec_all ?iflags ?flags ?rex ?pat ?pos ?callout subj
val next_match :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
substrings ->
substrings
next_match ?iflags ?flags ?rex ?pat ?pos ?callout substrs
val extract :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?full_match:bool ->
?callout:callout ->
string ->
string array
extract ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj
val extract_opt :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?full_match:bool ->
?callout:callout ->
string ->
string option array
extract_opt ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj
val extract_all :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?full_match:bool ->
?callout:callout ->
string ->
string array array
extract_all ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj
val extract_all_opt :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?full_match:bool ->
?callout:callout ->
string ->
string option array array
extract_all_opt ?iflags ?flags ?rex ?pat ?pos ?full_match ?callout subj
val pmatch :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
string ->
bool
pmatch ?iflags ?flags ?rex ?pat ?pos ?callout subj
String substitution
Information on substitution patterns
subst str
converts the string str
representing a substitution pattern to the internal representation
The contents of the substitution string str
can be normal text mixed with any of the following (mostly as in PERL):
- $[0-9]+ - a "$" immediately followed by an arbitrary number. "$0" stands for the name of the executable, any other number for the n-th backreference.
- $& - the whole matched pattern
- $` - the text before the match
- $' - the text after the match
- $+ - the last group that matched
- $$ - a single "$"
- $! - delimiter which does not appear in the substitution. Can be used to part "$
0-9
+" from an immediately following other number.
val replace :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?itempl:substitution ->
?templ:string ->
?callout:callout ->
string ->
string
replace ?iflags ?flags ?rex ?pat ?pos ?itempl ?templ ?callout subj
replaces all substrings of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the substitution string templ
when given, itempl
otherwise. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val qreplace :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?templ:string ->
?callout:callout ->
string ->
string
qreplace ?iflags ?flags ?rex ?pat ?pos ?templ ?callout subj
replaces all substrings of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the string templ
. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val substitute_substrings :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
subst:(substrings -> string) ->
string ->
string
substitute_substrings ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj
replaces all substrings of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the result of function subst
applied to the substrings of the match. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val substitute :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
subst:(string -> string) ->
string ->
string
substitute ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj
replaces all substrings of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the result of function subst
applied to the match. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val replace_first :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?itempl:substitution ->
?templ:string ->
?callout:callout ->
string ->
string
replace_first ?iflags ?flags ?rex ?pat ?pos ?itempl ?templ ?callout subj
replaces the first substring of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the substitution string templ
when given, itempl
otherwise. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val qreplace_first :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?templ:string ->
?callout:callout ->
string ->
string
qreplace_first ?iflags ?flags ?rex ?pat ?pos ?templ ?callout subj
replaces the first substring of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the string templ
. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val substitute_substrings_first :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
subst:(substrings -> string) ->
string ->
string
substitute_substrings_first ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj
replaces the first substring of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the result of function subst
applied to the substrings of the match. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
val substitute_first :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?callout:callout ->
subst:(string -> string) ->
string ->
string
substitute_first ?iflags ?flags ?rex ?pat ?pos ?callout ~subst subj
replaces the first substring of subj
matching pattern pat
when given, regular expression rex
otherwise, starting at position pos
with the result of function subst
applied to the match. Uses flags
when given, the precompiled iflags
otherwise. Callouts are handled by callout
.
Splitting
val split :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?max:int ->
?callout:callout ->
string ->
string list
split ?iflags ?flags ?rex ?pat ?pos ?max ?callout subj
splits subj
into a list of at most max
strings, using as delimiter pattern pat
when given, regular expression rex
otherwise, starting at position pos
. Uses flags
when given, the precompiled iflags
otherwise. If max
is zero, trailing empty fields are stripped. If it is negative, it is treated as arbitrarily large. If neither pat
nor rex
are specified, leading whitespace will be stripped! Should behave exactly as in PERL. Callouts are handled by callout
.
val asplit :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?max:int ->
?callout:callout ->
string ->
string array
asplit ?iflags ?flags ?rex ?pat ?pos ?max ?callout subj
same as Pcre2.split
but return an array instead of a list.
Result of a Pcre2.full_split
val full_split :
?iflags:irflag ->
?flags:rflag list ->
?rex:regexp ->
?pat:string ->
?pos:int ->
?max:int ->
?callout:callout ->
string ->
split_result list
full_split ?iflags ?flags ?rex ?pat ?pos ?max ?callout subj
splits subj
into a list of at most max
elements of type "split_result", using as delimiter pattern pat
when given, regular expression rex
otherwise, starting at position pos
. Uses flags
when given, the precompiled iflags
otherwise. If max
is zero, trailing empty fields are stripped. If it is negative, it is treated as arbitrarily large. Should behave exactly as in PERL. Callouts are handled by callout
.
Additional convenience functions
foreach_line ?ic f
applies f
to each line in inchannel ic
until the end-of-file is reached.
foreach_file filenames f
opens each file in the list filenames
for input and applies f
to each filename and the corresponding channel. Channels are closed after each operation (even when exceptions occur - they get reraised afterwards!).
UNSAFE STUFF - USE WITH CAUTION!
val unsafe_pcre2_match :
irflag ->
regexp ->
pos:int ->
subj_start:int ->
subj:string ->
int array ->
callout option ->
unit
unsafe_pcre_exec flags rex ~pos ~subj_start ~subj offset_vector callout
. You should read the C-source to know what happens. If you do not understand it - don't use this function!
make_ovector regexp
calculates the tuple (subgroups2, ovector) which is the number of subgroup offsets and the offset array.
val unsafe_pcre2_dfa_match :
irflag ->
regexp ->
pos:int ->
subj_start:int ->
subj:string ->
int array ->
callout option ->
workspace:int array ->
unit
unsafe_pcre_dfa_exec flags rex ~pos ~subj_start ~subj offset_vector callout ~workpace
. You should read the C-source to know what happens. If you do not understand it - don't use this function!
- Exceptions
- Compilation and runtime flags and their conversion functions
- Information on the PCRE2-configuration (build-time options)
- Information on patterns
- Compilation of patterns
- Subpattern extraction
- Callouts
- Matching of patterns and subpattern extraction
- String substitution
- Splitting
- Additional convenience functions
- UNSAFE STUFF - USE WITH CAUTION!