package containers

  1. Overview
  2. Docs
Legend:
Page
Library
Module
Module type
Parameter
Class
Class type
Source

Module CCParseSource

Very Simple Parser Combinators

These combinators can be used to write very simple parsers, for example to extract data from a line-oriented file, or as a replacement to Scanf.

A few examples

Some more advanced example(s) can be found in the /examples directory.

Parse a tree
  open CCParse;;

  type tree = L of int | N of tree * tree;;

  let mk_leaf x = L x
  let mk_node x y = N(x,y)

  let ptree = fix @@ fun self ->
    skip_space *>
      ( (char '(' *> (pure mk_node <*> self <*> self) <* char ')')
        <|>
          (U.int >|= mk_leaf) )
  ;;

  parse_string_exn ptree "(1 (2 3))" ;;
  parse_string_exn ptree "((1 2) (3 (4 5)))" ;;
Parse a list of words
  open Containers.Parse;;
  let p = U.list ~sep:"," U.word;;
  parse_string_exn p "[abc , de, hello ,world  ]";;
Stress Test

This makes a list of 100_000 integers, prints it and parses it back.

  let p = CCParse.(U.list ~sep:"," U.int);;

  let l = CCList.(1 -- 100_000);;
  let l_printed =
    CCFormat.(to_string (within "[" "]" (list ~sep:(return ",@,") int))) l;;

  let l' = CCParse.parse_string_exn p l_printed;;

  assert (l=l');;

Stability guarantees

Some functions are marked "experimental" and are still subject to change.

Sourcetype position

A position in the input. Typically it'll point at the beginning of an error location.

Sourcemodule Position : sig ... end
Sourcemodule Error : sig ... end
Sourcetype +'a or_error = ('a, Error.t) result

'a or_error is either Ok x for some result x : 'a, or an error Error.t.

See stringify_result and Error.to_string to print the error message.

Sourceexception ParseError of Error.t

Input

Combinators

Sourcetype 'a t

The abstract type of parsers that return a value of type 'a (or fail).

  • since 3.6 the type is private.
Sourceval return : 'a -> 'a t

Always succeeds, without consuming its input.

Sourceval pure : 'a -> 'a t

Synonym to return.

Sourceval map : ('a -> 'b) -> 'a t -> 'b t
Sourceval map2 : ('a -> 'b -> 'c) -> 'a t -> 'b t -> 'c t
Sourceval map3 : ('a -> 'b -> 'c -> 'd) -> 'a t -> 'b t -> 'c t -> 'd t
Sourceval bind : ('a -> 'b t) -> 'a t -> 'b t

bind f p results in a new parser which behaves as p then, in case of success, applies f to the result.

  • since 3.6
Sourceval ap : ('a -> 'b) t -> 'a t -> 'b t

Applicative.

  • since 3.6
Sourceval eoi : unit t

Expect the end of input, fails otherwise.

Sourceval empty : unit t

Succeed with ().

  • since 3.6
Sourceval fail : string -> 'a t

fail msg fails with the given message. It can trigger a backtrack.

Sourceval failf : ('a, unit, string, 'b t) format4 -> 'a

Format.sprintf version of fail.

Sourceval fail_lazy : (unit -> string) -> 'a t

Like fail, but only produce an error message on demand.

  • since 3.6
Sourceval parsing : string -> 'a t -> 'a t

parsing s p behaves the same as p, with the information that we are parsing s, if p fails. The message s is added to the error, it does not replace it, not does the location change (the error still points to the same location as in p).

Sourceval set_error_message : string -> 'a t -> 'a t

set_error_message msg p behaves like p, but if p fails, set_error_message msg p fails with msg instead and at the current position. The internal error message of p is just discarded.

  • since 3.6
Sourceval pos : position t

pos returns the current position in the buffer.

EXPERIMENTAL

  • since 3.7
Sourceval with_pos : 'a t -> ('a * position) t

with_pos p behaves like p, but returns the (starting) position along with p's result.

EXPERIMENTAL

  • since 3.6
Sourceval any_char : char t

any_char parses any character. It still fails if the end of input was reached.

  • since 3.6
Sourceval any_char_n : int -> string t

any_char_n len parses exactly len characters from the input. Fails if the input doesn't contain at least len chars.

  • since 3.6
Sourceval char : char -> char t

char c parses the character c and nothing else.

Sourcetype slice

A slice of the input, as returned by some combinators such as split_1 or split_list or take.

The idea is that one can use some parsers to cut the input into slices, e.g. split into lines, or split a line into fields (think CSV or TSV). Then a variety of parsers can be used on each slice to extract data from it using recurse.

Slices contain enough information to make it possible for recurse slice p to report failures (if p fails) using locations from the original input, not relative to the slice. Therefore, even after splitting the input into lines using, say, each_line, a failure to parse the 500th line will be reported at line 500 and not at line 1.

EXPERIMENTAL

  • since 3.6
Sourcemodule Slice : sig ... end

Functions on slices.

Sourceval recurse : slice -> 'a t -> 'a t

recurse slice p parses the slice (most likely obtained via another combinator, such as split_1 or split_n), using p.

The slice contains a position which is used to relocate error messages to their position in the whole input, not just relative to the slice.

EXPERIMENTAL

  • since 3.6
Sourceval set_current_slice : slice -> unit t

set_current_slice slice replaces the parser's state with slice.

EXPERIMENTAL

  • since 3.6
Sourceval chars_fold : f: ('acc -> char -> [ `Continue of 'acc | `Consume_and_stop of 'acc | `Stop of 'acc | `Fail of string ]) -> 'acc -> ('acc * slice) t

chars_fold f acc0 folds over characters of the input. Each char c is passed, along with the current accumulator, to f; f can either:

  • stop, by returning `Stop acc. In this case the final accumulator acc is returned, and c is not consumed.
  • consume char and stop, by returning `Consume_and_stop acc.
  • fail, by returning `Fail msg. In this case the parser fails with the given message.
  • continue, by returning `Continue acc. The parser continues to the next char with the new accumulator.

This is a generalization of of chars_if that allows one to transform characters on the fly, skip some, handle escape sequences, etc. It can also be useful as a base component for a lexer.

  • returns

    a pair of the final accumular, and the slice matched by the fold.

  • since 3.6
Sourceval chars_fold_transduce : f: ('acc -> char -> [ `Continue of 'acc | `Yield of 'acc * char | `Consume_and_stop | `Stop | `Fail of string ]) -> 'acc -> ('acc * string) t

Same as char_fold but with the following differences:

  • returns a string along with the accumulator, rather than the slice of all the characters accepted by `Continue _. The string is built from characters returned by `Yield.
  • new case `Yield (acc, c) adds c to the returned string and continues parsing with acc.
  • since 3.6
Sourceval take : int -> slice t

take len parses exactly len characters from the input. Fails if the input doesn't contain at least len chars.

  • since 3.6
Sourceval take_if : (char -> bool) -> slice t

take_if f takes characters as long as they satisfy the predicate f.

  • since 3.6
Sourceval take1_if : ?descr:string -> (char -> bool) -> slice t

take1_if f takes characters as long as they satisfy the predicate f. Fails if no character satisfies f.

  • parameter descr

    describes what kind of character was expected, in case of error

  • since 3.6
Sourceval char_if : ?descr:string -> (char -> bool) -> char t

char_if f parses a character c if f c = true. Fails if the next char does not satisfy f.

  • parameter descr

    describes what kind of character was expected, in case of error

Sourceval chars_if : (char -> bool) -> string t

chars_if f parses a string of chars that satisfy f. Cannot fail.

Sourceval chars1_if : ?descr:string -> (char -> bool) -> string t

Like chars_if, but accepts only non-empty strings. chars1_if p fails if the string accepted by chars_if p is empty. chars1_if p is equivalent to take1_if p >|= Slice.to_string.

  • parameter descr

    describes what kind of character was expected, in case of error

Sourceval endline : char t

Parse '\n'.

Sourceval space : char t

Tab or space.

Sourceval white : char t

Tab or space or newline.

Sourceval skip_chars : (char -> bool) -> unit t

Skip 0 or more chars satisfying the predicate.

Sourceval skip_space : unit t

Skip ' ' and '\t'.

Sourceval skip_white : unit t

Skip ' ' and '\t' and '\n'.

Sourceval is_alpha : char -> bool

Is the char a letter?

Sourceval is_num : char -> bool

Is the char a digit?

Sourceval is_alpha_num : char -> bool

Is the char a letter or a digit?

Sourceval is_space : char -> bool

True on ' ' and '\t'.

Sourceval is_white : char -> bool

True on ' ' and '\t' and '\n'.

Sourceval suspend : (unit -> 'a t) -> 'a t

suspend f is the same as f (), but evaluates f () only when needed.

A practical use case is to implement recursive parsers manually, as described in fix. The parser is let rec p () = …, and suspend p can be used in the definition to use p.

Sourceval string : string -> string t

string s parses exactly the string s, and nothing else.

Sourceval exact : string -> string t

Alias to string.

  • since 3.6
Sourceval many : 'a t -> 'a list t

many p parses p repeatedly, until p fails, and collects the results into a list.

Sourceval optional : _ t -> unit t

optional p tries to parse p, and return () whether it succeeded or failed. Cannot fail itself. It consumes input if p succeeded (as much as p consumed), but consumes not input if p failed.

  • since 3.6
Sourceval try_ : 'a t -> 'a t

try_ p is just like p (it used to play a role in backtracking semantics but no more).

  • deprecated

    since 3.6 it can just be removed. See try_opt if you want to detect failure.

Sourceval try_opt : 'a t -> 'a option t

try_opt p tries to parse using p, and return Some x if p succeeded with x (and consumes what p consumed). Otherwise it returns None and consumes nothing. This cannot fail.

  • since 3.6
Sourceval many_until : until:_ t -> 'a t -> 'a list t

many_until ~until p parses as many p as it can until the until parser successfully returns. If p fails before that then many_until ~until p fails as well. Typically until can be a closing ')' or another termination condition, and what is consumed by until is also consumed by many_until ~until p.

EXPERIMENTAL

  • since 3.6
Sourceval try_or : 'a t -> f:('a -> 'b t) -> else_:'b t -> 'b t

try_or p1 ~f ~else_:p2 attempts to parse x using p1, and then becomes f x. If p1 fails, then it becomes p2. This can be useful if f is expensive but only ever works if p1 matches (e.g. after an opening parenthesis or some sort of prefix).

  • since 3.6
Sourceval try_or_l : ?msg:string -> ?else_:'a t -> (unit t * 'a t) list -> 'a t

try_or_l ?else_ l tries each pair (test, p) in order. If the n-th test succeeds, then try_or_l l behaves like n-th p, whether p fails or not. If test consumes input, the state is restored before calling p. If they all fail, and else_ is defined, then it behaves like else_. If all fail, and else_ is None, then it fails as well.

This is a performance optimization compared to (<|>). We commit to a branch if the test succeeds, without backtracking at all. It can also provide better error messages, because failures in the parser will not be reported as failures in try_or_l.

See lookahead_ignore for a convenient way of writing the test conditions.

  • parameter msg

    error message if all options fail

EXPERIMENTAL

  • since 3.6
Sourceval or_ : 'a t -> 'a t -> 'a t

or_ p1 p2 tries to parse p1, and if it fails, tries p2 from the same position.

  • since 3.6
Sourceval both : 'a t -> 'b t -> ('a * 'b) t

both a b parses a, then b, then returns the pair of their results.

  • since 3.6
Sourceval many1 : 'a t -> 'a list t

many1 p is like many p excepts it fails if the list is empty (i.e. it needs p to succeed at least once).

Sourceval skip : _ t -> unit t

skip p parses zero or more times p and ignores its result. It is eager, meaning it will continue as long as p succeeds. As soon as p fails, skip p stops consuming any input.

Sourceval sep : by:_ t -> 'a t -> 'a list t

sep ~by p parses a list of p separated by by.

Sourceval sep_until : until:_ t -> by:_ t -> 'a t -> 'a list t

Same as sep but stop when until parses successfully.

  • since 3.6
Sourceval sep1 : by:_ t -> 'a t -> 'a list t

sep1 ~by p parses a non empty list of p, separated by by.

Sourceval lookahead : 'a t -> 'a t

lookahead p behaves like p, except it doesn't consume any input.

EXPERIMENTAL

  • since 3.6
Sourceval lookahead_ignore : 'a t -> unit t

lookahead_ignore p tries to parse input with p, and succeeds if p succeeds. However it doesn't consume any input and returns (), so in effect its only use-case is to detect whether p succeeds, e.g. in try_or_l.

EXPERIMENTAL

  • since 3.6
Sourceval fix : ('a t -> 'a t) -> 'a t

Fixpoint combinator. fix (fun self -> p) is the parser p, in which self refers to the parser p itself (which is useful to parse recursive structures.

An alternative, manual implementation to let p = fix (fun self -> q) is:

 let rec p () =
    let self = suspend p in
    q
Sourceval line : slice t

Parse a line, '\n' excluded, and position the cursor after the '\n'.

  • since 3.6
Sourceval line_str : string t

line_str is line >|= Slice.to_string. It parses the next line and turns the slice into a string. The state points to the character immediately after the '\n' character.

  • since 3.6
Sourceval each_line : 'a t -> 'a list t

each_line p runs p on each line of the input. EXPERIMENTAL

  • since 3.6
Sourceval split_1 : on_char:char -> (slice * slice option) t

split_1 ~on_char looks for on_char in the input, and returns a pair sl1, sl2, where:

  • sl1 is the slice of the input the precedes the first occurrence of on_char, or the whole input if on_char cannot be found. It does not contain on_char.
  • sl2 is the slice that comes after on_char, or None if on_char couldn't be found. It doesn't contain the first occurrence of on_char (if any).

The parser is now positioned at the end of the input.

EXPERIMENTAL

  • since 3.6
Sourceval split_list : on_char:char -> slice list t

split_list ~on_char splits the input on all occurrences of on_char, returning a list of slices.

EXPERIMENTAL

  • since 3.6
Sourceval split_list_at_most : on_char:char -> int -> slice list t

split_list_at_most ~on_char n applies split_1 ~on_char at most n times, to get a list of n+1 elements. The last element might contain on_char. This is useful to limit the amount of work done by split_list.

EXPERIMENTAL

  • since 3.6
Sourceval split_2 : on_char:char -> (slice * slice) t

split_2 ~on_char splits the input into exactly 2 fields, and fails if the split yields less or more than 2 items. EXPERIMENTAL

  • since 3.6
Sourceval split_3 : on_char:char -> (slice * slice * slice) t

See split_2 EXPERIMENTAL

  • since 3.6
Sourceval split_4 : on_char:char -> (slice * slice * slice * slice) t

See split_2 EXPERIMENTAL

  • since 3.6
Sourceval each_split : on_char:char -> 'a t -> 'a list t

split_list_map ~on_char p uses split_list ~on_char to split the input, then parses each chunk of the input thus obtained using p.

The difference with sep ~by:(char on_char) p is that sep calls p first, and only tries to find on_char after p returns. While it is more flexible, this technique also means p has to be careful not to consume on_char by error.

A useful specialization of this is each_line, which is basically each_split ~on_char:'\n' p.

EXPERIMENTAL

  • since 3.6
Sourceval all : slice t

all returns all the unconsumed input as a slice, and consumes it. Use Slice.to_string to turn it into a string.

Note that lookahead all can be used to peek at the rest of the input without consuming anything.

  • since 3.6
Sourceval all_str : string t

all_str accepts all the remaining chars and extracts them into a string. Similar to all but with a string.

EXPERIMENTAL

  • since 3.6
Sourceval memo : 'a t -> 'a t

Memoize the parser. memo p will behave like p, but when called in a state (read: position in input) it has already processed, memo p returns a result directly. The implementation uses an underlying hashtable. This can be costly in memory, but improve the run time a lot if there is a lot of backtracking involving p.

Do not call memo inside other functions, especially with (>>=), map, etc. being so prevalent. Instead the correct way to use it is in a toplevel definition:

  let my_expensive_parser = memo (foo *> bar >>= fun i -> …)

This function is not thread-safe.

Sourceval fix_memo : ('a t -> 'a t) -> 'a t

Like fix, but the fixpoint is memoized.

Infix

Sourcemodule Infix : sig ... end
include module type of Infix
Sourceval (>|=) : 'a t -> ('a -> 'b) -> 'b t

Alias to map. p >|= f parses an item x using p, and returns f x.

Sourceval (>>=) : 'a t -> ('a -> 'b t) -> 'b t

Alias to bind. p >>= f results in a new parser which behaves as p then, in case of success, applies f to the result.

Sourceval (<*>) : ('a -> 'b) t -> 'a t -> 'b t

Applicative.

Sourceval (<*) : 'a t -> _ t -> 'a t

a <* b parses a into x, parses b and ignores its result, and returns x.

Sourceval (*>) : _ t -> 'a t -> 'a t

a *> b parses a, then parses b into x, and returns x. The result of a is ignored.

Sourceval (<|>) : 'a t -> 'a t -> 'a t

Alias to or_.

a <|> b tries to parse a, and if a fails without consuming any input, backtracks and tries to parse b, otherwise it fails as a.

Sourceval (<?>) : 'a t -> string -> 'a t

a <?> msg behaves like a, but if a fails, a <?> msg fails with msg instead. Useful as the last choice in a series of <|>. For example: a <|> b <|> c <?> "expected one of a, b, c".

Sourceval (|||) : 'a t -> 'b t -> ('a * 'b) t

Alias to both. a ||| b parses a, then b, then returns the pair of their results.

  • since 3.6

Let operators on OCaml >= 4.08.0, nothing otherwise

  • since 2.8
Sourceval (let+) : 'a t -> ('a -> 'b) -> 'b t
Sourceval (and+) : 'a t -> 'b t -> ('a * 'b) t
Sourceval (let*) : 'a t -> ('a -> 'b t) -> 'b t
Sourceval (and*) : 'a t -> 'b t -> ('a * 'b) t

Parse input

Sourceval stringify_result : 'a or_error -> ('a, string) result

Turn a Error.t-oriented result into a more basic string result.

  • since 3.6
Sourceval parse_string : 'a t -> string -> ('a, string) result

Parse a string using the parser.

Sourceval parse_string_e : 'a t -> string -> 'a or_error

Version of parse_string that returns a more detailed error.

Sourceval parse_string_exn : 'a t -> string -> 'a
Sourceval parse_file : 'a t -> string -> ('a, string) result

parse_file p filename parses file named filename with p by opening the file and reading it whole.

Sourceval parse_file_e : 'a t -> string -> 'a or_error

Version of parse_file that returns a more detailed error.

Sourceval parse_file_exn : 'a t -> string -> 'a

Same as parse_file, but

Sourcemodule U : sig ... end
Sourcemodule Debug_ : sig ... end

Debugging utils. EXPERIMENTAL

OCaml

Innovation. Community. Security.