package pacomb

  1. Overview
  2. Docs

Module Pacomb.LexSource

Lexing: grouping characters before parsing

It is traditionnal to do parsing in two phases (scanning/parsing). This is not necessary with combinators in general (scannerless). This is still true with Pacomb. However, this makes the grammar more readable to use a lexing phase.

Moreover, lexing is often done with a longuest match rule that is not semantically equivalent to the semantics of context free grammar.

This modules provide combinator to create terminals that the parser will call.

Types and exception

Position in a buffer is a Input.buffer together with an index Input.pos.

Sourcetype idx = Input.idx
Sourcetype 'a lexeme = buf -> idx -> 'a * buf * idx

Type of terminal function, similar to blank, but with a returned value

Sourcetype _ ast =
  1. | Any : char ast
  2. | Any_utf8 : Uchar.t ast
  3. | Any_grapheme : string ast
  4. | Eof : unit ast
  5. | Char : char -> unit ast
  6. | Grapheme : string -> unit ast
  7. | String : string -> unit ast
  8. | Nat : int ast
  9. | Int : int ast
  10. | Float : float ast
  11. | CharLit : char ast
  12. | StringLit : string ast
  13. | Test : (char -> bool) -> char ast
  14. | NotTest : (char -> bool) -> unit ast
  15. | Seq : 'a t * 'b t * ('a -> 'b -> 'c) * 'c Pacomb.Assoc.key -> 'c ast
  16. | Alt : 'a t * 'a t -> 'a ast
  17. | Save : 'a t * (string -> 'a -> 'b) * 'b Pacomb.Assoc.key -> 'b ast
  18. | Option : 'a * 'a t -> 'a ast
  19. | Appl : 'a t * ('a -> 'b) * 'b Pacomb.Assoc.key -> 'b ast
  20. | Star : 'a t * (unit -> 'b) * ('b -> 'a -> 'b) * 'b Pacomb.Assoc.key -> 'b ast
  21. | Plus : 'a t * (unit -> 'b) * ('b -> 'a -> 'b) * 'b Pacomb.Assoc.key -> 'b ast
  22. | Sub : 'a t * ('a -> bool) * 'a Pacomb.Assoc.key -> 'a ast
  23. | Keyword : string * int -> unit ast
  24. | Custom : 'a lexeme * 'a Pacomb.Assoc.key -> 'a ast

ast for terminals, needed for equality

Sourceand 'a terminal = {
  1. n : string;
    (*

    name

    *)
  2. f : 'a lexeme;
    (*

    the terminal itself

    *)
  3. a : 'a ast;
  4. c : Charset.t;
    (*

    the set of characters accepted at the beginning of input

    *)
}

The previous types encapsulated in a record

Sourceand 'a t = 'a terminal

Abbreviation

Sourceexception NoParse

exception when failing,

  • can be raised (but not captured) by terminals
  • can be raised (but not captured) by action code in the grammar, see Combinator.give_up
  • will be raised and captured by Combinator.parse_buffer that will give the most advanced position
Sourceexception Give_up of string

from action ony may give an error message when rejecting a rule

Sourceval give_up : ?msg:string -> unit -> 'a

give_up () rejects parsing from a corresponding semantic action. An error message can be provided. Can be used both in the semantics of terminals and parsing rules.

Combinators to create terminals

Sourceval any : ?name:string -> unit -> char t

accept any character, except eof

Sourceval eof : ?name:string -> unit -> unit t

Terminal accepting the end of a buffer only. remark: eof is automatically added at the end of a grammar by Combinator.parse_buffer. name default is "EOF"

Sourceval char : ?name:string -> char -> unit t

Terminal accepting a given char, remark: char '\255' is equivalent to eof. name default is the given charater.

Sourceval test : ?name:string -> (char -> bool) -> char t

Accept any character for which the test returns true. name default to the result of Charset.show.

Sourceval charset : ?name:string -> Charset.t -> char t

Accept a character in the given charset. name default as in test

Sourceval not_test : ?name:string -> (char -> bool) -> unit t

Reject the input (raises Noparse) if the first character of the input passed the test. Does not read the character if the test fails. name default to "^" prepended to the result of Charset.show.

Sourceval not_charset : ?name:string -> Charset.t -> unit t

Reject the input (raises Noparse) if the first character of the input is in the charset. Does not read the character if not in the charset. name default as in not_test

Sourceval sub : ?name:string -> ?charset:Charset.t -> 'a t -> ('a -> bool) -> 'a t

Does a test on the result of a given lexer and reject if it returns false. You may provide a restricted charset for the set of charaters accepted in the initial position.

Sourceval seq : ?name:string -> 'a t -> 'b t -> ('a -> 'b -> 'c) -> 'c t

Compose two terminals in sequence. name default is the concatenation of the two names.

Sourceval seq1 : ?name:string -> 'a t -> 'b t -> 'a t

variation on the above

Sourceval seq2 : ?name:string -> 'a t -> 'b t -> 'b t
Sourceval seqs : 'a t list -> ('a -> 'a -> 'a) -> 'a t
Sourceval save : ?name:string -> 'a t -> (string -> 'a -> 'b) -> 'b t

save t f save the part of the input parsed by the terminal t and combine it with its semantics using f

Sourceval alt : ?name:string -> 'a t -> 'a t -> 'a t

alt t1 t2 parses the input with t1 or t2. Contrary to grammars, terminals does not use continuations, if t1 succeds, no backtrack will be performed to try t2. For instance,

seq1 (alt (char 'a' ())
           (seq1 (char 'a' ()) (char 'b' ())))
      (char 'b' ())

will reject "ab". If both t1 and t2 accept the input, longuest match is selected. name default to sprintf "(%s)|(%s)" t1.n t2.n.

Sourceval alts : 'a t list -> 'a t
Sourceval option : ?name:string -> 'a -> 'a t -> 'a t

option x t parses the given terminal 0 or 1 time. x is returned if 0. name defaults to sprintf "(%s)?" t.n.

Sourceval appl : ?name:string -> ('a -> 'b) -> 'a t -> 'b t

Applies a function to the result of the given terminal. name defaults to the terminal name.

Sourceval star : ?name:string -> 'a t -> (unit -> 'b) -> ('b -> 'a -> 'b) -> 'b t

star t a f Repetition of a given terminal 0,1 or more times. The type of function to compose the action allows for 'b = Buffer.t for efficiency. The returned value is f ( ... (f(f (a ()) x_1) x_2) ...) x_n if t returns x_1 ... x_n. name defaults to sprintf "(%s)*" t.n

Sourceval plus : ?name:string -> 'a t -> (unit -> 'b) -> ('b -> 'a -> 'b) -> 'b t

Same as above but parses at least once.

Sourceval string : ?name:string -> string -> unit t

string s Accepts only the given string. Raises Invalid_argument if s = "". name defaults to sprintf "%S" s.

Sourceval nat : ?name:string -> unit -> int t

Parses an natural in base 10. "-42" and "-42" are not accepted. name defaults to "NAT"

Sourceval int : ?name:string -> unit -> int t

Parses an integer in base 10. "+42" is accepted. name defaults to "INT"

Sourceval float : ?name:string -> unit -> float t

Parses a float in base 10. ".1" is accepted as "0.1" name defaults to "FLOAT"

Sourceval char_lit : ?name:string -> unit -> char t

Parses a char litteral 'c' using ocaml escaping convention name defaults to "CHARLIT"

Sourceval string_lit : ?name:string -> unit -> string t

Parses a string litteral "cccc" using ocaml escaping convention name defaults to "STRINGLIT"

Sourceval any_utf8 : ?name:string -> unit -> Uchar.t t

Parses a unicode UTF8 char name defaults to "UTF8"

Sourceval utf8 : ?name:string -> Uchar.t -> unit t

utf8 c parses a specific unicode char and returns (), name defaults to the string representing the char

Sourceval any_grapheme : ?name:string -> unit -> string t

Parses any utf8 grapheme. name defaults to "GRAPHEME"

Sourceval grapheme : ?name:string -> string -> unit t

grapheme s parses the given utf8 grapheme and return (). The difference with string s x is that if the input starts with a grapheme s' such that s is a strict prefix of s', parsing will fail. name defaults to "GRAPHEME("^s^")"

Sourceval accept_empty : 'a t -> bool

Test wether a terminal accept the empty string. Such a terminal are illegal in a grammar, but may be used in combinator below to create terminals

Sourceval test_from_lex : bool t -> buf -> idx -> buf -> idx -> bool

Test constructor for the test constructor in Grammar

Sourceval blank_test_from_lex : bool t -> buf -> idx -> buf -> idx -> bool
Sourceval eq : 'a t -> 'b t -> ('a, 'b) Pacomb.Assoc.eq

equality, incomplete in particular for "alt"

Sourceval custom : 'a lexeme -> 'a ast

If you build custom lexeme, you need to use this to fill the a field of the record

Sourceval default : 'a -> 'a option -> 'a

where to put it ...

OCaml

Innovation. Community. Security.