This module is roughly equivalent to the module Lexing from the OCaml standard library, except that its lexbuffers handle Unicode code points (OCaml type: Uchar.t in the range 0..0x10ffff) instead of bytes (OCaml type: char).
It is possible to have sedlex-generated lexers work on a custom implementation for lex buffers. To do this, define a module L which implements the start, next, mark and backtrack functions (See the Internal Interface section below for a specification). They need not work on a type named lexbuf: you can use the type name you want. Then, just do in your sedlex-processed source, bind this module to the name Sedlexing (for instance, with a local module definition: let module Sedlexing = L in ....
Of course, you'll probably want to define functions like lexeme to be used in the lexers semantic actions.
The type of lexer buffers. A lexer buffer is the argument passed to the scanning functions defined by the generated lexers. The lexer buffer holds the internal information for the scanners, including the code points of the token currently scanned, its position from the beginning of the input stream, and the current position of the lexer.
Create a generic lexer buffer. When the lexer needs more characters, it will call the given function, giving it an array of Uchars a, a position pos and a code point count n. The function should put n code points or less in a, starting at position pos, and return the number of characters provided. A return value of 0 means end of input. bytes_per_char argument is optional. If unspecified, byte positions are the same as code point position.
Create a lexbuf from an array of Unicode code points. bytes_per_char is optional. If unspecified, byte positions are the same as code point positions.
Interface for lexers semantic actions
The following functions can be called from the semantic actions of lexer definitions. They give access to the character string matched by the regular expression associated with the semantic action.
Sedlexing.lexeme_start lexbuf returns the offset in the input stream of the first code point of the matched string. The first code point of the stream has offset 0.
Sedlexing.lexeme_start lexbuf returns the offset in the input stream of the first byte of the matched string. The first code point of the stream has offset 0.
Sedlexing.lexeme_end lexbuf returns the offset in the input stream of the character following the last code point of the matched string. The first character of the stream has offset 0.
Sedlexing.lexeme_end lexbuf returns the offset in the input stream of the byte following the last code point of the matched string. The first character of the stream has offset 0.
Sedlexing.lexeme_length lexbuf returns the difference (Sedlexing.lexeme_end lexbuf) - (Sedlexing.lexeme_start lexbuf), that is, the length (in code points) of the matched string.
Sedlexing.lexeme_length lexbuf returns the difference (Sedlexing.lexeme_bytes_end lexbuf) - (Sedlexing.lexeme_bytes_start lexbuf), that is, the length (in bytes) of the matched string.
Sedlexing.lexing_positions lexbuf returns the start and end positions, in code points, of the current token, using a record of type Lexing.position. This is intended for consumption by parsers like those generated by Menhir.
Sedlexing.lexing_bytes_positions lexbuf returns the start and end positions, in bytes, of the current token, using a record of type Lexing.position. This is intended for consumption by parsers like those generated by Menhir.
Sedlexing.new_line lexbuf increments the line count and sets the beginning of line to the current position, as though a newline character had been encountered in the input.
Sedlexing.rollback lexbuf puts lexbuf back in its configuration before the last lexeme was matched. It is then possible to use another lexer to parse the same characters again. The other functions above in this section should not be used in the semantic action after a call to Sedlexing.rollback.
Internal interface
These functions are used internally by the lexers. They could be used to write lexers by hand, or with a lexer generator different from sedlex. The lexer buffers have a unique internal slot that can store an integer. They also store a "backtrack" position.
start t informs the lexer buffer that any code points until the current position can be discarded. The current position become the "start" position as returned by Sedlexing.lexeme_start. Moreover, the internal slot is set to -1 and the backtrack position is set to the current position.
next lexbuf extracts the next code point from the lexer buffer and increments to current position. If the input stream is exhausted, the function returns None. If a '\n' is encountered, the tracked line number is incremented.
__private__next_int lexbuf extracts the next code point from the lexer buffer and increments to current position. If the input stream is exhausted, the function returns -1. If a '\n' is encountered, the tracked line number is incremented.
This is a private API, it should not be used by code using this module's API and can be removed at any time.
backtrack lexbuf returns the value stored in the internal slot of the buffer, and performs backtracking (the current position is set to the value of the backtrack position).
with_tokenizer tokenizer lexbuf given a lexer and a lexbuf, returns a generator of tokens annotated with positions. This generator can be used with the Menir parser generator's incremental API.