package nx-text
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=a9a8a9787f8250337187bb7b21cb317c41bfd2ecf08bcfe0ab407c7b6660764d
sha512=fe13cf257c487e41efe2967be147d80fa94bac8996d3aab2b8fd16f0bbbd108c15e0e58c025ec9bf294d4a0d220ca2ba00c3b1b42fa2143f758c5f0ee4c15782
doc/nx-text/Nx_text/index.html
Module Nx_text
Source
Fast tokenization for ML in OCaml
Core Types
Vocabulary mapping between tokens and indices
type tokenizer_method = [
| `Words
(*Split on whitespace and punctuation
*)| `Chars
(*Unicode character-level tokenization
*)| `Regex of string
(*Custom regex pattern
*)
]
Simple API - Common Use Cases
tokenize ?method_ text
splits text into tokens.
Default method is `Words
.
tokenize "Hello world!"
= [ "Hello"; "world!" ] tokenize ~method_:`Chars "Hi!"
= [ "H"; "i"; "!" ]
encode ?vocab text
tokenizes and encodes text to indices.
If vocab not provided, builds one automatically from input.
encode "hello world hello" = [ 0; 1; 0 ]
val encode_batch :
?vocab:vocab ->
?max_len:int ->
?pad:bool ->
string list ->
(int32, Bigarray.int32_elt) Nx.t
encode_batch ?vocab ?max_len ?pad texts
encodes multiple texts to tensor.
max_len
: Maximum sequence lengthpad
: Whether to pad sequences- Auto-builds vocab from texts if not provided
encode_batch [ "hi"; "hello world" ]
(* Returns 2x3 tensor with padding *)
decode vocab indices
converts indices back to text.
decode vocab [ 0; 1; 0 ] = "hello world hello"
decode_batch vocab tensor
decodes tensor to texts.
Vocabulary
vocab ?max_size ?min_freq texts
builds vocabulary from texts.
max_size
: Maximum vocabulary sizemin_freq
: Minimum token frequency
Automatically includes special tokens: <pad>, <unk>, <bos>, <eos>
Text Preprocessing
val normalize :
?lowercase:bool ->
?strip_accents:bool ->
?collapse_whitespace:bool ->
string ->
string
normalize ?lowercase ?strip_accents ?collapse_whitespace text
applies normalization.
All options default to false.
normalize ~lowercase:true "Hello WORLD!" = "hello world!"