Module `Nx_text`Source

Fast tokenization for ML in OCaml

Core Types

Sourcetype vocab

Vocabulary mapping between tokens and indices

Sourcetype tokenizer_method = [

| `Words
(*
Split on whitespace and punctuation
*)
| `Chars
(*
Unicode character-level tokenization
*)
| `Regex of string
(*
Custom regex pattern
*)

]

Simple API - Common Use Cases

Sourceval tokenize : ?method_:tokenizer_method -> string -> string list

tokenize ?method_ text splits text into tokens.

Default method is `Words.

  tokenize "Hello world!"
  = [ "Hello"; "world!" ] tokenize ~method_:`Chars "Hi!"
  = [ "H"; "i"; "!" ]

Sourceval encode : ?vocab:vocab -> string -> int list

encode ?vocab text tokenizes and encodes text to indices.

If vocab not provided, builds one automatically from input.

  encode "hello world hello" = [ 0; 1; 0 ]

Source

val encode_batch : 
  ?vocab:vocab ->
  ?max_len:int ->
  ?pad:bool ->
  string list ->
  (int32, Bigarray.int32_elt) Nx.t

encode_batch ?vocab ?max_len ?pad texts encodes multiple texts to tensor.

max_len: Maximum sequence length
pad: Whether to pad sequences
Auto-builds vocab from texts if not provided

  encode_batch [ "hi"; "hello world" ]
  (* Returns 2x3 tensor with padding *)

Sourceval decode : vocab -> int list -> string

decode vocab indices converts indices back to text.

  decode vocab [ 0; 1; 0 ] = "hello world hello"

Sourceval decode_batch : vocab -> (int32, Bigarray.int32_elt) Nx.t -> string list

decode_batch vocab tensor decodes tensor to texts.

Vocabulary

Sourceval vocab : ?max_size:int -> ?min_freq:int -> string list -> vocab

vocab ?max_size ?min_freq texts builds vocabulary from texts.

max_size: Maximum vocabulary size
min_freq: Minimum token frequency

Automatically includes special tokens: <pad>, <unk>, <bos>, <eos>

Sourceval vocab_size : vocab -> int

vocab_size v returns number of tokens in vocabulary.

Sourceval vocab_save : vocab -> string -> unit

vocab_save v path saves vocabulary to file.

Sourceval vocab_load : string -> vocab

vocab_load path loads vocabulary from file.

Text Preprocessing

Source

val normalize : 
  ?lowercase:bool ->
  ?strip_accents:bool ->
  ?collapse_whitespace:bool ->
  string ->
  string

normalize ?lowercase ?strip_accents ?collapse_whitespace text applies normalization.

All options default to false.

  normalize ~lowercase:true "Hello  WORLD!" = "hello world!"

Advanced API - Custom Tokenizers

Sourcemodule Tokenizer : sig ... end

Sourcemodule Vocab : sig ... end

Unicode Processing

Sourcemodule Unicode : sig ... end

Unicode text processing utilities

Install

Dune Dependency

Authors

Maintainers

Sources

doc/nx-text/Nx_text/index.html

Module `Nx_text`Source

Core Types

Simple API - Common Use Cases

Vocabulary

Text Preprocessing

Advanced API - Custom Tokenizers

Unicode Processing

package nx-text

Install

Dune Dependency

Authors

Maintainers

Sources

doc/nx-text/Nx_text/index.html

Module Nx_textSource

Core Types

Simple API - Common Use Cases

Vocabulary

Text Preprocessing

Advanced API - Custom Tokenizers

Unicode Processing

Module `Nx_text`Source