package nx-text

  1. Overview
  2. Docs
Text processing and NLP extensions for Nx

Install

Dune Dependency

Authors

Maintainers

Sources

raven-1.0.0.alpha0.tbz
sha256=a9a8a9787f8250337187bb7b21cb317c41bfd2ecf08bcfe0ab407c7b6660764d
sha512=fe13cf257c487e41efe2967be147d80fa94bac8996d3aab2b8fd16f0bbbd108c15e0e58c025ec9bf294d4a0d220ca2ba00c3b1b42fa2143f758c5f0ee4c15782

doc/nx-text/Nx_text/index.html

Module Nx_textSource

Fast tokenization for ML in OCaml

Core Types

Sourcetype vocab

Vocabulary mapping between tokens and indices

Sourcetype tokenizer_method = [
  1. | `Words
    (*

    Split on whitespace and punctuation

    *)
  2. | `Chars
    (*

    Unicode character-level tokenization

    *)
  3. | `Regex of string
    (*

    Custom regex pattern

    *)
]

Simple API - Common Use Cases

Sourceval tokenize : ?method_:tokenizer_method -> string -> string list

tokenize ?method_ text splits text into tokens.

Default method is `Words.

  tokenize "Hello world!"
  = [ "Hello"; "world!" ] tokenize ~method_:`Chars "Hi!"
  = [ "H"; "i"; "!" ]
Sourceval encode : ?vocab:vocab -> string -> int list

encode ?vocab text tokenizes and encodes text to indices.

If vocab not provided, builds one automatically from input.

  encode "hello world hello" = [ 0; 1; 0 ]
Sourceval encode_batch : ?vocab:vocab -> ?max_len:int -> ?pad:bool -> string list -> (int32, Bigarray.int32_elt) Nx.t

encode_batch ?vocab ?max_len ?pad texts encodes multiple texts to tensor.

  • max_len: Maximum sequence length
  • pad: Whether to pad sequences
  • Auto-builds vocab from texts if not provided
  encode_batch [ "hi"; "hello world" ]
  (* Returns 2x3 tensor with padding *)
Sourceval decode : vocab -> int list -> string

decode vocab indices converts indices back to text.

  decode vocab [ 0; 1; 0 ] = "hello world hello"
Sourceval decode_batch : vocab -> (int32, Bigarray.int32_elt) Nx.t -> string list

decode_batch vocab tensor decodes tensor to texts.

Vocabulary

Sourceval vocab : ?max_size:int -> ?min_freq:int -> string list -> vocab

vocab ?max_size ?min_freq texts builds vocabulary from texts.

  • max_size: Maximum vocabulary size
  • min_freq: Minimum token frequency

Automatically includes special tokens: <pad>, <unk>, <bos>, <eos>

Sourceval vocab_size : vocab -> int

vocab_size v returns number of tokens in vocabulary.

Sourceval vocab_save : vocab -> string -> unit

vocab_save v path saves vocabulary to file.

Sourceval vocab_load : string -> vocab

vocab_load path loads vocabulary from file.

Text Preprocessing

Sourceval normalize : ?lowercase:bool -> ?strip_accents:bool -> ?collapse_whitespace:bool -> string -> string

normalize ?lowercase ?strip_accents ?collapse_whitespace text applies normalization.

All options default to false.

  normalize ~lowercase:true "Hello  WORLD!" = "hello world!"

Advanced API - Custom Tokenizers

Sourcemodule Tokenizer : sig ... end
Sourcemodule Vocab : sig ... end

Unicode Processing

Sourcemodule Unicode : sig ... end

Unicode text processing utilities

OCaml

Innovation. Community. Security.