package owl
Install
Dune Dependency
Authors
Maintainers
Sources
sha256=eed3ca75b0e9388a2c56dd45eee7fe257adacba77b4536cda041bddf98f01691
sha512=baf6419a2a9019df403e9badf45b5ffeedb52d66676dd5899406cbebbca79f0cb23edaf3b93339da9f938f57424f4899faa7af7237cda4409c5a1470829975f9
doc/owl/Owl_nlp_corpus/index.html
Module Owl_nlp_corpus
Source
NLP: Corpus module
Type definition
Type of a text corpus.
Query corpus
Return the file handle of the binary formation of corpus.
Return the file handle of the tokenised corpus.
Return the vocabulary associated with the corpus.
Return a list of document ids which are mapped back to the original file where the corpus is built.
Iteration functions
Iterate all the documents in the corpus, the index (line number) is passed in.
Iterate the tokenised documents in the corpus, the index (line number) is passed in.
Map all the documents in a corpus into another array. The index (line number) is passed in.
Map all the tokenised ocuments in a corpus into another array. The index (line number) is passed in.
Return the next batch of documents in a corpus as a string array. The default ``size`` is 100.
Return the next batch of tokenised documents in a corpus as a string array. The default ``size`` is 100.
Core functions
val build :
?docid:int array ->
?stopwords:(string, 'a) Hashtbl.t ->
?lo:float ->
?hi:float ->
?vocab:Owl_nlp_vocabulary.t ->
?minlen:int ->
string ->
t
This function builds up a corpus of type ``t`` from a given raw text corpus. We assume that each line in the raw text corpus represents a document.
Parameters: * ``?docid``: passed in ``docid`` can be used for tracking back to the original corpus, but this is not compulsory. * ``?stopwords``: stopwords used in building vocabulary. * ``?lo``: any word below this lower bound of the frequency is removed from vocabulary. * ``?hi``: any word above this upper bound of the frequency is removed from vocabulary. * ``?vocab``: an optional vocabulary, if it is not passed, the vocabulary is built from current corpus. * ``?(minlen=10)``: threshold of the document length, any document shorter than this is removed from the corpus. * ``fname``: the file name of the raw text corpus.
``tokenise corpus doc`` tokenises the document ``doc`` using the ``corpus`` and its associated vocabulary.
Remove the duplicates in a text corpus, the ids of the removed files are returned.
Function for simple pre-processing a given string.
``preprocess f input_file output_file`` pre-processes a given file ``input_file`` with the passed in function ``f`` then saves the output to ``output_file``.
E.g., you can plug in ``simple_process`` function to clean up the text. Note this function will not change the number of lines in a corpus.
I/O functions
The string representation of a corpus, contains the summary of a corpus.
Helper functions
val create :
string ->
int array ->
int array ->
in_channel option ->
in_channel option ->
Owl_nlp_vocabulary.t option ->
int ->
int array ->
t
```create uri bin_ofs tok_ofs bin_fh tok_fh vocab minlen docid` wraps up the corpus into a record of type ``t``.