Category: Java
-
Apache OpenNLP – Tokenization
Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings). Usually, these tokens are words, numbers, or punctuation marks. There’re three types of tokenizers available in OpenNLP. TokenizerME – A maximum entropy tokenizer, detects token boundaries based on probability model WhitespaceTokenizer – A whitespace tokenizer, non whitespace sequences are identified as tokens…