Apache OpenNLP – Tokenization

Tokenization is a process of segmenting strings into smaller parts called tokens(say sub-strings).

Usually, these tokens are words, numbers, or punctuation marks.

There’re three types of tokenizers available in OpenNLP.

  • TokenizerME – A maximum entropy tokenizer, detects token boundaries based on probability model
  • WhitespaceTokenizer – A whitespace tokenizer, non whitespace sequences are identified as tokens
  • SimpleTokenizer – A character class tokenizer, sequences of the same character class are tokens


TokenizerME

In this case, we first need to load the pre-trained model into a stream.

We can download the model file from here, put it in the /resources folder and load it from there.
This tokenizer can be used with a custom-trained model as well.

InputStream inputStream = getClass()
      .getResourceAsStream("/models/en-token.bin");

Read the stream to a Tokenizer model.

TokenizerModel model = new TokenizerModel(inputStream);

Initialize the tokenizer with the model.

TokenizerME tokenizer = new TokenizerME(model);

Use TokenizerME.tokenize() method to extract the tokens to a String Array.

String[] tokens = tokenizer.tokenize("John has a sister named Penny.");

WhitespaceTokenizer

As the name suggests, this tokenizer simply splits the sentence into tokens using whitespace characters as delimiters:

WhitespaceTokenizer tokenizer = WhitespaceTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("John has a sister named Penny.");

We’ll see that white spaces will split the sentence, so we get “Penny.” with the period character at the end instead of two different tokens for the word “Penny” and the period character.

SimpleTokenizer

This tokenizer is a little more sophisticated than WhitespaceTokenizer and splits the sentence into words, numbers, and punctuation marks. However, it’s the default behavior and doesn’t require any model:

SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize("John has a sister named Penny.");

Conclusion

In this Apache OpenNLP article, we have seen different ways of tokenization the OpenNLP Tokenizer API provides.