Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or headword form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.
In English, the base form for a verb is the simple infinitive. For example, the gerund “striking” and the past form “struck” are both forms of the lemma “(to) strike”. The base form for a noun is the singular form. For example, the plural “mice” is a form of the lemma “mouse.”
Most English spellings can be lemmatized using regular rules of English grammar, as long as the word class is known.
Not so trivial is the disambiguation of homonyms like ‘lie’ or ‘bark’. There are a few hundred (at most) such pairs in English. In the future, we may be able to distinguish which homonym is meant in some situations using methods collectively called word sense disambiguation. That would allow more accurate lemmatization for homonyms.
Stemming and lemmatization
For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization. In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.
The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.Cambridge University Press. 2008.
Lemmatization is part of NLP (Natural language processing) can do.
Before doing a lemmatization is necessary to separate the sentence into tokens. A lemmatizer takes a token and its part-of-speech tag as input and returns the word’s lemma.
This process, called tokenization, segments strings into smaller parts.
In this post, we show different ways of tokenization the OpenNLP Tokenizer API provides.
Furthermore, the Word Frequency tool separates words into tokens and shows their part of speech.