English Lemmatizer

Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or headword form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme.

In English, the base form for a verb is the simple infinitive. For example, the gerund “striking” and the past form “struck” are both forms of the lemma “(to) strike”. The base form for a noun is the singular form. For example, the plural “mice” is a form of the lemma “mouse.”

Most English spellings can be lemmatized using regular rules of English grammar, as long as the word class is known.

Not so trivial is the disambiguation of homonyms like ‘lie’ or ‘bark’. There are a few hundred (at most) such pairs in English. In the future, we may be able to distinguish which homonym is meant in some situations using methods collectively called word sense disambiguation. That would allow more accurate lemmatization for homonyms.

