Natural Language Processing, or NLP, is broadly defined as the software automatically manipulating natural languages, like speech and text.
One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens – that is, words. Usually, tokens are split based upon delimiters, such as white space. White space includes blanks, tabs, and carriage-return line feeds. However, specialized tokenizers can split tokens according to other delimiters.
Another important NLP task involves determining a word’s stem and lexical meaning. This is useful for deriving more meaning about the words beings processed.
The stem of a word refers to the root of a word. For example, the stem of the word antiquated is antiqu. While this may not seem to be the correct stem, the stem of a word is the ultimate base of the word.
The lexical meaning of a word is not concerned with the context in which this is being used.
The process of performing lemmatization of a word is also concerned with finding the root of a word but uses a more detailed dictionary to find the root. The stem of a word may vary depending on the word’s form.
However, with lemmatization, the root will always be the same. Stemming is often used when we will be satisfied with possibly a less than a precise determination of the root of a word.
There are many other tasks for processing natural language.
For more information, you can see:
Wikipedia. “Natural language processing”