-
NLP – Natural language processing
Natural Language Processing, or NLP, is broadly defined as the software automatically manipulating natural languages, like speech and text. One of the first steps required for Natural Language Processing (NLP) is the extraction of tokens in text. The process of tokenization splits text into tokens – that is, words. Usually, tokens are split based upon […]
-
English Lemmatizer
Lemmatization is the process of reducing an inflected spelling to its lexical root or lemma form. The lemma form is the base form or headword form you would find in a dictionary. The combination of the lemma form with its word class (noun, verb. etc.) is called the lexeme. In English, the base form for […]
-
Word list
A word list (or lexicon) is a list of a language’s lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency “provides a rational basis for making sure that learners get the best return […]
-
Text corpus
The term language corpus is used to mean a number of rather different things. It may refer simply to any collection of linguistic data (for example, written, spoken, signed, or multimodal), although many practitioners prefer to reserve it for collections which have been organized or collected with a particular end in view, generally to characterize […]
-
Bound Morphemes
A bound morpheme is a word element that cannot stand alone as a word, including both prefixes and suffixes. A bound morpheme is a morpheme (the smallest meaningful lexical item in a language). A morpheme is not a word. The difference between a morpheme and a word is that a morpheme sometimes does not stand […]
-
Lexicon
A lexicon is the vocabulary of a language or branch of knowledge. A list of all the words used in a particular language or subject, or a dictionary. Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalogue of a language’s words; and a grammar, a system of rules […]
-
Counting characters in Java
There are many ways for counting the number of characters in a String. Below a simple/naive approach:
-
Counting words in Java
This is a simple way to count words in a string in Java. StringTokenizer automatically takes care of whitespace for us, like tabs and carriage returns. In some cases like in “he-man”, we’d want “he” and “man” to be different words, but since there’s no whitespace between them, the defaults fail us. Fortunately, we can […]