Basic Terminologies
Corpus
Document
Vocabulary
Words
Tokenization
Tokenization is the process of breaking sentences into words.
For example, let us take the most iconic dialogue of all time from Bram Stroker’s “Dracula”:
I have crossed oceans of time to find you.
The above when tokenized using the Natural Language Toolkit (NLTK) library gives us a list as follows:
[‘I’, ‘have’, ‘crossed’, ‘oceans’, ‘of’, ’time’, ’to’, ‘find’, ‘you’, ‘.’]
Example Code
Output
Stopwords
Stopwords are words which do not add much value to the overall meaning of a sentence. Such words include: “this”, “of”, “and”, etc. They are usually deleted as a pre-processing technique from a dataset so as to reduce complexity, allowing our nlp model to focus more on important words.
From our previous example, the tokens we would be left with after removing the stopwords will be:
[‘crossed’, ‘oceans’, ’time’, ‘find’, ‘.’]
Example Code
Output
Stemming
Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.
For example, the words “Historical” and “History” will give the output as “histor” and “histori” respectively, thus, losing their meaning.
Advantages
- Stemming is really fast
Disadvantages
- Stemming might remove the meaning of the word
Example Code
Output
Lemmatization
Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
Advantages
- Retains meaning of words
Disadvantages
- Computationally more expensive than Stemming - needs to have it’s own corpus to come up with the base/root word
Example Code
Output
Note
How to choose between stemming and lemmatization?
- Stemming can be used in cases where the output does not need to have meaningful words/sentences/paragraphs, such as Spam Classification, Review Classification, etc.
- Lemmatization can be used in cases where meaningful words/sentences/paragraphs need to be generated as output, such as Language Translation, ChatBots, Text Summarization, etc.
Word embeddings
Everything boils down to numbers and tensors when it comes to developing AI. This is where word embeddings come into play. It is a technique where individual words are represented as real-valued vectors in a lower-dimensional space and captures inter-word semantics.
Types of word embeddings
Word embeddings can be divided mainly into two parts - embedding the words based on:
- Count or frequency - This includes techniques such as One-Hot Encoding, Bag of Words and TF-IDF
- Deep Learning Trained Models - Word2Vec (CBOW, Skipgram)