Basic Terminologies

Corpus

Document

Vocabulary

Words

Tokenization

Tokenization is the process of breaking sentences into words.

For example, let us take the most iconic dialogue of all time from Bram Stroker’s “Dracula”:

I have crossed oceans of time to find you.

The above when tokenized using the Natural Language Toolkit (NLTK) library gives us a list as follows:

[‘I’, ‘have’, ‘crossed’, ‘oceans’, ‘of’, ’time’, ’to’, ‘find’, ‘you’, ‘.’]

Example Code

import nltk
from nltk.tokenize import word_tokenize

nltk.download("punkt")  # Download the necessary data for word_tokenize

sentence = "I have crossed oceans of time to find you."
tokens = word_tokenize(sentence)

print("Original Sentence:", sentence)
print("Tokenized Sentence:", tokens)

Output

 Original Sentence: I have crossed oceans of time to find you.
 Tokenized Sentence: ['I', 'have', 'crossed', 'oceans', 'of', 'time', 'to', 'find', 'you', '.']

Stopwords

Stopwords are words which do not add much value to the overall meaning of a sentence. Such words include: “this”, “of”, “and”, etc. They are usually deleted as a pre-processing technique from a dataset so as to reduce complexity, allowing our nlp model to focus more on important words.

From our previous example, the tokens we would be left with after removing the stopwords will be:

[‘crossed’, ‘oceans’, ’time’, ‘find’, ‘.’]

Example Code

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

nltk.download("punkt")  # Download the necessary data for word_tokenize
nltk.download("stopwords")  # Download the stopwords dataset

sentence = "I have crossed oceans of time to find you."
tokens = word_tokenize(sentence)

stop_words = set(stopwords.words("english"))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Sentence:", sentence)
print("Tokens after removing stopwords:", filtered_tokens)

Output

 Original Sentence: I have crossed oceans of time to find you.
 Tokens after removing stopwords: ['crossed', 'oceans', 'time', 'find', '.']

Stemming

Stemming is a process that stems or removes last few characters from a word, often leading to incorrect meanings and spelling.

For example, the words “Historical” and “History” will give the output as “histor” and “histori” respectively, thus, losing their meaning.

Advantages

  • Stemming is really fast

Disadvantages

  • Stemming might remove the meaning of the word

Example Code

import nltk
from nltk.stem import PorterStemmer

# Create a PorterStemmer instance
porter_stemmer = PorterStemmer()

stemmed_word1 = porter_stemmer.stem("Historical")
stemmed_word2 = porter_stemmer.stem("History")

print(stemmed_word1)
print(stemmed_word2)

Output

 histor
 histori

Lemmatization

Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.

Advantages

  • Retains meaning of words

Disadvantages

  • Computationally more expensive than Stemming - needs to have it’s own corpus to come up with the base/root word

Example Code

import nltk
from nltk.stem import WordNetLemmatizer

nltk.download("wordnet")

# Create a WordNetLemmatizer instance
lemmatizer = WordNetLemmatizer()
lemmatized_word1 = lemmatizer.lemmatize("Historical")
lemmatized_word2 = lemmatizer.lemmatize("History")

print(lemmatized_word1)
print(lemmatized_word2)

Output

 Historical
 History

Note

How to choose between stemming and lemmatization?

  • Stemming can be used in cases where the output does not need to have meaningful words/sentences/paragraphs, such as Spam Classification, Review Classification, etc.
  • Lemmatization can be used in cases where meaningful words/sentences/paragraphs need to be generated as output, such as Language Translation, ChatBots, Text Summarization, etc.

Word embeddings

Everything boils down to numbers and tensors when it comes to developing AI. This is where word embeddings come into play. It is a technique where individual words are represented as real-valued vectors in a lower-dimensional space and captures inter-word semantics.

Types of word embeddings

Word embeddings can be divided mainly into two parts - embedding the words based on:

  1. Count or frequency - This includes techniques such as One-Hot Encoding, Bag of Words and TF-IDF
  2. Deep Learning Trained Models - Word2Vec (CBOW, Skipgram)