One-Hot Encoding
What is One-Hot Encoding
Imagine we have the following document:
From this document, we have a vocabulary (unique words and symbols) of size 7 as follows:
Notice how every word has been converted into lower case. If not, ‘The’ and ’the’ would be treated as seperate unique words in our vocabulary and our vocabulary size would then be 8
After establishing the vocabulary, every word is depicted as a binary vector consisting of 0s and 1s. The vector’s length matches the vocabulary size, and each position in the vector corresponds to a distinct word in the vocabulary. When a specific word is found in a given text sample, its corresponding position in the vector is designated as 1, while all other positions are set to 0. This implies that each word is distinctly characterized by a binary vector, wherein only a single element is 1 (signifying its existence), and all other elements are 0.
One-Hot Encoding of the above document would look like:
Advantages
- Easy to implement
- Intuitive
Disadvantages
- Creates sparse vectors which can be computationally expensive for large vocabularies
- Out of Vocabulary - words not present in the training data can break the model
- Does not capture inherent relationships or semantics between words
Example Code
Output