By Elavarasan P

Natural Language processing (Basics to SOTA models) – Part-1

Introduction

This article provides the complete overview of Natural Language processing starting from use cases, basic NLP Libraries, preprocessing, embeddings, deep learning techniques used, Transformers (Attention mechanism) based SOTA models and how to use these models from huggingface libraries. Transformer based models have given a revolution in NLP. But, context creation is still a challenge when the sentence length increases. My idea is to provide examples to understand the concepts easily rather than explaining in detail.

Embeddings are the inputs for most of the NLP tasks. There are different ways of converting text into numeric form ranging from large sparse vectors to small dense vector representation.

  • Part-1 of this blog covers topics till embedding.
  • Part-2 will cover various Deep learning architectures, Pre-trained models and sample Huggingface implementation

NLP use cases:

CategoryUse caseExamples
Generate new
sentence from
input text
a) Machine translation
b)Text summarization
a) Translation from English to French
b) inshorts news summary
Text classificationa) Spam
b) Sentiment classification
a) Gmail spam folder
b) customer sentiment in terms of positive, negative and Neutral
Classify each word
in a sentence
a) NER
b) POS Tagging
a) Identifying Person, Organization, location in a sentence
b) Noun, verb identification
Natural Language inferencingRecognizing textual entailmentTowards Robust Natural Language Understanding · Licor's Space

There are many such use cases

  • Question Answering
  • Fill in the blanks
  • sentence completion
  • Autocorrect
  • auto completion (Gmail)

To get idea on industry wise use cases, visit here

Real time example: There are many real-world examples like News summary, Grammarly, Alexa, Google translate, chatbots, etc.

Extraction of Contractual terms use case, go here. Given a contract document, how do we extract agreement name, start date, End date, renewal term, parties involved, Termination for Convenience, audit rights, insurance, Liquidated damages, etc. This becomes important for larger organization which has many contracts with various vendors.

Approaches to NLP

There are below three approaches to NLP

  • Heuristic (Rule based)
    • Regular Expressions, wordnet
  • Machine Learning
    • Naïve Bayes, SVM, Logistic regression, LDA, Hidden Markov Model
  • Deep Learning (RNN, LSTM, GRU, Transformers)
    • Both Machine learning and Rule based approaches works on word level. They don’t address the sentence level NLP tasks in terms of context between 2 sentences and the sequence of words in a sentence. There are still improvements needed in understanding context and ambiguity using deep learning methods when the sentence length increases.

Basic NLP Libraries

This section provides the list of basic libraries, NLP tasks that can be done, whether it is derived from neural network and Areas where they are suitable

LibraryNLP TasksNeural NetworkBest for Unique features
NLTKtokenization, stemming, tagging, parsingNoTraining, Education, ResearchMultilingual support
50 Corpora
TextBlobPOS Tagging, noun phrase extraction
sentiment analysis, classification, translation
NoNLP prototypingTranslation
Spelling correction
GensimTest similarity
summarization, Topic modelling
Noconverting words and sentences into vectorsScalability,
High performance
SpacyTokenization, CNN tagging, parsing, NER,
classification, Sentiment analysis
YesBusiness, production50+ languages available for tokenization
StanfordNLP
/Stanza
Dependency parsing
Sentiment
NER
YesProduction gradeStatistical models
Link: More Details

Comparison of NLP libraries in terms of when to use them

Text Preprocessing

Tokenization

It is a way of separating a piece of text into smaller units called tokens, such as sentence, word, sub-word and characters.

Types of TokenizationMeaningInputOutput
sentenceSplit a paragraph into individual sentenceNatural language processing is one of the fields in programming where the natural language is processed by the software. This has many applications like sentiment analysis, language translation, fake news detection, grammatical error detection etcNatural language processing is one of the fields in programming where the natural language is processed by the software.

This has many applications like sentiment analysis, language translation, fake news detection, grammatical error detection etc
word splits a piece of text into individual words based on a certain delimiterwork hard. Never give upwork
hard
Never
give
up
Subwordsplits the piece of text into subwords (or n-gram characters).smartersmart
charactersplits a piece of text into a set of characterssmarterS | m | a | r | t | e | r

Stemming:

Stemming just removes or stems the last few characters of a word, often leading to incorrect meanings and spelling

InputOutput
Change
Changes
Changed
changing
Chang
studying
studied
studies
studi
waswa

Lemmatization

It considers the context and converts the word to its meaningful root form. This operation slower than stemming.

InputOutput
Change
Changes
Changed
changing
Change
studying
studied
studies
study
wasbe

Text Normalization
The process of transforming text into a single canonical / standard form

InputOutput
2moro
2mrrw
2morrow
2mrw
tomrw
tomorrow
b4before
otwon the way

POS Tagging

The Part of Speech (POS) explains how a word is used in a sentence. There are eight main parts of speech – nouns, pronouns, adjectives, verbs, adverbs, prepositions, conjunctions and interjections. In fact , there are 36 tags in combination

InputOutput
I like to read booksI-PRP
like-VBP
to – TO
read-VB (Verb)
books-NNS (Noun-plural)
Link: List of POS

Chunking:

It extracts phrases from unstructured data and give structure to it. It uses POS-tags as input and provides chunks as output.

Stop words:

Stop words are actually the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc) and does not add much information to the text. Examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”

Embeddings

Embedding methods (also called as encoding / vectorizing) convert symbolic representations (i.e. words, emojis, categorical items, dates, times, other features, etc) into meaningful numbers. Word embeddings are a way to represent words and whole sentences in a numerical manner.

a. Bag of words (BOW):

Bag of words is created based on number of times each word occurs in each of the document. In the below table, Row represents document ID and column represents the words.

b. TF-IDF:

Term frequency(TF) is calculated for all the words in each document (Word vs Doc). Inverse document frequency (IDF) is calculated for each word considering all the documents together. TFIDF value is generated for all the words in each document.

Document 1: Text processing is necessary.

Document 2: Text processing is necessary and important.

  • TFIDF gives importance to uncommon words.
  • There is a chance for overfitting
  • Both BOW and TFIDF don’t store semantic (meaning) information

c) Word2Vec:

  • In Word2vec, Each word is represented as a vector of 32 or more dimension instead of single number.
  • Semantic information and relation between different words are preserved

Word2vec embeddings can be created using continuous bag of words (CBOW) and skip-gram models. Both are deep learning models.

i) CBOW:

consider the sentence – “Word2Vec has a deep learning model working in the backend.”, there can be pairs of context words and target (center) words. If we consider a context window size of 2, we will have pairs like ([deep, model], learning), ([model, in], working), ([a, learning), deep) etc. in the format – ([2 context words], target word) The deep learning model would try to predict these target words based on the context word. search word is represented as a vector of 32 or more dimension instead of single number.

  • The context words are first passed as an input to an embedding layer (initialized with some random weights) as shown in the Figure below.
  • The word embeddings are then passed to a lambda layer where we average out the word embeddings.
  • We then pass these embeddings to a dense SoftMax layer that predicts our target word. We match this with our target word and compute the loss and then we perform backpropagation with each epoch to update the embedding layer in the process.
  • We can extract out the embeddings of the needed words from our embedding layer, once the training is completed.

ii) Skip-gram model:

In the skip-gram model, given a target (Centre) word, the context words are predicted. So, considering the same sentence – “Word2Vec has a deep learning model working in the backend.” and a context window size of 2, given the Centre word ‘learning’, the model tries to predict [‘deep’, ’model’] 

Since the skip-gram model has to predict multiple words from a single given word, we feed the model pairs of (X, Y) where X is our input and Y is our label. This is done by creating positive input samples and negative input samples. Positive Input Samples will have the training data in this form: [(target, context),1] where the target is the target or Centre word, context represents the surrounding context words, and label 1 indicates if it is a relevant pair. Negative Input Samples will have the training data in the same form: [(target, random),0]. In this case, instead of the actual surrounding words, randomly selected words are fed in along with the target words with a label of 0 indicating that it’s an irrelevant pair.

  • Both the target and context word pairs are passed to individual embedding layers from which we get dense word embeddings for each of these two words.
  • We then use a ‘merge layer’ to compute the dot product of these two embeddings and get the dot product value.
  • This dot product value is then sent to a dense sigmoid layer that outputs either 0 or 1.
  • The output is compared with the actual label and the loss is computed followed by backpropagation with each epoch to update the embedding layer in the process.

d) GloVe:

Glove model is based on leveraging global word to word co-occurrence counts leveraging the entire corpus. Word2vec on the other hand leverages co-occurrence within local context (neighboring words). Aims to combine the count-based matrix factorization and the context-based skip-gram model together.

GloVe focuses on capturing the similarity in context between words. It is lighter and more efficient than word2vec.

e) ELMo ( Embeddings from Language Models based on Bi-directional LSTM):

ELMo word vectors are computed on top of a two-layer bidirectional language model (biLM). This biLM model has two layers stacked together. Each layer has 2 passes — forward pass and backward pass.

ELMo -Link to research paper
  • The architecture above uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
  • These raw word vectors act as inputs to the first layer of biLM
  • The forward pass contains information about a certain word and the context (other words) before that word
  • The backward pass contains information about the word and the context after it
  • This pair of information, from the forward and backward pass, forms the intermediate word vectors
  • These intermediate word vectors are fed into the next layer of biLM
  • The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors

SummaryEmbeddings:

Summary – context independent and context dependent word embeddings

Challenges with BOW and TFIDF

  • TFIDF gives importance to uncommon words.
  • There is a chance for overfitting
  • Both BOW and TFIDF don’t store semantic (meaning) information

Word2Vec addresses the above challenges

Challenges with Word2Vec

  1. Word2Vec is a window-based model, so it does not benefit from the information in the whole document
  2.  it does not capture sub-word information, which could be interesting since adjectives are derived from nouns or verbs hold information in common.
  3. Also, Word2Vec can not handle Out Of Vocabulary words (OOV), since words not seen during the training can not be vectorized.
  4. Finally, an other problem that is not solved by Word2Vec is the disambiguation. A word can have multiple senses, which depend on the context.

The first three problems are addressed with GloVe and FastText  – whole doc, sub-word information, OOV words

while the last one has been resolved with ELMo – – – disambiguation: Different meaning of same word in different context: words like fair, lie, bat, lead, minute, project, second give different meaning in different context.

Hope this blog provided an overview to start with. Part-2 of this NLP blog will cover various Deep learning architectures, Pre-trained models and sample Huggingface implementation

Some useful references

NLP basics