NLP: Sentiment Analysis in the light of LSTM Recurrent Neural Networks

Usman Akhtar
8 min readSep 18, 2020

--

  1. Introduction

Natural Language Processing is the field of study that focuses on the interactions between human language and computers. One subproblem of NLP is Sentiment Analysis, SA is a computational method for categorizing viewers' opinions, given in the text.

Products: What do people think about the new iPhone?
Public sentiment: How is consumer confidence?
Politics: What do people think about Trump?
Prediction: Predict market trends

2. Motivation

Humans ourselves are not able to understand how exactly language is processed by our brains. So, is it possible for us to teach a machine to learn our language?? Yes, Motivation for Sentiment Analysis is two-fold. Both consumers and producers highly value “customer’s opinions” about products and services.

3. Objective

Compare the performance of the different types of LSTM architectures for Sentiment Analysis.

Conventional LSTM.
Deep LSTM network
Bidirectional LSTM

4. Challenges

Sentiment Analysis is a very challenging task. Reviews are collected from different sources like blogs, Tweets, #hashtags, and product page, are important in making our decision. It requires a deep understanding of the problem. Informal misspellings, emoticons/slangs (often intentional, like different spellings of ”cool”: coool) and URLs. leads to the problem of out-of-vocabulary words.
Models are trained to take a decision on a limited amount of text, either positive or negative.

◂Some of the challenges faced in Sentiment Analysis:

Identifying subjective portions of text
Associating sentiment with specific keywords
Domain Dependencies
Indirect negation of sentiment

5. Literature

◂Mihalcea et al. (2007) make use of English corpora to train sentence-level subjectivity classifiers in Romanian language using two approaches, which they claimed can be applied to any language, and not only Romanian. In the first approach, they use a bilingual dictionary to translate an existing English lexicon to build a target language subjectivity lexicon. In the other one, they generate a subjectivity-annotated corpus in a target language by projecting annotations from an automatically annotated English corpus. In Zhou et al. (2016) the authors propose an attention-based LSTM network to learn the document presentations of reviews in English and Chinese exploring word vectors as text representation.

◂Hochrieter and Schmidhuber (1997) pointed out that recurrent backpropagation or simply neural networks are extremely inefficient or fail miserably to learn information that is largely extended over time. Long short term memory networks — usually just called “LSTMs” a special kind of RNN, capable of learning long-term dependencies, was proposed by them. LSTMs work tremendously well on a large variety of problems and are now widely used.

◂In Socher et al. (2013b), the authors propose the Recursive Neural Tensor Network (RNTN) architecture, which represents a phrase through word vectors with a parse tree and then compute vectors for higher nodes in the tree using the same function, when trained on the new tree-bank, this model outperformed all previous methods on several metrics.

◂Santos et al. (n.d.) propose a new deep convolution neural network that exploits from character- to sentence-level information to perform sentiment analysis of short texts. Their approach for two corpora of two different domains: the Stanford Sentiment Tree-bank (SSTb), which contains sentences, from movie reviews; and the Stanford Twitter Sentiment corpus (STS), which contains Twitter messages.

a) Recurrent Neural Network

Recurrent Neural Networks, or RNNs, were designed to work with sequence prediction problems rather than local features.
Sequence prediction problems come in many forms and are best described by the types of inputs and outputs supported.

-(RNN) is a special type of neural network, where connections are made between units which form a directed cycle,
-An RNN has an Input layer, a variable number of hidden layers, and finally one output layer

Use RNNs For:

Text data
Speech data
Classification prediction problems
Regression prediction problems
Machine Translation

Don’t Use RNNs For:

Tabular data
Image data

b) Convolutional Neural Network

  • CNN's, were designed to map image data to an output variable. They have the ability to develop an internal representation of a two-dimensional image.
  • The CNN input is traditionally two-dimensional, a field or matrix, but can also be changed to be one-dimensional, allowing it to develop an internal representation of a one-dimensional sequence. This allows CNN to be used more generally on other types of data that have a spatial relationship.
  • For example, there is an order relationship between words in a document of text. There is an ordered relationship in the time steps of a time series.

Use CNNs For:

Image data
Classification prediction problems
Regression prediction problems

Don’t Use CNNs For:

Text data
Time series data
Sequence input data

6. Proposed Methodology

The experimental evaluation comprises of the following;

a) Dataset Description

b) Data Processing and Cleaning

c) LSTM Network

d) Deep LSTM Network

e) Bidirectional LSTM Network

f) Results and Discussion

a) Dataset Description

Total 50,000 reviews given by peoples out of which 25000 positive reviews and the same number of negative reviews. Dataset available at the given link:

https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews.

b) Data Processing and Cleaning

  • Firstly Html tags, punctuation marks, special symbols, single characters, and double space are removed.
  • Digitalize labels into 1 and 0 for reviews.
  • Dividing reviews in 50% test and 50% train set

c) LSTM Network

  • Vanishing gradient problem solved.
  • long short term memory (LSTM) that was designed over simple RNNs for modeling temporal sequences and their long-range dependencies more accurately.
  • The stored value is not iteratively squashed over time, because of the “input gate”, “output gate”, “forget gate, and memory cells. Handled by sigmoid activation function which makes the backdrop gradient essentially zero.
  • LSTM contains 3 gates that control information flow. implemented using the logistic function to compute a value between 0 and 1.
  • “input” gate controls the extent to which a new value flows into the memory,
  • “forget” gate controls the extent to which a value remains in memory,
  • “output” gate controls the extent to which the value in memory is used to compute the output activation of the block

Step-1: Decide how much of the past it should remember

The first step in the LSTM f_t is to decide which information to be omitted from the cell which is not important from the previous time step. it is decided by the sigmoid function. it looks at the previous state h_(t-1) and the current put x_t and computes the function

Step-2 Decide how much should this unit add to the current state

In the second layer, there are 2 parts. One is the sigmoid function and the other is the tanh. In the sigmoid function, it decides which values to let through (0 or 1). tanh function gives the weightage to the values which are passed deciding their level of importance (-1 to 1)

Step-3 Decides what part of the current cell state makes it to the output

The third step is to decide what will be out output. First, we run a sigmoid layer which decides which part of the cell state makes it to the output. Then, we put the cell state through tanh to push the values between -1 and 1and multiply it with the output of the sigmoid gate.

d) Deep LSTM Network

The input to the network at a given time step goes through multiple LSTM layers in addition to propagation through time and LSTM layers.

Benefit of using Deep LSTM RNNs to normal LSTM, they can better use parameters by distributing them over the space through multiple layers.

e) Bidirectional LSTM Network

The principle of Bidirectional Deep LSTM (BiLSTM) is to split the neurons of a regular LSTM into two directions, one for positive time direction forward states, and another for negative time direction backward states, the output are not connected to inputs of the opposite direction states.

f) Results and Discussion

The models performed quite well on the test data set.
25000 IMDB movie reviews used as training and the same number for validation.
Overfitting was removed and an increase in accuracy is reflected in the results given below.

7. Conclusion and Future Work

  • Bidirectional Deep LSTM with an increase in complexity of network the validation loss declined.
  • An increase in complexity clearly also increases the computational cost of the network.
  • In the future varying the number of stacked LSTM layers, varying the input and output dimensions of the network might also increase the accuracy of the system

References

◂Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., & Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1631–1642).

◂Nogueira dos Santos, C., & Gatti, M. Deep Convolutional Neural Networks for Senti-ment Analysis of Short Texts. In International Conference on Computational Linguistics (pp. 69–78).

◂Mihalcea, R., Banea, C., & Wiebe, J. (2007). Learning multilingual subjective language via cross-lingual projections. In Proceedings of the 45th annual meeting of the association of computational linguistics (pp. 976–983).

◂Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. doi:10.1162/neco.1997.9.8.1735 PMID:9377276

◂For example: There is an order relationship between words in a document of text. There is an ordered relationship in the time steps of a time series.

--

--

Usman Akhtar
Usman Akhtar

Written by Usman Akhtar

Data Analytics (Machine Learning & Deep Learning) for decision-making | Web 3.0 & NLP solutions for linguistics and knowledge graph insights.

No responses yet