Twitter sentiment analysis using Logistic Regression

Kolamanvitha
Nerd For Tech
Published in
6 min readMay 4, 2021

--

Image Source: Reputationx

Sentiment Analysis:

Sentiment analysis is an NLP technique that allows us to classify if a text, tweet or comment is either positive, neutral or negative. Today’s technology enables users to express their emotions and thoughts more openly on social platforms than ever before. So Sentiment Analysis has become a mandatory tool for every business to understand user sentiment and gauge their performance to tailor their products and services according to user needs, thus making systems more efficient.

Logistic Regression:

Logistic regression is a supervised machine learning technique for classification problems. Supervised machine learning algorithms train on a labeled dataset along with an answer key which it uses to train and evaluate its accuracy. The goal of the model is to learn and approximate a mapping function f(Xi) = Y from input variables {x1, x2, xn} to output variable(Y). It is called supervised because the model predictions are iteratively evaluated and corrected against the output values, until an acceptable performance is achieved.

Sentiment Analysis using Logistic Regression:

As a part of building sentiment classifier using logistic regression, we train the model on twitter sample dataset. The dataset available is in its natural human format of tweets, which is not so easy for a model to understand. Thus we will have to do some data pre-processing and cleaning to break down the given text into a easily understood format for the model.

Pre-processing of tweets include the following steps:

  1. Removing punctuations, hyperlinks and hashtags
  2. Tokenization — Converting a sentence into list of words
  3. Converting words to lower cases
  4. Removing stop words
  5. Lemmatization/stemming — Transforming a word to its root word
def process_tweet(tweet):
tweet = re.sub(r’\$\w*’, ‘’, tweet)
tweet = re.sub(r’https?:\/\/.*[\r\n]*’, ‘’, tweet)
tweet = re.sub(r’#’, ‘’, tweet)
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
tweet_tokens = tokenizer.tokenize(tweet)
stopwords_english = stopwords.words(‘english’)
stemmer = PorterStemmer()
tweets_stem = []
for word in tweet_tokens:
if(word not in stopwords_english and word not in string.punctuation):
tweets_stem.append(word)
stem_word = stemmer.stem(word)
tweets_stem.append(stem_word)
return tweets_stem

For this model, we will use NLTK’s twitter_samples corpus as our labeled training data. The twitter_samples corpus contains 3 files.

  1. negative_tweets.json: contains 5000 negative tweets
  2. positive_tweets.json: contains 5000 positive tweets
  3. tweets.20150430–223406.json: contains 20k positive and negative tweets

Let us consider first two files for our analysis

positive_tweets = twitter_samples.strings(‘positive_tweets.json’)
negative_tweets = twitter_samples.strings(‘negative_tweets.json’)

We split the data into train-test sets in 20:80 ratios and combine them to use the train set for training the model and test set to test the model performance.

test_positive = positive_tweets[4000:]
train_positive = positive_tweets[:4000]
test_negative = negative_tweets[4000:]
train_negative = negative_tweets[:4000]
train_x = train_positive + train_negative
test_x = test_positive + test_negative

Combine positive and negative labels into an array for the target variable. Append 1’s for positive and 0’s for negative tweets.

train_y = np.append(np.ones((len(train_positive), 1)), np.zeros((len(train_negative), 1)), axis=0)test_y = np.append(np.ones((len(test_positive), 1)), np.zeros((len(test_negative), 1)), axis=0)

Build frequency dictionary:

The current scenario is a binary classification problem where each tweet can be either positive or negative. So few words occur more frequently either in positive tweets or in negative tweets. So knowing this information helps us predict if a tweet/sentence containing these words is positive or negative.

For example, if the word “Happy” occurs more frequently in tweets labeled as positive, next time when a tweet contains the word “Happy”, it increases the likeliness of the sentiment of tweet to be positive.

So we calculate for each word, the number of times it appeared in positive tweets and negative tweets. We represent it using a dictionary where the key is the (word, sentiment) pair and frequency is the value.

Suppose if happy occurred 40 times in tweets with positive sentiment (1) and 8 times in negative tweets (0), it is represented as dictionary={(“happy”,1):40, (“happy”,0):8}.

Let us create a dictionary by mapping each (word, sentiment) pair to its frequency in the corpus.

freqs = {}
for y, tweet in zip(y, tweets):
for word in process_tweet(tweet):
pair = (word, y)
freqs[pair] = freqs.get(pair, 0) + 1

Feature Extraction:

Machine learning models can only deal with numbers rather than text as they can only understand the language of binary digits. Thus, we need to transform these tweets into vectors which can be later fed into our model for training. There are a lot of ways to represent text as vectors depending on the context.

We will use the frequency dictionary built from the previous section to convert each tweet to a vector of 3 dimensions as below:

tweet = [1, Σfreq of words in positive class, Σfreq of words in negative class]

Since each tweet is represented as a vector, we can combine all the vectors into a single matrix

# loop through each word in the list of words
for word in word_l:
# increment the word count for the positive label 1
if (word,1.0) in freqs:
x[0,1] += freqs[(word,1.0)]
else:
x[0,1] += 0
# increment the word count for the negative label 0
if (word,0.0) in freqs:
x[0,2] += freqs[(word,0.0)]
else:
x[0,2] += 0

Now that we have the dataset and features ready for training, let us look into logistic regression and its working.

Sigmoid activation Function:

Logistic regression achieves the best predictions using the maximum likelihood technique. Sigmoid is a mathematical function having a characteristic that can take any real value between -∞ and +∞ and map it to a real value between 0 to 1. So if the outcome of sigmoid function is more than 0.5 then we classify it as positive class and if it is less than 0.5 then we can classify it as negative class.

Cost Function and Gradient Descent:

We use a cross-entropy or log loss cost function in case of logistic regression. The cross entropy cost function can be divided into 2 cost functions, one for each output.

The vectored form of the cost function is given by

Multiplying by y and (1−y) in the above equation is a trick that lets us use the same equation to solve for both the cases when y=1 and y=0, cancelling out the other and performing only the required operation in each case.

Given inputs as feature matrix X, target variable Y, learning rate alpha and number of iterations num_iters, theta is calculated iteratively using gradient descent

def gradientDescent(x, y, theta, alpha, num_iters):
for i in range(0, num_iters):
z = np.dot(x, theta)
h = sigmoid(z)
J = (-1/m)*sum(np.dot(np.transpose(y), np.log(h))+ np.dot(np.transpose(1-y),np.log(1-h)))
theta = theta — (alpha/m)*(np.dot(np.transpose(x),(h-y)))

Training & Evaluating the Sentiment classifier:

Gradient descent is applied and the resultant theta vector of optimal weights is obtained. The sentiment of a new tweet is predicted using the sigmoid of dot product of extracted feature vector x and theta vector

y_pred = sigmoid(np.dot(x, theta))

The threshold is set at 0.5. So if y_pred > 0.5, it is predicted as a positive tweet else a negative tweet.

if y_pred > 0.5:
y_hat.append(1)
else:
y_hat.append(0)

With the given dataset, an accuracy of 99.5% is obtained using this model, which is almost perfect!

Note : This blog is based on the new specialization NLP at https://www.deeplearning.ai/

--

--