To be honest, I didn’t expect to learn NLP in the first place. But with its wide and intriguing applications, NLP allures me all the time to dig deeper and explore more fun with it. Unlike many machine learning algorithms, NLP is especially rich in visualization and thus is easy to understand, interpret, and apply to real-life problems. In this article, I will introduce several domains in NLP and share the ideas behind them (as well as the codes & visuals!). Here are what you should expect:
- Sentiment Analysis
- Word Cloud
- Named Entity Recognition
- Text Summarization
- Topic Analysis (LDA) and Similarities (LSI)
- Language Model (Text Generation)
The tweet dataset contains 2292 Twitter users’ tweets. Though this dataset cannot represent the whole population on Twitter, our conclusions still can be of great insights. Codes are all uploaded in Github.
I always find sentiment analysis interesting because it can be embedded anywhere. We, human beings, are connected with emotions and progress with opinions. More specifically, business is centered around customers. Analyzing the public sentiment towards a product or a company can help companies position themselves and make improvements. Thus, sentiment analysis can be used as a performance indicator/feedback in business.
- Naive Sentiment Analysis
Let’s start off with the naïve method to learn about the ideas behind sentiment analysis. In naïve sentiment analysis, we encode each word to be positive or negative and then iterate over the entire text to count positive words as well as negative words.
This snippet shows how to generate the word lists for positive/negative words, which later function as dictionaries where we can look words up. The full version of code is included in Github.
import requests
def sentiment_words(url):
request = requests.get(url)
print("retriving data >>> status code: ",request.status_code)
text = request.text
word_list = text[text.find("\n\n")+2:].split("\n")
return word_listpos_url = 'http://ptrckprry.com/course/ssd/data/positive-words.txt'
neg_url = 'http://ptrckprry.com/course/ssd/data/negative-words.txt'
pos_list = sentiment_words(pos_url)[:-1]
neg_list = sentiment_words(neg_url)[:-1]
The graph below is the sentiment distribution of Twitter users. Overall, Twitter users are more positive than negative. Some users even have high positive scores with close to 0 negative scores.
Look closer to one user: we are able to detect the changes in this user’s sentiment. Unfortunately, we do not have more accurate time information otherwise we might even find some weekly/seasonal patterns in sentimental change.
- NRC Emotion Lexicon
Similar to the mechanism underlying naive sentiment analysis, NRC extracted 8 more emotions like joy, trust, sadness, etc. With richer information, more visuals/functionalities could be added. For example, radar plots are potential visuals for sentiment diagnosis. User 1(blue) seems to be more positive than user 2 (red) with higher scores on positive emotions (joy, trust, anticipation) and lower scores on negative emotions (sadness, anger, disgust).
- VADER
VADER (Valence Aware Dictionary and Sentiment Reasoner) functions beyond word-level. Instead, it analyzes sentiment on the sentence-level/content-level. Moreover, it provides both polarity (positive/negative) and intensity of emotions. In Python vaderSentiment library, it returns 4 scores— positive, negative, neutral, and compound.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
pos=neg=neu=compound=0
sentences = nltk.sent_tokenize(text.lower())
for sentence in sentences:
vs = analyzer.polarity_scores(sentence)
pos += vs["pos"]/len(sentences)
neg += vs["neg"]/len(sentences)
neu += vs["neu"]/len(sentences)
compound += vs["compound"]/len(sentences)
“Neutral” is significantly and negatively correlated with “positive”, which makes sense since when we are expressing emotions it’s hard to be neutral. However, most of the users are being detected as highly neutral. Is our analyzer not sensitive enough, or is it the true case in real life?
Word Cloud is what many people think of when speaking of NLP. It counts the frequency of words and resizes each word based on their frequencies. The more frequent a word is, the more outstanding it is in the word cloud. The idea is simple but efficient.
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=3000,height=3000).generate(text)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
Named Entity Recognition enables us to understand what or who is being talked about, and it mirrors the way human beings process sentences with language grammars. This code snippet scans a text sentence by sentence and labels each word if the word is recognized as “entity”. Having done that, we search for trees in the chunk and ask “Are you tagged with a label?” — hasattr(tree, ”label”). If the tree says yes, we will then grab the entity in tree.leaves() and also store the label tree.label().
def entity_identity(text):
sentences = nltk.sent_tokenize(text)
entity = []
for sentence in sentences:
words = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(words)
chunked = nltk.ne_chunk(tagged)
for tree in chunked:
if hasattr(tree,"label"):
entity.append([tree.label()," ".join(c[0] for c in tree.leaves())])
entity = pd.DataFrame(entity,columns=['label','entity'])
return entity
If combining the named entity with sentiment analysis/word cloud, things can get more interesting. Let’s first see an example in the word cloud. This function can take any type of entity and grab all the entities in that type to form a word cloud. It can be functioning as a monitor of what/who is being heavily talked about.
def wordcloud_entity(entity,label="PERSON"):
text = " ".join(list(entity[entity["label"]==label]["entity"]))
wordcloud = WordCloud(stopwords=STOPWORDS,background_color='white',width=3000,height=3000).generate(text)
fig,ax = plt.subplots(1,1,figsize=(8,8))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()
OK, now we are aware that twitter users are heavily talking about Hong Kong, America, Trump, Polina Shinkina…How about further analyzing the sentiment related to these words? Next, I search for the sentences that contain these words and utilize the VADER sentiment analyzer.
def sentiment_entity(text,entity="New York"):
sentences = nltk.sent_tokenize(text)
analyzer = SentimentIntensityAnalyzer()
pos=neg=neu=compound=count=0
for sentence in sentences:
if entity.lower() in sentence.lower():
vs = analyzer.polarity_scores(sentence)
pos += vs["pos"]
neg += vs["neg"]
neu += vs["neu"]
compound += vs["compound"]
count += 1
return pos/count,neg/count,neu/count,compound/count
This dataset is biased in terms of the user pool it represents (e.g. college students) and the time when I extracted it, but at least it shows how we are able to incorporate sentiment analysis into named entity recognition.
In the previous 3 parts, we mainly focus on word-level and sentence-level analyses. Now we are going to analyze texts in paragraphs. Normally, an article entails a beginning, a body, and an end. Some sentences in the beginning or end are key sentences summarizing the main topics of the article. Text Summarization ranks each sentence and picks sentences of top ranks. To do it naively, we can count the frequency of each word and use that as a ruler to rank sentences. Finally, we pick sentences with the highest word frequency (most representative). There is another complex way of calculating TextRank which is wrapped up in the package called gensim. Check this code out! Tweets are shorter than articles and thus may not be a good fit for summarization. I set the ratio parameter to be as small as 0.003 to squeeze the output size of tweets.
import gensim
gensim.summarization.summarize(text,ratio=0.003)
Things are getting more and more abstract now. Topic Analysis is an unsupervised learning technique where we are trying to extract dimensions from texts. The technique I introduce here is LDA (Latent Dirichlet Allocation). LDA entails words dictionary and corpus to get ready for topic extractions. Words dictionary encodes every code in the text. Corpus is a list of lists where words in a text are stored in a list and all texts are stored separately in different lists (“bag of words”).
words_list = []
users = []
for user,text in tweets.items():
users.append(user)
words = nltk.word_tokenize(text.lower())
words = [word for word in words if word not in STOPWORDS and word.isalnum() and len(word)>=2]
words_list.append(words)num_topics = 3 #self-defined
dictionary = corpora.Dictionary(words_list)
corpus = [dictionary.doc2bow(words) for words in words_list]
lda = LdaModel(corpus, id2word=dictionary, num_topics=num_topics)
Now we get 3 topics with their corresponding representative words. To show what a specific topic is like, try this code:
# Topic is 0 indexed, 1 indicates the second topic
lda.show_topic(1)
To get the topic components of a user/document, try this code out:
# corpus[2] refers to the tweets of a user
sorted(lda.get_document_topics(corpus[2],minimum_probability=0,per_word_topics=False),key=lambda x:x[1],reverse=True)[output] [(0, 0.51107097), (1, 0.48721585), (2, 0.0017131236)]
To compare topic similarities between different users, that is where LSI shines. LSI takes the same input as LDA — dictionary, corpus — and compares a new document with the existing corpus.
lsi = models.LsiModel(corpus,id2word=dictionary, num_topics=3)
words_new = nltk.word_tokenize(doc.lower())
words_new = [word for word in words if word not in STOPWORDS and word.isalnum() and len(word)>=2]
vec_bow = dictionary.doc2bow(words_new)
vec_lsi = lsi[vec_bow]
index = similarities.MatrixSimilarity(lsi[corpus])
sims = index[vec_lsi]
sims = sorted(enumerate(sims), key=lambda item: -item[1])
I didn’t input new tweets. Instead, I randomly chose a user. That’s why we get one similarity score of 1.
Package pyLDAvis provides great visuals for LDA, through which we are able to know how topics are correlated with keywords and how to interpret topics. It takes quite a long time thus I only plot for one user.
lda_display = pyLDAvis.gensim.prepare(lda, corpus, dictionary, R=15, sort_topics=False)
pyLDAvis.display(lda_display)
Through the steps above, we are getting to know about the sentiment, entity, keywords, topics of a text. Now let’s teach our machine how to speak like a human being. We are going to build a simple RNN model for one twitter user and mimic how this user speaks.
There are 2 main obstacles in applying machine learning with series data. First, the order of series cannot be reflected in the traditional DNN model. That is where RNN comes in. RNN is a good fit for series data, just like CNN for images. Many to one RNN structure takes the first n-1 words as inputs and the nth word as outputs in a text. RNN can pass on the information from previous positions and thus can reserve the order in series. Second, series data can be of different lengths. Tweets can be as short as 3 characters but also can be as long as several sentences. Padding is especially designed for this problem.
To start with, we tokenize each word since machines can not directly recognize words, and each sentence is then being encoded into a list of integers. For example, “now” is encoded as 198 in the sentence. word_index is the dictionary for deciphering.
tokenizer = Tokenizer() # can set num_words to tokenize
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)
word_index = tokenizer.word_index
index_word = {index:word for word,index in word_index.items()}
Next, to uniformize the length, each sentence is either adding 0s to make up for the blank spaces or truncating itself to fit into the box. The first n-1 terms are taken into X, while the last term of each sentence goes into Y.
max_length = 15
trunct_type = "post"
padding_type = "pre"padded = pad_sequences(sequences,padding=padding_type,truncating=trunct_type,maxlen=max_length)vocab_size = len(word_index)+1
X = padded[:,:-1]
Y = padded[:,-1]
Y = tf.keras.utils.to_categorical(Y,num_classes=vocab_size)
When building the RNN model, we add an embedding layer. It is a way of representing words and has certain advantages over one-hot encoding representations. In one hot encoding, each word is independent/orthogonal to each other. Word embedding generates one vector for each word and enables non-orthogonal relationships. Some words can be closer and similar to each other. For example, “cat” is closer to “dog” than “westside”. Therefore, we are able to migrate our knowledge onto sentences that we haven’t seen before. Machines are able to learn “the dog is running” from the original sentence “the cat is running”.
LSTM is a technique to solve vanishing gradients and strengthen the long term connection in series.
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size,embedding_dim,input_length=max_length-1),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(512,return_sequences=True)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(256)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(vocab_size,activation="softmax")
])model.compile(loss='categorical_crossentropy',optimizer="adam",metrics=["accuracy"])model.fit(X,Y,epochs=100,verbose=1)
Let’s generate text to see how our talking robot works.
choice = random.randint(0,len(padded))
seed = padded[choice,1:].reshape(1,max_length-1)
tweet_robot = sentences[choice]
for i in range(5):
predicted = model.predict_classes(seed,verbose=0)
seed = np.append(seed,[int(predicted)])[1:].reshape(1,max_length-1)
tweet_robot = tweet_robot + " " + str(index_word[int(predicted)])
The original sentence is a complete tweet and our talking robot helps extend the sentence by picking out the appropriate words. The result is not as satisfying as I expect. Our talking robot seems to be just throwing words out instead of producing a well-organized sentence. However, the words selected show some alignment with the topic.
There is a lot more that can be done to improve the performance. Text cleaning would be my priority. Tweets are short and informal, which leads to a lot of inelegancy in word choices and sentence organizations. There is much noise in the dataset.
I am just standing at the gate of the magnificent palace of NLP, knocking at the door. There’s a long way in front of me. I am writing down this article to remind me of the way I’ve come through and to encourage me to keep adventuring.