German Twitter Sentiment Analysis

A field trial with R

Why this post and what is the aim of it?

This year, I have had a course about Social Network Analysis at my University. While looking for a research topic and a nice method for my final presentation I have heard of Twitter Sentiment Analysis. I thought that this would be a fancy method but that I could never implement it by myself. But then I noticed, that there is a possibility to do Twitter Sentiment Analysis with R. Including the main advantage of R for me as a student: It’s for free!

With this post I would like to describe how I set up the script, where you can find a german word list for the sentiment analysis and how you can use it and in special about my experience and the limitations of the method, in they way I used it. Furthermore I will try to explain the method and the realization in a way that it will be understandable for people who are having as less experience then I had while starting to work on this little project.

Setting up the script for a german Twitter Sentiment Analysis

At the beginning I started to look for tutorials that could help me to write my script (before I used R only for descriptive and regression analysis of survey datasets). Actually, I have found a lot of tutorials, but in my eyes the most helpful has been this wonderful post published by Sergey Bryl’ two years ago. In comparison to other tutorials Sergey Bryl’ already implements some very helpful improvements of the sentiment analysis. That is why my script is mainly the same as Sergeys script. But a small part differs, this is necessary because of the format of the german wordlist that I used to classify the tweets. I will try to explain this when we come to this point.

If you want to do a Sentiment Analysis via Twitter you will first need to creat your Twitter App here, which is easy to do and for free. After doing this you can just click on your application and there choose the menu ‘Keys and Access Tokens’. Here you can find your individual comsumer key and your consumer secret that you will need later on. After you got those keys you can start to write your own script.

First of all you will need to download some packages and load them into your R library.

#install required packages
install.packages("twitteR")
install.packages("ROAuth")
install.packages("plyr")
install.packages("dplyr")
install.packages("stringr")
install.packages("ggplot2")
#load required packages
library(twitteR)
library(ROAuth)
library(plyr)
library(dplyr)
library(stringr)
library(ggplot2)

In the next step R will connect itself to the Twitter API.

download.file(url='http://curl.haxx.se/ca/cacert.pem', destfile='cacert.pem')
reqURL <- 'https://api.twitter.com/oauth/request_token'
accessURL <- 'https://api.twitter.com/oauth/access_token'
authURL <- 'https://api.twitter.com/oauth/authorize'
consumerKey <- 'YOUR CONSUMER KEY HERE'
consumerSecret <- 'YOUR CONSUMER SECRET HERE'
Cred <- OAuthFactory$new(consumerKey=consumerKey,
                         consumerSecret=consumerSecret,
                         requestURL=reqURL,
                         accessURL=accessURL,
                         authURL=authURL)
Cred$handshake(cainfo = system.file('CurlSSL', 'cacert.pem', package = 'RCurl')) 

Now your browser will open and ask you if you would like to allow your new application to use your account. After accepting this you will get a number which you should enter in R to authorise the connection. Then you can set up the twitter oauth and choose ‘using direct authentication’ by entering a ‘1’.

'THE NUMBER YOU RECEIVED FROM YOUR BROWSER HERE'
setup_twitter_oauth('YOUR CONSUMER KEY HERE','YOUR CONSUMER SECRET HERE','YOUR ACCESS TOKEN HERE','YOUR ACCESS SECRET HERE')</div>
1

Now we will begin with the sentiment analysis. Here for example we are looking for 100.000 tweets with the hashtag #R2G. R2G stands for a possible german coalition. I am using this example because I hope that most of the tweets that are using this language will be written in german. We will see later why this is important. One of the main improvements by Sergey Bryl‘ that we can notice here and later on is, that his script permanently writes .csv files of the data you are capturing. So is a big plus, because Twitter is giving you just a limited number of tweets per catch, also it will not give you tweets that are older than seven or eight days. But by saving .csv files you can do the sentiment analysis for example once a week and easily merge old and new files together to get later on a long-time analysis as output. If you want to reline your work with content of the tweets it will also be a big plus that you will also able to the who has written the tweet, how often it got retweetet and so on. It is also possible to get georeferential informations about the tweets, but it seems like only a few people a using this feature. After running the first line of the script R will tell you how many tweets it got. In this example we got 3.464 tweets.

search.list <- searchTwitter('#R2G', n=100000)
df <- twListToDF(search.list)
df <- df[, order(names(df))]
df$created <- strftime(df$created, '%Y-%m-%d')
if (file.exists(paste('#R2G', '_stack.csv'))==FALSE) write.csv(df, file=paste('#R2G', '_stack.csv'), row.names=F)

In the next step we are going to save all the tweets – without the duplicates –  and the additional informations belonging to them in our storage .csv file, which is stored in your working directory.

stack <- read.csv(file=paste(searchterm, '_stack.csv'))
 stack <- rbind(stack, df)
 stack <- subset(stack, !duplicated(stack$text))
 write.csv(stack, file=paste(searchterm, '_stack.csv'), row.names=F)

After this it is time to prepare the wordlist which you will Need to use to score your tweets. For the german language you can find a nice wordlist called SentiWS here. Senti WS bei Remus et al. is licensed under a Creative Commons license and contains 1.650 positive and 1.818 negative words. Actually SentiWS contains of two lists, one for positive and one for negative words.These are supplemented by other forms of the same root word, so that you will get 15.649 positive and 15.632 negative words in total. Senti WS also contains additional informations like a weight for the words, which we are not going to use in this case. Because of that we will first write a function which we can use to format the list. For example will transform all capital letters to lowercase letters. The function will also be used to reformat the tweets later on. When this is done, provided that you have saved both SentiWS files in your working directory, we will link them to R and run our function in them.  As you can see I renamed the SentiWS files. If you want to use the same script it is important that the wordlist-files are in your R working Directory.

readAndflattenSentiWS <- function(filename) {
words = readLines(filename, encoding="UTF-8")
words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
words <- unlist(strsplit(words, ","))
words <- tolower(words)
return(words)
}
pos.words <- c(scan("positive-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("positive-words.txt"))
neg.words <- c(scan("negative-words.txt",what='character', comment.char=';', quiet=T),
readAndflattenSentiWS("negative-words.txt"))</div>

The following part of the script contains a the function which formats the tweets and calculates the scores per tweet. The result is going to be saved in another .csv file.

score.sentiment <- function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores <- laply(sentences, function(sentence, pos.words, neg.words){
sentence <- gsub('[[:punct:]]', "", sentence)
sentence <- gsub('[[:cntrl:]]', "", sentence)
sentence <- gsub('\\d+', "", sentence)
sentence <- tolower(sentence)
word.list <- str_split(sentence, '\\s+')
words <- unlist(word.list)
pos.matches <- match(words, pos.words)
neg.matches <- match(words, neg.words)
pos.matches <- !is.na(pos.matches)
neg.matches <- !is.na(neg.matches)
score <- sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress)
scores.df <- data.frame(score=scores, text=sentences)
return(scores.df)
}
Dataset <- stack
Dataset$text <- as.factor(Dataset$text)
scores <- score.sentiment(Dataset$text, pos.words, neg.words, .progress='text')
write.csv(scores, file=paste('#R2G', '_scores.csv'), row.names=TRUE)</div>

The last step before we can work with a result in the categories of “positive”, “neutral” or “negative” is to classify the tweets in these categories. The result we be saved as .csv file again.

stat <- scores
stat$created <- stack$created
stat$created <- as.Date(stat$created)
stat <- mutate(stat, tweet=ifelse(stat$score > 0, 'positive', ifelse(stat$score < 0, 'negative', 'neutral')))
by.tweet <- group_by(stat, tweet, created)
by.tweet <- summarise(by.tweet, number=n())
write.csv(by.tweet, file=paste('#R2G', '_opin.csv'), row.names=TRUE)

Finally we are able to plot the data and save the file. If you use the second last line, by removing the # sign, you will get another line, which includes the number of total tweets.

ggplot(by.tweet, aes(created, number)) + geom_line(aes(group=tweet, color=tweet), size=1.0) +
geom_point(aes(group=tweet, color=tweet), size=1.9) +
theme(text = element_text(size=16), axis.text.x = element_text(angle=90, vjust=1)) +
stat_summary(fun.y = 'sum',fun.ymin='sum', fun.ymax='sum', colour = 'yellow', size=1, geom = 'line') +
ggtitle('#R2G')
ggsave(file=paste('#R2G', '_plot.jpeg'))

The result will look like this:

r2g-_plot

#If you want to, you can also get other statistics
table(stat$score)
mean(stat$score)

Conclusion: A critical Review of the method used above

As you can see in the plot above a lot of tweets remains classified as neutral. There a few possible explanations for this:

  • they are written in a foreign language, which is – off course – not covered by a german wordlist,
  • they are using irony or words and phrases SentiWS is not able to recognise as positive or negative,
  • or they are really neutral.

To illustrate that the scoring is not right in every single case even if the tweets are written in german and SentiWS includes the words used, I will give you a short example.

The script below can be seen as a small test sample for our script. We are going to classify five german tweets and check their results. You can generous translate them to english as following:

  1. “ich liebe dich” -> “i love you”
  2. “Ich bin schlecht. Mist!” -> “I am poor/low.. Shit!”
  3. “geil” -> “horny/nice”
  4. “scheiss”->”shit”
  5. “geiler scheiß”->”fancy shit”
sample = c("ich liebe dich.","Ich bin schlecht. Mist!", "geil", "scheiss","geiler scheiss")
test.sample = score.sentiment(sample, pos.words, neg.words)

If we are having a look at the scores now you can notice that the first 4 tweets are getting the right score (1,-2,1,-1). After the transformation for our plot they would be classified as two positive and two negative tweets. But what happens to the last tweet, which is clearly positive? It will be counted as neutral. Because it consists of one word which is marked as positive in our setting and by another which is marked as negative so the result is going to be: -1+1=0

So all in all my conclusion is, that implementing a Twitter Sentiment Analysis like I did it here is a nice way to get in touch with R and to play with programm and it’s functions. But you have to be very careful when you are interreting the resulst, because theire are a lot of potential biases (for. e.g. tweets in a foreign language) that will distort your results.

But nevertheless a nice side effect of this method is that you will also get .csv files which include the full text of the tweet, the author, the count of retweets´, how often the tweet hast been favorated etc.

If your aim is to implement a valid type of a Twitter Sentiment Analysis you should have a look papers like these one from Go et al. (2009), Pak & Paroubek (2010) or Kouloumpis et al. (2011).

References:

Go, Alec / Huang, Lei / Bhayani, Richa (2009): Twitter Sentiment Analysis.

Kouloumpis, Efthymios / Wilson, Theresa / Moore, Johanna (2011): Twitter Sentiment Analysis: The Good the Bad and the OMG!.

Pak, Alexander / Paroubek, Patrick (2010): Twitter as a Corpus for Sentiment Analysis and Opinion Mining.

Remus, R. / Quasthoff U. / Heyer, G. (2010): SentiWS – a Publicly Available German-language Resource for Sentiment Analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC’10), pp. 1168–1171.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s