CEU - Data Science 4: Unstructured Text Analysis

Analysis of whats app group chat log using NLP

Lisa Lang (1902224)

14 May, 2020


Introduction and Hypothesis

I recently learned that it is possible to download WhatsApp conversation logs as .txt files which makes them an interesting data collection for exploring natural language processing techniques. For this project I chose my family-in-law’s group chat, which I joined in August 2017. There are a total of 15 family members in the chat from three different generations.

I am going to test a few hypothesis:

There are times when chat intensity picks up

  • There are peaks of communication in the morning and early evening
  • There are peaks of communication on special occasions such as birthdays, holidays, …

There is a different communication style between family members

  • Different family members use different words more often
  • Different family members use a different set of emojis

  • The overall sentiment of the chats is very positive

Getting and cleaning the data

WhatsApp has a built-in function for exporting chat logs to a zip archiv. On an iphone it works like this: While in the chat overview, swipe the (group) chat of choice to the left and tap the three dots (… more). Then select “export chat”. I saved the file directly to my icloud, but there are alternative ways such as sharing the file via AirDrop or Email.

Before I read the data into R, I opened the file with a text editor and updated the (nick) names of each family member to proper first names for the analysis. Also I removed the first three lines that were not chat conversations but the information of me joining the group.

For proper analysis some data cleaning steps have to be taken:

Some rows don’t have a time stamp. This happens when a message has multiple paragraphs.

These rows are going to be shifted to the right, so I can transfer the time stamp and name of the particular message. In order to do that it is crucial to create empty columns at the end of the data frame, in case the stampless rows are very long and would exceed the dimensions of the data frame:

Now I am going to format the time stamp in such a way that R can evaluate it. Currently the first column shows the date in a format [%d.%m.%y, and the second column shows the time %H:%M:%S] I will convert the date to POSIXlt format, using the lubridate package.

## Warning: 3 failed to parse.

In order to have an entire message in one row as opposed to it being spread over several rows and columns in the table, I am going to concatenate them. I am also going to remove all warnings that inform that videos, pictures and stickers were removed from the text file upon export.

Names have a colon at the end. That doesn’t look nice in graphics.

##             date_time   name
## 1 2017-08-05 21:31:36 Fabian
## 2 2017-08-05 21:38:45 Gerald
##                                                                     text
## 1 Wir begrüßen unser neues Mitglied in der Selbsthilfegruppe. Hallo lisa
## 2                                        Hallo Lisa \U0001f38a\U0001f389

Next I am going to convert the data.frame into a long (or tidy) format, with one row for each word from each of the chat messages. By default, unnest_tokens() converts the tokens (words) to lowercase, which makes them easier to compare or combine with other datasets.

##               date_time   name     word
## 1   2017-08-05 21:31:36 Fabian      wir
## 1.1 2017-08-05 21:31:36 Fabian begrüßen
## 1.2 2017-08-05 21:31:36 Fabian    unser
## 1.3 2017-08-05 21:31:36 Fabian    neues
## 1.4 2017-08-05 21:31:36 Fabian mitglied
## 1.5 2017-08-05 21:31:36 Fabian       in

Then I am going to remove common german stop words and some other words I would like to ignore.

Exploratory analysis

First I will have a look at the number and timing of messages.

Who messages the most?

Corinna is by far the most active person in the family chat. Robert and David are the second and third most active persons resprectively. Michael and Sophie contribute the least to the family chat.

What time of the day do messages peak?

Looking at an entire 24 hour day:

Conversations happen more frequently in the evening hours, peaking at 18:00.

During which months are family members most active in the family chat?

March is the month with the highest frequency of messages on average, followed by July, while February, October and December are rather calm months.

I do not have an explanation for the peak in March at this point, but will explore this further:

The heatmap shows that the message frequency is strongest around week 12, which is approximately mid-March.

The plot separated by year reveals that the peak in March is caused by a very intense chat message exchange in March 2020, which is most likely caused by the governmental reactions to the covid-19 pandemic and the respective impact to our lives. All of us were affected in one way or another and in-person meetings were prohibited so information exchange got more intense on such platforms as WhatsApp during this period.

During pandemic free years we can see a drop of messages in December as well as in the beginning of the year (Jan or Feb) and message peaks during summer months.

The lower number of messages might be because in some month the entire family typically meets (Christmas gathering in December, ski vaction in early February, mothers day in May) so less messages have to be exchanged. July August is typically the time to go on holiday and send messages and travel pictures.

Which words are most frequently used?

## # A tibble: 10 x 2
##    word       n
##    <chr>  <int>
##  1 ja       565
##  2 schon    494
##  3 danke    388
##  4 gute     369
##  5 liebe    304
##  6 gut      288
##  7 mal      257
##  8 na       227
##  9 wow      220
## 10 morgen   202

Authorship

To best predict authorship, I will create a wide data set again that has one row for each chat message and a column for each word. I am going to include dummy variables for each family member, indicating who used each word and when.

Let’s see which words have the largest coefficients: Lasso regression can help determine which words are most predictive for who authored a chat message.

# function to get coefficients for each word and plot top 15 most likely and top 10 least likely used words
whatsapp_word_coef <- function(name){ 
  
  lambda <- 10^seq(-1, -4, length = 5)
  train_control <- trainControl(method = "cv", number = 5) # 5 fold cross-validation
  grid <- expand.grid("alpha" = 1, lambda = lambda)
  
  set.seed(570)

  fit <- train(
    x = data.frame(chat_wide[,-((ncol(chat_wide)-length(unique(chat$name))+1):ncol(chat_wide))]),
    y = as.factor(unlist(chat_wide[,name, drop=FALSE])),
    method = "glmnet",
    family = "binomial",
    trControl = train_control,
    tuneGrid = grid
          )
  
  tuned_logit_lasso_model <- fit$finalModel
  best_lambda <- fit$bestTune$lambda

  lasso_coeffs <- data.frame(coef = row.names(as.matrix(coef(tuned_logit_lasso_model, best_lambda))),
                             as.matrix(coef(tuned_logit_lasso_model, best_lambda))) %>%
    mutate(beta = X1) %>%
    filter(beta != 0) %>%
    filter(coef != "(Intercept)") %>%
    arrange(beta)
  
  ifelse(nrow(lasso_coeffs) < 25, 
         df <- lasso_coeffs %>% mutate(i = row_number()), 
         df <- lasso_coeffs[c(1:10, (nrow(lasso_coeffs)-14):nrow(lasso_coeffs)),] %>% 
           mutate(i = row_number()))
  
  plot <- ggplot(df, aes(x = i, y = beta, fill = ifelse(beta > 0, "Likely", "Not likely"))) +
    geom_bar(stat = "identity", alpha = 0.75, width = 0.5) +
    scale_x_continuous(breaks = df$i, labels = df$coef, minor_breaks = NULL) +
      labs(title = "Authorship prediction",
       subtitle = "WhatsApp, Aug 2017 - April 2020") +
    xlab("") +
    ylab("Coefficient Estimate") +
    coord_flip() +
    scale_fill_manual(
        guide = guide_legend(title = paste0("Word most/least typically used by ",name,":")),
        values = c("#446093", "#bc3939")) +
    theme_economist_white() +
    theme(legend.position = "top")
  
  ggsave(paste0("output/coefs_",name,".png"), plot)
}

For some family members (Andrea, Fabian, Gerald, Gerhard, Lisa, Michael, Robert, Sophie, Tizian) no linear combination of any subset of the words is useful for predicting the author (This is when the lasso model reduces all coefficients to zero). It seems the conversations among family members are pretty homogenous and only few use distinguishable language. Among the family members where lasso was able to identify words that are predictive for who authored a chat message, I have selected two visualisations:

“Omi” and “Opi” (which means grandma and grandpa) are words that point towards Chiara as sender. Since she is one of the two only grandchildren in the chat, it makes sense that it is her using these words most frequently. A unique characteristic of Chiaras messages is that she likes to express her feelings by using a lot af aaaaaaaawwwww.

Gerhard uses the abbreviation “ad” meaning “refering to the above”, which is unique to him. Also “schmackofatz”, a colloquial word for “tasty” is an expression that only he uses.

Unique words (TF-IDF)

Another way to eximine words that are unique to every family member but also frequently used is to calculate the TF-IDF score (i.e., Term Frequency-Inverse Document Frequency). It takes the frequency of words in a document and calculates the inverse proportion of those words to the corpus.

The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the document (name/author in this case), and the last necessary column contains the counts, how many times each name used each term (n in this example).

## # A tibble: 6 x 6
##   name    word         n      tf   idf  tf_idf
##   <chr>   <chr>    <int>   <dbl> <dbl>   <dbl>
## 1 Corinna schon      179 0.0133  0     0      
## 2 Corinna ja         166 0.0123  0     0      
## 3 Corinna happy      112 0.00831 0.310 0.00258
## 4 Corinna birthday   109 0.00809 0.511 0.00413
## 5 Corinna danke       96 0.00712 0     0      
## 6 Corinna gut         94 0.00697 0     0

Here we can nicely see how idf and thus tf-idf are zero for such an extremely common word like “schon” (already), or “ja” (yes). These are all words that appear chat messages from all family members, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the chat messages; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.

These are terms with high tf-idf:

## # A tibble: 6 x 6
##   name   word           n      tf   idf tf_idf
##   <chr>  <chr>      <int>   <dbl> <dbl>  <dbl>
## 1 Tizian cola          29 0.0229   2.01 0.0460
## 2 Tizian letztens      12 0.00946  2.71 0.0256
## 3 Chiara omi           15 0.0189   1.10 0.0207
## 4 Chiara opi            7 0.00881  2.01 0.0177
## 5 Andrea jedenfalls    12 0.0110   1.61 0.0177
## 6 Gerald forum         18 0.00584  2.71 0.0158

The list shows all proper nouns, names and other words that are important in these chat messages. None of them are used by everyone, but they are important characteristic words for each chat message of the respective author.

There are some obvious associations like this: One of Andrea’s most important words is “Robert”, who is her husband so is one of Robert’s most important words “Andrea”, referring to his wife. Anna is talking about her husband “Leonid” and Viktoria about her partner “Fabi”, Sophie about her partner “Michi”.

Some are less obvious but still plausible:

  • Corinna is often the first one to send out birthday wishes.
  • Gerald is often travelling to Sarajevo so it is among his most important words.
  • Gerhard loves to use abbreviations such as “vll” for “vielleicht” (maybe).
  • My most important words, which are distinguishable from the rest of the family members is “grandma”, “urkunde” and “taufregister” which refer to a time when I asked for help with some documents from my american host parents who were trying to discover who their german ancestors were.
  • Among Robert’s most important words is “Zoom”, which can be explained by the fact that he has a paid zoom account and his the host of all our online zoom meetings.
  • Silvis sends kisses (“Bussi”, “Bussal”) a lot.
  • Among Tizian’s most important word is “Feuerwehr” (fire department) and “u17” (under 17) which makes sense, since he is a volunteer firefighter and a teenager.
  • In Viktoria’s list we find “Ziehrer” which is the name of her music society.

Similarities

Are there similarities among family members to each other in text content? I will try to discover this by finding the pairwise correlation of word frequencies for each family member, using the pairwise_cor() function from the widyr package!

## # A tibble: 6 x 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 Silvia  Corinna       0.716
## 2 Corinna Silvia        0.716
## 3 Gerald  Silvia        0.699
## 4 Silvia  Gerald        0.699
## 5 David   Fabian        0.659
## 6 Fabian  David         0.659
## # A tibble: 6 x 3
##   item1   item2   correlation
##   <chr>   <chr>         <dbl>
## 1 Tizian  Gerhard       0.320
## 2 Gerhard Tizian        0.320
## 3 Chiara  Gerhard       0.315
## 4 Gerhard Chiara        0.315
## 5 Tizian  Anna          0.313
## 6 Anna    Tizian        0.313

Silvia and Corrina as well as Silvia and Gerald show similarities in the use and frequency of words. Tizian and Anna or Chiara and Gerhard have the least similarities on how (often) they use (which) words.

For a better understanding of the correlations I will visualise the very strong correlations in a network.

It looks like there are strong correlations between Corinna and her children David, Silvia, Fabian. There is also strong correlation between Silvia, her mother and her husband Gerald. There are weak correlations between the two youngest family members Chiara and Tizian and most of the remaining family members. This can be an indicator for the different language younger and older generations tend to use.

Topic modeling

Although topic modeling might not be the most useful technique within an ongoing chat I will still try to use this technique to epxlain the peak in message frequency in March 2020.

I am going to use the latent Dirichlet allocation (LDA) algorithm which treats each chat history within a time sequence as a mixture of topics, and each topic as a mixture of words. This allows chats within time sequences to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.

I am only going to use this years chat messages and treat each month like a separate entity/document.

## # A tibble: 6 x 3
##   month word      n
##   <ord> <chr> <int>
## 1 Mar   ja      101
## 2 Mar   schon    80
## 3 Jan   danke    54
## 4 Jan   gute     49
## 5 Mar   mal      44
## 6 Mar   grad     37
## <<DocumentTermMatrix (documents: 4, terms: 4653)>>
## Non-/sparse entries: 6188/12424
## Sparsity           : 67%
## Maximal term length: 37
## Weighting          : term frequency (tf)

I will use the LDA() function from the topicmodels package, setting k = 4, to create a four-topic LDA model.

## A LDA_VEM topic model with 4 topics.

Now I will examine per-topic-per-word probabilities.

## # A tibble: 6 x 3
##   topic term     beta
##   <int> <chr>   <dbl>
## 1     1 ja    0.0145 
## 2     2 ja    0.0126 
## 3     3 ja    0.0167 
## 4     4 ja    0.0130 
## 5     1 schon 0.0159 
## 6     2 schon 0.00168

The one-topic-per-term-per-row format shows, for each combination, the probability of that term being generated from that topic. For example, the term “ja” (“yes”) has an almost equal probability of being generated from topics 1, 2, 3 or 4.

With dplyr’s top_n() it is possible to display the top 15 terms within each topic.

These topics are not clearly distinguishable from each other, however topic 1 seems to include birthday congratulations, while topic 3 might be about the well-being, abuot children and about doing ok. The terms with the hightest beta coefficient are the same in all topics (ja, schon, gut(e), …)

Different values for k do not produce better distinguishible results.

Nonetheless I want to know which topics are associated with each month by examining the per-month-per-topic probabilities, γ (“gamma”).

## # A tibble: 6 x 3
##   document topic      gamma
##   <chr>    <int>      <dbl>
## 1 Mar          1 0.00000487
## 2 Jan          1 0.0000122 
## 3 Apr          1 0.0000146 
## 4 Feb          1 1.00      
## 5 Mar          2 0.268     
## 6 Jan          2 0.0000122

Each of these values is an estimated proportion of words from that month (here: document) that are generated from that topic. For example, the model estimates that each word typed in March has a 0.0000049% probability of coming from topic 1 (birthday).

During the month of March, where communication peaks, topic 3 and eventually topic 2 seem to be more present than in the other month. Topic one was discussed mainly in February, topic 4 in April.

Lets check if the birthday theme frome topic 1 in February matches the family’s birthdays and name day calendar:

According to this chart, we would expect the birthday theme to be most present in January, not in February.

However it makes sense that words showing affection as in topic 3 were present in the month of March, where the family was aksing about the well-being of everyone in these special times.

Looking at a greater selection of words from topic 2 and topic 3 it is becoming more obvious that the pandemic played an important role in the conversations in March. Words like “hoffentlich” (hopefully), “masken” (masks), “idee” (idea), arbeit (“work”), “kinder” (children) were used in these discussions.

Emojynalysis

As my family in law is very fond of expressing their thoughts and emotions through emojis I thought it was worth digging deeper into the frequency and preferences of these icons.

I extracted the emojis from the text, and counted their occurances. Then I added their respective r-encoding and matched it to its image.

In order to be able to display the emojis in ggplot I downloaded their images from the web, saved them as .png and made sure they are all roughly the same size.

The kissing heart is clearly the prefered emoji within my family in law. Family members also like to express their agreement, their delight and enthusiasm with the respective emoticon.

Chiara favoures hearts in different forms, while David mostly expresses his amusement and is very scarce with using hearts.

Sophie as a few similies she regularly uses (heart-eyes, kiss-blowing, …) but other than that she uses only a few emojis associated with celebrations.

Sentiment

In this section I am going to use sentiment analysis techniques to examine how often positive and negative words occurred in the last years chat messages. Which family members were the most positive or negative overall?

I will use the SentimentWortschatz, or SentiWS for short, which lists positive and negative polarity bearing words weighted within the interval of [-1; 1].

Acknowledgement: SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/). R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC’10), 2010 This version of the data set was last updated in March 2012.

In the family group chat positive contributions are clearly the dominant ones. Words like “leider” and “little” (unfortunately, small) are among the few words with a negative sentiment.

Summary

There are times when chat intensity picks up:

  • There are peaks of communication in the early evening arount 18:00.
  • Message exchange is slightly higher during summer months, however this year the month of March was an extremely intense time due to the frequent message exchange around covid-19.

There is a different communication style between family members:

  • Corinna has been the most active member, while Sophie has contributed the least to the family chat.
  • Lasso regression was not able to predict the author from any subset of words, but with the help of td-idf scoring I was able to identify words for each family member that are an important characteristic to their chat messages.
  • Pairwise correlation of word frequencies for each family member revealed that the grandparent and parent generation show greater similarities in the use and frequency of words than the children to the older generations.
  • Different family members use a different set of emojis, the overall favourite is however the blowing kiss smiley.

Topic modeling did not produce very meaningful results, however we were able to identify that the covid-19 theme was indeed present in March 2020, where message intensity increased.

Last but not least, as expected, the overall sentiment of the chats is very positive.