Getting and cleaning the data
WhatsApp has a built-in function for exporting chat logs to a zip archiv. On an iphone it works like this: While in the chat overview, swipe the (group) chat of choice to the left and tap the three dots (… more). Then select “export chat”. I saved the file directly to my icloud, but there are alternative ways such as sharing the file via AirDrop or Email.
Before I read the data into R, I opened the file with a text editor and updated the (nick) names of each family member to proper first names for the analysis. Also I removed the first three lines that were not chat conversations but the information of me joining the group.
For proper analysis some data cleaning steps have to be taken:
Some rows don’t have a time stamp. This happens when a message has multiple paragraphs.
These rows are going to be shifted to the right, so I can transfer the time stamp and name of the particular message. In order to do that it is crucial to create empty columns at the end of the data frame, in case the stampless rows are very long and would exceed the dimensions of the data frame:
# Add 3 empty columns to the end of the dataframe to make space for shifting
chat <- cbind(chat, matrix(nrow = nrow(chat), ncol = 3))
# Shift datetime-stampless rows 3 cols to the right
for(row in grep("\\d,$", chat[,1], invert= TRUE)){
end <- which(is.na(chat[row,]))[1] #first empty column
chat[row, 4:(end+2)]<- chat[row, 1:(end-1)]
chat[row, 1:3] <- NA
}
Now I am going to format the time stamp in such a way that R can evaluate it. Currently the first column shows the date in a format [%d.%m.%y,
and the second column shows the time %H:%M:%S]
I will convert the date to POSIXlt format, using the lubridate
package.
## format date
# remove special characters from date: "[02.07.16", "16:35:09]" %d.%m.%y
chat[,1] <- gsub(".*[\\[]|[\\,].*", "", chat[,1])
chat[,2] <- gsub("\\]$", "", chat[,2])
# fill empty cells with date and name:
for(row in which(is.na(chat[,1]))){
chat[row,1:3] <- chat[(row-1), 1:3]
}
# Merge columns 1 and 2 (date and time) and convert to POSIXlt format
chat$V1 <- dmy_hms(paste(chat[,1], chat[,2]), tz = "")
## Warning: 3 failed to parse.
In order to have an entire message in one row as opposed to it being spread over several rows and columns in the table, I am going to concatenate them. I am also going to remove all warnings that inform that videos, pictures and stickers were removed from the text file upon export.
## format chat text
# create new column text
chat <- chat %>% mutate(text = NA)
# concatenate strings from various columns and safe in text column
for(row in 1:nrow(chat)){
last <- which(is.na(chat[row,]))[1] -1
chat[row,ncol(chat)] <- str_c(chat[row,3:last], collapse = " ")
}
# remove the original columns containing the individual words
chat <- chat[,c(1:2,ncol(chat))]
# remove empty text rows
chat <- chat[!is.na(chat$text),]
# remove rows with "Bild weggelassen" which indicates that someone has sent a picture as well as other file types.
chat <- chat[-grep("^Bild weggelassen", chat[,3]),]
chat <- chat[-grep("^Video weggelassen", chat[,3]),]
chat <- chat[-grep("^GIF weggelassen", chat[,3]),]
chat <- chat[-grep("^Audio weggelassen", chat[,3]),]
chat <- chat[-grep("^Sticker weggelassen", chat[,3]),]
chat <- chat[-grep("Dokument weggelassen", chat[,3]),]
# remove links
chat <- chat[-grep("^http", chat[,3]),]
chat <- chat[-grep("^www", chat[,3]),]
Names have a colon at the end. That doesn’t look nice in graphics.
## date_time name
## 1 2017-08-05 21:31:36 Fabian
## 2 2017-08-05 21:38:45 Gerald
## text
## 1 Wir begrüßen unser neues Mitglied in der Selbsthilfegruppe. Hallo lisa
## 2 Hallo Lisa \U0001f38a\U0001f389
Next I am going to convert the data.frame into a long (or tidy) format, with one row for each word from each of the chat messages. By default, unnest_tokens()
converts the tokens (words) to lowercase, which makes them easier to compare or combine with other datasets.
## date_time name word
## 1 2017-08-05 21:31:36 Fabian wir
## 1.1 2017-08-05 21:31:36 Fabian begrüßen
## 1.2 2017-08-05 21:31:36 Fabian unser
## 1.3 2017-08-05 21:31:36 Fabian neues
## 1.4 2017-08-05 21:31:36 Fabian mitglied
## 1.5 2017-08-05 21:31:36 Fabian in
Then I am going to remove common german stop words and some other words I would like to ignore.
stop_words_de <- data.frame(word = stopwords("german"))
words <- words %>%
anti_join(stop_words_de, by = "word")
options(stringsAsFactors = FALSE)
words_to_ignore <- data.frame(word = c("name", "is", "word", "a","is", "ä", "dass"))
words <- words %>%
anti_join(words_to_ignore, by = "word")
# remove words that start with a special character or a number
words <- words[-c(grep("^[[:punct:]]",words[,3]),
grep("^\\d",words[,3])), ]
# remove words that are have char lenght of 1
words$nchar = nchar(words$word)
words <- words[words$nchar > 1,]
Exploratory analysis
First I will have a look at the number and timing of messages.
Who messages the most?
ggplot(timelogs,
aes(x = name, fill = name)) +
stat_count(position = "dodge", show.legend = FALSE) +
# geom_text(aes(label = Frequency), size = 3, hjust = 0.5, vjust = 3, position = "stack") +
geom_text(stat = "count", aes(label=..count..), vjust = -0.5) +
scale_fill_viridis_d() +
labs(title = "Conversation frequency per family member",
subtitle = "WhatsApp, Aug 2017 - April 2020",
y = "# of messages",
x = "") +
theme_economist_white() +
theme(axis.text.x = element_text(angle = 90))

Corinna is by far the most active person in the family chat. Robert and David are the second and third most active persons resprectively. Michael and Sophie contribute the least to the family chat.
What time of the day do messages peak?
Looking at an entire 24 hour day:

Conversations happen more frequently in the evening hours, peaking at 18:00.
During which months are family members most active in the family chat?

March is the month with the highest frequency of messages on average, followed by July, while February, October and December are rather calm months.
I do not have an explanation for the peak in March at this point, but will explore this further:
ggplot(data = timelogs %>%
mutate(week = week(date_time),
day = lubridate::wday(date_time, label = TRUE)) %>%
count(week, day),
aes(x = as.factor(week), y = day, fill = n)) +
geom_tile(show.legend = FALSE) +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Conversation frequency per day of the year",
subtitle = "WhatsApp, Aug 2017 - March 2020",
y="weekday",
x="week of the year") +
theme_economist_white() +
theme(axis.text.x = element_text(angle = 90))

The heatmap shows that the message frequency is strongest around week 12, which is approximately mid-March.

The plot separated by year reveals that the peak in March is caused by a very intense chat message exchange in March 2020, which is most likely caused by the governmental reactions to the covid-19 pandemic and the respective impact to our lives. All of us were affected in one way or another and in-person meetings were prohibited so information exchange got more intense on such platforms as WhatsApp during this period.
During pandemic free years we can see a drop of messages in December as well as in the beginning of the year (Jan or Feb) and message peaks during summer months.
The lower number of messages might be because in some month the entire family typically meets (Christmas gathering in December, ski vaction in early February, mothers day in May) so less messages have to be exchanged. July August is typically the time to go on holiday and send messages and travel pictures.
Which words are most frequently used?
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 ja 565
## 2 schon 494
## 3 danke 388
## 4 gute 369
## 5 liebe 304
## 6 gut 288
## 7 mal 257
## 8 na 227
## 9 wow 220
## 10 morgen 202
Authorship
To best predict authorship, I will create a wide data set again that has one row for each chat message and a column for each word. I am going to include dummy variables for each family member, indicating who used each word and when.
words[,3] <- make.names(words[,3])
chat_wide <- words %>%
group_by(name, date_time, word) %>%
summarise(contains = 1) %>%
ungroup() %>%
spread(key = word, value = contains, fill = 0) %>%
mutate(Andrea = as.integer(name == "Andrea"),
Anna = as.integer(name == "Anna"),
Chiara = as.integer(name == "Chiara"),
Corinna = as.integer(name == "Corinna"),
David = as.integer(name == "David"),
Fabian = as.integer(name == "Fabian"),
Gerald = as.integer(name == "Gerald"),
Gerhard = as.integer(name == "Gerhard"),
Lisa = as.integer(name == "Lisa"),
Michael = as.integer(name == "Michael"),
Robert = as.integer(name == "Robert"),
Silvia = as.integer(name == "Silvia"),
Sophie = as.integer(name == "Sophie"),
Tizian = as.integer(name == "Tizian"),
Viktoria = as.integer(name == "Viktoria")) %>%
select(-name, -date_time)
Let’s see which words have the largest coefficients: Lasso regression can help determine which words are most predictive for who authored a chat message.
# function to get coefficients for each word and plot top 15 most likely and top 10 least likely used words
whatsapp_word_coef <- function(name){
lambda <- 10^seq(-1, -4, length = 5)
train_control <- trainControl(method = "cv", number = 5) # 5 fold cross-validation
grid <- expand.grid("alpha" = 1, lambda = lambda)
set.seed(570)
fit <- train(
x = data.frame(chat_wide[,-((ncol(chat_wide)-length(unique(chat$name))+1):ncol(chat_wide))]),
y = as.factor(unlist(chat_wide[,name, drop=FALSE])),
method = "glmnet",
family = "binomial",
trControl = train_control,
tuneGrid = grid
)
tuned_logit_lasso_model <- fit$finalModel
best_lambda <- fit$bestTune$lambda
lasso_coeffs <- data.frame(coef = row.names(as.matrix(coef(tuned_logit_lasso_model, best_lambda))),
as.matrix(coef(tuned_logit_lasso_model, best_lambda))) %>%
mutate(beta = X1) %>%
filter(beta != 0) %>%
filter(coef != "(Intercept)") %>%
arrange(beta)
ifelse(nrow(lasso_coeffs) < 25,
df <- lasso_coeffs %>% mutate(i = row_number()),
df <- lasso_coeffs[c(1:10, (nrow(lasso_coeffs)-14):nrow(lasso_coeffs)),] %>%
mutate(i = row_number()))
plot <- ggplot(df, aes(x = i, y = beta, fill = ifelse(beta > 0, "Likely", "Not likely"))) +
geom_bar(stat = "identity", alpha = 0.75, width = 0.5) +
scale_x_continuous(breaks = df$i, labels = df$coef, minor_breaks = NULL) +
labs(title = "Authorship prediction",
subtitle = "WhatsApp, Aug 2017 - April 2020") +
xlab("") +
ylab("Coefficient Estimate") +
coord_flip() +
scale_fill_manual(
guide = guide_legend(title = paste0("Word most/least typically used by ",name,":")),
values = c("#446093", "#bc3939")) +
theme_economist_white() +
theme(legend.position = "top")
ggsave(paste0("output/coefs_",name,".png"), plot)
}
For some family members (Andrea, Fabian, Gerald, Gerhard, Lisa, Michael, Robert, Sophie, Tizian) no linear combination of any subset of the words is useful for predicting the author (This is when the lasso model reduces all coefficients to zero). It seems the conversations among family members are pretty homogenous and only few use distinguishable language. Among the family members where lasso was able to identify words that are predictive for who authored a chat message, I have selected two visualisations:


“Omi” and “Opi” (which means grandma and grandpa) are words that point towards Chiara as sender. Since she is one of the two only grandchildren in the chat, it makes sense that it is her using these words most frequently. A unique characteristic of Chiaras messages is that she likes to express her feelings by using a lot af aaaaaaaawwwww.
Gerhard uses the abbreviation “ad” meaning “refering to the above”, which is unique to him. Also “schmackofatz”, a colloquial word for “tasty” is an expression that only he uses.
Unique words (TF-IDF)
Another way to eximine words that are unique to every family member but also frequently used is to calculate the TF-IDF score (i.e., Term Frequency-Inverse Document Frequency). It takes the frequency of words in a document and calculates the inverse proportion of those words to the corpus.
The bind_tf_idf function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document. One column (word here) contains the terms/tokens, one column contains the document (name/author in this case), and the last necessary column contains the counts, how many times each name used each term (n in this example).
## # A tibble: 6 x 6
## name word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Corinna schon 179 0.0133 0 0
## 2 Corinna ja 166 0.0123 0 0
## 3 Corinna happy 112 0.00831 0.310 0.00258
## 4 Corinna birthday 109 0.00809 0.511 0.00413
## 5 Corinna danke 96 0.00712 0 0
## 6 Corinna gut 94 0.00697 0 0
Here we can nicely see how idf and thus tf-idf are zero for such an extremely common word like “schon” (already), or “ja” (yes). These are all words that appear chat messages from all family members, so the idf term (which will then be the natural log of 1) is zero. The inverse document frequency (and thus tf-idf) is very low (near zero) for words that occur in many of the chat messages; this is how this approach decreases the weight for common words. The inverse document frequency will be a higher number for words that occur in fewer of the documents in the collection.
These are terms with high tf-idf:
## # A tibble: 6 x 6
## name word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 Tizian cola 29 0.0229 2.01 0.0460
## 2 Tizian letztens 12 0.00946 2.71 0.0256
## 3 Chiara omi 15 0.0189 1.10 0.0207
## 4 Chiara opi 7 0.00881 2.01 0.0177
## 5 Andrea jedenfalls 12 0.0110 1.61 0.0177
## 6 Gerald forum 18 0.00584 2.71 0.0158
The list shows all proper nouns, names and other words that are important in these chat messages. None of them are used by everyone, but they are important characteristic words for each chat message of the respective author.

There are some obvious associations like this: One of Andrea’s most important words is “Robert”, who is her husband so is one of Robert’s most important words “Andrea”, referring to his wife. Anna is talking about her husband “Leonid” and Viktoria about her partner “Fabi”, Sophie about her partner “Michi”.
Some are less obvious but still plausible:
- Corinna is often the first one to send out birthday wishes.
- Gerald is often travelling to Sarajevo so it is among his most important words.
- Gerhard loves to use abbreviations such as “vll” for “vielleicht” (maybe).
- My most important words, which are distinguishable from the rest of the family members is “grandma”, “urkunde” and “taufregister” which refer to a time when I asked for help with some documents from my american host parents who were trying to discover who their german ancestors were.
- Among Robert’s most important words is “Zoom”, which can be explained by the fact that he has a paid zoom account and his the host of all our online zoom meetings.
- Silvis sends kisses (“Bussi”, “Bussal”) a lot.
- Among Tizian’s most important word is “Feuerwehr” (fire department) and “u17” (under 17) which makes sense, since he is a volunteer firefighter and a teenager.
- In Viktoria’s list we find “Ziehrer” which is the name of her music society.
Similarities
Are there similarities among family members to each other in text content? I will try to discover this by finding the pairwise correlation of word frequencies for each family member, using the pairwise_cor()
function from the widyr
package!
## # A tibble: 6 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 Silvia Corinna 0.716
## 2 Corinna Silvia 0.716
## 3 Gerald Silvia 0.699
## 4 Silvia Gerald 0.699
## 5 David Fabian 0.659
## 6 Fabian David 0.659
## # A tibble: 6 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 Tizian Gerhard 0.320
## 2 Gerhard Tizian 0.320
## 3 Chiara Gerhard 0.315
## 4 Gerhard Chiara 0.315
## 5 Tizian Anna 0.313
## 6 Anna Tizian 0.313
Silvia and Corrina as well as Silvia and Gerald show similarities in the use and frequency of words. Tizian and Anna or Chiara and Gerhard have the least similarities on how (often) they use (which) words.
For a better understanding of the correlations I will visualise the very strong correlations in a network.

It looks like there are strong correlations between Corinna and her children David, Silvia, Fabian. There is also strong correlation between Silvia, her mother and her husband Gerald. There are weak correlations between the two youngest family members Chiara and Tizian and most of the remaining family members. This can be an indicator for the different language younger and older generations tend to use.
Topic modeling
Although topic modeling might not be the most useful technique within an ongoing chat I will still try to use this technique to epxlain the peak in message frequency in March 2020.
I am going to use the latent Dirichlet allocation (LDA) algorithm which treats each chat history within a time sequence as a mixture of topics, and each topic as a mixture of words. This allows chats within time sequences to “overlap” each other in terms of content, rather than being separated into discrete groups, in a way that mirrors typical use of natural language.
I am only going to use this years chat messages and treat each month like a separate entity/document.
## # A tibble: 6 x 3
## month word n
## <ord> <chr> <int>
## 1 Mar ja 101
## 2 Mar schon 80
## 3 Jan danke 54
## 4 Jan gute 49
## 5 Mar mal 44
## 6 Mar grad 37
## <<DocumentTermMatrix (documents: 4, terms: 4653)>>
## Non-/sparse entries: 6188/12424
## Sparsity : 67%
## Maximal term length: 37
## Weighting : term frequency (tf)
I will use the LDA()
function from the topicmodels package, setting k = 4, to create a four-topic LDA model.
## A LDA_VEM topic model with 4 topics.
Now I will examine per-topic-per-word probabilities.
## # A tibble: 6 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 ja 0.0145
## 2 2 ja 0.0126
## 3 3 ja 0.0167
## 4 4 ja 0.0130
## 5 1 schon 0.0159
## 6 2 schon 0.00168
The one-topic-per-term-per-row format shows, for each combination, the probability of that term being generated from that topic. For example, the term “ja” (“yes”) has an almost equal probability of being generated from topics 1, 2, 3 or 4.
With dplyr’s top_n()
it is possible to display the top 15 terms within each topic.
top_terms <- chat_topics %>%
group_by(topic) %>%
top_n(15, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
scale_fill_manual("legend", values=c("#0072B2","#FF9999", "#999999","#009E73")) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip() +
theme_bw()

These topics are not clearly distinguishable from each other, however topic 1 seems to include birthday congratulations, while topic 3 might be about the well-being, abuot children and about doing ok. The terms with the hightest beta coefficient are the same in all topics (ja, schon, gut(e), …)
Different values for k do not produce better distinguishible results.
Nonetheless I want to know which topics are associated with each month by examining the per-month-per-topic probabilities, γ (“gamma”).
## # A tibble: 6 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 Mar 1 0.00000487
## 2 Jan 1 0.0000122
## 3 Apr 1 0.0000146
## 4 Feb 1 1.00
## 5 Mar 2 0.268
## 6 Jan 2 0.0000122
Each of these values is an estimated proportion of words from that month (here: document) that are generated from that topic. For example, the model estimates that each word typed in March has a 0.0000049% probability of coming from topic 1 (birthday).

During the month of March, where communication peaks, topic 3 and eventually topic 2 seem to be more present than in the other month. Topic one was discussed mainly in February, topic 4 in April.
Lets check if the birthday theme frome topic 1 in February matches the family’s birthdays and name day calendar:
# birthday and name day counts per month
BD <- data.frame(month = 1:12, n = c(3,2,2,1,1,3,3,4,3,2,4,2))
ggplot(data = BD, aes(x = as.factor(month), y = n)) +
geom_bar(stat = "identity", fill = "darkred", alpha = 0.75, size = 0.5) +
geom_image(aes(image = "emojis/1F382.png"), size = 0.1) +
labs(title = "Birthdays and name days",
x = "month",
y= "") +
theme_economist_white()

According to this chart, we would expect the birthday theme to be most present in January, not in February.
However it makes sense that words showing affection as in topic 3 were present in the month of March, where the family was aksing about the well-being of everyone in these special times.
Looking at a greater selection of words from topic 2 and topic 3 it is becoming more obvious that the pandemic played an important role in the conversations in March. Words like “hoffentlich” (hopefully), “masken” (masks), “idee” (idea), arbeit (“work”), “kinder” (children) were used in these discussions.
Emojynalysis
As my family in law is very fond of expressing their thoughts and emotions through emojis I thought it was worth digging deeper into the frequency and preferences of these icons.
I extracted the emojis from the text, and counted their occurances. Then I added their respective r-encoding and matched it to its image.
In order to be able to display the emojis in ggplot I downloaded their images from the web, saved them as .png and made sure they are all roughly the same size.
## ---- setup ----
# download dictionary csv file from Jessicas github profile:
# download.file("https://raw.githubusercontent.com/today-is-a-good-day/emojis/master/emojis.csv","data/emojis.csv")
# read in emoji dictionary
emDict_raw <- read.csv2("data/emojis.csv") %>%
select(description = EN, r_encoding = utf8, unicode)
#convert unicode to printable unicode
emDict_raw <- emDict_raw %>%
mutate(unicode2 = vapply(parse(text = gsub("U[+]([0-9A-Fa-f]{1,8})",
"\\\\U\\1",
encodeString(unicode, quote = '"')),
keep.source = FALSE), eval, "")) %>%
mutate(description = str_to_lower(description))
# plain skin tones
skin_tones <- c("light skin tone",
"medium-light skin tone",
"medium skin tone",
"medium-dark skin tone",
"dark skin tone")
# remove plain skin tones and remove skin tone info in description
emDict <- emDict_raw %>%
# remove plain skin tones emojis
filter(!description %in% skin_tones) %>%
# remove emojis with skin tones info, e.g. remove woman: light skin tone and only
# keep woman
filter(!grepl(":", description)) %>%
mutate(description = tolower(description)) %>%
mutate(unicode = as.u_char(unicode))
# all emojis with more than one unicode codepoint become NA
matchto <- emDict$r_encoding
description <- emDict$description
# rank emojis by occurence in data
rank_emoji <- list()
familymembers <- sort(unique(chat$name))
for (i in 1:length(familymembers)){
# convert text to a ascii format (ASCII is a subset of UTF-8)
chat2 <- chat[chat$name == familymembers[i],] %>%
mutate(text2 = iconv(text, from = "latin1", to = "ascii", sub = "byte"))
rank_emoji[[familymembers[i]]] <- emojis_matching(chat2$text2, matchto, description) %>%
filter(!is.na(description)) %>%
group_by(description) %>%
dplyr::summarise(n = sum(count)) %>%
arrange(-n) %>%
dplyr::mutate(name = familymembers[i]) %>%
left_join(emDict_raw, by = "description") %>%
mutate(image_path = paste0("emojis/",substr(unicode, 3, 7),".png"))
emoji_plot <- ggplot(data = rank_emoji[[familymembers[i]]] %>%
head(15) %>%
dplyr::mutate(row = row_number(),
description = factor(description,
levels = description[order(-row)])
),
aes(x=description, y = n)) +
geom_bar(stat = "identity") +
geom_image(aes(image = image_path), size = 0.04) +
coord_flip() +
labs(title = paste0("Emojis most frequently used by ",familymembers[i]),
x = "", y = "") +
theme_bw()
ggsave(paste0("output/emojis_",familymembers[i],".png"), emoji_plot)
}

The kissing heart is clearly the prefered emoji within my family in law. Family members also like to express their agreement, their delight and enthusiasm with the respective emoticon.


Chiara favoures hearts in different forms, while David mostly expresses his amusement and is very scarce with using hearts.

Sophie as a few similies she regularly uses (heart-eyes, kiss-blowing, …) but other than that she uses only a few emojis associated with celebrations.
Sentiment
In this section I am going to use sentiment analysis techniques to examine how often positive and negative words occurred in the last years chat messages. Which family members were the most positive or negative overall?
I will use the SentimentWortschatz, or SentiWS for short, which lists positive and negative polarity bearing words weighted within the interval of [-1; 1].
Acknowledgement: SentiWS is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License (http://creativecommons.org/licenses/by-nc-sa/3.0/). R. Remus, U. Quasthoff & G. Heyer: SentiWS - a Publicly Available German-language Resource for Sentiment Analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC’10), 2010 This version of the data set was last updated in March 2012.
readAndflattenSentiWS <- function(filename) {
words = readLines(filename, encoding="UTF-8")
words <- sub("\\|[A-Z]+\t[0-9.-]+\t?", ",", words)
words <- unlist(strsplit(words, ","))
words <- tolower(words)
return(words)
}
pos.words <- data.frame(word = readAndflattenSentiWS("data/SentiWS_v1.8c/SentiWS_v1.8c_Positive.txt")) %>%
mutate(sentiment=1)
neg.words <- data.frame(word = readAndflattenSentiWS("data/SentiWS_v1.8c/SentiWS_v1.8c_Negative.txt")) %>%
mutate(sentiment=-1)
german_sentiment <- rbind(pos.words, neg.words)
words_sentiment <- words %>%
inner_join(german_sentiment, by = "word")%>%
count(word, sentiment) %>%
mutate(word = reorder(word, n)) %>%
mutate(contribution = n*sentiment)
ggplot(words_sentiment %>%
filter(contribution > 40 | contribution < -40),
aes(word, contribution, fill = sentiment)) +
geom_col(show.legend = FALSE) +
labs(y = "Contribution to sentiment", x = NULL) +
coord_flip() +
theme_bw()

In the family group chat positive contributions are clearly the dominant ones. Words like “leider” and “little” (unfortunately, small) are among the few words with a negative sentiment.