library(downloader) #data download
library(tm) #text mining
library(gridExtra)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(tidytext)
library(dplyr)
library(kableExtra)
Introduction
For this Capstone Project, as part of the Data Science Specialisation by Johns Hopkins University, I was provided with the following SwiftKey company dataset, that contains several text files from different sources (blogs, news, twitter) in different languages: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The text was originally from HC Corpora, a collection of free text corpora.
The aim of the capstone project is to develop a web application and predictive model that suggests a word given a limited text input.
In this report I am going to describe the dataset in more detail, perform exploratory analysis and explain the goals for building a predictive model for text and for building a predictive text mining application
Exploratory Analysis
The intention of this analysis is to understand the basic relationships observed in the data and prepare to build a first linguistic model. For that we need to understand the distribution and relationship between the words, tokens, and phrases in the text.
Obtaining the data
First I am going to download the dataset and unzip it:
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest = "dataset.zip"
outPath <- getwd()
if (!file.exists(dest)) {download(url, dest, mode="wb")}
unzip(dest, exdir = outPath, overwrite = TRUE)
Next I will check which text files are in the folder that I downloaded:
# list text files in downloaded zip folder
textfiles <- list.files(outPath, pattern = "\\.txt",recursive = TRUE, include.dirs = TRUE)
textfiles
## [1] "final/de_DE/de_DE.blogs.txt" "final/de_DE/de_DE.news.txt"
## [3] "final/de_DE/de_DE.twitter.txt" "final/en_US/en_US.blogs.txt"
## [5] "final/en_US/en_US.news.txt" "final/en_US/en_US.twitter.txt"
## [7] "final/fi_FI/fi_FI.blogs.txt" "final/fi_FI/fi_FI.news.txt"
## [9] "final/fi_FI/fi_FI.twitter.txt" "final/ru_RU/ru_RU.blogs.txt"
## [11] "final/ru_RU/ru_RU.news.txt" "final/ru_RU/ru_RU.twitter.txt"
## [13] "output/textsample.txt"
There are 12 text files - 3 text files each in 4 different languages.
I will start by importing the English text files only:
ENlist <- list.files(path=outPath, recursive=T, pattern=".*en_.*.txt")
blogsURL <- file(ENlist[1], open="rb") # open for reading in binary mode
blogs <- readLines(blogsURL, encoding = "UTF-8", skipNul=TRUE)
newsURL <- file(ENlist[2], open = "rb") # open for reading in binary mode
news <- readLines(newsURL, encoding = "UTF-8", skipNul=TRUE)
twitterURL <- file(ENlist[3], open = "rb") # open for reading in binary mode
twitter <- readLines(twitterURL, encoding = "UTF-8", skipNul=TRUE)
Here are samples of lines from these files:
- en_US.blogs.txt
## [1] "I am en-route to Cornwall again. 3/4 months of slog and sun and sea too. Always a job needs doing in a tourist town. I’m bringing my stuff back from 97 on Friday. Will need to cancel bills before I retreat. Not such a bad thing. Moved to Leeds with promise from Millies. That went nowhere. Move on."
## [2] "Pure large leaf Assam. No waffling with the leaf, thank you. I want it strong and dark with no herby frills. And for goodness sake no fruit mixers and no sweetener. Why would you do that to tea?"
- en_US.blogs.txt
## [1] "\"The absurdity of attempting to bottle up news of such magnitude was too apparent,\" he would later write."
## [2] "GM labor relations vice president Diana Tremblay said the deal \"will enable GM to be fully competitive and has eliminated the gap with our competitors.\""
- en_US.blogs.txt
## [1] "Dammnnnnn what a catch"
## [2] "such a great picture! The green shirt totally brings out your eyes!"
Summary statistics
stats = data.frame(file = sub("final/en_US/", "", ENlist))
stats$filesize = round(sapply(ENlist, file.size)/(1024^2), 1) # in Megabytes
stats$lines = c(length(blogs), length(news), length(twitter))
stats$words = c(sum(sapply(strsplit(blogs, " "), length)),
sum(sapply(strsplit(news, " "), length)),
sum(sapply(strsplit(twitter, " "), length))) # alternative: str_count(s, '\\w+'); library stringr
stats$longest_line = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter)))
kableExtra::kable_styling(knitr::kable(stats,
bootstrap_options = "striped",
full_width = F,
position = "center"))
file | filesize | lines | words | longest_line |
---|---|---|---|---|
en_US.blogs.txt | 200.4 | 899288 | 37334131 | 40833 |
en_US.news.txt | 196.3 | 1010242 | 34372530 | 11384 |
en_US.twitter.txt | 159.4 | 2360148 | 30373583 | 140 |
- The english text documents have a filesize between 160 and 200 MB.
- The blog document has the fewest number of lines (>900.000), while twitter has the most lines (> 2,3 Mio).
- The number of words exceeds 30 million per file.
- Since Twitter restricts the users in how long the tweets can be, the longest line of a twitter feed is only 140 characters. Blogs and News are less restricted. The longest blog has roughly 41.000, the longest newstext 11.000 characters. The news file contains long paragraphs, while blogs are a sequence of sentences.
# plot filesize
plot_filesize <- ggplot(stats, aes(x = file, y = filesize, color = file, fill = file)) +
geom_bar(stat = "identity") +
theme(legend.position = "none",
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey")) +
coord_flip() +
scale_fill_brewer(palette = "Paired", direction = -1) +
scale_color_brewer(palette = "Paired", direction = -1) +
xlab("")
plot_lines <- ggplot(stats, aes(x = file, y = lines, color = file, fill = file)) +
geom_bar(stat = "identity") +
coord_flip()+
scale_fill_brewer(palette = "Paired", direction = -1) +
scale_color_brewer(palette = "Paired", direction = -1) +
scale_y_continuous(labels = scales::comma) +
theme(legend.position = "none",
axis.title.y=element_blank(),
axis.text.y=element_blank(),
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey"))
plot_words <- ggplot(stats, aes(x = file, y = words, color = file, fill = file)) +
geom_bar(stat = "identity") +
coord_flip()+
scale_fill_brewer(palette = "Paired", direction = -1) +
scale_color_brewer(palette = "Paired", direction = -1)+
xlab("") +
scale_y_continuous(labels = scales::comma) +
theme(legend.position = "none",
axis.title.y=element_blank(),
axis.text.y=element_blank(),
panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey"))
grid.arrange(plot_filesize, plot_lines, plot_words, ncol = 3)
Data preparation
This process includes the following steps:
Random sampling - In order to produce a representative sample from the population, the three datasets will be combined to a Corpus (collection of documents) and 1% of the data will be randomly extracted for future analysis.
Cleaning - These include for example converting text to lower case, removing punctuations, profanity filtering, etc.
Tokenization - The aim is to identify appropriate tokens such as words, punctuation, and numbers by writing a function that takes a file as input and returns a tokenized version of it.
Random sampling
To build models it is not necessary to use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.
set.seed(190820)
textsample <- c(
blogs[sample(1:length(blogs),length(blogs)/100, replace = FALSE)],
news[sample(1:length(news),length(news)/100, replace = FALSE)],
twitter[sample(1:length(twitter),length(twitter)/100, replace = FALSE)])
write(textsample, "output/textsample.txt")
length(textsample)
## [1] 42695
The random subsample consists of 42695 lines.
Cleaning
I am going to use the tm
library for perform data cleaning tasks
Convert character vector between encodings
This conversion, which transforms the text to Latin-ASCII, replaces diacritic/accent characters (âêîôûŷŵ äëïöüÿ à èìòù áéĂóúý ãñõ ø) and removes all characters that are not a letter, number or common symbols.
textsample2 <- iconv(textsample, from = "", to = "ASCII", sub="")
# sub replaces any non-convertible bytes in the input.
# load data as a corpus
corpus <- VCorpus(VectorSource(textsample2))
corpus
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 42695
Text sample (line 12):
## I then dealt with the insurance company directly, got them whatever they needed and kept getting stalled about the cheque. Finally I just lost it, told them I cant do business without that money, that Id make it my full time job to show up to their offices and call every TV station I could think of to cover this story if I didnt get my cheque! They seemed pretty alarmed.
Convert text to lower case
This facilitates further cleaning and analysis steps.
Remove special characters, numbers, punctuations, URLs…
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
remURL <- function(x) gsub("http[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(remURL))
Remove white space
Remove stop words
Profanity filtering
Removes words that we do not want to predict, such as swear words.
## Warning in readLines("http://www.bannedwordlist.com/lists/swearWords.txt"):
## incomplete final line found on 'http://www.bannedwordlist.com/lists/
## swearWords.txt'
Stemming
The process of stemming reduces words to the word stem or root, such as “growing” to “grow”.
This is again the text sample line 12 after the cleaning process:
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 215
##
## dealt insur compani direct got whatev need kept get stall chequ final just lost told cant busi without money id make full time job show offic call everi tv station think cover stori didnt get chequ seem pretti alarm
Tokenisation
I am going to convert the corpus into a data.frame and then into a long (or tidy) format, with one row for each word from each of the text elements. I will use the unnest_tokens()
function from the tidytext
package, which splits a column into tokens (one-token-per-row).
d <- data.frame(text=unlist(sapply(corpus, '[', "content")), stringsAsFactors=F)
unigram <- unnest_tokens(d, word, text, token = "words") # unnest_tokens(data.frame, output-column, input-column)
For n-grams the column is split into groups of n-words.
Example: dont worry be happy
- unigram: dont, worry, be, happy
- bigram: dont worry, worry be, be happy
- trigram: dont worry be, worry be happy
Exploration of words distribution and relationship between the words in the corpora
In this section I am going to explore the
- distributions of word frequencies
- n-grams: Find out what are the frequencies of 2-grams and 3-grams in the dataset
- build figures and tables to understand variation in the frequencies of words and word pairs in the data.
Word cloud
unigram_freq <- data.frame(table(unigram$word))
unigram_freq <- unigram_freq[order(unigram_freq$Freq, decreasing = TRUE),]
wordcloud(unigram_freq$Var1[1:30], unigram_freq$Freq[1:30]) # wordcloud(words,freq)
This word cloud shows the thirty most frequently used words in the corpus. Since this type of visualisation is not very clear, I am going to present the n-gram frequencies with tables and histograms.
Distribution of unigram, bigram and trigram frequencies
## [1] 41534
There are 41534 unique words in this subsample.
kable(unigram_freq[1:15,], col.names = c('word', 'frequency'), row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)
word | frequency |
---|---|
will | 3298 |
get | 3143 |
like | 3123 |
one | 3104 |
just | 3069 |
said | 3050 |
go | 2801 |
time | 2564 |
can | 2519 |
im | 2448 |
day | 2263 |
year | 2189 |
make | 2117 |
love | 2018 |
know | 1827 |
Among the top 15 unigrams are the words “get”, “like”, “one”, “just” and “said”.
ggplot(unigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity") +
labs(title = "30 most frequent unigrams",
subtitle = "SwiftKey company dataset, subsample, n = 41534 unigrams",
x = "unigrams",
y = "frequency") +
coord_flip() +
theme(panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey"))
bigram_freq <- data.frame(table(bigram$ngram))
bigram_freq <- bigram_freq[order(bigram_freq$Freq, decreasing = TRUE),]
## [1] 445891
We can find 445891 unique word combinations in this subsample.
kable(bigram_freq[1:15,], col.names = c('bigram', 'frequency'), row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)
bigram | frequency |
---|---|
right now | 268 |
last year | 224 |
dont know | 216 |
cant wait | 194 |
look like | 194 |
feel like | 193 |
new york | 179 |
year ago | 171 |
look forward | 167 |
high school | 147 |
im go | 145 |
thank follow | 138 |
first time | 136 |
last night | 135 |
last week | 130 |
Among the top 15 bigrams are common word combinations such as “right now”, “last year”, “dont know” and “cant wait”.
ggplot(bigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity") +
labs(title = "30 most frequent bigrams",
subtitle = "SwiftKey company dataset, subsample, n = 445891 bigrams",
x = "bigrams",
y = "frequency") +
coord_flip() +
theme(panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey"))
trigram_freq <- data.frame(table(trigram$ngram))
trigram_freq <- trigram_freq[order(trigram_freq$Freq, decreasing = TRUE),]
## [1] 562405
When combining three words there are 62405 unique combinations.
kable(trigram_freq[1:15,], col.names = c('trigram', 'frequency'), row.names = FALSE) %>%
kable_styling(bootstrap_options = c("striped"), full_width = FALSE)
trigram | frequency |
---|---|
cant wait see | 38 |
happi mother day | 31 |
let us know | 28 |
new york citi | 26 |
happi new year | 22 |
presid barack obama | 22 |
im pretti sure | 19 |
look forward see | 19 |
new york time | 15 |
dont even know | 14 |
feel like im | 14 |
cent per share | 12 |
dont get wrong | 11 |
im look forward | 11 |
ive ever seen | 11 |
The most frequent trigrams are also very common combinations in the every day conversation such as “let us know”, “cant wait see” (here the stopword “the” was removed in the data cleaning process)
ggplot(trigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity") +
labs(title = "30 most frequent trigrams",
subtitle = "SwiftKey company dataset, subsample, n = 562405 trigrams",
x = "trigrams",
y = "frequency") +
coord_flip() +
theme(panel.background = element_rect(fill = "white"),
panel.grid = element_line(colour = "grey"))
More deep diving
For a better understanding of the data I am going to explore a further questions:
- How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
unigram_freq_cum <- unigram_freq
unigram_freq_cum$cum_freq <- cumsum(unigram_freq$Freq)
unigram_freq_cum$cum_perc <- unigram_freq_cum$cum_freq / sum(unigram_freq_cum$Freq)
# head(unigram_freq_cum[unigram_freq_cum$cum_perc > 0.5,])
a <- nrow(unigram_freq_cum[unigram_freq_cum$cum_perc <= 0.5,]) #583 words
b <- round((nrow(unigram_freq_cum[unigram_freq_cum$cum_perc <= 0.5,]) / nrow(unigram_freq_cum))*100, 2) #0.01403669 --> 1,4 %
print(c(a,b))
## [1] 583.0 1.4
We need 583 unique words to cover 50% of all word instances in the english language. this accounts for 1,4% of all word instances (n = 41534)
Next steps: Modeling, Shiny app
The next steps will include:
- Build basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
- Build a model to handle cases where a particular n-gram isn’t observed
- Develop shiny web app, that includes a simple user interface for entering text and displays the word prediction based on the prediction model