Coursera Capstone Project - Milestone Report

library(downloader) #data download
library(tm) #text mining
library(gridExtra)
library(ggplot2)
library(gridExtra)
library(wordcloud)
library(tidytext)
library(dplyr)
library(kableExtra)

Introduction

For this Capstone Project, as part of the Data Science Specialisation by Johns Hopkins University, I was provided with the following SwiftKey company dataset, that contains several text files from different sources (blogs, news, twitter) in different languages: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The text was originally from HC Corpora, a collection of free text corpora.

The aim of the capstone project is to develop a web application and predictive model that suggests a word given a limited text input.

In this report I am going to describe the dataset in more detail, perform exploratory analysis and explain the goals for building a predictive model for text and for building a predictive text mining application

Exploratory Analysis

The intention of this analysis is to understand the basic relationships observed in the data and prepare to build a first linguistic model. For that we need to understand the distribution and relationship between the words, tokens, and phrases in the text.

Obtaining the data

First I am going to download the dataset and unzip it:

url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
dest = "dataset.zip"
outPath <- getwd()

if (!file.exists(dest)) {download(url, dest, mode="wb")}

unzip(dest, exdir = outPath, overwrite = TRUE)

Next I will check which text files are in the folder that I downloaded:

# list text files in downloaded zip folder
textfiles <- list.files(outPath, pattern = "\\.txt",recursive = TRUE, include.dirs = TRUE)
textfiles

##  [1] "final/de_DE/de_DE.blogs.txt"   "final/de_DE/de_DE.news.txt"   
##  [3] "final/de_DE/de_DE.twitter.txt" "final/en_US/en_US.blogs.txt"  
##  [5] "final/en_US/en_US.news.txt"    "final/en_US/en_US.twitter.txt"
##  [7] "final/fi_FI/fi_FI.blogs.txt"   "final/fi_FI/fi_FI.news.txt"   
##  [9] "final/fi_FI/fi_FI.twitter.txt" "final/ru_RU/ru_RU.blogs.txt"  
## [11] "final/ru_RU/ru_RU.news.txt"    "final/ru_RU/ru_RU.twitter.txt"
## [13] "output/textsample.txt"

There are 12 text files - 3 text files each in 4 different languages.

I will start by importing the English text files only:

ENlist <- list.files(path=outPath, recursive=T, pattern=".*en_.*.txt")

blogsURL <- file(ENlist[1], open="rb") # open for reading in binary mode
blogs <- readLines(blogsURL, encoding = "UTF-8", skipNul=TRUE)
newsURL <- file(ENlist[2], open = "rb") # open for reading in binary mode
news <- readLines(newsURL, encoding = "UTF-8", skipNul=TRUE)
twitterURL <- file(ENlist[3], open = "rb") # open for reading in binary mode
twitter <- readLines(twitterURL, encoding = "UTF-8", skipNul=TRUE)

Here are samples of lines from these files:

en_US.blogs.txt

blogs[c(12,13)]

## [1] "I am en-route to Cornwall again. 3/4 months of slog and sun and sea too. Always a job needs doing in a tourist town. I’m bringing my stuff back from 97 on Friday. Will need to cancel bills before I retreat. Not such a bad thing. Moved to Leeds with promise from Millies. That went nowhere. Move on."
## [2] "Pure large leaf Assam. No waffling with the leaf, thank you. I want it strong and dark with no herby frills. And for goodness sake no fruit mixers and no sweetener. Why would you do that to tea?"

en_US.blogs.txt

news[c(12,13)]

## [1] "\"The absurdity of attempting to bottle up news of such magnitude was too apparent,\" he would later write."                                               
## [2] "GM labor relations vice president Diana Tremblay said the deal \"will enable GM to be fully competitive and has eliminated the gap with our competitors.\""

en_US.blogs.txt

twitter[c(12,13)]

## [1] "Dammnnnnn what a catch"                                             
## [2] "such a great picture! The green shirt totally brings out your eyes!"

Summary statistics

stats = data.frame(file = sub("final/en_US/", "", ENlist))
stats$filesize = round(sapply(ENlist, file.size)/(1024^2), 1) # in Megabytes
stats$lines = c(length(blogs), length(news), length(twitter))
stats$words = c(sum(sapply(strsplit(blogs, " "), length)), 
                sum(sapply(strsplit(news, " "), length)), 
                sum(sapply(strsplit(twitter, " "), length))) # alternative: str_count(s, '\\w+'); library stringr
stats$longest_line = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter)))

kableExtra::kable_styling(knitr::kable(stats, 
                                       bootstrap_options = "striped", 
                                       full_width = F, 
                                       position = "center"))

file	filesize	lines	words	longest_line
en_US.blogs.txt	200.4	899288	37334131	40833
en_US.news.txt	196.3	1010242	34372530	11384
en_US.twitter.txt	159.4	2360148	30373583	140

The english text documents have a filesize between 160 and 200 MB.
The blog document has the fewest number of lines (>900.000), while twitter has the most lines (> 2,3 Mio).
The number of words exceeds 30 million per file.
Since Twitter restricts the users in how long the tweets can be, the longest line of a twitter feed is only 140 characters. Blogs and News are less restricted. The longest blog has roughly 41.000, the longest newstext 11.000 characters. The news file contains long paragraphs, while blogs are a sequence of sentences.

# plot filesize
plot_filesize <- ggplot(stats, aes(x = file, y = filesize, color = file, fill = file)) +
   geom_bar(stat = "identity") +
   theme(legend.position = "none",
         panel.background = element_rect(fill = "white"),
         panel.grid = element_line(colour = "grey")) +
   coord_flip() +
   scale_fill_brewer(palette = "Paired", direction = -1) +
   scale_color_brewer(palette = "Paired", direction = -1) +
   xlab("")

plot_lines <- ggplot(stats, aes(x = file, y = lines, color = file, fill = file)) +
   geom_bar(stat = "identity") +
   coord_flip()+
   scale_fill_brewer(palette = "Paired", direction = -1) +
   scale_color_brewer(palette = "Paired", direction = -1) +
   scale_y_continuous(labels = scales::comma) +
   theme(legend.position = "none",
        axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        panel.background = element_rect(fill = "white"),
        panel.grid = element_line(colour = "grey"))

plot_words <- ggplot(stats, aes(x = file, y = words, color = file, fill = file)) +
   geom_bar(stat = "identity") +
   coord_flip()+
   scale_fill_brewer(palette = "Paired", direction = -1) +
   scale_color_brewer(palette = "Paired", direction = -1)+
   xlab("") +
   scale_y_continuous(labels = scales::comma) +
   theme(legend.position = "none",
         axis.title.y=element_blank(),
         axis.text.y=element_blank(),
         panel.background = element_rect(fill = "white"),
         panel.grid = element_line(colour = "grey"))

grid.arrange(plot_filesize, plot_lines, plot_words, ncol = 3)

Data preparation

This process includes the following steps:

Random sampling - In order to produce a representative sample from the population, the three datasets will be combined to a Corpus (collection of documents) and 1% of the data will be randomly extracted for future analysis.
Cleaning - These include for example converting text to lower case, removing punctuations, profanity filtering, etc.
Tokenization - The aim is to identify appropriate tokens such as words, punctuation, and numbers by writing a function that takes a file as input and returns a tokenized version of it.

Random sampling

To build models it is not necessary to use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

set.seed(190820)

textsample <- c(
  blogs[sample(1:length(blogs),length(blogs)/100, replace = FALSE)], 
  news[sample(1:length(news),length(news)/100, replace = FALSE)], 
  twitter[sample(1:length(twitter),length(twitter)/100, replace = FALSE)])

write(textsample, "output/textsample.txt")

length(textsample)

## [1] 42695

The random subsample consists of 42695 lines.

Cleaning

I am going to use the tm library for perform data cleaning tasks

Convert character vector between encodings

This conversion, which transforms the text to Latin-ASCII, replaces diacritic/accent characters (âêîôûŷŵ äëïöüÿ àèìòù áéíóúý ãñõ ø) and removes all characters that are not a letter, number or common symbols.

textsample2 <- iconv(textsample, from = "", to = "ASCII", sub="")
# sub replaces any non-convertible bytes in the input. 

# load data as a corpus
corpus <- VCorpus(VectorSource(textsample2))
corpus

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 42695

Text sample (line 12):

writeLines(as.character(corpus[[12]]))

## I then dealt with the insurance company directly, got them whatever they needed and kept getting stalled about the cheque. Finally I just lost it, told them I cant do business without that money, that Id make it my full time job to show up to their offices and call every TV station I could think of to cover this story if I didnt get my cheque! They seemed pretty alarmed.

Convert text to lower case

This facilitates further cleaning and analysis steps.

corpus <- tm_map(corpus, content_transformer(tolower))

Remove special characters, numbers, punctuations, URLs…

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")

corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)

remURL <- function(x) gsub("http[[:alnum:]]*", "", x)
corpus <- tm_map(corpus, content_transformer(remURL))

Remove white space

corpus <- tm_map(corpus, stripWhitespace)

Remove stop words

corpus <- tm_map(corpus, removeWords, stopwords("english"))

Profanity filtering

Removes words that we do not want to predict, such as swear words.

profanity_words <- readLines("http://www.bannedwordlist.com/lists/swearWords.txt")

## Warning in readLines("http://www.bannedwordlist.com/lists/swearWords.txt"):
## incomplete final line found on 'http://www.bannedwordlist.com/lists/
## swearWords.txt'

corpus <- tm_map(corpus, removeWords, profanity_words)

Stemming

The process of stemming reduces words to the word stem or root, such as “growing” to “grow”.

corpus <- tm_map(corpus, stemDocument)

This is again the text sample line 12 after the cleaning process:

inspect(corpus[[12]])

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 215
## 
## dealt insur compani direct got whatev need kept get stall chequ final just lost told cant busi without money id make full time job show offic call everi tv station think cover stori didnt get chequ seem pretti alarm

Tokenisation

I am going to convert the corpus into a data.frame and then into a long (or tidy) format, with one row for each word from each of the text elements. I will use the unnest_tokens() function from the tidytext package, which splits a column into tokens (one-token-per-row).

d <- data.frame(text=unlist(sapply(corpus, '[', "content")), stringsAsFactors=F)
unigram <- unnest_tokens(d, word, text, token = "words") # unnest_tokens(data.frame, output-column, input-column)

For n-grams the column is split into groups of n-words.

Example: dont worry be happy

unigram: dont, worry, be, happy
bigram: dont worry, worry be, be happy
trigram: dont worry be, worry be happy

bigram <- unnest_tokens(d, ngram, text, token = "ngrams", n = 2)
trigram <- unnest_tokens(d, ngram, text, token = "ngrams", n = 3)

Exploration of words distribution and relationship between the words in the corpora

In this section I am going to explore the

distributions of word frequencies
n-grams: Find out what are the frequencies of 2-grams and 3-grams in the dataset
build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Word cloud

unigram_freq <- data.frame(table(unigram$word))
unigram_freq <- unigram_freq[order(unigram_freq$Freq, decreasing = TRUE),]

wordcloud(unigram_freq$Var1[1:30], unigram_freq$Freq[1:30]) # wordcloud(words,freq)

This word cloud shows the thirty most frequently used words in the corpus. Since this type of visualisation is not very clear, I am going to present the n-gram frequencies with tables and histograms.

Distribution of unigram, bigram and trigram frequencies

nrow(unigram_freq)

## [1] 41534

There are 41534 unique words in this subsample.

kable(unigram_freq[1:15,], col.names = c('word', 'frequency'), row.names = FALSE)  %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE)

word	frequency
will	3298
get	3143
like	3123
one	3104
just	3069
said	3050
go	2801
time	2564
can	2519
im	2448
day	2263
year	2189
make	2117
love	2018
know	1827

Among the top 15 unigrams are the words “get”, “like”, “one”, “just” and “said”.

ggplot(unigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
   geom_bar(stat = "identity") +
   labs(title = "30 most frequent unigrams",
        subtitle = "SwiftKey company dataset, subsample, n = 41534 unigrams",
        x = "unigrams",
        y = "frequency") +
   coord_flip() +
   theme(panel.background = element_rect(fill = "white"),
         panel.grid = element_line(colour = "grey"))

bigram_freq <- data.frame(table(bigram$ngram))
bigram_freq <- bigram_freq[order(bigram_freq$Freq, decreasing = TRUE),]

nrow(bigram_freq)

## [1] 445891

We can find 445891 unique word combinations in this subsample.

kable(bigram_freq[1:15,], col.names = c('bigram', 'frequency'), row.names = FALSE)  %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE)

bigram	frequency
right now	268
last year	224
dont know	216
cant wait	194
look like	194
feel like	193
new york	179
year ago	171
look forward	167
high school	147
im go	145
thank follow	138
first time	136
last night	135
last week	130

Among the top 15 bigrams are common word combinations such as “right now”, “last year”, “dont know” and “cant wait”.

ggplot(bigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
   geom_bar(stat = "identity") +
   labs(title = "30 most frequent bigrams",
        subtitle = "SwiftKey company dataset, subsample, n = 445891 bigrams",
        x = "bigrams",
        y = "frequency") +
   coord_flip() +
   theme(panel.background = element_rect(fill = "white"),
         panel.grid = element_line(colour = "grey"))

trigram_freq <- data.frame(table(trigram$ngram))
trigram_freq <- trigram_freq[order(trigram_freq$Freq, decreasing = TRUE),]

nrow(trigram_freq)

## [1] 562405

When combining three words there are 62405 unique combinations.

kable(trigram_freq[1:15,], col.names = c('trigram', 'frequency'), row.names = FALSE)  %>%
  kable_styling(bootstrap_options = c("striped"), full_width = FALSE)

trigram	frequency
cant wait see	38
happi mother day	31
let us know	28
new york citi	26
happi new year	22
presid barack obama	22
im pretti sure	19
look forward see	19
new york time	15
dont even know	14
feel like im	14
cent per share	12
dont get wrong	11
im look forward	11
ive ever seen	11

The most frequent trigrams are also very common combinations in the every day conversation such as “let us know”, “cant wait see” (here the stopword “the” was removed in the data cleaning process)

ggplot(trigram_freq[1:30,], aes(x = reorder(Var1, Freq), y = Freq)) +
   geom_bar(stat = "identity") +
   labs(title = "30 most frequent trigrams",
        subtitle = "SwiftKey company dataset, subsample, n = 562405 trigrams",
        x = "trigrams",
        y = "frequency") +
   coord_flip() +
   theme(panel.background = element_rect(fill = "white"),
         panel.grid = element_line(colour = "grey"))

More deep diving

For a better understanding of the data I am going to explore a further questions:

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

unigram_freq_cum <- unigram_freq
unigram_freq_cum$cum_freq <- cumsum(unigram_freq$Freq)
unigram_freq_cum$cum_perc <- unigram_freq_cum$cum_freq / sum(unigram_freq_cum$Freq)

# head(unigram_freq_cum[unigram_freq_cum$cum_perc > 0.5,])

a <- nrow(unigram_freq_cum[unigram_freq_cum$cum_perc <= 0.5,]) #583 words
b <- round((nrow(unigram_freq_cum[unigram_freq_cum$cum_perc <= 0.5,]) / nrow(unigram_freq_cum))*100, 2) #0.01403669 --> 1,4 %

print(c(a,b))

## [1] 583.0   1.4

We need 583 unique words to cover 50% of all word instances in the english language. this accounts for 1,4% of all word instances (n = 41534)

Next steps: Modeling, Shiny app

The next steps will include:

Build basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
Build a model to handle cases where a particular n-gram isn’t observed
Develop shiny web app, that includes a simple user interface for entering text and displays the word prediction based on the prediction model