Introduction

For this Capstone Project, as part of the Data Science Specialisation by Johns Hopkins University, I was provided with the following SwiftKey company dataset, that contains several text files from different sources (blogs, news, twitter) in different languages: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The text was originally from HC Corpora, a collection of free text corpora.

The aim of the capstone project is to develop a web application and predictive model that suggests a word given a limited text input.

In this report I am going to describe the dataset in more detail, perform exploratory analysis and explain the goals for building a predictive model for text and for building a predictive text mining application

Exploratory Analysis

The intention of this analysis is to understand the basic relationships observed in the data and prepare to build a first linguistic model. For that we need to understand the distribution and relationship between the words, tokens, and phrases in the text.

Obtaining the data

First I am going to download the dataset and unzip it:

Next I will check which text files are in the folder that I downloaded:

##  [1] "final/de_DE/de_DE.blogs.txt"   "final/de_DE/de_DE.news.txt"   
##  [3] "final/de_DE/de_DE.twitter.txt" "final/en_US/en_US.blogs.txt"  
##  [5] "final/en_US/en_US.news.txt"    "final/en_US/en_US.twitter.txt"
##  [7] "final/fi_FI/fi_FI.blogs.txt"   "final/fi_FI/fi_FI.news.txt"   
##  [9] "final/fi_FI/fi_FI.twitter.txt" "final/ru_RU/ru_RU.blogs.txt"  
## [11] "final/ru_RU/ru_RU.news.txt"    "final/ru_RU/ru_RU.twitter.txt"
## [13] "output/textsample.txt"

There are 12 text files - 3 text files each in 4 different languages.

I will start by importing the English text files only:

Here are samples of lines from these files:

  • en_US.blogs.txt
## [1] "I am en-route to Cornwall again. 3/4 months of slog and sun and sea too. Always a job needs doing in a tourist town. I’m bringing my stuff back from 97 on Friday. Will need to cancel bills before I retreat. Not such a bad thing. Moved to Leeds with promise from Millies. That went nowhere. Move on."
## [2] "Pure large leaf Assam. No waffling with the leaf, thank you. I want it strong and dark with no herby frills. And for goodness sake no fruit mixers and no sweetener. Why would you do that to tea?"
  • en_US.blogs.txt
## [1] "\"The absurdity of attempting to bottle up news of such magnitude was too apparent,\" he would later write."                                               
## [2] "GM labor relations vice president Diana Tremblay said the deal \"will enable GM to be fully competitive and has eliminated the gap with our competitors.\""
  • en_US.blogs.txt
## [1] "Dammnnnnn what a catch"                                             
## [2] "such a great picture! The green shirt totally brings out your eyes!"

Summary statistics

file filesize lines words longest_line
en_US.blogs.txt 200.4 899288 37334131 40833
en_US.news.txt 196.3 1010242 34372530 11384
en_US.twitter.txt 159.4 2360148 30373583 140
  • The english text documents have a filesize between 160 and 200 MB.
  • The blog document has the fewest number of lines (>900.000), while twitter has the most lines (> 2,3 Mio).
  • The number of words exceeds 30 million per file.
  • Since Twitter restricts the users in how long the tweets can be, the longest line of a twitter feed is only 140 characters. Blogs and News are less restricted. The longest blog has roughly 41.000, the longest newstext 11.000 characters. The news file contains long paragraphs, while blogs are a sequence of sentences.

Data preparation

This process includes the following steps:

  1. Random sampling - In order to produce a representative sample from the population, the three datasets will be combined to a Corpus (collection of documents) and 1% of the data will be randomly extracted for future analysis.

  2. Cleaning - These include for example converting text to lower case, removing punctuations, profanity filtering, etc.

  3. Tokenization - The aim is to identify appropriate tokens such as words, punctuation, and numbers by writing a function that takes a file as input and returns a tokenized version of it.

Random sampling

To build models it is not necessary to use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

## [1] 42695

The random subsample consists of 42695 lines.

Cleaning

I am going to use the tm library for perform data cleaning tasks

Convert character vector between encodings

This conversion, which transforms the text to Latin-ASCII, replaces diacritic/accent characters (âêîôûŷŵ äëïöüÿ àèìòù áéíóúý ãñõ ø) and removes all characters that are not a letter, number or common symbols.

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 42695

Text sample (line 12):

## I then dealt with the insurance company directly, got them whatever they needed and kept getting stalled about the cheque. Finally I just lost it, told them I cant do business without that money, that Id make it my full time job to show up to their offices and call every TV station I could think of to cover this story if I didnt get my cheque! They seemed pretty alarmed.

Convert text to lower case

This facilitates further cleaning and analysis steps.

Remove special characters, numbers, punctuations, URLs…

Remove white space

Remove stop words

Profanity filtering

Removes words that we do not want to predict, such as swear words.

## Warning in readLines("http://www.bannedwordlist.com/lists/swearWords.txt"):
## incomplete final line found on 'http://www.bannedwordlist.com/lists/
## swearWords.txt'

Stemming

The process of stemming reduces words to the word stem or root, such as “growing” to “grow”.

This is again the text sample line 12 after the cleaning process:

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 215
## 
## dealt insur compani direct got whatev need kept get stall chequ final just lost told cant busi without money id make full time job show offic call everi tv station think cover stori didnt get chequ seem pretti alarm

Tokenisation

I am going to convert the corpus into a data.frame and then into a long (or tidy) format, with one row for each word from each of the text elements. I will use the unnest_tokens() function from the tidytext package, which splits a column into tokens (one-token-per-row).

For n-grams the column is split into groups of n-words.

Example: dont worry be happy

  • unigram: dont, worry, be, happy
  • bigram: dont worry, worry be, be happy
  • trigram: dont worry be, worry be happy

Exploration of words distribution and relationship between the words in the corpora

In this section I am going to explore the

  • distributions of word frequencies
  • n-grams: Find out what are the frequencies of 2-grams and 3-grams in the dataset
  • build figures and tables to understand variation in the frequencies of words and word pairs in the data.

Word cloud

This word cloud shows the thirty most frequently used words in the corpus. Since this type of visualisation is not very clear, I am going to present the n-gram frequencies with tables and histograms.

Distribution of unigram, bigram and trigram frequencies

## [1] 41534

There are 41534 unique words in this subsample.

word frequency
will 3298
get 3143
like 3123
one 3104
just 3069
said 3050
go 2801
time 2564
can 2519
im 2448
day 2263
year 2189
make 2117
love 2018
know 1827

Among the top 15 unigrams are the words “get”, “like”, “one”, “just” and “said”.

## [1] 445891

We can find 445891 unique word combinations in this subsample.

bigram frequency
right now 268
last year 224
dont know 216
cant wait 194
look like 194
feel like 193
new york 179
year ago 171
look forward 167
high school 147
im go 145
thank follow 138
first time 136
last night 135
last week 130

Among the top 15 bigrams are common word combinations such as “right now”, “last year”, “dont know” and “cant wait”.

## [1] 562405

When combining three words there are 62405 unique combinations.

trigram frequency
cant wait see 38
happi mother day 31
let us know 28
new york citi 26
happi new year 22
presid barack obama 22
im pretti sure 19
look forward see 19
new york time 15
dont even know 14
feel like im 14
cent per share 12
dont get wrong 11
im look forward 11
ive ever seen 11

The most frequent trigrams are also very common combinations in the every day conversation such as “let us know”, “cant wait see” (here the stopword “the” was removed in the data cleaning process)

More deep diving

For a better understanding of the data I am going to explore a further questions:

  • How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?
## [1] 583.0   1.4

We need 583 unique words to cover 50% of all word instances in the english language. this accounts for 1,4% of all word instances (n = 41534)

Next steps: Modeling, Shiny app

The next steps will include:

  • Build basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words.
  • Build a model to handle cases where a particular n-gram isn’t observed
  • Develop shiny web app, that includes a simple user interface for entering text and displays the word prediction based on the prediction model