12-textmining.Rmd

---
title: "Text mining with sparklyr"
output: html_notebook
---

## 12.1 - Data Import

1. Open a Spark session
```{r}
library(sparklyr)
library(dplyr)

conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9

sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
```


2. The `spark_read_text()` is a new function which works like `readLines()` but for `sparklyr`. Use it to read the *mark_twain.txt* file into Spark.
```{r}
twain_path <- "file:///usr/share/class/bonus/mark_twain.txt"
twain <- spark_read_text(sc, "twain", twain_path) 
```

3. Read the *arthur_doyle.txt* file into Spark
```{r}
doyle_path <- 
doyle <- 
```


## 12.2 - Tidying data

1. Add a identification column to each data set.

```{r}
twain_id <- twain %>% 
  mutate(author = "twain")

doyle_id <- 
```

2. Use `sdf_bind_rows()` to append the two files together
```{r}
both <- 
```

3. Filter out empty lines using `nchar()`

```{r}
all_lines <- 
```

4. Use Hive's *regexp_replace* to remove punctuation. Use *"[_\"\'():;,.!?\\-]"* as the characters to be removed. Save into a new column called *line*
```{r}
all_lines <- 
```

## 12.3 - Transform the data

1. Use `ft_tokenizer()` to separate each word. in the line.  It places it in a list column.
```{r}
word_list <- 
```

2. Remove "stop words" with the `ft_stop_words_remover()` transformer. The list is of stop words Spark uses is available here: https://github.com/apache/spark/blob/master/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt

```{r}
wo_stop <- 
```

3. Un-nest the tokens inside *wo_stop_words* using `explode()`.  This will create a row per word.
```{r}
exploded <- 
```

4. Select the *word* and *author* columns, and remove any word with less than 3 characters.
```{r}
all_words <- 
```

5. Cache the *all_words* variable using `compute()`  
```{r}

```


## 12.4 - Data Exploration

1. Words used the most by author

```{r}
word_count <- 
```

2. Words most used by Twain

```{r}
twain_most <- 
```

3. Use `wordcloud` to visualize the top 50 words used by Twain

```{r}
twain_most %>%
  head(50) %>%
  collect() %>%
  with(wordcloud::wordcloud(
    word, 
    n,
    colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
```

4. Words most used by Doyle

```{r}
doyle_most <- 
```

5. Used `wordcloud` to visualize the top 50 words used by Doyle that have more than 5 characters

```{r}

```

6. Use `anti_join()` to figure out which words are used by Doyle but not Twain. Order the results by number of words.

```{r}
doyle_unique <- 
```

7. Use `wordcloud` to visualize top 50 records in the previous step
```{r}

```

8. Find out how many times Twain used the word "sherlock"
```{r}

```

9. Against the `twain` variable, use Hive's *instr* and *lower* to make all ever word lower cap, and then look for "sherlock" in the line
```{r}

```

Most of these lines are in a short story by Mark Twain called [A Double Barrelled Detective Story](https://www.gutenberg.org/files/3180/3180-h/3180-h.htm#link2H_4_0008). As per the [Wikipedia](https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story) page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.


```{r}
spark_disconnect(sc)
```