-
Notifications
You must be signed in to change notification settings - Fork 55
/
12-textmining.Rmd
155 lines (107 loc) · 3.23 KB
/
12-textmining.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
title: "Text mining with sparklyr"
output: html_notebook
---
## 12.1 - Data Import
1. Open a Spark session
```{r}
library(sparklyr)
library(dplyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 4
conf$`sparklyr.shell.driver-memory` <- "8G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local", config = conf,version = "2.0.0")
```
2. The `spark_read_text()` is a new function which works like `readLines()` but for `sparklyr`. Use it to read the *mark_twain.txt* file into Spark.
```{r}
twain_path <- "file:///usr/share/class/bonus/mark_twain.txt"
twain <- spark_read_text(sc, "twain", twain_path)
```
3. Read the *arthur_doyle.txt* file into Spark
```{r}
doyle_path <-
doyle <-
```
## 12.2 - Tidying data
1. Add a identification column to each data set.
```{r}
twain_id <- twain %>%
mutate(author = "twain")
doyle_id <-
```
2. Use `sdf_bind_rows()` to append the two files together
```{r}
both <-
```
3. Filter out empty lines using `nchar()`
```{r}
all_lines <-
```
4. Use Hive's *regexp_replace* to remove punctuation. Use *"[_\"\'():;,.!?\\-]"* as the characters to be removed. Save into a new column called *line*
```{r}
all_lines <-
```
## 12.3 - Transform the data
1. Use `ft_tokenizer()` to separate each word. in the line. It places it in a list column.
```{r}
word_list <-
```
2. Remove "stop words" with the `ft_stop_words_remover()` transformer. The list is of stop words Spark uses is available here: https://github.com/apache/spark/blob/master/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
```{r}
wo_stop <-
```
3. Un-nest the tokens inside *wo_stop_words* using `explode()`. This will create a row per word.
```{r}
exploded <-
```
4. Select the *word* and *author* columns, and remove any word with less than 3 characters.
```{r}
all_words <-
```
5. Cache the *all_words* variable using `compute()`
```{r}
```
## 12.4 - Data Exploration
1. Words used the most by author
```{r}
word_count <-
```
2. Words most used by Twain
```{r}
twain_most <-
```
3. Use `wordcloud` to visualize the top 50 words used by Twain
```{r}
twain_most %>%
head(50) %>%
collect() %>%
with(wordcloud::wordcloud(
word,
n,
colors = c("#999999", "#E69F00", "#56B4E9","#56B4E9")))
```
4. Words most used by Doyle
```{r}
doyle_most <-
```
5. Used `wordcloud` to visualize the top 50 words used by Doyle that have more than 5 characters
```{r}
```
6. Use `anti_join()` to figure out which words are used by Doyle but not Twain. Order the results by number of words.
```{r}
doyle_unique <-
```
7. Use `wordcloud` to visualize top 50 records in the previous step
```{r}
```
8. Find out how many times Twain used the word "sherlock"
```{r}
```
9. Against the `twain` variable, use Hive's *instr* and *lower* to make all ever word lower cap, and then look for "sherlock" in the line
```{r}
```
Most of these lines are in a short story by Mark Twain called [A Double Barrelled Detective Story](https://www.gutenberg.org/files/3180/3180-h/3180-h.htm#link2H_4_0008). As per the [Wikipedia](https://en.wikipedia.org/wiki/A_Double_Barrelled_Detective_Story) page about this story, this is a satire by Twain on the mystery novel genre, published in 1902.
```{r}
spark_disconnect(sc)
```