07-intro-to-sparklyr.Rmd

---
title: "Intro to sparklyr"
output: html_notebook
---
## 7.1 -  New Spark session

1. Use `spark_connect()` to create a new local Spark session
```{r}
library(tidyverse)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.0.0")
```

2. Click on the `SparkUI` button to view the current Spark session's UI

3. Click on the `Log` button to see the message history

## 7.2 - Data transfer

1. Copy the `mtcars` dataset into the session
```{r}
spark_mtcars <- sdf_copy_to(sc, mtcars, "my_mtcars")
```

2. In the **Connections** pane, expande the `my_mtcars` table

3. Go to the Spark UI, note the new jobs

4. In the UI, click the Storage button, note the new table

5. Click on the **In-memory table my_mtcars** link

## 7.3 - Simple dplyr example

1. Run the following code snipett
```{r}
spark_mtcars %>%
  group_by(am) %>%
  summarise(avg_wt = mean(wt, na.rm = TRUE))
```

2. Go to the Spark UI and click the **SQL** button 

3. Click on the top item inside the **Completed Queries** table

4. At the bottom of the diagram, expand **Details**

## 7.4 - Map data

1. Examine the contents of the /usr/share/class/flights/data folder

2. Read the top 5 rows of the `flight_2008_1` CSV file.  It is located under /usr/share/class/flights/data
```{r}
library(readr)
top_rows <- read_csv("/usr/share/class/flights/data/flight_2008_1.csv", n_max = 5)
```

3. Create a list based on the column names, and add a list item with "character" as its value.
```{r}
library(purrr)
file_columns <- top_rows %>%
  rename_all(tolower) %>%
  map(function(x) "character")
head(file_columns)
```

4. Use `spark_read()` to "map" the file's structure and location to the Spark context
```{r}
spark_flights <- spark_read_csv(
  sc,
  name = "flights",
  path = "/usr/share/flights/data/",
  memory = FALSE,
  columns = file_columns,
  infer_schema = FALSE
)
```

5. In the Connections pane, click on the table icon by the `flights` variable

6. Verify that the new variable pointer work using `tally()`
```{r}
spark_flights %>%
  tally()
```

## 7.5 -  Caching data

1. Create a subset of the *flights* table object
```{r}
cached_flights <- spark_flights %>%
  mutate(
    arrdelay = ifelse(arrdelay == "NA", 0, arrdelay),
    depdelay = ifelse(depdelay == "NA", 0, depdelay)
  ) %>%
  select(
    month,
    dayofmonth,
    arrtime,
    arrdelay,
    depdelay,
    crsarrtime,
    crsdeptime,
    distance
  ) %>%
  mutate_all(as.numeric)
```

2. Use `compute()` to extract the data into Spark memory
```{r}
cached_flights <- compute(cached_flights, "sub_flights")
```

3. Confirm new variable pointer works
```{r}
cached_flights %>%
  tally()
```

## 7.6 - sdf Functions
http://spark.rstudio.com/reference/#section-spark-dataframes 

1. Use `sdf_pivot` to create a column for each value in month
```{r}
cached_flights %>%
  arrange(month) %>%
  sdf_pivot(month ~ dayofmonth)
```

2. Use `sdf_partition()` to sepparate the data into discrete groups
```{r}
partition <- cached_flights %>%
  sdf_partition(training = 0.01, testing = 0.09, other = 0.9)

tally(partition$training)
```

## 7.7 - Feature transformers
http://spark.rstudio.com/reference/#section-spark-feature-transformers

1. Use `ft_binarizer()` to identify "delayed" flights
```{r}
cached_flights %>%
  ft_binarizer(
    input.col = "depdelay",
    output.col = "delayed",
    threshold = 15
  ) %>%
  select(
    depdelay,
    delayed
  ) %>%
  head(100)
```

2. Use `ft_bucketizer()` to split the data into groups
```{r}
cached_flights %>%
  ft_bucketizer(
    input.col = "crsdeptime",
    output.col = "dephour",
    splits = c(0, 400, 800, 1200, 1600, 2000, 2400)
  ) %>%
  select(
    crsdeptime,
    dephour
  ) %>%
  head(100)
```

## 7.8 - Fit a model with sparklyr

1. Combine the `ft_` and `sdf_` functions to prepare the data
```{r}
sample_data <- cached_flights %>%
  filter(!is.na(arrdelay)) %>%
  ft_binarizer(
    input.col = "arrdelay",
    output.col = "delayed",
    threshold = 15
  ) %>%
  ft_bucketizer(
    input.col = "crsdeptime",
    output.col = "dephour",
    splits = c(0, 400, 800, 1200, 1600, 2000, 2400)
  ) %>%
  mutate(dephour = paste0("h", as.integer(dephour))) %>%
  sdf_partition(training = 0.01, testing = 0.09, other = 0.9)
```

2. Cache the training data
```{r}
training <- sdf_register(sample_data$training, "training")
tbl_cache(sc, "training")

```

3. Run a logistic regression model in Spark
```{r}
delayed_model <- training %>%
  ml_logistic_regression(delayed ~ depdelay + dephour)
```

4. View the model results
```{r}
summary(delayed_model)
```

## 7.9 - Run predictions in Spark

1. Use `sdf_predict()` agains the test dataset
```{r}
delayed_testing <- sdf_predict(delayed_model, sample_data$testing)
delayed_testing %>%
  head()
```

2. Use `group_by()` to see how effective the new model is
```{r}
delayed_testing %>%
  group_by(delayed, prediction) %>%
  tally()
```

```{r, include = FALSE}
spark_disconnect(sc)
```