05-modeling.Rmd

---
Title: "Modeling"
output: html_notebook
---

## Class catchup

```{r}
library(tidyverse)
library(DBI)
library(dbplyr)
library(dbplot)
library(tidypredict)
con <- DBI::dbConnect(odbc::odbc(), "Postgres Dev")
airports <- tbl(con, in_schema("datawarehouse", "airport")) 
table_flights <- tbl(con, in_schema("datawarehouse", "flight"))
carriers <- tbl(con, in_schema("datawarehouse", "carrier"))
set.seed(100)
```

## 5.1 - SQL Native sampling 
*Use PostgreSQL TABLESAMPLE clause*

1. Find out the class of the object returned by `show_query()`.  Test with *table_flights*.
```{r}
```

2. Find out the class of the object returned by `remote_query()`. Test with *table_flights*.
```{r}
```

3. Run `remote_query()` again *table_flights*
```{r}
```

4. Use `build_sql()` to paste together the results of the `remote_query()` operation and *" TABLESAMPLE SYSTEM (0.1)"*
```{r}
```

5. Use `build_sql()` and `remote_query()` to combine a the `dplyr` command with a custom SQL statement
```{r}
sql_sample <-  
```

6. Preview the sample data
```{r}
sql_sample
```

6. Test the efficacy of the sampling using `dbplot_histogram()` and comparing to the histogram produced in the Visualization chapter.
```{r}
dbplot_histogram(sql_sample, distance)
```

## 5.2 - Sample with ID
*Use a record's unique ID to produce a sample*

1. Summarize with `max()` and `min()` to get the upper and lower bound of *flightid*
```{r}
limit <-
```

2. Use `sample()` to get 0.1% of IDs
```{r}
sampling <- 
```

3. Use `%in%` to match the sample IDs in *table_flights* table
```{r}
id_sample <- 
```

4. Test the efficacy of the sampling using `dbplot_histogram()` and comparing to the histogram produced in the Visualization chapter.
```{r}

```


## 5.3 - Sample manually
*Use `row_number()`, `sample()` and `map_df()` to create a sample data set*

1. Create a filtered data set with January's data
```{r}
db_month <- 
```

2. Get the row count, collect and save the results to a variable
```{r}
rows <- 
```

3. Use `row_number()` to create a new column to number each row
```{r}
db_month <- 
```

4. Create a random set of 600 numbers, limited by the number of rows
```{r}
sampling <- 
```

5. Use `%in%` to filter the matched sample row IDs with the random set
```{r}
db_month <- 
```

6. Verify number of rows
```{r}
tally(db_month)
```

7. Create a function with the previous steps, but replacing the month number with an argument.  Collect the data at the end
```{r}
sample_segment <- function(x, size = 600) {
  
}
```

8. Test the function
```{r}
head(sample_segment(3), 100)
```

9. Use `map_df()` to run the function for each month
```{r}
strat_sample <- 
```

10. Verify sample with a `dbplot_histogram()`
```{r}

```

## 5.4 - Create a model & test

1. Prepare a model data set.  Using `case_when()` create a field called *season* and assign based 
```{r}
model_data <- strat_sample %>%
    mutate(
    season = case_when(
      month >= 3 & month <= 5  ~ "Spring",
      month >= 6 & month <= 8  ~ "Summmer",
      month >= 9 & month <= 11 ~ "Fall",
      month == 12 | month <= 2  ~ "Winter"
    )
  ) %>%
  select(arrdelay, season, depdelay) 
  
```

2. Create a simple `lm()` model against *arrdelay*
```{r}
```

3. Create a test data set by combining the sampling and model data set routines
```{r}
test_sample <- 
```

4. Run a simple routine to check accuracy 
```{r}
test_sample %>%
  mutate(p = predict(model_lm, test_sample),
         over = abs(p - arrdelay) < 10) %>%
  group_by(over) %>% 
  tally() %>%
  mutate(percent = round(n / sum(n), 2))
```

## 5.5 - Score inside database
*Learn about tidypredict to run predictions inside the database*

1. Load the library, and see the results of passing the model as an argument to `tidypredict_fit()` 
```{r}
library(tidypredict)

tidypredict_fit(model_lm)
```

2. Use `tidypredict_sql()` to see the resulting SQL statement
```{r}

```

3. Run the prediction inside `dplyr` by piping the same transformations into `tidypredict_to_column()`, but starting with *table_flights*
```{r}

```

4. View the SQL behind the `dplyr` command with `remote_query()`
```{r}

```

5. Compare predictions to ensure results are within range using `tidypredict_test()`
```{r}
test <- 
```

6. View the records that exceeded the threshold
```{r}
test$raw_results %>%
  filter(fit_threshold)
```

## 5.6 - Parsed model
*Quick review of the model parser*

1. Use the `parse_model()` function to see how `tidypredict` interprets the model
```{r}
pm <- 
```

2. With `tidypredict_fit()`, verify that the resulting table can be used to get the fit formula
```{r}

```

3. Save the parsed model for later use using the `yaml` package
```{r}
library(yaml)

``` 

4. Reload model from the YAML file
```{r}
my_pm <-
```

5. Use the reloaded model to build the fit formula
```{r}
tidypredict_fit(my_pm)
```

## 5.7 -  Model inside the database
*Brief intro to modeldb*

1. Load `modeldb`
```{r}
library(modeldb)
```

2. Use the `sampling` variable to create a filtered table of `table_flights`
```{r}
sample <- table_flights %>%
  filter(flightid %in% sampling) 
```

3. Select *deptime*, *distance* and *arrdelay* from `sample` and pipe into `linear_regression_db()`, pass *arrdelay* as the only argument.
```{r}

```

4. Using the coefficients from the results, create a new column that multiplies each field against the corresponding coefficient, and adds them all together along with the intercept
```{r}
table_flights %>%
  head(1000) %>%
  mutate(pred = --0.6262197	 + (deptime * 0.01050023	) + (distance * -0.003834017	)) %>%
  select(arrdelay, pred) 
```

5. Add *dayofweek* to the variable selection.  Add `add_dummy_variables()` in between the selection and the linear regression function. Pass *dayofweek* and `c(1:7)` to represent the 7 days
```{r}

```

6. Replace *arrdelay* with *uniquecarrier*. Group by *uniquecarrier* and remove the `add_dummy_variable()`.
```{r}
```

7. Pipe the code into `ggplot2`.  Use the intercept for the plot's `x` and *uniquecarrier* for `y`.  Use `geom_point()`
```{r}
```

```{r, include = FALSE}
dbDisconnect(con)
```