-
Notifications
You must be signed in to change notification settings - Fork 5
/
README.Rmd
133 lines (87 loc) · 4.01 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
output: github_document
---
```{r setup, echo = FALSE, message = FALSE, results = 'hide'}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.width = 8,
fig.height = 6,
fig.path = "figs/",
echo = TRUE
)
```
# Welcome to the *linelist* package!
[![Travis build status](https://travis-ci.org/reconhub/linelist.svg?branch=master)](https://travis-ci.org/reconhub/linelist)
[![Codecov test coverage](https://codecov.io/gh/reconhub/linelist/branch/master/graph/badge.svg)](https://codecov.io/gh/reconhub/linelist?branch=master)
This package is dedicated to simplifying the cleaning and standardisation of
linelist data. Considering a case linelist `data.frame`, it aims to:
- standardise the variables names, replacing all non-ascii characters with their
closest latin equivalent, removing blank spaces and other separators,
enforcing lower case capitalisation, and using a single separator between
words
- standardise the labels used in all variables of type `character` and `factor`,
as above
- set `POSIXct` and `POSIXlt` to `Date` objects
- extract dates from a messy variable, automatically detecting formats, allowing
inconsistent formats, and dates flanked by other text
## Installing the package
To install the current stable, CRAN version of the package, type:
```{r install, eval = FALSE}
install.packages("linelist")
```
To benefit from the latest features and bug fixes, install the development, *github* version of the package using:
```{r install2, eval = FALSE}
devtools::install_github("reconhub/linelist")
```
Note that this requires the package *devtools* installed.
# What does it do?
## Data cleaning
Procedures to clean data, first and foremost aimed at `data.frame` formats,
include:
- `clean_data()`: the main function, taking a `data.frame` as input, and doing all
the variable names, internal labels, and date processing described above
- `clean_variable_names()`: like `clean_data`, but only the variable names
- `clean_variable_labels()`: like `clean_data`, but only the variable labels
- `clean_variable_spelling()`: provided with a dictionary, will correct the
spelling of values in a variable and can globally correct commonly mis-spelled
words.
- `clean_dates()`: like `clean_data`, but only the dates
- `guess_dates()`: find dates in various, unspecified formats in a messy
`character` vector
# Worked example
Let us consider some messy `data.frame` as a toy example:
```{r toy_data}
## make toy data
onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
"Confirmed", "PROBABLE", "suspected ", "Not.a.Case")
messy_dates <- sample(
c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
"2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
"DisCharge.." = discharge,
"SeX_ " = gender,
"Épi.Case_définition" = case,
"messy/dates" = messy_dates)
## show data
toy_data
```
We start by cleaning these data:
```{r clean_data}
## load library
library(linelist)
## clean data with defaults
x <- clean_data(toy_data)
x
```
## Getting help online
Bug reports and feature requests should be posted on *github* using the [*issue*](http://github.com/reconhub/linelist/issues) system. All other questions should be posted on the **RECON forum**: <br>
[http://www.repidemicsconsortium.org/forum/](http://www.repidemicsconsortium.org/forum/)
Contributions are welcome via **pull requests**.
Please note that this project is released with a [Contributor Code of Conduct](CONDUCT.md). By participating in this project you agree to abide by its terms.