This package is dedicated to simplifying the cleaning and
standardisation of linelist data. Considering a case linelist
data.frame
, it aims to:
-
standardise the variables names, replacing all non-ascii characters with their closest latin equivalent, removing blank spaces and other separators, enforcing lower case capitalisation, and using a single separator between words
-
standardise the labels used in all variables of type
character
andfactor
, as above -
set
POSIXct
andPOSIXlt
toDate
objects -
extract dates from a messy variable, automatically detecting formats, allowing inconsistent formats, and dates flanked by other text
-
support data dictionary:
linelist
objects can store meta-data indicating which columns correspond to standard epidemiological variables, usually found in linelists such as a unique identifier, gender, or dates of onset
To install the current stable, CRAN version of the package, type:
install.packages("linelist")
To benefit from the latest features and bug fixes, install the development, github version of the package using:
devtools::install_github("reconhub/linelist")
Note that this requires the package devtools installed.
Procedures to clean data, first and foremost aimed at data.frame
formats, include:
-
clean_data()
: the main function, taking adata.frame
as input, and doing all the variable names, internal labels, and date processing described above -
clean_variable_names()
: likeclean_data
, but only the variable names -
clean_variable_labels()
: likeclean_data
, but only the variable labels -
clean_variable_spelling()
: provided with a dictionary, will correct the spelling of values in a variable and can globally correct commonly mis-spelled words. -
clean_dates()
: likeclean_data
, but only the dates -
guess_dates()
: find dates in various, unspecified formats in a messycharacter
vector -
as_linelist()
: create a newlinelist
object from adata.frame
linelist also handles a dictionary of pre-defined, standard
epidemiological variables, referred to as epivars
throughout the
package. Meta-information can be attached to linelist
objects to
define which columns of the dataset correspond to specific epivars
.
The main functions to handle the epivars
of a linelist
object
include:
-
list_epivars()
: lists theepivars
of a dataset, with options to have more or less information -
get_epivars()
: extract columns of a dataset corresponding toepivars
-
set_epivars()
: set theepivars
of a dataset -
as_linelist()
: creates a newlinelist
object, and can defineepivars
as extra arguments
In addition, several functions allow to interact with the dictionary of
recognised epivars
, including:
-
default_dictionary()
: shows the default dictionary ofepivars
-
get_dictionary()
: shows the currentepivars
dictionary -
set_dictionary()
: set the currentepivars
dictionary; if arguments are empty, reset to the defaults -
reset_dictionary()
: reset the currentepivars
dictionary to defaults
Let us consider some messy data.frame
as a toy example:
## make toy data
onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
"Confirmed", "PROBABLE", "suspected ", "Not.a.Case")
messy_dates <- sample(
c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
"2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
"DisCharge.." = discharge,
"SeX_ " = gender,
"Épi.Case_définition" = case,
"messy/dates" = messy_dates)
## show data
toy_data
#> Date.of.Onset. DisCharge.. SeX_. Épi.Case_définition
#> 1 2018-01-06 16/01/2018 Male suspected
#> 2 2018-01-05 15/01/2018 female probable
#> 3 2018-01-02 12/01/2018 Female probable
#> 4 2018-01-03 13/01/2018 Female not a case
#> 5 2018-01-10 20/01/2018 FEMALE probable
#> 6 2018-01-08 18/01/2018 FEMALE probable
#> 7 2018-01-03 13/01/2018 Male confirmed
#> 8 2018-01-05 15/01/2018 Male Not.a.Case
#> 9 2018-01-09 19/01/2018 FEMALE suspected
#> 10 2018-01-08 18/01/2018 MALE not a case
#> 11 2018-01-07 17/01/2018 Female probable
#> 12 2018-01-06 16/01/2018 MALE PROBABLE
#> 13 2018-01-11 21/01/2018 female PROBABLE
#> 14 2018-01-06 16/01/2018 female Confirmed
#> 15 2018-01-08 18/01/2018 Male Not.a.Case
#> 16 2018-01-02 12/01/2018 FEMALE suspected
#> 17 2018-01-07 17/01/2018 Female not a case
#> 18 2018-01-10 20/01/2018 Female probable
#> 19 2018-01-05 15/01/2018 Male Confirmed
#> 20 2018-01-09 19/01/2018 Male Not.a.Case
#> messy.dates
#> 1 that's 24/12/1989!
#> 2 01-12-2001
#> 3 male
#> 4 male
#> 5 2018 10 19
#> 6 that's 24/12/1989!
#> 7 2018_10_17
#> 8 2018_10_17
#> 9 female
#> 10 <NA>
#> 11 <NA>
#> 12 01-12-2001
#> 13 that's 24/12/1989!
#> 14 // 24//12//1989
#> 15 2018-10-18
#> 16 2018-10-18
#> 17 2018_10_17
#> 18 that's 24/12/1989!
#> 19 2018_10_17
#> 20 01-12-2001
We start by cleaning these data:
## load library
library(linelist)
## clean data with defaults
x <- clean_data(toy_data)
x
#> date_of_onset discharge sex epi_case_definition messy_dates
#> 1 2018-01-06 2018-01-16 male suspected 1989-12-24
#> 2 2018-01-05 2018-01-15 female probable 2001-12-01
#> 3 2018-01-02 2018-01-12 female probable <NA>
#> 4 2018-01-03 2018-01-13 female not_a_case <NA>
#> 5 2018-01-10 2018-01-20 female probable 2018-10-19
#> 6 2018-01-08 2018-01-18 female probable 1989-12-24
#> 7 2018-01-03 2018-01-13 male confirmed 2018-10-17
#> 8 2018-01-05 2018-01-15 male not_a_case 2018-10-17
#> 9 2018-01-09 2018-01-19 female suspected <NA>
#> 10 2018-01-08 2018-01-18 male not_a_case <NA>
#> 11 2018-01-07 2018-01-17 female probable <NA>
#> 12 2018-01-06 2018-01-16 male probable 2001-12-01
#> 13 2018-01-11 2018-01-21 female probable 1989-12-24
#> 14 2018-01-06 2018-01-16 female confirmed 1989-12-24
#> 15 2018-01-08 2018-01-18 male not_a_case 2018-10-18
#> 16 2018-01-02 2018-01-12 female suspected 2018-10-18
#> 17 2018-01-07 2018-01-17 female not_a_case 2018-10-17
#> 18 2018-01-10 2018-01-20 female probable 1989-12-24
#> 19 2018-01-05 2018-01-15 male confirmed 2018-10-17
#> 20 2018-01-09 2018-01-19 male not_a_case 2001-12-01
We can now define some epivars
for x
, i.e. identify which columns
correspond to typical epidemiological variables:
## see what the dictionary is
get_dictionary()
#> epivar hxl
#> 1 id #respondee
#> 2 date_onset #date +start
#> 3 date_report #date +reported
#> 4 date_outcome #date +end
#> 5 case_definition #indicator +name
#> 6 outcome #indicator +type
#> 7 gender #indicator +type
#> 8 age #indicator +num
#> 9 age_group #indicator +type
#> 10 geo_lon #geo +lon
#> 11 geo_lat #geo +lat
#> description
#> 1 unique individual identifier
#> 2 date at which symptoms started
#> 3 date at which case was reported
#> 4 date of the outcome (recovery or death)
#> 5 case type: suspected, probable, confirmed, negative
#> 6 recovery or death
#> 7 gender of the individual
#> 8 age of the individual in years
#> 9 age group of the individual in years
#> 10 geographic coordinate: longitude
#> 11 geographic coordinate: latitude
## see current names of variables
names(x)
#> [1] "date_of_onset" "discharge" "sex"
#> [4] "epi_case_definition" "messy_dates"
## some variables are known epivars; let's create a linelist object and register
## this information at the same time
x <- as_linelist(x, date_onset = "date_of_onset", gender = "sex")
x
#> <linelist object>
#>
#> date_of_onset discharge sex epi_case_definition messy_dates
#> 1 2018-01-06 2018-01-16 male suspected 1989-12-24
#> 2 2018-01-05 2018-01-15 female probable 2001-12-01
#> 3 2018-01-02 2018-01-12 female probable <NA>
#> 4 2018-01-03 2018-01-13 female not_a_case <NA>
#> 5 2018-01-10 2018-01-20 female probable 2018-10-19
#> 6 2018-01-08 2018-01-18 female probable 1989-12-24
#> 7 2018-01-03 2018-01-13 male confirmed 2018-10-17
#> 8 2018-01-05 2018-01-15 male not_a_case 2018-10-17
#> 9 2018-01-09 2018-01-19 female suspected <NA>
#> 10 2018-01-08 2018-01-18 male not_a_case <NA>
#> 11 2018-01-07 2018-01-17 female probable <NA>
#> 12 2018-01-06 2018-01-16 male probable 2001-12-01
#> 13 2018-01-11 2018-01-21 female probable 1989-12-24
#> 14 2018-01-06 2018-01-16 female confirmed 1989-12-24
#> 15 2018-01-08 2018-01-18 male not_a_case 2018-10-18
#> 16 2018-01-02 2018-01-12 female suspected 2018-10-18
#> 17 2018-01-07 2018-01-17 female not_a_case 2018-10-17
#> 18 2018-01-10 2018-01-20 female probable 1989-12-24
#> 19 2018-01-05 2018-01-15 male confirmed 2018-10-17
#> 20 2018-01-09 2018-01-19 male not_a_case 2001-12-01
Note that the equivalent can be done using piping:
library(magrittr)
x <- toy_data %>%
clean_data %>%
as_linelist(date_onset = "date_of_onset", gender = "sex")
x
#> <linelist object>
#>
#> date_of_onset discharge sex epi_case_definition messy_dates
#> 1 2018-01-06 2018-01-16 male suspected 1989-12-24
#> 2 2018-01-05 2018-01-15 female probable 2001-12-01
#> 3 2018-01-02 2018-01-12 female probable <NA>
#> 4 2018-01-03 2018-01-13 female not_a_case <NA>
#> 5 2018-01-10 2018-01-20 female probable 2018-10-19
#> 6 2018-01-08 2018-01-18 female probable 1989-12-24
#> 7 2018-01-03 2018-01-13 male confirmed 2018-10-17
#> 8 2018-01-05 2018-01-15 male not_a_case 2018-10-17
#> 9 2018-01-09 2018-01-19 female suspected <NA>
#> 10 2018-01-08 2018-01-18 male not_a_case <NA>
#> 11 2018-01-07 2018-01-17 female probable <NA>
#> 12 2018-01-06 2018-01-16 male probable 2001-12-01
#> 13 2018-01-11 2018-01-21 female probable 1989-12-24
#> 14 2018-01-06 2018-01-16 female confirmed 1989-12-24
#> 15 2018-01-08 2018-01-18 male not_a_case 2018-10-18
#> 16 2018-01-02 2018-01-12 female suspected 2018-10-18
#> 17 2018-01-07 2018-01-17 female not_a_case 2018-10-17
#> 18 2018-01-10 2018-01-20 female probable 1989-12-24
#> 19 2018-01-05 2018-01-15 male confirmed 2018-10-17
#> 20 2018-01-09 2018-01-19 male not_a_case 2001-12-01
We now handle a clean dataset, with standardised labels and variable names, and dates of onset and gender are now formally identifier:
## check available epivars
list_epivars(x, simple = TRUE) # simple
#> [1] "date_onset" "gender"
list_epivars(x) # more info
#> epivar column hxl description
#> 1 date_onset date_of_onset #date +start date at which symptoms started
#> 2 gender sex #indicator +type gender of the individual
get_epivars(x, "gender", "date_onset")
#> sex date_of_onset
#> 1 male 2018-01-06
#> 2 female 2018-01-05
#> 3 female 2018-01-02
#> 4 female 2018-01-03
#> 5 female 2018-01-10
#> 6 female 2018-01-08
#> 7 male 2018-01-03
#> 8 male 2018-01-05
#> 9 female 2018-01-09
#> 10 male 2018-01-08
#> 11 female 2018-01-07
#> 12 male 2018-01-06
#> 13 female 2018-01-11
#> 14 female 2018-01-06
#> 15 male 2018-01-08
#> 16 female 2018-01-02
#> 17 female 2018-01-07
#> 18 female 2018-01-10
#> 19 male 2018-01-05
#> 20 male 2018-01-09
Bug reports and feature requests should be posted on github using the
issue system. All other
questions should be posted on the RECON forum:
http://www.repidemicsconsortium.org/forum/
Contributions are welcome via pull requests.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
The linelist package should have the following features in the future:
- A data dictionary that allows you to map standard variable names to columns
- Integration with #hxl standard
- Validation of categorical values