Skip to content

Commit

Permalink
Merge pull request #314 from NIEHS/kpm-readme
Browse files Browse the repository at this point in the history
Kpm readme
  • Loading branch information
kyle-messier authored Mar 29, 2024
2 parents 00d9043 + 979b266 commit 0967639
Show file tree
Hide file tree
Showing 3 changed files with 147 additions and 58 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
output/
manuscript/

# README html
README.html

# User-specific files
.Ruserdata

Expand Down Expand Up @@ -78,6 +81,7 @@ code/mitchell_tests/
# Insang's negative value exploration script
tools/negative_exploration
tools/metalearner_test
tools/shiny_explore_pm/missingness_exploration_pm_quarto_shiny_data/

# NASA Earthdata login credentials
.dodsrc
Expand Down
175 changes: 117 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@ experimental](https://img.shields.io/badge/lifecycle-experimental-orange.svg)](h
# Building an Extensible, rEproducible, Test-driven, Harmonized, Open-source, Versioned, ENsemble model for air quality
Group Project for the Spatiotemporal Exposures and Toxicology group with help from friends :smiley: :cowboy_hat_face: :earth_americas:

## GitHub Push/Pull Workflow
1) Each collaborator has a local copy of the github repo - suggested location is ddn/gs1/username/home
2) Work locally
3) Push to remote
4) Kyle [or delegate] will pull to MAIN local copy on SET group ddn location

## Repo Rules
1) To PUSH changes to the repo, the changes must be made to a non-MAIN branch
2) Then a PULL request must be made
3) Then it requires the REVIEW of 1 person (can be anyone)
4) Then the change from the branch is MERGED to the MAIN branch
## Installation

```r
remotes::install_github("NIEHS/beethoven")
```

## Getting Started

```r
TODO
```

## Overall Project Workflow

Targets: Make-like Reproducible Analysis Pipeline
Expand All @@ -28,74 +28,133 @@ Targets: Make-like Reproducible Analysis Pipeline
4) Fit Meta Learners
5) Predictions
6) Summary Stats

**Placeholder for up-to-date rendering of targets**

```r
tar_visnetwork(targets)
```

## Unit and Integration Testing

We will utilize various testing approaches to ensure functionality and quality of code
## Project Organization

Here, we describe the structure of the project and the naming conventions used. The most up to date file paths and names are recorded here for reference.

### File Structure

#### Folder Structure
Root Directory
- R/
- input/
- output/
- tests/
- inst/
- docs/
- tools/
- man/
- vignettes/

#### input/

#### output/

Currently, as of 3/29/24, the output folder contains .rds files for each
of the covariates/features for model development. e.g.:

- NRTAP_Covars_NLCD.rds
- NRTAP_Covars_TRI.rds


#### tests/

##### testthat/

##### testdata/

#### Relevant files

- LICENSE
- DESCRIPTION
- NAMESPACE
- README.md

### Naming Conventions

For `tar_target` functions, we use the following naming conventions:

Example from CF conventions:
[surface] [component] standard_name [at surface] [in medium] [due to process] [assuming condition]

Naming conventions for tar objects:

Long Version:
[R object type]_[role]_[stage]_[source]_[spacetime]

### Processes to test or check
1) data type
2) data name
3) data size
4) relative paths
5) output of one module is the expectation of the input of the next module
- R object type: sf, datatable, tibble, SpatRaster, SpatVector

### Test Drive Development
Starting from the end product, we work backwards while articulating the tests needed at each stage.
- role : Detailed description of the role of the object in the pipeline. Allowable keywords:

#### Key Points of Unit and Integration Testing
File Type
1. NetCDF
2. Numeric, double precision
3. NA
4. Variable Names Exist
5. Naming Convention
- PM25
- feature (i.e. geographic covariate)
- base_model
- base_model suffix types: linear, random_forest, xgboost, neural_net etc.
- meta_model
- prediction
- plot
-plot suffix types: scatter, map, time_series, histogram, density etc.

- stage: the stage of the pipeline the object is used in. Object transformations
are also articulated here. Allowable keywords:

Stats
1. Non-negative variance ($\sigma^2$)
2. Mean is reasonable ($\mu$)
3. SI Units
- raw
- process
- fit: Ready for base/meta learner fitting
- result: Final result
- log
- log10

Domain
1. In the US (+ buffer)
2. In Time range (2018-2022)
- source: the original data source

Geographic
1. Projections
2. Coordinate names (e.g. lat/lon)
3. Time in acceptable format
- AQS
- MODIS
- GMTED
- NLCD
- NARR
- GEOSCF
- TRI


- spacetime: relevant spatial or temporal information

### Frameworks for Testing of this project (with help from ChatGPT)
- spatial:
- site_id
- census_tract
- grid
- time:
- daily [optional YYYYMMDD]
- annual [optional YYYY]

#### Test Driven Development (TDD)- Key Steps
1. **Write a Test**: Before you start writing any code, you write a test case for the functionality you want to implement. This test should fail initially because you haven't written the code to make it pass yet. The test defines the expected behavior of your code.

2. **Run the Test**: Run the test to ensure it fails. This step confirms that your test is correctly assessing the functionality you want to implement.

3. **Write the Minimum Code**: Write the minimum amount of code required to make the test pass. Don't worry about writing perfect or complete code at this stage; the goal is just to make the test pass.
Short Verion:

4. **Run the Test Again**: After writing the code, run the test again. If it passes, it means your code now meets the specified requirements.
Cross-walk lives on the punchcard.

5. **Refactor (if necessary)**: If your code is working and the test passes, you can refactor your code to improve its quality, readability, or performance. The key here is that you should have test coverage to ensure you don't introduce new bugs while refactoring.
### Function Naming Convenctions

6. **Repeat**: Continue this cycle of writing a test, making it fail, writing the code to make it pass, and refactoring as needed. Each cycle should be very short and focused on a small piece of functionality.
TBD

7. **Complete the Feature**: Keep repeating the process until your code meets all the requirements for the feature you're working on.
- download
- calc

TDD helps ensure that your code is reliable and that it remains functional as you make changes and updates. It also encourages a clear understanding of the requirements and promotes better code design.
### Punchcard

The punchard is a single list of paths, variables, and functions that are used throughout the pipeline.

If a path is used in multiple places, it is only listed once. Then updates only require changing the path in one place.

tools/pipeline/punchcard.csv

## Base Learners
Potential base learners we can use:
1) PrestoGP (lasso + GP)
2) XGBOOST
3) RF
4) CNN
5) UMAP covariates
6) Encoder NN covariates



Expand Down
26 changes: 26 additions & 0 deletions contributing_guide.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Contibuting Guide for beethoven"
format: html
---


## GitHub Push/Pull Workflow
1) Each collaborator has a local copy of the github repo
2) Work locally
3) Push to remote
4) Admin will pull to MAIN local copy on SET group ddn location

## Repo Rules
1) To PUSH changes to the repo, the changes must be made to a non-MAIN branch
2) Then a PULL request must be made
3) Then it requires the REVIEW of 1 person (can be anyone)
4) Then the change from the branch is MERGED to the MAIN branch


## Areas for Contribution

1. **Documentation**: We are always looking for help in improving our documentation. This includes the README, contributing guide, and any other documentation that you think could be improved.
2. **Code**: We are always looking for help in improving our code. This includes fixing bugs, adding new features, and improving existing features.
3. **Testing**: We are always looking for help in improving our testing. This includes writing new tests, improving existing tests, and ensuring that all tests pass.
4. **Base Learners**: Novel and efficient base learners are always welcome and can be integrated into the overall model workflow
5. **Model Workflow**: Improvements to the model workflow are always welcome. This includes improving the efficiency of the workflow, adding new features, and improving existing features.

0 comments on commit 0967639

Please sign in to comment.