Skip to content

jonjoncardoso/data-science-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Practical Workflow for Data Science Projects

This project is intended as a template structure for data science projects. Its main intended use is for teams within organizations but we see no reason why you would not benefit from it even if you are coding solo, participating in a data hackathon or are in an academic group, doing exploratory, statistical analysis or algorithm modelling.

This is a standalone template project that can be used as a starting point for any data science project. It is not a framework, a library, or a package. It is a template that you can use to start your own project. It is not intended to be a one-size-fits-all solution, but rather a starting point for you to build your own project structure.

If you like this project, please consider giving it a ⭐️!

👥 Team

  • Jon as the core developer

People who have contributed to this course in the past:

Initial repository setup

Follow the instructions below to make use of this template.

  1. Create a new repository on GitHub using this template. You can do this by clicking on the green "Use this template" button on the top right of this page.

    Illustration of how to use this template

  2. Give your project a name and description. You can also choose to make the repository private if you wish.

    • Leave "Include all branches" unchecked.
  3. GitHub will copy the files from this repository into your new repository and it will trigger an Actions workflow. This workflow will customize labels (to include emojis!) as well as Issues and Pull Request templates for your project.

    • If you are not familiar with GitHub Actions, you can read more about it here.
  4. Clone your new repository to your computer and start working on it!

First steps

Once you have cloned your new repository to your computer, you might want to do the following:

  1. Update the README.md file to remove all things related to this template and add information about your project.

  2. Update the LICENSE file to reflect the license you want to use for your project. You can find a list of open-source licenses here.

  3. Modify the name of the src/python/pkg_name folder to reflect the name of your project. You can also remove the pkg_name folder if you are not planning on using custom Python packages.

More information

Click on the links below to learn how to best use this template, and how to contribute to it.

✋ How to contribute

✋ How to contribute

If you want to propose changes to the template, follow the steps below:

  1. Set up your environment by following the instructions in the Dev Setup section.
  2. Create a new branch from develop and give it a meaningful name. Best practices involve using the following format: <your-username>/<issue-number>-<short-description>. For example, if you are working on issue #3, you could name your branch jonjoncardoso/3-update-github-action. Remember the GitFlow workflow!
  3. Make your changes and commit them to your branch. Remember to commit often and to write meaningful commit messages. If you are working on a specific issue, you can use the following format: <gitmoji> #<issue-number> <commit-message>. For example, if you are working on issue #3, you could write 📝 #3 Update GitHub Action.
    • To add emojis on Windows, just type Win + . and then select the emoji you want. On Mac, it's the world symbol ⌘ + Ctrl + Space.
    • You can find a list of gitmojis here. If you are not sure what to write, you can use 📝 for documentation, 🐛 for bug fixes, 🌟 for new features, and ♻️ for refactoring. You can also use 🔧 for general changes. If you are not sure, just ask!
  4. When you are done, push all your commits and then open a pull request to merge your branch into develop. You can do this by clicking on the "Compare & pull request" button on GitHub. Make sure to add a meaningful title and description to your pull request. If you are working on a specific issue, you can use the following format: #<issue-number> <pull-request-title>. For example, if you are working on issue #3, you could write #3 Update GitHub Action. Mark @jonjoncardoso as a reviewer.
🧰 Dev Setup

🧰 Dev Setup

🐍 The Python setup

  1. Install Python 3.9 or higher on your computer.

  2. Install anaconda or miniconda on your computer.

  3. Create a new conda environment:

    conda create -y -n=venv-ds-workflow python=3.10.8
  4. Activate the environment and make sure you have pip installed inside that environment:

# the exact `activate` command will vary depending on your OS
conda activate venv-ds-workflow 

💡 Remember to activate this particular conda environment whenever you reopen VSCode/the terminal.

  1. Install required libraries
pip install -r requirements.txt

Now, whenever you open a Jupyter Notebook, you should see the venv-ds-workflow kernel available. You can also run jupyter kernelspec list to see all the kernels available on your computer.

📊 The R setup

  1. Clone this repository to your computer.
  2. Open a terminal and navigate to the root of this repository.
  3. Ensure you have R version 4.2.2 or higher
  4. Open the R console in this same directory and install renv package:
install.packages("renv")
  1. Run renv::restore() to install all the packages needed for this project

The Quarto setup

If using quarto is not your thing, you can just ignore this section. If you want to use quarto, follow the steps below:

  1. Install Quarto on your computer.
  2. Run the following command to start the website locally:
    quarto preview . --render all --no-browser
    This will read the instructions from _quarto.yml and render the website locally.
  3. Open your browser and navigate to http://localhost:<port>/. That's it!
⚒️ (Advanced) Jon's full setup

⚒️ (Advanced) Jon's full setup

I, @jonjoncardoso, like to use R on VSCode (WSL Ubuntu) instead of RStudio. It is a weird setup if you come from R, but it's a good setup for when you need to switch between R and Python all the time. Feel free to just ignore this stuff but if you want to replicate my setup, just follow the steps below:

  1. Install VSCode
  2. Install WSL on Windows
  3. Install WSL extension on VSCode
  4. Open VSCode and open a new WSL window (Type Ctrl+Shift+P and type WSL: New Window)
  5. Open the Ubuntu terminal on VSCode and install R

When doing R

  1. Install the R extension on VSCode
  2. Install Quarto
  3. Install the Quarto extension on VSCode
  4. When running R notebooks (either .Rmd or .qmd) manually, you will see that some plots do not render with adequate size. To fix this, follow these instructions.

When doing Python

  1. Install the Python extension on VSCode
  2. Install the Jupyter extension on VSCode

I also use the following VSCode Extensions:

About

A Practical Workflow for Data Science Projects

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published