diff --git a/content/blog/last_semester.md b/content/blog/last_semester.md index 0789a15d4..b3015cb4a 100644 --- a/content/blog/last_semester.md +++ b/content/blog/last_semester.md @@ -6,9 +6,10 @@ draft: false aliases: - /blog/last_semester --- -## **Last semester: Diving into the World of Causal Inference and Virtual Machines** ## -During the previous semester at Tilburg Science Hub, we were involved in a time of learning, exploring, and innovating. We've been busy with developing new content and using new tools that we're excited to share with you. +## **Last semester: Diving into the World of Causal Inference and Virtual Machines** + +During the previous semester at Tilburg Science Hub, we were involved in a time of learning, exploring, and innovating. We've been busy with developing new content and using new tools that we're excited to share with you. We have mainly been busy creating new content in the field of causal inference and also delved into the world of virtual machines. Let's take a closer look at the resources we've created in these two areas. @@ -16,29 +17,28 @@ We have mainly been busy creating new content in the field of causal inference a Causal inference has been the most important subject of our exploration. We have put together a comprehensive set of tools to guide you through this complex but captivating subject: -- [Introduction to Instrumental Variables Estimation](https://tilburgsciencehub.com/topics/analyze-data/regressions/iv/): We start with a thorough introduction to IV (Instrumental Variable) estimation, which is a fundamental concept in the field of causal inference. - -- [Doing Calculations with Regression Coefficients Using deltaMethod](https://tilburgsciencehub.com/topics/analyze-data/regressions/deltamethod/): We show you how to handle regression coefficients with precision through the deltaMethod. +- [Introduction to Instrumental Variables Estimation](../topics/Analyze/causal-inference/instrumental-variables/iv.md): We start with a thorough introduction to IV (Instrumental Variable) estimation, which is a fundamental concept in the field of causal inference. -- [Impact evaluation](https://tilburgsciencehub.com/topics/analyze-data/regressions/impact-evaluation/): We take a closer look at impact evaluation through regressions, uncovering how interventions and policies can be rigorously analyzed to make informed decisions. +- [Doing Calculations with Regression Coefficients Using deltaMethod](../topics/Analyze/Regression/linear-regression/deltamethod.md): We show you how to handle regression coefficients with precision through the deltaMethod. -- [Synthetic Controls](https://tilburgsciencehub.com/topics/analyze-data/regressions/synth-control/): Discover the power of the Synthetic Control Method, which is an invaluable tool for causal inference in diverse research scenarios. +- [Impact evaluation](../topics/Analyze/causal-inference/did/impact-evaluation.md): We take a closer look at impact evaluation through regressions, uncovering how interventions and policies can be rigorously analyzed to make informed decisions. -- [Fixed-Effects Estimation in R with the fixest Package](https://tilburgsciencehub.com/topics/analyze-data/regressions/fixest/): Lastly, we dive into fixed-effects estimation using the `fixest`` package in R, which is particularly useful for panel data analyses (and, super super fast!). +- [Synthetic Controls](../topics/Analyze/causal-inference/synthetic-control/synth-control.md): Discover the power of the Synthetic Control Method, which is an invaluable tool for causal inference in diverse research scenarios. +- [Fixed-Effects Estimation in R with the fixest Package](../topics/Analyze/causal-inference/panel-data/fixest.md): Lastly, we dive into fixed-effects estimation using the `fixest`` package in R, which is particularly useful for panel data analyses (and, super super fast!). ### **Virtual Machines** Besides our focus on causal inference, we've also explored virtual machines (VMs). Think of them like a supercomputer that you can rent on-demand. We've created some building blocks related to virtual machines to help you set them up and run environments in the cloud. -- [Configure a VM with GPUs in Google Cloud](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/config-vm-gcp/): Learn how to use the computing power of GPUs on Google Cloud by setting up a customized VM to meet your research requirements. +- [Configure a VM with GPUs in Google Cloud](../topics/Automation/Replicability/cloud-computing/config-VM-GCP.md): Learn how to use the computing power of GPUs on Google Cloud by setting up a customized VM to meet your research requirements. -- [Import and run a Python environment on Google cloud with Docker](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/google_cloud_docker/): Explore the world of containerization and Docker to import and run Python environments on Google Cloud, enhancing the reproducibility and efficiency of your work. - -- [Export a Python environment with Docker and share it through Docker Hub](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/dockerhub/): Learn to export Python environments with Docker and streamline collaboration by sharing them on Docker Hub, ensuring easy access for fellow researchers. +- [Import and run a Python environment on Google cloud with Docker](../topics/Automation/Replicability/cloud-computing/google_cloud_docker.md): Explore the world of containerization and Docker to import and run Python environments on Google Cloud, enhancing the reproducibility and efficiency of your work. +- [Export a Python environment with Docker and share it through Docker Hub](../topics/Automation/Replicability/Docker/dockerhub.md): Learn to export Python environments with Docker and streamline collaboration by sharing them on Docker Hub, ensuring easy access for fellow researchers. ### **Enhancing Your Research Skills in Causal Inference and Virtual Machines** -Our aim is to support your exploration of causal inference and virtual machines. These resources are designed to equip you with the knowledge and tools needed to excel in your research. -Curious about what we will be working on next semester? Keep an eye on our blog! \ No newline at end of file +Our aim is to support your exploration of causal inference and virtual machines. These resources are designed to equip you with the knowledge and tools needed to excel in your research. + +Curious about what we will be working on next semester? Keep an eye on our blog! diff --git a/content/examples/keywords-finder.md b/content/examples/keywords-finder.md index 257507441..3144d2b9f 100644 --- a/content/examples/keywords-finder.md +++ b/content/examples/keywords-finder.md @@ -11,19 +11,19 @@ aliases: ## Overview -This is a template for a reproducible [Dockerized](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/docker/) application, based on [R](https://tilburgsciencehub.com/topics/configure-your-computer/statistics-and-computation/r/), that **finds keywords and/or sentences in multiple `PDF` files**. +This is a template for a reproducible [Dockerized](../topics/Automation/Replicability/Docker/docker.md) application, based on [R](../topics/Computer-Setup/software-installation/RStudio), that **finds keywords and/or sentences in multiple `PDF` files**. {{% summary %}} -* We use `R` to first, convert the `PDF` files into plain text files (`.txt`). -* Then, a second `R` script searches the keywords and/or sentences that are previously defined in those converted text files. -* Matches are reported into an Excel file, reporting what keyword or sentence was found in which file. -{{% /summary %}} - -`Docker` is used to run the above mentioned process inside an isolated container (*see [this building block](https://tilburgsciencehub.com/topics/configure-your-computer/automation-and-workflows/docker/) to learn more about containers*). By doing so, you can run this application without even having to have `R` installed in your computer plus it will also run smoothly, regardless of what operating system (OS) you're using. +- We use `R` to first, convert the `PDF` files into plain text files (`.txt`). +- Then, a second `R` script searches the keywords and/or sentences that are previously defined in those converted text files. +- Matches are reported into an Excel file, reporting what keyword or sentence was found in which file. + {{% /summary %}} +`Docker` is used to run the above mentioned process inside an isolated container (_see [this building block](../topics/Automation/Replicability/Docker/) to learn more about containers_). By doing so, you can run this application without even having to have `R` installed in your computer plus it will also run smoothly, regardless of what operating system (OS) you're using. ## Motivating Example + In many situations, we make use of `crtl (or "Command" in Mac) + F` to find words or sentences in `PDF` files. However, this can be highly time-consuming, especially if needed to apply in multiple files and/or different keywords or sentences. For instance, we first applied this application in legal research, where we needed to check in over 10,000 court rulings, which ones made reference to a specific law. ## Get The Workflow diff --git a/content/examples/reproducible-workflow-airbnb.md b/content/examples/reproducible-workflow-airbnb.md index ee7422d37..5b4e14d70 100644 --- a/content/examples/reproducible-workflow-airbnb.md +++ b/content/examples/reproducible-workflow-airbnb.md @@ -20,16 +20,17 @@ We've crafted this project to run: - platform-independent (Mac, Linux, Windows) - across a diverse set of software programs (Stata, Python, R) - producing an entire (mock) paper, including modules that - - download data from Kaggle, - - prepare data for analysis, - - run a simple analysis, - - produce a paper with output tables and figures. + - download data from Kaggle, + - prepare data for analysis, + - run a simple analysis, + - produce a paper with output tables and figures. ## How to run it + ### Dependencies - Install [Python](/get/python/). - - Anaconda is recommended. [Download Anaconda](https://www.anaconda.com/distribution/). + - Anaconda is recommended. [Download Anaconda](https://www.anaconda.com/download). - check availability: type `anaconda --version` in the command line. - Install Kaggle package. - [Kaggle API](https://github.com/Kaggle/kaggle-api) instruction for installation and setup. @@ -61,11 +62,9 @@ Open your command line tool: - if not, type `cd yourpath/airbnb-workflow` to change your directory to `airbnb-workflow` - Type `make` in the command line. - - ### Directory structure -Make sure `makefile` is put in the present working directory. The directory structure for the Airbnb project is shown below. +Make sure `makefile` is put in the present working directory. The directory structure for the Airbnb project is shown below. ```text ├── data @@ -108,7 +107,6 @@ Make sure `makefile` is put in the present working directory. The directory stru - **src**: all source codes. - Three parts: **data_preparation**, **analysis**, and **paper** (including TeX files). - {{% codeblock %}} + ```bash # Create new empty environment # (this will ask where to save the environment, default location is usually fine ) @@ -91,6 +92,7 @@ conda install r-data.table # Close environment (switches back to base) conda deactivate ``` + {{% /codeblock %}} ## Using the code @@ -102,10 +104,11 @@ If Conda is installed, the example can be run (line by line) in a terminal to se - If you've installed Python based on [these instructions](/topics/configure-your-computer/statistics-and-computation/python/), Conda should be already available - Alternatively, detailed instructions to install Conda are provided [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html). -Note that with the full Anaconda install environments can also be managed using a graphical interface. Here, we focus only on the command line to have clear steps. Instructions for the graphical interface are provided [here](https://docs.anaconda.com/anaconda/navigator/topics/manage-environments/). +Note that with the full Anaconda install environments can also be managed using a graphical interface. Here, we focus only on the command line to have clear steps. Instructions for the graphical interface are provided [here](https://docs.anaconda.com/navigator/tutorials/manage-environments/). ## Examples -1. Add *conda-forge* channel to get newer R version + +1. Add _conda-forge_ channel to get newer R version # Add 'conda-forge' channel (provides more recent versions, # and a lot of additional software) @@ -116,13 +119,12 @@ Note that with the full Anaconda install environments can also be managed using conda activate test_env conda install r-base=4.0.3 - -2. Search for available packages +2. Search for available packages # lists packages starting with r-data conda search r-data* -3. Export environment into file (to be installed on a different machine) +3. Export environment into file (to be installed on a different machine) # Activate environment to be saved conda activate test_env @@ -134,29 +136,28 @@ Note that with the full Anaconda install environments can also be managed using # (required if environment ported to different OS, see below) conda env export --from-history > test_env.yml -4. Import and set up environment from file +4. Import and set up environment from file # Create environment based on yml-file (created as above, in same directory) conda env create -f test_env.yml -5. Deactivate environment (gets back to base) +5. Deactivate environment (gets back to base) conda deactivate -6. Remove environment +6. Remove environment conda env remove --name test_env Note that there's a bug; software environments sometimes can only be removed if at least one package was installed. -7. List installed environments +7. List installed environments conda info --envs - ## OS dependence -Sometimes we work across different operating systems. For example, you may develop and test code on a Windows desktop, before then running it on a server that runs on Linux. As the Conda environment contains all dependencies, it will also list some low level tools that may not be available on a different OS. In this case, when trying to set up the same environment from a *.yml* file created on a different OS it will fail. +Sometimes we work across different operating systems. For example, you may develop and test code on a Windows desktop, before then running it on a server that runs on Linux. As the Conda environment contains all dependencies, it will also list some low level tools that may not be available on a different OS. In this case, when trying to set up the same environment from a _.yml_ file created on a different OS it will fail. {{% example %}} Check the output of `conda list`, which will list various dependencies you never explicitly requested. @@ -164,7 +165,7 @@ Check the output of `conda list`, which will list various dependencies you never In this case, it is possible to use the `--from-history` option (see Example 3 above). When creating the environment `.yml` file, this option makes it so that the generated `.yml` file will only contain the packages explicitly requested (i.e. those you at one point added through `conda install`), while the lower level dependencies (e.g. compiler, BLAS library) are not added. If the requested packages exist for the different OS, this usually should work as the low level dependencies will be automatically resolved when setting up the environment. -## Additional Resources +## Additional Resources 1. More information on Conda environments: [https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html](https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html) diff --git a/content/topics/Automation/Replicability/cloud-computing/config-VM-GCP.md b/content/topics/Automation/Replicability/cloud-computing/config-VM-GCP.md index 5026c92e1..9131242bb 100644 --- a/content/topics/Automation/Replicability/cloud-computing/config-VM-GCP.md +++ b/content/topics/Automation/Replicability/cloud-computing/config-VM-GCP.md @@ -1,5 +1,5 @@ --- -title: "Configure a VM with GPUs in Google Cloud" +title: "Configure a VM with GPUs in Google Cloud" description: "Learn how to configure code in a Google Cloud instance with GPUs within a Docker environment" keywords: "Docker, Environment, Python, Jupyter notebook, Google cloud, Cloud computing, GPU, Virtual Machine, Instance, Memory" weight: 3 @@ -7,7 +7,7 @@ author: "Fernando Iscar" authorlink: "https://www.linkedin.com/in/fernando-iscar/" draft: false date: 2023-06-05 #updated 2023-09-15 -aliases: +aliases: - /run/vm-on-google-cloud --- @@ -18,7 +18,7 @@ In this building block, you will discover how to create and configure a simple a After going through this guide, you'll get more familiar with: - Establishing a VM instance in Google Cloud with optimized configurations. -- The usefulness of [Docker](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/docker/) in combination with cloud virtual machines. +- The usefulness of [Docker](../Docker/docker.md) in combination with cloud virtual machines. - NVIDIA drivers to access GPU power. ## Initialize a new instance @@ -50,9 +50,9 @@ You'll encounter four primary machine categories to select from: {{% warning %}} -**Save on unnecessary costs!** +**Save on unnecessary costs!** -GPU-enabled VMs are vital for deep learning tasks like language models. However, for other uses, GPUs are redundant and increase expenses. +GPU-enabled VMs are vital for deep learning tasks like language models. However, for other uses, GPUs are redundant and increase expenses. See an example of how suboptimal GPU usage can slow compute time [here](https://rstudio-pubs-static.s3.amazonaws.com/15192_5965f6c170994ebb972deaf18f1ddf34.html). @@ -60,7 +60,6 @@ See an example of how suboptimal GPU usage can slow compute time [here](https:// In case you are unsure, a good choice to balance between price and performance would be to select an **NVIDIA T4 n1-standard-8** machine. It's packed with 30GB of RAM and a GPU. If we would need more vCPUs or memory, we can improve it by selecting a customized version, under the **"Machine type"** header. -

Adjust the machine configuration to your needs
@@ -83,9 +82,9 @@ In the top right corner, you'll see a real-time **pricing summary**. As you adju #### Boot disk settings -As we scroll down through the configuration process, we'll skip to [Boot Disk settings](https://cloud.google.com/compute/docs/disks). +As we scroll down through the configuration process, we'll skip to [Boot Disk settings](https://cloud.google.com/compute/docs/disks). -Think of your boot disk as your instance's storage locker - here, you get to pick its type (standard or SSD), size, and the VM image (Operating System) you want to load on it. +Think of your boot disk as your instance's storage locker - here, you get to pick its type (standard or SSD), size, and the VM image (Operating System) you want to load on it. A bigger boot disk equals more space for data and apps. So, if you're playing with chunky datasets, you'll want to upgrade your storage. In our case, we'll crank up the boot disk size to 100GB, which is a good starting point. In fact, boosting your storage won't be very costly. @@ -94,14 +93,13 @@ A bigger boot disk equals more space for data and apps. So, if you're playing wi
Boot disk configuration summary

- If you're considering integrating GPUs into your instance, it's recommended to switch the default boot disk from **Debian** to **Ubuntu**. **Ubuntu** simplifies the installation of proprietary software drivers and firmware, making the process of installing necessary [NVIDIA drivers](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html#ubuntu-lts) for GPU utilization significantly smoother. This could save you time and effort in the long run. We will cover this topic later. #### Firewall Rules -As you scroll down, you will find the [Firewall Rules](https://cloud.google.com/compute/docs/samples/compute-firewall-create) section. Here, you will see the default settings that manage your network traffic flow. HTTP or HTTPS? Go for HTTPS whenever you can. It's the safer bet, wrapping your data transfers in an encryption layer for added security. +As you scroll down, you will find the [Firewall Rules](https://cloud.google.com/compute/docs/samples/compute-firewall-create) section. Here, you will see the default settings that manage your network traffic flow. HTTP or HTTPS? Go for HTTPS whenever you can. It's the safer bet, wrapping your data transfers in an encryption layer for added security. However, many developers activate both HTTP and HTTPS for broader compatibility, handling secure (HTTPS) and insecure (HTTP) requests alike. But let's keep it modern and secure, HTTPS should be your go-to, especially when handling sensitive data. @@ -123,12 +121,10 @@ After fine-tuning your instance's setup and firewall rules, you can go ahead and ## Establish your environment using Docker -At this point, we strongly recommend you [set up Docker](https://tilburgsciencehub.com/topics/configure-your-computer/automation-and-workflows/docker/) -as a great tool to easily [deploy your projects and environments within your newly created virtual machine](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/dockerhub/). If you are not familiar with the advantages that Docker offers in terms of productivity and open science value for your project, check out our building block on [Docker for reproducible research](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/docker/) - - -You can check Docker's setup process in a Google Cloud virtual machine by visiting [this building block](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/google_cloud_docker/), where you'll find more details as well as a setup script that will get you Docker up and running in your virtual machine in the blink of an eye. After you're done, come back here to move on to the next step. +At this point, we strongly recommend you [set up Docker](../Docker/docker.md) +as a great tool to easily [deploy your projects and environments within your newly created virtual machine](../Docker/dockerhub.md). If you are not familiar with the advantages that Docker offers in terms of productivity and open science value for your project, check out our building block on [Docker for reproducible research](../Docker/docker.md) +You can check Docker's setup process in a Google Cloud virtual machine by visiting [this building block](../Docker/google_cloud_docker.md), where you'll find more details as well as a setup script that will get you Docker up and running in your virtual machine in the blink of an eye. After you're done, come back here to move on to the next step. ## Install the NVIDIA drivers and container toolkit @@ -139,9 +135,11 @@ Besides the regular drivers, if you have your project containerized within Docke After completing the installation process of the NVIDIA container toolkit, you can run the following in your virtual machine terminal to check if the installation was successful. In that case, you will see in your command line something resembling the image below. {{% codeblock %}} + ```bash sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ``` + {{% /codeblock %}}

@@ -151,11 +149,12 @@ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi ## Confirm GPUs availability -To ensure that GPUs are accessible for your tasks, you can use specific commands depending on the framework you're using. +To ensure that GPUs are accessible for your tasks, you can use specific commands depending on the framework you're using. For instance, let's say you're working on a Python deep learning project. If you are using `PyTorch`, the following command can be used to check if CUDA (GPU acceleration) is currently available, returning a boolean value. {{% codeblock %}} + ```python import torch @@ -165,11 +164,13 @@ else: print("GPUs not available") ``` + {{% /codeblock %}} If you are working with other common deep learning libraries like `Tensorflow`, you could verify it this way: {{% codeblock %}} + ```python import tensorflow as tf @@ -180,6 +181,7 @@ else: print("GPUs not available") ``` + {{% /codeblock %}} Bear in mind that the particular framework you are using within your project, such as `Pytorch` or `Tensorflow` may have specific additional requirements to make use of your machine's GPUs on top of the ones already presented in this building block. @@ -187,27 +189,25 @@ Bear in mind that the particular framework you are using within your project, su {{% tip %}} **Working with heavy files or having memory issues?** -Your Virtual Machine can be monitored, this will be useful especially when the tasks you are running are memory-demanding. - -Also, oftentimes you'll be working with large files and you'll need to use the so-called "buckets" to access extra storage. The ways to establish the connection with them might not be that intuitive, but luckily for you, you'll learn these and more useful skills in our [next building block](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/mem-storage-gcp/) on the topic! +Your Virtual Machine can be monitored, this will be useful especially when the tasks you are running are memory-demanding. +Also, oftentimes you'll be working with large files and you'll need to use the so-called "buckets" to access extra storage. The ways to establish the connection with them might not be that intuitive, but luckily for you, you'll learn these and more useful skills in our [next building block](monitor-memory-vm.md) on the topic! {{% /tip %}} - {{% summary %}} - **Google Cloud VM Setup:** - - Register on Google Cloud. - - Create a Virtual Machine that satisfies your computational power needs. - - Select the most appropriate Boot Disk and Firewall Rules + - Register on Google Cloud. + - Create a Virtual Machine that satisfies your computational power needs. + - Select the most appropriate Boot Disk and Firewall Rules - **Enable reproducibility and access the GPU power** - - Install Docker on the VM to aim for reproducibility. - - Install NVIDIA drivers and the container toolkit for GPUs - - Confirm GPU availability + - Install Docker on the VM to aim for reproducibility. + - Install NVIDIA drivers and the container toolkit for GPUs + - Confirm GPU availability {{% /summary %}} @@ -216,4 +216,4 @@ Also, oftentimes you'll be working with large files and you'll need to use the s - Google Cloud Compute Engine [guides](https://cloud.google.com/compute/docs/instances) - CUDA Toolkit [download](https://developer.nvidia.com/cuda-toolkit) - PyTorch [Documentation](https://pytorch.org/) -- Tensorflow [Documentation](https://www.tensorflow.org/api_docs/python/tf) \ No newline at end of file +- Tensorflow [Documentation](https://www.tensorflow.org/api_docs/python/tf) diff --git a/content/topics/Automation/Replicability/cloud-computing/configure-git.md b/content/topics/Automation/Replicability/cloud-computing/configure-git.md index 658ef5062..9380d63ef 100644 --- a/content/topics/Automation/Replicability/cloud-computing/configure-git.md +++ b/content/topics/Automation/Replicability/cloud-computing/configure-git.md @@ -12,7 +12,7 @@ aliases: # Overview -This building block takes you through the steps to configure Git on a new virtual machine (VM) on Research Cloud once you have finished setting up your account and creating a workspace following [these steps](https://tilburgsciencehub.com/topics/configure-your-computer/infrastructure-choice/getting-started-research-cloud/). +This building block takes you through the steps to configure Git on a new virtual machine (VM) on Research Cloud once you have finished setting up your account and creating a workspace following [these steps](getting-started-research-cloud.md). ## Setting up SSH key authentication @@ -24,17 +24,16 @@ You can access and write data in remote repositories on Github using SSH (Secure - Then, create your SSH keys with the `ssh-keygen` command. Click enter to save the key in the default directory specified or mention an alternative directory if you would like to save it elsewhere. Then you may enter a passphrase for your private key, providing an additional layer of security. - Now, we have generated two keys that are required for SSH authentication: private key (id_rsa) and the public key (id_rsa.pub). - - {{% tip %}} - You could also create additional storage drive and link it to the instance when creating one. Then make sure to save the SSH key files in this external drive. This will be helpful when for example, you delete your current VM and mount this external drive to a new VM. In this case, you need not configure git from scratch again. +You could also create additional storage drive and link it to the instance when creating one. Then make sure to save the SSH key files in this external drive. This will be helpful when for example, you delete your current VM and mount this external drive to a new VM. In this case, you need not configure git from scratch again. {{% /tip %}} ### Step 2: Configure SSH + If you configure multiple keys for an SSH client and connect to an SSH server, the client can try the keys one at a time until the server accepts one but this process doesn’t work because of how Git SSH URLs are structured. Hence, you must configure SSH to explicitly use a specific key file. To do this, edit your `~/.ssh/config` file using the `nano` command and copy-paste the following and press F3 to save. ``` @@ -48,13 +47,12 @@ If you configure multiple keys for an SSH client and connect to an SSH server, t Make sure to change the ‘IdentityFile’ to the directory where the id_rsa key is saved. {{% /warning %}} - - ### Step 3: Adding a New SSH key to your Github Account + - Open the public key using the `cat` command and copy the SSH public key to your clipboard.

Underlined in green: The token to access Jupyter Notebook.

-As you can see in the image above, the token consists of a string of alphanumeric characters comprising all that comes after the equal sign ("="). +As you can see in the image above, the token consists of a string of alphanumeric characters comprising all that comes after the equal sign ("="). {{% warning %}} @@ -258,8 +263,7 @@ Note that in the image above the token is not fully depicted. You must copy all Now simply introduce this token in the box at the top of the landing page and click on "Log in" to access your environment. - -You can also paste your token in the "Token" box at the bottom of the page and generate a password with it by creating a new password in the "New Password" box. +You can also paste your token in the "Token" box at the bottom of the page and generate a password with it by creating a new password in the "New Password" box. This way the next time you revisit your environment you will not need the full token, instead, you will be able to log in using your password. This is particularly useful in case you lose access to your token or this is not available to you at the moment. @@ -278,20 +282,3 @@ By the end of this building block, you should be able to: - Learn how to [Configure a VM with GPUs in Google Cloud](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/config-vm-gcp/). - Cheat-sheat on [Docker Commands](https://docs.docker.com/get-started/docker_cheatsheet.pdf) - - - - - - - - - - - - - - - - - diff --git a/content/topics/Automation/Replicability/cloud-computing/hpc_tiu.md b/content/topics/Automation/Replicability/cloud-computing/hpc_tiu.md index 25b877f58..44ed0e269 100644 --- a/content/topics/Automation/Replicability/cloud-computing/hpc_tiu.md +++ b/content/topics/Automation/Replicability/cloud-computing/hpc_tiu.md @@ -9,12 +9,13 @@ authorlink: "https://nl.linkedin.com/in/roshinisudhaharan" aliases: - /configure/research-cloud --- + # HPC TiU: Overview + HPC TiU Computing is a service by Tilburg University's Library and IT Services (LIS). The environment consists of powerful servers on which researchers can interactively work on data and run statistical computations. This is a Windows-based environment, which looks and feels like a regular Windows desktop. It's accessible through this [web portal](https://rdweb.campus.uvt.nl/RDWeb/webclient/). -

Pros and Cons of Using HPC TiU
@@ -35,7 +36,6 @@ Once your request is approved, you can access HPC TiU: - On a **macOS** computer use Microsoft Remote Desktop Client for Mac (install from App Store), use this web feed [URL](https://rdweb.campus.uvt.nl/RDWeb/feed/webfeed.aspx) and sign in with your TiU credentials (campus\TiU username) - {{% warning %}} **Scheduled maintenance** @@ -47,20 +47,25 @@ Moreover, each month during the maintenance period, **all personal data stored l {{% /warning %}} ### Where to Save your Files? + **M / S / T drive** + - Use M / S / T disc for storing data that must be retained. **D drive (Scratch)** + - Use only the D-disk (Scratch) for temporary data storage. -During standard (monthly) maintenance, the D drive is deleted. + During standard (monthly) maintenance, the D drive is deleted. **C drive** + - The C drive is only intended for the Operating System and Applications. **E drive** + - The E-disk has been used since 12-10-2018 to store the local user profiles. (This is because of possible filling up of the C-drive and thus undermining the entire server). - The E-disk contains the user profiles (with, for example, python packages). This E-disk is automatically deleted on restart. {{% tip %}} -Here are some additional cloud computing solutions you might want to check out to level up your research infrastructure: [SURFsara's LISA Cluster](http://tilburgsciencehub.com/topics/configure-your-computer/infrastructure-choice/lisa_cluster/) and [SURFsara's Cartesius Cluster](http://tilburgsciencehub.com/topics/configure-your-computer/infrastructure-choice/cartesius_cluster/) +Here are some additional cloud computing solutions you might want to check out to level up your research infrastructure: [SURFsara's LISA Cluster](lisa_cluster.md) and [SURFsara's Cartesius Cluster](cartesius_cluster.md) {{% /tip %}} diff --git a/content/topics/Automation/Replicability/cloud-computing/monitor-memory-vm.md b/content/topics/Automation/Replicability/cloud-computing/monitor-memory-vm.md index 6ad9fc6e4..d0fbd76c3 100644 --- a/content/topics/Automation/Replicability/cloud-computing/monitor-memory-vm.md +++ b/content/topics/Automation/Replicability/cloud-computing/monitor-memory-vm.md @@ -1,5 +1,5 @@ --- -title: "Monitor and solve memory constraints in your computational environment" +title: "Monitor and solve memory constraints in your computational environment" description: "After configuring a Google Cloud instance with GPUs, learn to monitor and handle memory issues" keywords: "Environment, Python, Jupyter notebook, Google cloud, Cloud computing, Cloud storage, GPU, Virtual Machine, Instance, Memory" weight: 11 @@ -7,13 +7,13 @@ author: "Fernando Iscar" authorlink: "https://www.linkedin.com/in/fernando-iscar/" draft: false date: 2023-10-02 -aliases: +aliases: - /handle/memory-issues --- ## Overview -In any local or virtual machine, monitoring and managing memory allocation is crucial. Regardless of how advanced or powerful your machine might be, there are always potential bottlenecks, especially when working with memory-intensive tasks. +In any local or virtual machine, monitoring and managing memory allocation is crucial. Regardless of how advanced or powerful your machine might be, there are always potential bottlenecks, especially when working with memory-intensive tasks. In this guide, we delve deep into: @@ -26,13 +26,13 @@ In this guide, we delve deep into: The commands `htop` and `nvtop` are designed for Linux-based environments (such as Ubuntu or Debian) given their widespread use in virtual machine contexts due to their open-source nature, robust security, and versatility. -If you wonder how to set-up a Virtual Machine with a Linux system, go through our [building block](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/config-vm-gcp/)! +If you wonder how to set-up a Virtual Machine with a Linux system, go through our [building block](config-VM-GCP.md)! {{% /tip %}} ## Handling memory allocation issues -It's not uncommon for systems to run out of memory, especially when dealing with large datasets or computation-heavy processes. When a system can't allocate required memory, it can result in runtime errors. +It's not uncommon for systems to run out of memory, especially when dealing with large datasets or computation-heavy processes. When a system can't allocate required memory, it can result in runtime errors. While the straightforward solution might seem to be upgrading hardware, it isn't always feasible. Hence, the necessity to monitor and manage memory efficiently. @@ -57,11 +57,13 @@ It allows us to sort by the task we're most interested in monitoring by pressing To install `htop` in your VM instance, you can use the following command: {{% codeblock %}} + ```bash $ sudo apt install htop # or: $ sudo apt-get install htop ``` + {{% /codeblock %}} You can then run `htop` by simply typing `htop` in your terminal. @@ -82,6 +84,7 @@ $ sudo apt install nvtop # or: $ sudo apt-get install nvtop ``` + {{% /codeblock %}} With `nvtop`, you can monitor GPU usage by typing `nvtop` into your terminal. @@ -95,7 +98,7 @@ Overloading system memory can lead to unsaved data loss. Regularly save your wor {{% /warning %}} -### Practical approaches +### Practical approaches There are several practical solutions to avoid running out of memory. These are some common strategies: @@ -151,9 +154,9 @@ An illustration of creating a `DataLoader` for a text dataset, using a tokenizer for batch in dataloader: input_ids = batch['input_ids'].squeeze() attention_mask = batch['attention_mask'].squeeze() - + output = bert_sc_pa(input_ids=input_ids, attention_mask=attention_mask) - + scores = output.logits predicted_pa = torch.argmax(scores, dim=1).cpu().numpy() predictions_pa.extend(predicted_pa) @@ -166,10 +169,10 @@ Adjusting the `batch_size` parameter balances memory usage against processing ti {{% /tip %}} -- **Efficient Data Structures and Algorithms:** A wise choice in data structures and algorithm design can substantially cut down memory usage. The selection depends on your data's nature and your go-to operations. +- **Efficient Data Structures and Algorithms:** A wise choice in data structures and algorithm design can substantially cut down memory usage. The selection depends on your data's nature and your go-to operations. {{% example %}} -Take hash tables as an example, they boast constant time complexity for search operations, becoming a superior option for substantial datasets. +Take hash tables as an example, they boast constant time complexity for search operations, becoming a superior option for substantial datasets. In Python, this translates to choosing dictionaries over lists when wrestling with large datasets: @@ -182,14 +185,14 @@ In Python, this translates to choosing dictionaries over lists when wrestling wi - **Parallelizing your Work:** Divide the task among multiple identical instances, each running a part of the code. This approach is particularly useful when your code involves training or using multiple machine-learning models. For instance, instead of running three BERT models sequentially on one instance, distribute them across three instances. -Remember that beyond these strategies, it's always possible to leverage the scalability and flexibility of cloud services such as Google Cloud. These services allow for a dynamic allocation of resources according to your needs. +Remember that beyond these strategies, it's always possible to leverage the scalability and flexibility of cloud services such as Google Cloud. These services allow for a dynamic allocation of resources according to your needs. {{% summary %}} - **Memory Management:** - - Monitor with `htop` (CPU) and `nvtop` (GPU). - - Implement batching, efficient data structures and algorithms, and use job parallelization to handle memory issues. + - Monitor with `htop` (CPU) and `nvtop` (GPU). + - Implement batching, efficient data structures and algorithms, and use job parallelization to handle memory issues. {{% /summary %}} @@ -198,4 +201,4 @@ Remember that beyond these strategies, it's always possible to leverage the scal - Google Cloud [Memory-Optimized](https://cloud.google.com/compute/docs/memory-optimized-machines) machines - Memory management [Python Documentation](https://docs.python.org/3/c-api/memory.html) - Machine type [recommendations for VM instances](https://cloud.google.com/compute/docs/instances/apply-machine-type-recommendations-for-instances) -- BERT [GitHub repository](https://github.com/google-research/bert) \ No newline at end of file +- BERT [GitHub repository](https://github.com/google-research/bert) diff --git a/content/topics/Automation/Replicability/cloud-computing/rstudio-aws.md b/content/topics/Automation/Replicability/cloud-computing/rstudio-aws.md index 55feee70e..56b34fe21 100644 --- a/content/topics/Automation/Replicability/cloud-computing/rstudio-aws.md +++ b/content/topics/Automation/Replicability/cloud-computing/rstudio-aws.md @@ -7,10 +7,9 @@ author: "Roshini Sudhaharan" authorlink: "https://nl.linkedin.com/in/roshinisudhaharan" draft: false aliases: -- /run/r-in-the-cloud -- /run/r-on-aws -- /reproduce/rstudio-on-aws - + - /run/r-in-the-cloud + - /run/r-on-aws + - /reproduce/rstudio-on-aws --- # Run RStudio on AWS using Docker @@ -19,9 +18,10 @@ Seeking to run R in the cloud? Use this easy step-by-step guide to get started! ## Overview -While cloud services like [Amazon Web Services (AWS)](https://tilburgsciencehub.com/topics/more-tutorials/running-computations-remotely/cloud-computing/) are great for big data processing and storage, it doesn’t give you immediate access to tools like R and RStudio. With [Docker](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/docker/), though, you can launch R and RStudio on AWS without trouble. +While cloud services like [Amazon Web Services (AWS)](https://tilburgsciencehub.com/topics/more-tutorials/running-computations-remotely/cloud-computing/) are great for big data processing and storage, it doesn’t give you immediate access to tools like R and RStudio. With [Docker](../Docker/docker.md), though, you can launch R and RStudio on AWS without trouble. What are the benefits? + - Avoid dependency issues and foster reproducible research - Specify software versions which always work, regardless of which operating system you use - Setup virtual computers for your colleagues - so that they can work as productively as you do! @@ -39,7 +39,8 @@ In this building block, we use DockerHub -- a hosted repository service like Git 1. [Launch and connect to an AWS instance](https://tilburgsciencehub.com/topics/more-tutorials/running-computations-remotely/launch-instance/) 2. Install `docker` on the instance: -{{% codeblock %}} + {{% codeblock %}} + ```bash # Update the packages on your instance $ sudo yum update -y @@ -50,12 +51,14 @@ $ sudo service docker start # Add the ec2-user to the docker group so you can execute Docker commands without using sudo. $ sudo usermod -a -G docker ec2-user ``` + {{% /codeblock %}} 3. Install `docker-decompose` on the instance: {{% codeblock %}} -```bash + +````bash # Copy the appropriate docker-compose binary from GitHub: sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose @@ -121,7 +124,8 @@ The docker-compose.yml file contains the rules required to configure the image a {{% codeblock %}} ```bash docker-compose -f docker-compose.yml run --name rstudio --service-ports rstudio -``` +```` + {{% /codeblock %}} {{% tip %}} @@ -129,6 +133,7 @@ Make sure to change directory to the location of the docker-compose.yml file bef {{% /tip %}} ### Steps 8-9: Configure security group and launch R + 8. Open the `8787` web server port on EC2 We're almost there! In order to get access to R on the web interface you need to configure the security group which controls the traffic that is allowed to enter and leave the resources (here: an EC2 instance) that it is associated with. @@ -136,15 +141,15 @@ We're almost there! In order to get access to R on the web interface you need to For each security group, you add rules that control the traffic based on protocols and port numbers. There are separate sets of rules for inbound traffic and outbound traffic. The RStudio server is fully accessible only when port `8787` is open. This involves adding inbound rules on the instance. - Go to AWS management console and security groups -![](../images/open-portal1.gif) + ![](../images/open-portal1.gif) - Edit inbound rules: IP version = IPv4; Type = Custom TCP; Port range = `8787` -![](../images/open-portal3.gif) - + ![](../images/open-portal3.gif) 9. Launch R + - Copy the Public IPv4 DNS address for instance IP - Open a new tab in your browser and paste the URL, including the port number, for example: `http://:8787` -![](../images/open-r.gif) + ![](../images/open-r.gif) - To login, enter `rstudio` as username and the password specified earlier in the docker-compose file. diff --git a/content/topics/Automation/Replicability/cloud-computing/running-computations-remotely/run-scripts.md b/content/topics/Automation/Replicability/cloud-computing/running-computations-remotely/run-scripts.md index eea3f38c3..a8ade3615 100644 --- a/content/topics/Automation/Replicability/cloud-computing/running-computations-remotely/run-scripts.md +++ b/content/topics/Automation/Replicability/cloud-computing/running-computations-remotely/run-scripts.md @@ -17,6 +17,7 @@ aliases: If you opted for the default Amazon Linux 2 AMI, you probably need to install some additional software to be able to execute your code files. To install Python 3 and any other packages on your EC2 instance, run the following commands: {{% codeblock %}} + ```bash # install Python sudo yum install python3 @@ -27,14 +28,15 @@ sudo python3 -m pip install # install multiple packages from txt file sudo python3 -m pip install -r requirements.txt ``` + {{% /codeblock %}} Next, you can run your scripts from the command line like you're used to with `python3 .py`. {{% tip %}} -One of the key advantages of using a VM is that you can leave it running indefinitely (even if you shut down your laptop!). Take a look at the [task scheduling](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/automate-your-workflow/task-scheduling/) building block to learn how to use crontab to schedule recurring tasks (e.g., run a Python script that scrapes a website every day at 3 AM). Keep in mind that the time at your VM may differ from your local time because the data center that hosts the VM is located in another time zone (tip: run `date` to find out the UTC time). +One of the key advantages of using a VM is that you can leave it running indefinitely (even if you shut down your laptop!). Take a look at the [task scheduling](../../../automation-tools/task-automation/task-scheduling.md) building block to learn how to use crontab to schedule recurring tasks (e.g., run a Python script that scrapes a website every day at 3 AM). Keep in mind that the time at your VM may differ from your local time because the data center that hosts the VM is located in another time zone (tip: run `date` to find out the UTC time). {{% /tip %}} {{% tip %}} -Check out [this building block](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/rstudio-aws/) to learn how to use [Docker](https://tilburgsciencehub.com/topics/automate-and-execute-your-work/reproducible-work/docker/) to launch R on the instance to run your R scripts. +Check out [this building block](../rstudio-aws.md) to learn how to use [Docker](../../Docker/docker.md) to launch R on the instance to run your R scripts. {{% /tip %}} diff --git a/content/topics/Automation/Workflows/Auditing/workflow-checklist.md b/content/topics/Automation/Workflows/Auditing/workflow-checklist.md index 765d5ec3d..e59fdfced 100644 --- a/content/topics/Automation/Workflows/Auditing/workflow-checklist.md +++ b/content/topics/Automation/Workflows/Auditing/workflow-checklist.md @@ -16,7 +16,7 @@ aliases: ## Overview -As projects progress, they can become disorganized and difficult to navigate. A structured approach not only facilitates collaboration and understanding but also ensures that the project remains efficient and reproducible. +As projects progress, they can become disorganized and difficult to navigate. A structured approach not only facilitates collaboration and understanding but also ensures that the project remains efficient and reproducible. This building block offers a comprehensive checklist to guide you towards achieving this goal. @@ -26,38 +26,38 @@ This building block offers a comprehensive checklist to guide you towards achiev Foundational guidelines that are essential for setting up any project, ensuring clarity and effective organization from the outset: -* Implement a consistent [directory structure](/topics/project-management/principles-of-project-setup-and-workflow-management/directories/#working-example): `data/src/gen`. -* Include [readme](/topics/project-management/principles-of-project-setup-and-workflow-management/documenting-code/#main-project-documentation) with project description and technical instructions on how to run/build the project. -* Store any authentication credentials outside of the repository (e.g., in a JSON file), and **not** in clear-text within the source code. -* Mirror your `/data` folder to a secure backup location. Alternatively, store all raw data on a secure server and download the relevant files to `/data`. +- Implement a consistent [directory structure](/topics/project-management/principles-of-project-setup-and-workflow-management/directories/#working-example): `data/src/gen`. +- Include [readme](/topics/project-management/principles-of-project-setup-and-workflow-management/documenting-code/#main-project-documentation) with project description and technical instructions on how to run/build the project. +- Store any authentication credentials outside of the repository (e.g., in a JSON file), and **not** in clear-text within the source code. +- Mirror your `/data` folder to a secure backup location. Alternatively, store all raw data on a secure server and download the relevant files to `/data`. ### Throughout the Pipeline -#### File/directory structure +#### File/directory structure Ensuring that your data, code, and results are systematically arranged, makes it easier to track changes and debug issues. -* Create subdirectory for source code: `/src/[pipeline-stage-name]/`. -* Establish subdirectories within `/gen/[pipeline-stage-name]/` for generated files: `temp`, `output`, and `audit`. -* Ensure file names are relative and not absolute. For instance, avoid references like `C:/mydata/myproject`, and opt for relative paths such as `../output`. -* Structure directories using your source code or use [.gitkeep](https://www.freecodecamp.org/news/what-is-gitkeep/). +- Create subdirectory for source code: `/src/[pipeline-stage-name]/`. +- Establish subdirectories within `/gen/[pipeline-stage-name]/` for generated files: `temp`, `output`, and `audit`. +- Ensure file names are relative and not absolute. For instance, avoid references like `C:/mydata/myproject`, and opt for relative paths such as `../output`. +- Structure directories using your source code or use [.gitkeep](https://www.freecodecamp.org/news/what-is-gitkeep/). #### Automation & documentation Ensuring smooth automation alongside clear documentation streamlines project workflows and aids clarity. -* Make sure to have a [`makefile`](/automate/project-setup) to allow for automation. -* Alternatively, include a [readme](/topics/project-management/principles-of-project-setup-and-workflow-management/documenting-code/#main-project-documentation) with running instructions. -* Delineate dependencies between the source code and files-to-be-built explicitly. This allows `make` to automatically recognize when a rule is redundant, ensuring you define targets and source files properly. -* Include a function to delete `temp`, `output` files, and `audit` files in makefile when necessary. +- Make sure to have a [`makefile`](/automate/project-setup) to allow for automation. +- Alternatively, include a [readme](/topics/project-management/principles-of-project-setup-and-workflow-management/documenting-code/#main-project-documentation) with running instructions. +- Delineate dependencies between the source code and files-to-be-built explicitly. This allows `make` to automatically recognize when a rule is redundant, ensuring you define targets and source files properly. +- Include a function to delete `temp`, `output` files, and `audit` files in makefile when necessary. #### Versioning Versioning guarantees that changes in your project are trackable, providing a foundation for collaboration and recovery of previous work states. -* Track and version all source code stored in `/src` (e.g., add to Git/GitHub). -* Do not version any files in `/data` and `/gen`. They should **not** be added to Git/GitHub. -* If there are specific files or directories you wish to exclude, especially those unintentionally written to `/src`, utilize [.gitignore](https://www.freecodecamp.org/news/gitignore-file-how-to-ignore-files-and-folders-in-git/) to keep them unversioned. +- Track and version all source code stored in `/src` (e.g., add to Git/GitHub). +- Do not version any files in `/data` and `/gen`. They should **not** be added to Git/GitHub. +- If there are specific files or directories you wish to exclude, especially those unintentionally written to `/src`, utilize [.gitignore](https://www.freecodecamp.org/news/gitignore-file-how-to-ignore-files-and-folders-in-git/) to keep them unversioned. {{% warning %}} **Do not version sesitive data** @@ -66,31 +66,31 @@ Before making a GitHub repository public, we recommend you check that you have n You can use [GitHub credentials scanner](https://geekflare.com/github-credentials-scanner/) if you want to make sure. {{% /warning %}} - #### Housekeeping A tidy codebase is instrumental for collaborations and future adjustments. Proper housekeeping practices ensure code readability, maintainability, and efficient debugging. -* Opt for concise and descriptive variable names. -* Wherever possible, employ loops to reduce redundancy. -* Break down extensive source code into subprograms, functions, or divide them into smaller focused scripts. -* Prune unnecessary components such as redundant comments, outdated library calls, and unused variables. -* Implement asserts to stop program execution when encountering unhandled errors, ensuring robustness. +- Opt for concise and descriptive variable names. +- Wherever possible, employ loops to reduce redundancy. +- Break down extensive source code into subprograms, functions, or divide them into smaller focused scripts. +- Prune unnecessary components such as redundant comments, outdated library calls, and unused variables. +- Implement asserts to stop program execution when encountering unhandled errors, ensuring robustness. #### Testing for portability Ensuring your project works across different environments and systems is crucial for consistent results and wider usability. -* On Your Computer: +- On Your Computer: + - Rebuild Test: Clear `/gen` and rebuild using `make`. - Clone & Build: Clone to a new directory, then rebuild using `make`. -* Different Systems: +- Different Systems: - Confirm functionality on Windows OS, Mac setup and Linux. #### Example of a well-organized project -[This tutorial](/topics/project-management/principles-of-project-setup-and-workflow-management/overview/) covers the fundamental principles of project setup and workflows underlying this checklist. Under the *Summary* section, you will find a visual example of a well-structure project display. +[This tutorial](/topics/project-management/principles-of-project-setup-and-workflow-management/overview/) covers the fundamental principles of project setup and workflows underlying this checklist. Under the _Summary_ section, you will find a visual example of a well-structure project display. {{% tip %}} To quickly visualize the structure of your project directories in a tree-like format, you can utilize the `tree` command in your terminal or command prompt. @@ -98,7 +98,6 @@ To quickly visualize the structure of your project directories in a tree-like fo ## Additional Resources -- Tutorial about [Pipeline Automation using Make](https://tilburgsciencehub.com/topics/reproducible-research-and-automation/practicing-pipeline-automation-make/pipeline-automation-overview/). +- Tutorial about [Pipeline Automation using Make](../../../Automation/automation-tools/Makefiles/practicing-pipeline-automation-make/pipeline-automation-overview.md). - Free open-source Master level course on [Data Preparation and Workflow Management](https://dprep.hannesdatta.com/). - Reading about an example of a [Digital Project Folder Structure](https://medium.com/@dcbryan/staying-organized-a-project-folder-structure-7764651ff89f). - diff --git a/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/automation.md b/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/automation.md index ba0115050..a47e571d8 100644 --- a/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/automation.md +++ b/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/automation.md @@ -15,13 +15,13 @@ aliases: ## Overview -Remember the [different stages of a project's pipeline](../pipeline/#project-pipelines)? Let's suppose +Remember the [different stages of a project's pipeline](pipeline.md/#project-pipelines)? Let's suppose we're in the process of preparing our data set for analysis. For example: 1. You wish to convert three raw data sets from Excel to CSV files. 2. You want to merge these three CSV files and apply some cleaning steps. 3. Finally, you want to save the final data set, so that it can be used in -other stages of your project pipeline (e.g., such as the analysis). + other stages of your project pipeline (e.g., such as the analysis). This workflow for your specific pipeline can be visualized as follows: @@ -35,9 +35,9 @@ Using so-called "build tools" such as [`make`](/topics/configure-your-computer/a Specifically, `make` introduces "recipes" that are used to tell `make` how to build certain `targets`, using a set of `source files` and `execution command(s)`. -- A *target* refers to **what** needs to be built (e.g., a file), -- *source(s)* specify what is **required** to execute the build, and -- the *execution command* specifies **how** to execute the build. +- A _target_ refers to **what** needs to be built (e.g., a file), +- _source(s)_ specify what is **required** to execute the build, and +- the _execution command_ specifies **how** to execute the build. In `make` code, this becomes: @@ -45,9 +45,10 @@ In `make` code, this becomes: target: source(s) execution command ``` + ## Translating the Pipeline into "Recipes" for `make` -In "`make` code," the workflow above - saved in a *makefile* (a file called `makefile`, without a file type ending) - becomes: +In "`make` code," the workflow above - saved in a _makefile_ (a file called `makefile`, without a file type ending) - becomes: ```make @@ -69,9 +70,8 @@ In "`make` code," the workflow above - saved in a *makefile* (a file called `mak R -e "rmarkdown::render('make_report.Rmd', output_file = '../output/report.pdf')" --> - {{% tip %}} -Pay attention to the subdirectory structure used here: the rules refer to files in different folders (src, gen, data, etc.), which are explained [earlier in this guide](../directories). +Pay attention to the subdirectory structure used here: the rules refer to files in different folders (src, gen, data, etc.), which are explained [earlier in this guide](directories.md). {{% /tip %}} ## Running `make` @@ -88,20 +88,21 @@ actually execute them. Great to preview how a workflow would be executed! ### Consider Source Code or Targets as Up-to-Date -By default, `make` runs each step in a workflow that *needs* to be -updated. However, sometimes you wish to only rebuild *some* but not all +By default, `make` runs each step in a workflow that _needs_ to be +updated. However, sometimes you wish to only rebuild _some_ but not all parts of your project. For example, consider the case where you have only added some comments to some R scripts, but re-running that part of the project would not change any of the resulting output files (e.g., let's say a dataset). There are two ways in which you can "skip" the re-builds, depending on whether you want to consider **file(s)**, or **targets** as up-to-date. -Recall that *targets* are higher-order recipes, whereas files are, well, +Recall that _targets_ are higher-order recipes, whereas files are, well, merely files. -**Considering a *target* as up-to-date:** +**Considering a _target_ as up-to-date:** Pass the parameter `-t targetname` to `make`, and press enter. For example, + ``` make -t targetname ``` @@ -109,14 +110,14 @@ make -t targetname The `targetname` is now "up-to-date". When you then run `make`, it will only run those files necessary to build the remainder of the workflow. -**Considering *source code* as up-to-date:** +**Considering _source code_ as up-to-date:** Pass the parameter `-o filename1 filename2` to `make`. In other words, `filename1` and `filename2` will be considered "infinitely old", and when rebuilding, that part of the project will not be executed. {{% warning %}} -Of course, using `-t` and `-o` should only be used for *prototyping* your +Of course, using `-t` and `-o` should only be used for _prototyping_ your code. When you're done editing (e.g., at the end of the day), make your temporary and output files, and re-run `make` to see whether everything works (and reproduces). @@ -131,13 +132,13 @@ This [book by O'Reilly Media](https://www.oreilly.com/openbook/make3/book/index. ### Why is `make` useful? - You may have a script that takes a very long time to build a dataset -(let's say a couple of hours), and another script that runs an analysis on it. -You only would like to produce a new dataset if actually code to make that dataset has changed. -Using `make`, your computer will figure out what code to execute to get you your final analysis. + (let's say a couple of hours), and another script that runs an analysis on it. + You only would like to produce a new dataset if actually code to make that dataset has changed. + Using `make`, your computer will figure out what code to execute to get you your final analysis. - You may want to run a robustness check on a larger sample, using a virtual computer you have rented in the cloud. -To run your analysis, you would have to spend hours of executing script after script to make sure the project runs the way you want. -Using `make `, you can simply ship your entire code off to the cluster, change the sample size, and wait for the job to be done. + To run your analysis, you would have to spend hours of executing script after script to make sure the project runs the way you want. + Using `make `, you can simply ship your entire code off to the cluster, change the sample size, and wait for the job to be done. ### Are there alternatives to `make`? @@ -163,7 +164,7 @@ you would have to execute that code manually. What you see with other researchers is that they put the running instructions into a bash script, for example a `.bat` file on Windows. Such a file is helpful because it makes the order of -execution *explicit*. The downside is that such a file will always build *everything* - even those +execution _explicit_. The downside is that such a file will always build _everything_ - even those targets that are already up-to-date. Especially in data- and computation-intensive projects, though, you would exactly want to avoid that to make quick progress. @@ -180,6 +181,6 @@ With `make`, we: - keep track of complicated file dependencies, and -- are kept from *repeating* typos or mistakes - if we stick to using `make` everytime -we want to run our project, then we *must* correct each mistake before we can continue. -{{% /summary %}} +- are kept from _repeating_ typos or mistakes - if we stick to using `make` everytime + we want to run our project, then we _must_ correct each mistake before we can continue. + {{% /summary %}} diff --git a/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/directories.md b/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/directories.md index f38391e0c..f014d2ba6 100644 --- a/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/directories.md +++ b/content/topics/Automation/Workflows/Starting/principles-of-project-setup-and-workflow-management/directories.md @@ -19,12 +19,13 @@ aliases: From the [previous section in this guide](../pipeline), recall that a project can be logically structured in the following **components**: raw data (e.g., data sets), source code (e.g., code to prepare your data and to analyze it), generated files (e.g., some diagnostic plots or the result of your analysis), a file exchange, and notes (e.g., literature, meeting summaries). -Below, we give you some more guidance on how to *manage these components*, by casting a prototypical directory structure that you need to set up on your computer. +Below, we give you some more guidance on how to _manage these components_, by casting a prototypical directory structure that you need to set up on your computer. {{% tip %}} **Principles** We adhere to the following principles in managing our project's data and code: + - others should understand the project and its pipeline, merely by looking at your directory and file structure, - each step in the pipeline is self-contained, i.e., it can be executed without actually having to run previous stages in the pipeline ("portability"), and @@ -60,9 +61,9 @@ Raw data gets downloaded into the `data` folder of your project (`my_project/dat #### Which data format to use - Raw data needs to be stored in text-based data files, such as CSV or JSON files. -This ensures highest portability across different platforms, and also stability -should certain software packages be discontinued or data formats being substantially -changed. + This ensures highest portability across different platforms, and also stability + should certain software packages be discontinued or data formats being substantially + changed. - Do you use databases to handle data storage (e.g., such as a MySQL or MongoDB server)? @@ -84,27 +85,27 @@ See below for an example directory structure for your raw data: /data/data_provider_C/2020-03-01/... - We also recommend you to use self-explanatory file names for your data, -[document the data yourself with a `readme`](../documenting-data), or include an overview about how the data was collected from your data provider. + [document the data yourself with a `readme`](../documenting-data), or include an overview about how the data was collected from your data provider. - Last, it may happen that you code up some data yourself, and that you -still wish to store multiple versions of that file. In that case, -please only keep *the head copy* (i.e., the most recent version) in your folder, -and move all outdated files to a subfolder (which you can call `/legacy` or `/outdated`). + still wish to store multiple versions of that file. In that case, + please only keep _the head copy_ (i.e., the most recent version) in your folder, + and move all outdated files to a subfolder (which you can call `/legacy` or `/outdated`). #### Where to store the data - Ideally, the folder with your raw data is stored on a secure server that (a) project members have access to, and (b) gets backed up frequently. Good locations are: - - a secure network folder in your organization - - file hosting services such as Amazon S3 - - public cloud services such as Dropbox or Google Drive + - a secure network folder in your organization + - file hosting services such as Amazon S3 + - public cloud services such as Dropbox or Google Drive - When working on the project, download copies of your raw data to your `my_project/data/` folder. - No time to think about remote storage locations right now? Then just store your -data in `my_project/data/` on your PC, and set yourself a reminder to move it -to a more secure location soon. Do make backups of this folder though! + data in `my_project/data/` on your PC, and set yourself a reminder to move it + to a more secure location soon. Do make backups of this folder though! ### 2. Source code @@ -120,22 +121,22 @@ Source code is made available in the `src` folder of your main project: `my_proj **Version your source code** -- Source code must be [*versioned*](../versioning) so that you can -roll back to previous versions of the same files, and engage into -"housekeeping" exercises to improve the readability of your code. +- Source code must be [_versioned_](../versioning) so that you can + roll back to previous versions of the same files, and engage into + "housekeeping" exercises to improve the readability of your code. - Of course, versioning also is a requirement when you work with multiple -team members on a project. + team members on a project. **Which directory structure to adhere to** - Create subfolders for each stage of your [project's pipeline](../pipeline), -and store source code pertaining to these stages in their respective directories. + and store source code pertaining to these stages in their respective directories. - For example, let's assume your pipeline consists of three stages (ordered from "upstream" to "downstream" stages): - - the pipeline stage `data-preparation` is used to prepare/clean a data set - - the pipeline stage `analysis` is used to analyze the data cleaned in the previous step, and - - the pipeline stage `paper` produces tables and figures for the final paper. + - the pipeline stage `data-preparation` is used to prepare/clean a data set + - the pipeline stage `analysis` is used to analyze the data cleaned in the previous step, and + - the pipeline stage `paper` produces tables and figures for the final paper. Your directory structure for the `/src` directory would then become: @@ -148,7 +149,7 @@ Your directory structure for the `/src` directory would then become: - First of all, the source code is available on your local computer. - Second, the source code is versioned using Git, and -synchronized with [GitHub](http://github.com) (so, automatic backups are guaranteed). + synchronized with [GitHub](http://github.com) (so, automatic backups are guaranteed). ### 3. Generated files @@ -163,26 +164,26 @@ synchronized with [GitHub](http://github.com) (so, automatic backups are guarant Each subdirectory contains the following subdirectories: - `input`: This subdirectory contains any required input files to run this -step of the pipeline. Think of this as a directory that holds files from -preceding modules (e.g., the analysis uses the *file exchange* to pull in the -dataset from its preceding stage in the pipeline, `/data-preparation`). + step of the pipeline. Think of this as a directory that holds files from + preceding modules (e.g., the analysis uses the _file exchange_ to pull in the + dataset from its preceding stage in the pipeline, `/data-preparation`). - `temp`: These are temporary files, like an Excel dataset that -needed to be converted to a CSV data set before reading it in -your statistical software package. + needed to be converted to a CSV data set before reading it in + your statistical software package. -- `output`: This subdirectory stores the *final* result of the module. -For example, in the case of a data preparation module, you would expect this -subdirectory to hold the final dataset. In the case of the analysis module, -you would expect this directory to contain a document with the results of -your analysis (e.g., some tables, or some figures).If necessary, -pass these to a file exchange for use in other stages of your pipeline. +- `output`: This subdirectory stores the _final_ result of the module. + For example, in the case of a data preparation module, you would expect this + subdirectory to hold the final dataset. In the case of the analysis module, + you would expect this directory to contain a document with the results of + your analysis (e.g., some tables, or some figures).If necessary, + pass these to a file exchange for use in other stages of your pipeline. - `audit`: Checking data and model quality is important. So use this -directory to generate diagnostic information on the performance of each -step in your pipeline. For example, for a data preparation, save a txt -file with information on missing observations in your final data set. -For an analysis, write down a txt file with some fit measures, etc. + directory to generate diagnostic information on the performance of each + step in your pipeline. For example, for a data preparation, save a txt + file with information on missing observations in your final data set. + For an analysis, write down a txt file with some fit measures, etc. Your directory structure for the generated data is: @@ -204,13 +205,13 @@ Your directory structure for the generated data is: - Since generated files are purely the result of running source code (which is versioned) on your raw data (which is backed up), it's sufficient to store generated files locally only. - If you are working with other team members that may need access to the `/output` of preceding -pipeline stages, you can make use of the file exchange (see next section). + pipeline stages, you can make use of the file exchange (see next section). ### 4. File exchange - File exchanges are used to easily "ping-pong" (upload or download) -generated temporary or output files between different stages of the pipeline. Review -the use situations for file exchanges [here](../pipeline/#project-components). + generated temporary or output files between different stages of the pipeline. Review + the use situations for file exchanges [here](../pipeline/#project-components). - In simple terms, a file exchange "mirrors" (parts of) your generated files in `/gen`, so that you or your team members can download the outputs of previous pipeline stages. @@ -218,46 +219,47 @@ the use situations for file exchanges [here](../pipeline/#project-components). **Use cases for file exchanges** - A file exchange to **collaborate with others**: + - A co-author builds a prototype model on his laptop. - He/she can work on a dataset that you have prepped using a high-performance -workstation (this part of the pipeline), without having to actually build -the dataset him/herself from scratch. + He/she can work on a dataset that you have prepped using a high-performance + workstation (this part of the pipeline), without having to actually build + the dataset him/herself from scratch. - A file exchange **without any collaboration**: + - You have built a data set on a high-performance workstation, - and would like to work with a sample dataset on your laptop, without - having to actually build the dataset from scratch. -{{% /example %}} + and would like to work with a sample dataset on your laptop, without + having to actually build the dataset from scratch. + {{% /example %}} - For hosting, you have the same options as for storing your raw data: - - a secure network folder in your organization, - - file hosting services such as Amazon S3, or - - public cloud services such as Dropbox or Google Drive. + + - a secure network folder in your organization, + - file hosting services such as Amazon S3, or + - public cloud services such as Dropbox or Google Drive. - To upload or download data from your file exchange, put scripts in the -`src` folder of the relevant pipeline stage: - - have a script in `src/data-preparation` that uploads the output of this pipeline stage from -`gen/data-preparation/output` to the file exchange (`put_output`) - - have a script in `src/analysis` which downloads the data from the file exchange to - `gen/analysis/input` (`get_input`) + `src` folder of the relevant pipeline stage: - have a script in `src/data-preparation` that uploads the output of this pipeline stage from + `gen/data-preparation/output` to the file exchange (`put_output`) - have a script in `src/analysis` which downloads the data from the file exchange to + `gen/analysis/input` (`get_input`) - For details on setting up and using a file exchange, follow our [Building Block here](/setup/file-exchanges). ### 5. Managing notes and other documents - Notes and other documents are NOT part of your `my_project` directory, but are kept on a conveniently accessible cloud provider. -The conditions are: files are accessible for all team members, and files are automatically backed up. Services that meet these requirements typically are Dropbox, Google Drive, or - if you're located in The Netherlands - Surfdrive. + The conditions are: files are accessible for all team members, and files are automatically backed up. Services that meet these requirements typically are Dropbox, Google Drive, or - if you're located in The Netherlands - Surfdrive. - Do create subdirectories for each type of files (e.g., notes, literature, etc.) - ## Summary ### Complete Project Directory Structure + {{% summary %}} - Below, we reproduce the resulting directory structure for your entire project, called `my_project` -(of course, feel free to relabel that project when you implement this!). + (of course, feel free to relabel that project when you implement this!). - This example project has three pipeline stages: - the pipeline stage `data-preparation` is used to prepare/clean a data set - the pipeline stage `analysis` is used to analyze the data cleaned in the previous step, and @@ -270,78 +272,78 @@ Contents of folder my_project │ ├───company_B │ │ coding.csv │ │ readme.txt -│ │ +│ │ │ ├───data_provider_C │ │ ├───2019-11-04 │ │ │ coding.csv │ │ │ readme.txt -│ │ │ +│ │ │ │ │ └───2020-03-01 │ │ coding.csv │ │ readme.txt -│ │ +│ │ │ └───website_A │ file1.csv │ file2.csv │ readme.txt -│ +│ ├───gen │ ├───analysis │ │ ├───audit │ │ │ audit.txt -│ │ │ +│ │ │ │ │ ├───input │ │ │ cleaned_data.csv -│ │ │ +│ │ │ │ │ ├───output │ │ │ model_results.RData -│ │ │ +│ │ │ │ │ └───temp │ │ imported_data.csv -│ │ +│ │ │ ├───data-preparation │ │ ├───audit │ │ │ checks.txt -│ │ │ +│ │ │ │ │ ├───input │ │ │ dataset1.csv │ │ │ dataset2.csv -│ │ │ +│ │ │ │ │ ├───output │ │ │ cleaned_data.csv -│ │ │ +│ │ │ │ │ └───temp │ │ tmpfile1.csv │ │ tmpfile2.RData -│ │ +│ │ │ └───paper │ ├───audit │ │ audit.txt -│ │ +│ │ │ ├───input │ │ model_results.RData -│ │ +│ │ │ ├───output │ │ figure1.png │ │ figure2.png │ │ tables.html -│ │ +│ │ │ └───temp │ table1.tex │ table2.tex -│ +│ └───src ├───analysis │ analyze.R │ get_input.txt │ put_output.txt - │ + │ ├───data-preparation │ clean_data.R │ get_input.txt │ load_data.R │ put_output.txt - │ + │ └───paper figures.R get_input.txt @@ -350,28 +352,31 @@ Contents of folder my_project ``` + {{% /summary %}} {{% summary %}} **Description of the workflow** 1. **Raw data** is stored in `my_project/data`, logically - structured into data sources and backed up securely. + structured into data sources and backed up securely. 2. **Source code** is written in the `src` folder, with each step of - your pipeline getting its own subdirectory. + your pipeline getting its own subdirectory. 3. **Generated files** are written to the `gen` folder, again with subdirectories - for each step in your pipeline. Further, they have up to four subdirectories: - `/input` for input files, `/temp` for any temporary files, `/output` for any - output files, and `audit` for any auditing/checking files. + for each step in your pipeline. Further, they have up to four subdirectories: + `/input` for input files, `/temp` for any temporary files, `/output` for any + output files, and `audit` for any auditing/checking files. 4. **Notes** are kept on an easily accessible cloud provider, like -Dropbox, Google Drive, or - if you're located in The Netherlands - Surfdrive. -{{% /summary %}} + Dropbox, Google Drive, or - if you're located in The Netherlands - Surfdrive. + {{% /summary %}} + ### Download Example Directory Structure {{% tip %}} + - [Download our example directory structure here](../dir_structure.zip), so you can get started right away. - Remember that horrible [directory and file structure](../structure_phd_2013.html)? Check out how this tidier project on [how streaming services change music consumption](https://pubsonline.informs.org/doi/pdf/10.1287/mksc.2017.1051) looked [like](../structure_spotify_2018.html#spotify). - You've seen those readme.txt's?! These are super helpful to include to [describe your project](../documenting-code), and to [describe raw data](../documenting-data). -{{% /tip %}} + {{% /tip %}} ### Most Important Data Management Policies by Project Component @@ -379,31 +384,38 @@ Dropbox, Google Drive, or - if you're located in The Netherlands - Surfdrive. **Data management for your project's components** 1. Raw data - - Store where: On a secure server that project members have access to + +- Store where: On a secure server that project members have access to (e.g., could also be a network folder; later, we show to you how to use a variety of systems like Amazon S3, Dropbox, or Google Drive to "host" your raw data). No time to think about this much? Well, then just have your data available locally on your PC, and set yourself a reminder to move it to a more secure environment soon. - - Backup policy: Regular backups (e.g., weekly) - - Versioning (i.e., being able to roll back to prior versions): not necessary, although +- Backup policy: Regular backups (e.g., weekly) +- Versioning (i.e., being able to roll back to prior versions): not necessary, although you need to store different versions of the same data (e.g., a data set delivered in 2012, an updated dataset delivered in 2015) in different folders. + 2. Source code - - Store where: on Git/GitHub - which will allow you to collaborate efficiently on code - - Backup policy: Inherent (every change can be rolled back - good if you - want to roll back to previous versions of a model, for example) - - Versioning: complete versioning + +- Store where: on Git/GitHub - which will allow you to collaborate efficiently on code +- Backup policy: Inherent (every change can be rolled back - good if you + want to roll back to previous versions of a model, for example) +- Versioning: complete versioning + 3. Generated temp (temporary and output) files - - Store where: only on your local computer. - - Backup policy: None. These files are entirely produced on the basis of raw data and source code, + +- Store where: only on your local computer. +- Backup policy: None. These files are entirely produced on the basis of raw data and source code, and hence, can always be "reproduced" from 1. and 2. if they get wiped. + 4. Notes - - Store where: anywhere where it's convenient for you! Think about tools like + +- Store where: anywhere where it's convenient for you! Think about tools like Dropbox or Google Drive, which also offer great features that you may enjoy when you work in a team. - - Backup policy: Automatic backups (standard for these services) - - Versioning: not necessary (but typically offered for free for 1 month) +- Backup policy: Automatic backups (standard for these services) +- Versioning: not necessary (but typically offered for free for 1 month) {{% /tip %}} -* What minimum security levels do you have to ensure? Can you make your code public? +- What minimum security levels do you have to ensure? Can you make your code public? Are you working on a data consultancy project for a Fortune 500 client? Have you signed a NDA for the data that requires you to treat it with a great sense of responsibility? If so, then you better make sure that you have configured your systems securely (e.g., private Github repositories, 2-factor authentication, etc.). -* How will you manage your data? +- How will you manage your data? Can the raw data be managed in a cloud storage service like Dropbox or Google Drive, or does the sheer amount of data requires us to look for alternatives (e.g., database or object storage)? -* How long will it take to run the workflow? +- How long will it take to run the workflow? While importing a dataset and running a bunch of regression models typically happens with a matter of seconds, you may encounter circumstances in which you need to factor in the run time. For example, if you throttle API calls, experiment with a variety of hyperparameters, run a process repeatedly (e.g., web scraping). In these cases, the hardware of your machine may not suffice nor do you want to keep your machine running all day long. Creating a virtual instance (e.g., EC2) adjusted to your specific needs can overcome these hurdles. @@ -54,20 +54,22 @@ Before we dive right into the nitty gritty details, here are a couple of things - If you advocate for open science and strive for reproducibility, open sourcing your data and code online is almost a given. This in turn means you need to put in the extra effort to write comprehensive documentation and running instructions so that others - who may lack some prior knowledge - can still make sense of your repository. --> + Together, these considerations can guide your decision making in terms of (a) code versioning, (b) raw data storage, and (c) computation (local vs remote). ## 2. Set up computing environment Configure your software environment. The minimum requirements typically are + - programming languages (e.g., Python, R) - version control system (e.g., Git/GitHub) - automation tools (e.g., make). -Head over to the [software section ](../../../topics/configure-your-computer) section Tilburg Science Hub to view the installation guides. +Head over to the [software section ](../../../Computer-Setup/software-installation/) section Tilburg Science Hub to view the installation guides. ## 3. Setup the repository -Never worked with Git/GitHub before? Then follow our [onboarding for Git/GitHub first](../../../topics/collaborate-and-share-your-work/use-github/versioning-using-git). +Never worked with Git/GitHub before? Then follow our [onboarding for Git/GitHub first](../Starting/principles-of-project-setup-and-workflow-management/versioning.md). 1. Initialize a new Github repository and clone it to your local machine. Based on your project requirements, consider whether you need a public or private repository. @@ -99,10 +101,10 @@ Never worked with Git/GitHub before? Then follow our [onboarding for Git/GitHub 2. Create a script in the `src` folder that downloads the data from your cloud storage (or external website) and stores it in the `data` folder. ## 5. Automate your pipeline + - Create a makefile that handles the end-to-end process (e.g., download data, preprocess data, estimate linear model, generate regression table and plot). - Start automating your project early on - even if it's just downloading the data and producing a little overview of the data. Expand your `makefile` while you're working on the project. - ## Next steps Think you're done? No way! This is just the start of your reproducible research project. So take some time to go through our suggestions on how to continue your work on the project. @@ -111,7 +113,7 @@ Think you're done? No way! This is just the start of your reproducible research - Prepare data for analysis - Pull in changes from GitHub (and push your own changes to the remote) - Create issues, and assign team members to these issues -- Work on your repository's readme - the *first* thing users of your repository will view when visiting it on GitHub +- Work on your repository's readme - the _first_ thing users of your repository will view when visiting it on GitHub -### Adding images +### Adding images + You can add an image into your building block using html tags, please see an example below `

+

` ### Bear in mind breaks in front of lists -- correct 👍 (additional break before a list) : +- correct 👍 (additional break before a list) : ``` DiD works well whenever: @@ -80,50 +82,49 @@ DiD works well whenever: ### Use tables notations -- Correct 👍 (using $\texttt{table}$ notation, moreover, no space should be in between the table tags and the start and end of the table word and percentage sign): +- Correct 👍 (using $\texttt{table}$ notation, moreover, no space should be in between the table tags and the start and end of the table word and percentage sign): {{%table%}} -| | Before ($Y_i^0$) | After ($Y_i^1$) | +| | Before ($Y_i^0$) | After ($Y_i^1$) | | --------------- | ---------------------- | ---------------------- | -| Control ($D_i = 0$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \mid D_i = 0)$ | -| Treatment ($D_i=1$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \ mid D_i = 1)$ | +| Control ($D_i = 0$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \mid D_i = 0)$ | +| Treatment ($D_i=1$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \ mid D_i = 1)$ | {{%/table%}} - + - Incorrect 👎 (NOT using $\texttt{table}$ notation): -| | Before ($Y_i^0$) | After ($Y_i^1$) | -| --------------- | ---------------------- | ---------------------- | -| Control ($D_i = 0$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \mid D_i = 0)$ | -| Treatment ($D_i=1$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \ mid D_i = 1)$ | - + | | Before ($Y_i^0$) | After ($Y_i^1$) | + | --------------- | ---------------------- | ---------------------- | + | Control ($D_i = 0$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \mid D_i = 0)$ | + | Treatment ($D_i=1$) | $E(Y_i^0 \mid D_i = 0)$ | $E(Y_i^1 \ mid D_i = 1)$ | + ### Each article should have at least a title, description, keywords, weight, date and content - correct 👍 (all fields are present): `--- - title: "Software Setup Overview" - description: "Here is a guide to help you start setting up the computing environment on your machine ready." - keywords: software, setup, guide, configure, configuration" - weight: 4 - date: 2021-01-06T22:01:14+05:30 - draft: false - --- - Some content below....` - + title: "Software Setup Overview" + description: "Here is a guide to help you start setting up the computing environment on your machine ready." + keywords: software, setup, guide, configure, configuration" + weight: 4 + date: 2021-01-06T22:01:14+05:30 + draft: false + --- + Some content below....` - incorrect 👎 (missing some fields): `--- - title: "Software Setup Overview" - description: "Here is a guide to help you start setting up the computing environment on your machine ready." - keywords: software, setup, guide, configure, configuration" - --- - Some content below....` + title: "Software Setup Overview" + description: "Here is a guide to help you start setting up the computing environment on your machine ready." + keywords: software, setup, guide, configure, configuration" + --- + Some content below....` +### Multiple authors -### Multiple authors -When you cooperate with someone on an article, in the author field please insert authors' names separated by coma but **without space** in between +When you cooperate with someone on an article, in the author field please insert authors' names separated by coma but **without space** in between -- correct 👍 (no space in between): `Author: Krzysztof Wiesniakowski,Thierry Lahaije` +- correct 👍 (no space in between): `Author: Krzysztof Wiesniakowski,Thierry Lahaije` - incorrect 👎 (there is space in between): `Author: Krzysztof Wiesniakowski, Thierry Lahaije` @@ -140,6 +141,7 @@ In addition to the standard way of formatting code in Markdown, code snippets ca Provide your code in all the relevant languages and/or operating systems and specify them after the three back ticks. Wrap all of your snippets (in different languages) inside a codeblock shortcode (see our templates for more info on this, or simply look at the Markdown file of this page to see how we render the codeblock below). {{% codeblock %}} + ```python # some Python code here print("Hello, world!") @@ -149,6 +151,7 @@ print("Hello, world!") # some R code here cat('Hello, world!') ``` + {{% /codeblock %}} {{% warning %}} @@ -190,15 +193,16 @@ In that case, you need to omit the language right after the three back ticks. Th ### LaTeX Integration & Math Formulas Simply place your math formulas: + - within single dollar signs for inline math: `$f(x)=x^2$` yields: -$f(x)=x^2$ + $f(x)=x^2$ - within double dollar signs for display: `$$f(x)=x^2$$` yields: $$f(x)=x^2$$ Our webiste currently does not support notation so please stick to using dollar signs. - correct 👍 (using dollar signs): $P(X_{i}) ≡ Pr(D_i = 1 | X_i)$ -- incorrect 👎 (using {{katex}} notation): {{}} P(X_{i}) ≡ Pr(D_i = 1 | X_i) {{}} +- incorrect 👎 (using {{katex}} notation): {{}} P(X\_{i}) ≡ Pr(D_i = 1 | X_i) {{}} ### Highlighting Boxes @@ -210,7 +214,6 @@ This is a tip {{% /tip %}} - {{% warning %}} This is a warning @@ -236,33 +239,37 @@ If you're using wide tables, they may appear broken on smaller screens like mobi This will make sure that the right edge of the table fades out on small screens and they become scrollable on the horizontal axis. The following is an example of a wide table: {{%table%}} {{% wide-table %}} -| | | | | | | +| | | | | | | |-------------|-------------|------------|------------|---------------|---------------| -| $\alpha$ | `\alpha` | $\beta$ | `\beta` | $\gamma$ | `\gamma` | -| $\delta$ | `\delta` | $\epsilon$ | `\epsilon` | $\varepsilon$ | `\varepsilon` | -| $\zeta$ | `\zeta` | $\eta$ | `\eta` | $\theta$ | `\theta` | -| $\vartheta$ | `\vartheta` | $\iota$ | `\iota` | $\kappa$ | `\kappa` | -| $\lambda$ | `\lambda` | $\mu$ | `\mu` | $\nu$ | `\nu` | -| $\xi$ | `\xi` | $\pi$ | `\pi` | $\varpi$ | `\varpi` | -| $\rho$ | `\rho` | $\varrho$ | `\varrho` | $\sigma$ | `\sigma` | -| $\tau$ | `\tau` | $\upsilon$ | `\upsilon` | $\phi$ | `\phi` | -| $\varphi$ | `\varphi` | $\chi$ | `\chi` | $\psi$ | `\psi` | -| $\omega$ | `\omega` | +| $\alpha$ | `\alpha` | $\beta$ | `\beta` | $\gamma$ | `\gamma` | +| $\delta$ | `\delta` | $\epsilon$ | `\epsilon` | $\varepsilon$ | `\varepsilon` | +| $\zeta$ | `\zeta` | $\eta$ | `\eta` | $\theta$ | `\theta` | +| $\vartheta$ | `\vartheta` | $\iota$ | `\iota` | $\kappa$ | `\kappa` | +| $\lambda$ | `\lambda` | $\mu$ | `\mu` | $\nu$ | `\nu` | +| $\xi$ | `\xi` | $\pi$ | `\pi` | $\varpi$ | `\varpi` | +| $\rho$ | `\rho` | $\varrho$ | `\varrho` | $\sigma$ | `\sigma` | +| $\tau$ | `\tau` | $\upsilon$ | `\upsilon` | $\phi$ | `\phi` | +| $\varphi$ | `\varphi` | $\chi$ | `\chi` | $\psi$ | `\psi` | +| $\omega$ | `\omega` | {{% /wide-table %}} {{%/table%}} {{% tip %}} -Click **[here](https://github.com/tilburgsciencehub/website/blob/master/content/topics/more-tutorials/contribute-to-tilburg-science-hub/style-guide.md)** to check out the Markdown file of this page to learn how these shortcodes are used in the text. +Click **[here](https://github.com/tilburgsciencehub/website/blob/main-flask/content/topics/Collaborate-share/Project-management/contribute-to-tilburg-science-hub/style-guide.md)** to check out the Markdown file of this page to learn how these shortcodes are used in the text. {{% /tip %}} ### Adding Footnotes + Footnotes [^1] let you reference relevant information without disrupting the flow of what you're trying to say. Need to make use of footnotes? Here is how to add them in Markdown: + ``` Footnotes[^1] let you (...) [^1]: My reference ``` + [^1]: [Footnotes in Markdown](https://github.blog/changelog/2021-09-30-footnotes-now-supported-in-markdown-fields/) + ### Writing Instructions - Use the second person ("you") when speaking to our or talking about our users @@ -301,18 +308,21 @@ Generally speaking, it's a good idea to name your file exactly like your main ti Title: Automation with GNU Make Filename: automation-with-gnu-make.md + {{% /warning %}} -### Add the correct tags -Every page has at the top of the code tags which, among others, help index the page. +### Add the correct tags + +Every page has at the top of the code tags which, among others, help index the page. + ``` --- -tutorialtitle: "Web Scraping and API Mining [a good title] " -type: "web-scraping [very short title ~2/3 words with no capitols nor spaces]" -title: "Web Scraping and API Mining [the same good title]" +tutorialtitle: "Web Scraping and API Mining [a good title] " +type: "web-scraping [very short title ~2/3 words with no capitols nor spaces]" +title: "Web Scraping and API Mining [the same good title]" description: "[a short, relevant summary of what a particular page is about]" -keywords: "scrape, webscraping, beautifulsoup .. [relevant keywords to the page]" -weight: 1 [to determine the position in the navigation 1 at the top 99 at the bottom] +keywords: "scrape, webscraping, beautifulsoup .. [relevant keywords to the page]" +weight: 1 [to determine the position in the navigation 1 at the top 99 at the bottom] draft: false [false when finished, true when it is still a draft] aliases: [other urls which lead to this page] - /learn/web-scraping-and-api-mining [first alias is the short link] @@ -324,26 +334,26 @@ aliases: [other urls which lead to this page] When contributing content to our platform, please address at least one of our target groups. -1. __Students and novice researchers__ who look for material developed, curated, and rigorously tested by professors and experienced users, ensuring them about quality and usability. Users may not know yet what (and why) they should learn the Tilburg Science Hub way of working. These users need to be onboarded first and guided through our content. +1. **Students and novice researchers** who look for material developed, curated, and rigorously tested by professors and experienced users, ensuring them about quality and usability. Users may not know yet what (and why) they should learn the Tilburg Science Hub way of working. These users need to be onboarded first and guided through our content. -2. __(Advanced) researchers and professors__ who look for useful material developed by their peers. These users have identified a problem, but need code snippets to use in their projects. Researchers seek to avoid mistakes that others have made before them, and look to get inspired by their colleagues. They may also use our platform to share their solutions to problems (e.g., in the form of building blocks or tutorials). +2. **(Advanced) researchers and professors** who look for useful material developed by their peers. These users have identified a problem, but need code snippets to use in their projects. Researchers seek to avoid mistakes that others have made before them, and look to get inspired by their colleagues. They may also use our platform to share their solutions to problems (e.g., in the form of building blocks or tutorials). -3. __Teams and small businesses__ who wish to get inspired by how scientists work on on data-intensive projects. Tilburg Science Hub offers tools and knowledge to help teams work together better, more efficiently. Our content can be adopted by individual team members (i.e., adoption doesn't disturb the way of working of other team members), or jointly by a team (i.e., the entire team commits to the Tilburg Science Hub way of working). +3. **Teams and small businesses** who wish to get inspired by how scientists work on on data-intensive projects. Tilburg Science Hub offers tools and knowledge to help teams work together better, more efficiently. Our content can be adopted by individual team members (i.e., adoption doesn't disturb the way of working of other team members), or jointly by a team (i.e., the entire team commits to the Tilburg Science Hub way of working). -4. __Data science enthusiasts__ who encounter our site when working on their projects. We strive to become a key learning resource in the data science community. +4. **Data science enthusiasts** who encounter our site when working on their projects. We strive to become a key learning resource in the data science community. {{% warning %}} **Keep in mind our key goals** -Our platform focuses on __usability__ (i.e., ability to quickly copy-paste code or download templates), __speed__ (i.e., loading time), and __attractiveness__ (i.e., visual appeal, “state-of-the-art” look, good writing). +Our platform focuses on **usability** (i.e., ability to quickly copy-paste code or download templates), **speed** (i.e., loading time), and **attractiveness** (i.e., visual appeal, “state-of-the-art” look, good writing). Our platform is a (classical) two-sided market. -1. We attract __novice and expert users__ that *use* our content in their work, and +1. We attract **novice and expert users** that _use_ our content in their work, and -2. We build a __community of content creators__ that have an outspoken vision on how to conduct science efficiently. -{{% /warning %}} +2. We build a **community of content creators** that have an outspoken vision on how to conduct science efficiently. + {{% /warning %}} {{% warning %}} **License** diff --git a/content/topics/Collaborate-share/Project-management/engage-open-science/pull-requests.md b/content/topics/Collaborate-share/Project-management/engage-open-science/pull-requests.md index 98cdf3d31..b980232fa 100644 --- a/content/topics/Collaborate-share/Project-management/engage-open-science/pull-requests.md +++ b/content/topics/Collaborate-share/Project-management/engage-open-science/pull-requests.md @@ -16,25 +16,26 @@ aliases: # Overview Ever wondered how to contribute to open source projects on GitHub? This building block will go through this process step-by-step to prepare you for your first meaningful open source contribution! - 1. Get familiar with the repository - 2. Run the code - 3. Make changes - 4. Submit your contribution + +1. Get familiar with the repository +2. Run the code +3. Make changes +4. Submit your contribution ## Step 1: Get familiar with the repository Familiarize yourself with the repository to which you want to contribute. - Typically, each repository has a README with general instructions on what the repository is about (and how to run the code). -- Also, new features and bugs are discussed at the repository’s *Issues* page. -- Finally, many repositories contain a discussion forum and project board (called *Projects*) in which you can learn about the roadmap of the project. +- Also, new features and bugs are discussed at the repository’s _Issues_ page. +- Finally, many repositories contain a discussion forum and project board (called _Projects_) in which you can learn about the roadmap of the project. {{% example %}} For example, visit the [repository of Tilburg Science Hub](https://github.com/tilburgsciencehub/tsh-website). - Browse through the repository’s README (the first file you see when you click on the link above) -- Head over to the issue page: click on the *Issues* tab. -- View the project/discussion boards: you can find this through the *Projects* tab. +- Head over to the issue page: click on the _Issues_ tab. +- View the project/discussion boards: you can find this through the _Projects_ tab. Can you identify ways in which to contribute to the project? @@ -44,14 +45,16 @@ Can you identify ways in which to contribute to the project? After installing the required software, you need to run the code to see whether you can test your changes to the project later. -- Open the repository on GitHub, and fork it (click on the *fork* button in the upper right corner on GitHub). This creates a copy of the code in your GitHub account. +- Open the repository on GitHub, and fork it (click on the _fork_ button in the upper right corner on GitHub). This creates a copy of the code in your GitHub account. - Clone your own forked repository to the disk. You can do this with `git clone`. The example code below shows how to clone a fork of the Tilburg Science Hub website. (Replace your user name.) {{% codeblock %}} + ```bash git clone https://github.com/[your-user-name]/tsh-website ``` + {{% /codeblock %}} - Enter the directory of the cloned repository, and run the code. @@ -60,22 +63,26 @@ After installing the required software, you need to run the code to see whether Find something that you want to add to the project, or fix! Each new feature that you introduce needs to be developed in a separate branch. This allows the repository owner to carefully screen which parts of the project to add to the public version of the main project. The websites are written in Markdown, and you can easily add or change the content. -1. Create a new branch +1. Create a new branch {{% codeblock %}} + ```bash git branch name-of-a-new-feature ``` + {{% /codeblock %}} -2. Work on your new feature. Throughout, apply the Git workflow: `git status`, `git add`, `git commit -m "commit message"`. For more information on Git Commands, read this [building block](/topics/contribute-and-share-your-work/most-important-git-commands/). +2. Work on your new feature. Throughout, apply the Git workflow: `git status`, `git add`, `git commit -m "commit message"`. For more information on Git Commands, read this [building block](/topics/contribute-and-share-your-work/most-important-git-commands/). -3. When you’re done with all of your changes, push your changes to your GitHub repository. +3. When you’re done with all of your changes, push your changes to your GitHub repository. {{% codeblock %}} + ```bash git push -u origin name-of-a-new-feature ``` + {{% /codeblock %}} At this stage, your changes are visible in your forked copy of the repository. The repository owner of the main project doesn't know about these changes yet. @@ -83,15 +90,15 @@ At this stage, your changes are visible in your forked copy of the repository. T {{% tip %}} **Preview the Tilburg Science Hub website** -When contributing to Tilburg Science Hub, it is very useful to preview changes and see how they will look on the website. +When contributing to Tilburg Science Hub, it is very useful to preview changes and see how they will look on the website. -- First, install Hugo. If you don't have Hugo installed, follow the tutorial [Get Hugo](https://tilburgsciencehub.com/topics/code-like-a-pro/hugo-website/get-hugo/). +- First, install Hugo. If you don't have Hugo installed, follow the tutorial [Get Hugo](../../Share-your-work/content-creation/Hugo-website/get-hugo.md). - Navigate to the repository cloned on your local machine. You can do this in two ways: navigate to this repository first and then open the Command Prompt here, or open the Command Prompt first and type `cd` followed by the Path to this repository. (The example uses the first option.) - Type `hugo server` in the Command Prompt. -- The website can now be run locally. Paste the localhost link (http://localhost:xxxx/) in your browser to view the website with your changes. +- The website can now be run locally. Paste the localhost link (http://localhost:xxxx/) in your browser to view the website with your changes.

@@ -112,10 +119,10 @@ Fully done and happy with your changes? Then let the project owner know about yo {{% summary %}} This guide takes you on a step-by-step explanation journey into open source contribution on GitHub: -1. **Get familiar**: Explore the repository, read the README page, and explore the *Issues* and *Projects* pages. You can find out how you can contribute. -2. **Run the code**: Fork the repository, clone it, and run the code. This sets the stage for your changes. +1. **Get familiar**: Explore the repository, read the README page, and explore the _Issues_ and _Projects_ pages. You can find out how you can contribute. +2. **Run the code**: Fork the repository, clone it, and run the code. This sets the stage for your changes. 3. **Make changes**: Create a new branch, and make your changes by adding new content or changing existing content. You can use Git commands to manage these changes. -4. **Submit your contribution**: When satisfied, initiate a pull request from your fork. Explain your changes, and the project owner might merge your work into the main project. +4. **Submit your contribution**: When satisfied, initiate a pull request from your fork. Explain your changes, and the project owner might merge your work into the main project. With these steps, you're set to make your first meaningful open source contribution! -{{% /summary %}} \ No newline at end of file +{{% /summary %}} diff --git a/content/topics/Collaborate-share/Project-management/project-management-GH/software-environments.md b/content/topics/Collaborate-share/Project-management/project-management-GH/software-environments.md index 22847cd63..6c60090ec 100644 --- a/content/topics/Collaborate-share/Project-management/project-management-GH/software-environments.md +++ b/content/topics/Collaborate-share/Project-management/project-management-GH/software-environments.md @@ -16,15 +16,14 @@ aliases: The **main advantages** of using virtual software environments are: 1. **Ensures replicability**: - - Environment specifies versions for each program and package used - - Ensures that specified versions are the right ones (environment does not forget to update specified version if the version is updated in a project) + - Environment specifies versions for each program and package used + - Ensures that specified versions are the right ones (environment does not forget to update specified version if the version is updated in a project) 2. **Easy set-up on a different machine or the cloud**: run environment setup to install all required software/packages. 3. **Keeps projects separate**: adding or updating packages for one project does not affect others. Here, we will explain how to use Conda to set up such virtual software environments. Conda is a package and environment manager and allows to install Python, R as well as additional software (e.g. pandoc) on Windows, Linux and MacOS. - ### Why is keeping track of software/package versions important? If software or packages are updated, code that was written for one, may not work on the updated version. An obvious example here is the difference between Python versions 2.7.x and 3.x, where code written for one will not work on the other. For example, just try the basic `print 'Hello World'` on Python 2.7.x and 3.x. @@ -40,6 +39,7 @@ One example of a subtle change is the update of R from version 3.5.1 to 3.6.0. I r = sample(1:10,1) # = 9 on 3.6.3 # = 3 on 3.5.1 + {{% /example %}} Given that such changes are bound to occur over time, it is important to keep track of which versions (of all packages!) are used for a project if it should still work and produce the same results in the future. @@ -51,30 +51,31 @@ This is where virtual software environments come in. The environment basically i Fortunately, there are software solutions that specify and keep track of software environments. The one we're using here is Conda, which is part of Anaconda which you will have already installed if you installed Python following the instructions [here](/topics/configure-your-computer/statistics-and-computation/python/). Of course, other solutions exist (e.g. in Python itself) to set up such environments. These are linked below. There are multiple advantages to using Conda (or similar setups): + 1. Installs software and packages directly (at least for R and Python) 2. Automatically keeps track of versions 3. Export environments and easily set them up on a different machine (or in the cloud) 4. Easy to use (both command line and graphical interface) 5. Part of Anaconda (which we used on Windows to install Python in the first place) - ### General setup + For each project, create a separate software environment. This ensures that if you're updating versions for one project, it won't affect your other projects. Moreover, it creates fewer dependencies in a given project, as its environment will not also contain dependencies from other projects. Note that now instead of using the familiar `pip install` or `install.packages()` commands in Python and R, it is necessary to install packages through Conda so that they are correctly added to the environment. - {{% tip %}} **Activate the right environment** Having multiple projects means having multiple environments. So when working on a project, always make sure the corresponding environment is activated. {{% /tip %}} - ## Code snippet + {{% codeblock %}} + ```bash # Create new empty environment # (this will ask where to save the environment, default location is usually fine ) @@ -91,6 +92,7 @@ conda install r-data.table # Close environment (switches back to base) conda deactivate ``` + {{% /codeblock %}} ## Using the code @@ -102,10 +104,11 @@ If Conda is installed, the example can be run (line by line) in a terminal to se - If you've installed Python based on [these instructions](/topics/configure-your-computer/statistics-and-computation/python/), Conda should be already available - Alternatively, detailed instructions to install Conda are provided [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html). -Note that with the full Anaconda install environments can also be managed using a graphical interface. Here, we focus only on the command line to have clear steps. Instructions for the graphical interface are provided [here](https://docs.anaconda.com/anaconda/navigator/topics/manage-environments/). +Note that with the full Anaconda install environments can also be managed using a graphical interface. Here, we focus only on the command line to have clear steps. Instructions for the graphical interface are provided [here](https://docs.anaconda.com/navigator/tutorials/manage-environments/). ## Examples -1. Add *conda-forge* channel to get newer R version + +1. Add _conda-forge_ channel to get newer R version # Add 'conda-forge' channel (provides more recent versions, # and a lot of additional software) @@ -116,13 +119,12 @@ Note that with the full Anaconda install environments can also be managed using conda activate test_env conda install r-base=4.0.3 - -2. Search for available packages +2. Search for available packages # lists packages starting with r-data conda search r-data* -3. Export environment into file (to be installed on a different machine) +3. Export environment into file (to be installed on a different machine) # Activate environment to be saved conda activate test_env @@ -134,29 +136,28 @@ Note that with the full Anaconda install environments can also be managed using # (required if environment ported to different OS, see below) conda env export --from-history > test_env.yml -4. Import and set up environment from file +4. Import and set up environment from file # Create environment based on yml-file (created as above, in same directory) conda env create -f test_env.yml -5. Deactivate environment (gets back to base) +5. Deactivate environment (gets back to base) conda deactivate -6. Remove environment +6. Remove environment conda env remove --name test_env Note that there's a bug; software environments sometimes can only be removed if at least one package was installed. -7. List installed environments +7. List installed environments conda info --envs - ## OS dependence -Sometimes we work across different operating systems. For example, you may develop and test code on a Windows desktop, before then running it on a server that runs on Linux. As the Conda environment contains all dependencies, it will also list some low level tools that may not be available on a different OS. In this case, when trying to set up the same environment from a *.yml* file created on a different OS it will fail. +Sometimes we work across different operating systems. For example, you may develop and test code on a Windows desktop, before then running it on a server that runs on Linux. As the Conda environment contains all dependencies, it will also list some low level tools that may not be available on a different OS. In this case, when trying to set up the same environment from a _.yml_ file created on a different OS it will fail. {{% example %}} Check the output of `conda list`, which will list various dependencies you never explicitly requested. @@ -164,7 +165,7 @@ Check the output of `conda list`, which will list various dependencies you never In this case, it is possible to use the `--from-history` option (see Example 3 above). When creating the environment `.yml` file, this option makes it so that the generated `.yml` file will only contain the packages explicitly requested (i.e. those you at one point added through `conda install`), while the lower level dependencies (e.g. compiler, BLAS library) are not added. If the requested packages exist for the different OS, this usually should work as the low level dependencies will be automatically resolved when setting up the environment. -## Additional Resources +## Additional Resources 1. More information on Conda environments: [https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html](https://conda.io/projects/conda/en/latest/user-guide/concepts/environments.html) diff --git a/content/topics/Collect-store/data-collection/APIs/extract-data-api.md b/content/topics/Collect-store/data-collection/APIs/extract-data-api.md index 95f61c8d0..e7075a768 100644 --- a/content/topics/Collect-store/data-collection/APIs/extract-data-api.md +++ b/content/topics/Collect-store/data-collection/APIs/extract-data-api.md @@ -13,8 +13,9 @@ aliases: --- ## Extracting Data From APIs -An Application Programming Interface (*API*) is a version of a website intended for computers, rather than humans, to talk to one another. APIs are everywhere, and most are used to *provide data* (e.g., retrieve a user name and demographics), *functions* (e.g., start playing music from Spotify, turn on your lamps in your "smart home"), or -*algorithms* (e.g., submit an image, retrieve a written text for what's on the image). + +An Application Programming Interface (_API_) is a version of a website intended for computers, rather than humans, to talk to one another. APIs are everywhere, and most are used to _provide data_ (e.g., retrieve a user name and demographics), _functions_ (e.g., start playing music from Spotify, turn on your lamps in your "smart home"), or +_algorithms_ (e.g., submit an image, retrieve a written text for what's on the image). APIs work similarly to websites. At the core, you obtain code that computers can easily understand to process the content of a website, instead of obtaining the source code of a rendered website. APIs provide you with simpler and more scalable ways to obtain data, so you must understand how they work. @@ -25,17 +26,18 @@ In practice, most APIs require user authentication to get started. Each API has {{% tip %}} One of the major advantages of APIs is that you can directly access the data you need without all the hassle of selecting the right HTML tags. Another advantage is that you can often customize your API request (e.g., the first 100 comments or only posts about science), which may not always be possible in the web interface. Last, using APIs is legitimized by a website (mostly, you will have to pay a license fee to use APIs!). So it's a more stable and legit way to retrieve web data compared to web scraping. That's also why we recommend using an API whenever possible. -In practice, though, APIs really can't give you all the data you possibly want, and web scraping allows you to access complementary data (e.g., viewable on a website or somewhere hidden in the source code). See our tutorial [web-scraping and api mining](https://tilburgsciencehub.com/topics/code-like-a-pro/web-scraping/web-scraping-tutorial/) for all the details. +In practice, though, APIs really can't give you all the data you possibly want, and web scraping allows you to access complementary data (e.g., viewable on a website or somewhere hidden in the source code). See our tutorial [web-scraping and api mining](../web-scraping/web-scraping-tutorial.md) for all the details. {{% /tip %}} - ## Examples ### icanhazdadjoke + Every time you visit the [site](https://icanhazdadjoke.com), the site shows a random joke. From a technical perspective, each time a user opens the site, a little software program on the server makes an API call to the daddy joke API to draw a new joke to be displayed. Note that this is one of the few APIs that does not require any authentication. {{% codeblock %}} + ```Python import requests search_url = "https://icanhazdadjoke.com/search" @@ -46,12 +48,14 @@ response = requests.get(search_url, joke_request = response.json() print(joke_request) ``` + {{% /codeblock %}} -*Multiple seeds* +_Multiple seeds_ Rather than having a fixed endpoint like the one above (e.g., always search for cat jokes), you can restructure your code to allow for variable input. For example, you may want to create a function `search_api()` that takes an input parameter `search_term` that you can assign any value you want. {{% codeblock %}} + ```Python def search_api(search_term): search_url = "https://icanhazdadjoke.com/search" @@ -63,24 +67,22 @@ def search_api(search_term): search_api("dog") search_api("cow") ``` + {{% /codeblock %}} -*Pagination* +_Pagination_ Transferring data is costly - not strictly in a monetary sense, but in time. So - APIs are typically very greedy in returning data. Ideally, they only produce a very targeted data point that is needed for the user to see. It saves the website owner from paying for bandwidth and guarantees that the site responds fast to user input. By default, each page contains 20 jokes, where page 1 shows jokes 1 to 20, page 2 jokes 21 to 40, ..., and page 33 jokes 641 to 649. You can adjust the number of results on each page (max. 30) with the limit parameter (e.g., `params={"limit": 10}`). In practice, almost every API on the web limits the results of an API call (100 is also a common cap). As an alternative, you can specify the current page number with the `page` parameter (e.g., `params={"term": "", "page": 2}`). - - - ### Reddit + Reddit is a widespread American social news aggregation and discussion site. The service uses an API to generate the website's content and grants public access to the API. To request data from the Reddit API, we need to include headers in our request. HTTP headers are a vital part of any API request, containing meta-data associated with the request (e.g., type of browser, language, expected data format, etc.). - {{% codeblock %}} ```python @@ -92,9 +94,9 @@ headers = {'authority': 'www.reddit.com', 'cache-control': 'max-age=0', 'upgrade response = requests.get(url, headers=headers) json_response = response.json() ``` -{{% /codeblock %}} +{{% /codeblock %}} ## See Also -* The Spotify Web API [tutorial](https://github.com/hannesdatta/course-odcm/blob/master/content/docs/topics/apisadvanced/api-advanced.ipynb) used in the Online Data Management Collection course illustrates how to generate access tokens and use a variety of API endpoints. +- The Spotify Web API [tutorial](https://github.com/hannesdatta/course-odcm/blob/master/content/docs/project/resources/tutorials/apisadvanced/api-advanced.ipynb) used in the Online Data Management Collection course illustrates how to generate access tokens and use a variety of API endpoints. diff --git a/content/topics/Collect-store/data-collection/web-scraping/scrape-dynamic-websites.md b/content/topics/Collect-store/data-collection/web-scraping/scrape-dynamic-websites.md index e1c1e11c0..433c51697 100644 --- a/content/topics/Collect-store/data-collection/web-scraping/scrape-dynamic-websites.md +++ b/content/topics/Collect-store/data-collection/web-scraping/scrape-dynamic-websites.md @@ -11,66 +11,74 @@ aliases: --- ## Scrape Dynamic Websites -While it's easy to get started with Beautifulsoup, it has limitations when it comes to *scraping dynamic websites*. That is websites of which the content changes after each page refresh. Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. Here we outline the basic commands and installation instructions to get you started. + +While it's easy to get started with Beautifulsoup, it has limitations when it comes to _scraping dynamic websites_. That is websites of which the content changes after each page refresh. Selenium can handle both static and dynamic websites and mimic user behavior (e.g., scrolling, clicking, logging in). It launches another web browser window in which all actions are visible which makes it feel more intuitive. Here we outline the basic commands and installation instructions to get you started. ## Code ### Install Selenium and make it work for Chromedriver + {{% warning %}} The first time you will need to: - 1. Install the Python package for **selenium** by typing in the command `pip install selenium`. Alternatively, open Anaconda Prompt (Windows) or the Terminal (Mac), type the command `conda install selenium`, and agree to whatever the package manager wants to install or update (usually by pressing `y` to confirm your choice). - 2. Download a web driver to interface with a web browser, we recommend the **Webdriver Manager for Python**. Install it by typing in the command `pip install webdriver-manager` +1. Install the Python package for **selenium** by typing in the command `pip install selenium`. Alternatively, open Anaconda Prompt (Windows) or the Terminal (Mac), type the command `conda install selenium`, and agree to whatever the package manager wants to install or update (usually by pressing `y` to confirm your choice). + +2. Download a web driver to interface with a web browser, we recommend the **Webdriver Manager for Python**. Install it by typing in the command `pip install webdriver-manager` - {{% /warning %}} +{{% /warning %}} Once Selenium and webdriver manager are installed, you can now install ChromeDriver as follows: - {{% codeblock %}} - ```Python - # Make Selenium and chromedriver work for a dynamic website (here: untappd.com) - from selenium import webdriver - from selenium.webdriver.chrome.service import Service as ChromeService - from webdriver_manager.chrome import ChromeDriverManager - - driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) +{{% codeblock %}} - url = "https://untappd.com/" - driver.get(url) +```Python +# Make Selenium and chromedriver work for a dynamic website (here: untappd.com) +from selenium import webdriver +from selenium.webdriver.chrome.service import Service as ChromeService +from webdriver_manager.chrome import ChromeDriverManager + +driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install())) + +url = "https://untappd.com/" +driver.get(url) + +``` - ``` - {{% /codeblock %}} +{{% /codeblock %}} If you want to block all pop-ups automatically (e.g. cookie pop-ups), you can use the following code: - {{% codeblock %}} - ```Python - # Make Selenium and chromedriver work for Untappd.com - - from selenium import webdriver - from selenium.webdriver.chrome.service import Service - from selenium.webdriver.chrome.options import Options - from webdriver_manager.chrome import ChromeDriverManager - - # Set Chrome Options - options = Options() - options.add_experimental_option("excludeSwitches", ["disable-popup-blocking"]) - - # Initialize the Chrome driver - driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) - - ``` - {{% /codeblock %}} +{{% codeblock %}} + +```Python +# Make Selenium and chromedriver work for Untappd.com + +from selenium import webdriver +from selenium.webdriver.chrome.service import Service +from selenium.webdriver.chrome.options import Options +from webdriver_manager.chrome import ChromeDriverManager + +# Set Chrome Options +options = Options() +options.add_experimental_option("excludeSwitches", ["disable-popup-blocking"]) + +# Initialize the Chrome driver +driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options) + +``` + +{{% /codeblock %}} {{% tip %}} -Want to run Chromedriver manually? Check our [page](https://tilburgsciencehub.com/topics/configure-your-computer/task-specific-configurations/configuring-python-for-webscraping/) on Configuring Python for Web Scraping +Want to run Chromedriver manually? Check our [page](../../../Computer-Setup/software-installation/Python/configuring-python-for-webscraping.md) on Configuring Python for Web Scraping {{% /tip %}} - ### Finding content in a website's source code + Running the code snippet below starts a new Google Chrome browser (`driver`) and then navigates to the specified URL. In other words, you can follow along with what the computer does behind the screens. Next, you can obtain specific website elements by tag name (e.g., `h1` is a header), similar to BeautifulSoup. {{% codeblock %}} + ```Python import selenium.webdriver from selenium.webdriver.common.by import By @@ -89,16 +97,18 @@ soup = BeautifulSoup(driver.page_source) # ensure that the site has fully been l soup.find('h1').get_text() ``` -{{% /codeblock %}} +{{% /codeblock %}} #### Selectors + Alternatively, you can specify which elements to extract through attributes, classes, and identifiers: {{% codeblock %}} + ```Python # HTML classes -driver.find_element(By.CLASS_NAME, "").text +driver.find_element(By.CLASS_NAME, "").text # HTML identifiers () driver.find_element(By.ID,"").text @@ -107,34 +117,36 @@ driver.find_element(By.ID,"").text driver.find_element(By.XPATH,"").text ``` -{{% /codeblock %}} +{{% /codeblock %}} {{% tip %}} Within Google Inspector you can easily obtain the XPath by right-clicking on an element and selecting: "Copy" > "Copy XPath". {{% /tip %}} - #### User interactions + One of the distinguishable features of Selenium is the ability to mimic user interactions which can be vital to get to the data you are after. For example, older tweets are only loaded once you scroll down the page. {{% codeblock %}} + ```Python # scroll down the page driver.execute_script('window.scrollTo(0, document.body.scrollHeight);') # click on an element (e.g., button) -element_to_be_clicked = driver.find_element(By.CLASS_NAME, "").text +element_to_be_clicked = driver.find_element(By.CLASS_NAME, "").text element_to_be_clicked.click() ``` -{{% /codeblock %}} +{{% /codeblock %}} ## Advanced Use Case ### Task Scheduling -See the building block on [task automation](http://tilburgsciencehub.com/topics/automate-and-execute-your-work/automate-your-workflow/task-scheduling/) on how to schedule the execution of the web scraper (e.g., every day). Keep in mind that this only works with Python scripts, so if you're currently working in a Jupyter Notebook you need to transform it into a `.py` file first. +See the building block on [task automation](../../../Automation/automation-tools/task-automation/task-scheduling.md) on how to schedule the execution of the web scraper (e.g., every day). Keep in mind that this only works with Python scripts, so if you're currently working in a Jupyter Notebook you need to transform it into a `.py` file first. ## See Also -* Looking for a simple solution that does the job without any bells and whistles? Try out the BeautifulSoup package and follow our [web-scraping for static websites building block](/topics/collect-data/webscraping-apis/scrape-static-websites/). + +- Looking for a simple solution that does the job without any bells and whistles? Try out the BeautifulSoup package and follow our [web-scraping for static websites building block](/topics/collect-data/webscraping-apis/scrape-static-websites/). diff --git a/content/topics/Collect-store/data-collection/web-scraping/scrape-static-websites.md b/content/topics/Collect-store/data-collection/web-scraping/scrape-static-websites.md index 76ad31a6c..5eea13398 100644 --- a/content/topics/Collect-store/data-collection/web-scraping/scrape-static-websites.md +++ b/content/topics/Collect-store/data-collection/web-scraping/scrape-static-websites.md @@ -11,13 +11,15 @@ aliases: --- ## Scrape Static Websites + Say that you want to capture and analyze data from a website. Of course, you could simply copy-paste the data from each page but you would quickly run into issues. What if the data on the page gets updated (i.e., would you have time available to copy-paste the new data, too)? Or what if there are simply so many pages that you can't possibly do it all by hand (i.e., thousands of product pages)? -Web scraping can help you overcome these issues by *programmatically* grabbing data from the web. The tools best suited for the job depend on the type of website: static or dynamic. In this building block, we focus on static websites, which always return the same information. +Web scraping can help you overcome these issues by _programmatically_ grabbing data from the web. The tools best suited for the job depend on the type of website: static or dynamic. In this building block, we focus on static websites, which always return the same information. ## Code ### Requests + Each site is made up of HTML code that you can view with, for example, Google Inspector. This source code contains data about the look and feel and contents of a website. To store the source code of a website in Python, you can use the `get()` method from the `requests` package, while for R, the `rvest` package is employed for the same purpose: {{% codeblock %}} @@ -43,12 +45,15 @@ url = "https://www.abcdefhijklmnopqrstuvwxyz.nl" # read the html from the URL source_code = read_html(url) ``` + {{% /codeblock %}} ### Seed Generation + In practice, you typically scrape data from a multitude of links (also known as the seed). Each URL has a fixed and variable part. The latter may be a page number that increases by increments of 1 (e.g., `page-1`, `page-2`). The code snippet below provides an example of how you can quickly generate such a list of URLs. {{% codeblock %}} + ```Python base_url = # the fixed part of the URL num_pages = # the number of pages you want to scrape @@ -58,6 +63,7 @@ for counter in range(1, num_pages+1): full_url = base_url + "page-" + str(counter) + ".html" page_urls.append(full_url) ``` + ```R base_url = # the fixed part of the URL num_pages = # the number of pages you want to scrape @@ -66,24 +72,26 @@ page_urls = character(num_pages) for (i in 1:num_pages) { page_urls[i] = paste0(base_url,"page-",i)} ``` + {{% /codeblock %}} {{% tip %}} Rather than inserting a fixed number of pages (`num_pages`), you may want to leverage the page navigation menu instead. For example, you could extract the next page url (if it exists) from a "Next" button. {{% /tip %}} - ### Beautifulsoup Next, once we have imported the `source_code`, it is a matter of extracting specific elements. The `BeautifulSoup` package has several built-in methods that simplify this process significantly. -Moreover, the `rvest` package in R provides tools and functions that allow you to extract data from HTML and XML documents, making it easier to gather information from websites. +Moreover, the `rvest` package in R provides tools and functions that allow you to extract data from HTML and XML documents, making it easier to gather information from websites. #### Finding content in a website's source code + The `.find()` and `.find_all()` methods search for matching HTML elements (e.g., `h1` is a header). While `.find()` always prints out the first matching element, `find_all()` captures all of them and returns them as a list. Often, you are specifically interested in the text you have extracted and less so in the HTML tags. To get rid of them, you can use the `.get_text()` method. In R, when using the `rvest` package, an equivalent approach involves the `html_elements` method. To access the text content without the HTML tags, you can further apply the `html_text()` function. {{% codeblock %}} + ```Python from bs4 import BeautifulSoup @@ -101,6 +109,7 @@ print(soup.find_all('h2')[0]) # strip HTML tags from element print(soup.find_all('h2')[0].get_text()) ``` + ```R # the first matching

element html_element(source_code,'h1') @@ -114,17 +123,19 @@ html_elements(source_code,'h2')[1] # strip HTML tags from element html_text(html_elements(source_code,'h2')[1]) ``` -{{% /codeblock %}} +{{% /codeblock %}} {{% tip %}} In practice, you often find yourself in situations that require chaining one or more commands, for example `soup.find('table').find_all('tr')[1].find('td')` looks for the first column (`td`) of the second row (`tr`) in the `table`. {{% /tip %}} #### Selectors + Rather than searching by HTML tag, you can specify which elements to extract through attributes, classes, and identifiers: {{% codeblock %}} + ```Python # element attributes soup.find("a").attrs["href"] @@ -135,6 +146,7 @@ soup.find(class_ = "") # HTML identifiers soup.find(id = "") ``` + ```R # element attributes html_attr(html_elements(source_code,"a"),"href") @@ -149,16 +161,16 @@ html_elements(source_code,"#>") {{% /codeblock %}} {{% tip %}} -You can combine HTML tags and selectors like this: -```soup.find("h1", class_ = "menu-header"]``` +You can combine HTML tags and selectors like this: +`soup.find("h1", class_ = "menu-header"]` {{% /tip %}} - ## Advanced Use Case ### Task Scheduling -See the building block on [task automation](http://tilburgsciencehub.com/topics/automate-and-execute-your-work/automate-your-workflow/task-scheduling/) on how to schedule the execution of the web scraper (e.g., every day). Keep in mind that this only works with Python scripts, so if you're currently working in a Jupyter Notebook you need to transform it into a `.py` file first. +See the building block on [task automation](../../../Automation/automation-tools/task-automation/task-scheduling.md) on how to schedule the execution of the web scraper (e.g., every day). Keep in mind that this only works with Python scripts, so if you're currently working in a Jupyter Notebook you need to transform it into a `.py` file first. ## See Also -* If you're aiming to strive for a dynamic website, such as a social media site, please consult our building block [web-scraping dynamic websites](/topics/collect-data/webscraping-apis/scrape-dynamic-websites/) building block. + +- If you're aiming to strive for a dynamic website, such as a social media site, please consult our building block [web-scraping dynamic websites](/topics/collect-data/webscraping-apis/scrape-dynamic-websites/) building block. diff --git a/content/topics/Collect-store/data-storage/commercial-cloud/mem-storage-gcp.md b/content/topics/Collect-store/data-storage/commercial-cloud/mem-storage-gcp.md index 3c2fe3930..7b8e111f9 100644 --- a/content/topics/Collect-store/data-storage/commercial-cloud/mem-storage-gcp.md +++ b/content/topics/Collect-store/data-storage/commercial-cloud/mem-storage-gcp.md @@ -1,19 +1,19 @@ --- -title: "Handle storage within your Google Cloud instance" +title: "Handle storage within your Google Cloud instance" description: "After configuring a Google Cloud instance with GPUs, learn to import heavy files from Google Drive" keywords: "Docker, Environment, Python, Jupyter notebook, Google cloud, Cloud computing, Cloud storage, GPU, Virtual Machine, Instance, Bucket, Storage" weight: 2 author: "Fernando Iscar" authorlink: "https://www.linkedin.com/in/fernando-iscar/" draft: false -date: 2023-09-16 -aliases: +date: 2023-09-16 +aliases: - /manage/storage --- ## Overview -In this article, we build upon the foundational steps outlined in [Configure a VM with GPUs in Google Cloud](/topics/automation/replicability/cloud-computing/config-vm-gcp/). +In this article, we build upon the foundational steps outlined in [Configure a VM with GPUs in Google Cloud](../../../Automation/Replicability/cloud-computing/config-VM-GCP.md). After configuring our instance, we may find the necessity to manage large files proficiently. @@ -22,13 +22,14 @@ In this guide, you will learn how to: - Connect our Docker container with Google Cloud Storage using GCSFuse. - Use Google Colab for seamless file transfers between Google Drive and Google Cloud Storage. -## Connect the Docker container to a GCS bucket +## Connect the Docker container to a GCS bucket [Storage buckets](https://cloud.google.com/storage/docs/buckets) are the native storage units of the Google Cloud platform. Linking them to your virtual machine offers integrated and efficient data management, simplifying scaling, backup, and access from anywhere in the Google Cloud infrastructure, eliminating, for example, the need to perform manual uploads/downloads via the Google Cloud's browser interface. To do this, first of all, you will need to install [GCSFuse](https://github.com/GoogleCloudPlatform/gcsfuse) in your instance by running the following code: {{% codeblock %}} + ```bash #1 $ export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s` @@ -41,22 +42,24 @@ $ sudo apt-get update #3 $ sudo apt-get install gcsfuse ``` + {{% /codeblock %}} Now, create a new directory in your virtual machine, which will be the one connected to your bucket. After that, you can run the following, substituting 'YOUR_BUCKET' with the name of the storage bucket you wish to connect with your instance, and 'PATH_TO/NEW_DIRECTORY' to the path of the newly created directory that will host that connection: {{% codeblock %}} + ```bash $ sudo gcsfuse -o allow_other YOUR_BUCKET PATH_TO/NEW_DIRECTORY ``` + {{% /codeblock %}} This code will tell GCSFuse to synchronize your new directory with your bucket. Then, you would be able to store any output produced in your projects in this new directory that you just created in your instance and it will be immediately at your disposal within your GCS bucket, and vice versa. - ## Import large files: Use Google Colab as a bridge -The transfer and management of large files can be challenging due to extended upload/download durations. Yet, having connected your Google Cloud virtual machine to your storage buckets, you can swiftly transfer files that you already have hosted on **Google Drive** and **GCS Buckets** (And in turn to your virtual machine as well) using **Google Colab** as a bridge. +The transfer and management of large files can be challenging due to extended upload/download durations. Yet, having connected your Google Cloud virtual machine to your storage buckets, you can swiftly transfer files that you already have hosted on **Google Drive** and **GCS Buckets** (And in turn to your virtual machine as well) using **Google Colab** as a bridge. This bypasses the need to manually Download/Upload files from Drive to your local machine and then to your cloud instance, potentially saving you significant amounts of time in the process. @@ -65,44 +68,50 @@ It is suggested to utilize the same account for all Google tools. Doing so simpl {{% tip %}} **Copy all directories you need** -When synchronizing the bucket with your directory inside the container, if your bucket contains other directories as well, the command `$ sudo gcsfuse -o allow_other your-bucket volume_path/new_dir/` might not copy those directories. +When synchronizing the bucket with your directory inside the container, if your bucket contains other directories as well, the command `$ sudo gcsfuse -o allow_other your-bucket volume_path/new_dir/` might not copy those directories. Try `$ sudo gcsfuse --implicit-dirs -o allow_other your-bucket volume_path/new_dir/` to make sure the implicit directories are also included if you want them to. {{% /tip %}} -To start with, make sure your bucket is created. To do so, you can follow [these](https://cloud.google.com/storage/docs/creating-buckets) short guidelines. +To start with, make sure your bucket is created. To do so, you can follow [these](https://cloud.google.com/storage/docs/creating-buckets) short guidelines. -To continue, launch a notebook in **Google Colab**, preferably the one you plan to run on your instance later. This approach helps you maintain a record of the code you've used. +To continue, launch a notebook in **Google Colab**, preferably the one you plan to run on your instance later. This approach helps you maintain a record of the code you've used. Subsequently, to mount your **Google Drive** to this notebook, use the following code: {{% codeblock %}} + ```python from google.colab import drive drive.mount('/content/gdrive') ``` + {{% /codeblock %}} Next, we authenticate our user: {{% codeblock %}} + ```python from google.colab import auth auth.authenticate_user() ``` + {{% /codeblock %}} After, we set our **project ID**. This is the identifier corresponding to your **GCS bucket** that you intend to populate with files: {{% codeblock %}} + ```python project_id = 'your_project_id' !gcloud config set project {project_id} ``` + {{% /codeblock %}} The `gcloud config set project {project_id}` command configures the gcloud command-line tool to use the specified **project ID** by default. @@ -110,19 +119,20 @@ The `gcloud config set project {project_id}` command configures the gcloud comma Finally, we define our bucket's name and use the `gsutil -m cp -r` command to recursively copy the directory you're interested in sending from our mounted **Google Drive** to our specified **GCS bucket**: {{% codeblock %}} + ```python bucket_name = 'your_bucket_name' !gsutil -m cp -r /* gs://{bucket_name}/ ``` + {{% /codeblock %}} -The output of this last cell will say at the end *"Operation completed..."*. +The output of this last cell will say at the end _"Operation completed..."_. Now your data should be available in your **GCS bucket** and can be accessed from your Google Cloud instance. Refresh your bucket webpage and confirm it. After following the steps outlined so far, you will be able to import your files and execute your code within the Docker container. - {{% summary %}} **Efficiently handle large files in Google Cloud:** diff --git a/content/topics/Manage-manipulate/Loading/getting-started/spss-files-in-r.md b/content/topics/Manage-manipulate/Loading/getting-started/spss-files-in-r.md index d81e8b9c3..e024770ca 100644 --- a/content/topics/Manage-manipulate/Loading/getting-started/spss-files-in-r.md +++ b/content/topics/Manage-manipulate/Loading/getting-started/spss-files-in-r.md @@ -12,6 +12,7 @@ aliases: --- # Learning goals + - Read, process and manipulate SPSS (or other labelled) data in R while maintaining the labels - Learn the most common operations on labels - Enable you to seamlessly work with both R and SPSS so you can split your workflow between the two as you (or your teammates) wish @@ -51,7 +52,7 @@ For this tutorial, we'll work with the `efc` dataset on informal care. It comes {{% codeblock %}} -``` r +```r # attach the dataset and create a dataframe data(efc) df <- efc @@ -59,13 +60,13 @@ df <- efc {{% /codeblock %}} -SPSS works with both *variable labels* and *value labels*. *Value labels* can be very handy to store relevant metadata. For example, you can store the whole question that the participants got to see. +SPSS works with both _variable labels_ and _value labels_. _Value labels_ can be very handy to store relevant metadata. For example, you can store the whole question that the participants got to see. {{% tip %}} Unfortunately, R doesn't do the best job of communicating the labels to you as the user. When we print the dataset to the console, we don't see any indication that we are dealing with labelled data. -If we `View()` the dataframe, we do at least see the *variable labels* in the columns. +If we `View()` the dataframe, we do at least see the _variable labels_ in the columns. You can inspect the data labels with dedicated `sjlabelled` functions, though. `get_label()` returns a named vector of the variable labels, while `get_labels()` returns a named list of named vectors containing the value labels. @@ -135,8 +136,8 @@ This data frame does not carry any labels anymore, just like a "normal" R datafr {{% codeblock %}} ```R -df_processed |> - var_labels(!!!variable_labels) |> +df_processed |> + var_labels(!!!variable_labels) |> get_label() ``` @@ -147,8 +148,8 @@ Analogously, we can re-assign the value labels: {{% codeblock %}} ```R -df_processed |> - val_labels(!!!value_labels) |> +df_processed |> + val_labels(!!!value_labels) |> get_labels() ``` @@ -159,7 +160,7 @@ Now let's fully re-label the "processed" dataframe. As you saw before, both `var {{% codeblock %}} ```R -df_proc_relabelled <- df_processed |> +df_proc_relabelled <- df_processed |> select(order(colnames(df_processed))) |> # order the columns alphabetically var_labels(!!!variable_labels) |> val_labels(!!!value_labels) @@ -172,9 +173,9 @@ Both `var_labels()` and `val_labels()` automatically skip variables that are giv {{% codeblock %}} ```R -df_processed |> +df_processed |> select(!e15relat) |> # removes the `e15relat` variable - var_labels(!!!(variable_labels)) |> + var_labels(!!!(variable_labels)) |> get_label() # Warning: Following elements are no valid column names in `x`: e15relat @@ -187,8 +188,8 @@ If you don't want the warning, you can create a makeshift version of `any_of()` {{% codeblock %}} ```R -df_processed |> - select(!e15relat) |> +df_processed |> + select(!e15relat) |> var_labels(!!!(variable_labels[names(variable_labels) %in% colnames(.data)])) ``` @@ -203,14 +204,14 @@ which_exist <- function(variable_labels) { variable_labels[names(variable_labels) %in% colnames(.data)] } -df_processed |> - select(!e15relat) |> +df_processed |> + select(!e15relat) |> var_labels(!!!(which_exist(variable_labels))) ``` {{% /codeblock %}} -If you happen to find a better way to only consider variables that exist in the dataframe, please [let us know](https://tilburgsciencehub.com/topics/more-tutorials/contribute-to-tilburg-science-hub/contribute/)! +If you happen to find a better way to only consider variables that exist in the dataframe, please [let us know](../../../Collaborate-share/Project-management/contribute-to-tilburg-science-hub/contribute.md)! ### Labeling data @@ -243,8 +244,8 @@ mtcars_var_labels <- c( gear = "Number of forward gears", carb = "Number of carburetors" ) -mtcars_labelled <- mtcars |> - var_labels(!!!(mtcars_var_labels)) +mtcars_labelled <- mtcars |> + var_labels(!!!(mtcars_var_labels)) ``` {{% /codeblock %}} @@ -283,8 +284,8 @@ We can then perform conditional operations based on the `ordered` flag: ```R # flag all variables related to coping ("cop") are ordinal data -df_processed_ordered <- df_processed |> - as_ordered_cols(contains("cop")) +df_processed_ordered <- df_processed |> + as_ordered_cols(contains("cop")) df_processed_ordered |> select(where(is.ordered)) @@ -296,7 +297,7 @@ To convert factors back to unordered (i.e. nominal) ones, you can use: {{% codeblock %}} -``` r +```r df$c82cop1 <- factor(df$c82cop1, ordered = FALSE) ``` @@ -310,8 +311,8 @@ Sometimes ordinal data is interpreted as numerical data. One example for this is ```R df_processed |> print() # text levels -df_processed_numeric <- df_processed |> - as_numeric() |> +df_processed_numeric <- df_processed |> + as_numeric() |> print() # numeric levels, the text levels are stored in the value labels ``` @@ -324,9 +325,9 @@ We can now perform statistical operations. By calling `rowwise()`, we can ensure {{% codeblock %}} ```R -df_processed_numeric |> +df_processed_numeric |> select(c(c82cop1, c89cop8, c90cop9)) |> # positive measures - rowwise() |> + rowwise() |> mutate( positive_score = mean(c(c82cop1, c89cop8, c90cop9)) ) @@ -336,13 +337,13 @@ df_processed_numeric |> ## Export data back to SPSS -After we imported the SPSS file, we can clean and analyze the data as we normally would. Afterwards, we can save it either as a `.csv` or we can go back to a `.sav` file. In the latter case we need to go back to the SPSS standard, i.e. keep the *variable labels* and replace factors with numerical IDs plus corresponding value labels. Conveniently, both can be done automatically in R! This is the only time that we will be using a function that is not part of `sjlabelled`, but the `haven` package. We don't recommend attaching `haven` with `library()`, though. Instead we will be calling it with the `::` notation. Factor columns will be treated as nominal measures by default, so make sure to flag ordered variables (e.g. with our custom function `as_ordered_cols()`) before exporting. +After we imported the SPSS file, we can clean and analyze the data as we normally would. Afterwards, we can save it either as a `.csv` or we can go back to a `.sav` file. In the latter case we need to go back to the SPSS standard, i.e. keep the _variable labels_ and replace factors with numerical IDs plus corresponding value labels. Conveniently, both can be done automatically in R! This is the only time that we will be using a function that is not part of `sjlabelled`, but the `haven` package. We don't recommend attaching `haven` with `library()`, though. Instead we will be calling it with the `::` notation. Factor columns will be treated as nominal measures by default, so make sure to flag ordered variables (e.g. with our custom function `as_ordered_cols()`) before exporting. {{% codeblock %}} ```R -df_factor |> - as_ordered_cols(contains("cop")) |> +df_factor |> + as_ordered_cols(contains("cop")) |> haven::write_sav("data/df_processed.sav") ``` @@ -356,8 +357,8 @@ If you want to export the numeric version of the dataframe, make sure to convert ```R df_processed_numeric |> - as_ordered_cols(contains("cop")) |> - as_label() |> + as_ordered_cols(contains("cop")) |> + as_label() |> haven::write_sav("data/df_processed_num.sav") ``` @@ -367,16 +368,16 @@ df_processed_numeric |> Here are the key takeaways from this building block: -- You can seamlessly work with both SPSS and R for your labelled data -- We recommend the `sjlabelled` package to handle SPSS (and labelled data in general) in R. -- Labelled data can have **variable** and **value** labels (aka levels). - - Use `get_label()` to inspect the **variable** labels. - - Use `get_labels()` to inspect the **value** labels. -- Use `as_label()` to replace the numerical level IDs with the text levels. -- To assign labels to the data with `dplyr`-syntax use - - `var_labels(!!!variable_labels)` - - `val_lables(!!!value_labels)` - - `variable_labels` and `value_labels` are (nested) named lists with column name-label pairs -- Use `as.ordered()` to transfer categorical data to ordinal data. -- Use `as_numeric()` on ordinal data to before applying quantitative statistics. -- Use `haven::write_sav()` to export the data as SPSS files. +- You can seamlessly work with both SPSS and R for your labelled data +- We recommend the `sjlabelled` package to handle SPSS (and labelled data in general) in R. +- Labelled data can have **variable** and **value** labels (aka levels). + - Use `get_label()` to inspect the **variable** labels. + - Use `get_labels()` to inspect the **value** labels. +- Use `as_label()` to replace the numerical level IDs with the text levels. +- To assign labels to the data with `dplyr`-syntax use + - `var_labels(!!!variable_labels)` + - `val_lables(!!!value_labels)` + - `variable_labels` and `value_labels` are (nested) named lists with column name-label pairs +- Use `as.ordered()` to transfer categorical data to ordinal data. +- Use `as_numeric()` on ordinal data to before applying quantitative statistics. +- Use `haven::write_sav()` to export the data as SPSS files. diff --git a/content/topics/Manage-manipulate/Loading/large-datasets/dask-in-action.md b/content/topics/Manage-manipulate/Loading/large-datasets/dask-in-action.md index dc914d1c3..c3bfadb86 100644 --- a/content/topics/Manage-manipulate/Loading/large-datasets/dask-in-action.md +++ b/content/topics/Manage-manipulate/Loading/large-datasets/dask-in-action.md @@ -13,14 +13,14 @@ aliases: --- ## Using Dask to Describe Data -We learned how to [Handle Large Datasets in Python](https://tilburgsciencehub.com/topics/prepare-your-data-for-analysis/data-preparation/large-datasets-python/) in a general way, but now let's dive deeper into it by implementing a practical example. To illustrate how to use Dask, we perform simple descriptive and analytics operations on a large dataset. We use a flight dataset, available at [Kaggle.com](https://www.kaggle.com/datasets/usdot/flight-delays), containing over 5M flight delays and cancellations from 2015. When loaded with `pandas`, the dataframe occupies more than 1.3GB of memory, which would make operations rather slow. Let's load the data with `dask`. For all operations we also provide the equivalent `pandas` code for comparison. +We learned how to [Handle Large Datasets in Python](../large-datasets/large-datasets-python.md) in a general way, but now let's dive deeper into it by implementing a practical example. To illustrate how to use Dask, we perform simple descriptive and analytics operations on a large dataset. We use a flight dataset, available at [Kaggle.com](https://www.kaggle.com/datasets/usdot/flight-delays), containing over 5M flight delays and cancellations from 2015. When loaded with `pandas`, the dataframe occupies more than 1.3GB of memory, which would make operations rather slow. Let's load the data with `dask`. For all operations we also provide the equivalent `pandas` code for comparison. -### Loading the data +### Loading the data The first thing we need to do is import the `dask` library and load the data. Make sure that the downloaded data is in the same working directory as the Python file of the code. In Dask, we do this as follows: -{{% codeblock %}} +{{% codeblock %}} ```python # import the Dask DataFrame module @@ -29,12 +29,13 @@ import dask.dataframe as dd #read the downloaded csv file flights = dd.read_csv('flights.csv', assume_missing=True, dtype={'CANCELLATION_REASON': 'object'}) ``` -{{% /codeblock %}} +{{% /codeblock %}} The equivalent `pandas` code would be: -{{% codeblock %}} +{{% codeblock %}} + ```python # import the pandas module import pandas as pd @@ -43,13 +44,13 @@ import pandas as pd flights = pd.read_csv('flights.csv') ``` -{{% /codeblock %}} +{{% /codeblock %}} {{% tip %}} The arguments `assume_missing` and `dtype` are added because it is likely that `dask` will raise a _ValueError_ due to its attempts to infer the types of all columns by reading a sample of the data from the start of the file. This attempt is sometimes incorrect, thus it may require manually specifying the type each column should have (with `dtype`) or assuming all integer columns have missing values and request their conversion to float (with `assume_missing`). -If you don't mention these arguments when reading the data and it raises the error, Python will suggest a possible solution to fixing it. +If you don't mention these arguments when reading the data and it raises the error, Python will suggest a possible solution to fixing it. {{% /tip %}} @@ -67,14 +68,15 @@ flights.npartitions #display dataframe flights.compute() ``` + {{% /codeblock %}} Since `pandas` does not partition a dataset there are no equivalent commands. - We can see the column names and shape of the dataframe too. In `dask`: {{% codeblock %}} + ```python #display dataframe columns flights.columns @@ -83,29 +85,29 @@ flights.columns flights.shape #outputs delayed object flights.shape[0].compute() #calculates the number of rows ``` + {{% /codeblock %}} -Note that the output of the `.shape()` method in `dask` doesn't immediately return output. +Note that the output of the `.shape()` method in `dask` doesn't immediately return output. Instead, it gives this: `(Delayed('int-4b01ce40-f552-432c-b591-da8955b3ea9c'), 31)`. This is because `dask` uses lazy evaluation - it delays the evaluation of an expression until its value is needed. We use `.compute()` to evaluate the expression. -Why the delay? It's because to count the number of rows, `dask` needs to work through each partition and sum the number of the rows in each one. -More information on lazy evaluation is available in our building block [Handle Large Datasets in Python](https://tilburgsciencehub.com/topics/prepare-your-data-for-analysis/data-preparation/large-datasets-python/). +Why the delay? It's because to count the number of rows, `dask` needs to work through each partition and sum the number of the rows in each one. +More information on lazy evaluation is available in our building block [Handle Large Datasets in Python](../large-datasets/large-datasets-python.md). If we wanted the column names and shape via `pandas`: {{% codeblock %}} + ```python #display dataframe columns flights.columns #check the shape of dataframe -flights.shape +flights.shape ``` -{{% /codeblock %}} - - +{{% /codeblock %}} In `dask` the `info()` function doesn't give us the output we know from `pandas`, but instead only gives the number of columns, the first and last column name and dtype counts. Alternatively, we can do the following to get each dtype and non-null/null counts: @@ -127,6 +129,7 @@ flights.isna().sum().compute() To achieve the same thing in `pandas` we'd write the following: {{% codeblock %}} + ```python #show types of columns flights.dtypes @@ -138,7 +141,6 @@ flights.info() {{% /codeblock %}} - ### Descriptive statistics, Grouping, Filtering We want to investigate the delays, more specifically, what is the biggest delay? What airline has the largest amount of summed delays? Is there a difference in average arrival delay between certain periods of the year? @@ -155,11 +157,13 @@ flights.describe().compute() #alternatively, for just one column flights['ARRIVAL_DELAY'].describe().compute() ``` + {{% /codeblock %}} Which is similar to how we'd work in `pandas`. {{% codeblock %}} + ```python #compute descriptive statistics for the whole dataframe flights.describe() @@ -167,8 +171,8 @@ flights.describe() #alternatively, for just one column flights['ARRIVAL_DELAY'].describe() ``` -{{% /codeblock %}} +{{% /codeblock %}} If we wanted to find the largest departure delay, we'd need to use the `max()` command along the horizontal axis. In `dask`: @@ -178,6 +182,7 @@ If we wanted to find the largest departure delay, we'd need to use the `max()` c #compute the max value of a column flights['DEPARTURE_DELAY'].max(axis = 0).compute() ``` + {{% /codeblock %}} Or, equivalently in `pandas`: @@ -188,6 +193,7 @@ Or, equivalently in `pandas`: #compute the max value of a column flights['DEPARTURE_DELAY'].max(axis = 0) ``` + {{% /codeblock %}} Next, we may be interested in which airline has the largest amount of minutes delay in the data. @@ -199,22 +205,26 @@ We can find this by computing the sum of the airline delays grouped by airlines. #sum the total delay per airline with groupby flights.groupby(by = 'AIRLINE')['AIRLINE_DELAY'].sum().compute() ``` + {{% /codeblock %}} Or in `pandas`: {{% codeblock %}} + ```python #sum the total delay per airline with groupby flights.groupby(by = 'AIRLINE')['AIRLINE_DELAY'].sum() ``` + {{% /codeblock %}} -We might also want to know if the average delay is different across months. +We might also want to know if the average delay is different across months. For example, delays could be higher in winter due to snowstorms and inclement weather. We start by first checking if the data contains delays for all months of the year. Then, we compute the mean arrival delays by grouping them per month. {{% codeblock %}} + ```python #count the unique values for the "month" column to check we get 12 flights['MONTH'].nunique().compute() @@ -226,11 +236,13 @@ flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().compute() flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().max().compute() ``` + {{% /codeblock %}} The same operations would be executed in `pandas` as follows: {{% codeblock %}} + ```python #count the unique values for month column to check we get 12 flights['MONTH'].nunique() @@ -242,8 +254,8 @@ flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean() flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().max() ``` -{{% /codeblock %}} +{{% /codeblock %}} ### Variable creation and plotting @@ -252,8 +264,9 @@ When plotting large datasets, we can either use packages that can handle many da Thus, in our answer to the question "Is there a difference in average arrival delay between certain periods of the year?", we can visualize the average delays for each month with a plot: {{% codeblock %}} + ```python -#create new dataframe +#create new dataframe monthly_delay = flights.groupby(by = 'MONTH')['ARRIVAL_DELAY'].mean().reset_index() #converting the dataframe from dask to pandas @@ -266,12 +279,12 @@ monthly_delay['MONTH'] = monthly_delay['MONTH'].astype(int) import matplotlib.pyplot as plt fig = plt.figure(figsize = (10, 7)) - + # creating the bar plot plt.bar(monthly_delay["MONTH"], monthly_delay["ARRIVAL_DELAY"], color ='orange', width = 0.4) - -plt.xlabel("Month") + +plt.xlabel("Month") plt.xticks(range(monthly_delay['MONTH'].nunique()+1)) #display all values on x axis plt.ylabel("Average arrival delay") plt.title("Average arrival delay per month") @@ -283,6 +296,7 @@ plt.rcParams.update({'text.color': "white", 'font.size': 13}) #change text color and size for better readability plt.show() ``` + {{% /codeblock %}} _Output_: @@ -291,16 +305,16 @@ _Output_:

- -As such, we can see that the highest average arrival delay is registered in June, with the second highest in February, while September and October register negative values, which means that on average, arrivals were ahead of schedule. +As such, we can see that the highest average arrival delay is registered in June, with the second highest in February, while September and October register negative values, which means that on average, arrivals were ahead of schedule. {{% summary %}} -`dask` is a Python library suitable to use when dealing with larger datasets that don't fit the memory. It uses lazy evaluation, which means it doesn't execute operations or commands until actually necessary. Most of the functions are the same as in `pandas`, but remember to add `.compute()` after them if you want `dask` to compute the result at that point in your code. +`dask` is a Python library suitable to use when dealing with larger datasets that don't fit the memory. It uses lazy evaluation, which means it doesn't execute operations or commands until actually necessary. Most of the functions are the same as in `pandas`, but remember to add `.compute()` after them if you want `dask` to compute the result at that point in your code. {{% /summary %}} ## See also + For more resources on `dask` check out these links: - [Dask documentation](https://docs.dask.org/en/stable/) diff --git a/content/topics/Manage-manipulate/Loading/large-datasets/large-datasets-python.md b/content/topics/Manage-manipulate/Loading/large-datasets/large-datasets-python.md index 9c90ce79d..51c305e8e 100644 --- a/content/topics/Manage-manipulate/Loading/large-datasets/large-datasets-python.md +++ b/content/topics/Manage-manipulate/Loading/large-datasets/large-datasets-python.md @@ -11,11 +11,12 @@ aliases: - /import/large-datsets-python --- - ## Overview + Discover Dask, a valuable solution to handle large datasets in Python that provides parallel computing functionalities to popular libraries such as Pandas and Numpy. In this building block, you will be introduced to the package Dask. Moreover, you will learn some of Dask's fundamental operations that will allow you to handle and work with large datasets in Python in a much more effective and efficient manner. ## Memory Errors when working with large datasets + When trying to import a large dataset to dataframe format with `pandas` (for example, using the `read_csv` function), you are likely to run into `MemoryError`. This error indicates that you have run out of memory in your RAM. `Pandas` uses in-memory analytics, so larger-than-memory datasets won't load. Additionally, any operations performed on the dataframe require memory as well. Wes McKinney - the creator of the Python `pandas` project, noted in his [2017 blog post](https://wesmckinney.com/blog/apache-arrow-pandas-internals/): @@ -25,39 +26,45 @@ Wes McKinney - the creator of the Python `pandas` project, noted in his [2017 bl You can check the [memory usage](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.memory_usage.html) of each column of the `pandas` DataFrame (including the dataframe's index) with the following line of code: {{% codeblock %}} + ```python DataFrame.memory_usage(deep=True) ``` + {{% /codeblock %}} Moreover, `pandas` only uses a single CPU core to perform computations, so it is relatively slow, especially when working with larger datasets. - - ## Dask library + One of the solutions to memory errors is to use another library. Here `Dask` comes in handy. `Dask` is a Python library for parallel computing, which can perform computations on large datasets while scaling well-known Python libraries such as `pandas`, `NumPy`, and `scikit-learn`. `Dask` splits the dataset into a number of partitions. Unlike `pandas`, each `Dask` partition is sent to a separate CPU core. This feature allows us to work on a larger-than-memory dataset but also speeds up the computations on that dataset. ### Installation + `Dask` is included by default in the Anaconda distribution. Otherwise, you can also use pip to install everything required for the most common uses of `Dask` or choose to only install the `Dask` library: {{% codeblock %}} + ```shell python -m pip install "dask[complete]" # Install everything python -m pip install dask # Install only core parts of Dask ``` + {{% /codeblock %}} Alternatively, see other installing options [here](https://docs.dask.org/en/stable/install.html). ### Dask DataFrame + `Dask` DataFrame is a collection of smaller `pandas` DataFrames, split along the index. The following are the fundamental operations on the `Dask` DataFrame: {{% codeblock %}} -```python + +````python # import the Dask DataFrame module import dask.dataframe as dd @@ -69,7 +76,7 @@ df = dd.read_csv('/path/example-*.csv') # check the number of partitions df.npartitions # change the number of partitions -df = df.repartition(npartitions=10) +df = df.repartition(npartitions=10) # save the Dask DataFrame to CSV files (one file per partition) df.to_csv('/path/example-*.csv') @@ -93,12 +100,13 @@ df['column1'].mean() # get column mean in Dask df['column1'].mean().compute() -``` -{{% /codeblock %}} +```` +{{% /codeblock %}} To convert a `Dask` DataFrame to a `pandas` DataFrame, you call the `compute()` function on that dataframe: {{% codeblock %}} + ```python df = df.compute() {{% /codeblock %}} @@ -128,4 +136,5 @@ You can see the list of functionalities [here](https://docs.dask.org/en/stable/a ### See Also -[Practical Example Using Dask](https://tilburgsciencehub.com/topics/prepare-your-data-for-analysis/data-preparation/dask-in-action/) +[Practical Example Using Dask](dask-in-action.md) +``` diff --git a/content/topics/Research-skills/Writing/Citations/citations.md b/content/topics/Research-skills/Writing/Citations/citations.md index 05f4e5deb..056b573b6 100644 --- a/content/topics/Research-skills/Writing/Citations/citations.md +++ b/content/topics/Research-skills/Writing/Citations/citations.md @@ -10,6 +10,7 @@ aliases: --- ## Citation Styles + Many academic journals, publishers or university departments require a specific citation style. If that is the case, check their guidelines. However, if nothing is specified you still have to choose one and be consistent with it. Which style should be chosen? Usually your choice will depend on your field/discipline or even country of publication. For instance, APA is one of the most common styles for social sciences, MLA in humanities, AMA in medicine, OSCOLA for law in the UK mainly etc. @@ -17,122 +18,137 @@ Which style should be chosen? Usually your choice will depend on your field/disc Luckily enough if you use the right tools, changing from one style to another comes at no additional effort. ## Citing in LaTex + ### Store your references in a ".bib" file + To cite in LaTeX make sure you have a `.bib` file in the same folder as your`.tex` file. {{% tip %}} - Be efficient, export your references to a `.bib` file straight from a reference management tool. This way, you avoid manually typing in the references or downloading them one by one from the journal. +Be efficient, export your references to a `.bib` file straight from a reference management tool. This way, you avoid manually typing in the references or downloading them one by one from the journal. - - Not familiar with reference management tools? Check our [building block](https://tilburgsciencehub.com/topics/develop-your-research-skills/tips/reference-list/) +- Not familiar with reference management tools? Check our [building block](reference-list.md) {{% /tip %}} +### Cite in text +The basic command to add citation in text is `\cite{label}`. However, you may not like how the citation is printed out and might want to change it. For this, you need to add additional packagess to your ".tex" document which will give more options on how citations in text appear. Here we show 2 commonly used options, which allow for further flexibility when citing: -### Cite in text - The basic command to add citation in text is `\cite{label}`. However, you may not like how the citation is printed out and might want to change it. For this, you need to add additional packagess to your ".tex" document which will give more options on how citations in text appear. Here we show 2 commonly used options, which allow for further flexibility when citing: +#### A) Natbib -#### A) Natbib Natbib is a widely used, and very reliable package as it relies on the `bibtex` environment. To employ it, type in `\usepackage{natbib}` in the preamble of your document. The table below describes some examples of additional citation commands that come with the Natbib package:
-| Command | Description | Example | -| ------------- |:----------------------------------------:| --------------:| -| \citet{} | Textual citation | Jon Doe (2021) | -| \citep{} | Parenthetical citation | (Jon Doe, 2021)| -| \citeauthor{} | Prints only the name of the authors(s) | Jon Doe | -| \citeyear{} | Prints only the year of the publication. | 2021 | +| Command | Description | Example | +| ------------- | :--------------------------------------: | --------------: | +| \citet{} | Textual citation | Jon Doe (2021) | +| \citep{} | Parenthetical citation | (Jon Doe, 2021) | +| \citeauthor{} | Prints only the name of the authors(s) | Jon Doe | +| \citeyear{} | Prints only the year of the publication. | 2021 |
You can find more information on how to employ Natbib [here](https://gking.harvard.edu/files/natnotes2.pdf). -#### B) Biblatex -Biblatex has the advantage that it is also designed to use the `biber` environment, which allows for further flexibility when formatting. To employ Biblatex, in the preamble of your document make sure to include: +#### B) Biblatex - - `\usepackage{biblatex}` +Biblatex has the advantage that it is also designed to use the `biber` environment, which allows for further flexibility when formatting. To employ Biblatex, in the preamble of your document make sure to include: - - `\addbibresource{.bib}` +- `\usepackage{biblatex}` +- `\addbibresource{.bib}` To cite in text, using Biblatex you only need the command `\cite{label}`. How this appears in the text depends on the citation style that you choose. ### Choose your citation style + Each of the above mentioned packages contains standard citation styles and some journal-specific styles. Some will be better for the standard styles (according to your taste). However, for for the non-standard-styles, Biblatex, contains a wider range of options. #### Natbib + If using **Natbib**, in the preamble of your document type in: - ```LaTex - \bibliographystyle{stylename} - ``` -Where the predetermined *stylename* options for Natbib are: `dinat`, `plainnat`, `abbrvnat`, `unsrtnat`, `rusnat`, `rusnat` and `ksfh_nat`. Check how they look in the following [page](https://es.overleaf.com/learn/latex/Natbib_bibliography_styles) +```LaTex +\bibliographystyle{stylename} +``` + +Where the predetermined _stylename_ options for Natbib are: `dinat`, `plainnat`, `abbrvnat`, `unsrtnat`, `rusnat`, `rusnat` and `ksfh_nat`. Check how they look in the following [page](https://es.overleaf.com/learn/latex/Natbib_bibliography_styles) {{% tip %}} Want a specific style, for instance, APA? You can also use the bibliography style option `apalike` among others (e.g. jep, harvard, chicago, astron...). {{% /tip %}} #### Biblatex + If using **Biblatex**, in the preamble of your document type in: - ```LaTex - \usepackage[backend=biber, style=stylename,]{biblatex} - ``` +```LaTex +\usepackage[backend=biber, style=stylename,]{biblatex} +``` -Where there are many more predetermined *stylename* options than for Natbib. You can find these options in the following [link](https://es.overleaf.com/learn/latex/Biblatex_citation_styles). +Where there are many more predetermined _stylename_ options than for Natbib. You can find these options in the following [link](https://es.overleaf.com/learn/latex/Biblatex_citation_styles). Biblatex is especially good at non-standard citation styles, which are usually journal-specific. For instance, among others, it includes the following commonly used citation styles: - - | Citation style | biblatex "stylename" | - | ---------------------|:----------------------------------------:| - | Nature | nature | - | Chicago | chicago-authordate | - | MLA | mla | - | APA | apa | - +| Citation style | biblatex "stylename" | +| -------------- | :------------------: | +| Nature | nature | +| Chicago | chicago-authordate | +| MLA | mla | +| APA | apa | ### Print your formatted references + Again, it will change slightly depending on what package you're using. - For Natbib: + ```LaTex \bibliography{bibfile} % Wherever you want your references to be printed ``` + - For Biblatex: + ```LaTex \printbibliography % Wherever you want your references to be printed ``` + {{% warning %}} Don't forget the `\addbibresource` command if using Biblatex. {{% /warning %}} ## Citing in Lyx + Because Lyx is based on LaTex, adding citations straight from a `.bib` file is also possible in Lyx. {{% tip %}} **Citing in Lyx from a .bib file** - - Go to **"Insert" > "List/TOC" > "Bibliography"**. +- Go to **"Insert" > "List/TOC" > "Bibliography"**. + +- Browse for the `.bib` file and click **Add**. - - Browse for the `.bib` file and click **Add**. +- In the same window, under **Style**, choose the style you wish to use. - - In the same window, under **Style**, choose the style you wish to use. +- In the **Content** section, choose from the drop-down to select the references. - - In the **Content** section, choose from the drop-down to select the references. +- Find the locations for which the in-text citations shall appear: - - Find the locations for which the in-text citations shall appear: - - "Insert" > "Citations". + - "Insert" > "Citations". - - Select the matching references for each citation +- Select the matching references for each citation {{% /tip %}} + ## Citing in Word + Not comfortable with LaTeX and using Word? Well, though not as easy as in LaTeX, citing and printing the bibliography in Word can be quite efficient if combined with the Mendeley plug-in. {{% tip %}} + - In Mendeley Reference Manager, go to: + - "Tools" > "Install Mendeley Cite for Word" - Go to Word and click on "References". If correctly installed, the following should appear in the top right-hand corner of Word: @@ -150,5 +166,4 @@ Not comfortable with LaTeX and using Word? Well, though not as easy as in LaTeX, - Need more help or information? Go to the following [page](https://www.mendeley.com/guides/using-citation-editor) - {{% /tip %}} diff --git a/content/topics/Research-skills/Writing/Citations/reference-list.md b/content/topics/Research-skills/Writing/Citations/reference-list.md index e5f0b1b3f..22a0f6b12 100644 --- a/content/topics/Research-skills/Writing/Citations/reference-list.md +++ b/content/topics/Research-skills/Writing/Citations/reference-list.md @@ -9,20 +9,21 @@ aliases: - /reference/list - /build/maintain --- + # Overview + One of the most tedious tasks (if not the most) when writing a publication (e.g, article, thesis) is dealing with references. A good usage of reference management applications will save you a lot of time, which you can put to better use in your research. Here, we will go through the most widely used reference management applications: Mendeley, Zotero, and EndNote. - ## Mendeley - **Mendeley Reference Manager** is a free web and desktop reference management application that can be downloaded for [Windows](https://www.mendeley.com/download-reference-manager/windows), [MacOS](https://www.mendeley.com/download-reference-manager/macOS) and [Linux](https://www.mendeley.com/download-reference-manager/linux). With it, you can: +**Mendeley Reference Manager** is a free web and desktop reference management application that can be downloaded for [Windows](https://www.mendeley.com/download-reference-manager/windows), [MacOS](https://www.mendeley.com/download-reference-manager/macOS) and [Linux](https://www.mendeley.com/download-reference-manager/linux). With it, you can: - - Store, organize, and search all your references from just one library. - - Seamlessly insert references and bibliographies into your Microsoft® Word documents using Mendeley Cite. - - Read, highlight, and annotate PDFs, and keep all your thoughts across multiple documents in one place. - - Collaborate with others by sharing references and ideas. +- Store, organize, and search all your references from just one library. +- Seamlessly insert references and bibliographies into your Microsoft® Word documents using Mendeley Cite. +- Read, highlight, and annotate PDFs, and keep all your thoughts across multiple documents in one place. +- Collaborate with others by sharing references and ideas. ### Build your reference list @@ -33,7 +34,6 @@ From Mendeleys's Reference manager there are several ways to add references to y

- - **Browse for local files.** Do you have files already downloaded to your computer? Go to "+Add new" and select "File(s) from computer".

@@ -59,69 +59,72 @@ From Mendeleys's Reference manager there are several ways to add references to y Adding references manually is the most reliable way to add a reference because it relies on the metadata of publications. There is a much lower chance that you get incorrect information. {{% /tip %}} -- **Search for articles online.** Although the preferred search methods for literature on the web are the ones mentioned in this [topic](https://tilburgsciencehub.com/topics/develop-your-research-skills/tips/search-literature/), Mendeley Reference Manager also has its own search engine. To use it, go to "Tools" and select "Search for articles online". It will redirect you to the following [page](https://www.mendeley.com/search/?dgcid=refmandesktop). - - Having said this, the Mendeley searcher has an "Add to library" button for every search result, which allows you to instantly add the reference to your Mendeley Reference Manager's library. However, this is no significant advantage compared to other search engines once the Mendeley Web Importer extension is added (see below) - +- **Search for articles online.** Although the preferred search methods for literature on the web are the ones mentioned in this [topic](../literature-review/search-literature.md), Mendeley Reference Manager also has its own search engine. To use it, go to "Tools" and select "Search for articles online". It will redirect you to the following [page](https://www.mendeley.com/search/?dgcid=refmandesktop). + - Having said this, the Mendeley searcher has an "Add to library" button for every search result, which allows you to instantly add the reference to your Mendeley Reference Manager's library. However, this is no significant advantage compared to other search engines once the Mendeley Web Importer extension is added (see below) ### Mendeley Web Importer + If you happen to find an article/book/paper online that you wish to add to your reference list, with Mendeley Web Importer you can do so without even launching the Reference Manager, directly from the webpage. -Mendeley Web Importer is a browser extension that can be downloaded [here](https://chrome.google.com/webstore/detail/mendeley-web-importer/dagcmkpagjlhakfdhnbomgmjdpkdklff). Alternatively, from the Reference Manager go to "Tools > Install Mendeley Web Importer". +Mendeley Web Importer is a browser extension that can be downloaded [here](https://chrome.google.com/webstore/detail/mendeley-web-importer/dagcmkpagjlhakfdhnbomgmjdpkdklff). Alternatively, from the Reference Manager go to "Tools > Install Mendeley Web Importer". - Once added, check that the extension logo appears in your browser toolbar. If it appears, you are now ready to start importing references directly from the web page into your reference manager! +Once added, check that the extension logo appears in your browser toolbar. If it appears, you are now ready to start importing references directly from the web page into your reference manager! - 1. **Launch the importer:** Click on ![logo](../images/logo_importer.PNG). +1. **Launch the importer:** Click on ![logo](../images/logo_importer.PNG). - 2. **Sign in**: Use your existing Mendeley/Elsevier credentials to sign in. +2. **Sign in**: Use your existing Mendeley/Elsevier credentials to sign in. - 3. **Select the Reference you wish to import.** The Importer will display references that it can import from the web page you are viewing. -

- -

+3. **Select the Reference you wish to import.** The Importer will display references that it can import from the web page you are viewing. +

+ +

- 4. **Add to your library**. Click "Add to Mendeley" and the reference is added to your library! +4. **Add to your library**. Click "Add to Mendeley" and the reference is added to your library! {{% warning %}} - Double-check the import from Mendeley Web Importer. It sometimes retrieves incorrect information from the page. For instance, it might confuse the year of publication or change slightly the title. +Double-check the import from Mendeley Web Importer. It sometimes retrieves incorrect information from the page. For instance, it might confuse the year of publication or change slightly the title. - - For minor changes, you can click on the article in the Reference Manager and modify the information manually. You can additionally add any extra information (e.g., pages) or add annotations to the reference. -{{% /warning %}} +- For minor changes, you can click on the article in the Reference Manager and modify the information manually. You can additionally add any extra information (e.g., pages) or add annotations to the reference. + {{% /warning %}} ### Organize your reference list #### Smart Collections. - Automatically organized folders. These include: - - **Recently added.** +Automatically organized folders. These include: - - **Favorites**. Click on the star icon next to your reference to add it to the favorites folder. +- **Recently added.** - - **My Publications.** This collection displays the publications that you have authored and claimed through the Scopus Author Profile. +- **Favorites**. Click on the star icon next to your reference to add it to the favorites folder. - - **Trash** +- **My Publications.** This collection displays the publications that you have authored and claimed through the Scopus Author Profile. + +- **Trash** #### Custom Collections. - Organize your references into the folders of your choosing. For instance, organize by topic (e.g., Economics, Data Modelling) - - To create a new collection, select the "New Collection" button in the left-hand navigation panel. +Organize your references into the folders of your choosing. For instance, organize by topic (e.g., Economics, Data Modelling) + +- To create a new collection, select the "New Collection" button in the left-hand navigation panel. - - To add a reference to a collection, drop it onto a collection in the left-hand navigation panel. +- To add a reference to a collection, drop it onto a collection in the left-hand navigation panel. ### Export your references + You can export references directly into a BibTex, EndNote, Word, or RIS file. To do so: - 1. Select the references you want to export by clicking on the squared box next to the "Authors" column. - 2. Click on the export drop-down cell, and select the file format of your choosing. +1. Select the references you want to export by clicking on the squared box next to the "Authors" column. + +2. Click on the export drop-down cell, and select the file format of your choosing.

- If you require further information on Mendeley, go to the following [page](https://www.mendeley.com/guides/mendeley-reference-manager). - ## Zotero + - Zotero is an open-source tool that can be downloaded [here](https://www.zotero.org/download/). - Its usage is very similar to Mendeley's Reference Manager, though it does not have an integrated PDF viewer (it will open PDFs in your computer’s default PDF viewer). @@ -130,8 +133,8 @@ If you require further information on Mendeley, go to the following [page](https - Want to learn more about Zotero's usage? Check this [page](https://www.zotero.org/support/) - ## Endnote + The main reason to use EndNote over Mendeley or Zotero is that it offers more features and comes with unlimited reference storage locally and in Endnote Online. To download EndNote go [here](https://endnote.com/downloads) However, unlike those two, EndNote is **not free**. Therefore, it is not probably worth it unless your institution gives you free access to it. @@ -141,6 +144,7 @@ Don't know whether to go for Mendeley, Zotero or EndNote? Bottom line, they are {{% /tip %}} ## BibDesk + Are you a Mac user? You then have another top-notch bibliography manager in [BibDesk](https://sourceforge.net/projects/bibdesk/), which is particularly well suited for LaTeX and is also a free-open source project. - Want to learn how to use it? Check out BibDesk's [wiki](https://sourceforge.net/p/bibdesk/wiki/Main_Page/) diff --git a/content/topics/Research-skills/Writing/master-thesis-guide/data_coding.md b/content/topics/Research-skills/Writing/master-thesis-guide/data_coding.md index 407655741..8299d5cfd 100644 --- a/content/topics/Research-skills/Writing/master-thesis-guide/data_coding.md +++ b/content/topics/Research-skills/Writing/master-thesis-guide/data_coding.md @@ -17,36 +17,37 @@ aliases: - Data collection techniques - Data import - Inspecting raw data -- Building your project pipeline +- Building your project pipeline ## Code versioning: GitHub -GitHub is a powerful version control platform that allows you to manage and track changes in your code. This ensures transparency, and organization throughout the research process. +GitHub is a powerful version control platform that allows you to manage and track changes in your code. This ensures transparency, and organization throughout the research process. ### Why use GitHub for your course project or Master Thesis? Even if your project/thesis is not a collaborative project, keeping your work on GitHub still offers a lot of advantages: -- *Version control*: Easily track changes, view previous versions, and revert to earlier drafts if needed. No more confusion with multiple versions of the same document (e.g. saving the file as "Thesisdraft1", "Thesisdraft2" and so on). -
+- _Version control_: Easily track changes, view previous versions, and revert to earlier drafts if needed. No more confusion with multiple versions of the same document (e.g. saving the file as "Thesisdraft1", "Thesisdraft2" and so on). +
-- *Transparent documentation:* The commit history provides a clear record of all the edits made. Leaving clear commit messages helps in understanding what edits you made when revisiting your work after months of working on it. +- _Transparent documentation:_ The commit history provides a clear record of all the edits made. Leaving clear commit messages helps in understanding what edits you made when revisiting your work after months of working on it. -- *Branching:* You can create branches for different sections without impacting the main project draft in which you can experiment as much as you want. +- _Branching:_ You can create branches for different sections without impacting the main project draft in which you can experiment as much as you want. -- *Remote accessibility:* Access your project from anywhere as GitHub is a cloud-based platform. +- _Remote accessibility:_ Access your project from anywhere as GitHub is a cloud-based platform. -- *Backup:* GitHub serves as a secure backup, minimizing the risk of losing your work. +- _Backup:_ GitHub serves as a secure backup, minimizing the risk of losing your work. -- *Showcasing your work:* After finishing the project, you can keep it on GitHub, making it accessible for others to use and showcasing your research! +- _Showcasing your work:_ After finishing the project, you can keep it on GitHub, making it accessible for others to use and showcasing your research! Use [this topic](/share/data) to learn how to use GitHub for your project! ## Data collection techniques -Depending on your research, you might want to create your own data. Techniques for data collection include web scraping or APIs. -- The [Web scraping and API Mining tutorial](/learn/web-scraping-and-api-mining) discusses the difference between web scraping and API mining. It also shows you how to scrape static and dynamic websites and how to extract data from APIs. -- When you are new to web scraping and APIs, you can follow the open course [Online Data Collection and Management](https://odcm.hannesdatta.com/) of Tilburg University. +Depending on your research, you might want to create your own data. Techniques for data collection include web scraping or APIs. + +- The [Web scraping and API Mining tutorial](/learn/web-scraping-and-api-mining) discusses the difference between web scraping and API mining. It also shows you how to scrape static and dynamic websites and how to extract data from APIs. +- When you are new to web scraping and APIs, you can follow the open course [Online Data Collection and Management](https://odcm.hannesdatta.com/) of Tilburg University. ## Data import @@ -63,13 +64,12 @@ You might want to analyze data sets for your project that already exist somewher Upon importing your raw data, explore it right away to assess its quality. There are several things you can do: -- **Visually inspect your data:** Load the data into your software program (e.g. R, Excel) and visually inspect the data tables and their structure. +- **Visually inspect your data:** Load the data into your software program (e.g. R, Excel) and visually inspect the data tables and their structure. - **Cross-tabulate data:** Organize your data in a table format to see the frequency distribution of variables' categories in relation to each other. Do this with `table()` in R or pivot tables in Excel. - **Plot your data** to explore trends (e.g. with `plot()` in R), or distributions (e.g. with `hist()` in R). Aggregate data by a common unit before plotting (e.g. time periods, categories like countries or brands). - ### Important considerations Key considerations when assessing your raw data quality include the following points: @@ -78,33 +78,34 @@ Key considerations when assessing your raw data quality include the following po - **Variable names and columns**: Ensure you understand the meaning of each variable and column. Confirm if any columns are incomplete. -- **Data quality**: - - Check for feasibility and consistency in variable values. Are there any values that seem unrealistic? - - Detect missing values and identify if missing values truly represent NA (*Not Available*) data or if they imply zero values. - - Verify the consistency in data definitions across the dataset, including the measurement units of your variables. - - Pay attention to proper encoding of strings and headers. +- **Data quality**: + - Check for feasibility and consistency in variable values. Are there any values that seem unrealistic? + - Detect missing values and identify if missing values truly represent NA (_Not Available_) data or if they imply zero values. + - Verify the consistency in data definitions across the dataset, including the measurement units of your variables. + - Pay attention to proper encoding of strings and headers. {{% tip %}} **GDPR compliance**
-When dealing with data, ensuring compliance with your organization's data management policies is crucial. This often entails promptly anonymizing personally identifiable information during your data preparation phase. +When dealing with data, ensuring compliance with your organization's data management policies is crucial. This often entails promptly anonymizing personally identifiable information during your data preparation phase. -Tilburg University has provided [specific guidelines for students](https://www.tilburguniversity.edu/sites/default/files/download/Student%20research%20and%20personal%20data%20in%20your%20research.pdf). Moreover, the legal department at Tilburg offers valuable [guidance on handling personal data with care](https://www.tilburguniversity.edu/about/conduct-and-integrity/privacy-and-security/careful-handling-personal-data). +Tilburg University has provided [specific guidelines for students](https://www.tilburguniversity.edu/sites/default/files/download/Student%20research%20and%20personal%20data%20in%20your%20research.pdf). Moreover, the legal department at Tilburg offers valuable [guidance on handling personal data with care](https://www.tilburguniversity.edu/about/conduct-and-integrity/privacy-and-security/research-data). {{% /tip %}} -## Building your project pipeline +## Building your project pipeline -A project pipeline is a systematic framework that gives a structured sequence of the steps included in your research. It outlines the flow of tasks, from data collection and cleaning to analysis and visualization. +A project pipeline is a systematic framework that gives a structured sequence of the steps included in your research. It outlines the flow of tasks, from data collection and cleaning to analysis and visualization. -Your project pipeline might quickly become more complex, which makes a clear structure even more important. A good project pipeline ensures consistency and facilitates efficiency by breaking down the complex process into smaller manageable steps. +Your project pipeline might quickly become more complex, which makes a clear structure even more important. A good project pipeline ensures consistency and facilitates efficiency by breaking down the complex process into smaller manageable steps. It also ensures transparency. For example, a project where code files are separated into multiple smaller components (e.g. one for data cleaning, one for running the model, one for producing tables and figures) is way easier for others to read and understand. -Use [this tutorial that covers the principles of project setup and workflow management](/learn/project-setup) as a guide. +Use [this tutorial that covers the principles of project setup and workflow management](/learn/project-setup) as a guide. {{% tip %}} [A checklist to help you ensure you meet all the requirements when building your pipeline](/tutorials/project-management/principles-of-project-setup-and-workflow-management/checklist) {{% /tip %}} #### Pipeline automation + If your project is data-oriented, e.g. related to marketing analytics or data science, consider automating your pipeline. [This tutorial](/practice/pipeline-automation) is helpful to teach you how to automate your pipeline automation using `Make`. diff --git a/content/topics/Visualization/data-visualization/dashboarding/shiny-apps.md b/content/topics/Visualization/data-visualization/dashboarding/shiny-apps.md index d77de3ab2..8f77c28f2 100644 --- a/content/topics/Visualization/data-visualization/dashboarding/shiny-apps.md +++ b/content/topics/Visualization/data-visualization/dashboarding/shiny-apps.md @@ -15,7 +15,7 @@ aliases: ## Overview -In the world of data analysis, the ability to communicate insights effectively is a great skill to have. +In the world of data analysis, the ability to communicate insights effectively is a great skill to have. **[Shiny](https://shiny.rstudio.com)** is a package that helps you turn your analyses into interactive web applications all with R or Python without requiring HTML, CSS, or Javascript knowledge. Shiny empowers data analysts and scientists to create dynamic, user-friendly applications that allow non-technical stakeholders to explore and interact with data in real time. @@ -28,32 +28,35 @@ Whether you're looking to build an interactive dashboard for tracking key metric In this guide, we'll walk you through the fundamental concepts of creating Shiny apps, step by step. By the end, you'll have the skills to transform your static analyses into engaging web applications that empower your audience to conduct their own explorations. So, whether you're a data analyst, a scientist, or anyone who wants to convey insights in an interactive and impactful way, dive into the world of Shiny apps and unlock the potential of dynamic data visualization! The building block consists of the following sections -- Skeleton + +- Skeleton - Layout options - Define placeholders - Control widgets -- An example +- An example ### Skeleton + The skeleton of any Shiny app consists of a user interface (`ui`) and a `server`. The `ui` defines the visual elements, such as plots and widgets, while the `server` handles the reactive behavior and interactions. This is for example what happens once you click on the download button. And this is exactly where Shiny shines: combining inputs with outputs. {{% codeblock %}} + ```R library(shiny) -ui <- fluidPage( +ui <- fluidPage( # Define UI components here ) - + server <- function(input, output){ # Implement server logic here } shinyApp(ui = ui, server = server) ``` -{{% /codeblock %}} +{{% /codeblock %}} ### Layout options @@ -61,6 +64,7 @@ shinyApp(ui = ui, server = server) A common layout structure in Shiny apps is the sidebar layout. This layout divides the app into a small sidebar panel on the left and a main content panel on the right. {{% codeblock %}} + ```R ui <- fluidPage( sidebarLayout( @@ -73,13 +77,14 @@ ui <- fluidPage( ) ) ``` -{{% /codeblock %}} +{{% /codeblock %}} **Tabs** -Tabs are another effective way to organize content in your Shiny app. They provide a convenient means of switching between different sections of your app's content. +Tabs are another effective way to organize content in your Shiny app. They provide a convenient means of switching between different sections of your app's content. {{% codeblock %}} + ```R tabsetPanel( tabPanel(title = "tab 1", @@ -89,6 +94,7 @@ tabsetPanel( tabPanel(title = "tab 3", "The content of the third tab") ) ``` + {{% /codeblock %}} {{% example %}} @@ -101,12 +107,12 @@ tabsetPanel( Define a placeholder for plots, tables, and text in the user interface (`ui`) and server side (`server`). -* Text can be formatted as headers (e.g., `h1()`, `h2()`) or printed in bold (`strong()`) or italics (`em()`) format. -* The [`ggplotly()` function](https://www.rdocumentation.org/packages/plotly/versions/4.9.3/topics/ggplotly) can convert a `ggplot2` plot into an interactive one (e.g., move, zoom, export image features that are not available in the standard `renderPlot()` function). -* Similarly, the `DT::dataTableOutput("table")` (in the `ui`) and the `DT::renderDataTable()` (in the `server`) from the `DT` package enrich the `renderTable` function. See a live example [here](https://royklaassebos.shinyapps.io/dPrep_Demo_Google_Mobility/). - +- Text can be formatted as headers (e.g., `h1()`, `h2()`) or printed in bold (`strong()`) or italics (`em()`) format. +- The [`ggplotly()` function](https://www.rdocumentation.org/packages/plotly/versions/4.9.3/topics/ggplotly) can convert a `ggplot2` plot into an interactive one (e.g., move, zoom, export image features that are not available in the standard `renderPlot()` function). +- Similarly, the `DT::dataTableOutput("table")` (in the `ui`) and the `DT::renderDataTable()` (in the `server`) from the `DT` package enrich the `renderTable` function. See a live example [here](https://royklaassebos.shinyapps.io/dPrep_Demo_Google_Mobility/). {{% codeblock %}} + ```R # ui plotOutput(outputId = "plot"), @@ -128,21 +134,22 @@ output$text <- renderText({ "Some text" }) ``` -{{% /codeblock %}} - +{{% /codeblock %}} ### Control Widgets -Shiny provides a range of control widgets that allow users to interact with your app: +Shiny provides a range of control widgets that allow users to interact with your app: **Text box** {{% codeblock %}} + ```R # textbox that accepts both numeric and alphanumeric input textInput(inputId = "title", label="Text box title", value = "Text box content") ``` + {{% /codeblock %}} {{% example %}} @@ -154,10 +161,12 @@ textInput(inputId = "title", label="Text box title", value = "Text box content") **Numeric input** {{% codeblock %}} + ```R # text box that only accepts numeric data between 1 and 30 numericInput(inputId = "num", label = "Number of cars to show", value = 10, min = 1, max = 30) ``` + {{% /codeblock %}} {{% example %}} @@ -169,10 +178,12 @@ numericInput(inputId = "num", label = "Number of cars to show", value = 10, min **Slider** {{% codeblock %}} + ```R # slider that goes from 35 to 42 degrees with increments of 0.1 sliderInput(inputId = "temperature", label = "Body temperature", min = 35, max = 42, value = 37.5, step = 0.1) ``` + {{% /codeblock %}} {{% example %}} @@ -183,12 +194,13 @@ sliderInput(inputId = "temperature", label = "Body temperature", min = 35, max = **Range selector** - {{% codeblock %}} + ```R # slider that allows the user to set a range (rather than a single value) sliderInput(inputId = "price", label = "Price (€)", value = c(39, 69), min = 0, max = 99) ``` + {{% /codeblock %}} {{% example %}} @@ -200,10 +212,12 @@ sliderInput(inputId = "price", label = "Price (€)", value = c(39, 69), min = 0 **Radio buttons** {{% codeblock %}} + ```R # input field that allows for a single selection radioButtons(inputId = "radio", label = "Choose your preferred time slot", choices = c("09:00 - 09:30", "09:30 - 10:00", "10:00 - 10:30", "10:30 - 11:00", "11:00 - 11:30"), selected = "10:00 - 10:30") ``` + {{% /codeblock %}} {{% example %}} @@ -215,11 +229,13 @@ radioButtons(inputId = "radio", label = "Choose your preferred time slot", choic **Dropdown menu** {{% codeblock %}} + ```R # a dropdown menu is useful when you have plenty of options and you don't want to list them all below one another selectInput(inputId = "major", label = "Major", choices = c("Business Administration", "Data Science", "Econometrics & Operations Research", "Economics", "Liberal Arts", "Industrial Engineering", "Marketing Management", "Marketing Analytics", "Psychology"), selected = "Marketing Analytics") ``` + {{% /codeblock %}} {{% example %}} @@ -231,12 +247,14 @@ selectInput(inputId = "major", label = "Major", choices = c("Business Administra **Dropdown menu (multiple selections)** {{% codeblock %}} + ```R # dropdown menu that allows for multiple selections (e.g., both R and JavaScript) selectInput(inputId = "programming_language", label = "Programming Languages", choices = c("HTML", "CSS", "JavaScript", "Python", "R", "Stata"), selected = "R", multiple = TRUE) ``` + {{% /codeblock %}} {{% example %}} @@ -248,10 +266,12 @@ selectInput(inputId = "programming_language", label = "Programming Languages", **Checkbox** {{% codeblock %}} + ```R # often used to let the user confirm their agreement checkboxInput(inputId = "agree", label = "I agree to the terms and conditions", value=TRUE) ``` + {{% /codeblock %}} {{% example %}} @@ -263,11 +283,13 @@ checkboxInput(inputId = "agree", label = "I agree to the terms and conditions", **Colorpicker** {{% codeblock %}} + ```R # either insert a hexadecmial color code or use the interactive picker library(colourpicker) # you may first need to install the package colourInput(input = "colour", label = "Select a colour", value = "blue") ``` + {{% /codeblock %}} {{% example %}} @@ -275,9 +297,11 @@ colourInput(input = "colour", label = "Select a colour", value = "blue") {{% /example %}} ### Download Button + Add a download button to your Shiny app so that users can directly download their current data selection in csv-format and open the data in a spreadsheet program (e.g., Excel). {{% codeblock %}} + ```R ui <- fluidPage( downloadButton(outputId = "download_data", label = "Download") @@ -287,12 +311,13 @@ server <- function(input, output) { output$download_data <- downloadHandler( filename = "download_data.csv", content = function(file) { - data <- filtered_data() + data <- filtered_data() write.csv(data, file, row.names = FALSE) } ) } ``` + {{% /codeblock %}} ### An example @@ -303,14 +328,10 @@ The [Shiny app](https://royklaassebos.shinyapps.io/dPrep_Demo_Google_Mobility/) ### More resources -* [A course on learning Shiny](https://debruine.github.io/shinyintro/) and its necessary [package](https://github.com/debruine/shinyintro) -* [Interactive Web Apps with shiny Cheat Sheet](https://shiny.rstudio.com/images/shiny-cheatsheet.pdf) -* [Shiny User Showcase](https://shiny.rstudio.com/gallery/) +- [A course on learning Shiny](https://debruine.github.io/shinyintro/) and its necessary [package](https://github.com/debruine/shinyintro) +- [Interactive Web Apps with shiny Cheat Sheet](https://shiny.posit.co/r/articles/start/cheatsheet/) +- [Shiny User Showcase](https://shiny.rstudio.com/gallery/) {{% summary %}} You've now gained a solid understanding of how to create interactive web applications using Shiny. Feel free to experiment, explore the vast capabilities of Shiny, and tailor your apps to fit your specific needs. With the resources provided and your newfound knowledge, you're well-equipped to embark on your journey into the world of Shiny app development. Happy coding! {{% /summary %}} - - - - diff --git a/content/topics/Visualization/data-visualization/graphs-charts/matplotlib-seaborn.md b/content/topics/Visualization/data-visualization/graphs-charts/matplotlib-seaborn.md index 0899851e8..733e9dcc9 100644 --- a/content/topics/Visualization/data-visualization/graphs-charts/matplotlib-seaborn.md +++ b/content/topics/Visualization/data-visualization/graphs-charts/matplotlib-seaborn.md @@ -13,7 +13,7 @@ aliases: ## Overview -Python has a lot of libraries for visualizing data, out of which `matplotlib` and `seaborn` are the most common. In this building block we construct the plots defined in [Data Visualization Theory and Best Practices](https://tilburgsciencehub.com/topics/visualize-your-data/data-visualization/theory-best-practices/) with both `matplotlib` and `seaborn`. +Python has a lot of libraries for visualizing data, out of which `matplotlib` and `seaborn` are the most common. In this building block we construct the plots defined in [Data Visualization Theory and Best Practices](theory-best-practices.md) with both `matplotlib` and `seaborn`. ## Setup @@ -21,15 +21,16 @@ To install `matplotlib` follow this [guide](https://matplotlib.org/stable/users/ {{% tip %}} -You can also plot with `pandas`, which is built on top of `matplotlib`. +You can also plot with `pandas`, which is built on top of `matplotlib`. {{% /tip %}} -To install `seaborn` follow this [guide](https://seaborn.pydata.org/installing.html). This is also built on top of `matplotlib` to create statistical plots. +To install `seaborn` follow this [guide](https://seaborn.pydata.org/installing.html). This is also built on top of `matplotlib` to create statistical plots. Let's first import the libraries. {{% codeblock %}} + ```python import pandas as pd import matplotlib as mpl @@ -37,27 +38,31 @@ import matplotlib.pyplot as plt import seaborn as sns ``` + {{% /codeblock %}} We are going to use two datasets, the [Iris](https://www.kaggle.com/datasets/uciml/iris) dataset and the [Monthly stocks](stocks-monthly.csv) dataset, containing closing prices of 4 companies over time. Let's load the datasets. {{% codeblock %}} + ```python iris = pd.read_csv('iris.csv') stocks = pd.read_csv('stocks-monthly.csv',parse_dates=[0]) ``` + {{% /codeblock %}} ## Gallery of Plots ### 1. Scatterplot -*Matplotlib* +_Matplotlib_ Creating a scatterplot with `matplotlib` is simple, we just need to follow a simple syntax. For this plot type we use the Iris dataset. {{% codeblock %}} + ```python #create the scatterplot using two quantitative attributes plt.scatter(iris['sepal width'], iris['sepal length']) @@ -74,17 +79,19 @@ plt.title("Scatterplot") #add gridlines plt.grid() ``` + {{% /codeblock %}} -The scatterplot visualizes the sepal width on the X axis and the sepal length on the Y axis. The plot shows us that the majority of the points are concentrated around the center denoting that in general the flowers, regardless of their species, have a medium sepal length and width. +The scatterplot visualizes the sepal width on the X axis and the sepal length on the Y axis. The plot shows us that the majority of the points are concentrated around the center denoting that in general the flowers, regardless of their species, have a medium sepal length and width. _Output:_ +

{{% tip %}} -We can also change the color of the dots by adding the parameter `c = #some color`. We can see all supported colors in `matplotlib` by running `mpl.colors.cnames`. +We can also change the color of the dots by adding the parameter `c = #some color`. We can see all supported colors in `matplotlib` by running `mpl.colors.cnames`. Additionally, we can change the style of the markers (dots) by adding the parameter `marker = #some marker`. We can see all supported marker styles in `matplotlib` by running `mpl.markers.MarkerStyle.markers`. @@ -95,23 +102,27 @@ Additionally, we can change the style of the markers (dots) by adding the parame Creating the same scatterplot in `seaborn` is easy. Additionally, it can take the categorical variable of flower species as parameter for color hue. This way, each species has a different color and is easier to identify. {{% codeblock %}} + ```python sns.scatterplot(iris['sepal width'], iris['sepal length'], hue = iris['species']).set(title="Scatterplot") ``` + {{% /codeblock %}} _Output:_ +

- ### 2. Bar plot + #### Matplotlib For the bar plot we use the Monthly stock dataset. We visualize the months on the X axis and closing prices of one company on the Y axis. {{% codeblock %}} + ```python #create the bar plot using the months and closing prices of Google plt.bar(stocks['Date'].dt.month, stocks['GOOG']) @@ -132,9 +143,11 @@ plt.title("Bar plot") plt.grid() ``` + {{% /codeblock %}} _Output:_ +

@@ -144,12 +157,15 @@ _Output:_ When plotting with `seaborn` it automatically adds a different color for each bar, as well as add error bars. They represent the uncertainty or variation of the corresponding coordinate of the point. {{% codeblock %}} + ```python sns.barplot(stocks['Date'].dt.month, stocks['GOOG']).set(title="Bar plot") ``` + {{% /codeblock %}} _Output:_ +

@@ -159,6 +175,7 @@ _Output:_ #### Matplotlib {{% codeblock %}} + ```python #add each categorical variable (company) with a different color plt.bar(stocks['Date'].dt.month, stocks['GOOG'], color='r') @@ -181,9 +198,11 @@ plt.title("Closing prices of stocks in each month") #add limit for Y axis to better visualize all categories plt.ylim(0,800) ``` + {{% /codeblock %}} _Output:_ +

@@ -192,17 +211,20 @@ _Output:_ ### 4. Line chart -#### Seaborn +#### Seaborn When plotting line charts with `seaborn` we have to specify exactly what to visualize on the axes: {{% codeblock %}} + ```python sns.lineplot(data = stocks, x = 'Date', y = 'NASDAQ').set(title="Line plot") ``` + {{% /codeblock %}} _Output:_ +

@@ -212,6 +234,7 @@ _Output:_ We can use a simple command to plot all 4 companies in the same line plot: {{% codeblock %}} + ```python #we first set the date column as index stocks_d = stocks.set_index('Date') @@ -220,9 +243,11 @@ stocks_d = stocks.set_index('Date') stocks_d.plot() plt.title("Stock prices over time") ``` + {{% /codeblock %}} _Output:_ +

@@ -232,6 +257,7 @@ _Output:_ We can also create several subplots under the same figure. For instance, we create one line plot for each company. {{% codeblock %}} + ```python #create display of figure fig, ax = plt.subplots(nrows=2, ncols=2, squeeze=False, sharex=True, figsize=(10,10)) @@ -253,17 +279,21 @@ ax[1, 0].set_ylabel('Price (USD)') #set title of whole figure fig.suptitle("Development of stocks over time", size=18, weight='bold') ``` + {{% /codeblock %}} _Output:_ +

### 5. Heatmap -Before actually creating the heatmap, we need to rearrange the data to create a pivot table. We use the Iris dataset to create the pivot table after the petal length and width levels. + +Before actually creating the heatmap, we need to rearrange the data to create a pivot table. We use the Iris dataset to create the pivot table after the petal length and width levels. {{% codeblock %}} + ```python levels = ["tiny", "small", "medium", "big", "large"] iris["petal width level"] = pd.cut(iris["petal width"], len(levels), labels=levels) @@ -283,6 +313,7 @@ iris_matrix = iris_matrix.reindex(levels, axis=1); iris_matrix ``` + {{% /codeblock %}} We can now create the heatmap from the new matrix. @@ -290,14 +321,17 @@ We can now create the heatmap from the new matrix. #### Matplotlib {{% codeblock %}} + ```python plt.imshow(iris_matrix) plt.colorbar() plt.title("Heatmap with color bar") ``` + {{% /codeblock %}} _Output:_ +

@@ -305,12 +339,15 @@ _Output:_ #### Seaborn {{% codeblock %}} + ```python sns.heatmap(iris_matrix, square=True).set(title="Heatmap with color bar") ``` + {{% /codeblock %}} _Output:_ +

@@ -320,6 +357,7 @@ _Output:_ For the histogram we use `seaborn` since it is the best library for statistical plotting. {{% codeblock %}} + ```python #we can create a more complex chart that contains the histogram, the density plot and the normal distribution @@ -327,9 +365,11 @@ from scipy.stats import norm sns.distplot(iris['petal length'], fit=norm).set(title="Histogram with normal distribution") ``` + {{% /codeblock %}} _Output:_ +

@@ -341,13 +381,17 @@ The blue line represents the density plot and the black line is the fitted norma We can visualize the distribution of petal length for each iris species with the box plot. #### Matplotlib + {{% codeblock %}} + ```python iris.boxplot(column = 'petal length', by = 'species', figsize = (5,5)) ``` + {{% /codeblock %}} _Output:_ +

@@ -355,14 +399,15 @@ _Output:_ #### Seaborn {{% codeblock %}} + ```python sns.boxplot(data=iris, x='species', y='petal length').set(title="Box plot of petal length") ``` + {{% /codeblock %}} _Output:_ +

- - diff --git a/content/topics/Visualization/data-visualization/regression-results/model-summary.md b/content/topics/Visualization/data-visualization/regression-results/model-summary.md index 24e2610f4..18dcfac37 100644 --- a/content/topics/Visualization/data-visualization/regression-results/model-summary.md +++ b/content/topics/Visualization/data-visualization/regression-results/model-summary.md @@ -18,27 +18,30 @@ The `modelsummary` package is a powerful and user-friendly package for summarizi ## The Setting -Eichholtz et al. (2010) investigate the relationship between investments in energy efficiency in design and construction of commercial office buildings and the rents, the effective rents and the selling prices of these properties. Green ratings assess the energy footprint of buildings and can be used by building owners or tenants to evaluate the energy efficiency and sustainability of buildings. +Eichholtz et al. (2010) investigate the relationship between investments in energy efficiency in design and construction of commercial office buildings and the rents, the effective rents and the selling prices of these properties. Green ratings assess the energy footprint of buildings and can be used by building owners or tenants to evaluate the energy efficiency and sustainability of buildings. Their empirical approach boils down to regressing the logarithm of rent per square foot in commercial office buildings on a dummy variable (1 if rated as green) and other characteristics of the buildings. The regression equation is: - $log R_{in} = \alpha + \beta_i X_i + \sum\limits_{n=1}^{N}\gamma_n c_n + \delta g_i + \epsilon_{in}$ where -- $R_{in}$ is the rent per square foot in commercial office building `i` in cluster `n` + +- $R_{in}$ is the rent per square foot in commercial office building `i` in cluster `n` - $X_{i}$ is a vector of the hedonic characteristics of building `i` - $c_n$ is a dummy variable with a value of 1 if building i is located in cluster n and zero otherwise (fixed effects) - $g_{i}$ is a dummy variable with a value of 1 if building `i` is rated by Energy Star or USGBC and zero otherwise - $\epsilon_{in}$ is the error term -## Load packages and data +## Load packages and data + +The data set to replicate Table 1 comes from the [replication package of Eichholtz et al. (2010)](https://www.openicpsr.org/openicpsr/project/112392/version/V1/view). It contains data for 8,105 commercial office buildings in the US, both green buildings and control buildings. Green rated buildings are clustered to nearby commercial buildings in the same market. -The data set to replicate Table 1 comes from the [replication package of Eichholtz et al. (2010)](https://www.openicpsr.org/openicpsr/project/112392/version/V1/view). It contains data for 8,105 commercial office buildings in the US, both green buildings and control buildings. Green rated buildings are clustered to nearby commercial buildings in the same market. + We use a tidied version of the data in their replication package. {{% codeblock %}} + ```R # Load packages library(modelsummary) @@ -52,6 +55,7 @@ data_url <- "https://raw.githubusercontent.com/tilburgsciencehub/website/master/ load(url(data_url)) #data_rent is the cleaned data set ``` + {{% /codeblock %}} ## Regression equations @@ -59,97 +63,100 @@ load(url(data_url)) #data_rent is the cleaned data set Below, we are estimating regression 1 until 5 displayed in Table 1 of Eichholtz et al. (2010). To control for locational effects, each regression also includes 694 dummy variables, one for each locational cluster. Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. {{% codeblock %}} + ```R -reg1 <- feols(logrent ~ - green_rating + size_new + oocc_new + class_a + class_b + - net + empl_new | - id, +reg1 <- feols(logrent ~ + green_rating + size_new + oocc_new + class_a + class_b + + net + empl_new | + id, data = data_rent ) # Split "green rating" into two classifications: energystar and leed -reg2 <- feols(logrent ~ - energystar + leed + size_new + oocc_new + class_a + class_b + - net + empl_new | - id, +reg2 <- feols(logrent ~ + energystar + leed + size_new + oocc_new + class_a + class_b + + net + empl_new | + id, data = data_rent ) -reg3 <- feols(logrent ~ - green_rating + size_new + oocc_new + class_a + class_b + - net + empl_new + - age_0_10 + age_10_20 + age_20_30 + age_30_40 + renovated | - id, +reg3 <- feols(logrent ~ + green_rating + size_new + oocc_new + class_a + class_b + + net + empl_new + + age_0_10 + age_10_20 + age_20_30 + age_30_40 + renovated | + id, data = data_rent ) -reg4 <- feols(logrent ~ - green_rating + size_new + oocc_new + class_a + class_b + - net + empl_new + - age_0_10 + age_10_20 + age_20_30 + age_30_40 + - renovated + story_medium + story_high + amenities | +reg4 <- feols(logrent ~ + green_rating + size_new + oocc_new + class_a + class_b + + net + empl_new + + age_0_10 + age_10_20 + age_20_30 + age_30_40 + + renovated + story_medium + story_high + amenities | id, data = data_rent ) # add fixed effects for green rating -reg5 <- feols(logrent ~ - size_new + oocc_new + class_a + class_b + - net + empl_new + renovated + - age_0_10 + age_10_20 + age_20_30 + age_30_40 + - story_medium + story_high + amenities | - id + green_rating, +reg5 <- feols(logrent ~ + size_new + oocc_new + class_a + class_b + + net + empl_new + renovated + + age_0_10 + age_10_20 + age_20_30 + age_30_40 + + story_medium + story_high + amenities | + id + green_rating, data = data_rent ) ``` + {{% /codeblock %}} {{% tip %}} -- 78 observations are removed because of NA values in all 5 regressions. This results in a similar number of observations of 8105 as reported in Table 1 of Eichholtz et al. (2010). -- The variable `empl_new` is removed because of collinearity. +- 78 observations are removed because of NA values in all 5 regressions. This results in a similar number of observations of 8105 as reported in Table 1 of Eichholtz et al. (2010). + +- The variable `empl_new` is removed because of collinearity. - Note that Eichholtz et al. (2010) do report an estimate for `empl_new`. - -{{% /tip %}} +{{% /tip %}} ## Output of table with `modelsummary` -Now we have estimated the regression equations for our table, we can move on to applying `modelsummary`. `models` defines a list of regression 1 until 5. +Now we have estimated the regression equations for our table, we can move on to applying `modelsummary`. `models` defines a list of regression 1 until 5. {{% codeblock %}} + ```R models <- list( - "(1)" = reg1, - "(2)" = reg2, - "(3)" = reg3, - "(4)" = reg4, + "(1)" = reg1, + "(2)" = reg2, + "(3)" = reg3, + "(4)" = reg4, "(5)" = reg5) msummary(models) ``` -{{% /codeblock %}} +{{% /codeblock %}}
(1) (2) (3) (4) (5)
green_rating 0.035 0.033 0.028
(0.009) (0.009) (0.009)
size_new 0.113 0.113 0.102 0.110 0.110
(0.023) (0.023) (0.022) (0.027) (0.027)
oocc_new 0.020 0.020 0.020 0.011 0.011
(0.018) (0.018) (0.018) (0.017) (0.017)
class_a 0.231 0.231 0.192 0.173 0.173
(0.012) (0.012) (0.013) (0.015) (0.015)
class_b 0.101 0.101 0.092 0.083 0.083
(0.011) (0.011) (0.011) (0.011) (0.011)
net -0.047 -0.047 -0.050 -0.051 -0.051
(0.014) (0.014) (0.014) (0.013) (0.013)
energystar 0.033
(0.009)
leed 0.052
(0.035)
age_0_10 0.118 0.131 0.131
(0.019) (0.019) (0.019)
age_10_20 0.079 0.084 0.084
(0.017) (0.016) (0.016)
age_20_30 0.047 0.048 0.048
(0.013) (0.013) (0.013)
age_30_40 0.043 0.044 0.044
(0.012) (0.012) (0.012)
renovated -0.008 -0.008 -0.008
(0.010) (0.010) (0.010)
story_medium 0.010 0.010
(0.012) (0.012)
story_high -0.027 -0.027
(0.019) (0.019)
amenities 0.047 0.047
(0.008) (0.008)
Num.Obs. 8105 8105 8105 8105 8105
R2 0.715 0.715 0.718 0.720 0.720
R2 Adj. 0.688 0.688 0.691 0.693 0.693
R2 Within 0.131 0.131 0.140 0.146 0.133
R2 Within Adj. 0.130 0.130 0.138 0.144 0.131
AIC 1530.9 1532.5 1460.4 1409.2 1409.2
BIC 6389.1 6397.7 6353.6 6323.4 6323.4
RMSE 0.24 0.24 0.24 0.24 0.24
Std.Errors by: id by: id by: id by: id by: id
FE: id X X X X X
FE: green_rating X
- This table provides a relatively clean, easy to read summary of the five regression models. In the rest of the post, we will extend and improve this table making it into something to that could be used in a research paper or presentation. We will: -- Report different standard errors -- Choosing which coefficients to report and renaming them -- Reformat estimates and statistics -- Add a caption title and table notes -- Add stars denoting statistical significance -- Format the number of decimals -- Report confidence interval rather than standard errora -- Export the output to a file +- Report different standard errors +- Choosing which coefficients to report and renaming them +- Reformat estimates and statistics +- Add a caption title and table notes +- Add stars denoting statistical significance +- Format the number of decimals +- Report confidence interval rather than standard errora +- Export the output to a file ## Report different standard errors @@ -158,18 +165,20 @@ Within the `modelsummary` function, it is possible to specify different types of On the other hand, clustered standard errors are appropriate when there are groups of observations that are likely to be correlated with each other. To specify standard errors clustered around the id values, we could use `cluster = ~id`. Note that the default behavior is to cluster standard errors by the variable used to estimate the fixed effects, which in this case is the id variable. {{% codeblock %}} + ```R #robust standard errors msummary(models, vcov = "HC1") ``` -{{% /codeblock %}} +{{% /codeblock %}} ## Selecting and formatting estimates `coef_map = cm` is used to specify the variable names in our output table, and to rearrange the order of the rows to match Table 1. {{% codeblock %}} + ```R cm = c('green_rating' = 'Green rating (1 = yes)', 'energystar' = 'Energystar (1 = yes)', @@ -184,13 +193,14 @@ cm = c('green_rating' = 'Green rating (1 = yes)', 'age_20_30' = 'Age: 20-30 years', 'age_30_40' = 'Age: 30-40 years', 'renovated' = 'Renovated (1 = yes)', - 'story_medium' = 'Stories: Intermediate (1 = yes)', - 'story_high' = 'Stories: High (1 = yes)', + 'story_medium' = 'Stories: Intermediate (1 = yes)', + 'story_high' = 'Stories: High (1 = yes)', 'amenities' = 'Amenities (1 = yes)') msummary(models, vcov="HC1", coef_map = cm) ``` + {{% /codeblock %}} ## Selecting and formatting statistics @@ -199,6 +209,7 @@ msummary(models, vcov="HC1", - `gof_map = gm` is used to specify which statistics we want to include in the bottom section, as well as their desired formatting. The names of the statistics are specified using `clean=` and the number of decimals places are specified using `fmt=`. {{% codeblock %}} + ```R gm <- list( list("raw" = "nobs", "clean" = "Sample size", "fmt" = 0), @@ -207,77 +218,83 @@ gm <- list( #get_gof(reg1) to see "raw" names of these statistics. msummary(models, vcov="HC1", - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', gof_map = gm) ``` + {{% /codeblock %}} After applying the first few steps, the output table looks like this:
(1) (2) (3) (4) (5)
Green rating (1 = yes) 0.035 0.033 0.028
(0.009) (0.009) (0.009)
Energystar (1 = yes) 0.033
(0.009)
LEED (1 = yes) 0.052
(0.036)
Building size (millions of sq.ft) 0.113 0.113 0.102 0.110 0.110
(0.019) (0.019) (0.019) (0.021) (0.021)
Fraction occupied 0.020 0.020 0.020 0.011 0.011
(0.016) (0.016) (0.016) (0.016) (0.016)
Building class A (1 = yes) 0.231 0.231 0.192 0.173 0.173
(0.012) (0.012) (0.014) (0.015) (0.015)
Building class B (1 = yes) 0.101 0.101 0.092 0.083 0.083
(0.011) (0.011) (0.011) (0.011) (0.011)
Net contract (1 = yes) -0.047 -0.047 -0.050 -0.051 -0.051
(0.013) (0.013) (0.013) (0.013) (0.013)
Age: <10 years 0.118 0.131 0.131
(0.016) (0.017) (0.017)
Age: 10-20 years 0.079 0.084 0.084
(0.014) (0.014) (0.014)
Age: 20-30 years 0.047 0.048 0.048
(0.013) (0.013) (0.013)
Age: 30-40 years 0.043 0.044 0.044
(0.011) (0.011) (0.011)
Renovated (1 = yes) -0.008 -0.008 -0.008
(0.009) (0.009) (0.009)
Stories: Intermediate (1 = yes) 0.010 0.010
(0.009) (0.009)
Stories: High (1 = yes) -0.027 -0.027
(0.015) (0.015)
Amenities (1 = yes) 0.047 0.047
(0.007) (0.007)
Sample size 8105 8105 8105 8105 8105
R2 0.71 0.71 0.72 0.72 0.72
Adjusted R2 0.69 0.69 0.69 0.69 0.69
+## Add a caption to the table and include table notes -## Add a caption to the table and include table notes We can also add a title and note to our table. {{% codeblock %}} + ```R notetable1 <- c( - "Notes: Each regression also includes 694 dummy variables, one for each locational cluster. - Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. + "Notes: Each regression also includes 694 dummy variables, one for each locational cluster. + Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. Standard errors are in parentheses") -titletable1 <- 'Table 1—Regression Results, Commercial Office Rents and Green Ratings +titletable1 <- 'Table 1—Regression Results, Commercial Office Rents and Green Ratings (dependent variable: logarithm of effective rent in dollars per square foot)' msummary(models, vcov="HC1", - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', - gof_map = gm, - notes = notetable1, + gof_map = gm, + notes = notetable1, title = titletable1) ``` + {{% /codeblock %}}
Table 1—Regression Results, Commercial Office Rents and Green Ratings
(dependent variable: logarithm of effective rent in dollars per square foot)
(1) (2) (3) (4) (5)
Green rating (1 = yes) 0.035 0.033 0.028
(0.009) (0.009) (0.009)
Energystar (1 = yes) 0.033
(0.009)
LEED (1 = yes) 0.052
(0.036)
Building size (millions of sq.ft) 0.113 0.113 0.102 0.110 0.110
(0.019) (0.019) (0.019) (0.021) (0.021)
Fraction occupied 0.020 0.020 0.020 0.011 0.011
(0.016) (0.016) (0.016) (0.016) (0.016)
Building class A (1 = yes) 0.231 0.231 0.192 0.173 0.173
(0.012) (0.012) (0.014) (0.015) (0.015)
Building class B (1 = yes) 0.101 0.101 0.092 0.083 0.083
(0.011) (0.011) (0.011) (0.011) (0.011)
Net contract (1 = yes) -0.047 -0.047 -0.050 -0.051 -0.051
(0.013) (0.013) (0.013) (0.013) (0.013)
Age: <10 years 0.118 0.131 0.131
(0.016) (0.017) (0.017)
Age: 10-20 years 0.079 0.084 0.084
(0.014) (0.014) (0.014)
Age: 20-30 years 0.047 0.048 0.048
(0.013) (0.013) (0.013)
Age: 30-40 years 0.043 0.044 0.044
(0.011) (0.011) (0.011)
Renovated (1 = yes) -0.008 -0.008 -0.008
(0.009) (0.009) (0.009)
Stories: Intermediate (1 = yes) 0.010 0.010
(0.009) (0.009)
Stories: High (1 = yes) -0.027 -0.027
(0.015) (0.015)
Amenities (1 = yes) 0.047 0.047
(0.007) (0.007)
Sample size 8105 8105 8105 8105 8105
R2 0.71 0.71 0.72 0.72 0.72
Adjusted R2 0.69 0.69 0.69 0.69 0.69
Notes: Each regression also includes 694 dummy variables, one for each locational
cluster. Regression (5) also includes an additional 694 dummy variables, one for
each green building in the sample. Standard errors are in parentheses.
- ## Add stars denoting statistical significance The stars in a regression table are used to indicate the level of statistical significance of the coefficients in the regression model. They are based on the p-values, which measure the probability of obtaining the observed results when there is in fact no effect. ### Argument "stars = TRUE" -To add stars to our regression table, we can use the `stars = TRUE` argument. This will automatically add stars to the table based on a default threshold of statistical significance. By default, a note explaining the significance levels will be added at the bottom of the table: + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001. + +To add stars to our regression table, we can use the `stars = TRUE` argument. This will automatically add stars to the table based on a default threshold of statistical significance. By default, a note explaining the significance levels will be added at the bottom of the table: + p < 0.1, \* p < 0.05, ** p < 0.01, \*** p < 0.001. {{% codeblock %}} + ```R msummary(models, vcov = "HC1", stars=TRUE, - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', - gof_map = gm, - notes = notetable1, + gof_map = gm, + notes = notetable1, title = titletable1) ``` + {{% /codeblock %}} ### Manually add stars -To replicate Table 1 of Eichholtz et al. (2010), we need to customize the output of the regression table to show the significance of coefficients using stars. +To replicate Table 1 of Eichholtz et al. (2010), we need to customize the output of the regression table to show the significance of coefficients using stars. - By default, the stars are printed next to the coefficient estimate, but we want the stars to be printed on the row of the standard error. This can be done by manually adding a list for stars and adding it to the `statistics` argument. Check the code block below! -- Additionally, we want a different note about the stars so we change our note and add the new note to the `notes` argument. +- Additionally, we want a different note about the stars so we change our note and add the new note to the `notes` argument. {{% codeblock %}} + ```R note2table1 <- c( - "Notes: Each regression also includes 694 dummy variables, one for each locational cluster. - Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. + "Notes: Each regression also includes 694 dummy variables, one for each locational cluster. + Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. Standard errors are in brackets.", - "***Significant at the 1 percent level.", + "***Significant at the 1 percent level.", "**Significant at the 5 percent level.", "*Significant at the 10 percent level.") @@ -286,26 +303,26 @@ msummary(models, stars = c('*' = .1, '**' = 0.05, '***' = .01), estimate = "{estimate}", statistic = "[{std.error}]{stars}", - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', gof_map = gm, notes = note2table1, title = titletable1) ``` + {{% /codeblock %}}
Table 1—Regression Results, Commercial Office Rents and Green Ratings
(dependent variable: logarithm of effective rent in dollars per square foot)
(1) (2) (3) (4) (5)
Green rating (1 = yes) 0.035 0.033 0.028
[0.009]*** [0.009]*** [0.009]***
Energystar (1 = yes) 0.033
[0.009]***
LEED (1 = yes) 0.052
[0.035]
Building size (millions of sq.ft) 0.113 0.113 0.102 0.110 0.110
[0.023]*** [0.023]*** [0.022]*** [0.027]*** [0.027]***
Fraction occupied 0.020 0.020 0.020 0.011 0.011
[0.018] [0.018] [0.018] [0.017] [0.017]
Building class A (1 = yes) 0.231 0.231 0.192 0.173 0.173
[0.012]*** [0.012]*** [0.013]*** [0.015]*** [0.015]***
Building class B (1 = yes) 0.101 0.101 0.092 0.083 0.083
[0.011]*** [0.011]*** [0.011]*** [0.011]*** [0.011]***
Net contract (1 = yes) -0.047 -0.047 -0.050 -0.051 -0.051
[0.014]*** [0.014]*** [0.014]*** [0.013]*** [0.013]***
Age: <10 years 0.118 0.131 0.131
[0.019]*** [0.019]*** [0.019]***
Age: 10-20 years 0.079 0.084 0.084
[0.017]*** [0.016]*** [0.016]***
Age: 20-30 years 0.047 0.048 0.048
[0.013]*** [0.013]*** [0.013]***
Age: 30-40 years 0.043 0.044 0.044
[0.012]*** [0.012]*** [0.012]***
Renovated (1 = yes) -0.008 -0.008 -0.008
[0.010] [0.010] [0.010]
Stories: Intermediate (1 = yes) 0.010 0.010
[0.012] [0.012]
Stories: High (1 = yes) -0.027 -0.027
[0.019] [0.019]
Amenities (1 = yes) 0.047 0.047
[0.008]*** [0.008]***
Sample size 8105 8105 8105 8105 8105
R2 0.71 0.71 0.72 0.72 0.72
Adjusted R2 0.69 0.69 0.69 0.69 0.69
Notes: Each regression also includes 694 dummy variables, one for each locational cluster. Regression (5)
also includes an additional 694 dummy variables, one for each green building in the sample. Standard
errors are in brackets.
*** Significant at the 1 percent level.
** Significant at the 5 percent level.
* Significant at the 10 percent level.
- ## Format the number of decimals -The `fmt` argument in the `modelsummary` functions allows us to control how numeric values are rounded and presented in the table. In order to match the formatting of Table 1 in Eichholtz et al. (2010), we set the number of decimal digits to 3. +The `fmt` argument in the `modelsummary` functions allows us to control how numeric values are rounded and presented in the table. In order to match the formatting of Table 1 in Eichholtz et al. (2010), we set the number of decimal digits to 3. There is various ways to get the desired number of decimals. For example, it is possible to give statistics a different number of decimals than estimates, or to display the values in scientific (exponential) notation by specifying `fmt = fmt_sprintf("%.3e")`. - {{% codeblock %}} + ```R msummary(models, vcov = "HC1", @@ -314,48 +331,51 @@ msummary(models, stars = c('*' = .1, '**' = 0.05, '***' = .01), estimate = "{estimate}", statistic = "[{std.error}]{stars}", - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', gof_map = gm, notes = note2table1, title = titletable1) ``` + {{% /codeblock %}} ## Report confidence intervals instead of standard errors -Some scientific associations discourage the use of stars. For instance, one of the [guidelines](https://www.aeaweb.org/journals/aer/submissions/guidelines) of the American Economic Association is to report standard errors in parentheses but to not use *s to report significance levels. +Some scientific associations discourage the use of stars. For instance, one of the [guidelines](https://www.aeaweb.org/journals/aer/submissions) of the American Economic Association is to report standard errors in parentheses but to not use \*s to report significance levels. -In this step, we leave out the stars and report confidence intervals instead of standard errors. In some situations confidence intervals might be more informative, as they provide a range of plausible values that can be used to estimate the true value of a population parameter. +In this step, we leave out the stars and report confidence intervals instead of standard errors. In some situations confidence intervals might be more informative, as they provide a range of plausible values that can be used to estimate the true value of a population parameter. {{% codeblock %}} + ```R #Change the note to be correct -note3table1 <- c("Notes: Each regression also includes 694 dummy variables, one for -each locational cluster. Regression (5) also includes an additional 694 dummy variables, +note3table1 <- c("Notes: Each regression also includes 694 dummy variables, one for +each locational cluster. Regression (5) also includes an additional 694 dummy variables, one for each green building in the sample. Confidence intervals are in brackets.") msummary(models, vcov = "HC1", fmt = fmt_statistic(estimate = 3, conf.int = 3), statistic ='conf.int', - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|Std.Errors|FE', gof_map = gm, notes = note3table1, title = titletable1) ``` + {{% /codeblock %}}
Table 1—Regression Results, Commercial Office Rents and Green Ratings
(dependent variable: logarithm of effective rent in dollars per square foot)
(1) (2) (3) (4) (5)
Green rating (1 = yes) 0.035 0.033 0.028
[0.018, 0.053] [0.015, 0.050] [0.011, 0.046]
Energystar (1 = yes) 0.033
[0.016, 0.050]
LEED (1 = yes) 0.052
[-0.019, 0.123]
Building size (millions of sq.ft) 0.113 0.113 0.102 0.110 0.110
[0.076, 0.150] [0.076, 0.150] [0.065, 0.139] [0.069, 0.152] [0.069, 0.152]
Fraction occupied 0.020 0.020 0.020 0.011 0.011
[-0.011, 0.051] [-0.011, 0.051] [-0.010, 0.051] [-0.020, 0.041] [-0.020, 0.041]
Building class A (1 = yes) 0.231 0.231 0.192 0.173 0.173
[0.207, 0.255] [0.207, 0.255] [0.165, 0.220] [0.143, 0.203] [0.143, 0.203]
Building class B (1 = yes) 0.101 0.101 0.092 0.083 0.083
[0.080, 0.123] [0.080, 0.122] [0.070, 0.113] [0.060, 0.105] [0.060, 0.105]
Net contract (1 = yes) -0.047 -0.047 -0.050 -0.051 -0.051
[-0.072, -0.022] [-0.072, -0.022] [-0.076, -0.025] [-0.076, -0.026] [-0.076, -0.026]
Age: <10 years 0.118 0.131 0.131
[0.086, 0.151] [0.098, 0.163] [0.098, 0.163]
Age: 10-20 years 0.079 0.084 0.084
[0.052, 0.106] [0.058, 0.111] [0.058, 0.111]
Age: 20-30 years 0.047 0.048 0.048
[0.022, 0.072] [0.024, 0.073] [0.024, 0.073]
Age: 30-40 years 0.043 0.044 0.044
[0.023, 0.064] [0.024, 0.065] [0.024, 0.065]
Renovated (1 = yes) -0.008 -0.008 -0.008
[-0.026, 0.011] [-0.026, 0.010] [-0.026, 0.010]
Stories: Intermediate (1 = yes) 0.010 0.010
[-0.008, 0.027] [-0.008, 0.027]
Stories: High (1 = yes) -0.027 -0.027
[-0.056, 0.001] [-0.056, 0.001]
Amenities (1 = yes) 0.047 0.047
[0.033, 0.061] [0.033, 0.061]
Sample size 8105 8105 8105 8105 8105
R2 0.71 0.71 0.72 0.72 0.72
Adjusted R2 0.69 0.69 0.69 0.69 0.69
Notes: Each regression also includes 694 dummy variables, one for each locational cluster. Regression (5) also includes an
additional 694 dummy variables, one for each green building in the sample. Confidence intervals are in brackets.
- ## Exporting the table to a file -The `output` argument specifies the destination where the output can be exported. In the given code, the output is named `table1.html`. The extension `.html` specifies that the table will be printed in HTML format. Other possible extensions are for example `.docx`, `.md`, and `.txt`. If the `output` argument is not included, the output will be printed directly in the console. +The `output` argument specifies the destination where the output can be exported. In the given code, the output is named `table1.html`. The extension `.html` specifies that the table will be printed in HTML format. Other possible extensions are for example `.docx`, `.md`, and `.txt`. If the `output` argument is not included, the output will be printed directly in the console. {{% codeblock %}} + ```R msummary(models, output = "table1.html", vcov = "HC1", @@ -363,12 +383,13 @@ msummary(models, output = "table1.html", stars = c('*' = .1, '**' = 0.05, '***' = .01), estimate = "{estimate}", statistic = "[{std.error}]{stars}", - coef_map = cm, + coef_map = cm, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm, notes = note2table1, title = titletable1) ``` + {{% /codeblock %}} {{% summary %}} @@ -376,11 +397,11 @@ msummary(models, output = "table1.html", In this building block, we covered the most useful functions of the `modelsummary` package: - You can customize the standard errors printed in your model. For instance, `vcov = "HC1"` will give robust standard errors, and `cluster= ~id` will produce clustered standard errors by the variable `id`. -- You can customize the way estimates and statistics are presented, including the order of the estimates, variable names and which goodness-of-fit measures are printed. -- It is possible to add a title and note at the bottom of your table. +- You can customize the way estimates and statistics are presented, including the order of the estimates, variable names and which goodness-of-fit measures are printed. +- It is possible to add a title and note at the bottom of your table. - You can add stars to indicate the level of statistical significance of the coefficients with `stars = TRUE`, which provides default values and a note at the bottom. Alternatively, you can add stars manually to customize the threshold for significance or the note at the bottom. -- You can choose to include confidence intervals instead of standard errors for your estimates. -- The `fmt` argument allows you to specify the number of decimal places for estimates and statistics. +- You can choose to include confidence intervals instead of standard errors for your estimates. +- The `fmt` argument allows you to specify the number of decimal places for estimates and statistics. - You can export the table using the `output` argument. Adding extensions like `.html` and `.docx` will produce the table in that format. {{% /summary %}} diff --git a/content/topics/Visualization/reporting-tables/reportingtables/kableextra.md b/content/topics/Visualization/reporting-tables/reportingtables/kableextra.md index a0dac53d5..6360e8f90 100644 --- a/content/topics/Visualization/reporting-tables/reportingtables/kableextra.md +++ b/content/topics/Visualization/reporting-tables/reportingtables/kableextra.md @@ -12,12 +12,11 @@ aliases: - /run/kableextra --- - # Overview The main purpose of the `kableExtra` package is to simplify the process of creating tables with custom styles and formatting in R. In this building block, we will provide an example using some useful functions of `kableExtra` to improve the output of the `modelsummary` package and create a table in LaTeX format suitable for a publishable paper. -The starting point is the final output of our [`modelsummary` building block](https://tilburgsciencehub.com/topics/analyze-data/regressions/model-summary/), which is a replication of Table 1 of [Eiccholtz et al. (2010)](https://www.aeaweb.org/articles?id=10.1257/aer.100.5.2492). This table presents the results of a model investigating how green building certification impacts rental prices of commercial office buildings. +The starting point is the final output of our [`modelsummary` building block](../../data-visualization/regression-results/model-summary.md), which is a replication of Table 1 of [Eiccholtz et al. (2010)](https://www.aeaweb.org/articles?id=10.1257/aer.100.5.2492). This table presents the results of a model investigating how green building certification impacts rental prices of commercial office buildings. The following `kableExtra` functions will be covered to get to a formatted table that can be inserted into a research paper or slide deck that is built using LaTeX: @@ -26,13 +25,14 @@ The following `kableExtra` functions will be covered to get to a formatted table - Specifying the column alignment within the table - Add rows to the table to designate which fixed effects are included in each regression specification - Add a row that specifies the dependent variable in the regression -- Formatting and grouping rows of regression coefficients +- Formatting and grouping rows of regression coefficients ## Load packages and data To begin, let's load the necessary packages and data: {{% codeblock %}} + ```R # Load packages library(modelsummary) @@ -42,18 +42,20 @@ library(stringr) library(knitr) library(kableExtra) -# Load data +# Load data data_url <- "https://github.com/tilburgsciencehub/website/blob/master/content/topics/Visualization/Reporting_tables/ReportingTables/data_rent.Rda?raw=true" load(url(data_url)) #data_rent is loaded now ``` + {{% /codeblock %}} +## The `modelsummary` table -## The `modelsummary` table - 'models' is a list which consists of five regression models (regression 1 to 5). For a detailed overview and understanding of these regressions, please refer to the [`modelsummary` building block](https://tilburgsciencehub.com/topics/analyze-data/regressions/model-summary/). - `cm2` and `gm2` represent the variable names and statistics included in the regression table, respectively. {{% codeblock %}} + ```R cm2 = c('green_rating' = 'Green rating (1 $=$ yes)', 'energystar' = 'Energystar (1 $=$ yes)', @@ -68,8 +70,8 @@ cm2 = c('green_rating' = 'Green rating (1 $=$ yes)', 'age_20_30' = '20-30 years', 'age_30_40' = '30-40 years', 'renovated' = 'Renovated (1 $=$ yes)', - 'story_medium' = 'Intermediate (1 $=$ yes)', - 'story_high' = 'High (1 $=$ yes)', + 'story_medium' = 'Intermediate (1 $=$ yes)', + 'story_high' = 'High (1 $=$ yes)', 'amenities' = 'Amenities (1 $=$ yes)') gm2 <- list( @@ -83,54 +85,56 @@ msummary(models, #stars = c('*' = .1, '**' = 0.05, '***' = .01), estimate = "{estimate}", statistic = "[{std.error}]{stars}", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2) ``` + {{% /codeblock %}}
(1) (2) (3) (4) (5)
Green rating (1 = yes) 0.035 0.033 0.028
[0.009] [0.009] [0.009]
Energystar (1 = yes) 0.033
[0.009]
LEED (1 = yes) 0.052
[0.036]
Building size (millions of sq.ft) 0.113 0.113 0.102 0.110 0.110
[0.019] [0.019] [0.019] [0.021] [0.021]
Fraction occupied 0.020 0.020 0.020 0.011 0.011
[0.016] [0.016] [0.016] [0.016] [0.016]
A (1 = yes) 0.231 0.231 0.192 0.173 0.173
[0.012] [0.012] [0.014] [0.015] [0.015]
B (1 = yes) 0.101 0.101 0.092 0.083 0.083
[0.011] [0.011] [0.011] [0.011] [0.011]
Net contract (1 = yes) -0.047 -0.047 -0.050 -0.051 -0.051
[0.013] [0.013] [0.013] [0.013] [0.013]
<10 years 0.118 0.131 0.131
[0.016] [0.017] [0.017]
10-20 years 0.079 0.084 0.084
[0.014] [0.014] [0.014]
20-30 years 0.047 0.048 0.048
[0.013] [0.013] [0.013]
30-40 years 0.043 0.044 0.044
[0.011] [0.011] [0.011]
Renovated (1 = yes) -0.008 -0.008 -0.008
[0.009] [0.009] [0.009]
Intermediate (1 = yes) 0.010 0.010
[0.009] [0.009]
High (1 = yes) -0.027 -0.027
[0.015] [0.015]
Amenities (1 = yes) 0.047 0.047
[0.007] [0.007]
Sample size 8105 8105 8105 8105 8105
R2 0.71 0.71 0.72 0.72 0.72
Adjusted R2 0.69 0.69 0.69 0.69 0.69
- ## Output in LaTex format We want the table to be outputted in LaTeX format now. We set the `output` argument to `latex` in the `msummary()` function. This generates LaTeX code that can be directly copied and pasted into a LaTeX document. {{% codeblock %}} + ```R -msummary(models, +msummary(models, vcov = "HC1", fmt = 3, estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, output = "latex", escape = FALSE ) ``` + {{% /codeblock %}} {{% tip %}} When `escape = FALSE`, any special characters that are present in the output of `msummary()` will not be modified or escaped. This means that they will be printed exactly as they appear in the output, without any changes or substitutions. On the other hand, if escape = TRUE, then any special characters that are present in the output will be replaced with the appropriate LaTeX commands to render them correctly in the final document. {{% /tip %}} - ## Export the table to a file -To save this LaTex code to a .tex file, we can use the `cat()` function. The `file` argument specifies the name of the file we want the output to be printed to: `my_table.tex`. Not including any `file` argument will print the output to the console. +To save this LaTex code to a .tex file, we can use the `cat()` function. The `file` argument specifies the name of the file we want the output to be printed to: `my_table.tex`. Not including any `file` argument will print the output to the console. {{% codeblock %}} + ```R -msummary(models, +msummary(models, vcov = "HC1", fmt = 3, stars = c('*' = .1, '**' = 0.05, '***' = .01), estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, output = "latex", @@ -138,13 +142,13 @@ msummary(models, ) %>% cat(.,file="my_table.tex") ``` + {{% /codeblock %}} {{% tip %}} The %>% operator is being used to pipe the output of `msummary()` into the `cat()` function. Specifically, the dot (.) is a placeholder for the output of the previous function in the pipeline, which in this case is `msummary()`. {{% /tip %}} -

@@ -157,22 +161,25 @@ The LaTex file requires the following packages loaded in the header: \usepackage{siunitx} \usepackage{bbding} ``` + {{% /tip %}} ## Specify the column alignment You can customize the horizontal alignment of the table columns by including the `align` argument within the `msummary()` function. By setting `align = "lcccc"`, you can specify the alignment for each column as follows: + - The first column, containing the variable names, will be left-aligned (l). - The second to fifth columns, displaying the coefficients, will be centered (c). {{% codeblock %}} + ```R - msummary(models, + msummary(models, vcov = "HC1", fmt = 3, estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, align="lccccc", @@ -181,6 +188,7 @@ You can customize the horizontal alignment of the table columns by including the ) ``` + {{% /codeblock %}} ## Add rows designating fixed-effects specification @@ -188,11 +196,14 @@ You can customize the horizontal alignment of the table columns by including the In this step, we will integrate additional rows in the regression table to indicate the inclusion of fixed effects in each model. ### Creating a tibble + First, the additional rows are specified in a tibble. Within the tibble: -- There are two columns: term and "(1)", "(2)", "(3)", "(4)", "(5)". + +- There are two columns: term and "(1)", "(2)", "(3)", "(4)", "(5)". - The two rows specify the two kind of Fixed Effects: Location Fixed Effect & Green Building Fixed Effect. In these rows, it is specified for each regression model whether these fixed effects are included: "Checkmark" indicates they are included, and "XSolidBrush" indicates they are not included. {{% codeblock %}} + ```R library(tibble) @@ -201,9 +212,11 @@ rows <- tribble(~term, ~"(1)", ~"(2)", ~"(3)",~"(4)", ~"(5)", 'Green Building Fixed Effect', '\\XSolidBrush', '\\XSolidBrush', '\\XSolidBrush', '\\XSolidBrush', '\\Checkmark' ) ``` + {{% /codeblock %}} ### Position of Fixed Effect rows in table + To insert the fixed effect rows at the appropriate location in the table (between the estimates and statistics), we use the `attr()` function. These rows should be placed at row 33 and 34. ```R @@ -211,31 +224,35 @@ attr(rows, 'position') <- c(33, 34) ``` ### add_rows + To include the tibble with the fixed effect rows in the final table, we use the `add_rows = rows` argument within the `msummary()` function. Note that `rows` refers to the name of our tibble. ### row_spec + To distinguish the fixed effect rows from the summary statistics, we can add a horizontal line below the fixed effect rows using the `row_spec()` function after `modelsummary()`, using the pipe operator. Here is the `msummary()` function incorporating the mentioned arguments to generate the regression table with fixed effect rows: {{% codeblock %}} + ```R -msummary(models, +msummary(models, vcov = "HC1", fmt = 3, estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, add_rows = rows, - align="lccccc", + align="lccccc", output = "latex", escape = FALSE ) %>% row_spec(34, extra_latex_after = "\\midrule") %>% cat(., file = "my_table.tex") ``` + {{% /codeblock %}} The table generated using the previous code is displayed below. As you can see, fixed effect rows are added now! @@ -246,31 +263,34 @@ The table generated using the previous code is displayed below. As you can see, ## Add header -To include a header above the existing column headers of the regression table, we can use the `add_header_above()` function. -- The name of the header should be provided as a character string. +To include a header above the existing column headers of the regression table, we can use the `add_header_above()` function. + +- The name of the header should be provided as a character string. - To span the header across columns 2 to 6 while excluding the first column, we can set the header value to 5. {{% codeblock %}} + ```R -msummary(models, +msummary(models, vcov = "HC1", fmt = 3, estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, add_rows = rows, - align="lccccc", + align="lccccc", output = "latex", escape = FALSE ) %>% add_header_above(c(" " = 1, "Dependent Variable: $log(rent)$" = 5), escape = FALSE - ) + ) cat(., file = "my_table.tex") ``` + {{% /codeblock %}} The LaTeX output displays the header as follows: @@ -282,27 +302,28 @@ The LaTeX output displays the header as follows: ## Group rows ### Add a labeling row -We can use the `pack_rows()` function to insert labeling rows and group variables into categories. In our case, we have three categories: Building Class, Age and Stories. -Within `pack_rows()`, we provide the category name along with the first and last row numbers for the variables that belong to that category. Additionally, we can specify the formatting for the category name, such as printing it in italic text instead of bold. +We can use the `pack_rows()` function to insert labeling rows and group variables into categories. In our case, we have three categories: Building Class, Age and Stories. -### Indent subgroups +Within `pack_rows()`, we provide the category name along with the first and last row numbers for the variables that belong to that category. Additionally, we can specify the formatting for the category name, such as printing it in italic text instead of bold. +### Indent subgroups To indicate that Energystar and LEED are subgroups of the variable Green rating (row 1), we can use the `add_intent()` function. This function allows us to indent specific rows. {{% codeblock %}} + ```R -msummary(models, +msummary(models, vcov = "HC1", fmt = 3, estimate = "{estimate}", statistic = "[{std.error}]", - coef_map = cm2, + coef_map = cm2, gof_omit = 'AIC|BIC|RMSE|Within|FE', gof_map = gm2, add_rows = rows, - align="lccccc", + align="lccccc", output = "latex", escape = FALSE ) %>% @@ -315,6 +336,7 @@ msummary(models, cat(., file = "my_table.tex") ``` + {{% /codeblock %}}

@@ -322,8 +344,7 @@ msummary(models,

{{% summary %}} -The `kableExtra` package provides powerful tools for creating attractive and publication-ready tables in R. By combining it with the `modelsummary` package, we can generate informative and visually appealing regression tables for our research papers or reports. +The `kableExtra` package provides powerful tools for creating attractive and publication-ready tables in R. By combining it with the `modelsummary` package, we can generate informative and visually appealing regression tables for our research papers or reports. This example shows how to use some `kableExtra` functions to improve a standard `modelsummary` table and have it outputted in LaTex format. The functions covered include how export the table in LaTeX format, specify column alignment, add fixed effect rows, add a header, and group rows. {{% /summary %}} -