Appendix A — File Organization & Paths

File Organization

For small college projects, a well-organized file system might seem like overkill. However, for complex projects with multiple contributors, a clear file system and supporting documentation are crucial. While individual preferences for file organization vary, these essential rules apply to any data science project:

Minimum Folders. Include a data folder and a code folder named, eg, code, scripts, or src.
Raw Data Integrity. Original data files should remain unaltered and clearly labeled. Ideally, store them in a raw subfolder within data.
Data Provenance. Keep the original data source information alongside the data, perhaps, eg, in a population-source.txt file next to population.csv.
Efficient Data Processing. If raw data cleaning is time-consuming, use a script, eg, population-process.qmd, to process it. Store the processed/cleaned data, eg, population-processed.csv, either alongside the raw data or in a dedicated processed or cleaned subfolder within data.

Folder vs Directory

Beyond technical distinctions, folder and directory* are interchangeable terms. (Wikipedia).

Default File Download Location

Generally by default, internet browsers automatically save all files to the Downloads folder on your computer. This does not encourage good file organization practices. You need to change this option so that your browser asks you where to save each file before downloading it. This Online Tech Tips page has information on how to do this for the most common browsers.

Courses

When starting a new semester and a new course, it is recommended to set up a directory structure such as the one below:

Documents
└─ course_work
   └─ stat212
      ├─ class_activities
      ├─ homework
      └─ project

For this class, you will be asked to clone multiple repositories into your machine. You can store all of them under a parent folder called comp212 or stat212 to keep your file system organized. However, make sure not to store any of these repositories inside another because this will break the git tracking mechanism. For instructions on how to clone the repositories, see the course main page.

Data Science Projects

Below is one recommended directory structure when working on any data science project:

Documents
└─ project_name                     ← should be short but descriptive
   ├─ code                          ← all code files, eg, .R, .Rmd, .qmd go here
   │  ├─ wa                         ← a work area to try code
   │  │  ├─ viz.qmd                 ← trying different visualizations
   │  │  └─ model.qmd               ← trying different models
   │  ├─ viz.qmd                    ← final visualizations
   │  └─ model.qmd                  ← final models
   │
   ├─ data                          ← all data files go here
   │  ├─ raw                        ← original data files go here
   │  │  ├─ population.csv          ← original data
   │  │  └─ population-source.txt   ← information about the data source
   │  │
   │  └─ processed
   │     ├─ population-cleaned.csv  ← processed/cleaned data
   │     └─ population-clean.qmd    ← script used to clean the data
   │
   └─ results
      ├─ report.qmd                 ← written narrative
      │
      ├─ figures                    ← plots produced by ggsave()
      │  ├─ plot1.png
      │  └─ plot2.tiff
      │  └─ plot3.pdf
      │
      ├─ tables                     ← for tabular results, eg, .csv
      │  ├─ results1.csv
      │  └─ results2.csv
      │
      └─ interactive                ← for interactive shiny apps
         ├─ app.R
         ├─ ...
         └─ ...

For this class, you will be provided with structured file system that you will be asked to clone via a GitHub Classroom invite link. For instructions on how to clone the repositories, see the course main page.

File Paths

When you read in data from a source on your computer, you need to specify the file path correctly. A file path is an address to where the file is stored on our computer.

Example: Physical Address

Consider 1600 Grand Ave, St. Paul, MN 55105. Think about how different parts of the address give increasingly more specific information about the location.

St. Paul, MN 55105 tells us the city and smaller region within the city
Grand Ave tells us the street
1600 tells us the specific location on the street.

Example: File Address

Consider the following code:

my_data <- read_csv("~/Desktop/112/data/my_data.csv")

In this example, the file path tells us the location giving more and more specific information as we read it from left to right. In particular,

~ on an Apple computer tells us that we are looking in the user’s home directory.
Desktop tells us to go to the Desktop within that home directory.
112 tells us that we are looking in the 112 folder on the Desktop.
data tells us to next go in the data folder in the 112 folder.
my_data.csv tells us that we are looking for a file called my_data.csv location within the data folder.

Types

There are two types of paths: absolute and relative. Absolute file paths start at the “root” directory in a computer system, eg, ~/Desktop/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson (on Mac) and C:/Users/lesliemyint/Documents/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson (on Windows). Relative file paths, on the other hand, start wherever you are right now, the working directory.

When referencing other files, absolute paths are NOT a good idea because if the code file is shared, the path will not work on a different computer. On the other hand, relative paths will still work. Except for resources hosted on the web (eg, https://mac-stat.github.io/data/sfo_weather.csv), always use relative paths.

~ vs / vs \

On a Mac the tilde ~ in a file path refers to the “Home” directory, which is typically a user-specific directory.
Windows uses both / (forward slash) and \ (backward slash) to separate folders in a file path.

Different Working Directories

Note that the working directory when you are working in a code file may be different from the working directory specified in the Console.

Examples

The table below illustrates sample directory setups for data files referenced by code files. Note the relative path options used to reference the data file.

Data Location	File Structure	Relative Path Options
Same folder as code file	`project_folder ├─ your_code_file.qmd └─ data.csv`	`./data.csv` `data.csv`
Within a subfolder called `data`	`project_folder ├─ your_code_file.qmd │ └─ data └─ data.csv`	`./data/data.csv` `data/data.csv`.
In a sibling folder called `data`	`project_folder ├─ code │ └─ your_code_file.qmd │ └─ data └─ data.csv`	`./../data/data.csv` ¹ `../data/data.csv`

./ vs ../

./ refers to the current working directory
../ refers to the parent directory

From your_code_file.qmd, you must go “up” to the parent folder of code to project_folder and then back “down” into the data folder. To go “up” to a parent folder in a relative path we use ../↩︎

--- title: "File Organization & Paths" number-sections: false --- ## File Organization For small college projects, a well-organized file system might seem like overkill. However, for complex projects with multiple contributors, a clear file system and supporting documentation are crucial. While individual preferences for file organization vary, these essential rules apply to any data science project: - **Minimum Folders**. Include a `data` folder and a code folder named, eg, `code`, `scripts`, or `src`. - **Raw Data Integrity**. Original data files should remain unaltered and clearly labeled. Ideally, store them in a `raw` subfolder within `data`. - **Data Provenance**. Keep the original data source information alongside the data, perhaps, eg, in a `population-source.txt` file next to `population.csv`. - **Efficient Data Processing**. If raw data cleaning is time-consuming, use a script, eg, `population-process.qmd`, to process it. Store the processed/cleaned data, eg, `population-processed.csv`, either alongside the raw data or in a dedicated `processed` or `cleaned` subfolder within `data`. ::: {.callout-note title="Folder vs Directory"} Beyond technical distinctions, **folder** and **directory*** are interchangeable terms. ([Wikipedia](https://en.wikipedia.org/wiki/Directory_(computing))). ::: ::: {.callout-note title="Default File Download Location"} Generally by default, internet browsers automatically save all files to the `Downloads` folder on your computer. This does not encourage good file organization practices. You need to change this option so that your browser asks you where to save each file before downloading it. [This Online Tech Tips page](https://www.online-tech-tips.com/computer-tips/change-default-download-folder-location-on-any-web-browser/) has information on how to do this for the most common browsers. ::: ### Courses When starting a new semester and a new course, it is recommended to set up a directory structure such as the one below: ``` markdown Documents └─ course_work └─ stat212 ├─ class_activities ├─ homework └─ project ``` **For this class**, you will be asked to clone multiple repositories into your machine. You can store all of them under a parent folder called `comp212` or `stat212` to keep your file system organized. However, make sure **not** to store any of these repositories inside another because this will break the **git** tracking mechanism. For instructions on how to clone the repositories, see [the course main page](index.qmd). ### Data Science Projects Below is one recommended directory structure when working on any data science project: ``` markdown Documents └─ project_name ← should be short but descriptive ├─ code ← all code files, eg, .R, .Rmd, .qmd go here │ ├─ wa ← a work area to try code │ │ ├─ viz.qmd ← trying different visualizations │ │ └─ model.qmd ← trying different models │ ├─ viz.qmd ← final visualizations │ └─ model.qmd ← final models │ ├─ data ← all data files go here │ ├─ raw ← original data files go here │ │ ├─ population.csv ← original data │ │ └─ population-source.txt ← information about the data source │ │ │ └─ processed │ ├─ population-cleaned.csv ← processed/cleaned data │ └─ population-clean.qmd ← script used to clean the data │ └─ results ├─ report.qmd ← written narrative │ ├─ figures ← plots produced by ggsave() │ ├─ plot1.png │ └─ plot2.tiff │ └─ plot3.pdf │ ├─ tables ← for tabular results, eg, .csv │ ├─ results1.csv │ └─ results2.csv │ └─ interactive ← for interactive shiny apps ├─ app.R ├─ ... └─ ... ``` **For this class**, you will be provided with structured file system that you will be asked to clone via a GitHub Classroom invite link. For instructions on how to clone the repositories, see [the course main page](index.qmd#sec-github-repos). ## File Paths When you read in data from a source on your computer, you need to specify the **file path** correctly. A file path is an address to where the file is stored on our computer. **Example: Physical Address** Consider `1600 Grand Ave, St. Paul, MN 55105`. Think about how different parts of the address give increasingly more specific information about the location. - `St. Paul, MN 55105` tells us the city and smaller region within the city - `Grand Ave` tells us the street - `1600` tells us the specific location on the street. **Example: File Address** Consider the following code: ``` r my_data <- read_csv("~/Desktop/112/data/my_data.csv") ``` In this example, the file path tells us the location giving more and more specific information as we read it from left to right. In particular, - `~` on an Apple computer tells us that we are looking in the user’s home directory. - `Desktop` tells us to go to the Desktop within that home directory. - `112` tells us that we are looking in the 112 folder on the Desktop. - `data` tells us to next go in the data folder in the 112 folder. - `my_data.csv` tells us that we are looking for a file called my_data.csv location within the data folder. ### Types There are two types of paths: **absolute** and **relative**. **Absolute** file paths start at the "root" directory in a computer system, eg, `~/Desktop/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson` (on Mac) and `C:/Users/lesliemyint/Documents/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson` (on Windows). **Relative** file paths, on the other hand, start wherever you are right now, the **working directory**. When referencing other files, absolute paths are NOT a good idea because if the code file is shared, the path will not work on a different computer. On the other hand, relative paths will still work. Except for resources hosted on the web (eg, `https://mac-stat.github.io/data/sfo_weather.csv`), **always use relative paths**. ::: {.callout-note title="`~` vs `/` vs `\\`"} - On a Mac the tilde `~` in a file path refers to the "Home" directory, which is typically a user-specific directory. - Windows uses both `/` (forward slash) and `\` (backward slash) to separate folders in a file path. ::: ::: {.callout-note title="Different Working Directories"} Note that the working directory when you are working in a code file may be different from the working directory specified in the Console. ::: ### Examples The table below illustrates sample directory setups for data files referenced by code files. Note the relative path options used to reference the data file. +-----------------------------------+--------------------------+-------------------------------------------------+ | Data Location | File Structure | Relative Path Options | +===================================+==========================+=================================================+ | Same folder as code file | ``` markdown | - `./data.csv` | | | project_folder | - `data.csv` | | | ├─ your_code_file.qmd | | | | └─ data.csv | | | | ``` | | +-----------------------------------+--------------------------+-------------------------------------------------+ | Within a subfolder called `data` | ``` markdown | - `./data/data.csv` | | | project_folder | - `data/data.csv`. | | | ├─ your_code_file.qmd | | | | │ | | | | └─ data | | | | └─ data.csv | | | | ``` | | +-----------------------------------+--------------------------+-------------------------------------------------+ | In a sibling folder called `data` | ``` markdown | - `./../data/data.csv` [^file-organization-1] | | | project_folder | - `../data/data.csv` | | | ├─ code | | | | │ └─ your_code_file.qmd | | | | │ | | | | └─ data | | | | └─ data.csv | | | | ``` | | +-----------------------------------+--------------------------+-------------------------------------------------+ [^file-organization-1]: From `your_code_file.qmd`, you must go "up" to the parent folder of `code` to `project_folder` and then back "down" into the `data` folder. To go "up" to a parent folder in a relative path we use `../` ::: {.callout-note title="`./` vs `../`"} - `./` refers to the current working directory - `../` refers to the parent directory :::