Appendix A — File Organization & Paths
File Organization
For small college projects, a well-organized file system might seem like overkill. However, for complex projects with multiple contributors, a clear file system and supporting documentation are crucial. While individual preferences for file organization vary, these essential rules apply to any data science project:
- Minimum Folders. Include a
datafolder and a code folder named, eg,code,scripts, orsrc. - Raw Data Integrity. Original data files should remain unaltered and clearly labeled. Ideally, store them in a
rawsubfolder withindata. - Data Provenance. Keep the original data source information alongside the data, perhaps, eg, in a
population-source.txtfile next topopulation.csv. - Efficient Data Processing. If raw data cleaning is time-consuming, use a script, eg,
population-process.qmd, to process it. Store the processed/cleaned data, eg,population-processed.csv, either alongside the raw data or in a dedicatedprocessedorcleanedsubfolder withindata.
Beyond technical distinctions, folder and directory* are interchangeable terms. (Wikipedia).
Generally by default, internet browsers automatically save all files to the Downloads folder on your computer. This does not encourage good file organization practices. You need to change this option so that your browser asks you where to save each file before downloading it. This Online Tech Tips page has information on how to do this for the most common browsers.
Courses
When starting a new semester and a new course, it is recommended to set up a directory structure such as the one below:
For this class, you will be asked to clone multiple repositories into your machine. You can store all of them under a parent folder called comp212 or stat212 to keep your file system organized. However, make sure not to store any of these repositories inside another because this will break the git tracking mechanism. For instructions on how to clone the repositories, see the course main page.
Data Science Projects
Below is one recommended directory structure when working on any data science project:
Documents
└─ project_name ← should be short but descriptive
├─ code ← all code files, eg, .R, .Rmd, .qmd go here
│ ├─ wa ← a work area to try code
│ │ ├─ viz.qmd ← trying different visualizations
│ │ └─ model.qmd ← trying different models
│ ├─ viz.qmd ← final visualizations
│ └─ model.qmd ← final models
│
├─ data ← all data files go here
│ ├─ raw ← original data files go here
│ │ ├─ population.csv ← original data
│ │ └─ population-source.txt ← information about the data source
│ │
│ └─ processed
│ ├─ population-cleaned.csv ← processed/cleaned data
│ └─ population-clean.qmd ← script used to clean the data
│
└─ results
├─ report.qmd ← written narrative
│
├─ figures ← plots produced by ggsave()
│ ├─ plot1.png
│ └─ plot2.tiff
│ └─ plot3.pdf
│
├─ tables ← for tabular results, eg, .csv
│ ├─ results1.csv
│ └─ results2.csv
│
└─ interactive ← for interactive shiny apps
├─ app.R
├─ ...
└─ ...For this class, you will be provided with structured file system that you will be asked to clone via a GitHub Classroom invite link. For instructions on how to clone the repositories, see the course main page.
File Paths
When you read in data from a source on your computer, you need to specify the file path correctly. A file path is an address to where the file is stored on our computer.
Example: Physical Address
Consider 1600 Grand Ave, St. Paul, MN 55105. Think about how different parts of the address give increasingly more specific information about the location.
St. Paul, MN 55105tells us the city and smaller region within the cityGrand Avetells us the street1600tells us the specific location on the street.
Example: File Address
Consider the following code:
In this example, the file path tells us the location giving more and more specific information as we read it from left to right. In particular,
~on an Apple computer tells us that we are looking in the user’s home directory.Desktoptells us to go to the Desktop within that home directory.112tells us that we are looking in the 112 folder on the Desktop.datatells us to next go in the data folder in the 112 folder.my_data.csvtells us that we are looking for a file called my_data.csv location within the data folder.
Types
There are two types of paths: absolute and relative. Absolute file paths start at the “root” directory in a computer system, eg, ~/Desktop/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson (on Mac) and C:/Users/lesliemyint/Documents/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson (on Windows). Relative file paths, on the other hand, start wherever you are right now, the working directory.
When referencing other files, absolute paths are NOT a good idea because if the code file is shared, the path will not work on a different computer. On the other hand, relative paths will still work. Except for resources hosted on the web (eg, https://mac-stat.github.io/data/sfo_weather.csv), always use relative paths.
~ vs / vs \
- On a Mac the tilde
~in a file path refers to the “Home” directory, which is typically a user-specific directory. - Windows uses both
/(forward slash) and\(backward slash) to separate folders in a file path.
Note that the working directory when you are working in a code file may be different from the working directory specified in the Console.
Examples
The table below illustrates sample directory setups for data files referenced by code files. Note the relative path options used to reference the data file.
| Data Location | File Structure | Relative Path Options |
|---|---|---|
| Same folder as code file |
|
|
Within a subfolder called data |
|
|
In a sibling folder called data |
|
./ vs ../
./refers to the current working directory../refers to the parent directory
From
your_code_file.qmd, you must go “up” to the parent folder ofcodetoproject_folderand then back “down” into thedatafolder. To go “up” to a parent folder in a relative path we use../↩︎