Appendix A — File Organization & Paths
File Organization
For small college projects, a well-organized file system might seem like overkill. However, for complex projects with multiple contributors, a clear file system and supporting documentation are crucial. While individual preferences for file organization vary, these essential rules apply to any data science project:
- Minimum Folders. Include a
data
folder and a code folder named, eg,code
,scripts
, orsrc
. - Raw Data Integrity. Original data files should remain unaltered and clearly labeled. Ideally, store them in a
raw
subfolder withindata
. - Data Provenance. Keep the original data source information alongside the data, perhaps, eg, in a
population-source.txt
file next topopulation.csv
. - Efficient Data Processing. If raw data cleaning is time-consuming, use a script, eg,
population-process.qmd
, to process it. Store the processed/cleaned data, eg,population-processed.csv
, either alongside the raw data or in a dedicatedprocessed
orcleaned
subfolder withindata
.
Beyond technical distinctions, folder and directory* are interchangeable terms. (Wikipedia).
Generally by default, internet browsers automatically save all files to the Downloads
folder on your computer. This does not encourage good file organization practices. You need to change this option so that your browser asks you where to save each file before downloading it. This Online Tech Tips page has information on how to do this for the most common browsers.
Courses
When starting a new semester and a new course, it is recommended to set up a directory structure such as the one below:
For this class, you will be asked to clone multiple repositories into your machine. You can store all of them under a parent folder called comp212
or stat212
to keep your file system organized. However, make sure not to store any of these repositories inside another because this will break the git tracking mechanism. For instructions on how to clone the repositories, see the course main page.
Data Science Projects
Below is one recommended directory structure when working on any data science project:
Documents
└─ project_name ← should be short but descriptive
├─ code ← all code files, eg, .R, .Rmd, .qmd go here
│ ├─ wa ← a work area to try code
│ │ ├─ viz.qmd ← trying different visualizations
│ │ └─ model.qmd ← trying different models
│ ├─ viz.qmd ← final visualizations
│ └─ model.qmd ← final models
│
├─ data ← all data files go here
│ ├─ raw ← original data files go here
│ │ ├─ population.csv ← original data
│ │ └─ population-source.txt ← information about the data source
│ │
│ └─ processed
│ ├─ population-cleaned.csv ← processed/cleaned data
│ └─ population-clean.qmd ← script used to clean the data
│
└─ results
├─ report.qmd ← written narrative
│
├─ figures ← plots produced by ggsave()
│ ├─ plot1.png
│ └─ plot2.tiff
│ └─ plot3.pdf
│
├─ tables ← for tabular results, eg, .csv
│ ├─ results1.csv
│ └─ results2.csv
│
└─ interactive ← for interactive shiny apps
├─ app.R
├─ ...
└─ ...
For this class, you will be provided with structured file system that you will be asked to clone via a GitHub Classroom invite link. For instructions on how to clone the repositories, see the course main page.
File Paths
When you read in data from a source on your computer, you need to specify the file path correctly. A file path is an address to where the file is stored on our computer.
Example: Physical Address
Consider 1600 Grand Ave, St. Paul, MN 55105
. Think about how different parts of the address give increasingly more specific information about the location.
St. Paul, MN 55105
tells us the city and smaller region within the cityGrand Ave
tells us the street1600
tells us the specific location on the street.
Example: File Address
Consider the following code:
In this example, the file path tells us the location giving more and more specific information as we read it from left to right. In particular,
~
on an Apple computer tells us that we are looking in the user’s home directory.Desktop
tells us to go to the Desktop within that home directory.112
tells us that we are looking in the 112 folder on the Desktop.data
tells us to next go in the data folder in the 112 folder.my_data.csv
tells us that we are looking for a file called my_data.csv location within the data folder.
Types
There are two types of paths: absolute and relative. Absolute file paths start at the “root” directory in a computer system, eg, ~/Desktop/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson
(on Mac) and C:/Users/lesliemyint/Documents/Course_Work/STAT212/ica/maps/us_states_hexgrid.geojson
(on Windows). Relative file paths, on the other hand, start wherever you are right now, the working directory.
When referencing other files, absolute paths are NOT a good idea because if the code file is shared, the path will not work on a different computer. On the other hand, relative paths will still work. Except for resources hosted on the web (eg, https://mac-stat.github.io/data/sfo_weather.csv
), always use relative paths.
~
vs /
vs \
- On a Mac the tilde
~
in a file path refers to the “Home” directory, which is typically a user-specific directory. - Windows uses both
/
(forward slash) and\
(backward slash) to separate folders in a file path.
Note that the working directory when you are working in a code file may be different from the working directory specified in the Console.
Examples
The table below illustrates sample directory setups for data files referenced by code files. Note the relative path options used to reference the data file.
Data Location | File Structure | Relative Path Options |
---|---|---|
Same folder as code file |
|
|
Within a subfolder called data |
|
|
In a sibling folder called data |
|
./
vs ../
./
refers to the current working directory../
refers to the parent directory
From
your_code_file.qmd
, you must go “up” to the parent folder ofcode
toproject_folder
and then back “down” into thedata
folder. To go “up” to a parent folder in a relative path we use../
↩︎