Learning Goals

This course will help you build confidence in carrying out the entire data science pipeline which consists of:

Below are the general skills that will be targeted and the specific topics that will be covered.

General Skills

The general skills targeted by the course are:

Data Communication

In written and oral formats, inform and justify (a) data cleaning and analysis process and (b) resulting conclusions with clear, organized, logical, and compelling details that adapt to the background, values, and motivations of the audience and context in which communication occurs.

Collaborative Learning

  • Understand and demonstrate characteristics of effective collaboration (team roles, interpersonal communication, self-reflection, awareness of social dynamics, advocating for self and others).
  • Develop a common purpose and agreement on goals.
  • Be able to contribute questions or concerns in a respectful way.
  • Share and contribute to the group’s learning in an equitable manner.
  • Develop a familiarity and comfort in using collaboration tools such as Git and Github.

Course Topics

The specific topics covered by the course topics are listed below. Note that the topics are listed in the order of the data science pipeline, not the order in which we will cover them in class. Note also that the material might be adjusted for the different topics.

Foundation

Intro to R, RStudio, R Markdown, and Quarto

  • Download and install the necessary tools (R, RStudio, Git, GitHub Desktop)
  • Develop comfort in navigating the tools in RStudio
  • Develop comfort in writing and rendering quarto documents/websites
  • Identify the characteristics of tidy data
  • Use R code as a calculator and to explore tidy data

Data Visualization

Introduction to Data Visualization

  • Convince ourselves about the importance of data viz
  • Understand the “grammar of graphics”
  • Use ggplot2 functions to create data viz

Univariate

  • Understand the different basic univariate visualizations for categorical and quantitative variables

Bivariate

  • Identify appropriate types of bivariate visualizations to visualize relationships between 2 variables, depending on the type of variables (categorical, quantitative)
  • Create basic bivariate visualizations based on real data with ggplot2 functions

Multivariate

  • Understand how to visualize relationships between more than 2 variables.
  • Add aesthetics such as color and size to incorporate a third (or more variables) to a bivariate plot with ggplot2 functions

Spatial

  • Plot data points on top of a map using ggplot()
  • Create choropleth maps using geom_map()
  • Understand the basics of creating a map using leaflet, including adding points and choropleths to a base map.

Effective Visualization

  • Understand and apply the guiding principles of effective visualizations

Data Wrangling

Wrangling

  • Understand and use the verbs select, mutate, filter, arrange, summarize, and group_by appropriately
  • Develop an understanding of what code will do conceptually without running it

Dates

  • Develop an understanding of working with dates and lubridate functions

Reshaping

  • Understand the difference between wide and long data format and distinguish the cases (units of observation) for a given data set under each format.
  • Be able to use pivot_wider and pivot_longer from the tidyr package

Joining

  • Understand the concept of variables that uniquely identify rows (aka, cases or units of observations)
  • Understand the different types of joins, ie, combining two data frames together
  • Be able to use mutating joins: left_join, inner_join and full_join from the dplyr package
  • Be able to use filtering joins: semi_join, anti_join from the dplyr package

Factors

  • Understand the difference between a variable stored as a character vs. as a factor
  • Be able to convert a character variable to a factor
  • Be able to manipulate the order and values of a factor with the forcats package to improve summaries and visualizations.

Strings

  • Be able to work with strings of text data
  • Use regular expressions to search and replace, detect patterns, locate patterns, extract patterns, and separate text with the stringr package.

Starting a DS Project

Import

  • Find existing data sets
  • Save data sets locally
  • Load data into RStudio
  • Do some preliminary data checking and cleaning steps before further wrangling and visualization:
    • Make sure variables are properly formatted
    • Deal with missing values

Exploratory Data Analysis (EDA)

  • Understand the first steps that should be taken when you encounter a new data set
  • Develop comfort in knowing how to explore data to understand it
  • Develop comfort in formulating research questions