1 Introduction
🚩 Pre-Class Learnings
To prepare for this lesson, do the followings:
- Work through Preparing for class Instructions document (GDoc)
🧩 Learning Goals
By the end of this lesson, you should be able to:
- Use and evaluate different sources of information to find potential answers
- Distinguish between high- and low-level of understanding
- Determine which source of information is suitable for each level of understanding
- Determine which level of understanding is suitable for this class
- Review data wrangling verbs introduced in COMP/STAT 112
Sources of Information
Let’s practice how to use different sources of information to find potential answers. Use the sources below to find (a) how a data frame in R is defined and (b) how the sources differ in their definitions.
- Search Engines: Google Search
- LLMs: Gemini
- Intro Textbooks: R for Data Science
- Manuals: R Manuals
- Advanced Textbooks: Advanced R
- Documentations:
data.frame()
function Documentation - Documentations:
tibble
class Documentation
A data frame in R is a named list with elements of all the same length.
Levels of Understanding
There are two levels of understanding:
- high-level (intuitive) understanding necessary to implement common tasks
- low-level (foundational) understanding necessary to tackle new problems and come up with new solutions
In this class, We are going to strive for low-level (foundational) understanding. You should start with a high-level (intuitive) understanding, eg, using LLM, and then dig deeper to get the details, eg, using documentation and more advanced textbooks.
Data Wrangling Verbs
Below are some of the wrangling verbs introduced in COMP/STAT 112. Please, spend few minutes to review them.
mutate()
: creates/changes columns/elements in a data frame/tibbleselect()
: keeps subset of columns/elements in a data frame/tibblefilter()
: keeps subsets of rows in a data frame/tibblearrange()
: sorts rows in a data frame/tibblegroup_by()
: internally groups rows in data frame/tibble by values in 1 or more columns/elementssummarize()
: collapses/combines information across rows using functions such asn()
,sum()
,mean()
,min()
,max()
,median()
,sd()
count()
: shortcut forgroup_by() |> summarize(n = n())
left_join()
: mutating join of two data frames/tibbles keeping all rows in left data framefull_join()
: mutating join of two data frames/tibbles keeping all rows in both data framesinner_join()
: mutating join of two data frames/tibbles keeping rows in left data frame that find match in rightsemi_join()
: filtering join of two data frames/tibbles keeping rows in left data frame that find match in rightanti_join()
: filtering join of two data frames/tibbles keeping rows in left data frame that do not find match in rightpivot_wider()
: rearrange values from two columns to many(one column becomes the names of new variables, one column becomes the values of the new variables)pivot_longer()
: rearrange values from many columns to two (the names of the columns go to one new variable, the values of the columns go to a second new variable)