1 Introduction

🚩 Pre-Class Learnings

To prepare for this lesson, do the followings:

Work through Preparing for class Instructions document (GDoc)

🧩 Learning Goals

By the end of this lesson, you should be able to:

Use and evaluate different sources of information to find potential answers
Distinguish between high- and low-level of understanding
Determine which source of information is suitable for each level of understanding
Determine which level of understanding is suitable for this class
Review data wrangling verbs introduced in COMP/STAT 112

Sources of Information

Let’s practice how to use different sources of information to find potential answers. Use the sources below to find (a) how a data frame in R is defined and (b) how the sources differ in their definitions.

Search Engines: Google Search
LLMs: Gemini
Intro Textbooks: R for Data Science
Manuals: R Manuals
Advanced Textbooks: Advanced R
Documentations: data.frame() function Documentation
Documentations: tibble class Documentation

Our Definition of Data Frame in R

A data frame in R is a named list with elements of all the same length.

Levels of Understanding

There are two levels of understanding:

high-level (intuitive) understanding necessary to implement common tasks
low-level (foundational) understanding necessary to tackle new problems and come up with new solutions

In this class, We are going to strive for low-level (foundational) understanding. You should start with a high-level (intuitive) understanding, eg, using LLM, and then dig deeper to get the details, eg, using documentation and more advanced textbooks.

Data Wrangling Verbs

Below are some of the wrangling verbs introduced in COMP/STAT 112. Please, spend few minutes to review them.

mutate(): creates/changes columns/elements in a data frame/tibble
select(): keeps subset of columns/elements in a data frame/tibble
filter(): keeps subsets of rows in a data frame/tibble
arrange(): sorts rows in a data frame/tibble
group_by(): internally groups rows in data frame/tibble by values in 1 or more columns/elements
summarize(): collapses/combines information across rows using functions such as n(), sum(), mean(), min(), max(), median(), sd()
count(): shortcut for group_by() |> summarize(n = n())
left_join(): mutating join of two data frames/tibbles keeping all rows in left data frame
full_join(): mutating join of two data frames/tibbles keeping all rows in both data frames
inner_join(): mutating join of two data frames/tibbles keeping rows in left data frame that find match in right
semi_join(): filtering join of two data frames/tibbles keeping rows in left data frame that find match in right
anti_join(): filtering join of two data frames/tibbles keeping rows in left data frame that do not find match in right
pivot_wider(): rearrange values from two columns to many(one column becomes the names of new variables, one column becomes the values of the new variables)
pivot_longer(): rearrange values from many columns to two (the names of the columns go to one new variable, the values of the columns go to a second new variable)

--- title: "1 Introduction" logo: "../images/mac.png" --- {{< include _prep-front_matter.qmd >}} - Work through [Preparing for class Instructions document (GDoc)]({{< var prep-for-class-url >}}) {{< include _learning_goals-front_matter.qmd >}} - Use and evaluate different sources of information to find potential answers - Distinguish between high- and low-level of understanding - Determine which source of information is suitable for each level of understanding - Determine which level of understanding is suitable for this class - Review data wrangling verbs introduced in COMP/STAT 112 ## Sources of Information Let's practice how to use different sources of information to find potential answers. Use the sources below to find (a) how a **data frame** in R is defined and (b) how the sources differ in their definitions. - Search Engines: [Google Search](http://google.com) - LLMs: [Gemini](https://gemini.google.com/app) - Intro Textbooks: [R for Data Science](https://r4ds.hadley.nz/) - Manuals: [R Manuals](https://rstudio.github.io/r-manuals/r-intro/) - Advanced Textbooks: [Advanced R](https://adv-r.hadley.nz/) - Documentations: [`data.frame()` function Documentation](https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html) - Documentations: [`tibble` class Documentation](https://tibble.tidyverse.org/reference/tbl_df-class.html) ::: {.callout-note title="Our Definition of Data Frame in R" collapse="true"} A data frame in R is a named list with elements of all the same length. ::: ## Levels of Understanding There are two levels of understanding: - **high-level** (**intuitive**) understanding necessary to implement common tasks - **low-level** (**foundational**) understanding necessary to tackle new problems and come up with new solutions In this class, We are going to strive for low-level (foundational) understanding. You should start with a high-level (intuitive) understanding, eg, using LLM, and then dig deeper to get the details, eg, using documentation and more advanced textbooks. ## Data Wrangling Verbs Below are some of the wrangling verbs introduced in COMP/STAT 112. Please, spend few minutes to review them. - `mutate()`: creates/changes columns/elements in a data frame/tibble - `select()`: keeps subset of columns/elements in a data frame/tibble - `filter()`: keeps subsets of rows in a data frame/tibble - `arrange()`: sorts rows in a data frame/tibble - `group_by()`: internally groups rows in data frame/tibble by values in 1 or more columns/elements - `summarize()`: collapses/combines information across rows using functions such as `n()`, `sum()`, `mean()`, `min()`, `max()`, `median()`, `sd()` - `count()`: shortcut for `group_by() |> summarize(n = n())` - `left_join()`: mutating join of two data frames/tibbles keeping all rows in left data frame - `full_join()`: mutating join of two data frames/tibbles keeping all rows in both data frames - `inner_join()`: mutating join of two data frames/tibbles keeping rows in left data frame that find match in right - `semi_join()`: filtering join of two data frames/tibbles keeping rows in left data frame that find match in right - `anti_join()`: filtering join of two data frames/tibbles keeping rows in left data frame that do not find match in right - `pivot_wider()`: rearrange values from two columns to many(one column becomes the names of new variables, one column becomes the values of the new variables) - `pivot_longer()`: rearrange values from many columns to two (the names of the columns go to one new variable, the values of the columns go to a second new variable)