Description
Goal
The course project is an opportunity to:
- Collaborate with peers to demonstrate mastery of course goals by crafting a complete data story from research questions and datasets of your choosing.
- Seek out a challenge, eg, explore new territory or delve deeper into a concept you’ve already encountered.
Expectations
Below are few expectations for your project:
- Professionalism. You are expected to produce a professional and sophisticated data-based report in comparable quality to New York Times and Five Thirty Eight reports–below are some examples:
- Readability. In order to be accessible to large audience, especially individuals with limited data literacy, your data story should be written in plain language.
- Data. In order to build your data wrangling skills, the datasets should be raw and come from a minimum of two different sources. Data hosted on Kaggle and similar sites are usually not raw, ie, has been processed. You are encouraged to use data acquisition tools such as APIs and web scrapping to obtain one of your datasets.
- Effectiveness. All visualizations mush be effective
- Code Styling. To insure readability and maintainability, your code base should follow COMP/STAT 112/212: Code Styling Guide.
- Self-Contained. All project files such as code, data, presentation, and video should are stored in the team project repository and relative file paths should be used. If the files can be stored on GitHub, then your should link into these files using absolute file paths.
- Availability. Your data story should be published online on GitHub Pages using the procedure described in class. All links to external resources should be active, ie, clickable.
Team Size
You are expected to work in a team of 3 students (or 2 if not possible).
Domain Area
Each team is free to select any domain area for their project, ranging from agriculture and biology to education and e-sport.
Report Structure
Your report should include the following sections, all within context.
Motivation
- What is the story behind the project?
- Why were you interested in the project?
- What were you motivation?
- Why is the project important?
Research Question
- What research question(s) are you trying to answer?
- What are the key goals of the project?
- What ideas do you most want to communicate?
Background
- What background information is necessary for target audience to understand?
- What assumptions, terms, and/or acronyms need to be clarified?
Data
- Data collection
- What was collected?
- When was it collected?
- Why was it collected?
- How was it collected originally?
- Who collected it?
- Data acquisition
- Where / how did you get the data?
- What is the source?
- Data understanding
- How much data do you have?
- What types of measurements?
- Anything you needed to clean before getting started?
Data Insights
It’s your job to explicitly identify and discuss key insights. Don’t simply present the audience with some code and output and expect them to do that work. Specifically address the following questions:
- What are the important takeaways from the data? What was interesting?
- Why do these takeaways matter?
- Was there anything surprising?
- Overall, what do you want the audience to walk away with? What do you want them to understand about your data and research questions?
Conclusions / Big Picture
- How do the insights connect to answer your research question(s)?
Limitations and Future Work
- What limitations or weaknesses does your data / analysis suffer from?
- What improvements might someone make to your analysis?
Due Dates
Check the class schedule linked on the main page.
Where to Submit
Most deliverables are submitted via project GitHub Classroom Assignment.
The project is a group (not individual) GitHub Classroom Assignment meaning that all your team members will share the same (online) GitHub repository. This means that conflict will arise whenever multiple members edited the same files and tried to push their changes to GitHub. To avoid conflicts, each team member should use a different quarto file for their own analysis–few sample files has already been created. Afterward, the team members should come together and decide what pieces from each member’s analysis should go to the final report.
I am not expecting the GitHub collaboration process to go smooth. So, if you encounter any issue and you could not resolve it on your own after a short period of try, let the instructor know as soon as possible to avoid any delay in your data analysis task.
Milestones
Milestone 0.0: Brainstorming Project Ideas
In their table, students brainstorm project ideas and add them to the Project Notebook. After that, students categorize the brainstormed ideas and assigned themselves to categories of interest.
Milestone 0.1: Team Formation
Students the categories the brainstormed ideas added to the Project Notebook to determine which one interest them the most and reach out to other classmates who share the same interest for the possibility to form a team. Upon agreeing, the team should add basic information about their team and project to the corresponding sheet in the Project Notebook.
Milestone 1: Proposal + Dataset
Each team should submit a proposal for their project in the form of a rendered HTML Quarto page added as an appendix to their project website. The proposal should include the followings
- a title
- the names of the team members
- a short description of the project
- 2-3 broad research questions the team is interested in exploring
- the reasons/inspirations behind choosing this project
- one data set that can be used to answer the research question
- a rough implementation and responsibility plan, ie, what tasks needs to be accomplished and who will do each when. Think about the list of deliverables when building your list of tasks. The plan should be added as GitHub Issues within the project repository–see the callout box below for details.
To track your project progress, utilize the GitHub Issues feature by creating a set of milestone and tasks (issues in GitHub terminology) inside each. Below are the instructions for how to add a task into a milestone.
- Go to your project GitHub repository on the web
- Select
Issues
tab - Click
Milestones
–> clickNew milestone
–> type a title, pick a due date, and type a description then hitCreate milestone
- Click
New issue
–> type a tile, add a description, assign members (from the right list), and hitCreate
Repeat the steps above to add new milestones and new tasks.
Milestone 2: Effective Teamwork
As individuals, not as a team, each member should read an article, listen to a podcast, or watch a video about effective teamwork then summarize it in the corresponding sheet in their progress tracker.
Milestone 3: Case Study
As a team, members should study one of the professional reports linked in the Expectation section (or something in a comparable quality) then record their observations about its data story structure, ie, how the report unfolds the story. The analysis should be added to appx/case-study.qmd
file. Make sure to cite the referenced resources.
Milestone 4: EDA
As individuals, not as a team, each member should perform an exploratory data analysis (EDA) on one of the project dataset and add the results into their respective Quarto file under the eda
folder of the project repository.
In a blog post, Michael Clark demos eight R packages(html) that can help explore data quickly and effectively. The demoed packages are: arsenal, DataExplorer, dataReporter (dataMaid previously), gtsummary, janitor, SmartEDA, summarytools, and visdat.
Milestone 5: Preliminary Findings Presentation
As a team, members should present their preliminary findings to class to solicit feedback.
Milestone 6: Data Story Presentation
Before recording the final presentation as described below, teams should present their complete data story to class to solicit feedback.
Milestone 7: Deliverables + Evaluation
Code
Each team should push the code of their project to GitHub. Each repository should include a README.md
file that answers the following questions:
- What is this GitHub repository is all about?
- What software (with the version numbers) need to be installed to run the code contained in this GitHub repository, eg, R version 4.4.3+ and RStudio 2024.12.1 Build 563?
- What steps need to be taken to run the code contained in this GitHub repository? Think about the steps you did at the beginning of the semester to prepare your machine for class.
- What is the expected output look like? You can use one or more screenshots of the main features.
The awesome README GitHub page lists examples of GitHub repository with well-structured README files. Please, check some of them for inspirations–Aimeos TYPO3 extension project repository is a good example.
Data Story Report
The data story report should be the rendered version of the landing page of the project website, ie, the file named index.qmd
. When rendering the data story report make sure to
- Hide the code chunks
- Hide the warning and error messages
- Avoid printing any unnecessary output
- Format printed tables, if any, nicely. There are many packages such as gt and kableExtra that can help achieve this.
See the Structure section for details about what to include in the data story report.
Presentation Video
After soliciting feedback from peers, teams are expected to adjust their data story report accordingly and then record a video presentation (5 minutes max) explaining the main finding in their data story and pointing the interested viewers to the full report if more information is needed. The presentation video should be upload to the project GitHub repository (if possible) or a video streaming service such as YouTube or Loom then embedded in the report.
See this Quarto webpage for how-to embed videos.
Presentation Slides
The presentation slides used in the presentation video (or an altered version) should be embedded in the report, preferably after uploading it to the project GitHub repository.
- See this Quarto webpage for how-to embed figures–this will also work on pdf files.
- See this Stack Overflow page for how-to embed a webpage.
- See this Google Help page for how-to embed a Google presentation.
Evaluation and Reflection
Each team member will be asked to evaluate and reflect on their own performance, as well as the performance of each of their teammates. Additionally, each student will be asked to evaluate the projects of other teams.
Important Notes
Teamwork
Working in a team is change to improve one communication skills as well as know their teammates better. However, working in a team sometimes pose some challenges. To ensure successful project outcome, below are few expectations:
- Active participation–be present, attend classes, do your work, keep your team informed about any unexpected events
- Active listening–show interest in other team members’ ideas
- Inclusive environment–invite teammates to participate
- Each member should be in all main aspects of the project, including coding, reporting, and presentation. It is not acceptable for a single member to solely handle one aspect, such as coding, while another focuses solely on the report, and another solely on the presentation.
Code Backup
When working on your data analysis, ensure that you commit your code frequently to GitHub, accompanied by meaningful commit messages, and push your changes regularly. This practice helps prevent any unforeseen issues or loss of progress.
Inspirations
For inspirational data science projects, see the Data Engineering Zoomcamp list of projects for 2022: cohort #1 and cohort #2 and 2023 cohort #1 and cohort #2