Description
Goals
- The course project offers you the opportunity to showcase your mastery of the course learning goals, encompassing both soft and hard skills, by collaborating with other students to tell a data story by formulating and answering a research question of your choosing based on a suitable dataset.
- The course project is a chance to challenge yourself and work with minimal guidance to explore new territories such as interactive plots, text analysis, and web apps.
Expectations
Below are few expectations for your project:
- Professionalism–you are expected to produce a professional data-based report in comparable quality to New York Times and Five Thirty Eight reports–below are some examples:
- Accessible to individuals with little data literacy and written in plain language
- All applicable effective viz principles discussed in class should be followed.
- The COMP112: Code Styling Guidelines should be followed to insure more readable and maintainable code base.
- Self-contained, ie, all project files such as code, data, presentation, and video are (downloaded if possible and) stored in the team project repository shared by the instructor and relative file paths are used. If the files can be downloaded, then absolute file paths should be used instead.
- Published online on GitHub Pages using the procedure used in class.
- Links to external resources should be active, ie, clickable.
Team Size
You are expected to work in a team of 2-3 students.
Domain Area
Each team is free to select any domain area for their project, ranging from agriculture and biology to education and e-sport. The sole requirement is that the data analysis must be sufficiently sophisticated to engage in a meaningful discussion with someone, eg, a prospective employer.
Structure
Your data story should include the following sections, all within context.
Motivation
- What is the story behind the project?
- Why were you interested in the project?
- What were you motivation?
- Why is the project important?
Research Question
- What research question(s) are you trying to answer?
- What are the key goals of the project?
- What ideas do you most want to communicate?
Background
- What background information is necessary for target audience to understand?
- What assumptions, terms, and/or acronyms need to be clarified?
Data
- Data collection
- What was collected?
- When was it collected?
- Why was it collected?
- How was it collected originally?
- Who collected it?
- Data acquisition
- Where / how did you get the data?
- What is the source?
- Data understanding
- How much data do you have?
- What types of measurements?
- Anything you needed to clean before getting started?
Data Insights
It’s your job to explicitly identify and discuss key insights. Don’t simply present the audience with some code and output and expect them to do that work. Specifically address the following questions:
- What are the important takeaways from the data? What was interesting?
- Why do these takeaways matter?
- Was there anything surprising?
- Overall, what do you want the audience to walk away with? What do you want them to understand about your data and research questions?
Conclusions / Big Picture
- How do the insights connect to answer your research question(s)?
Limitations and Future Work
- What limitations or weaknesses does your data / analysis suffer from?
- What improvements might someone make to your analysis?
Due Dates
Check the class Moodle page.
Where to Submit
Most of project-related deliverables are submitted via the provided GitHub Classroom Assignment linked in Moodle.
The project is a group (not individual) GitHub Classroom Assignment meaning that all your team members will share the same (online) GitHub repository. This means that conflict will arise whenever multiple members edited the same files and tried to push their changes to GitHub. To avoid conflicts, each team member should create a different quarto file for their own analysis. Afterward, the team members should come together and decide what pieces from each member’s analysis should go to the final report.
I am not expecting the GitHub collaboration process to go smooth. So, if you encounter any issue and you could not resolve it on your own after a short period of try, let the instructor know as soon as possible to avoid any delay in your data analysis task.
Deliverables
Proposal
Each team should submit a proposal for their project in the form of a rendered HTML Quarto page added as an appendix to their project website. The proposal should include the followings
- a title
- the names of the team members
- a short description of the project
- the reasons/inspirations behind choosing this project
- a rough implementation and responsibility plan, ie, what needs to be accomplished and who will do what when. Think about the list of deliverables when building the plan. The plan should be presented in a table format.
Case Study
All the members as a team should study one of the professional reports linked in the Expectation section (or something in a comparable quality) then record their observations about its data story structure, ie, how the report unfolds the story, in a rendered HTML Quarto page added as an appendix to their project website. Remember to cite the referenced resource.
Effective Teamwork
Each member of the team should read an article, listen to a podcast, or watch a video about about effective teamwork then summarize it into a Quarto file added to the project website under a section called “Effective Teamwork”. Remember to cite the referenced resource. This is an individual work, not a team work.
EDA
Each member of the team should do exploratory data analysis (EDA) on the project datasets and add the results into a Quarto files added to the project website under a section called “EDA”. This is an individual work, not a team work. In a blog post, Michael Clark demos eight R packages(html) that can help explore data quickly and effectively. The demoed packages are: arsenal, DataExplorer, dataReporter (dataMaid previously), gtsummary, janitor, SmartEDA, summarytools, and visdat.
In-class Presentation
Each team will present their project to class to solicit feedback before recording their final presentation–see next section for details.
Code, Report, Video, Presentation
Each team should push the code of their project to GitHub. Each repository should include a README.md
file that answers the following questions:
- What is this GitHub repository is all about?
- What software (with the version numbers) need to be installed to run the code contained in this GitHub repository, eg, R version 4.4.3+ and RStudio 2024.12.1 Build 563?
- What steps need to be taken to run the code contained in this GitHub repository? Think about the steps you did at the beginning of the semester to prepare your machine for class.
- What is the expected output look like? You can use one or more screenshots of the main features.
The awesome README GitHub page lists examples of GitHub repository with well-structured README files. Please, check some of them for inspirations–Aimeos TYPO3 extension project repository is a good example.
The report should be the rendered version of the landing page of the project website, ie, the file named index.qmd
. When rendering the report make sure to
- Hide the code chunks
- Hide the warning and error messages
- Avoid printing any unnecessary output
- Format printed tables, if any, nicely. There are many packages such as gt and kableExtra that can help achieve this.
See the Structure section for details about what to include in the report.
The presentation video should be no longer than 5 minutes. The presentation video should be as alternative version of the written report and a refined version of the presentation done in class. If report is long, the team can focus on one aspect in the presentation video and direct the viewer to consult the project website for the full report. The presentation video should be upload to the project GitHub repository or a video streaming service such as YouTube or Loom then embed it in the report.
The presentation slides used in the presentation video (or an altered version) should be embedded in the report preferably after uploading it to the project GitHub repository.
- See this Quarto webpage for how-to embed figures which will work on pdf files as well.
- See this Stack Overflow page for how-to embed a webpage.
- See this Google Help page for how-to embed a Google presentation.
Evaluation and Reflection
Each team member will be required to evaluate and reflect on their own performance, as well as the performance of each of their teammates. Additionally, each student will be asked to evaluate the projects of other teams.
Important Notes
Teamwork
Working in a team is change to improve one communication skills as well as know their teammates better. However, working in a team sometimes pose some challenges. To ensure successful project outcome, below are few expectations:
- Active participation–be present, attend classes, do your work, keep your team informed about any unexpected events
- Active listening–show interest in other team members’ ideas
- Inclusive environment–invite teammates to participate
- Each member should be in all main aspects of the project, including coding, reporting, and presentation. It is not acceptable for a single member to solely handle one aspect, such as coding, while another focuses solely on the report, and another solely on the presentation.
Code Backup
When working on your data analysis, ensure that you commit your code frequently to GitHub, accompanied by meaningful commit messages, and push your changes regularly. This practice helps prevent any unforeseen issues or loss of progress.
Inspirations
For inspirational data science projects, see the Data Engineering Zoomcamp list of projects for 2022: cohort #1 and cohort #2 and 2023 cohort #1 and cohort #2