DATASCI 112: Principles of Data Science

Dennis Sun, Stanford University, Winter 2023


Final Project

In the final project for this course, you will apply the techniques learned in this class to analyze a data set of personal interest to you. Your goal should be to create an original project that you would be proud to show off to a potential employer. You are encouraged to upload your project to Github after the course is over.

Requirements

The outline below is only a suggestion. If you have a completely different idea for a data science project that does not fit neatly with the following requirements, please come talk to me.

  • The data that you analyze should be complex to collect or to clean in some way.
  • You should create informative visualizations to explain the data and your findings.
  • You should use machine learning to build a model to predict some outcome variable in your data set.

You are expected to work on this project in pairs, although you can work individually if you prefer. Please note that if you work alone, you are still expected to produce a comparable amount of work as a two-person team.

Deliverables

  • Create a directory containing the following files:
    • Data Collection and Cleaning: This notebook should contain all code that you wrote to collect and clean the data.
    • Data Exploration: This notebook should contain all code that you wrote as you were exploring the data. It should contain results and visualizations.
    • Machine Learning: This notebook should contain the code that you wrote to determine what model to fit and to analyze the performance of the model you ultimately chose.
    • Presentation: Either slides or a notebook containing the highlights from your analyses above. It should not contain much code, unless the code is important for understanding the analysis. You will be presenting this during the in-class presentation session.
  • Submit this directory by Wednesday 3/22 in one of two ways:
    • Upload the project to Github. (Make sure the repository is public.) Provide the link to this Github repository on Canvas.
    • Create a .zip file out of the directory. Upload this .zip file to Canvas.
    In general, we will not be re-running your notebooks unless we are confused about something, so it is not necessary to include data files in your submission. However, if the data files are small and it is convenient to include them, please do so.
  • Give a 5-7 minute presentation in one of the following sessions:
    • Monday 3/20, 3:30 - 6:30 PM in 380-380F (This is the scheduled final exam time for this class.)
    • Wednesday 3/22, 3:30 - 6:30 PM in Sequoia 200
    Sign up for a session here.

Grading Rubric

You will be graded on a 0-4 scale in each of the categories below. A 0 means that the component is missing from the project altogether, while the meanings of scores of 1-4 are summarized below. In general, you get a 3 for meeting expectations and doing solid work; to earn a 4, you have to go above and beyond in some way.

Category 1 2 3 4
Research Question and Motivation Research question is unclear. Research question is reasonable, but perhaps lacking in focus and motivation. Research question is focused and well-motivated. Research question is exceptionally creative and/or clever, in addition to being focused and well-motivated.
Data Collection and Cleaning Data is simple and already clean. Data is straightforward (e.g., a CSV file), but still nontrivial (e.g., quite large). Data is complex to collect (e.g., from a REST API) and/or to clean (e.g., in JSON format). Data collection and cleaning is unusually complex.
Data Visualization Limited attempts to visualize the data. Visualizations are straightforward and provide some insight into the data. Visualizations are informative, insightful, and visually appealing. Outstanding visualizations that are unusually artistic, insightful, and/or challenging to produce.
Machine Learning Missing either a comparison of different models or an analysis of the performance of the model. Some comparison of different models and some analysis of the performance of the model. Solid comparison of several models and thorough analysis of the performance of the final model. Creative and/or sophisticated application of machine learning, in addition to solid analysis.
Correctness of Results Many flaws in the analysis; results are unreliable. Results are generally correct, with one major flaw or many minor flaws. Results are correct with no flaws (or only trivial flaws). Results are not only correct, but the analyses were also technically challenging.
Submission Organization Submission did not match requirements stated in the "Deliverables" section. Submission met the requirements but was difficult to follow and understand. Submission was well-organized and easy to follow. Submission exceeded all expectations in terms of documentation and was exceptionally organized.
Presentation Presentation failed to communicate the results adequately. Presentation was acceptable, but not engaging or difficult to follow. Solid presentation that was engaging and easy to understand. Exceptional presentation that captivated the audience.
Participation (Presentation Session) Attended presentation session, but failed to complete the peer feedback. Completed peer feedback, but perfunctorily. Provided solid peer feedback for at least two-thirds of presentations. Provided unusually detailed peer feedback for at least two-thirds of presentations and/or participated actively throughout the presentation session.

Where to Find Datasets

The best data set is one that you are passionate about. I recommend that you start by finding a question you want to answer and then finding data to answer that question, rather than starting with a data set. That said, here are some helpful websites with large collections of data.

Example Projects