Dennis Sun, Stanford University, Winter 2023
In the final project for this course, you will apply the techniques learned in this class to analyze a data set of personal interest to you. Your goal should be to create an original project that you would be proud to show off to a potential employer. You are encouraged to upload your project to Github after the course is over.
The outline below is only a suggestion. If you have a completely different idea for a data science project that does not fit neatly with the following requirements, please come talk to me.
You are expected to work on this project in pairs, although you can work individually if you prefer. Please note that if you work alone, you are still expected to produce a comparable amount of work as a two-person team.
You will be graded on a 0-4 scale in each of the categories below. A 0 means that the component is missing from the project altogether, while the meanings of scores of 1-4 are summarized below. In general, you get a 3 for meeting expectations and doing solid work; to earn a 4, you have to go above and beyond in some way.
Category | 1 | 2 | 3 | 4 |
---|---|---|---|---|
Research Question and Motivation | Research question is unclear. | Research question is reasonable, but perhaps lacking in focus and motivation. | Research question is focused and well-motivated. | Research question is exceptionally creative and/or clever, in addition to being focused and well-motivated. |
Data Collection and Cleaning | Data is simple and already clean. | Data is straightforward (e.g., a CSV file), but still nontrivial (e.g., quite large). | Data is complex to collect (e.g., from a REST API) and/or to clean (e.g., in JSON format). | Data collection and cleaning is unusually complex. |
Data Visualization | Limited attempts to visualize the data. | Visualizations are straightforward and provide some insight into the data. | Visualizations are informative, insightful, and visually appealing. | Outstanding visualizations that are unusually artistic, insightful, and/or challenging to produce. |
Machine Learning | Missing either a comparison of different models or an analysis of the performance of the model. | Some comparison of different models and some analysis of the performance of the model. | Solid comparison of several models and thorough analysis of the performance of the final model. | Creative and/or sophisticated application of machine learning, in addition to solid analysis. |
Correctness of Results | Many flaws in the analysis; results are unreliable. | Results are generally correct, with one major flaw or many minor flaws. | Results are correct with no flaws (or only trivial flaws). | Results are not only correct, but the analyses were also technically challenging. |
Submission Organization | Submission did not match requirements stated in the "Deliverables" section. | Submission met the requirements but was difficult to follow and understand. | Submission was well-organized and easy to follow. | Submission exceeded all expectations in terms of documentation and was exceptionally organized. |
Presentation | Presentation failed to communicate the results adequately. | Presentation was acceptable, but not engaging or difficult to follow. | Solid presentation that was engaging and easy to understand. | Exceptional presentation that captivated the audience. |
Participation (Presentation Session) | Attended presentation session, but failed to complete the peer feedback. | Completed peer feedback, but perfunctorily. | Provided solid peer feedback for at least two-thirds of presentations. | Provided unusually detailed peer feedback for at least two-thirds of presentations and/or participated actively throughout the presentation session. |
The best data set is one that you are passionate about. I recommend that you start by finding a question you want to answer and then finding data to answer that question, rather than starting with a data set. That said, here are some helpful websites with large collections of data.