Skip to the content.

Data Science Seminar

Grading

Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:

Readings

Spring 2023

Week 10: June 7th

Wrap-up and celebration!

Don’t forget to send in you artwork (Week 7 homework) by June 6th.

Week 9: May 31

Topic: Recommender Systems

One of the things missing from our Data Science discussions in DATA NNN coursework and beyond are recommender systems and recommendation engines. This is in part because quite a bit of methodology that goes into recommendation engines comes from classification, regression and clustering. But this is also somewhat unfortunate, because there are issues that are specific to recommender engines that get ignored. This week you are reading two surveys on recommender systems.

Recommender Systems Survey (Bobadilla et. al, 2013)

This is a general survey of recommender systems. It is somewhat old for “future” being the general theme, but it is a good overview of the current state of non-deep-learning systems.

Deep Learning based Recommender System: A Survey and New Perspectives (Zhang et. al, 2018). This is a newer and more in-depth technical survey documenting how deep learning techniques are used in the guts of the recommender engines.

How to read these?

Please come to class prepared to discuss the first paper, and with at least one already written down short paragraph summarizing one takeaway or idea from the second paper that you kind of understood or were interested to learn more about.

Week 8: May 24

Topic: Towards Fair, Transparent and Accountable AI

Many of our discussions this year have considered the ethical implications of AI and the many pitfalls inherent in the machine learning process. This week we will consider strategies to ensure that future AI systems are fair, transparent, and accountable, and ask whether this is even a realistic possibility.

Readings

  1. Hutson, M. It’s Too Easy to Hide Bias in Deep-Learning Systems. IEEE Spectrum, January 2021.
  2. Mitchell et al., Model Cards for Model Reporting. FAT* ‘19: Proceedings of the Conference on Fairness, Accountability, and Transparency, January 2019.

The first reading points out the many ways in which an organization might try to make their AI system appear fair and unbiased, while actually operating in a biased manner. It also discusses some possible strategies to address this problem. The second reading proposes a way to address this problem head on, by providing a performance “report card” along with the AI model when it is distributed.

Your groups will present a Hippocratic Oath of AI. Each of your statements in the Oath should be supported by examples or quotes from these or other readings from this year.

Week 7: May 17

Topic: Generative Art

In-Class Video: Generating Basic Tiling with turtle

In-Class Activity: Make your first art piece. Each person should produce their own unique piece, but your group’s collection should have something in common, such as similar colors, similar movement principles, etc.

Homework: Watch the remaining videos in the series. By Week 10, produce a more complex piece or series of pieces, including titles. We will have a “gallery” at the Week 10 meeting/party.

Week 6: May 10

Topic: Data Feminism

Guest speaker: Dr. Allison Theobold

Week 5: May 3

Topic: Under the Covers of GPT-3

Guest speaker: Jim Bodwin

Week 4: April 26

Topic: Reproducibility

Job security through code obscurity

Have you ever inherited code so impossible to follow, you decided it was easier to rewrite it yourself?

In this class, we’ll learn by counterexample, by implementing the most difficult to read and inefficient code possible, then trying to figure out what’s going on in each other’s messy code.

Rules:

Week 3: April 19

Topic: Examples of Explainable Boosting

Readings/Discussion

Teams will give a 10-15 minute lecture on their assigned topic. You will NOT have access to a projector - please plan to use the blackboard and/or handouts to explain your topic. You should include some mathematical/conceptual explanation of the topic, but also some discussion of its place in ML progress: Do you see this method as a solution to any problems that currently exist?

Team 1: ADA Boost (https://www.youtube.com/watch?v=LsK-xG1cLYA) Team 2: Gradient Boosting (https://www.youtube.com/watch?v=3CC4N4z3GJc) Team 3: Explainable Boosting (https://towardsdatascience.com/the-explainable-boosting-machine-f24152509ebb)

Week 2: April 12

Topic: Predicting the future of the field

“I have traveled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won’t last out the year.”

It’s easy to look back and laugh at predictions of the future that turned out to be dead wrong. As we move into speculative readings, that suggest hypothetical future directions for Data Science, how do we know which ones are plausible? This week you’ll read papers from the past, and compare them to a recent paper that looks ahead to the next decades of data.

Readings

  1. On the Future of Statistics (1942)
  2. The Future of Statistics as a Discipline (1981)
  3. The Future of Statistics and Data Science (2018)

In your groups, one person should read each paper.

Each person find at least 2 quotes or ideas that are, in your opinion, accurate or true today; and at least 2 that do not seem to hold up. As a group, choose a few of your favorites to present, along with references to support the accuracy/inaccuracy of the prediction.

Week 1: April 5

In-Class activity: Cards Against Data Science

Winter 2023

Week 9: March 8

We’ll finish off this quarter with another discussion about ChatGPT. Please complete the following reading and prepare for our discussion this Wednesday via the following assignments.

All teams should prepare the following from two perspectives:

  1. Us as an ongoing data science fellowship

  2. A lay audience. That is, summarize this reading for your friends, parents, or other contacts might know less about AI and data science than you.

Finally, last week we talked about the idea of using and citing ChatGPT…and its limitations. Please come prepared to discuss when it would be appropriate to cite ChatGPT and how you would do so.

Week 8: March 1

We’re going to begin discussing ChatGPT, and other similar tools, over the next couple of weeks and likely next quarter too! If you’re unfamiliar with ChatGPT, please read the following short synopsis:

https://www.digitaltrends.com/computing/how-to-use-openai-chatgpt-text-generation-chatbot/

It is extremely exciting and powerful. Your first assignment concerning ChatGPT involves the following:

Week 7: February 22

In this next exercise you’ll be working with a team, again, to revisit the kidney disease dataset. Please read the instructions for the second assignment and access the data at the following link, and come prepared to next week’s class:

https://drive.google.com/drive/folders/1vXFloq65bmXzwWfB-K6VMJtnWbRy52I3

Week 6: February 15

One of the dangers of AI and Machine Learning technologies becoming ubiquitous is that they inevitably become the tools in product marketing, fundraising, and attention-grabbing campaigns. As data scientists we do not always control how our work is presented to others and in what form. As consumers we may want to recognize when claims of AI use exceed plausibility, or spurious.

AI/ML researchers and sociologists started using the term “AI snake oil” to discuss situations when AI technology is being sold for unsuitable purposes, or when the marketing claims of AI input into a company’s product are dubious.

Looking at the problem from another angle, we as data scientists, machine learning experts and AI researchers must understand the limitations of the tools we use to analyze data, make predictions, and obtain insight from data. We must be able to separate the real contributions and capabilities of AI/ML systems from the claims made in public space.

So, let’s take a look at how AI snake oil works and how to recognize it.

For next week’s assignment, watch Arvind Narayanan’s talk “How to Recognize AI Snake Oil” available here: https://www.cs.princeton.edu/news/how-recognize-ai-snake-oil (1 hour 37 minutes)

The slides for the talk can be found here: https://www.cs.princeton.edu/~arvindn/talks/MIT-STS-AI-snakeoil.pdf

For some additional insight and takes, you can also watch this University of Oxford virtual panel on Fake AI, snake oil, pseudoscience and hype by a group of authors of the book “Fake AI, pseudosicence, snake oil and hype”: https://www.youtube.com/watch?v=w1f0Sj-st9g&ab_channel=OxfordInternetInstitute%2CUniversityofOxford (1 hour 4 minutes)

In preparation for the discussion, break into groups of 4 people each and discuss the following questions:

In class, we will discuss each question in turn, each group is expected to have a summary of thoughts on each question prepared for sharing with the rest of the Fellowship.

Week 5: February 8

In this next exercise you’ll be working with a team, again, to analyze some data in an attempt to do some classification. Please read the instructions and access the data at the following link, and come prepared to next week’s class:

https://drive.google.com/drive/folders/1vXFloq65bmXzwWfB-K6VMJtnWbRy52I3

Week 4: February 1

In the next week, we will discuss tooling.

You need to pick a partner to research one of two technologies for data science: R or Python. Before class make sure you know which tool you’re researching and that the assignments are balanced across all pairs of students.

The R group will research the advantages of:

The Python group will research the advantages of:

Come prepared to discuss and debate the advantages of your assigned technology.

Week 3: January 25

By this class, in your same groups of three from the activity on January 18, you should pick two data science tools from the following list to compare and contrast with respect to reproducibility. What are the pros? What are the cons? Is one of them clearly better than the other? Why?

Come to class prepared to give a 5-minute presentation on what you come up with.

Week 2: January 18

Our first unit is on reproducibility. Please read the following before this class and come prepared to discuss them both in terms of their content and how you feel they relate to your experiences and knowledge of data science:

Please don’t look at the following docs until class today!

Group 1: https://docs.google.com/document/d/1zlgi44LB8L8-5-gCTknCy54pKT_BjaqAVf7CoNvBIIU/edit?usp=sharing

Group 2: https://docs.google.com/document/d/1PsQwkE8OoAg9h7Q8LP_DZ135U5Z29hEd7lk7fUJCfd0/edit?usp=sharing

Week 1: January 11

Fall 2022

The theme of the readings for Fall 2022 is the past.

Week 1: September 21

Week 2: September 28

Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the turbulent history of hypothesis testing in the 20th century.

Required Reading:

Additional Resources:

Assignment:

In your assigned group of 6 students:

Week 3: October 5

In the early to mid 1900s, the field of eugenics - the idea that some groups or people are inherently genetically inferior - was a mainstream and well-respected scientific pursuit. Many of the foundational ideas of classic statistics were developed in conjunction with eugenics applications. In the modern era, now that these ideas have been rejected as racist/classist/etc, how should we regard the influential people and ideas that came out of that movement?

Required Reading:

Assignment:

We ask you to think carefully about the practice of re-contextualizing scientific contributions in light of modern ethics. Questions to consider:

In your assigned group of 6 students:

Week 4: October 12

We will look at the Frequentist vs. Bayes debate! Not only does this debate pertain to how you think and do your statistics and data science, but also to some of your ways of thinking every day! Please come prepared to defend both, although we want you will be asked to take a side in class.

Required Reading:

Here are some additional readings that might be of interest.

Assignment:

We ask you to think carefully about these two sides: Frequentist and Bayes. Questions to consider:

In your assigned group of 6 students:

Week 5: October 19

We will read a classic paper by Leo Breiman entitled “Statistical Modeling: The Two Cultures.” The PDF linked below also contains comments by leading statisticians and data scientists which will give you more ideas as you prepare your presentations.

Required Reading:

This paper by Efron can be seen as an update to Breiman’s paper 20 years later – it is not required reading but might be interesting for you to read over:

Assignment:

In your assigned group of 6 students:

Week 6: October 26

We will discuss the past and future of Artificial Intelligence (AI).

Required Reading:

Assignment:

In your assigned group of 6 students:

Week 7: November 2

We are starting a three-week stretch devoted to the ideas and technologies behind working with big data. Our first discussion is about relational databases, their history and their role in bringing forth our ability to work with large quantities of data. To that end, you will read two sets of articles.

The first set of articles gives you some historic perspective on the development of the relational data model and relational databases. The articles in this set are:

The Codd papers provide historic descriptions of the ideas behind the modern relational databases. The blog post puts some of the information contained in these papers in the overall context.

The second set of papers comes from a sequence of meetings conducted by the database research and industry community over the course of the late 20th and early 21st century. The meetings served as the community reflection points on the progress of the field of relational databases (and the field of databases in general) over the years. They also attempted to identify future challenges that the database technology and the database community needed to meet. The papers are co-authored by who’s who in the area of database management systems. The papers in this series are:

This is a lot of reading. Please, read the instructions below carefully.

The roles for this week’s assignment are:

Week 8: November 9

We will discuss the rise and “fall” of Hadoop… and future of Hadoop.

Required Reading:

Everyone should read in detail the following:

Assignment:

The “decline” of Hadoop is documented, and so it is not my intention to have you discuss why this happened. It is an interesting topic but not our focus. I’ll share a bit of my own experiences with Hadoop at the beginning of seminar since it came about during my graduate school days, and I’ve been along for the ride ever since.

In your assigned group of 6 students:

Week 9: November 16

Everyone needs to read the following:

In your assigned group of 6:

Everyone: be sure to look up terms and products that you’re reading about which you may not be familiar with (too old or discontinued). Teams need to be explain the terms in the papers that have to do with their subject, if asked during their presentations.