Data Science Seminar

Grading

Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:

attend all seminars
do the reading
complete the assignment associated with each reading

Readings

Winter 2022

Week 10 assignment Requirements Spec

Week 10 assignment comes from the Spring 2015 CSC 366 course I taught. Our customer, Dmytro Marushkevich heads a data science team in a marketing company that at the time was called Rosetta, and now is called Sapient. Dmytro shared with us a real use case that he and his team had to work on, complete with real (although sanitized) data that CSC 366 students used.

For this week we are splitting you into two teams of six people each. Both teams will have the same task:

Read the requirements specification. It presents the data in service of a specific marketing goal that Rosetta has. It also presents the kinds of reports/data analysis the company wants to perform with the provided data.
Build an Entity-Relationship model for the Rosetta dataset based on provided descriptions. (Understandably, you may not get it completely right, and the document itself omits some potentially important information, but we want you to get a reasonable start).
On Wednesday, March 9, each team will present its proposed E-R model. We will have a short breakout discussion, followed by a general discussion of what you have learned from this exerise and some comments from Alex Dekhtyar and Lubomir Stanchev.

Week 9: March 2nd

In the next two weeks, we will discuss entity-relationship diagrams.

Readings:

Slides on ER modeling
Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom, Database Systems: The Complete Book, Chapter 4, Available online, but link not provided because of copyright.
Practice Problem 1
Practice Problem 2
Alex’s E-R modeling lectures: Part 1
Alex’s E-R modeling lectures: Part 2
Alex’s E-R modeling lectures: Part 3
Alex’s E-R modeling lectures: Part 4

Weeks 6 and 7: February 9 and 16

In the next two weeks, we will discuss transfer learning.

For week 6, read the following two summaries about transfer learning and come prepared to discuss. Also, bring a laptop as we will go through a transfer learning example in Google Colab together.

Be prepared to discuss the following simple questions:

What is transfer learning?
Why is it important?
Why is it difficult?
What are some common methods for transfer learning?

We prepared a Google Colab notebook to demonstrate transfer learning using CNN features.

For week 7, your task is to prepare a new transfer learning experiment. Our suggested experiment to investigate the ridiculous question, “Is a hot dog a sandwich?” If you wish to investigate this question, do the following:

Use transfer learning to train a classifier to distinguish sandwiches from other foods using this dataset, which we extracted from the Food-101 dataset on Kaggle.
Test the classifier on images of hot dogs, to allow a disinterested party (the classifier) to decide once and for all whether a hot dog is truly a sandwich.

You are also welcome to design your own experiment, perhaps addressing a more serious question :)

Come to seminar in week 7 ready to present and discuss your experiment!

Weeks 4 and 5: January 26, February 2

In the next two weeks, we will discuss tooling.

You will be assigned with a partner to research one of two technologies for data science: R or Python.

The R group will research the advantages of:

R
R Markdown

The Python group will research the advantages of:

Python
Jupyter notebooks

Come prepared to discuss and debate the advantages of your assigned technology.

Week 3: January 19

By this class, in your same groups of three from the activity on January 12, you should pick two data science tools from the following list to compare and contrast with respect to reproducibility. What are the pros? What are the cons? Is one of them clearly better than the other? Why?

Come to class prepared to give a 5-minute presentation on what you come up with.

RStudio
PyCharm
Jupyter
Google Colab
Amazon Web Services
Google Docs
Github
GitLab
Slack
Dropbox
Tableau
LaTeX
Microsoft Office
RMarkdown

Week 2: January 12

Our first unit is on reproducibility. Please read the following before this class and come prepared to discuss them both in terms of their content and how you feel they relate to your experiences and knowledge of data science:

Week 1: January 5

Fall 2021

The theme of the readings for Fall 2021 is the past.

Week 1: September 22

Week 2: September 29

Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the turbulent history of hypothesis testing in the 20th century.

Required Reading:

Chapters 10 and 11 from David Salsburg (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. Macmillan.

Additional Resources:

Erich Lehmann (1993). The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two?. (You should be able to download a PDF when you are on the Cal Poly network or VPN.)
Erich Lehmann (2011). Fisher, Neyman, and the Creation of Classical Statistics.
Oral History Interviews with David Blackwell (Youtube playlist).

Assignment:

In your assigned group of 6 students:

Pair 1 should prepare slides discussing Fisher’s contributions to hypothesis testing.
Pair 2 should prepare slides discussing Neyman’s (and Pearson’s) contributions to hypothesis testing.
Pair 3 should prepare slides comparing and contrasting the two approaches.

Week 3: October 6

In the early to mid 1900s, the field of eugenics - the idea that some groups or people are inherently genetically inferior - was a mainstream and well-respected scientific pursuit. Many of the foundational ideas of classic statistics were developed in conjunction with eugenics applications. In the modern era, now that these ideas have been rejected as racist/classist/etc, how should we regard the influential people and ideas that came out of that movement?

Required Reading:

This twitter thread by famous statistician Daniela Witten, which initiated a change to a major statistics award.
Scientific Priestcraft, Chapter 3 of “Superior: The Return of Race Science” by Angela Saini.

Assignment:

We ask you to think carefully about the practice of re-contextualizing scientific contributions in light of modern ethics. Questions to consider:

Does removing accolates like the Fisher Prize harm the target and/or his family? Should the scientific contributions be considered in isolation, and honored, regardless of what else the individual did?
Does this approach truly create a more welcoming/diverse scientific community? Or does it discourage or hinder others from contributing to scientific progress, for fear of personal scrutiny?
Does it remove objectivity from statistical methodology?
Does it help us prevent similar mistakes in the future?

In your assigned group of 6 students:

Pair 1 should track down more examples of events like the Fisher Prize renaming: Can you find historical figures related to data science whose ethics are currently in question? Can you find examples of honors or accolades being removed from these people (or discussions suggesting such)?
Pair 2 should collect, summarize, and present arguments against re-contextualizing scientific contributions in light of modern ethics.
Pair 3 should collect, summarize, and present arguments for re-contextualizing scientific contributions in light of modern ethics.

Week 4: October 13

We will look at the Frequentist vs. Bayes debate! Not only does this debate pertain to how you think and do your statistics and data science, but also to some of your ways of thinking every day! Please come prepared to defend both, although we want you will be asked to take a side in class.

Required Reading:

New York Times Article: The Odds, Continually Updated
Frequentist and Bayesian Approaches in Statistics
Efron, B. (2005). Bayesians, frequentists, and scientists. Journal of the American Statistical Association, 100(469), 1-5.
The Boxer, the Wrestler, and the Coin Flip: A Paradox of Robust Bayesian Inference and Belief Functions Andrew Gelman, The American Statistician, May 2006, Vol. 60, No. 2
xkcd comic on Frequentists vs. Bayesians

Here are some additional readings that might be of interest.

MIT Lecture Notes: Comparison of frequentist and Bayesian inference.
Efron, B. (1986). Why isn’t everyone a Bayesian? The American Statistician, 40(1), 1-5.
Andrew Gelman’s blog post: Why I Don’t Like Bayesian Statistics. (This is an April Fools’ Joke written by a famous Bayesian statistician. However, it contains some good ideas.)

Assignment:

We ask you to think carefully about these two sides: Frequentist and Bayes. Questions to consider:

What are the strengths and weaknesses of each?
Is it possible be a little bit of both? Why or why not?
Think about your own beliefs about how probability and the likelihood of events work. In your day-to-day life (when not doing statistics or data science), do you tend to be more frequentist or bayesian?

In your assigned group of 6 students:

Pair 1 should organize and present the strengths of the Frequentist side and the weaknesses of the Bayesian side, as they pertain to statistics and data science work.
Pair 2 should organize and present the strengths of the Bayesian side and the weaknesses of the Frequentist side, as they pertain to statistics and data science work.
Pair 3 should comment on how these two sides pertain to day-to-day life, and whether your position on them in this context should be related to your position on them in the context of your work. Be sure to think carefully and thoroughly here. Your comments might also include similarities between the two sides in certain situations.

Week 5: October 20

We will read a classic paper by Leo Breiman entitled “Statistical Modeling: The Two Cultures.” The PDF linked below also contains comments by leading statisticians and data scientists which will give you more ideas as you prepare your presentations.

Required Reading:

Leo Breiman: Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

This paper by Efron can be seen as an update to Breiman’s paper 20 years later – it is not required reading but might be interesting for you to read over:

Efron, B. (2020). Prediction, Estimation, and Attribution Journal of the American Statistical Association.

Assignment:

In your assigned group of 6 students:

Pair 1 should organize and present the strengths of the “data modeling” culture.
Pair 2 should organize and present the strengths of the “algorithmic modeling” culture.
Pair 3 should compare and constrast the two cultures.

Week 6: October 27

We will discuss the past and future of Artificial Intelligence (AI).

Required Reading:

Assignment:

In your assigned group of 6 students:

Pair 1 should review the two papers.
Pair 2 should make the argument why AI is the best thing since peanut butter and jelly and why investment in AI will keep growing.
Pair 3 should make the argument that history will repeat itself, reality will crush expectations, and we will head for another AI winter.

Week 7: November 3

We are starting a three-week stretch devoted to the ideas and technologies behind working with big data. Our first discussion is about relational databases, their history and their role in bringing forth our ability to work with large quantities of data. To that end, you will read two sets of articles.

The first set of articles gives you some historic perspective on the development of the relational data model and relational databases. The articles in this set are:

[E.F. Codd. A Relational Model of Data For Large Shared Data Banks, Communications of the ACM, June 1970] ( https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)
E.F. Codd. The 12 Rules, (excert from “Is your database fully relational?”, Computerworld, Oct 1985, reprinted
Important Papers: Codd and the Relational Model. Two Bit History blog, Dec 2017

The Codd papers provide historic descriptions of the ideas behind the modern relational databases. The blog post puts some of the information contained in these papers in the overall context.

The second set of papers comes from a sequence of meetings conducted by the database research and industry community over the course of the late 20th and early 21st century. The meetings served as the community reflection points on the progress of the field of relational databases (and the field of databases in general) over the years. They also attempted to identify future challenges that the database technology and the database community needed to meet. The papers are co-authored by who’s who in the area of database management systems. The papers in this series are:

This is a lot of reading. Please, read the instructions below carefully.

The roles for this week’s assignment are:

Role 1: Archeologists. Students in this role will present information about the development the ideas behind modern (relational) database management systems and will discuss the importance of these ideas to today’s data science.
Role 2: Supporters. Students in this role will discuss what the database community got right about its challenges and the problems that it needs to tackle in a modern world.
Role 3: Detractors. Students in this role will discuss where the database community missed. What errors of omission and commission did the DB community commit? (Errors of omisison - something important about today’s world of working with data, that the community missed. Errors of commission - something the DB community thought would be a big challenge, that was not).

Week 8: November 10

We will discuss the rise and “fall” of Hadoop… and future of Hadoop.

Required Reading:

Everyone should read in detail the following:

Assignment:

The “decline” of Hadoop is documented, and so it is not my intention to have you discuss why this happened. It is an interesting topic but not our focus. I’ll share a bit of my own experiences with Hadoop at the beginning of seminar since it came about during my graduate school days, and I’ve been along for the ride ever since.

In your assigned group of 6 students:

Pair 1 should provide summaries of the CAP Theorem, MapReduce Paper, and the blog style article.
Pair 2 should discuss how Hadoop can be thrive as expectations and the technology mature (i.e., in a plateau of productivity phase). Specifically, they should focus on explaining and expanding on the third article’s optimism for Hadoop. For example, the authors mention technologies maturing around Hadoop: optimized data formats (ORC, Parquet) and query engines (Impala, Presto, Dremel). This pair should also discuss and expand upon the emerging best practices, and should make a general case for the long-term viability of Hadoop ecosystem.
Pair 3 should discuss and expand upon why the third article’s optimism for Hadoop is incorrect. They likewise should go through the article and prepare specific counterpoints to the author’s claims of a “plateau of productivity” future for Hadoop.

Week 9: November 17

Everyone needs to read the following:

A brief History of the Internet
What is Web 2.0? (make sure to read all the pages)
Semantic web
Chicken Farms on the Semantic Web

In your assigned group of 6:

Pair 1 presents fundamental forces (social, economical, political) that were crucial to development of the World Wide Web
Pair 2 explain the differences between Web 1.0 and Web 2.0. Use examples and products from your own life and interaction. Name the ways in which current technologies exptend from Web 2.0 enabling concepts.
Pair 3 explain the semantic Web. How are things likely to change based on this in the near future? Use examples.

Everyone: be sure to look up terms and products that you’re reading about which you may not be familiar with (too old or discontinued). Teams need to be explain the terms in the papers that have to do with their subject, if asked during their presentations.