Readings for the Data Science Seminar

Project maintained by dlsun Hosted on GitHub Pages — Theme by mattgraham

This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on MWF.

Mondays:

- Reading discussions
- Discussions will be held in three breakout rooms, 6 students per room.
- Each student receives a designated role for each reading: “presenter”, “supporter”, “opponent”. Each breakout room will have two presenters, two supporters and two opponents.
- Discussions start with presenters giving a succinct and fact-based summary of the reading material. then supporters will discuss what they liked in the reading material. After that, opponents will critique the work. Following this scripted discussion, there will be additional time for back and forth between all students, and for faculty reflections on the reading material.

Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:

- attend all seminars
- complete readings and post a question or comment about the reading to the Slack channel

The theme of the readings for Winter 2021 is **the present**.

Topic: Dealing With Small Data

On the heels of some of our previous conversations about data science hype often going hand-in-hand with big data hype, it seemed appropriate to spend a little more explicit time on non-big data situations. This is a popular blog and probably worth an exploration beyond this article. There could actually be quite a bit to unpack with this reading. I (Glanz) challenge the “supporters” and “opposers” to really go at it with this one…Here are some starting questions, but don’t feel restricted to just these.

- What, if any, explicit and/or implicit assumptions is the author making when proposing each tip?
- What are the pros and cons of each tip?
- If your dataset is small and there’s not much to do about the size, can you really call your work with it “data science”?

Reading:

- 7 Tips for Dealing With Small Data or here on github

Topic: Underspecification of Neural Network Models

A machine learning pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. The paper discusses how this results in diminished accuracy on different real-world verification sets and possible solutions to address the problem.

Reading:

- D’Amour, A. et al. (2020). Underspecification Presents Challenges for Credibility in Modern Machine Learning.

Topic: Data vs. Theory

Intro to reading: In 2008, Chris Anderson, then editor-in-chief of Wired Magazine, published a provocative editorial entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” He essentially argues that traditional methods of scientific discovery and business strategy are inferior in comparison to what can be achieved with data mining on large-scale datasets. This is a fundamentally attractive idea – that through big data, we can leave behind human-constructed models and theories about the way the world works, and instead discover directly from the data all the answers we need. There are clearly many examples of where data-driven methods have prevailed over traditional hand-crafted alternatives, and all of you must “believe in data” somewhat based on your choice of academic field to study! But, what is missing in Anderson’s discussion of the promise and successes of big data? Is data really all we need? And when might data lead us astray?

Note that this article has almost 2,500 citations on Google Scholar and is widely discussed in followup blog posts and magazine articles – you might want to search a bit to find some different perspectives on this bold idea.

Reading:

- Anderson, C. (2008). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine.

Topic: Bias and Fairness of Rankings and Recommendations

Introduction to Readings: Rankings and recommendation systems are everywhere. We take them for granted. We base a lot of big and small decisions on the results presented to us. The readings below share several themes. At their heart they are examples of models and systems that have unintended and possibly detrimental impacts on society. When you are reading them, please think about things the original designers could/should have done differently. Put yourselves in their position, and think about whether you would have forseen or anticipated any of the results or problems. Think about how you can design and study the many different models of that recommend a nd rank items for you. Did the paper (third link) present a reasonable approach to studying a black box? What faults do you find if any? What could be done with their findings? What did the designers of both the studies and the original systems do well (i.e., think about the positive as well as the negative).

Readings:

- Weapons of Math Destruction (Chapter)
- Questioning the Fairness of Targeting Ads Online (Brief Summary)
- Original article from the above brief summary

No reading

This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on Thursdays. There are two sessons: at 10 AM and at 3 PM. You only need to attend one of these sessions. Each week there will be some readings relevant to data science, which we will discuss for 30 minutes. Afterwards, individual fellows will give updates on their research projects.

Please enroll in STAT 400 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:

- attend all seminars
- complete readings and post a question or comment about the reading to the Slack channel

The theme of the readings for Fall 2020 is **the past**.

No reading

Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the chaotic history of hypothesis testing in the 20th century.

- Chapters 10 and 11 from Salsburg, D. (2001).
*The lady tasting tea: How statistics revolutionized science in the twentieth century*. Macmillan.

We will look at how hypothesis testing has influenced other disciplines and the controversy this has caused.

- Cohen, J. (1994). The earth is round (p < .05). American psychologist, 49(12), 997.
- Gill, J. (1999). The insignificance of null hypothesis significance testing.
*Political research quarterly*, 52(3), 647-674.

We will look at the Frequentist vs. Bayes debate. Next week, you will be randomly assigned to defend either frequentism or Bayesianism. Please come prepared to defend both, although we want you to take a side in your Slack post this week.

- New York Times Article: The Odds, Continually Updated
- Frequentist and Bayesian Approaches in Statistics
- Efron, B. (2005). Bayesians, frequentists, and scientists.
*Journal of the American Statistical Association*, 100(469), 1-5. - xkcd comic on Frequentists vs. Bayesians

Here are some additional readings that might be of interest.

- MIT Lecture Notes: Comparison of frequentist and Bayesian inference.
- Efron, B. (1986). Why isn’t everyone a Bayesian?
*The American Statistician*, 40(1), 1-5. - Andrew Gelman’s blog post: Why I Don’t Like Bayesian Statistics. (This is an April Fools’ Joke written by a famous Bayesian statistician. However, it contains some good ideas.)

A survey of the history of the connection between early statistics and the eugenics movement.

- Scientific Priestcraft From “Superior: The Return of Race Science” chapter 3

Bonus reading from last week: A metaphor for the difference between randomness and unknown-ness, and the consequences in Bayesian analysis.

- The Boxer, the Wrestler, and the Coin Flip: A Paradox of Robust Bayesian Inference and Belief Functions Andrew Gelman, The American Statistician, May 2006, Vol. 60, No. 2

We will examine the history of AI winters.

- Analyzing the Prospect of an Approaching AI Winter Sebastian Schuchman, May 3, 2019

We will discuss the Turing Test and whether machines can be intelligent.

- Computing Machinery and Intelligence Alan Turing, Mind, October 1950, Vol. 59, No. 236
- Is the Brain a Good Model for Machine Intelligence? Rodney Brooks, Demis Hassabis, Dennis Bray, and Amnon Shashua, Nature, Feburary 2012, Vol 482

Questions to consider:

- Is the Turing Test a good test of whether a machine is intelligent?
- Are today’s systems (like IBM Watson, Google’s image recognition systems, etc.) intelligent? Could a convolutional neural network be intelligent?
- In trying to achieve machine intelligence, should we try to mimic the brain or should we apply a pure engineering approach?
- Is the brain just a computer – could its functionality be replicated by a machine?

No reading. Focus group with Lubi.

Main reading: Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States, 2012. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/bigdatawhitepaper.pdf

and The Claremont Report on Database Research, Communications of the ACM, Vol. 52, No. 8, 2009 http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/claremont-report2009.pdf

Supplemental reading:

- 2005: Lowell self-assessment. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/lowell-report2005.pdf
- 1998: Asilomar report. PDF.
- 1996: Strategic directions in database systems. PDF.
- 1995: Achievements and Opportunities. PDF.
- 1988: Future Directions in DBMS Research. PDF

Given the timeframe there is no expectation that any supplemental materials will be read, but I want them available in a single place.

reading: D. I. HOLMES and R. S. FORSYTH, “The Federalist Revisited: New Directions in Authorship Attribution” PDF

In 1964 Mosteller and Wallace showed how statistics (and Bayesian analysis) can be applied to the problem of authorship attribution, and by implication stylometry (understanding, measuring and detecting “style”). Holmes and Forsyth built on that work in the 1990s, laying the foundations for more computationally intensive methods of behavioral analysis that has major implications in our lives in 2020s. Here are some motivating questions for the discussion this week:

- What are the parameters for the authorship attribution problem?
- Do you think there’s a definitive style of writing associated with each person? Can that style be faked?
- How has computational stylistics changed in the recent decades?
- How does the 1964 work on Federalist Papers relate to your personal life today?