Readings for the Data Science Seminar

Project maintained by dlsun Hosted on GitHub Pages — Theme by mattgraham

Data Science Seminar

Winter 2021


This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on MWF.



Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:


The theme of the readings for Winter 2021 is the present.

Week 5

Topic: Dealing With Small Data

On the heels of some of our previous conversations about data science hype often going hand-in-hand with big data hype, it seemed appropriate to spend a little more explicit time on non-big data situations. This is a popular blog and probably worth an exploration beyond this article. There could actually be quite a bit to unpack with this reading. I (Glanz) challenge the “supporters” and “opposers” to really go at it with this one…Here are some starting questions, but don’t feel restricted to just these.


Week 4

Topic: Underspecification of Neural Network Models

A machine learning pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. The paper discusses how this results in diminished accuracy on different real-world verification sets and possible solutions to address the problem.


Week 3

Topic: Data vs. Theory

Intro to reading: In 2008, Chris Anderson, then editor-in-chief of Wired Magazine, published a provocative editorial entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” He essentially argues that traditional methods of scientific discovery and business strategy are inferior in comparison to what can be achieved with data mining on large-scale datasets. This is a fundamentally attractive idea – that through big data, we can leave behind human-constructed models and theories about the way the world works, and instead discover directly from the data all the answers we need. There are clearly many examples of where data-driven methods have prevailed over traditional hand-crafted alternatives, and all of you must “believe in data” somewhat based on your choice of academic field to study! But, what is missing in Anderson’s discussion of the promise and successes of big data? Is data really all we need? And when might data lead us astray?

Note that this article has almost 2,500 citations on Google Scholar and is widely discussed in followup blog posts and magazine articles – you might want to search a bit to find some different perspectives on this bold idea.


Week 2

Topic: Bias and Fairness of Rankings and Recommendations

Introduction to Readings: Rankings and recommendation systems are everywhere. We take them for granted. We base a lot of big and small decisions on the results presented to us. The readings below share several themes. At their heart they are examples of models and systems that have unintended and possibly detrimental impacts on society. When you are reading them, please think about things the original designers could/should have done differently. Put yourselves in their position, and think about whether you would have forseen or anticipated any of the results or problems. Think about how you can design and study the many different models of that recommend a nd rank items for you. Did the paper (third link) present a reasonable approach to studying a black box? What faults do you find if any? What could be done with their findings? What did the designers of both the studies and the original systems do well (i.e., think about the positive as well as the negative).


Week 1

No reading

Fall 2020


This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on Thursdays. There are two sessons: at 10 AM and at 3 PM. You only need to attend one of these sessions. Each week there will be some readings relevant to data science, which we will discuss for 30 minutes. Afterwards, individual fellows will give updates on their research projects.


Please enroll in STAT 400 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:


The theme of the readings for Fall 2020 is the past.

Week 1: September 17

No reading

Week 2: September 24

Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the chaotic history of hypothesis testing in the 20th century.

Week 3: October 1

We will look at how hypothesis testing has influenced other disciplines and the controversy this has caused.

Week 4: October 8

We will look at the Frequentist vs. Bayes debate. Next week, you will be randomly assigned to defend either frequentism or Bayesianism. Please come prepared to defend both, although we want you to take a side in your Slack post this week.

Here are some additional readings that might be of interest.

Week 5: October 15

A survey of the history of the connection between early statistics and the eugenics movement.

Bonus reading from last week: A metaphor for the difference between randomness and unknown-ness, and the consequences in Bayesian analysis.

Week 6: October 22

We will examine the history of AI winters.

Week 7: October 29

We will discuss the Turing Test and whether machines can be intelligent.

Questions to consider:

Week 8

No reading. Focus group with Lubi.

Week 9

Main reading: Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States, 2012. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/bigdatawhitepaper.pdf

and The Claremont Report on Database Research, Communications of the ACM, Vol. 52, No. 8, 2009 http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/claremont-report2009.pdf

Supplemental reading:

Given the timeframe there is no expectation that any supplemental materials will be read, but I want them available in a single place.

Week 10

reading: D. I. HOLMES and R. S. FORSYTH, “The Federalist Revisited: New Directions in Authorship Attribution” PDF

In 1964 Mosteller and Wallace showed how statistics (and Bayesian analysis) can be applied to the problem of authorship attribution, and by implication stylometry (understanding, measuring and detecting “style”). Holmes and Forsyth built on that work in the 1990s, laying the foundations for more computationally intensive methods of behavioral analysis that has major implications in our lives in 2020s. Here are some motivating questions for the discussion this week: