Readings for the Data Science Seminar
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on MWF.
Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
The theme of the readings for Winter 2021 is the present.
Topic: Dealing With Small Data
On the heels of some of our previous conversations about data science hype often going hand-in-hand with big data hype, it seemed appropriate to spend a little more explicit time on non-big data situations. This is a popular blog and probably worth an exploration beyond this article. There could actually be quite a bit to unpack with this reading. I (Glanz) challenge the “supporters” and “opposers” to really go at it with this one…Here are some starting questions, but don’t feel restricted to just these.
Topic: Underspecification of Neural Network Models
A machine learning pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. The paper discusses how this results in diminished accuracy on different real-world verification sets and possible solutions to address the problem.
Topic: Data vs. Theory
Intro to reading: In 2008, Chris Anderson, then editor-in-chief of Wired Magazine, published a provocative editorial entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” He essentially argues that traditional methods of scientific discovery and business strategy are inferior in comparison to what can be achieved with data mining on large-scale datasets. This is a fundamentally attractive idea – that through big data, we can leave behind human-constructed models and theories about the way the world works, and instead discover directly from the data all the answers we need. There are clearly many examples of where data-driven methods have prevailed over traditional hand-crafted alternatives, and all of you must “believe in data” somewhat based on your choice of academic field to study! But, what is missing in Anderson’s discussion of the promise and successes of big data? Is data really all we need? And when might data lead us astray?
Note that this article has almost 2,500 citations on Google Scholar and is widely discussed in followup blog posts and magazine articles – you might want to search a bit to find some different perspectives on this bold idea.
Topic: Bias and Fairness of Rankings and Recommendations
Introduction to Readings: Rankings and recommendation systems are everywhere. We take them for granted. We base a lot of big and small decisions on the results presented to us. The readings below share several themes. At their heart they are examples of models and systems that have unintended and possibly detrimental impacts on society. When you are reading them, please think about things the original designers could/should have done differently. Put yourselves in their position, and think about whether you would have forseen or anticipated any of the results or problems. Think about how you can design and study the many different models of that recommend a nd rank items for you. Did the paper (third link) present a reasonable approach to studying a black box? What faults do you find if any? What could be done with their findings? What did the designers of both the studies and the original systems do well (i.e., think about the positive as well as the negative).
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on Thursdays. There are two sessons: at 10 AM and at 3 PM. You only need to attend one of these sessions. Each week there will be some readings relevant to data science, which we will discuss for 30 minutes. Afterwards, individual fellows will give updates on their research projects.
Please enroll in STAT 400 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
The theme of the readings for Fall 2020 is the past.
Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the chaotic history of hypothesis testing in the 20th century.
We will look at how hypothesis testing has influenced other disciplines and the controversy this has caused.
We will look at the Frequentist vs. Bayes debate. Next week, you will be randomly assigned to defend either frequentism or Bayesianism. Please come prepared to defend both, although we want you to take a side in your Slack post this week.
Here are some additional readings that might be of interest.
A survey of the history of the connection between early statistics and the eugenics movement.
Bonus reading from last week: A metaphor for the difference between randomness and unknown-ness, and the consequences in Bayesian analysis.
We will examine the history of AI winters.
We will discuss the Turing Test and whether machines can be intelligent.
Questions to consider:
No reading. Focus group with Lubi.
Main reading: Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States, 2012. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/bigdatawhitepaper.pdf
and The Claremont Report on Database Research, Communications of the ACM, Vol. 52, No. 8, 2009 http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/claremont-report2009.pdf
Given the timeframe there is no expectation that any supplemental materials will be read, but I want them available in a single place.
reading: D. I. HOLMES and R. S. FORSYTH, “The Federalist Revisited: New Directions in Authorship Attribution” PDF
In 1964 Mosteller and Wallace showed how statistics (and Bayesian analysis) can be applied to the problem of authorship attribution, and by implication stylometry (understanding, measuring and detecting “style”). Holmes and Forsyth built on that work in the 1990s, laying the foundations for more computationally intensive methods of behavioral analysis that has major implications in our lives in 2020s. Here are some motivating questions for the discussion this week: