Readings for the Data Science Seminar
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on MWF.
Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
The theme of the readings for Spring 2021 is the future.
We will do an in-class activity in the seminar. To prepare for this activity, please take the Pew Political Typology Quiz and then fill out this form. Your answers to the form are anonymous. Please remember your political typology when you come to class.
One of the things missing from our Data Science discussions in DATA NNN coursework and beyond are recommender systems and recommendation engines. This is in part because quite a bit of methodology that goes into recommendation engines comes from classification, regression and clustering. But this is also somewhat unfortunate, because there are issues that are specific to recommender engines that get ignored. This week you are reading two surveys on recommender systems.
How to read these? The first survey is broad and not overly technical. Read fully. This will give you the background and the set up for reading the second paper. The second paper has a lot of technical content. This may be useful for you in the future, should you be tasked with building a recommendation system. In the meantime, you can internalize the methodology that is being discussed, but do not have to get too deep into the math.
A paper discussing claims of p-hacking in clinical trials. Do you believe the authors claims that p-hacking is (for the most part) not a problem when linking trials across phases? In reference to COVID-19 vaccinations and the recent problems with Johnson and Johnson’s vaccine, do you have any questions or concerns you wish the companies would address? From a data science perspective, do companies need to do anything different in the future?
Readings (main paper to discuss)
Optional readings: There are many readings that can be considered relevant to this discussion. Here is a list:
A position paper exploring DS computing technologies and devices of the Internet of Things. There are many aspects discussed in the paper, you can pick just a few or one area to discuss. IOT has been one of the most hyped concepts for over a decade now. Is this warranted? Is IOT Data Science really different than regular DS? In what ways does it need “integration”?
Topic: Examples of Explainable Boosting
The paper covers the tradeoffs between accuracy and explainability in a predication model. It claims that explainable boosting can give us the best of both worlds. Read the paper critically and try to answer the question: “Is explainable boosting really the solution that we have all been waiting for?”
Topic: Towards Fair, Transparent and Accountable AI
Many of our discussions this year have considered the ethical implications of AI and the many pitfalls inherent in the machine learning process. This week we will consider strategies to ensure that future AI systems are fair, transparent, and accountable, and ask whether this is even a realistic possibility.
The first reading points out the many ways in which an organization might try to make their AI system appear fair and unbiased, while actually operating in a biased manner. It also discusses some possible strategies to address this problem. The second reading proposes a way to address this problem head on, by providing a performance “report card” along with the AI model when it is distributed. In preparing your responses, consider whether model cards are a solid strategy to address issues of fairness and transparency, and whether you think these issues can and will be addressed effectively in the future.
Topic: Predicting the future of the field
“I have traveled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won’t last out the year.”
- Editor of Prentice Hall business books, 1957
It’s easy to look back and laugh at predictions of the future that turned out to be dead wrong. As we move into speculative readings, that suggest hypothetical future directions for Data Science, how do we know which ones are plausible? This week you’ll read papers from the past, and compare them to a recent paper that looks ahead to the next decades of data.
One partner should read Reading 1, and one partner should read Reading 2. Both should read Reading 3. The “pro” group should identify historical predictions that turned out to be true, and argue in favor of predictions in the 2018 paper. The “con” group should identify historical predictions that were off-base, and argue against predictions in the 2018 paper.
The theme of the readings for Winter 2021 is the present.
Topic: Computational Psychiatry
We’ve come a long way since ELIZA! ELIZA was an AI psychiatrist that took great advantage of natural language generation and Rogerian psychological methods. While it revolutionized computational assistants, it was very limited and had no knowledge base. Today there is a resurgance of fully data driven / AI driven methods and computationally mediated methods in pscyhotherapy. Two persistent problems exist, each at a different level of abstraction: First, how do we use data-driven natural language generation but with high quality output and coherence? (unlike GPT-3!); Second: Do we have evidence that we can adapt psychotherapy method in a deepr, more fundamental way than ELIZA did?
The pro side should argue that computers can solve these problems now. They can adopt the theory and solve the means of communication. The con side should be saying “hold on a minute!” We may not want to do this, and there are major reasons we don’t have good enough technology anyway.
Topic: Learning Bayesian Networks/Learning with Bayesian Networks
Bayesian Networks (Bayes Nets) are compact representations of joint probability distributions over a set of discrete random variables (each with finite number of values). In a variety of settings Bayesian Netowrks can be used to describe relationships between the random variables in a succinct, and easy-to-manipulate way. Bayesian Networks can be learned from data. In turn, a Bayesian Network can be used to simulate/synthesize a dataset.
Your assignment for Week 8 is to read one of the most influential papers on the topic of learning Bayesian Networks, and to gain some insight about the potential uses of Bayesian Networks in Machine Learning/Data Science. The paper is more technical than the conversation that we want to have, but it is a foundational paper on the subject.
The “pro” side should discuss the uses of Bayesian Networks for data science applications - situations when building Bayesian Netoworks from data is a good way to proceed.
The “con” side should discuss the limitations of the use of Bayesian Networks, and the difficulties that are associated with their use.
Note: one one paper is assigned. The paper was written in 1996, but was updates in 2020. Unfortunately, the 2020 version I found has no bibliogrpahy. So, two links are provided below. I recommend that you read the 2020 version, but use 1996 version for bibliographic references.
Topic: Unsupervised Clustering
In Machine Learning, we talk a lot about prediction methods, where the relative “success” of any given algorithm is easily measurable. What happens when we start applying unsupervised methods, such as clustering, where there is no clear “right” answer?
The “pro” side should argue in favor of the central thesis of the paper; namely, that there is no way to evaluate a clustering method except in the context of how the results will be used. The “con” side should argue against this thesis, in support of some of the evaluation metrics that are critiqued in the paper.
Topic: The Reproducibility Crisis
Results in science and social science are published, then turn out to be false. How serious is this problem? The “pro” side will argue that this is a crisis that undermines public trust in science, while the “opposition” will argue that the reproducibility crisis is overblown.
Topic: Dealing With Small Data
On the heels of some of our previous conversations about data science hype often going hand-in-hand with big data hype, it seemed appropriate to spend a little more explicit time on non-big data situations. This is a popular blog and probably worth an exploration beyond this article. There could actually be quite a bit to unpack with this reading. I (Glanz) challenge the “supporters” and “opposers” to really go at it with this one…Here are some starting questions, but don’t feel restricted to just these.
Topic: Underspecification of Neural Network Models
A machine learning pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. The paper discusses how this results in diminished accuracy on different real-world verification sets and possible solutions to address the problem.
Topic: Data vs. Theory
Intro to reading: In 2008, Chris Anderson, then editor-in-chief of Wired Magazine, published a provocative editorial entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” He essentially argues that traditional methods of scientific discovery and business strategy are inferior in comparison to what can be achieved with data mining on large-scale datasets. This is a fundamentally attractive idea – that through big data, we can leave behind human-constructed models and theories about the way the world works, and instead discover directly from the data all the answers we need. There are clearly many examples of where data-driven methods have prevailed over traditional hand-crafted alternatives, and all of you must “believe in data” somewhat based on your choice of academic field to study! But, what is missing in Anderson’s discussion of the promise and successes of big data? Is data really all we need? And when might data lead us astray?
Note that this article has almost 2,500 citations on Google Scholar and is widely discussed in followup blog posts and magazine articles – you might want to search a bit to find some different perspectives on this bold idea.
Topic: Bias and Fairness of Rankings and Recommendations
Introduction to Readings: Rankings and recommendation systems are everywhere. We take them for granted. We base a lot of big and small decisions on the results presented to us. The readings below share several themes. At their heart they are examples of models and systems that have unintended and possibly detrimental impacts on society. When you are reading them, please think about things the original designers could/should have done differently. Put yourselves in their position, and think about whether you would have forseen or anticipated any of the results or problems. Think about how you can design and study the many different models of that recommend a nd rank items for you. Did the paper (third link) present a reasonable approach to studying a black box? What faults do you find if any? What could be done with their findings? What did the designers of both the studies and the original systems do well (i.e., think about the positive as well as the negative).
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on Thursdays. There are two sessons: at 10 AM and at 3 PM. You only need to attend one of these sessions. Each week there will be some readings relevant to data science, which we will discuss for 30 minutes. Afterwards, individual fellows will give updates on their research projects.
Please enroll in STAT 400 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
The theme of the readings for Fall 2020 is the past.
Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the chaotic history of hypothesis testing in the 20th century.
We will look at how hypothesis testing has influenced other disciplines and the controversy this has caused.
We will look at the Frequentist vs. Bayes debate. Next week, you will be randomly assigned to defend either frequentism or Bayesianism. Please come prepared to defend both, although we want you to take a side in your Slack post this week.
Here are some additional readings that might be of interest.
A survey of the history of the connection between early statistics and the eugenics movement.
Bonus reading from last week: A metaphor for the difference between randomness and unknown-ness, and the consequences in Bayesian analysis.
We will examine the history of AI winters.
We will discuss the Turing Test and whether machines can be intelligent.
Questions to consider:
No reading. Focus group with Lubi.
Main reading: Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States, 2012. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/bigdatawhitepaper.pdf
and The Claremont Report on Database Research, Communications of the ACM, Vol. 52, No. 8, 2009 http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/claremont-report2009.pdf
Given the timeframe there is no expectation that any supplemental materials will be read, but I want them available in a single place.
reading: D. I. HOLMES and R. S. FORSYTH, “The Federalist Revisited: New Directions in Authorship Attribution” PDF
In 1964 Mosteller and Wallace showed how statistics (and Bayesian analysis) can be applied to the problem of authorship attribution, and by implication stylometry (understanding, measuring and detecting “style”). Holmes and Forsyth built on that work in the 1990s, laying the foundations for more computationally intensive methods of behavioral analysis that has major implications in our lives in 2020s. Here are some motivating questions for the discussion this week: