Data Science Seminar (2020-2021)
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on MWF.
- Reading discussions
- Discussions will be held in three breakout rooms, 6 students per room.
- Each student receives a designated role for each reading: “presenter”, “supporter”, “opponent”. Each breakout room will have two presenters, two supporters and two opponents.
- Discussions start with presenters giving a succinct and fact-based summary of the reading material. then supporters will discuss what they liked in the reading material. After that, opponents will critique the work. Following this scripted discussion, there will be additional time for back and forth between all students, and for faculty reflections on the reading material.
Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
- attend all seminars and events
- complete readings and post a question or comment about the reading to the Slack channel
The theme of the readings for Spring 2021 is the future.
We will do an in-class activity in the seminar. To prepare for this activity, please take the Pew Political Typology Quiz and then fill out this form. Your answers to the form are anonymous. Please remember your political typology when you come to class.
One of the things missing from our Data Science discussions in DATA NNN coursework and beyond are recommender systems and recommendation engines. This is in part because quite a bit of methodology that goes into recommendation engines comes from classification, regression and clustering. But this is also somewhat unfortunate, because there are issues that are specific to recommender engines that get ignored. This week you are reading two surveys on recommender systems.
- Recommender Systems Survey (Bobadilla et. al, 2013) local copy here The first survey is a general recommendation systems survey. It is somewhat old for “future” being the general theme, but this is the most recent general survey that looked appropriate to me.
- [Deep Learning based Recommender System: A Survey and New Perspectives (Zhang et. al, 2018)] (https://arxiv.org/pdf/1707.07435.pdf) [local copy here] (http://www.csc.calpoly.edu/~dekhtyar/ds/RecSys-DeepLearning-Survey.pdf). This is a newer and more in-depth technical survey documenting how deep learning techniques are used in the guts of the recommender engines.
How to read these? The first survey is broad and not overly technical. Read fully. This will give you the background and the set up for reading the second paper. The second paper has a lot of technical content. This may be useful for you in the future, should you be tasked with building a recommendation system. In the meantime, you can internalize the methodology that is being discussed, but do not have to get too deep into the math.
A paper discussing claims of p-hacking in clinical trials. Do you believe the authors claims that p-hacking is (for the most part) not a problem when linking trials across phases? In reference to COVID-19 vaccinations and the recent problems with Johnson and Johnson’s vaccine, do you have any questions or concerns you wish the companies would address? From a data science perspective, do companies need to do anything different in the future?
Readings (main paper to discuss)
- P-hacking in clinical trials and how incentives shape the distribution of results across phases
Optional readings: There are many readings that can be considered relevant to this discussion. Here is a list:
- Looking beyond COVID-19 vaccine phase 3 trials
A position paper exploring DS computing technologies and devices of the Internet of Things. There are many aspects discussed in the paper, you can pick just a few or one area to discuss. IOT has been one of the most hyped concepts for over a decade now. Is this warranted? Is IOT Data Science really different than regular DS? In what ways does it need “integration”?
- Next Grand Challenges: Integrating the Internet of Things and Data Science
Topic: Examples of Explainable Boosting
The paper covers the tradeoffs between accuracy and explainability in a predication model. It claims that explainable boosting can give us the best of both worlds. Read the paper critically and try to answer the question: “Is explainable boosting really the solution that we have all been waiting for?”
- Robert Kubler, The Explainable Boosting Machine As accurate as gradient boosting, as interpretable as linear regression.
Topic: Towards Fair, Transparent and Accountable AI
Many of our discussions this year have considered the ethical implications of AI and the many pitfalls inherent in the machine learning process. This week we will consider strategies to ensure that future AI systems are fair, transparent, and accountable, and ask whether this is even a realistic possibility.
- Hutson, M. It’s Too Easy to Hide Bias in Deep-Learning Systems. IEEE Spectrum, January 2021.
- Mitchell et al., Model Cards for Model Reporting. FAT* ‘19: Proceedings of the Conference on Fairness, Accountability, and Transparency, January 2019.
The first reading points out the many ways in which an organization might try to make their AI system appear fair and unbiased, while actually operating in a biased manner. It also discusses some possible strategies to address this problem. The second reading proposes a way to address this problem head on, by providing a performance “report card” along with the AI model when it is distributed. In preparing your responses, consider whether model cards are a solid strategy to address issues of fairness and transparency, and whether you think these issues can and will be addressed effectively in the future.
Topic: Predicting the future of the field
“I have traveled the length and breadth of this country and talked with the best people, and I can assure you that data processing is a fad that won’t last out the year.”
- Editor of Prentice Hall business books, 1957
It’s easy to look back and laugh at predictions of the future that turned out to be dead wrong. As we move into speculative readings, that suggest hypothetical future directions for Data Science, how do we know which ones are plausible? This week you’ll read papers from the past, and compare them to a recent paper that looks ahead to the next decades of data.
- On the Future of Statistics (1942)
- The Future of Statistics as a Discipline (1981)
- The Future of Statistics and Data Science (2018)
One partner should read Reading 1, and one partner should read Reading 2. Both should read Reading 3. The “pro” group should identify historical predictions that turned out to be true, and argue in favor of predictions in the 2018 paper. The “con” group should identify historical predictions that were off-base, and argue against predictions in the 2018 paper.
The theme of the readings for Winter 2021 is the present.
Topic: Computational Psychiatry
We’ve come a long way since ELIZA! ELIZA was an AI psychiatrist that took great advantage of natural language generation and Rogerian psychological methods. While it revolutionized computational assistants, it was very limited and had no knowledge base. Today there is a resurgance of fully data driven / AI driven methods and computationally mediated methods in pscyhotherapy. Two persistent problems exist, each at a different level of abstraction: First, how do we use data-driven natural language generation but with high quality output and coherence? (unlike GPT-3!); Second: Do we have evidence that we can adapt psychotherapy method in a deepr, more fundamental way than ELIZA did?
The pro side should argue that computers can solve these problems now. They can adopt the theory and solve the means of communication. The con side should be saying “hold on a minute!” We may not want to do this, and there are major reasons we don’t have good enough technology anyway.
- Self-Learning Architecture for Natural Language Generation
- Towards a Neural Model of Bonding in Self-Attachment
Topic: Learning Bayesian Networks/Learning with Bayesian Networks
Bayesian Networks (Bayes Nets) are compact representations of joint probability distributions over a set of discrete random variables (each with finite number of values). In a variety of settings Bayesian Netowrks can be used to describe relationships between the random variables in a succinct, and easy-to-manipulate way. Bayesian Networks can be learned from data. In turn, a Bayesian Network can be used to simulate/synthesize a dataset.
Your assignment for Week 8 is to read one of the most influential papers on the topic of learning Bayesian Networks, and to gain some insight about the potential uses of Bayesian Networks in Machine Learning/Data Science. The paper is more technical than the conversation that we want to have, but it is a foundational paper on the subject.
The “pro” side should discuss the uses of Bayesian Networks for data science applications - situations when building Bayesian Netoworks from data is a good way to proceed.
The “con” side should discuss the limitations of the use of Bayesian Networks, and the difficulties that are associated with their use.
Note: one one paper is assigned. The paper was written in 1996, but was updates in 2020. Unfortunately, the 2020 version I found has no bibliogrpahy. So, two links are provided below. I recommend that you read the 2020 version, but use 1996 version for bibliographic references.
- David Heckerman: A Tutorial on Learning With Bayesian Networks- 2020 Version
- David Heckerman: A Tutorial on Learning With Bayesian Networks- 1996 Version(07280548.pdf)
Topic: Unsupervised Clustering
In Machine Learning, we talk a lot about prediction methods, where the relative “success” of any given algorithm is easily measurable. What happens when we start applying unsupervised methods, such as clustering, where there is no clear “right” answer?
The “pro” side should argue in favor of the central thesis of the paper; namely, that there is no way to evaluate a clustering method except in the context of how the results will be used. The “con” side should argue against this thesis, in support of some of the evaluation metrics that are critiqued in the paper.
Topic: The Reproducibility Crisis
Results in science and social science are published, then turn out to be false. How serious is this problem? The “pro” side will argue that this is a crisis that undermines public trust in science, while the “opposition” will argue that the reproducibility crisis is overblown.
- Baker. Is there a reproducibility crisis?
- Simmons et al. False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant
Topic: Dealing With Small Data
On the heels of some of our previous conversations about data science hype often going hand-in-hand with big data hype, it seemed appropriate to spend a little more explicit time on non-big data situations. This is a popular blog and probably worth an exploration beyond this article. There could actually be quite a bit to unpack with this reading. I (Glanz) challenge the “supporters” and “opposers” to really go at it with this one…Here are some starting questions, but don’t feel restricted to just these.
- What, if any, explicit and/or implicit assumptions is the author making when proposing each tip?
- What are the pros and cons of each tip?
- If your dataset is small and there’s not much to do about the size, can you really call your work with it “data science”?
Topic: Underspecification of Neural Network Models
A machine learning pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. The paper discusses how this results in diminished accuracy on different real-world verification sets and possible solutions to address the problem.
- D’Amour, A. et al. (2020). Underspecification Presents Challenges for Credibility in Modern Machine Learning.
Topic: Data vs. Theory
Intro to reading: In 2008, Chris Anderson, then editor-in-chief of Wired Magazine, published a provocative editorial entitled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.” He essentially argues that traditional methods of scientific discovery and business strategy are inferior in comparison to what can be achieved with data mining on large-scale datasets. This is a fundamentally attractive idea – that through big data, we can leave behind human-constructed models and theories about the way the world works, and instead discover directly from the data all the answers we need. There are clearly many examples of where data-driven methods have prevailed over traditional hand-crafted alternatives, and all of you must “believe in data” somewhat based on your choice of academic field to study! But, what is missing in Anderson’s discussion of the promise and successes of big data? Is data really all we need? And when might data lead us astray?
Note that this article has almost 2,500 citations on Google Scholar and is widely discussed in followup blog posts and magazine articles – you might want to search a bit to find some different perspectives on this bold idea.
- Anderson, C. (2008). The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine.
Topic: Bias and Fairness of Rankings and Recommendations
Introduction to Readings: Rankings and recommendation systems are everywhere. We take them for granted. We base a lot of big and small decisions on the results presented to us. The readings below share several themes. At their heart they are examples of models and systems that have unintended and possibly detrimental impacts on society. When you are reading them, please think about things the original designers could/should have done differently. Put yourselves in their position, and think about whether you would have forseen or anticipated any of the results or problems. Think about how you can design and study the many different models of that recommend a nd rank items for you. Did the paper (third link) present a reasonable approach to studying a black box? What faults do you find if any? What could be done with their findings? What did the designers of both the studies and the original systems do well (i.e., think about the positive as well as the negative).
- Weapons of Math Destruction (Chapter)
- Questioning the Fairness of Targeting Ads Online (Brief Summary)
- Original article from the above brief summary
This is a seminar for the Cal Poly Data Science Fellows. We will meet weekly over Zoom on Thursdays. There are two sessons: at 10 AM and at 3 PM. You only need to attend one of these sessions. Each week there will be some readings relevant to data science, which we will discuss for 30 minutes. Afterwards, individual fellows will give updates on their research projects.
Please enroll in STAT 400 for 1 unit, on a CR/NC grading basis.
To earn a CR grade, you are expected to:
- attend all seminars
- complete readings and post a question or comment about the reading to the Slack channel
The theme of the readings for Fall 2020 is the past.
Week 1: September 17
Week 2: September 24
Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the chaotic history of hypothesis testing in the 20th century.
- Chapters 10 and 11 from Salsburg, D. (2001). The lady tasting tea: How statistics revolutionized science in the twentieth century. Macmillan.
Week 3: October 1
We will look at how hypothesis testing has influenced other disciplines and the controversy this has caused.
- Cohen, J. (1994). The earth is round (p < .05). American psychologist, 49(12), 997.
- Gill, J. (1999). The insignificance of null hypothesis significance testing. Political research quarterly, 52(3), 647-674.
Week 4: October 8
We will look at the Frequentist vs. Bayes debate. Next week, you will be randomly assigned to defend either frequentism or Bayesianism. Please come prepared to defend both, although we want you to take a side in your Slack post this week.
- New York Times Article: The Odds, Continually Updated
- Frequentist and Bayesian Approaches in Statistics
- Efron, B. (2005). Bayesians, frequentists, and scientists. Journal of the American Statistical Association, 100(469), 1-5.
- xkcd comic on Frequentists vs. Bayesians
Here are some additional readings that might be of interest.
- MIT Lecture Notes: Comparison of frequentist and Bayesian inference.
- Efron, B. (1986). Why isn’t everyone a Bayesian? The American Statistician, 40(1), 1-5.
- Andrew Gelman’s blog post: Why I Don’t Like Bayesian Statistics. (This is an April Fools’ Joke written by a famous Bayesian statistician. However, it contains some good ideas.)
Week 5: October 15
A survey of the history of the connection between early statistics and the eugenics movement.
- Scientific Priestcraft From “Superior: The Return of Race Science” chapter 3
Bonus reading from last week: A metaphor for the difference between randomness and unknown-ness, and the consequences in Bayesian analysis.
- The Boxer, the Wrestler, and the Coin Flip: A Paradox of Robust Bayesian Inference and Belief Functions Andrew Gelman, The American Statistician, May 2006, Vol. 60, No. 2
Week 6: October 22
We will examine the history of AI winters.
- Analyzing the Prospect of an Approaching AI Winter Sebastian Schuchman, May 3, 2019
Week 7: October 29
We will discuss the Turing Test and whether machines can be intelligent.
- Computing Machinery and Intelligence Alan Turing, Mind, October 1950, Vol. 59, No. 236
- Is the Brain a Good Model for Machine Intelligence? Rodney Brooks, Demis Hassabis, Dennis Bray, and Amnon Shashua, Nature, Feburary 2012, Vol 482
Questions to consider:
- Is the Turing Test a good test of whether a machine is intelligent?
- Are today’s systems (like IBM Watson, Google’s image recognition systems, etc.) intelligent? Could a convolutional neural network be intelligent?
- In trying to achieve machine intelligence, should we try to mimic the brain or should we apply a pure engineering approach?
- Is the brain just a computer – could its functionality be replicated by a machine?
No reading. Focus group with Lubi.
Main reading: Challenges and Opportunities with Big Data, a community white paper developed by leading researchers across the United States, 2012. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/bigdatawhitepaper.pdf
and The Claremont Report on Database Research, Communications of the ACM, Vol. 52, No. 8, 2009 http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/claremont-report2009.pdf
- 2005: Lowell self-assessment. http://users.csc.calpoly.edu/~dekhtyar/560-Fall2014/papers/lowell-report2005.pdf
- 1998: Asilomar report. PDF.
- 1996: Strategic directions in database systems. PDF.
- 1995: Achievements and Opportunities. PDF.
- 1988: Future Directions in DBMS Research. PDF
Given the timeframe there is no expectation that any supplemental materials will be read, but I want them available in a single place.
reading: D. I. HOLMES and R. S. FORSYTH, “The Federalist Revisited: New Directions in Authorship Attribution” PDF
In 1964 Mosteller and Wallace showed how statistics (and Bayesian analysis) can be applied to the problem of authorship attribution, and by implication stylometry (understanding, measuring and detecting “style”). Holmes and Forsyth built on that work in the 1990s, laying the foundations for more computationally intensive methods of behavioral analysis that has major implications in our lives in 2020s. Here are some motivating questions for the discussion this week:
- What are the parameters for the authorship attribution problem?
- Do you think there’s a definitive style of writing associated with each person? Can that style be faked?
- How has computational stylistics changed in the recent decades?
- How does the 1964 work on Federalist Papers relate to your personal life today?