Skip to the content.

Data Science Seminar

Grading

Please enroll in DATA 472 for 1 unit, on a CR/NC grading basis.

To earn a CR grade, you are expected to:

Readings

Winter 2022

Week 10 assignment Requirements Spec

Week 10 assignment comes from the Spring 2015 CSC 366 course I taught. Our customer, Dmytro Marushkevich heads a data science team in a marketing company that at the time was called Rosetta, and now is called Sapient. Dmytro shared with us a real use case that he and his team had to work on, complete with real (although sanitized) data that CSC 366 students used.

For this week we are splitting you into two teams of six people each. Both teams will have the same task:

  1. Read the requirements specification. It presents the data in service of a specific marketing goal that Rosetta has. It also presents the kinds of reports/data analysis the company wants to perform with the provided data.
  2. Build an Entity-Relationship model for the Rosetta dataset based on provided descriptions. (Understandably, you may not get it completely right, and the document itself omits some potentially important information, but we want you to get a reasonable start).
  3. On Wednesday, March 9, each team will present its proposed E-R model. We will have a short breakout discussion, followed by a general discussion of what you have learned from this exerise and some comments from Alex Dekhtyar and Lubomir Stanchev.

Week 9: March 2nd

In the next two weeks, we will discuss entity-relationship diagrams.

Readings:

Weeks 6 and 7: February 9 and 16

In the next two weeks, we will discuss transfer learning.

For week 6, read the following two summaries about transfer learning and come prepared to discuss. Also, bring a laptop as we will go through a transfer learning example in Google Colab together.

Be prepared to discuss the following simple questions:

We prepared a Google Colab notebook to demonstrate transfer learning using CNN features.

For week 7, your task is to prepare a new transfer learning experiment. Our suggested experiment to investigate the ridiculous question, “Is a hot dog a sandwich?” If you wish to investigate this question, do the following:

You are also welcome to design your own experiment, perhaps addressing a more serious question :)

Come to seminar in week 7 ready to present and discuss your experiment!

Weeks 4 and 5: January 26, February 2

In the next two weeks, we will discuss tooling.

You will be assigned with a partner to research one of two technologies for data science: R or Python.

The R group will research the advantages of:

The Python group will research the advantages of:

Come prepared to discuss and debate the advantages of your assigned technology.

Week 3: January 19

By this class, in your same groups of three from the activity on January 12, you should pick two data science tools from the following list to compare and contrast with respect to reproducibility. What are the pros? What are the cons? Is one of them clearly better than the other? Why?

Come to class prepared to give a 5-minute presentation on what you come up with.

Week 2: January 12

Our first unit is on reproducibility. Please read the following before this class and come prepared to discuss them both in terms of their content and how you feel they relate to your experiences and knowledge of data science:

Week 1: January 5

Fall 2021

The theme of the readings for Fall 2021 is the past.

Week 1: September 22

Week 2: September 29

Hypothesis testing is a core part of many statistics classes. But where did the ideas such as the p-value, Type I error, and power come from? This reading reviews the turbulent history of hypothesis testing in the 20th century.

Required Reading:

Additional Resources:

Assignment:

In your assigned group of 6 students:

Week 3: October 6

In the early to mid 1900s, the field of eugenics - the idea that some groups or people are inherently genetically inferior - was a mainstream and well-respected scientific pursuit. Many of the foundational ideas of classic statistics were developed in conjunction with eugenics applications. In the modern era, now that these ideas have been rejected as racist/classist/etc, how should we regard the influential people and ideas that came out of that movement?

Required Reading:

Assignment:

We ask you to think carefully about the practice of re-contextualizing scientific contributions in light of modern ethics. Questions to consider:

In your assigned group of 6 students:

Week 4: October 13

We will look at the Frequentist vs. Bayes debate! Not only does this debate pertain to how you think and do your statistics and data science, but also to some of your ways of thinking every day! Please come prepared to defend both, although we want you will be asked to take a side in class.

Required Reading:

Here are some additional readings that might be of interest.

Assignment:

We ask you to think carefully about these two sides: Frequentist and Bayes. Questions to consider:

In your assigned group of 6 students:

Week 5: October 20

We will read a classic paper by Leo Breiman entitled “Statistical Modeling: The Two Cultures.” The PDF linked below also contains comments by leading statisticians and data scientists which will give you more ideas as you prepare your presentations.

Required Reading:

This paper by Efron can be seen as an update to Breiman’s paper 20 years later – it is not required reading but might be interesting for you to read over:

Assignment:

In your assigned group of 6 students:

Week 6: October 27

We will discuss the past and future of Artificial Intelligence (AI).

Required Reading:

Assignment:

In your assigned group of 6 students:

Week 7: November 3

We are starting a three-week stretch devoted to the ideas and technologies behind working with big data. Our first discussion is about relational databases, their history and their role in bringing forth our ability to work with large quantities of data. To that end, you will read two sets of articles.

The first set of articles gives you some historic perspective on the development of the relational data model and relational databases. The articles in this set are:

The Codd papers provide historic descriptions of the ideas behind the modern relational databases. The blog post puts some of the information contained in these papers in the overall context.

The second set of papers comes from a sequence of meetings conducted by the database research and industry community over the course of the late 20th and early 21st century. The meetings served as the community reflection points on the progress of the field of relational databases (and the field of databases in general) over the years. They also attempted to identify future challenges that the database technology and the database community needed to meet. The papers are co-authored by who’s who in the area of database management systems. The papers in this series are:

This is a lot of reading. Please, read the instructions below carefully.

The roles for this week’s assignment are:

Week 8: November 10

We will discuss the rise and “fall” of Hadoop… and future of Hadoop.

Required Reading:

Everyone should read in detail the following:

Assignment:

The “decline” of Hadoop is documented, and so it is not my intention to have you discuss why this happened. It is an interesting topic but not our focus. I’ll share a bit of my own experiences with Hadoop at the beginning of seminar since it came about during my graduate school days, and I’ve been along for the ride ever since.

In your assigned group of 6 students:

Week 9: November 17

Everyone needs to read the following:

In your assigned group of 6:

Everyone: be sure to look up terms and products that you’re reading about which you may not be familiar with (too old or discontinued). Teams need to be explain the terms in the papers that have to do with their subject, if asked during their presentations.