Content¶
for learners: a stub notebook to get you started can be obtained from the lesson01 repo.
for instructors: the video script is available here.
Define regression as a very general concept in data science and statistics.
describe the 3 steps in regression: data, model optimisation/fit and prediction on new data
explain that regression provides a result with uncertainty (inductive inference)
use MSE to quantify the uncertainty/error of a regression result
use a least squares regression for a linear function with sklearn
describe the PPDAC cycle of data science
for learners: a stub notebook to get you started can be obtained from the lesson01 repo.
for instructors: the video script is available here.
The following questions serve as a help for learners to reflect on the content of the videos. Answer at least one question. At best you want to answer these questions as a team.
Question 1
In the following, the order of steps was confused, please rearrange:
collect training data, compute accuracy, predict new data, fit training data
compute accuracy, collect training data, predict new data, fit training data
collect training data, fit training data, compute accuracy, predict new data
collect training data, predict new data, fit training data, compute accuracy
Question 2
The least squares method for an input data pair `x` and `y` derives it’s name as it …
Minimizes the sum of the product of x*y
Minimizes the sum of the absolute difference between y and the predicted y_hat
Minimizes the sum of the squared difference between y and the predicted y_hat
Minimizes the sum of y**2 and x**2
Question 3
NaN stands for not-a-number. When loading a data set with `pandas`, NaN values occur in the loaded data because … NaN stands for not-a-number. When loading a data set with `pandas`, NaN values occur in the loaded data because …
Input files contain string values in a column
Computational Problems occurred, like computing the square root of a negative number
Data could not be parsed correctly when reading input files into memory
there was no internet connection
Inspired by the sustainability math project, perform a linear regression on the following data sets:
Arctic Ice Data
Data source: http://sustainabilitymath.org/excel/ArcticIceDataMonth-R.csv
Content: average Arctic Ice Extent (in millions of km^2) from 1979 to present by month.
Task: perform a linear regression for the months March (winter peak month) and September (summer low month) for the entire given time period (40 years)
World Grain
Data source: about different grains and their production, end-of-year-stock and consumption in the US can be downloaded here:
Content: grain production, consumption, and ending stocks, totals and by per capita.
Task: conduct a linear regression for grain production/consumption/stock versus time for a grain kind of your liking (60 years)
Hourly Wage (by Race) and Gender
Data source: http://sustainabilitymath.org/excel/EPI-Wages-R.csv
Content: median and average hourly wages (in 2019 dollars) with categories of men and women by White, Black, and Hispanic.
Task: conduct a linear regression for wages earned versus time (47 years) for both men and women on this subset: reduced wages data set
Task: conduct a linear regression for wages earned versus time (47 years) for both men and women on this subset: reduced wages data set
At which year will equal pay be achieved? At what wage?