Homework 03 Due on 04/04/2022
Regression Analysis BA222 | Spring 2022
Instructions
1. Download the collegeDistance.csv data from QuestromTools/Resources/Data. There you should also find
the data dictionary.
2. All your work should be done in Python. Submit a jupyter notebook (.ipynb file) as your solution.
3. Label all your answers using comments. For instance, use #Q1PC to denote question 1 part c.
4. All cells of your notebook should run without errors.
5. Don’t forget to name the file using the convention HW03–nameLastName where name is your first name
and LastName is your last name.
1. Hypothesis Test
(a) Use the distance variable to create a categorical variable with the following categories:
i. near: individuals with distance less than or equal to the 25th percentile.
ii. average: individuals with distance within the 25th and 75th percentile.
iii. far: individuals with distance greater than or equal to the 75th percentile.
(b) Produce a bar chart using the categorized distance variable and the average years of education.
Before conducting any formal test, do you think that the difference between the years of education
among the groups is the result of random sampling or statistically significative (Explain)
(c) Test the hypothesis that individuals that live ”near” a university have different average years of
education.
(d) Test the hypothesis that individuals that live ”far” from a university have different average years of
education.
2. Univariate Regression
(a) Produce a scatter-plot with the education variable in the y-axis and the distance variable in the
x-axis. Describe the statistical relation between the two variables
(b) Run a regression of education on distance. Interpret the intercept and slope parameters.
(c) Are the estimated parameters statistically significant How can you tell
(d) What percentage of the variation in years of education can be explained by the variation in distance
alone
(e) Produce a graph of the regression fit.
(f) Calculate the fitted value for years of education assuming a distance of 10.
(g) Do you think there is a causal relation between distance and years of education (Explain)
3. Multivariate Regression (Part I)
(a) Produce a table with two columns. The first column should include the correlation between education
and all the other variables in the dataset and the second column should include the correlation
between distance and all the other variables in the dataset.
(b) Produce at least three alternative visualizations of the statistical relation between distance and the
rest of the variables in the dataset. Do you think some relations may be non-linear If yes, can
you rely solely on the correlation coefficient to establish an statistical relation among the variables
(Explain)
(c) Which omitted variables do you think are causing the estimated regression in problem 2 to be biased
Rank them from most likely to least likely.
(d) Introduce these variables to the regression model sequentially. That is, you are going to estimate
many models. The first one is simply the original model plus one additional variable, call it model
2. The next regression is model 2 plus one additional variable. Continue this process until you run
out of regressors.
(e) Compare the estimated slope parameter for distance to the one estimated in the univariate model.
Why are the estimated parameters different
1
(f) Are there any variables not affecting the sign of the estimated slope parameter for distance Should
you keep them in the model or remove them (Explain)
(g) Which variables are not statistically significant Should you keep them in the model or remove them
(Explain)
(h) Calculate the average years of education for two individuals:
i. Mario: male, other ethnicity, score of 45, none of the parents went to college, the family owns
a home, the location is urban, with unemployment rate of 6%, average industrial wage of $9.5,
tuition of 0.7, low income, from west region
ii. Ruby: female, afam ethnicity, score of 55, both parents went to college, the family does not
own a home, the location is not urban, with unemployment rate of 7%, average industrial wage
of $8.5, tuition of 0.85, high income, from other region
4. Multivariate Regression (Part II)
For this part I want you to use your best judgment and critical-thinking skills to estimate the relation
between score and years of education. A good answer should include: graphical representation of the
relation, both univariate and multivariate regression models, along with a correct interpretation of the
regression output. And, finally, a short conclusion summarizing all the evidence.
2