程序案例-COMP4702/COMP7703

COMP4702/COMP7703 – Machine Learning
Homework W2 – Introduction and Exploratory Data Analysis
Marcus Gallagher
Core Questions
1. (2 marks) Find the (sample) average and (sample) standard deviation of the body mass of tiger
snakes, based on the data available at:
https://datadryad.org/stash/dataset/doi:10.5061/dryad.14cr5345
(Correct to 4 decimal places).
2. (2 marks) Imagine we record the maximum temperature in Brisbane for the month of February, but
we forget to make the recording on the 6th and the 16th (y6 and y16). We decide to predict the
maximum temperature on the missing days according to the following rule:
yt =
1
2
(yt 1 + yt 2)
(a) Is this performing classification or regression
(b) If the rule is used to predict the maximum temperature on the 1st of March, is this performing
extrapolation or interpolation
3. (3 marks) Write a function, sum to n(), which takes an unordered array of unique integers and an
integer, n, and returns all unique pairs which sum to n.
Examples:
arr n output
[1, 2, 3, 4] 5 [1, 4; 2, 3]
[1, 4, 5, 3, 2] 6 [1, 5; 4, 2]
[1, 2, 5, 6, 3] 7 [1, 6; 2, 5]
Supply your code (Matlab or python) for this question. Important: you must write this code yourself!
4. (4 marks) Perform some exploratory data analysis on the hwW2mystery.csv dataset. Answer the
following questions in AT MOST 3 sentences. Use correct and specific statistical language:
(a) Discuss an interesting feature in the dataset.
(b) Discuss an interesting relationship that you observe between a pair of features that are not the
ones from the previous question.
(c) From looking at the scatterplots between pairs of numerical features, make a guess as to which
pair of features has the highest correlation. Then plot or calculate the correlations between
feature pairs to see if you were correct or not!
(d) Compare and summarize the distributions (histograms) of the numerical features in the dataset.
1
Extension Questions
5. (2 marks) Non-parametric statistics are commonly used in machine learning. They are useful to
describe data that does not necessarily follow a known distribution. Find and read an explanation
of a box-whiskers plot. Using the tiger snake data from Q1 above, what is the value of the data
point that lies closest (but not exactly on) the boundary of the inter-quartile range
6. (3 marks) Using the data from question 4 and in 5 sentences or less, what do you think this dataset
might relate to and why Use properties of the data to support your answer. You don’t have to
guess correctly to get marks for your answer!
2