CSE 250A FINAL EXAM
FALL QUARTER 2020
OUT: Sat Dec 12, 3:00 PM PST (Canvas)
DUE: Sun Dec 13, 3:00 PM PST (Gradescope)
Question Points
Academic integrity statement *
1. d-separation 10
2. Polytree inference 10
3. Na ¨ve Bayes vs logistic regression 10
4. Nonnegative random variables 5
5. EM algorithm 10
6. Inference in HMMs 8
Question Points
7. Most likely hidden states 7
8. Gaussian random variables 6
9. Policy improvement 7
10. The simplest MDP 3
11. Noisy parity model 9
12. Gamer rating engine 15
Total: 100
The exam is expected to take one full afternoon, but you may work as long as needed until the deadline. Be sure to
sign and submit the statement of academic integrity with your completed exam. Exams without signed statements
will not be graded.
You are not required to typeset your solutions. We do expect your writing to be legible and your final answers clearly
indicated. Also, please allow sufficient time to upload your solutions.
You are allowed to check your answers with programs in Matlab, Mathematica, Maple, NumPy, etc. But this should
not be necessary, and be aware that these programs may not produce the intermediate steps needed to receive credit.
The later parts of problems often do not depend on the results of earlier ones; therefore it is a good strategy to attempt
all parts of every problem.
An asterisk may indicate the need for a somewhat more laborious calculation. If you do not see the solution quickly,
you may wish to return to these parts later.
If something is unclear, state the assumptions that seem most natural to you and proceed under those assumptions. Out
of fairness, we will not be answering questions about the technical content of the exam on Piazza or by email.
1
Academic Integrity Statement
This take-home, open-book exam represents work that I have completed on my own. In particular,
during the twenty-four hour period of the exam, I hereby affirm1 the following:
(i) I have not communicated with other students in the class about these problems.
(ii) I have not consulted other knowledgeable researchers or acquaintances.
(iii) I have not contracted for help on any parts of the exam over the Internet.
I understand that any of the above actions, if proven, will result in a failing grade on the exam. In
addition, I understand the following:
(i) I am obligated to report any violations of academic integrity by others that come to my
attention during the exam.
(ii) All other students in the course (past and present) are under the same obligation.
Signature Date
1If you do not have a printer, you may simply copy, sign, and date the following statement with your solu-
tions: I affirm the statement of academic integrity on page 2 of the exam.
2
1. d-separation (10 pts)
Consider the following statements of marginal or conditional independence in the belief network
shown below. For each statement, indicate whether it is true or false, and then justify your answer
as shown in the provided examples. Note that credit will only be awarded to answers that are
supported by further explanation.
A
C
B
D
E GF
Examples:
(i) P (F |A,C) = P (F |C)
True. There are four paths from node A to node F . They are
A→ C → F ,
A→ E ← C → F ,
A→ C → G← D → F ,
A→ E ← C → G← D → F ,
and each of these paths is blocked by node C. (Some of these paths are also blocked by other
nodes, but the answer is already complete as written.)
(ii) P (C,D|G) = P (C|G)P (D|G)
False. The path C → G← D is not blocked.
Problems:
(a) P (B|F ) = P (B|E,F )
(b) P (B,E) = P (B)P (E)
(c) P (A|B,F ) = P (A|F )
(d) P (A,B|E) = P (A|E)P (B|E)
(e) P (C|A,E, F,G) = P (C|A,B,E, F,G)
3
2. Polytree inference (10 pts)
A B DC
GF H
E
For the belief network shown above, consider how to efficiently compute the conditional probability
P (G|A,C, F,H). This can be done in four consecutive steps in which some later steps rely on the
results from earlier ones.
Complete the procedure below for this inference by showing how to compute the following prob-
abilities. Make each computation as efficient as possible, and briefly justify each step in your
solutions for full credit. Your answers should be expressed in terms of the CPTs of the belief
network and (as needed) the results of previous steps.
(a) Compute P (E|H).
(2 pts)
(b) Compute P (D|H).
(2 pts)
(c) Compute P (B|A,F ).
(3 pts)
(d) Compute P (G|A,C, F,H).
(3 pts)
4
3. Na ¨ve Bayes versus logistic regression (10 pts)
Y
X1 X2 X3 Xd…
(a) Na ¨ve Bayes model (2 pts)
Consider the belief network of discrete random variables shown above. Show how to com-
pute the conditional probability
P (y|x1, x2, . . . , xd)
in terms of the belief network’s probability tables for P (y) and P (xi|y). Justify your steps
to receive full credit.
(b) Log-odds (2 pts)
Consider the special case where all the variables in this belief network are binary-valued.
For this special case, compute the log-odds
log
P (Y =1|x1, x2, . . . , xd)
P (Y =0|x1, x2, . . . , xd)
in terms of the belief network’s probability tables. Simplify your answer as much as possible;
this will be helpful for the next parts of the problem.
(c*) Linear decision boundary (3 pts)
Show that the log-odds from part (b) is a linear function of the values of x1, x2, . . . , xn. In
particular, show that it can be written in the form:
log
P (Y =1|x1, x2, . . . , xd)
P (Y =0|x1, x2, . . . , xd) = a0 +
d∑
i=1
aixi
for appropriately chosen values of a0, a1, . . . , ad. Your solution should express these values
in terms of the belief network’s probability tables.
Hint: since each xi is equal to either 0 or 1, it may be a useful notation to write
log
P (xi|Y =1)
P (xi|Y =0) = xi log
P (Xi=1|Y =1)
P (Xi=1|Y =0) + (1 xi) log
P (Xi=0|Y =1)
P (Xi=0|Y =0) .
5
(d) Logistic regression (2 pts)
Consider whether it is possible to express your result for P (Y =1|x1, x2, . . . , xd) in the form
of a logistic regression. In particular, are there parameters w0, w1, . . . , wd such that
P (Y =1|x1, x2, . . . , xd) = σ
(
w0 +
d∑
i=1
wixi
)
,
where σ(z) = (1 + e z) 1 is the sigmoid function If yes, show how to choose these
parameters so that this is true. If not, explain why not.
Hint: What is the inverse of the sigmoid function
6
4. Nonnegative random variables (5 pts)
Let μ > 0. In this problem you will derive some elementary but useful properties of the exponential
distribution
P (z) =
1
μ
exp
(
z
μ
)
for a continuous, nonegative random variable with mean μ. (You will be exploiting these proper-
ties to solve the last problem on the exam.)
(a) Log-likelihood (1 pt)
Let {z1, z2, . . . , zT} be an i.i.d. data set of nonnegative values. Assuming each value was
drawn from an exponential distribution, compute the log-likelihood
L(μ) =
T∑
t=1
logP (zt)
in terms of the distribution’s parameter μ.
(b) Maximum likelihood estimation (1 pt)
Show that the maximum likelihood estimate for μ is given by the sample mean of the data.
(c) Cumulative distribution (1 pt)
Calculate the cumulative distribution, given by
P (Z∫ a
0
dz P (z),
when Z is exponentially distributed with mean μ.
(d) Comparison (2 pts)
Suppose that Z1 and Z2 are independent, exponentially distributed random variables with
means μ1 and μ2, respectively. Show that
P (Z1>Z2) =
μ1
μ1 + μ2
,
where the left side denotes the probability that Z1 exceeds Z2 in value.
Hint: note that P (Z1 > Z2) =
∫∞
0
daP (Z1 = a)P (Z2 < a), and use your result from the
previous part of this problem.
7
5. EM algorithm (10 pts)
A B
F
C
ED
Consider the belief network shown at the right,
with observed nodes {A,B,E, F} and hidden nodes {C,D}.
On the problems below, simplify your answers as much as
possible, and briefly justify your steps to receive full credit.
(a) Hidden node C (2 pts)
Show how to compute the posterior probability P (C|A,B,E, F ) in terms of the conditional
probability tables (CPTs) of the belief network.
(b) Hidden node D (2 pts)
Show how to compute the posterior probability P (D|A,B,E, F ) in terms of the CPTs of
the belief network.
(c) Both hidden nodes (1 pt)
Show how to compute the posterior probability P (C,D|A,B,E, F ) in terms of your previ-
ous results.
(d) Log-likelihood (2 pts)
Consider a data set of T partially labeled examples {at, bt, et, ft}Tt=1 over the observed nodes
of the network. The log-likelihood of the data set is given by:
L =
∑
t
logP (A=at, B=bt, E=et, F =ft)
Compute this expression in terms of the CPTs of the belief network.
(e) EM algorithm (3 pts)
Consider the EM updates for the CPTs of this belief network that maximize the log-likelihood
in part (d). Provide these updates for the following:
(i) P (C=c)
(ii) P (D=d|A=a,B=b)
(iii) P (E=e|B=b, C=c)
For this part of the problem, it is not necessary to justify your steps; it is only necessary to
state the correct updates.
8
6. Inference in HMMs (8 pts)
Consider a discrete HMM with the belief network shown below. As usual, let st ∈ {1, 2, . . . , n}
and ot ∈ {1, 2, . . . ,m} denote, respectively, the hidden state and observation at time t; also, let
pii = P (S1= i),
aij = P (St+1=j|St= i),
bik = P (Ot=k|St= i),
denote the initial distribution over hidden states, the transition matrix, and the emission matrix. In
your answers you may also use bi(k) to denote the matrix element bik.
S1 S2 S3 ST-1 ST
O2 O3 OT-1 OT
...
O1
(a) Inference (4 pts)
Let at denotes the tth power (via matrix multiplication) of the transition matrix, and assume
that a0 denotes the identity matrix. Prove by induction or otherwise that
P (St= i) =
n∑
k=1
pik(a
t 1)ki.
(b) More inference (4 pts)
The forward-backward algorithm in discrete HMMs computes the probabilities
αit = P (o1, o2, . . . , ot, St= i),
βit = P (ot+1, ot+2, . . . , oT |St= i).
In terms of these probabilities (which you may assume to be given) and the parameters
(aij, bik, pii) of the HMM, show how to compute the conditional probability
P (o1, o2, . . . , oT |St= i, St+1=j)
for times 1≤ tture.) Show your work for full credit, justifying each step in your derivation, and simplifying
your answer as much as possible. Hint: your result from part (a) may be useful.
9
7. Most likely hidden states (7 pts)
ST-1 ST
OT-1OT-2
S1 S2 S3
O2O1
…
Consider the belief network shown above, where st ∈ {1, 2, . . . , n} and ot ∈ {1, 2, . . . ,m} denote,
respectively, the hidden state and observation at time t. In this problem you will derive an efficient
algorithm for computing the most likely sequence of hidden states
{s 1, s 2, . . . , s T} = argmax
s1,s2,...,st
P (s1, s2, . . . , st|o1, o2, . . . , ot 1).
Note that this is not a hidden Markov model: in particular, every observation node has two parents,
every hidden node has none, and the joint distribution is given by
P (S1, S2, . . . , ST , O1, O2, . . . , OT 1) =
T∏
t=1
P (St)
T 1∏
t=1
P (Ot|St, St+1).
(a) Forward pass (4 pts)
Given a sequence of T 1 observations, the most likely hidden states are most easily found
by computing the T -column matrix with elements
` it = max
s1,s2,...,st 1
logP (s1, s2, . . . , st 1, St= i, o1, o2, . . . , ot 1).
Note that the log probability for ` it in this network only includes observations up to time t 1.
Give an efficient algorithm to compute these matrix elements for 1 ≤ i ≤ n and 1 ≤ t ≤ T .
(b) Backward pass (3 pts)
Show how to efficiently derive the sequence of most likely hidden states from the results of
the forward pass in part (a).
10
8. Gaussian random variables (6 pts)
z x
z ~ N (0, I) x ~ N (Λz, I)
Consider the belief network of multivariate Gaussian random variables shown above, where z ∈ is hidden and x ∈ P (z) =
1
(2pi)d/2
exp
{
1
2
z>z
}
,
P (x|z) = 1
(2pi)D/2
exp
{
1
2
(x Λz)>(x Λz)
}
.
Here, the parameter Λ is a D × d matrix. Note that both these multivariate Gaussian distributions
have an identity covariance matrix.
(a) Posterior mode (2 pts)
Let z = argmaxz P (z|x) denote the mode of the posterior distribution (i.e., the location of
its global maximum). Prove that we may also compute this mode from
z = argmax
z
log
[
P (z)P (x|z)].
(b*) Maximization (3 pts)
Compute z by explicitly maximizing log
[
P (z)P (x|z)] as suggested in part (a). Your an-
swer should express z in terms of the observed value of x and the matrix parameter Λ.
(c) Posterior mean (1 pt)
Consider the following assertion:
The posterior mean E[z|x], defined by the multidimensional integral
E[z|x] =
∫
z∈dzP (z|x) z,
is equal to the posterior mode z = argmaxz P (z|x) that was obtained
(much more simply) by differentiation in part (b).
Is the above statement true or false If true, explain by appealing to the properties of normal
distributions; if false, provide a counterexample. (You are not meant to evaluate the integral.)
11
9. Policy improvement (7 pts)
Consider the Markov decision process (MDP) with two states s ∈ {0, 1}, two actions a ∈ {↓, ↑},
discount factor γ= 2
3
, and the reward function and transition matrices as shown below:
s R(s)
0 -4
1 8
s s′ P (s′|s, a=↓)
0 0 3
4
0 1 1
4
1 0 1
4
1 1 3
4
s s′ P (s′|s, a=↑)
0 0 1
2
0 1 1
2
1 0 1
2
1 1 1
2
(a) State value function (4 pts)
Consider the policy pi that chooses the action a =↓ in each state, and compute the state value
function V pi(s) for s∈{0, 1}. Show your work for full credit.
Hint: the elements of the state value function are integers.
(b) Action value function (2 pts)
Compute the action value function Qpi(s, a) for this same policy, where s ∈ {0, 1} and
a∈{↓, ↑}. Show your work for full credit.
(c) Greedy policy (1 pt)
Compute the greedy policy pi′ with respect to these value functions. Your final answer should
clearly specify pi′(s)∈{↓, ↑} for s∈{0, 1}. Show your work for full credit.
12
10. The simplest MDP (3 pts)
Consider a Markov decision process (MDP) in which the reward function is constant: i.e., there is
a scalar reward r such that
R(s) = r
for every state s in the state space. In such an MDP, prove that every policy is optimal.
13
11. Noisy parity model (9 pts)
X1 X2 Xn. . .
Y
Zn
X1 X2 Xn. . .
Z2Z1
Y
. . .
(a) Noisy parity (2 pts)
Consider the belief network of binary (0/1) random variables shown on the left. Furthermore,
suppose that the conditional probability table (CPT) at node Y takes the form
P (Y =1|x1, x2, . . . , xn) = 1
2
[
1
n∏
i=1
(1 2pi)xi
]
,
where pi ∈ [0, 1] are parameters in the unit interval. Show that when all the parameters pi
are equal to one, the node Y deterministically computes the parity of its parent’s bit vector:
P (Y =1|x1, x2, . . . , xn) =
{
1 if
∑
i xi is odd,
0 if
∑
i xi is even.
(b) Noisy copy (1 pt)
Consider the belief network of binary (0/1) random variables shown on the right. Suppose
thast each hidden variable Zi is a noisy copy of its parent Xi in the following sense:
P (Zi=0|Xi=0) = 1,
P (Zi=1|Xi=1) = pi where pi ∈ [0, 1].
Use these probabilities to derive the following simple identity (which may be useful for the
next part of this problem); in particular, for xi ∈ {0, 1}, show that
P (Zi=0|Xi=xi) P (Zi=1|Xi=xi) = (1 2pi)xi ,
where in this context it is understood that any real number (including zero) raised to the
zeroth power is equal to one.
(c*) Latent variable model (6 pts)
In the belief network on the right, suppose that the node Y deterministically computes the
parity of its parents:
P (Y =1|z1, z2, . . . , zn) = 1
2
[
1
n∏
i=1
( 1)zi
]
.
Show in this case that the belief network on the right yields the same conditional probability
P (Y = 1|x1, x2, . . . , xn) as given in part (a).
14
12. Gamer rating engine (15 pts)
A gaming company pits n users on the Internet against each other in a popular two-player game.
The game involves a combination of skill and chance, so that in each game the more skilled of the
two players is likely but not guaranteed to win.
Over time the company records the wins and losses of these games in an n × n matrix with
elements Gij . Specifically, the element Gij records the number of games in which user i beat
user j. Note that Gii = 0, because users cannot play against themselves, and also that in general
Gij 6= Gji, because in each matchup the more skilled player is more likely to win than lose.
From this matrix, the company wants to assign each user a rating so that it can suggest new
matches between players of comparable skill. Let ri ∈ < denote this rating for the ith user, and let
σ(z) = (1 + e z) 1 denote the sigmoid function. In a game between players with ratings ri and rj ,
the company hypothesizes that the player with rating ri should win with probability σ(ri rj). The
log-likelihood of their data in this model is therefore given by
L =
n∑
i,j=1
Gij log σ(ri rj),
and the best model of this form is found by choosing the user ratings to maximize this expression.
This is where they turn to you for help.
(a) Gradient ascent (3 pts)
The simplest approach to maximize the log-likelihood is gradient ascent. In this case, the
user ratings are adapted by
rk ← rk + η
(
L
rk
)
for some small learning rate η > 0. Compute the partial derivative
L
rk
that appears in this learning rule, and simplify your final expression as much as possible. Be
sure to show your work for full credit.
(b) Self-check (2 pts)
Intuitively, if the user ratings have been properly estimated, we might expect that the number
of expected wins for each user is equal to the number of observed wins. More precisely, for
the kth user, we might expect that
n∑
j=1
(Gkj +Gjk)σ(rk rj) =
n∑
j=1
Gkj,
where the left and right sides compute, respectively, the number of expected and observed
wins. Using your answer from part (a), show that the elements of the gradient vanish when
the above condition is satisfied.
15
(c) Nonnegative ratings (1 pt)
Unfortunately, the marketing department of the company objects to this model; it refuses on
principle to assign any user a zero or negative skill rating. (Too demoralizing, they say.)
To this end, let μi = eri be a new (strictly positive) rating for each user with original (real-
valued) rating ri. Consider how the probability of a win is expressed in terms of these of
new ratings. In particular, show that
σ(ri rj) = μi
μi + μj
.
Note how this expression matches the form of the probability you computed in problem 4(d).
We can thus interpret these transformed ratings as parameters of a latent variable model.
In this model, each game between the ith and jth user is simulated by the following experiment:
For the ith user, sample an exponentially distributed random variable Zi with mean μi.
For the jth user, sample an exponentially distributed random variable Zj with mean μj .
Award victory to the user with the larger sampled variable.
This latent variable model is depicted by the belief network shown below, where the binary random
variable Vij indicates a victory by player i (for Vij=1) or player j (for Vij=0).
Vij
Zi
Zj
P (Zi=z) =
1
μi
exp( z/μi)
P (Vij=1|Zi, Zj) =
{
1 if Zi>Zj
0 otherwise
P (Zj=z) =
1
μj
exp( z/μj)
Not surprisingly, the EM algorithm for this model takes a familiar form. The E-step computes
posterior means, such as E[Zi|Vij], in terms of the current user ratings, and the M-step uses these
expected values to derive new user ratings. The rest of this problem guides you through these steps.
(See next page.)
16
(d) M-step (3 pts)
The EM updates in this model are derived as usual from an auxiliary function. In this case,
the auxiliary function is given by
Q(μ,μold) =
n∑
i,j=1
{
Gij
(
log μi +
E[Zi|Vij=1]
μi
)
+ Gji
(
log μi +
E[Zi|Vij=0]
μi
)}
,
where μ is shorthand for (μ1, μ2, . . . , μn), and where the posterior means on the right side
are computed in terms of the current user ratings μold. From the above expression (which
you are not asked to prove), derive the EM updates by maximizing the auxiliary function
with respect to its first argument:
μnew = argmax
μ
Q(μ,μold).
Your answer should show how to re-estimate the rating μi for the ith user in terms of the
counts Gij and Gji as well as the posterior means E[Zi|Vij=1] and E[Zi|Vij=0] (which you
will compute next).
(e) Posterior mean (when the player wins) (3 pts)
To complete the EM algorithm, you must compute the posterior means that appear in your
answer to part (d). For the latent variable model on the previous page, it can be shown that
E[Zi|Vij=1] = 1
P (Zi>Zj)
∫ ∞
0
dz P (Zi=z)P (ZjYou do not need to prove2 this result, but assuming it to be true, use it to calculateE[Zi|Vij=1]
in terms of the current user ratings.
Hint: substitute your result from 4(d) for the term P (Zi>Zj), your result from 4(c) for the
term P (Zj(See next page.)
2It follows from Bayes rule and marginalization—nothing magical or mysterious—but you already have enough to
do on this exam.
17
(f) Posterior mean (when the player loses) (3 pts)
Here’s some good news for the last problem: you can compute the other posterior mean
E[Zi|Vij=0] without evaluating yet another integral.
To do this, you only need a result that follows from the most basic rules of probability. Note
that by definition, the prior and posterior means of Zi are given by
E[Zi] =
∫ ∞
0
dz z P (Zi=z),
E[Zi|Vij=0] =
∫ ∞
0
dz z P (Zi=z|Vij=0),
E[Zi|Vij=1] =
∫ ∞
0
dz z P (Zi=z|Vij=1).
It turns out that two of these expected values, along with the marginal probabilities P (Vij=0)
and P (Vij=1), are sufficient to determine the third. To prove this fact, show that
E[Zi] = P (Vij=0)E[Zi|Vij=0] + P (Vij=1)E[Zi|Vij=1].
Hint: Write out the expected values on the right side as integrals, then use the product rule.
The line across the page is to make clear that no more work is required! But for complete-
ness, here is how this identity yields the other posterior mean. Rearranging the identity:
E[Zi|Vij=0] = E[Zi] P (Vij=1)E[Zi|Vij=1]
P (Vij=0)
.
Now simply note that in the course of this exam, you have already computed every term on
the right side. In particular,
E[Zi] = μi,
P (Vij=1) = P (Zi>Zj),
P (Vij=0) = 1 P (Vij=1),
and E[Zi|Vij = 1] was computed in part (e). Again, for emphasis, you do not need to
substitute these previous results into the identity; after proving the identity, you are done.
18
Notes from the instructors
From Lawrence:
Congratulations on your hard work in CSE 250A. We covered a lot of material during the
quarter, often at a brisk pace. Nevertheless I hope that you enjoyed the course. One of my
favorite moments in this class is to congratulate students when they hand in their final exam.
Unfortunately that is not possible this quarter, so this brief note will have to suffice.
I will post solutions on Canvas once the TAs and I verify that all the exams have been sub-
mitted. It will also be possible to inspect your exams after they are graded. In the meantime,
good luck on your remaining exams, and enjoy the winter break.
From Aditi:
I hope you found this course helpful, and that some/all of you want to explore the concepts
introduced in this course further! Happy holidays, and stay safe!
From Jennifer:
Congratulations on completing CSE250A! You worked extremely hard and learned a lot this
quarter! Please take a moment to reflect on that and appreciate the incredible journey you
have been on this quarter. You are certainly better off for it. I hope you enjoyed the course
as much as I did, and I look forward to seeing you around campus. Please don’t hesitate
to reach out if you ever want to chat about more ML topics or anything GradWIC related.
Thank you for being such wonderful, engaged, and curious students
From Jessica:
Congratulations on completing CSE 250A, especially in these weird times! I’m sure you
enjoyed it and did wonderfully on the exam as well. Good luck for the rest of your exams
and happy holidays!
From Udayan:
Congratulations on completing CSE250A. Hope you all had a great time. This was my first
time being a TA, and I enjoyed it a lot. If you don’t remember me, I am the TA who liked
giving blunt answers on Piazza XD. Remember the fundamentals you learned in this course,
they are very useful. We tried our best to keep the course as authentic and close to previous
offerings as possible, despite being online. Hoping to meet a lot of you in person someday.
Wishing all of you best of luck in your future at UC San Diego.
From Xinghan:
Time flies, and you are all about to complete CSE 250A! Each one of you deserves a round
of applause for doing all the hard work, and I believe for sure that you have something to
take away from this course. I would also like to thank you for your encouragement, kind
notes, and support for my OH and discussion sessions, which make me a better TA. Best of
luck on all your exams, and have a nice winter break!