Content
The slides shown in the video can be obtained from the lesson repo.
This content is based on discussions in Mathematics for Machine Learning book by Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, as well as the wonderful presentation in Sebastian Raschka’s lecture L6.2 Understanding Automatic Differentiation via Computation Graphs.
Check your Learning
Question 1
The advantage of mini-batched based optimisation compared to online gradient descent or full data set gradient descent is …
a mini-batch represents the entire data set and hence is enough to optimize on
the optimisation converges faster
the optimisation can be performed in memory independent of the data set size
the optimisation will converge always into a global optimum
Solution
1. no, the optimisation is done on randomly shuffled mini-batched step-by-step
2. no, we have no evidence/proof for this
3. yes, we only have to handle mini-batches in memory and hence can scale well
4. no, we have no evidence/proof for this
Question 2
Categorical Cross-Entropy is based in parts on a well-known divergence in statistics. A divergence is a method to compare two probability density functions. It provides a large value if two distributions are different and a small value if they are similar. This well-known divergance that builds the Categorical Cross-Entropy is …
Mean-Squared-Error divergence
Negative-Log-Likelihood divergence
Kullback-Leibler divergence
Maximum-Mean-Discreptancy divergence
Solution
1. no, Categorical Cross-Entropy is used for classification. The MSE is often used for regression.
2. no, Negative-Log-Likelihood is a very broad concept
3. yes!
4. no, but this is yet another divergence or metric
Question 3
The gradient that is required for gradient descent is the gradient …
of the loss function L
with respect to the testset input data, dL/dx
, given the network parameters theta
of the network f
with respect to the input data, df/dx
, given the network parameters theta
of the network f
with respect to the network parameters, df/dtheta
, given the training data x
of the loss function L
with respect to the network parameters, dL/dtheta
, given the training data x
Solution
1. no, this would lead us astray for multiple reasons: shape of ``dL/dx`` might not help to update ``theta``, changes with respect to the data do not help to optimize ``theta``
2. no, the output of ``f`` has an arbitrary scale and no implications to solve the task (classification/regression)
3. no, the output of ``f`` has an arbitrary scale and no implications to solve the task (classification/regression)
4. yes! (we are interested in knowing the slope of the loss ``L`` with respect to the parameters ``theta`` so that we know how to alter it accordinly)