Content
The slides shown in the video can be obtained from the lesson repo.
This content is based on discussions in Mathematics for Machine Learning book by Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, as well as the wonderful presentation in Sebastian Raschka’s lecture L6.2 Understanding Automatic Differentiation via Computation Graphs.
 
Check your Learning
Question 1
The advantage of mini-batched based optimisation compared to online gradient descent or full data set gradient descent is …
a mini-batch represents the entire data set and hence is enough to optimize on
 
the optimisation converges faster
 
the optimisation can be performed in memory independent of the data set size
 
the optimisation will converge always into a global optimum
 
 
Question 2
Categorical Cross-Entropy is part of a well-known divergence in statistics. A divergence is a method to compare two probability density functions. It provides a large value if two distributions are different and a small value if they are similar. This well-known divergance that spurrs the Categorical Cross-Entropy is …
Mean-Squared-Error divergence
 
Negative-Log-Likelihood divergence
 
Kullback-Leibler divergence
 
Maximum-Mean-Discreptancy divergence
 
 
Question 3
The gradient that is required for gradient descent is the gradient …
of the loss function L with respect to the testset input data, df/dx, given the network parameters theta
 
of the network f with respect to the input data, df/dx, given the network parameters theta
 
of the network f with respect to the network parameters, df/dtheta, given the training data x
 
of the loss function L with respect to the network parameters, df/dtheta, given the training data x