Lesson BONUS: How did we train?¶

Learning Objectives¶

This lesson takes references to lesson 06.

list/repeat the three ingredients to a feed forward network: input, hidden layers, output
describe training with mini-batched based weight-update rule
describe a softmax layer
motivate the loss function of categorical crossentropy (information theory vs. statistics)
define backpropagation
calculate example backpropagation
generalize to full network

Content¶

The slides shown in the video can be obtained from the lesson repo.

This content is based on discussions in Mathematics for Machine Learning book by Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, as well as the wonderful presentation in Sebastian Raschka’s lecture L6.2 Understanding Automatic Differentiation via Computation Graphs.

Check your Learning¶

Question 1

The advantage of mini-batched based optimisation compared to online gradient descent or full data set gradient descent is …

a mini-batch represents the entire data set and hence is enough to optimize on
the optimisation converges faster
the optimisation can be performed in memory independent of the data set size
the optimisation will converge always into a global optimum

Solution

no, the optimisation is done on randomly shuffled mini-batched step-by-step
no, we have no evidence/proof for this
yes, we only have to handle mini-batches in memory and hence can scale well
no, we have no evidence/proof for this

Question 2

Categorical Cross-Entropy is based in parts on a well-known divergence in statistics. A divergence is a method to compare two probability density functions. It provides a large value if two distributions are different and a small value if they are similar. This well-known divergance that builds the Categorical Cross-Entropy is …

Mean-Squared-Error divergence
Negative-Log-Likelihood divergence
Kullback-Leibler divergence
Maximum-Mean-Discreptancy divergence

Solution

no, Categorical Cross-Entropy is used for classification. The MSE is often used for regression.
no, Negative-Log-Likelihood is a very broad concept
yes!
no, but this is yet another divergence or metric

Question 3

The gradient that is required for gradient descent is the gradient …

of the loss function L with respect to the testset input data, dL/dx, given the network parameters theta
of the network f with respect to the input data, df/dx, given the network parameters theta
of the network f with respect to the network parameters, df/dtheta, given the training data x
of the loss function L with respect to the network parameters, dL/dtheta, given the training data x

Solution

no, this would lead us astray for multiple reasons: shape of ``dL/dx`` might not help to update ``theta``, changes with respect to the data do not help to optimize ``theta``
no, the output of ``f`` has an arbitrary scale and no implications to solve the task (classification/regression)
no, the output of ``f`` has an arbitrary scale and no implications to solve the task (classification/regression)
yes! (we are interested in knowing the slope of the loss ``L`` with respect to the parameters ``theta`` so that we know how to alter it accordinly)

Exercises¶

perform the same single-neuron backpropagation as in the video but using the sigmoid activation function. Use b=1, x=.5 and w=3 for the inputs.
perform the same single-neuron backpropagation as in the video but include the loss function for the two sample pairs (a) {y=1, y_hat=1} and (b) {y=1, y_hat=0}. How would your single weight be changed for (b) with respect to (a)?

Lesson BONUS: How did we train?¶

Learning Objectives¶

Content¶

Check your Learning¶

Exercises¶

deeplearning in 540 minutes

Navigation

Related Topics