Deprecated Lesson 06: How did we train¶

Learning Objectives¶

This lesson takes references to lesson 06.

list/repeat the three ingredients to a feed forward network: input, hidden layers, output
describe training with mini-batched based weight-update rule
describe a softmax layer
motivate the loss function of categorical crossentropy (information theory vs. statistics)
define backpropagation
calculate example backpropagation
generalize to full network

Content¶

The slides shown in the video can be obtained from the lesson repo.

This content is based on discussions in Mathematics for Machine Learning book by Deisenroth, A. Aldo Faisal, and Cheng Soon Ong, as well as the wonderful presentation in Sebastian Raschka’s lecture L6.2 Understanding Automatic Differentiation via Computation Graphs.

Check your Learning¶

Question 1

The advantage of mini-batched based optimisation compared to online gradient descent or full data set gradient descent is …

a mini-batch represents the entire data set and hence is enough to optimize on
the optimisation converges faster
the optimisation can be performed in memory independent of the data set size
the optimisation will converge always into a global optimum

Question 2

Categorical Cross-Entropy is part of a well-known divergence in statistics. A divergence is a method to compare two probability density functions. It provides a large value if two distributions are different and a small value if they are similar. This well-known divergance that spurrs the Categorical Cross-Entropy is …

Mean-Squared-Error divergence
Negative-Log-Likelihood divergence
Kullback-Leibler divergence
Maximum-Mean-Discreptancy divergence

Question 3

The gradient that is required for gradient descent is the gradient …

of the loss function L with respect to the testset input data, df/dx, given the network parameters theta
of the network f with respect to the input data, df/dx, given the network parameters theta
of the network f with respect to the network parameters, df/dtheta, given the training data x
of the loss function L with respect to the network parameters, df/dtheta, given the training data x

Exercises¶

perform the same single-neuron backpropagation as in the video but using the sigmoid activation function. Use b=1, x=.5 and w=3 for the inputs.
perform the same single-neuron backpropagation as in the video but include the loss function for the two sample pairs (a) {y=1, y_hat=1} and (b) {y=1, y_hat=0}. How would your single weight be changed for (b) with respect to (a)?

Deprecated Lesson 06: How did we train¶

Learning Objectives¶

Content¶

Check your Learning¶

Exercises¶

deeplearning in 540 minutes

Navigation

Related Topics