General tips

Know how to do the things in the recent quizzes and homeworks.

Note that the quizzes are necessarily very short, so the questions are broard (e.g. describe in a few sentences how do you implement something with a computational graph). Whereas on an exam you may be asked to actually implement the solution with a computational graph.

The exam will overwhelmingly favor recent material (after the first midterm).

You will not need to memorize code; the APIs will be given if needed (as on the last quiz).

Backprop...

Know how to perform the full machine learning optimization of a model via computational graphs and stochastic gradient descent by hand. I.e.

  • Draw a graph of the model and loss calculation.
  • Compute its gradient with respect to the parameters you need to optimize using backpropagation
  • Give the resulting SGD update equation

For those still unclear what I am looking for regarding backpropagation, basically I want to see three things:

  1. Use the chain rule to to break the gradient down into steps along a path.
  2. use the elementary gradients of the functions each node implements to fill in the chain rule parts.
  3. use the forward computation values to fill in unknowns.

Example(ish):

  1. The start node in the path represents the function we are taking the derivative of, the end node represents the variable we are taking the derivative with respect to. For example,if you have a path through a graph $n_1->n_2->n_3$ where $n_1$ is $x$ and $n_3$ gives $f(x)$, then the gradient of $f$ with respect to $x$ can be broken down
$$ \frac{df}{dx} = \frac{df}{dn_3}\frac{dn_3}{dn_2}\frac{dn_2}{dn_1} $$

  1. For example if node $n_3$ implements a multiplication, e.g. $n_3 = n_2n_5$. Here we want $\frac{dn_3}{dn_2}$ since it is part of the above chain rule steps, so we calculate it to be $n_5$
  2. In the above example we plug in the number for $n_5$ from our forward computation step using the current value of $x$.

Convolution

You should understand fundamentally what is calculated by convolution and how it is implemented by a layer in a neural network.

Q1: Give keras code for a “1x1” (i.e. the kernel size is (1,1) conv2D layer, where the input is a color image (i.e. 3 channels). Is this useless or does it calculate something nontrivial?

A1: at each output node you get a combination of the three colors from a pixel in the input image, $y[i,j] = \sigma(w_1 x[i,j,1] + w_2 x[i,j,2] + w_3 x[i,j,3] + b)$. It is not trivial; it learns to combine the colors in a useful way for the task.

Keras

Be able to write the math function and draw the graph implemented by simple Keras layers, including the simple recurrent layer. You will be given the relevant API pages. You may need to figure out code you haven't used before to test your understanding.

E.g. draw the neural network layers that the following lines implement, assuming an input size of 6.

keras.layers.Conv1D(5,2)

keras.layers.AveragePooling1D(3)

Sequences

Understand what a language model is and how to implement it with a network. I.e. it's just a function like any other we model with a neural network, except its inputs are usually (one or more) words and its output is a softmax that is interpreted as a probability distribution.

Q1: How generally do you make words into vectors using one-hot encoding?

Q2: How can an embedding matrix may be implemented with a neural network?

Q3: How would you convert a single long sequence into a collection of samples for training a language model?

  1. using a feedforward network
  2. using a recurrent network

A1: For a vocabulary of size V unique words, make a dictionary where the key is the word and the value is its index in the vocabulary. For each word, use a unique index k_word. Then for each word, produce a vector of length V which has a one in the kth element and zeros elsewhere.

A2: Directly use the embedding matrix as the weight matrix for a dense layer (be able to draw this). I.e. the dense layer should compute $y = \sigma(\mathbf E^T \mathbf x + \mathbf b) = \mathbf E^T \mathbf x$, because we set the layer to not use activation or bias.

A3.1: for a feedforward network, take each N words in a sliding window (i.e. for sample i, use the ith word through the (i+N)th word) as a sequence, then use the (one-hot encoded) Nth word as the target $\mathbf y^{(i)}$ and the previous (one-hot encoded) N-1 words as the sample input vector $\mathbf x^{(i)}$. The output layer is a softmax over possible words. There are other ways too.

A3.2: For a recurrent network you can simply use the ith word for the target $\mathbf y^{(i)}$ and the (i-1)th word for the input $\mathbf x^{(i)}$.

Data Augmentation

Q1: Suppose you have a dataset consisting of faces from mugshots. When you train it you get a very accurate model for recognizing people using mugshots. However when you test it on pictures from street cameras it is very innaccurate. How would you use data augmentation to train a better model?

A1: augment your dataset with many variations of the mugshots. For each variation apply random backgrounds, lighting, and translations, but use the same true label in training.

General questions

Q1: When is a softmax typically used?

Q2: When is a ReLU typically used?

Q3: When is a sigmoid used?

A1: final layer activation function for multi-category classifier. can interpret as a probability estimate.

A2: hidden layer activations

A3: final layer activation function for two-category classifier. can interpret as a probability estimate.

HW 8 answers (more or less)

Q1:

x=[1,2,3,4,6,7,7,6]

when convolving with [-1,2,-1] your code should have returned [0,0,-1,1,1,1]

The math operation here is a finite-difference second derivative.

when convolving with [1,1,1] your code should have returned [6,9,13,17,20,20]

The math operation here is a smoothing of the sequence

when convolving with [-1,1] your code should have returned [-1,-1,-1,-2,-1,0,1]

The math operation here is a finite-difference first derivative.

Yur answer may differ slightly if you implemented correlation versus convolution.

when convolving with [2] your code should have returned [2,4,6,8,12,14,14,12]

The math operation here is just a scaling by 2.

Q2:

After implementing a stride of 2, your code should have output every other number from above, e.g., [2,6,12,14] for the last one.

Q3:

Using python-like notation...

$$h[m,n] = \sigma\Big(\sum_{i=-1}^{+1}\sum_{j=-1}^{+1}\sum_{k=0}^{1} x[n,m,k]w[n-i,m-j,k] +b \Big) $$

The actual ranges of the indices is arbitray.