- How do you stack the training examples (x(i), y(i)) to form X and Y? (X.shape = (nx, m) and Y.shape = (1, m) -> makes Python coding easier)
- Formulate the logistic regression problem.
- Should the bias term be kept separate or added as one of weight parameters?
- What is loss function? (defined on one training example)
- What is cost function? (average of the losses over all training examples)
- Why is the squared error not use in case of logistic regression? (1. makes the optimization problem non-convex -> difficult for GD 2. As a surrogate loss MSE is not close to 0/1 or accuracy loss. Cross-entropy or hinge loss are the closest.)
- What is the logistic regression loss function? Why use this form?
- What is the logistic regression cost function?
- On a high-level how does GD work? (talk about derivate, slope, and direction of steepest descent)
- Mathematically what is derivative? (Slope i.e. how does a small increment in x affect y)
- What is derivative of log(a)?
- What is a computation graph? (Defining the forward path of a network in terms of basic computational blocks. Comes in handy when we trying to optimize a function for example J. left to right path for calculating the value of the function J and right to left path for calculating the derivatives)
- How to find the derivates of J with respect to the intermediate variables in a computation graph?
- What is a proper way to denote the derivatives of J with respect to some variable intermediate var in Python? (dvar since we typically are always interested in finding the derivate of J and hence we drop it)
- What is the computation graph for logistic regression?
- Derive the derivatives at each point of the computation graph. From this derive the gradient descent formulae for logistic regression.
- How do you extend the weight update formulae obtained from one training example to multiple? (Average of all dws)
- What is vectorization and why is it necessary? (Art of getting rid of explicit for loops in the code; increases computational efficiency)
- What is SIMD? (Single instruction, Multiple Data -> helps GPUs/CPUs in parallelization)
- How do you vectorize the parameter updates in logistic regression?
- Write the vectorziation equation for the forward path of logistic regression.
- Derive the vectorized gradient update formulae for logistic regression.
- What is broadcasting? What are the advantages? (The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. )
- What is the general principle of broadcasting?
- Should you use (n,) i.e. rank 1 array in Python? (No - bugs might come in. better to use reshape to get (n,1). Matrix multiplications are easier to predict.)
- What is the need of assert statement during coding neural networks? (Assert to double-check shape of matrices)
- Standard stuff
- How do you get the logistic regression cost function?
- Why take log? (Simplication because converts to summation and monotonically increasing)
- How does the minus come in the loss function?
- What is maximal likelihood estimation?
- What is the assumption for doing this? (Input training examples are IID)
- What are train/dev/test sets?
- How do you choose train, test, and dev sets? (Previous era vs. this era; smaller dataset vs. larger dataset)
- What happens when the data distribution of train, test, and dev sets are different? (test and dev. data distribution have to be same)
- Describe bias and variance intuitively. What is the bias variance trade-off? (Trick question since people don't talk about the bias variance trade-off nowadays. All 4 scenarios can be seen.)
- Give examples for each scenario.
- How do you decide which 4 of the possible scenarios you are in? (Check train error, compare with Bayes error - decide high bias/low bias; then check dev error, compare with train error)
- How to approach in reducing bias and variance? (Talk about the basic ML recipe)
- Why is the bias-variance tradeoff not important anymore in the era of DL? (Because in the earlier era of ML, almost all techniques that increased one decreased the other. Now we have techniques that can almost solely affect one of them. For example, bigger network solely reduces bias and more data solely reduces variance)
- What is L2 regularization?
- Do we have to regularize both weights are biases? (For each neuron, most parameters are in the weight vector since it's high dimensional. So regularizing bias is not required. Can be omitted.)
- What is the L1 regularization? What are the consequences? (sparse network - most weights zero) When is it typically used? (compressing neural networks, but doesn't work much in pratice.)
- What is the regularization parameter? How do you set it? (another hyperparameter - so use dev set)
- What is the formula for L2 regularization for a multi-layer neural network?
- What is Frobenius norm? (L2 norm of a matrix; basically sum of squares of all elements)
- How does the weight update rule get changed when L2 regularization is added?
- Why is L2 regularization called the weight decay?
- Intuitively why does regularization prevent overfitting? (automatic pruning of network; classicial reason of operating in the linear region of an activation function)
- Difference between cost function and loss. (loss is a part of CF)
- What is dropout? Why does it help in regularization?
- How do you implement dropout? (Inverted dropout implementation)
- How does backprop happen in a dropout layer?
- Why is the inversion with keep_prob required during training? (Reducing the computation at test time- just no dropout network is fine)
- Why does dropout work? (From the network perspective - smaller network at each iteration; from a neuron's perspective - spreading out its weights since inputs can randomly go away -> effect is shrinking the L2 norm of weights i.e. adaptive form of L2 regularization
- How do you choose the dropout factor for layers? (keep prob smaller for layers with more parameters i.e. layers that have more chance of overfitting)
- What is a downside of dropout? (cost function of J is not well-defined)
- What are other regularization tricks? (data augmentation, early stopping,
- What is data augmentation and why is it used? (flip horizontal, random zooming., random rotations, distortions depending on the problem; Doesn't add much informarion but might regularize)
- What is early stopping and why is it used? Why does it work? (weights start from close to zero and then increases; early stopping chooses weights at the mid range)
- What is the advantage and disadvantage of early stopping? (adv. = unlike L2 norm which requires multiple passes of SGD to find the hyperparameter lamda early stopping requires only 1 pass. disadv. = combines the optimization and regularization part)
- What is orthogonalization in the context of ML?
- How to normalize input? (for each dimension subtract mean and dividie by std. dev., remember that test set has to see the same transformation)
- Why do we normalize the input data? (otherwise elongatated cost function so magnitude of weights for differnt dimensions are very dissimilar -> slower to optimize )
- When do we normalize the input data? (When i/p features come from very different ranges, although doing it doesn't do any harm)
- Explain the phenomenon in case of vanishing/exploding gradients. (explain the phenomenon and talk about why it is not very important anymore in case of feed-forward networks. But still important for RNNs.) (Very interesting read - https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b for the case of sigmoids and tanh)(It is better to talk about the classical problems first and then come to modern probelms and talk about the random block initialization paper and why it is less of an issue.)(Networks facing this problem in order of severity -> RNN, FFN w sigmoid/tanh, Mod. ReLU).
- How do you intialize the weight of any layer of a deep network? Why? Discuss the changes for different activation functions (ReLU and tanh).
- What is He intialization? Xavier intialization? Glorot and Bengio initialization? (Also an interesting paper: https://arxiv.org/pdf/1704.08863.pdf)
- Should you take a two sided difference or one sided difference while calculating gradients? (Accuracy of calculation: two-sided error is in O(ephsilon^2) whereas one-sided is in O(ephsilon). Although, two-sided is much more accurate, one-sided is faster.)
- Why is gradient checking necessary?
- How to perform gradient checking for a neural network? (Give formula)
- What is a good ballpark value of epshilon that can be used for calculating gradients?
- What to do if your algorithm fails gradient checking?
- Should you using grad check throughout training? If not, why not?
- How do you handle regularization during grad check? (IF L2 or additive regularzation just add that; but doesn't work with dropout)
- How do you do grad check for a neural network that has dropout? (Just turn off dropout, check and it the algo passes grad check, turn on dropout)
- Do we do grad check before training? (Yes, and also after some training epochs have passed.)
- What is mini-batch gradient descent? Compare it with batch gradient descent (stable but too long per iteration)? Also compare with stochastic gradient descent (very noise, doesn't converge, and loose speed up from vectorization) (Also remember this paper - https://arxiv.org/pdf/1609.04836.pdf)
- What are the advantages of mini-batch GD? (Can exploit vectorized implementations, more gradient descent steps than batch GD, less noisy than Stochastic GD)
- How to choose the size of the mini-batch? (In between 1 and m; Guidelines: If small training set (<2000) just use m i.e. batch GD, for bigger training set typical mini-batch sizes: 64, 128, 256, 512 (power of 2 since makes use of computer architecture), Make sure each minibatch fits GPU/CPU memory, mini-batch can also be a hyperparameter - explore powers of 2).
- Explain the general exponential moving average formula. (V_t = beta * (V_t-1) + (1-beta) * theta_t))
- Approximately, how many previous steps does it average upon? (1/(1-beta))
- Why the name "exponentially"?
- What is the advantage of this technique? (Computationally efficient)
- Why is bias correction required? (Starts with zero, the estimate for initial phases are inaccurate)
- How is bias correction done?
- What is the core idea of this algorithm? (calculate exponentially weighted moving averages of gradients and use this to update weights)
- Why can't larger learning rates might sometime create problem in GD? (If oscillations happen in some directions, they would get intensified)
- What is the formula for this algorithm?
- Intuitively, why does momentum work? (oscillations are averaged out to some extent; so larger learning rates can be used -> bigger steps towards the minimum -> speeding up the optimization process.)
- What is a common value of beta that is used?
- What is the RMSprop formula? Explain it intuitively.
- How to add numerical stability to the RMSprop algorithm?
- What is the advantage of RMSprop? (Averaging out oscillations -> allows us to use a larger learning rate -> speeding up the optimization process)
- What is the core idea of Adam? (Combining gradient descent with momentum and RMSprop)
- Write the formula.
- How are the hyperparameters selected for Adam? (alpha = needs to be tuned, beta_1 = 0.9, beta_2 = 0.999, ephsilon = 1e-8; typically only alpha is tuned)
- What is the intuition behind slowly reducing the learning rate decay?
- How to implement the learning rate decay? (staircase or formulae as a function of epoch - some of them introduce new hyperparameters)
- How has the concept of local minima evolved in modern deep learning? (lots of intuitions on lower dimensional spaces doesn't translate to higher dimensions; For example, local optima rarely occur in high dimensional spaces, saddle points are more likely . Derivatives are also zero in a saddle point.)
- What is the challenge faced by optimization in deep learning? (Problem of plateau - makes learning pretty slow - Adam, RMSprop can really help speed up learning here.)
- What are some of the hyperparameters used in deep learning? (learning rate alpha, momentum beta, Adam optimization hparameters, # layers, # units in layers, learning rate decay, mini-batch size; Acc. to Andrew Ng the most important parameter to tune is alpha, followed by beta, mini-batch-size, and # hidden units; next comes # layers and learning rate decay. Adam hparameters beta_1, beta_2 and ephsilon are almost never tuned)
- How to select values of hyperparameters to explore?
- Why is grid search better than random search? (Random search -> Consider a two hyperparameters situation -> If one of them doesn't affect the performance much, then ideally in a nxn grid search only n values of the other one has been explored and this means effectively only n points in total have been explored)
- What is coarse-to-fine sampling scheme?
- How do you choose the range of hyperparameter exploration? (Depends on the sensitivty of performance on the hyperparameter at hand; # units in hidden layers (can be searched in a linear scale), learning rate (searched in a logarithmic scale))
- What can be a range to search the # hidden units or # hidden layers and how to search in that range?
- What can be a range to search the learning rate alpha and how to search in that range?
- What can be a range to search the momentum term beta and how to search in that range? (Change to 1 - beta and repeat process for above)
- In general, what are two ways to choose hyperparameters? (babysit one model or try multiple models in parallel)
- What are the advantages of BNORM? (makes hyperparameter search much easier, makes neural network much robust to the choice of hyperparameters -> networks tend to perform well for a wide range of hyperparameters, makes training of deeper networks easier)
- What is the high level idea of BNORM? (We know that normalizing inputs makes learning faster because it changes the contours to a more round-ish shape by enforcing input dimensions to be in similar range. BNORM extends this idea of all layers.)
- Should you normalize the output of activation or normalize before the activation layer? (Debate - but before activation is much more often)
- Write the BNORM formula. (First convert to zero mean and std. dev. of 1, then add hyperparameters gamma and beta to learn the mean and std. dev. -> this makes the mean and std. dev. learnable) (Automatically learns the range in which the data to put into)
- Draw the signal flow graph of a NN with BNORM along with proper hyperparameters.
- Can we remove the biases for layers that we are using BNORM? (Yes. the first step of BNORM is to subtract the mean, that automatically cancels any constant (bias) added. So while using BNORM, biases can be permanently set to zeros.)
- How do you implement gradient descent with BNORM?
- Why batch norm works? (Reason 1: Scaling of different dimensions to the same range thereby making the optimization contours more rounded. Reason 2: makes later layers more robust to weight changes in the earlier layers. The later layers face covriate shift. The values of their input changes during training. BNORM at least keept their mean and std. dev. constant. Reason 3. Regularization effect. Since mean and std. dev. is calculated on a mini-batch, it is noisy. This adds some noise on the hidden layer output. BNORM offers both multiplicative and additive noise. Note that in contrast, dropout only adds multiplicative noise but it is much more powerful. Increasing the mini-batch size decreases the regularizing effect.)
- What is covariate shift? (If we learn X-> y, then if the input or output changes then the classifier has to be retrained.)
- Why can't we employ the same method mini-batch method for testing?
- How do you handle batch norm during test time? (Get an estimate of mean and variance by keeping an exponentially weighted average of the means and variance calculated from all the mini-batches)
- What is softmax? (Generalization of logistic regression for multi-classes; converts to probabilities)
- What is the softmax formula?
- What is the difference of softmax compared to other activation functions? (Softmax works on a layer whereas other activations work point- or neuron-wise)
- Softmax vs hardmax?
- During training a softmax layer, what is the loss function used? What is the cost function?
- How backprop works in the softmax layer? (https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function)
- What are the criteria to choose a software (easy of development and deployment, how large models are handled, truly open source i.e. open source with good governance)
- What is a placeholder in TensorFlow? What is its difference with Variable?
- Why is the "with" command used for running a session? (Better handles exceptions in the inner loop)
- What is a computation graph?
- What are different types of Computer Vision problems? (Image classification, object detetion, neural art transfer, etc.)
- What is the primary advantage of a Convolution layer? (for bigger real-world images -> lesser parameters -> reduce chance of overfitting and computationally feasible)
- Intuitively explain what features do progressive layers of a neural network extract? (gradually higher level features: edges, parts of objecte, etc.)
- Describe the forward convolution operation.
- Can you explain for a convolution layer works as a edge detector? (vertical edge detector)
- How to detect dark-to-bright and bright-to-dark vertical edges?
- Construct a simple vertical edge detector. Now construct a simple horizontal edge detector.
- What is a Sobel filter? (more weight on the central element on first and third column)
- What is a Scharr filter?
- What do we not use hand-coded filters anymore? (train them using back-propagation, better chance of capturing the statistics of the data; also can learn slanted edge detectors)
- If image size is nxn and filter size is fxf, what is the dimension of the output image?
- What are the problems of the shrinking output size? (pixels around the edges of the input don't get convolved much i.e. some information is thrown away; for deeper layers it is difficult to keep track)
- How to solve the above problem? (Padding)
- How to calculate how many pixels to pad on each side of the input?
- What is the meaning of "valid" and "same" convolution in padding?
- Why is the filter dimension f typically odd in Computer Vision? (1. the padding p = (f-1)/2 is symmetric; 2. filter has a central position)
- When input size is nxn, filter size fxf, padding of p is done on all sides, and stride is s, what is the size of the output image?
- Cross correlation vs. convolution in math vs. deep learning? (No horizontal and vertical flipping before point-wise multiplication)
- How does convolution over volume work? If your input is 6x6x3 and filter is 3x3x3, what is the output size? (4x4)
- What does each filter learn? (a particular type of edge detector feature)
- Where and how is bias added to a CNN?
- Why is a convolution layer less prone to overfitting?
- For a convolutional layer l write all the relevant equations for forward pass.
- What are the different types of layers in a convolutional neural network?
- What are the advantages of pooling layers? (1. reduces size of representation to speed up computation 2. makes some of the features it detects a bit more robust)
- What is max-pooling? What is average-pooling? Which one is more used? In what context is average-pooling still used?
- What is the size of the output of a max-pooling layer?
- How does backprop in the pooling layers?
- How to count the number of layers in a CNN? (two school of thoughts: number of trainable layers or all)
- What are the advantages of convolutional layers? (1. parameter sharing - a feature detector that is useful in one part of the image might also be useful in other parts 2. sparsity of connections - each output value depends on only a small number of inputs, less prone to overfitting 3. Translation invariance - image shifted a few pixels give rise to the same features)
- What are some classic (LeNet, AlexNet, VGG) and modern (ResNet, Inception) important networks?
- What is local response normalization? (Not used today)
- What was the key contribution of LeNet-5?
- What was the key contribution of AlexNet?
- What was the key contribution of ResNet? (All filters 3x3. nH and nW decreases by a factor of 2 and nC increases by a factor of 2)
- What is the key idea of ResNets?
- Describe the concept of skip connection. What is a residual block?
- Write the equations for a residual block?
- From where does the "short-cut" path start and where does it connects to? (After a ReLU and before another ReLU layer in the later part of the network)
- What is the primary advantage of ResNets? (Allows training of very deep architectures without any gradient explosion/vanishing. For a "plain network" the training error decreases and then increases with the number of layers.)
Read this paper.
- Why does ResNet work? (If the activation is ReLU, then it is easy for the ResNet to learn the identity function. Therefore adding more layer doesn't affect adversely. Either it helps or if not, then just the identity function is learnt. In contrast, it is difficult for "plain networks" to learn identity functions.)
- In context of ResNets why is typically "same" convolution used? (To avoid dimension mismatches)
- What happens when there is a dimension mismatch during the addition operation in ResNets? (Typically a matrix is used (pre-multiplied with a[l]) - can be learnt or fixed implementing zero padding)
- How do we handle the presence of pooling layer in a residual block since it will lead to a dimension mismatch? (same as above)
- Explain the concept of NIN or 1x1 convolutions.
- How is it different from a fully connected layer?
- Where is it typically used? (1. If the network depth has become huge then NIN can be used to shrink it. In contrast, the height and width of the volume is reduced by pooling layers or convs with strides >1. 2. Even if the output volume has same or more depth, it adds another level of nonlinearities to be learnt thereby increasing the model complexity.)
- What is the key motivation behind inception networks? (Allows to explore different filter configurations at once and then concatenate the results)
- What is a "bottleneck layer"? (Describe the problem of computation cost (Eg. 28x28x192 -> 28x28x32 for 64@5x5) and then say how 1x1 convolutions (Eg. 28x28x192 -> 28x28x16 for 16@1x1 -> 28x28x32 for 32@5x5) can be used to reduce it)
- Does the "bottlenect layer" hurt performance of a network? (No, if shrinking is done withing reason)
- What is an inception module? Describe the reason for each element of the blocks.
- Why are the 1x1 convolution layers present in the module? (bottleneck layers to reduce computation cost)
- Why does the max-pool layer have a stride of 1? (To keep dimensions consistent with its parallel layers for channel concatenation of output volumes)
- What is an inception network? (Brings a bunch of inception modules)
- Why are there are some side branches in the inception network? (Tries to perform prediction based on intermediate features -> can be used to detect overfitting)
- Use Github
- How to do transfer learning on smaller data? (Get a model trained on similar data. Freeze the conv layers i.e. the feature detector layers and train the fully connected layer)
- How to make transfer learning faster? (Since the first series of layers are fixed. Pre-compute this on the input data and save to disk. Then using this to train a softmax regression.)
- How to do transfer learning on mid-sized? (Freeze fewer layers. The trainable layers can be trained from scratch or the trained weights can be used as initialzer. With more data, the number of layers to freeze weill decrease.)
- How to do transfer learning on large datasets? (Unfreeze all layers. Treat trained weights as initializer and then do gradient descent. The final softmax layer has to be adjusted based on the number of classes.)
- What is the motivation behind data augmentation?
- What are some common data augmentation techniques? (Mirroring, Random cropping are two most common. Not so common ones are Rotation, Shearing, Local Warping, etc. Other techniques are color shifting, )
- What is color shifting? (Changing R, G, and B color randomly taken from a narrow distribution. Making robust to changes in the colors such as due to sunlight, night, etc.)
- How to do color augmentation? (PCA color augmentation - keeps the overall color/tint the same. If the image has more R and B than G, then it will more change R and B.)
- How to implement distortions during training? (Data kept in hard-disk. While loading data, distortions are done by a few CPU threads and then passed on as a batch. Training on batch and performing of distortions on another batch can be done in parallel to increase computational efficiency.)
- When to do hand-engineering features?
- What to do when dataset is smaller? (transfer learning or feature engineering)
- What are some techniques to do well benchmarks (probably can't be used in a production context)? (Ensembles, Multi-crop at test time, etc.)
- What is ensembling? (Train several networks independently and average their outputs)
- What is the 10-crop technique? (Run classifier on multiple versions of test images and average results)