Exercise week 47-48#

November 17-28, 2025

Date: Deadline is Friday November 28 at midnight

Overarching aims of the exercises this week#

The exercise set this week is meant as a summary of many of the central elements in various machine learning algorithms we have discussed throught the semester. You don’t need to answer all questions.

Linear and logistic regression methods#

Question 1:#

Which of the following is not an assumption of ordinary least squares linear regression?

  • There is a linearity between predictors/features and target/outout

  • The inputs/features distributed according to a normal/gaussian distribution

Question 2:#

The mean squared error cost function for linear regression is convex in the parameters, guaranteeing a unique global minimum. True or False? Motivate your answer.

Question 3:#

Which statement about logistic regression is false?

  • Logistic regression is used for binary classification.

  • It uses the sigmoid function to map linear scores to probabilities.

  • It has an analytical closed-form solution.

  • Its log-loss (cross-entropy) is convex.

Question 4:#

Logistic regression produces a linear decision boundary in the input space. True or False? Explain.

Question 5:#

Give two reasons why logistic regression is preferred over linear regression for binary classification.

Neural networks#

Question 6:#

Which statement is not true for fully-connected neural networks?

  • Without nonlinear activation functions they reduce to a single linear model.

  • Training relies on backpropagation using the chain rule.

  • A single hidden layer can approximate any continuous function on a compact set.

  • The loss surface of a deep neural network is convex.

Question 7:#

Using sigmoid activations in many layers of a deep neural network can cause vanishing gradients. True or False? Explain.

Question 8:#

Describe the vanishing gradient problem: Why does it occur? Mention one technique to mitigate it and explain briefly.

Question 9:#

Consider a fully-connected network with layer sizes \(n_0\) (the input layer) ,\(n_1\) (first hidden layer), \(\dots, n_L\), where \(n_L\) is the output layer. Derive a general formula for the total number of trainable parameters (weights + biases).

Convolutional Neural Networks#

Question 10:#

Which of the following is not a typical property or advantage of CNNs?

  • Local receptive fields

  • Weight sharing

  • More parameters than fully-connected layers

  • Pooling layers offering some translation invariance

Question 11:#

Using zero-padding in convolutional layers can preserve the input spatial dimensions when using a \(3 \times 3\) kernel/filter, stride 1, and padding \(P = 1\). True or False?

Question 12:#

Given input width \(W\), kernel size \(K\), stride S, and padding P, derive the formula for the output width \(W_{\text{out}} = \frac{W - K+ 2P}{S} + 1\).

Question 13:#

A convolutional layer has: \(C_{\text{in}}\) input channels, \(C_{\text{out}}\) output channels (filters) and kernel size \(K_h \times K_w\). Compute the number of trainable parameters including biases.

Recurrent Neural Networks#

Question 14:#

Which statement about simple RNNs is false?

  • They maintain a hidden state updated each time step.

  • They use the same weight matrices at every time step.

  • They handle sequences of arbitrary length.

  • They eliminate the vanishing gradient problem.

Question 15:#

LSTMs mitigate the vanishing gradient problem by using gating mechanisms (input, forget, output gates). True or False? Explain.

Question 16:#

What is Backpropagation Through Time (BPTT) and why is it required for training RNNs?

Question 17:#

What does a sliding window do? And why would we use it?