# [大一暑假] Deep Learning for Computer Vision | CS231N¶

• 因为我只能找到 17 年的视频，所以是跟着 17 年的版本学。

Course Website: 🔗

Course Video: 🔗

Course Description

• Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into the details of deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement and train their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. Additionally, the final assignment will give them the opportunity to train and apply multi-million parameter networks on real-world vision problems of their choice. Through multiple hands-on assignments and the final course project, students will acquire the toolset for setting up deep learning tasks and practical engineering tricks for training and fine-tuning deep neural networks.

Assignments

• 学长笔记：🔗

## Lecture 1: Introduction to Convolutional Neural Networks for Visual Recognition¶

• 一些早期的关于 CV 的思考

• 早期对如何表示物体，超越「Block World」的表示方法

• 启发：视觉识别的重点可以从识别对象的一些具有识别力和不易变化的部分开始

• 总的而言 对象识别 是 CV 领域的一个重要话题
• 该课程重点为 卷积神经网络(Convolutional Neural Network / CNN)
• 具体着眼点为 图像分类问题(image classification)
• 也将涉及 对象检测(object detection)图像字幕(image captioning) 等问题

## Lecture 2: Image Classification Pipeline¶

• Video: 🔗
• Slides: 🔗
• CIFAR-10 & CIFAR-100: 🔗
• 前置：Python Numpy Tutorial: 🔗
• 语义鸿沟(semantic gap)

### Data-Driven Approach¶

1. Collect a dataset of images and labels;
2. Use Machine Learning to train a classifier;
3. Evaluate the classifier on new images;
4. 得到的模型有两个核心 API：一个用于 训练(train)，一个用于 预测(predict)
5. 一般情况下，我们不介意训练时间长，但希望预测效率高。

CIFAR-10 & CIFAR-100: 🔗

• $$d_1(I_1,I_2)=\sum_{p}|I_1^{p}-I_2^{p}|$$

• 如果图像旋转，预测结果会发生改变。
• $$d_2(I_1,I_2)=\sum_{p}\sqrt{(I_1^p-I_2^p)^2}$$

K-Nearest Neighbors: 🔗

Interactive Demo: 🔗

Instead of copying label from nearest neighbor, take majority vote from K closest points.

• 当然，这种通过比较“距离”的分类方案并不仅限于图片等，对于任何需要分类的数据，例如文本，只要能定义能够量化的“距离”以及一系列相应的规则，就能使用这种方法来进行分类。
• 然而，K-临近算法 在图像分类上几乎不怎么使用，主要是因为它实际使用起来，预测效率较低；且“距离度量”并不非常适合图像处理（它无法完整描述图像上的距离信息或图像之间的差异）；此外它还有一个比较严重的问题：维数灾难(curse of dimensionality) 🔗 （因为只有训练样本足够密集，K-临近算法才能正常运行）

### Hyperparameters¶

• Choices about the algorithm that we set rather than learn.

• eg: best k to use; best distance to use (L1/L2);

Setting Hyperparameters (调参)

### Linear Classification¶

Parametric Approach

$f(x,W)=Wx+b$

• 即，我们构造一个函数，输入包含图像数据 $$x$$ 和权重参数 $$W$$，满足其计算结果为各个类别的预测得分
• 最终得到一个模版，它将尝试性地适应该类里尽可能多的样本

• 从这种角度来理解就很容易发现，单一的线性分类具有局限性，例如对于多模态的数据，使用单一的线性分类可能会比较吃力。

## Lecture 3: Loss Functions and Optimization¶

• Linear classifier is an example of parametric classifier.
• 我们可以这样理解 Linear Classifation 中的 $$W$$：矩阵中的每一个参数表示了每一个像素点(单个颜色通道)对于识别某个类的贡献权重。

• A loss function that quantifies our unhappiness with the scores across the training data, tells how good our current classifier is.

• Given a dataset of examples $$\{(x_1,y_i)\}_{i=1}^N$$, where $$x_i$$ is image and $$y_i$$ is label. And loss over the dataset is a sum of loss over examples: $$L = \frac{1}{N}\sum L_i(f(xi,W),y_i)$$.
• Loss function 是用来度量 $$W$$ 的合适程度的，我们通过寻找在 $$W$$ 空间中损失函数取最值时的 $$W$$ 来找到我们认为最合适的 $$W$$

### Multi-class SVM Loss¶

Let $$s = f(x_i,W)$$, then SVM loss is:

L_i = \sum_{j \not = y_i } \left\{ \begin{align*} &0 &\text{if } s_{y_i} \geq s_j + 1\\ &s_j-s_{y_i}+1 &\text{otherwise} \end{align*} \right. = \sum_{j \not = y_i } \text{max}(0,s_j-s_{y_i}+1)
• 也就是说，对于某一个样本，它实际类别对应的得分如果远大于（我们需要一个边界 ，就是上图中的$$+1$$）某个其他类别的得分，那么该“其他类别”对损失函数的贡献即为$$0$$；反之，如果并没有远大于其他某个类别的得分，则需要将这个偏差作为对损失函数的贡献。

def L_i_Vectorized(x, y, W):
scores = W.dot(x)
margins = np.maximum(0, scores - scores[y] + 1)
margins[y] = 0
loss_i = np.sum(margins)
return loss_i

• Overview
$f(x,W) = Wx\\ L = \frac{1}{N}\sum_{i=1}^{n}\sum_{j\not = y_i}\text{max}(0,f(x;W)_j-f(x_i;W)_{y_i}+1)$

• However, for that we only calculate a loss in terms of the data, some strange things like overfitting will happen.

• 蓝点为模型训练数据，绿色的为验证或者实际数据等。
• 蓝色的为过拟合后模型训练出来的预测趋势，他们完全符合训练模型的数据，但是可以发现，绿色的线条才是实际的我们希望得到的趋势。
• 这种预测结果过度拟合了训练数据的行为及为过拟合。

To solve it, we use regularization.

The regularization term.

Occam's Razor

• The regularization term encourages the model to somehow pick a simpler $$W$$ depending on the dask and the model.

So your standard loss function usually has two terms, a data loss and a regularization loss, and there's a hyperparameter(mentioned already in L2) here, lambda, which trades off between the two.

### Regularization¶

There are many regularizations used in practice, and the most common one is probably L2 regularization or weight decay.

In common use:

• L2 regularization: $$R(W) = \sum_{k}\sum_lW_{k,l}^2$$;
• L1 regularization: $$R(W) = \sum_k\sum_l |W_{k,l}|$$;
• Elastic net (L1+L2): $$R(W) = \sum_k\sum_l (\beta W_{k,l}^2+|W_{k,l}|)$$;
• Max norm regularization
• Dropout
• Fancier: Batch normalization, stochastic depth...

The whole idea of regularization is just any thing that you do to your model, that sort of penalizes somehow the complexity of the model, rather than explicitly trying to fit the training data. Each regularization has its own feature, you should choose them depends on the problems.

### Softmax Classifier (Multinomial Logistic Regression)¶

• It normalizes the scores to a probability distribution.
• Then we just want the probablity of the true class is high and as close to one.
• scores = unnormalized log probabilities of the classes
• $$P(Y=k|X=x_i)=\frac{e^{s_k}}{\sum_j e^{s_j}}\;,\;\;where\;s=f(x_i;W)$$;
• That is $$L_i=-\log P(Y=y_i|X=x_i)=-\log(\frac{e^{s_k}}{\sum_j e^{s_j}})$$;

eg for Softmax Classifier.

Compare the two.

• 在实际使用中，SVM 在保证真实标签对应的得分高于其他得分一定量后就接受了，即存在一个明确的突变标准；而对于 Softmax 来说，它会在这个过程中不断将正确标签对应的概率向$$1$$逼近，不断优化自己。

### Optimization¶

• Strategy #1: A first _very bad _idea solution: Random search
• Strategy #2: Gradient Descent / Follow the slope(gradient$$\nabla$$)
• Always use analytic gradient, but check implementation with numerical gradient. This is called a gradient check.

## Vanilla Gradient Descent

while True:
weights += - step_size * weights_grad ## perform parameter update

step_size is a hyperparameter here, it's also called a learning rate sometimes. Stochastic Gradient Descent (SGD) - The basic Gradient Descent will be messy when $$N$$ is large! - So we need to start with approximate sum using a minibatch of examples. - 32 / 64 / 128 common

## Vanilla Gradient Descent

while True:
data_batch = sample_training_data(data, 256) ## sample 256 examples
weights += - step_size * weights_grad ## perform parameter update


## Lecture 4: Introduction to Neural Networks¶

### Computational graphs¶

eg. for the linear classifier.

### Backpropagation¶

\left\{ \begin{aligned} &\frac{\partial{f}}{\partial{f}}=1,\\ &\frac{\partial{f}}{\partial{z}}=q=3,\\ &\frac{\partial{f}}{\partial{q}}=z=-4,\\ &\frac{\partial{f}}{\partial{x}}=\frac{\partial{f}}{\partial{q}}\frac{\partial{q}}{\partial{x}}=-4,\\ &\frac{\partial{f}}{\partial{y}}=\frac{\partial{f}}{\partial{q}}\frac{\partial{q}}{\partial{y}}=-4 \end{aligned} \right.

backprop (red lines)

### Patterns in backward flow¶

• Always check: The gradient with respect to a variable should have the same shape as the variable.
• Because each element of your gradient is quantifying how much that element is contributing to your final output.

Summary

• Neural nets will be very large: impractical to write down gradient formula by hand for all parameters.
• Backpropagation = recursive application of the chain rule along a computational graph to compute the gradients of all inputs/parameters/intermediates.
• Implementations maintain a graph structure, where the nodes implement the forward() / backward() API.
• Forward: compute result of an operation and save any intermediates needed for gradient computation in memory.
• Backward: apply the chain rule to compute the *gradient *of the loss function with respect to the inputs.

### Neural Networks¶

Summary

• We arrange neurons into fully-connected layers.
• The abstraction of a layer has the nice property that it allows us to use efficient vectorized code (e.g. matrix multiplies).
• Neural networks are not really neural.

## Lecture 5: Convolutional Neural Networks¶

### Convolutional Neural Networks¶

• Convolve the filter with the image. (i.e. "slide over the image spatially, computing dot products")

• The zero padding does add some sort of extraneous features at the corners, and we're doing our best to get some value and do, like process that region of the image. And so zero padding is kind of one way to do this. There's also other ways to do this that, you know, you can try and like, mirror the values here or extend them, and so it doesn't have to be zero padding, but in practice this is one thing that works reasonably.

Ans

• Each filter has $$5\times5\times3_{\text{(depth)}} + 1_{\text{(for bias)}} = 76$$ params.
• So the total number is $$76\times 10 = 760$$.

### Pooling layer¶

• makes the representations smaller and more manageable
• operates over each activation map independently:

just downsample

A common way to do this is max pooling:

just take the max value

### Fully Connected Layer (FC layer)¶

• Contains neurons that connect to the entire input volume, as in ordinary Neural Networks.