Hello, everyone.
Today I would like to share with you a very awesome open source project. A deep learning framework was developed using Numpy. The syntax is basically the same as Pytorch.
Today we take a simple convolutional neural network as an example to analyze the core steps involved in the neural network training process, such as forward propagation, back propagation, and parameter optimization. Source code.
The data sets and codes used have been packaged, and there are ways to obtain them at the end of the article.
First prepare the data and code.
First, download the framework source code, address: https://github.com/duma-repo/PyDyNet
git clone https://github.com/duma-repo/PyDyNet.git
Build the LeNet convolutional neural network, Train a three-classification model.
Just create the code file directly in the PyDyNet directory.
from pydynet import nn class LeNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(1, 6, kernel_size=5, padding=2) self.conv2 = nn.Conv2d(6, 16, kernel_size=5) self.avg_pool = nn.AvgPool2d(kernel_size=2, stride=2, padding=0) self.sigmoid = nn.Sigmoid() self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 3) def forward(self, x): x = self.conv1(x) x = self.sigmoid(x) x = self.avg_pool(x) x = self.conv2(x) x = self.sigmoid(x) x = self.avg_pool(x) x = x.reshape(x.shape[0], -1) x = self.fc1(x) x = self.sigmoid(x) x = self.fc2(x) x = self.sigmoid(x) x = self.fc3(x) return x
As you can see, the definition of the network is exactly the same as Pytorch syntax.
In the source code I provided, the summary function is provided to print the network structure.
The training data uses the Fanshion-MNIST data set, which contains 10 categories of pictures, 6k images in each category.
In order to speed up training, I only extracted the first 3 categories, a total of 1.8w training images, to make a three-classification model.
import pydynet from pydynet import nn from pydynet import optim lr, num_epochs = 0.9, 10 optimizer = optim.SGD(net.parameters(), lr=lr) loss = nn.CrossEntropyLoss() for epoch in range(num_epochs): net.train() for i, (X, y) in enumerate(train_iter): optimizer.zero_grad() y_hat = net(X) l = loss(y_hat, y) l.backward() optimizer.step() with pydynet.no_grad(): metric.add(l.numpy() * X.shape[0], accuracy(y_hat, y), X.shape[0])
The training code is also the same as Pytorch.
The key thing to do next is to go deep into the source code of model training to learn the principles of model training.
Before the model starts training, net.train will be called.
def train(self, mode: bool = True): set_grad_enabled(mode) self.set_module_state(mode)
You can see that it will set grad(gradient) to True, and the Tensor created afterwards can have gradients. After Tensor brings the gradient, it will be put into the calculation graph and wait for derivation to calculate the gradient.
The following with no_grad(): code
class no_grad: def __enter__(self) -> None: self.prev = is_grad_enable() set_grad_enabled(False)
will set grad(gradient) to False, so that the Tensor created later will not be placed in the calculation graph, and naturally it will not The gradient needs to be calculated, which can speed up inference.
We often see the usage of net.eval() in Pytorch, and we also take a look at its source code.
def eval(self): return self.train(False)
As you can see, it directly calls train(False) to turn off the gradient, and the effect is similar to no_grad().
So, generally call train to turn on the gradient before training. After training, call eval to close the gradient to facilitate fast inference.
In addition to calculating the category probability, the most important thing in forward propagation is to organize the tensors in the network into a calculation graph according to the order of forward propagation. The purpose It is used to calculate the gradient of each tensor during backpropagation.
In neural networks, tensor is not only used to store data, but also to calculate and store gradients.
Take the first layer convolution operation as an example to see how to generate a calculation graph.
def conv2d(x: tensor.Tensor, kernel: tensor.Tensor, padding: int = 0, stride: int = 1): '''二维卷积函数 ''' N, _, _, _ = x.shape out_channels, _, kernel_size, _ = kernel.shape pad_x = __pad2d(x, padding) col = __im2col2d(pad_x, kernel_size, stride) out_h, out_w = col.shape[-2:] col = col.transpose(0, 4, 5, 1, 2, 3).reshape(N * out_h * out_w, -1) col_filter = kernel.reshape(out_channels, -1).T out = col @ col_filter return out.reshape(N, out_h, out_w, -1).transpose(0, 3, 1, 2)
x is the input image, and there is no need to record the gradient. Kernel is the weight of the convolution kernel and needs to calculate the gradient.
So, the new tensor generated by pad_x = __pad2d(x, padding) also has no gradient, so it does not need to be added to the calculation graph.
The tensor generated by kernel.reshape(out_channels, -1) needs to calculate the gradient and also needs to be added to the calculation graph.
Let’s take a look at the joining process:
def reshape(self, *new_shape): return reshape(self, new_shape) class reshape(UnaryOperator): ''' 张量形状变换算子,在Tensor中进行重载 Parameters ---------- new_shape : tuple 变换后的形状,用法同NumPy ''' def __init__(self, x: Tensor, new_shape: tuple) -> None: self.new_shape = new_shape super().__init__(x) def forward(self, x: Tensor) return x.data.reshape(self.new_shape) def grad_fn(self, x: Tensor, grad: np.ndarray) return grad.reshape(x.shape)
The reshape function will return a reshape class object. The reshape class inherits the UnaryOperator class and is called in the __init__ function. Parent class initialization function.
class UnaryOperator(Tensor): def __init__(self, x: Tensor) -> None: if not isinstance(x, Tensor): x = Tensor(x) self.device = x.device super().__init__( data=self.forward(x), device=x.device, # 这里 requires_grad 为 True requires_grad=is_grad_enable() and x.requires_grad, )
The UnaryOperator class inherits the Tensor class, so the reshape object is also a tensor.
In the __init__ function of UnaryOperator, the initialization function of Tensor is called, and the required_grad parameter passed in is True, which means that the gradient needs to be calculated.
requires_gradThe calculation code is is_grad_enable() and x.requires_grad, is_grad_enable()has been set to True by train, and x is the convolution kernel, and its requires_grad is also True.
class Tensor: def __init__( self, data: Any, dtype=None, device: Union[Device, int, str, None] = None, requires_grad: bool = False, ) -> None: if self.requires_grad: # 不需要求梯度的节点不出现在动态计算图中 Graph.add_node(self)
Finally, in the initialization method of the Tensor class, call Graph.add_node(self) to add the current tensor to the calculation graph.
Similarly, new tensors that are commonly used in the tensor that requires_grad=True will be placed in the calculation graph.
After a convolution operation, 6 nodes will be added to the calculation graph.
After one forward propagation is completed, start from the last node in the calculation graph and perform backpropagation from back to front.
l = loss(y_hat, y) l.backward()
After propagating layer by layer through the forward network, it is finally transmitted to the loss tensor l.
Taking l as the starting point and propagating from front to back, the gradient of each node in the calculation graph can be calculated.
The core code of backward is as follows:
def backward(self, retain_graph: bool = False): for node in Graph.node_list[y_id::-1]: grad = node.grad for last in [l for l in node.last if l.requires_grad]: add_grad = node.grad_fn(last, grad) last.grad += add_grad
Graph.node_list[y_id::-1] sorts the calculation graph in reverse order.
node是前向传播时放入计算图中的每个tensor。
node.last 是生成当前tensor的直接父节点。
调用node.grad_fn计算梯度,并反向传给它的父节点。
grad_fn其实就是Tensor的求导公式,如:
class pow(BinaryOperator): ''' 幂运算算子,在Tensor类中进行重载 See also -------- add : 加法算子 ''' def grad_fn(self, node: Tensor, grad: np.ndarray) if node is self.last[0]: return (self.data * self.last[1].data / node.data) * grad
return后的代码其实就是幂函数求导公式。
假设y=x^2,x的导数为2x。
反向传播计算梯度后,便可以调用优化器,更新模型参数。
l.backward() optimizer.step()
本次训练我们用梯度下降SGD算法优化参数,更新过程如下:
def step(self): for i in range(len(self.params)): grad = self.params[i].grad + self.weight_decay * self.params[i].data self.v[i] *= self.momentum self.v[i] += self.lr * grad self.params[i].data -= self.v[i] if self.nesterov: self.params[i].data -= self.lr * grad
self.params是整个网络的权重,初始化SGD时传进去的。
step函数最核心的两行代码,self.v[i] += self.lr * grad 和 self.params[i].data -= self.v[i],用当前参数 - 学习速率 * 梯度更新当前参数。
这是机器学习的基础内容了,我们应该很熟悉了。
一次模型训练的完整过程大致就串完了,大家可以设置打印语句,或者通过DEBUG的方式跟踪每一行代码的执行过程,这样可以更了解模型的训练过程。
The above is the detailed content of Unbelievable! Use Numpy to develop deep learning framework and look into the neural network training process. For more information, please follow other related articles on the PHP Chinese website!