建構常規等變 CNN 的原則-Python教學-PHP中文網

其中一個原則簡單地表述為“讓核心旋轉”，我們將在本文中重點介紹如何將其應用到您的架構中。等變架構使我們能夠訓練對某些群體行為無關的模型。為了理解這到底意味著什麼，讓我們在 MNIST 資料集（0-9 的手寫數字資料集）上訓練這個簡單的 CNN 模型。

class SimpleCNN(nn.Module):

    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.cl1 = nn.Conv2d(in_channels=1, out_channels=8, kernel_size=3, padding=1)
        self.max_1 = nn.MaxPool2d(kernel_size=2)
        self.cl2 = nn.Conv2d(in_channels=8, out_channels=16, kernel_size=3, padding=1)
        self.max_2 = nn.MaxPool2d(kernel_size=2)
        self.cl3 = nn.Conv2d(in_channels=16, out_channels=16, kernel_size=7)
        self.dense = nn.Linear(in_features=16, out_features=10)

    def forward(self, x: torch.Tensor):
        x = nn.functional.silu(self.cl1(x))
        x = self.max_1(x)
        x = nn.functional.silu(self.cl2(x))
        x = self.max_2(x)
        x = nn.functional.silu(self.cl3(x))
        x = x.view(len(x), -1)
        logits = self.dense(x)
        return logits

登入後複製

Accuracy on test	Accuracy on 90-degree rotated test
97.3%	15.1%

_{表1：SimpleCNN模型的測試精確度} 正如預期的那樣，我們在測試資料集上獲得了超過 95% 的準確率，但是如果我們將影像旋轉 90 度呢？如果不採取任何對策，結果只會比猜測的好一點點。這個模型對於一般應用來說是沒有用的。相較之下，讓我們訓練一個具有相同數量參數的類似等變架構，其中組動作恰好是 90 度旋轉。

Accuracy on test	Accuracy on 90-degree rotated test
96.5%	96.5%

_{表2：使用與SimpleCNN模型相同數量的參數來測試EqCNN模型的準確度} 準確性保持不變，我們甚至沒有選擇資料增強。這些模型在 3D 數據的幫助下變得更加令人印象深刻，但我們將繼續使用這個範例來探索核心思想。如果您想親自測試一下，您可以在 Github-Repo 下免費存取用 PyTorch 和 JAX 編寫的所有程式碼，並且只需兩個命令即可使用 Docker 或 Podman 進行訓練。玩得開心！

那什麼是等方差呢？

等變架構保證了某些群體行為下特徵的穩定性。群組是簡單的結構，其中群組元素可以組合、反轉或不執行任何操作。有興趣的話可以去維基百科查一下正式的定義。出於我們的目的，您可以想像一組作用於方形影像的 90 度旋轉。我們可以將影像旋轉 90、180、270 或 360 度。為了反轉該動作，我們分別套用 270、180、90 或 0 度旋轉。很容易看出，我們可以對錶示為的群組進行組合、反轉或不執行任何操作

C_{4} C_4 C

。該影像將影像上的所有操作視覺化。

Figure 1: Rotated MNIST image by 90°, 180°, 270°, 360°, respectively
圖 1：將 MNIST 影像分別旋轉 90°、180°、270°、360°

Now, given an input image $x x$ , our CNN model classifier $f_{θ} f_\theta$ , and an arbitrary 90-degree rotation $g g$ , the equivariant property can be expressed as
$f_{θ} (rotate x by g) = f_{θ} (x) f_\theta(\text{rotate } x \text{ by } g) = f_\theta(x)$

Generally speaking, we want our image-based model to have the same outputs when rotated.

As such, equivariant models promise us architectures with baked-in symmetries. In the following section, we will see how our principle can be applied to achieve this property.

How to Make Our CNN Equivariant

The problem is the following: When the image rotates, the features rotate too. But as already hinted, we could also compute the features for each rotation upfront by rotating the kernel.
We could actually rotate the kernel, but it is much easier to rotate the feature map itself, thus avoiding interference with PyTorch's autodifferentiation algorithm altogether.

So, in code, our CNN kernel

x = nn.functional.silu(self.cl1(x))

登入後複製

now acts on all four rotated images:

x_0 = x
x_90 = torch.rot90(x, k=1, dims=(2, 3))
x_180 = torch.rot90(x, k=2, dims=(2, 3))
x_270 = torch.rot90(x, k=3, dims=(2, 3))

x_0 = nn.functional.silu(self.cl1(x_0))
x_90 = nn.functional.silu(self.cl1(x_90))
x_180 = nn.functional.silu(self.cl1(x_180))
x_270 = nn.functional.silu(self.cl1(x_270))

登入後複製

Or more compactly written as a 3D convolution:

self.cl1 = nn.Conv3d(in_channels=1, out_channels=8, kernel_size=(1, 3, 3))
...
x = torch.stack([x_0, x_90, x_180, x_270], dim=-3)
x = nn.functional.silu(self.cl1(x))

登入後複製

The resulting equivariant model has just a few lines more compared to the version above:

class EqCNN(nn.Module):

    def __init__(self):
        super(EqCNN, self).__init__()
        self.cl1 = nn.Conv3d(in_channels=1, out_channels=8, kernel_size=(1, 3, 3))
        self.max_1 = nn.MaxPool3d(kernel_size=(1, 2, 2))
        self.cl2 = nn.Conv3d(in_channels=8, out_channels=16, kernel_size=(1, 3, 3))
        self.max_2 = nn.MaxPool3d(kernel_size=(1, 2, 2))
        self.cl3 = nn.Conv3d(in_channels=16, out_channels=16, kernel_size=(1, 5, 5))
        self.dense = nn.Linear(in_features=16, out_features=10)

    def forward(self, x: torch.Tensor):
        x_0 = x
        x_90 = torch.rot90(x, k=1, dims=(2, 3))
        x_180 = torch.rot90(x, k=2, dims=(2, 3))
        x_270 = torch.rot90(x, k=3, dims=(2, 3))

        x = torch.stack([x_0, x_90, x_180, x_270], dim=-3)
        x = nn.functional.silu(self.cl1(x))
        x = self.max_1(x)

        x = nn.functional.silu(self.cl2(x))
        x = self.max_2(x)

        x = nn.functional.silu(self.cl3(x))

        x = x.squeeze()
        x = torch.max(x, dim=-1).values
        logits = self.dense(x)
        return logits

登入後複製

But why is this equivariant to rotations?
First, observe that we get four copies of each feature map at each stage. At the end of the pipeline, we combine all of them with a max operation.

This is key, the max operation is indifferent to which place the rotated version of the feature ends up in.

To understand what is happening, let us plot the feature maps after the first convolution stage.

_{Figure 2: Feature maps for all four rotations}