Lecture 1.3

Regular Group Convolutions

Now that we have the group theoretical basics, we can construct the Group Convolution operator. We do this by generalizing the concept of "template matching" from simple translations to general groups.

1. Template Matching as Inner Products

Recall that the standard cross-correlation (often called convolution) of a feature map $f$ with a kernel $k$ at position $\mathbf{x}$ is basically an inner product between the signal $f$ and the kernel centered at $\mathbf{x}$:

Standard Cross-Correlation $$ (k \star f)(\mathbf{x}) = \langle \mathcal{T}_{\mathbf{x}} k, f \rangle_{\mathbb{L}_2(\mathbb{R}^2)} = \int_{\mathbb{R}^2} k(\tilde{\mathbf{x}} - \mathbf{x}) f(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$

Here $\mathcal{T}_{\mathbf{x}}$ is the translation operator (a representation of the translation group). The value at $\mathbf{x}$ measures "how well does the kernel match the signal at location $\mathbf{x}$?"

2. Lifting Convolution

To detect features not just at every position, but at every pose (e.g., rotation), we simply replace the translation operator with a general group action. This creates the Lifting Convolution.

Let $G = SE(2)$ be the roto-translation group. A kernel $k$ (e.g., an edge detector) can be transformed by any $g = (\mathbf{x}, \theta) \in G$. We then compute the match score for every such transformation:

Lifting Correlation $$ [k \star f](g) = \langle \mathcal{L}_g k, f \rangle_{\mathbb{L}_2(\mathbb{R}^2)} = \int_{\mathbb{R}^2} k(g^{-1} \odot \tilde{\mathbf{x}}) f(\tilde{\mathbf{x}}) d\tilde{\mathbf{x}} $$

Input: $f: \mathbb{R}^2 \to \mathbb{R}$ (2D Image)
Output: $F: G \to \mathbb{R}$ (Function on the group SE(2))

The output is a 3D feature map (position $x, y$ + orientation $\theta$). It tells us "there is a vertical edge at (10, 10)" or "there is a horizontal edge at (20, 20)".

Implementation Trick for Affine Groups

Since $SE(2) \cong \mathbb{R}^2 \rtimes SO(2)$, a group element $g$ splits into a rotation $h$ and a translation $\mathbf{x}$. The operator $\mathcal{L}_g$ can be applied in two steps:

Rotate the kernel by $h$.
Translate the rotated kernel by $\mathbf{x}$.

This means a lifting convolution can be implemented efficiently as a standard 2D convolution using a filter bank of rotated kernels!

$$ [k \star f](\mathbf{x}, \theta) = (k_\theta \star_{\mathbb{R}^2} f)(\mathbf{x}) $$

3. Group Convolution

Once we have a feature map on the group (e.g., after the first lifting layer), subsequent layers must process this group-structured data. This is the Group Convolution.

Group Convolution $$ [K \star F](g) = \langle \mathcal{L}_g K, F \rangle_{\mathbb{L}_2(G)} = \int_{G} K(g^{-1} \cdot \tilde{g}) F(\tilde{g}) d\mu(\tilde{g}) $$

Input: $F: G \to \mathbb{R}$ (Function on Group)
Output: $F': G \to \mathbb{R}$ (Function on Group)

Here, the kernel $K$ is also a function on the group. It defines a pattern of relative poses. For example, "a face" is a pattern where "eyes" (at a certain relative pose to the nose) and "mouth" (at another relative pose) co-occur.

4. Regular G-CNN Architecture

A typical Regular Group CNN consists of:

Lifting Layer: $\mathbb{R}^2 \to G$. Lifts the image to a higher-dimensional group space (e.g., position + orientation).
Group Convolution Layers: $G \to G$. Processes features while preserving the group structure (equivariance). Matches patterns of patterns.
Projection Layer: $G \to \mathbb{R}^2$ or Global Pooling. Max-pooling over the rotation group subgroup gives us rotation-invariant features at every position.

Key Result

A linear layer is equivariant if and only if it is a group convolution.

This means Group Convolutions are the only linear way to process data equivariantly (given some technical conditions). This is stated as "Group Convolutions are All You Need" for equivariant deep learning.