A residual connection adds the input of a layer directly to its output, allowing the network to learn residual functions instead of complete transformations. This enables training of very deep networks by providing gradient highways.
Residual connections are essential for training transformers with dozens or hundreds of layers.