Layer normalization normalizes the activations across the feature dimension for each sample independently. Unlike batch normalization, it doesn't depend on batch statistics, making it ideal for transformers and variable-length sequences.
LayerNorm stabilizes training by keeping activations in a consistent range throughout the network.