Multi-head attention runs multiple attention operations in parallel, each with different learned projections. The outputs are concatenated and projected, allowing the model to attend to information from different representation subspaces simultaneously.
Different heads can learn to focus on different aspects: one might capture syntax, another semantics, another positional patterns.