Causal masking is the technique of preventing tokens from attending to future positions in the sequence during self-attention. This ensures the model can only "see" past context when predicting each token, maintaining the autoregressive property.
Without causal masking, the model could "cheat" by looking at future tokens during training.