Compression refers to techniques that reduce the token count of prompts while preserving essential information. This is crucial for working within context window limits, reducing costs, and improving inference speed.
Compression can be lossy (some information removed) or lossless (all information preserved in fewer tokens), and can be performed at the lexical, semantic, or learned representation level.