Latency measures the time delay in an LLM system, from when a request is sent to when a response (or first token) is received. Low latency is critical for interactive applications where users expect quick responses.
For LLMs, latency includes prompt processing time, model inference time, and network overhead.