Understanding Latency Metrics in Voice AI Executions

Introduction

Bolna provides detailed latency metrics for every Voice AI execution, allowing you to monitor and optimize the performance of your conversational AI agents. These metrics help you understand where time is being spent in the conversation pipeline and identify potential bottlenecks that may affect user experience. The latency data is included in the execution payload when you fetch execution details via the Get Execution API. This data provides granular timing information across all major components of the voice AI pipeline.

This is a beta feature and slowly being rolled out to every user.

Latency Data Structure

The latency_data object in the execution payload contains timing information organized by component. Here’s the complete structure:

Top-Level Metrics

{
  "latency_data": {
    "stream_id": 129.56982,
    "time_to_first_audio": 130.84888,
    "region": "in",
    "transcriber": {
      "time_to_connect": 226,
      "turns": [...]
    },
    "llm": {
      "time_to_connect": null,
      "turns": [...]
    },
    "synthesizer": {
      "time_to_connect": 271,
      "turns": [...]
    }
  }
}

Field	Type	Description
`stream_id`	float	Time in milliseconds to establish the audio stream connection
`time_to_first_audio`	float	Total time in milliseconds from call start until the first audio is played to the user
`region`	string	Geographic region code where the execution occurred (e.g., `in` for India, `us` for United States)

Transcriber Metrics

The transcriber component converts spoken audio into text. This section tracks how quickly speech is being transcribed.

{
  "transcriber": {
    "time_to_connect": 226,
    "turns": [
      {
        "turn": 1,
        "turn_latency": [
          {
            "sequence_id": 1,
            "audio_to_text_latency": 20.12128,
            "text": "hello who is there"
          },
          {
            "sequence_id": 2,
            "audio_to_text_latency": 19.96126,
            "text": "hello who is this"
          }
        ]
      }
    ]
  }
}

Field Definition

Field	Type	Description
`time_to_connect`	`integer`	Time (ms) to establish connection with the transcriber provider.
`turns`	`array<object>`	Array of conversation turns, each representing a speaker segment.
Turn Object		Each element in `turns[]` represents one user speaking segment.
`turn`	`integer`	Sequential number of this conversation turn (starting at 1).
`turn_latency`	`array<object>`	List of transcription attempts for this turn (incremental updates).
Turn Latency Object		Each object represents one transcription update within a turn.
`sequence_id`	`integer`	Sequential ID for transcription updates (increments as text is refined).
`audio_to_text_latency`	`float`	Time (ms) from receiving audio to producing transcribed text.
`text`	`string`	The transcribed text for this sequence.

Understanding Transcriber Turns

Each turn represents a segment where the user is speaking. Within a turn, you may see multiple sequence_id entries. This happens because modern speech recognition systems provide incremental updates as they process audio:

Sequence 1: Initial transcription (may be partial or less accurate)
Sequence 2: Refined transcription (more complete and accurate)
Sequence N: Final transcription (most accurate version)

The latency for each sequence shows how quickly the transcriber is providing updates, which affects how responsive your agent feels.

LLM Metrics

The LLM (Large Language Model) component generates the AI agent’s responses based on the transcribed user input.

{
  "llm": {
    "time_to_connect": null,
    "turns": [
      {
        "time_to_first_token": 1633.04442,
        "time_to_last_token": 1691.53098,
        "turn": 1
      },
      {
        "time_to_first_token": 737.80586,
        "time_to_last_token": 777.49842,
        "turn": 2
      }
    ]
  }
}

Field Definitions

Field	Type	Description
`time_to_connect`	`integer \| null`	Time (ms) to establish connection with the LLM provider. May be `null` if not applicable.
`turns`	`array<object>`	Array of conversation turns, each representing one LLM response generation.
Turn Object		Each element in `turns[]` represents one model response to a user input.
`turn`	`integer`	Sequential turn number (starting at 1).
`time_to_first_token`	`float`	Time (ms) between sending the request and receiving the first token from the LLM.
`time_to_last_token`	`float`	Time (ms) between sending the request and receiving the last token from the LLM (total generation time).

Understanding LLM Timing

Time to First Token (TTFT): This is critical for perceived responsiveness. Lower TTFT means the agent starts responding faster, even if the full response takes time to generate.
Time to Last Token: The total generation time. The difference between time_to_last_token and time_to_first_token indicates how long the streaming response took.
Streaming Benefit: When using streaming, the synthesizer can start converting text to speech as soon as the first tokens arrive, reducing overall latency.

Synthesizer Metrics

The synthesizer component converts the LLM’s text response into spoken audio.

{
  "synthesizer": {
    "time_to_connect": 271,
    "turns": [
      {
        "time_to_first_token": 599,
        "time_to_last_token": 800,
        "turn": 1
      },
      {
        "time_to_first_token": 317,
        "time_to_last_token": 518,
        "turn": 2
      }
    ]
  }
}

Field Definitions

Field	Type	Description
`time_to_connect`	`integer`	Time (ms) to establish connection with the text-to-speech service.
`turns`	`array<object>`	Array of synthesis operations, one per agent response.
Turn Object		Each element in `turns[]` represents one synthesized response turn.
`turn`	`integer`	Sequential turn number in the conversation (starting at 1).
`time_to_first_token`	`integer`	Time (ms) from receiving input text to generating the first audio chunk.
`time_to_last_token`	`integer`	Time (ms) from receiving input text to completing audio generation.

Understanding Synthesizer Timing

Time to First Token: How quickly the synthesizer starts producing audio. This is crucial for maintaining conversation flow.
Time to Last Token: Total time to generate all audio for the response.
Streaming Synthesis: Modern TTS systems stream audio, so playback can begin before the entire response is synthesized.

Analyzing Latency Data

Identifying Bottlenecks

To identify which component is causing delays, compare the latencies across components:

High Transcriber Latency (>100ms per sequence):
- May indicate network issues with the transcription service
- Could suggest the need for a different transcription provider
- Might be affected by audio quality or background noise
High LLM Time to First Token (>1000ms):
- Could indicate the LLM model is too large or complex
- May suggest the need for prompt optimization
- Might benefit from using a faster model or provider
- Could indicate high load on the LLM service
High Synthesizer Latency (>500ms to first token):
- May indicate network issues with the TTS service
- Could suggest trying a different voice or synthesis provider
- Might benefit from using a faster TTS model

Get Execution API - Retrieve execution details including latency data
Fetch Agent Executions - Guide to fetching and analyzing multiple executions
List of Hangup Status - Understanding call termination reasons

By understanding and monitoring these latency metrics, you can ensure your Bolna Voice AI agents deliver fast, natural, and responsive conversations throughout as you scale.

Getting Started

Pricing

Enterprise

Multilingual Voice agents

Integrations

Voice AI Agent Function calls

Features

Advance capabilities

Phone calls using Bolna

Resources

Using Bolna Playground

Understanding Latency Metrics in Voice AI Executions

Introduction

Latency Data Structure

Top-Level Metrics

Transcriber Metrics

Field Definition

Understanding Transcriber Turns

LLM Metrics

Field Definitions

Understanding LLM Timing

Synthesizer Metrics

Field Definitions

Understanding Synthesizer Timing

Analyzing Latency Data

Identifying Bottlenecks

Getting Started

Pricing

Enterprise

Multilingual Voice agents

Integrations

Voice AI Agent Function calls

Features

Advance capabilities

Phone calls using Bolna

Resources

Using Bolna Playground

​Introduction

​Latency Data Structure

​Top-Level Metrics

​Transcriber Metrics

​Field Definition

​Understanding Transcriber Turns

​LLM Metrics

​Field Definitions

​Understanding LLM Timing

​Synthesizer Metrics

​Field Definitions

​Understanding Synthesizer Timing

​Analyzing Latency Data

​Identifying Bottlenecks

​Related Resources

Introduction

Latency Data Structure

Top-Level Metrics

Transcriber Metrics

Field Definition

Understanding Transcriber Turns

LLM Metrics

Field Definitions

Understanding LLM Timing

Synthesizer Metrics

Field Definitions

Understanding Synthesizer Timing

Analyzing Latency Data

Identifying Bottlenecks

Related Resources