Skip to main content

Introduction

Bolna provides detailed latency metrics for every Voice AI execution, allowing you to monitor and optimize the performance of your conversational AI agents. These metrics help you understand where time is being spent in the conversation pipeline and identify potential bottlenecks that may affect user experience. The latency data is included in the execution payload when you fetch execution details via the Get Execution API. This data provides granular timing information across all major components of the voice AI pipeline.
This is a beta feature and slowly being rolled out to every user.

Latency Data Structure

The latency_data object in the execution payload contains timing information organized by component. Here’s the complete structure:

Top-Level Metrics

{
  "latency_data": {
    "stream_id": 129.56982,
    "time_to_first_audio": 130.84888,
    "region": "in",
    "transcriber": {
      "time_to_connect": 226,
      "turns": [...]
    },
    "llm": {
      "time_to_connect": null,
      "turns": [...]
    },
    "synthesizer": {
      "time_to_connect": 271,
      "turns": [...]
    }
  }
}
FieldTypeDescription
stream_idfloatTime in milliseconds to establish the audio stream connection
time_to_first_audiofloatTotal time in milliseconds from call start until the first audio is played to the user
regionstringGeographic region code where the execution occurred
(e.g., in for India, us for United States)

Transcriber Metrics

The transcriber component converts spoken audio into text. This section tracks how quickly speech is being transcribed.
{
  "transcriber": {
    "time_to_connect": 226,
    "turns": [
      {
        "turn": 1,
        "turn_latency": [
          {
            "sequence_id": 1,
            "audio_to_text_latency": 20.12128,
            "text": "hello who is there"
          },
          {
            "sequence_id": 2,
            "audio_to_text_latency": 19.96126,
            "text": "hello who is this"
          }
        ]
      }
    ]
  }
}

Field Definition

FieldTypeDescription
time_to_connectintegerTime (ms) to establish connection with the transcriber provider.
turnsarray<object>Array of conversation turns, each representing a speaker segment.
    Turn ObjectEach element in turns[] represents one user speaking segment.
    turnintegerSequential number of this conversation turn (starting at 1).
    turn_latencyarray<object>List of transcription attempts for this turn (incremental updates).
        Turn Latency ObjectEach object represents one transcription update within a turn.
        sequence_idintegerSequential ID for transcription updates (increments as text is refined).
        audio_to_text_latencyfloatTime (ms) from receiving audio to producing transcribed text.
        textstringThe transcribed text for this sequence.

Understanding Transcriber Turns

Each turn represents a segment where the user is speaking. Within a turn, you may see multiple sequence_id entries. This happens because modern speech recognition systems provide incremental updates as they process audio:
  • Sequence 1: Initial transcription (may be partial or less accurate)
  • Sequence 2: Refined transcription (more complete and accurate)
  • Sequence N: Final transcription (most accurate version)
The latency for each sequence shows how quickly the transcriber is providing updates, which affects how responsive your agent feels.

LLM Metrics

The LLM (Large Language Model) component generates the AI agent’s responses based on the transcribed user input.
{
  "llm": {
    "time_to_connect": null,
    "turns": [
      {
        "time_to_first_token": 1633.04442,
        "time_to_last_token": 1691.53098,
        "turn": 1
      },
      {
        "time_to_first_token": 737.80586,
        "time_to_last_token": 777.49842,
        "turn": 2
      }
    ]
  }
}

Field Definitions

FieldTypeDescription
time_to_connectinteger | nullTime (ms) to establish connection with the LLM provider. May be null if not applicable.
turnsarray<object>Array of conversation turns, each representing one LLM response generation.
    Turn ObjectEach element in turns[] represents one model response to a user input.
    turnintegerSequential turn number (starting at 1).
    time_to_first_tokenfloatTime (ms) between sending the request and receiving the first token from the LLM.
    time_to_last_tokenfloatTime (ms) between sending the request and receiving the last token from the LLM (total generation time).

Understanding LLM Timing

  • Time to First Token (TTFT): This is critical for perceived responsiveness. Lower TTFT means the agent starts responding faster, even if the full response takes time to generate.
  • Time to Last Token: The total generation time. The difference between time_to_last_token and time_to_first_token indicates how long the streaming response took.
  • Streaming Benefit: When using streaming, the synthesizer can start converting text to speech as soon as the first tokens arrive, reducing overall latency.

Synthesizer Metrics

The synthesizer component converts the LLM’s text response into spoken audio.
{
  "synthesizer": {
    "time_to_connect": 271,
    "turns": [
      {
        "time_to_first_token": 599,
        "time_to_last_token": 800,
        "turn": 1
      },
      {
        "time_to_first_token": 317,
        "time_to_last_token": 518,
        "turn": 2
      }
    ]
  }
}

Field Definitions

FieldTypeDescription
time_to_connectintegerTime (ms) to establish connection with the text-to-speech service.
turnsarray<object>Array of synthesis operations, one per agent response.
    Turn ObjectEach element in turns[] represents one synthesized response turn.
    turnintegerSequential turn number in the conversation (starting at 1).
    time_to_first_tokenintegerTime (ms) from receiving input text to generating the first audio chunk.
    time_to_last_tokenintegerTime (ms) from receiving input text to completing audio generation.

Understanding Synthesizer Timing

  • Time to First Token: How quickly the synthesizer starts producing audio. This is crucial for maintaining conversation flow.
  • Time to Last Token: Total time to generate all audio for the response.
  • Streaming Synthesis: Modern TTS systems stream audio, so playback can begin before the entire response is synthesized.

Analyzing Latency Data

Identifying Bottlenecks

To identify which component is causing delays, compare the latencies across components:
  1. High Transcriber Latency (>100ms per sequence):
    • May indicate network issues with the transcription service
    • Could suggest the need for a different transcription provider
    • Might be affected by audio quality or background noise
  2. High LLM Time to First Token (>1000ms):
    • Could indicate the LLM model is too large or complex
    • May suggest the need for prompt optimization
    • Might benefit from using a faster model or provider
    • Could indicate high load on the LLM service
  3. High Synthesizer Latency (>500ms to first token):
    • May indicate network issues with the TTS service
    • Could suggest trying a different voice or synthesis provider
    • Might benefit from using a faster TTS model

By understanding and monitoring these latency metrics, you can ensure your Bolna Voice AI agents deliver fast, natural, and responsive conversations throughout as you scale.