Text-to-Speech (TTS)

The Text-to-Speech Layer is the last stage in the /audio-chat pipeline. After generating and optionally refining the answer, Billx-Agent uses ElevenLabs to convert that text into a natural-sounding voice response.

🎯 What It Does

Accepts a text string (e.g. "Top 5 products are...")
Sends it to ElevenLabs TTS API
Receives an audio file in return
Encodes the audio in Base64 format
Includes the audio in the API response so the client can play it

🔉 Where It's Used

Primarily in POST /audio-chat
Can also be used standalone via POST /tts

🔁 Audio Response Example

{
  "refined_answer": "The top 3 selling products are A, B, and C.",
  "audio_content": "<base64-encoded-audio>"
}

You can play the audio in your frontend using JavaScript like this:

const audio = new Audio("data:audio/mp3;base64,<audio_content>");
audio.play();

⚙️ Output Format

Default format: MP3 (via ElevenLabs)
Returned as base64 string in API response
Compatible with web, mobile, and desktop audio players

🎙️ Voice Customization (Advanced)

If supported by your ElevenLabs plan, you can:

Choose different voices
Adjust speech rate and pitch
Localize for different languages (future-ready)

📌 TTS is optional — if your users prefer reading results, you can ignore the audio_content field.

PreviousLLM Refinement Layer NextNatural Language Prompts

Last updated 5 months ago