Text-to-Speech (TTS)

The Text-to-Speech Layer is the last stage in the /audio-chat pipeline. After generating and optionally refining the answer, Billx-Agent uses ElevenLabs to convert that text into a natural-sounding voice response.


🎯 What It Does

  • Accepts a text string (e.g. "Top 5 products are...")

  • Sends it to ElevenLabs TTS API

  • Receives an audio file in return

  • Encodes the audio in Base64 format

  • Includes the audio in the API response so the client can play it


πŸ”‰ Where It's Used

  • Primarily in POST /audio-chat

  • Can also be used standalone via POST /tts


πŸ” Audio Response Example

{
  "refined_answer": "The top 3 selling products are A, B, and C.",
  "audio_content": "<base64-encoded-audio>"
}

You can play the audio in your frontend using JavaScript like this:


βš™οΈ Output Format

  • Default format: MP3 (via ElevenLabs)

  • Returned as base64 string in API response

  • Compatible with web, mobile, and desktop audio players


πŸŽ™οΈ Voice Customization (Advanced)

If supported by your ElevenLabs plan, you can:

  • Choose different voices

  • Adjust speech rate and pitch

  • Localize for different languages (future-ready)


πŸ“Œ TTS is optional β€” if your users prefer reading results, you can ignore the audio_content field.


Last updated