Text-to-Speech for Developers: Getting Started

📅 May 15, 2026 published

Adding Voice to Your Application Is Easier Than It Was Three Years Ago

If you've been putting off integrating TTS into your application because you expected it to be complex, you may be working from outdated assumptions. The major cloud TTS APIs have invested heavily in developer experience over the last few years. Documentation is comprehensive, SDKs exist for every mainstream language, and free tiers give you room to experiment without a billing commitment.

This guide gets you from zero to working TTS output with three different APIs — Amazon Polly, Google Cloud TTS, and ElevenLabs — then covers the concepts you need to build on that foundation.

Core Concepts Before You Write Any Code

How TTS APIs Work

All major TTS APIs follow the same basic pattern: you send a request containing your text, your preferred voice, and optional configuration parameters; the API returns audio data (as a stream or a file). Your application receives that audio and plays it, stores it, or serves it to end users.

The key parameters you'll configure in almost every TTS API call:

Streaming vs. Batch

TTS APIs typically offer two delivery modes:

Synchronous / batch: You send a request, wait for the complete audio to be generated, and receive it all at once. Simple to implement. Best for pre-generating audio files, background processing, and non-time-sensitive applications.

Streaming: The API begins returning audio as soon as the first chunk is generated, before the full output is complete. Essential for low-latency applications like voice assistants, conversational AI, and real-time narration. Slightly more complex to implement (you need to handle a streaming response).

Getting Started with Amazon Polly (Python)

Amazon Polly is a good starting point for most developers: it's reliable, well-documented, and the free tier includes 5 million characters per month for the first 12 months.

Prerequisites

Basic Example

import boto3

polly = boto3.client(
    service_name='polly',
    region_name='us-east-1',
    aws_access_key_id='YOUR_ACCESS_KEY',
    aws_secret_access_key='YOUR_SECRET_KEY'
)

response = polly.synthesize_speech(
    Text='Hello from Amazon Polly. This is a test of neural text-to-speech.',
    OutputFormat='mp3',
    VoiceId='Joanna',  # Neural voice
    Engine='neural'
)

with open('output.mp3', 'wb') as f:
    f.write(response['AudioStream'].read())

Run this and you'll find an output.mp3 file in your working directory. Open it — that's Joanna (neural) reading your text.

Using SSML with Polly

response = polly.synthesize_speech(
    Text='<speak>This is <emphasis level="strong">very important</emphasis>. <break time="500ms"/> Please pay attention.</speak>',
    TextType='ssml',
    OutputFormat='mp3',
    VoiceId='Joanna',
    Engine='neural'
)

Getting Started with Google Cloud TTS (Python)

Prerequisites

Basic Example

from google.cloud import texttospeech
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-service-account-key.json"

client = texttospeech.TextToSpeechClient()

synthesis_input = texttospeech.SynthesisInput(
    text="Hello from Google Cloud TTS. This uses a WaveNet voice."
)

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-F",  # Neural2 voice
    ssml_gender=texttospeech.SsmlVoiceGender.FEMALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.95
)

response = client.synthesize_speech(
    input=synthesis_input,
    voice=voice,
    audio_config=audio_config
)

with open("output.mp3", "wb") as f:
    f.write(response.audio_content)

Getting Started with ElevenLabs API (Python)

ElevenLabs offers the best voice quality of any API available today. Its free tier includes 10,000 characters per month.

Prerequisites

Basic Example

from elevenlabs.client import ElevenLabs
from elevenlabs import save

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.generate(
    text="Hello from ElevenLabs. This is a neural AI voice.",
    voice="Rachel",           # Voice name from ElevenLabs library
    model="eleven_multilingual_v2"
)

save(audio, "output.mp3")

Key Decisions When Building a TTS Feature

Pre-Generate or Generate on Demand?

For content that's known in advance (articles, documentation, fixed scripts), pre-generating audio files and caching them is almost always the right approach. It's cheaper (you only pay for generation once), faster for the user (no generation latency), and more reliable. Store generated files in an object store (S3, GCS, R2) and serve them via CDN.

For dynamic or user-generated content where text is not known until request time, on-demand generation is necessary. Use streaming mode to minimize perceived latency.

Handling Character Limits

All APIs have per-request limits. For long documents, chunk the text at natural boundaries — paragraph breaks, sentence ends, or section headers — and process each chunk separately. Concatenate the audio files using a library like pydub in Python before serving.

from pydub import AudioSegment

segments = [AudioSegment.from_mp3(f) for f in chunk_files]
combined = sum(segments)
combined.export("full_article.mp3", format="mp3")

Error Handling and Retry Logic

TTS API calls can fail due to network issues, rate limiting, or temporary service unavailability. Implement exponential backoff for retries and graceful fallback behavior (informing the user that audio is temporarily unavailable rather than silently failing).

What to Explore Next

Once you have basic synthesis working, explore these areas:

The APIs covered here all have excellent documentation and active developer communities. The fastest way to learn is to get something working first — which you now can — and then explore the edges from there.

Try TTSVerse for Free!

Convert any text to natural-sounding audio in seconds. No signup required.

Start Converting →
← Back to Blog