Text-to-Speech for Developers: Getting Started

📅 May 15, 2026 published

Adding Voice to Your Application Is Easier Than It Was Three Years Ago

If you've been putting off integrating TTS into your application because you expected it to be complex, you may be working from outdated assumptions. The major cloud TTS APIs have invested heavily in developer experience over the last few years. Documentation is comprehensive, SDKs exist for every mainstream language, and free tiers give you room to experiment without a billing commitment.

This guide gets you from zero to working TTS output with three different APIs — Amazon Polly, Google Cloud TTS, and ElevenLabs — then covers the concepts you need to build on that foundation.

Core Concepts Before You Write Any Code

How TTS APIs Work

All major TTS APIs follow the same basic pattern: you send a request containing your text, your preferred voice, and optional configuration parameters; the API returns audio data (as a stream or a file). Your application receives that audio and plays it, stores it, or serves it to end users.

The key parameters you'll configure in almost every TTS API call:

Text input — plain text or SSML-formatted text
Voice ID — which voice to use (each platform has its own naming conventions)
Output format — MP3, OGG, PCM, or others depending on the API
Language code — e.g., "en-US", "en-GB", "es-ES"
Speaking rate / pitch / volume — optional prosody controls

Streaming vs. Batch

TTS APIs typically offer two delivery modes:

Synchronous / batch: You send a request, wait for the complete audio to be generated, and receive it all at once. Simple to implement. Best for pre-generating audio files, background processing, and non-time-sensitive applications.

Streaming: The API begins returning audio as soon as the first chunk is generated, before the full output is complete. Essential for low-latency applications like voice assistants, conversational AI, and real-time narration. Slightly more complex to implement (you need to handle a streaming response).

Getting Started with Amazon Polly (Python)

Amazon Polly is a good starting point for most developers: it's reliable, well-documented, and the free tier includes 5 million characters per month for the first 12 months.

Prerequisites

An AWS account (free at aws.amazon.com)
An IAM user with the AmazonPollyReadOnlyAccess policy (or more permissive) and programmatic access credentials
Python 3.x and boto3 installed (pip install boto3)

Basic Example

import boto3

polly = boto3.client(
    service_name='polly',
    region_name='us-east-1',
    aws_access_key_id='YOUR_ACCESS_KEY',
    aws_secret_access_key='YOUR_SECRET_KEY'
)

response = polly.synthesize_speech(
    Text='Hello from Amazon Polly. This is a test of neural text-to-speech.',
    OutputFormat='mp3',
    VoiceId='Joanna',  # Neural voice
    Engine='neural'
)

with open('output.mp3', 'wb') as f:
    f.write(response['AudioStream'].read())

Run this and you'll find an output.mp3 file in your working directory. Open it — that's Joanna (neural) reading your text.

Using SSML with Polly

response = polly.synthesize_speech(
    Text='<speak>This is <emphasis level="strong">very important</emphasis>. <break time="500ms"/> Please pay attention.</speak>',
    TextType='ssml',
    OutputFormat='mp3',
    VoiceId='Joanna',
    Engine='neural'
)

Getting Started with Google Cloud TTS (Python)

Prerequisites

A Google Cloud account with a project created
The Cloud Text-to-Speech API enabled for your project
A service account key JSON file with TTS permissions
pip install google-cloud-texttospeech

Basic Example

from google.cloud import texttospeech
import os

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-service-account-key.json"

client = texttospeech.TextToSpeechClient()

synthesis_input = texttospeech.SynthesisInput(
    text="Hello from Google Cloud TTS. This uses a WaveNet voice."
)

voice = texttospeech.VoiceSelectionParams(
    language_code="en-US",
    name="en-US-Neural2-F",  # Neural2 voice
    ssml_gender=texttospeech.SsmlVoiceGender.FEMALE
)

audio_config = texttospeech.AudioConfig(
    audio_encoding=texttospeech.AudioEncoding.MP3,
    speaking_rate=0.95
)

response = client.synthesize_speech(
    input=synthesis_input,
    voice=voice,
    audio_config=audio_config
)

with open("output.mp3", "wb") as f:
    f.write(response.audio_content)

Getting Started with ElevenLabs API (Python)

ElevenLabs offers the best voice quality of any API available today. Its free tier includes 10,000 characters per month.

Prerequisites

An ElevenLabs account (free at elevenlabs.io)
Your API key from your profile settings
pip install elevenlabs

Basic Example

from elevenlabs.client import ElevenLabs
from elevenlabs import save

client = ElevenLabs(api_key="YOUR_API_KEY")

audio = client.generate(
    text="Hello from ElevenLabs. This is a neural AI voice.",
    voice="Rachel",           # Voice name from ElevenLabs library
    model="eleven_multilingual_v2"
)

save(audio, "output.mp3")

Key Decisions When Building a TTS Feature

Pre-Generate or Generate on Demand?

For content that's known in advance (articles, documentation, fixed scripts), pre-generating audio files and caching them is almost always the right approach. It's cheaper (you only pay for generation once), faster for the user (no generation latency), and more reliable. Store generated files in an object store (S3, GCS, R2) and serve them via CDN.

For dynamic or user-generated content where text is not known until request time, on-demand generation is necessary. Use streaming mode to minimize perceived latency.

Handling Character Limits

All APIs have per-request limits. For long documents, chunk the text at natural boundaries — paragraph breaks, sentence ends, or section headers — and process each chunk separately. Concatenate the audio files using a library like pydub in Python before serving.

from pydub import AudioSegment

segments = [AudioSegment.from_mp3(f) for f in chunk_files]
combined = sum(segments)
combined.export("full_article.mp3", format="mp3")

Error Handling and Retry Logic

TTS API calls can fail due to network issues, rate limiting, or temporary service unavailability. Implement exponential backoff for retries and graceful fallback behavior (informing the user that audio is temporarily unavailable rather than silently failing).

What to Explore Next

Once you have basic synthesis working, explore these areas:

SSML markup for fine-grained control — covered in our article on How to Make TTS Sound More Natural
Voice selection strategy — Tips for Choosing the Right TTS Voice
API comparison for when your requirements outgrow your initial choice — Comparing the Top Text-to-Speech APIs in 2026
Website integration — How to Integrate TTS into Your Website

The APIs covered here all have excellent documentation and active developer communities. The fastest way to learn is to get something working first — which you now can — and then explore the edges from there.

Try TTSVerse for Free!

Convert any text to natural-sounding audio in seconds. No signup required.

Start Converting →

← Back to Blog