Text-to-Speech for Developers: Getting Started
Adding Voice to Your Application Is Easier Than It Was Three Years Ago
If you've been putting off integrating TTS into your application because you expected it to be complex, you may be working from outdated assumptions. The major cloud TTS APIs have invested heavily in developer experience over the last few years. Documentation is comprehensive, SDKs exist for every mainstream language, and free tiers give you room to experiment without a billing commitment.
This guide gets you from zero to working TTS output with three different APIs β Amazon Polly, Google Cloud TTS, and ElevenLabs β then covers the concepts you need to build on that foundation.
Core Concepts Before You Write Any Code
How TTS APIs Work
All major TTS APIs follow the same basic pattern: you send a request containing your text, your preferred voice, and optional configuration parameters; the API returns audio data (as a stream or a file). Your application receives that audio and plays it, stores it, or serves it to end users.
The key parameters you'll configure in almost every TTS API call:
- Text input β plain text or SSML-formatted text
- Voice ID β which voice to use (each platform has its own naming conventions)
- Output format β MP3, OGG, PCM, or others depending on the API
- Language code β e.g., "en-US", "en-GB", "es-ES"
- Speaking rate / pitch / volume β optional prosody controls
Streaming vs. Batch
TTS APIs typically offer two delivery modes:
Synchronous / batch: You send a request, wait for the complete audio to be generated, and receive it all at once. Simple to implement. Best for pre-generating audio files, background processing, and non-time-sensitive applications.
Streaming: The API begins returning audio as soon as the first chunk is generated, before the full output is complete. Essential for low-latency applications like voice assistants, conversational AI, and real-time narration. Slightly more complex to implement (you need to handle a streaming response).
Getting Started with Amazon Polly (Python)
Amazon Polly is a good starting point for most developers: it's reliable, well-documented, and the free tier includes 5 million characters per month for the first 12 months.
Prerequisites
- An AWS account (free at aws.amazon.com)
- An IAM user with the
AmazonPollyReadOnlyAccesspolicy (or more permissive) and programmatic access credentials - Python 3.x and boto3 installed (
pip install boto3)
Basic Example
import boto3
polly = boto3.client(
service_name='polly',
region_name='us-east-1',
aws_access_key_id='YOUR_ACCESS_KEY',
aws_secret_access_key='YOUR_SECRET_KEY'
)
response = polly.synthesize_speech(
Text='Hello from Amazon Polly. This is a test of neural text-to-speech.',
OutputFormat='mp3',
VoiceId='Joanna', # Neural voice
Engine='neural'
)
with open('output.mp3', 'wb') as f:
f.write(response['AudioStream'].read())
Run this and you'll find an output.mp3 file in your working directory. Open it β that's Joanna (neural) reading your text.
Using SSML with Polly
response = polly.synthesize_speech(
Text='<speak>This is <emphasis level="strong">very important</emphasis>. <break time="500ms"/> Please pay attention.</speak>',
TextType='ssml',
OutputFormat='mp3',
VoiceId='Joanna',
Engine='neural'
)
Getting Started with Google Cloud TTS (Python)
Prerequisites
- A Google Cloud account with a project created
- The Cloud Text-to-Speech API enabled for your project
- A service account key JSON file with TTS permissions
pip install google-cloud-texttospeech
Basic Example
from google.cloud import texttospeech
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "path/to/your-service-account-key.json"
client = texttospeech.TextToSpeechClient()
synthesis_input = texttospeech.SynthesisInput(
text="Hello from Google Cloud TTS. This uses a WaveNet voice."
)
voice = texttospeech.VoiceSelectionParams(
language_code="en-US",
name="en-US-Neural2-F", # Neural2 voice
ssml_gender=texttospeech.SsmlVoiceGender.FEMALE
)
audio_config = texttospeech.AudioConfig(
audio_encoding=texttospeech.AudioEncoding.MP3,
speaking_rate=0.95
)
response = client.synthesize_speech(
input=synthesis_input,
voice=voice,
audio_config=audio_config
)
with open("output.mp3", "wb") as f:
f.write(response.audio_content)
Getting Started with ElevenLabs API (Python)
ElevenLabs offers the best voice quality of any API available today. Its free tier includes 10,000 characters per month.
Prerequisites
- An ElevenLabs account (free at elevenlabs.io)
- Your API key from your profile settings
pip install elevenlabs
Basic Example
from elevenlabs.client import ElevenLabs
from elevenlabs import save
client = ElevenLabs(api_key="YOUR_API_KEY")
audio = client.generate(
text="Hello from ElevenLabs. This is a neural AI voice.",
voice="Rachel", # Voice name from ElevenLabs library
model="eleven_multilingual_v2"
)
save(audio, "output.mp3")
Key Decisions When Building a TTS Feature
Pre-Generate or Generate on Demand?
For content that's known in advance (articles, documentation, fixed scripts), pre-generating audio files and caching them is almost always the right approach. It's cheaper (you only pay for generation once), faster for the user (no generation latency), and more reliable. Store generated files in an object store (S3, GCS, R2) and serve them via CDN.
For dynamic or user-generated content where text is not known until request time, on-demand generation is necessary. Use streaming mode to minimize perceived latency.
Handling Character Limits
All APIs have per-request limits. For long documents, chunk the text at natural boundaries β paragraph breaks, sentence ends, or section headers β and process each chunk separately. Concatenate the audio files using a library like pydub in Python before serving.
from pydub import AudioSegment
segments = [AudioSegment.from_mp3(f) for f in chunk_files]
combined = sum(segments)
combined.export("full_article.mp3", format="mp3")
Error Handling and Retry Logic
TTS API calls can fail due to network issues, rate limiting, or temporary service unavailability. Implement exponential backoff for retries and graceful fallback behavior (informing the user that audio is temporarily unavailable rather than silently failing).
What to Explore Next
Once you have basic synthesis working, explore these areas:
- SSML markup for fine-grained control β covered in our article on How to Make TTS Sound More Natural
- Voice selection strategy β Tips for Choosing the Right TTS Voice
- API comparison for when your requirements outgrow your initial choice β Comparing the Top Text-to-Speech APIs in 2026
- Website integration β How to Integrate TTS into Your Website
The APIs covered here all have excellent documentation and active developer communities. The fastest way to learn is to get something working first β which you now can β and then explore the edges from there.
Try TTSVerse for Free!
Convert any text to natural-sounding audio in seconds. No signup required.
Start Converting β