Intro

I’ve been testing out a lot of text to speech options this month for my new project MyPod AI.

Here’s my notes - from cloud to local server, CPU to GPU, and open source to proprietary.

I’m ignoring voice cloning and other options for now - this is just for basic text to speech.
I’m choosing the price plans that align roughly across all options (eg. price for 1 million chars.)

I will update this page as I try out more options and learn more.

Cloud Based

ElevenLabs

ElevenLabs is the current winner of the high quality, realistic voice war. Easy to use, and the voices are amazing.

But the cost is extremely high for my particular use case.

$99/month gets you about 10 hours of audio (500,000 chars) $330/month gets you about 40 hours of audio (2,000,000 chars)

Azure TTS, AWS Polly, Google TTS

These are the big players of cloud services, so it’s often easier to use the one that you’re already using for other services. They are all priced pretty much the same.

$16 gets you ~20 hours of audio (1,000,000 chars).

The documentation is good or passable with some googling, since they have been around a while.

I have only used Azure so far - and found it easy to use, and the Neural voices are very good.

Not as realistic and high quality as ElevenLabs, but still very good if you choose voices carefully. They can be a little monotonic and hard to listen to for long periods.

Lots of options to choose from, and languages too.

Azure supports SSML which is an XML based language for controlling various speech parameters (like tone, speed etc..).

Open Source

There are quite a few open source and local server options available. From hosting your own models and inference endpoints, to using pre-built api solutions.

Some of these need GPU’s, but many can be run on plain old CPU even on a laptop. I’ll keep updating this page as I try out different options.

Tortoise-TTS

https://github.com/neonbjb/tortoise-tts

Fast fork: https://github.com/152334H/tortoise-tts-fast

Tortoise seems to produce as near quality to Eleven Labs as you can get open source. But it’s named after one of the slowest animals for a reason. It’s extremely slow, and requires GPU.

Fun to play with, but not feasible for my needs in a production app currently.

Piper TTS

https://github.com/rhasspy/piper

Excellent open source project that has many voices.

Coqui TTS

https://github.com/coqui-ai/TTS

All round api for using TTS models. Has some really good voices.

Discussion on best voices — https://github.com/coqui-ai/TTS/discussions/1891

Running CPU version in Docker:

docker run --rm -it -p 5002:5002 --entrypoint /bin/bash ghcr.io/coqui-ai/tts-cpu
python3 TTS/server/server.py --list_models #To get the list of available models
python3 TTS/server/server.py --model_name tts_models/en/vctk/vits

Text to speech for the average dev

Table of Contents