open source speech recognition 1

Top 11 Open Source Speech Recognition/Speech-to-Text Systems

M.Hanny Sabbagh

Last Updated on: June 19, 2024

security offer from FOSS Post

A speech-to-text (STT) system , or sometimes called automatic speech recognition (ASR) is as its name implies: A way of transforming the spoken words via sound into textual data that can be used later for any purpose.

Speech recognition technology is extremely useful. It can be used for a lot of applications such as the automation of transcription, writing books/texts using sound only, enabling complicated analysis on information using the generated textual files and a lot of other things.

In the past, the speech-to-text technology was dominated by proprietary software and libraries. Open source speech recognition alternatives didn’t exist or existed with extreme limitations and no community around.

This is changing, today there are a lot of open source speech-to-text tools and libraries that you can use right now.

Table of Contents:

What is a Speech Recognition Library/System?

What is an open source speech recognition library, what are the benefits of using open source speech recognition, 1. project deepspeech, 4. flashlight asr (formerly wav2letter++), 5. paddlespeech (formerly deepspeech2), 6. openseq2seq, 10. whisper, 11. styletts2, what is the best open source speech recognition system.

It is the software engine responsible for transforming voice to texts.

It is not meant to be used by end users. Developers will first have to adapt these libraries and use them to create computer programs that can enable speech recognition to users.

Some of them come with preloaded and trained dataset to recognize the given voices in one language and generate the corresponding texts, while others just give the engine without the dataset, and developers will have to build the training models themselves. This can be a complex task, similar to asking someone to do my online homework for me , as it requires a deep understanding of machine learning and data handling.

You can think of them as the underlying engines of speech recognition programs.

If you are an ordinary user looking for speech recognition, then none of these will be suitable for you, as they are meant for development use only.

The difference between proprietary speech recognition and open source speech recognition, is that the library used to process the voices should be licensed under one of the known open source licenses, such as GPL, MIT and others.

Microsoft and IBM for example have their own speech recognition toolkits that they offer for developers, but they are not open source. Simply because they are not licensed under one of the open source licenses in the market.

Mainly, you get few or no restrictions at all on the commercial usage for your application, as the open source speech recognition libraries will allow you to use them for whatever use case you may need.

Also, most – if not all – open source speech recognition toolkits in the market are also free of charge, saving you tons of money instead of using the proprietary ones.

The benefits of using open source speech recognition toolkits are indeed too many to be summarized in one article.

Top Open Source Speech Recognition Systems

open source speech recognition

In our article we’ll see a couple of them, what are their pros and cons and when they should be used.

This project is made by Mozilla, the organization behind the Firefox browser.

It’s a 100% free and open source speech-to-text library that also implies the machine learning technology using TensorFlow framework to fulfill its mission. In other words, you can use it to build training models by yourself to enhance the underlying speech-to-text technology and get better results, or even to bring it to other languages if you want.

You can also easily integrate it to your other machine learning projects that you are having on TensorFlow. Sadly it sounds like the project is currently only supporting English by default. It’s also available in many languages such as Python (3.6).

However, after the recent Mozilla restructure, the future of the project is unknown, as it may be shut down (or not) depending on what they are going to decide .

You may visit its Project DeepSpeech homepage to learn more.

Kaldi is an open source speech recognition software written in C++, and is released under the Apache public license.

It works on Windows, macOS and Linux. Its development started back in 2009. Kaldi’s main features over some other speech recognition software is that it’s extendable and modular: The community is providing tons of 3rd-party modules that you can use for your tasks.

Kaldi also supports deep neural networks, and offers an excellent documentation on its website . While the code is mainly written in C++, it’s “wrapped” by Bash and Python scripts.

So if you are looking just for the basic usage of converting speech to text, then you’ll find it easy to accomplish that via either Python or Bash. You may also wish to check Kaldi Active Grammar , which is a Python pre-built engine with English trained models already ready for usage.

Learn more about Kaldi speech recognition from its official website .

Probably one of the oldest speech recognition software ever, as its development started in 1991 at the University of Kyoto, and then its ownership was transferred to as an independent project in 2005. A lot of open source applications use it as their engine (Think of KDE Simon).

Julius main features include its ability to perform real-time STT processes, low memory usage (Less than 64MB for 20000 words), ability to produce N-best/Word-graph output, ability to work as a server unit and a lot more.

This software was mainly built for academic and research purposes. It is written in C, and works on Linux, Windows, macOS and even Android (on smartphones). Currently it supports both English and Japanese languages only.

The software is probably available to install easily using your Linux distribution’s repository; Just search for julius package in your package manager.

You can access Julius source code from GitHub.

If you are looking for something modern, then this one can be included.

Flashlight ASR is an open source speech recognition software that was released by Facebook’s AI Research Team. The code is a C++ code released under the MIT license.

Facebook was describing its library as “the fastest state-of-the-art speech recognition system available” up to 2018.

The concepts on which this tool is built makes it optimized for performance by default. Facebook’s machine learning library Flashlight is used as the underlying core of Flashlight ASR. The software requires that you first build a training model for the language you desire before becoming able to run the speech recognition process.

No pre-built support of any language (including English) is available. It’s just a machine-learning-driven tool to convert speech to text.

You can learn more about it from the following link .

Researchers at the Chinese giant Baidu are also working on their own speech recognition toolkit, called PaddleSpeech.

The speech toolkit is built on the PaddlePaddle deep learning framework, and provides many features such as:

  • Speech-to-Text support.
  • Text-to-Speech support.
  • State-of-the-art performance in audio transcription, it even won the  NAACL2022 Best Demo Award ,
  • Support for many large language models (LLMs), mainly for English and Chinese languages.

The engine can be trained on any model and for any language you desire.

PaddleSpeech ‘s source code is written in Python, so it should be easy for you to get familiar with it if that’s the language you use.

Developed by NVIDIA for sequence-to-sequence models training.

While it can be used for way more than just speech recognition, it is a good engine nonetheless for this use case. You can either build your own training models for it, or use models which are shipped by default. It supports parallel processing using multiple GPUs/Multiple CPUs, besides a heavy support for some NVIDIA technologies like CUDA and its strong graphics cards.

As of 2021 the project is archived; it can still be used but looks like it is no longer under active development.

Check its speech recognition documentation page for more information, or you may visit its official source code page .

One of the newest open source speech recognition systems, as its development just started in 2020.

Unlike other systems in this list, Vosk is quite ready to use after installation, as it supports 10 languages (English, German, French, Turkish…) with portable 50MB-sized models already available for users (There are other larger models up to 1.4GB if you need).

It also works on Raspberry Pi, iOS and android devices, and provides a streaming API which allows you to connect to it to do your speech recognition tasks online. Vosk has bindings for Java, Python, JavaScript, C# and NodeJS.

Learn more about Vosk from its official website .

An end-to-end speech recognition engine which implements ASR.

Written in Python and licensed under the Apache 2.0 license. Supports unsupervised pre-training and multi-GPUs training either on same or multiple machines. Built on the top of TensorFlow.

Has a large model available for both English and Chinese languages.

Visit Athena source code .

Written in Python on the top of PyTorch.

Also supports end-to-end ASR. It follows Kaldi style for data processing, so it would be easier to migrate from it to ESPnet. The main marketing point for ESPnet is the state-of-art performance it gives in many benchmarks, and its support for other language processing tasks such as speech-to-text (STT), machine translation (MT) and speech translation (ST).

Licensed under the Apache 2.0 license.

You can access ESPnet from the following link .

The newest speech recognition toolkit in the family, developed by the famous OpenAI company (the same company behind ChatGPT ).

The main marketing point for Whisper is that it does not specialize in a set of training datasets for specific languages only; instead, it can be used with any suitable model and for any language. It was trained on 680 thousand hours of audio files, one third of which were non-English datasets.

It supports speech-to-text, text-to-speech, speech translation. And the company claims that its toolkit has 50% less errors in the output compared to other toolkit in the market.

Learn more about Whisper from its official website .

The newest speech recognition library on the list, which was just released in the middle of November, 2023. It employs diffusion techniques with large speech language models (SLMs) training in order to achieve more advanced results than other models.

The makers of the model published it along with a research paper, where they make the following claim about their work:

This work achieves the first human-level TTS synthesis on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs.

It is written in Python, and has some Jupyter notebooks shipped with it to demonstrate how to use it. The model is licensed under the MIT license.

There is an online demo where you can see different benchmarks of the model:

If you are building a small application that you want to be portable everywhere, then Vosk is your best option, as it is written in Python and works on iOS, android and Raspberry pi too, and supports up to 10 languages. It also provides a huge training dataset if you shall need it, and a smaller one for portable applications.

If, however, you want to train and build your own models for much complex tasks, then any of PaddleSpeech, Whisper and Athena should be more than enough for your needs, as they are the most modern state-of-the-art toolkits.

As for Mozilla’s DeepSpeech , it lacks a lot of features behind its other competitors in this list, and isn’t really cited a lot in speech recognition academic research like the others. And its future is concerning after the recent Mozilla restructure, so one would want to stay away from it for now.

Traditionally, Julius and Kaldi are also very much cited in the academic literature.

Alternatively, you may try these open source speech recognition libraries to see how they work for you in your use case.

The speech recognition category is starting to become mainly driven by open source technologies, a situation that seemed to be very far-fetched a few years ago.

The current open source speech recognition software are very modern and bleeding-edge, and one can use them to fulfill any purpose instead of depending on Microsoft’s or IBM’s toolkits.

If you have any other recommendations for this list, or comments in general, we’d love to hear them below!

Other interesting reads:

open source digital twin software

FOSS Post has been providing high-quality content about open source and Linux software for around 7 years now. All of our content is free so that you can enjoy it whenever you like. However, consider buying us a cup of coffee by joining our Patreon campaign or doing a one-time donation to support our efforts!

Our community platform is here. Join it now so that you can explore tons of interesting and fun discussions about various open source aspects and issues!

Are you stuck following one of our articles or technical tutorials? Drop us a support request in the forum and we'll get right back to you.

You can take a number of interesting and exciting quizzes that the FOSS Post team prepared about various open source software from FOSS Quiz.

M.Hanny Sabbagh

With a B.Sc and M.Sc in Computer Science & Engineering, Hanny brings more than a decade of experience with Linux and open-source software. He has developed Linux distributions, desktop programs, web applications and much more. All of which attracted tens of thousands of users over many years. He additionally maintains other open-source related platforms to promote it in his local communities.

Hanny is the founder of FOSS Post.

speech to text library

Enter your email address to subscribe to our newsletter. We only send you an email when we have a couple of new posts or some important updates to share.

Social Links

Open source directory.

Business Software

Designing Software


Medical Software
User Software

Join the Force!

For the price of one cup of coffee per month:

  • Support the FOSS Post to produce more content.
  • Get a special account on our website.
  • Remove all the ads you are seeing (including this one!).
  • Get an OPML file containing +70 RSS feeds for various FOSS-related websites and blogs, so that you can import it into your favorite RSS reader and stay updated about the FOSS world!

Become a Supporter

security offer from FOSS Post

Comments on this story are now closed.

Stay away from the eyes of governments and other malicious 3rd-party actors with one of the finest VPNs working on Linux, and enjoy a large discount when you do so!

Offer may end soon.

speech to text library

I'll take the Blue bill

Originally published on August 23, 2020, Last Updated on June 19, 2024 by M.Hanny Sabbagh

  • About AssemblyAI

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

This post compares the best free Speech-to-Text APIs and AI models on the market today, including APIs that have a free tier. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API vs. an open-source library, or vice versa.

The top free Speech-to-Text APIs, AI Models, and Open Source Engines

Growth at AssemblyAI

Choosing the best Speech-to-Text API , AI model, or open-source engine to build with can be challenging. You need to compare accuracy, model design, features, support options, documentation, security, and more.

This post examines the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision. We’ll also look at several free open-source Speech-to-Text engines and explore why you might choose an API or AI model vs. an open-source library, or vice versa.

Looking for a powerful speech-to-text API or AI model?

Learn why AssemblyAI is the leading Speech AI partner.

Free Speech-to-Text APIs and AI Models

APIs and AI models are more accurate, easier to integrate, and come with more out-of-the-box features than open-source options. However, large-scale use of APIs and AI models can come with a higher cost than open-source options.

If you’re looking to use an API or AI model for a small project or a trial run, many of today’s Speech-to-Text APIs and AI models have a free tier. This means that the API or model is free for anyone to use up to a certain volume per day, per month, or per year.

Let’s compare three of the most popular Speech-to-Text APIs and AI models with a free tier: AssemblyAI, Google, and AWS Transcribe.

AssemblyAI is an API platform that offers AI models that accurately transcribe and understand speech, and enable users to extract insights from voice data. AssemblyAI offers cutting-edge AI models such as Speaker Diarization , Topic Detection, Entity Detection , Automated Punctuation and Casing , Content Moderation , Sentiment Analysis , Text Summarization , and more. These AI models help users get more out of voice data, with continuous improvements being made to accuracy .

AssemblyAI also offers LeMUR , which enables users to leverage Large Language Models (LLMs) to pull valuable information from their voice data—including answering questions, generating summaries and action items, and more. 

The company offers up to 100 free transcription hours for audio files or video streams, with a concurrency limit of 5, before transitioning to an affordable paid tier.

Its high accuracy and diverse collection of AI models built by AI experts make AssemblyAI a sound option for developers looking for a free Speech-to-Text API. The API also supports virtually every audio and video file format out-of-the-box for easier transcription.

AssemblyAI has expanded the languages it supports to include English, Spanish, French, German, Japanese, Korean, and much more, with additional languages being released monthly. See the full list here .

AssemblyAI’s easy-to-use models also allow for quick set-up and transcription in any programming language. You can copy/paste code examples in your preferred language directly from the AssemblyAI Docs or use the AssemblyAI Python SDK or another one of its ready-to-use integrations .

  • Free to test in the AI playground , plus 100 free hours of asynchronous transcription with an API sign-up
  • Speech-to-Text – $0.37 per hour
  • Real-time Transcription – $0.47 per hour
  • Audio Intelligence – varies, $.01 to $.15 per hour
  • LeMUR – varies
  • Enterprise pricing is also available

See the full pricing list here .

  • High accuracy
  • Breadth of AI models available, built by AI experts
  • Continuous model iteration and improvement
  • Developer-friendly documentation and SDKs
  • Enterprise-grade support and security
  • Models are not open-source

Google Speech-to-Text is a well-known speech transcription API. Google gives users 60 minutes of free transcription, with $300 in free credits for Google Cloud hosting.

Google only supports transcribing files already in a Google Cloud Bucket, so the free credits won’t get you very far. Google also requires you to sign up for a GCP account and project — whether you're using the free tier or paid.

With good accuracy and 125+ languages supported, Google is a decent choice if you’re willing to put in some initial work.

  • 60 minutes of free transcription
  • $300 in free credits for Google Cloud hosting
  • Decent accuracy
  • Multi-language support
  • Only supports transcription of files in a Google Cloud Bucket
  • Difficult to get started
  • Lower accuracy than other similarly-priced APIs
  • AWS Transcribe

AWS Transcribe offers one hour free per month for the first 12 months of use.

Like Google, you must create an AWS account first if you don’t already have one. AWS also has lower accuracy compared to alternative APIs and only supports transcribing files already in an Amazon S3 bucket.

However, if you’re looking for a specific feature, like medical transcription, AWS has some options. Its Transcribe Medical API is a medical-focused ASR option that is available today.

  • One hour free per month for the first 12 months of use
  • Tiered pricing , based on usage, ranges from $0.02400 to $0.00780
  • Integrates into existing AWS ecosystem
  • Medical language transcription
  • Difficult to get started from scratch
  • Only supports transcribing files already in an Amazon S3 bucket

Open-Source Speech Transcription engines

An alternative to APIs and AI models, open-source Speech-to-Text libraries are completely free--with no limits on use. Some developers also see data security as a plus, since your data doesn’t have to be sent to a third party or the cloud.

There is work involved with open-source engines, so you must be comfortable putting in a lot of time and effort to get the results you want, especially if you are trying to use these libraries at scale. Open-source Speech-to-Text engines are typically less accurate than the APIs discussed above.

If you want to go the open-source route, here are some options worth exploring:

DeepSpeech is an open-source embedded Speech-to-Text engine designed to run in real-time on a range of devices, from high-powered GPUs to a Raspberry Pi 4. The DeepSpeech library uses end-to-end model architecture pioneered by Baidu.

DeepSpeech also has decent out-of-the-box accuracy for an open-source option and is easy to fine-tune and train on your own data.

  • Easy to customize
  • Can use it to train your own model
  • Can be used on a wide range of devices
  • Lack of support
  • No model improvement outside of individual custom training
  • Heavy lift to integrate into production-ready applications

Kaldi is a speech recognition toolkit that has been widely popular in the research community for many years.

Like DeepSpeech, Kaldi has good out-of-the-box accuracy and supports the ability to train your own models. It’s also been thoroughly tested—a lot of companies currently use Kaldi in production and have used it for a while—making more developers confident in its application.

  • Can use it to train your own models
  • Active user base
  • Can be complex and expensive to use
  • Uses a command-line interface

Flashlight ASR (formerly Wav2Letter)

Flashlight ASR, formerly Wav2Letter, is Facebook AI Research’s Automatic Speech Recognition (ASR) Toolkit. It is also written in C++ and usesthe ArrayFire tensor library.

Like DeepSpeech, Flashlight ASR is decently accurate for an open-source library and is easy to work with on a small project.

  • Customizable
  • Easier to modify than other open-source options
  • Processing speed
  • Very complex to use
  • No pre-trained libraries available
  • Need to continuously source datasets for training and model updates, which can be difficult and costly
  • SpeechBrain

SpeechBrain is a PyTorch-based transcription toolkit. The platform releases open implementations of popular research works and offers a tight integration with Hugging Face for easy access.

Overall, the platform is well-defined and constantly updated, making it a straightforward tool for training and finetuning.

  • Integration with Pytorch and Hugging Face
  • Pre-trained models are available
  • Supports a variety of tasks
  • Even its pre-trained models take a lot of customization to make them usable
  • Lack of extensive docs makes it not as user-friendly, except for those with extensive experience

Coqui is another deep learning toolkit for Speech-to-Text transcription. Coqui is used in over twenty languages for projects and also offers a variety of essential inference and productionization features.

The platform also releases custom-trained models and has bindings for various programming languages for easier deployment.

  • Generates confidence scores for transcripts
  • Large support comunity
  • No longer updated and maintained by Coqui

Whisper by OpenAI, released in September 2022, is comparable to other current state-of-the-art open-source options.

Whisper can be used either in Python or from the command line and can also be used for multilingual translation.

Whisper has five different models of varying sizes and capabilities, depending on the use case, including v3 released in November 2023 .

However, you’ll need a fairly large computing power and access to an in-house team to maintain, scale, update, and monitor the model to run Whisper at a large scale, making the total cost of ownership higher compared to other options. 

As of March 2023, Whisper is also now available via API . On-demand pricing starts at $0.006/minute.

  • Multilingual transcription
  • Can be used in Python
  • Five models are available, each with different sizes and capabilities
  • Need an in-house research team to maintain and update
  • Costly to run

Which free Speech-to-Text API, AI model, or Open Source engine is right for your project?

The best free Speech-to-Text API, AI model, or open-source engine will depend on our project. Do you want something that is easy-to-use, has high accuracy, and has additional out-of-the-box features? If so, one of these APIs might be right for you:

Alternatively, you might want a completely free option with no data limits—if you don’t mind the extra work it will take to tailor a toolkit to your needs. If so, you might choose one of these open-source libraries:

Whichever you choose, make sure you find a product that can continually meet the needs of your project now and what your project may develop into in the future.

Want to get started with an API?

Get a free API key for AssemblyAI.

Popular posts

AI trends in 2024: Graph Neural Networks

AI trends in 2024: Graph Neural Networks

Marco Ramponi's picture

Developer Educator at AssemblyAI

AI for Universal Audio Understanding: Qwen-Audio Explained

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works

How DALL-E 2 Actually Works

Ryan O'Connor's picture

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.


Folders and files.

3,466 Commits
deepspeech_training deepspeech_training

Repository files navigation

Project deepspeech.


DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper . Project DeepSpeech uses Google's TensorFlow to make the implementation easier.

Documentation for installation, usage, and training models are available on .

For the latest release, including pre-trained models and checkpoints, see the latest release on GitHub .

For contribution guidelines, see CONTRIBUTING.rst .

For contact and support information, see SUPPORT.rst .

Code of conduct

Releases 105, used by 420.


Contributors 136


  • Python 21.4%
  • Shell 10.8%

SpeechRecognition 3.10.4

pip install SpeechRecognition Copy PIP instructions

Released: May 5, 2024

Library for performing speech recognition, with support for several engines and APIs, online and offline.

Verified details


Avatar for Anthony.Zhang from

Unverified details

Project links, github statistics.

  • Open issues:

License: BSD License (BSD)

Author: Anthony Zhang (Uberi)

Tags speech, recognition, voice, sphinx, google, wit, bing, api, houndify, ibm, snowboy

Requires: Python >=3.8


  • 5 - Production/Stable
  • OSI Approved :: BSD License
  • MacOS :: MacOS X
  • Microsoft :: Windows
  • POSIX :: Linux
  • Python :: 3
  • Python :: 3.8
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Multimedia :: Sound/Audio :: Speech
  • Software Development :: Libraries :: Python Modules

Project description

Latest Version

UPDATE 2022-02-09 : Hey everyone! This project started as a tech demo, but these days it needs more time than I have to keep up with all the PRs and issues. Therefore, I’d like to put out an open invite for collaborators - just reach out at me @ anthonyz . ca if you’re interested!

Speech recognition engine/API support:

Quickstart: pip install SpeechRecognition . See the “Installing” section for more details.

To quickly try it out, run python -m speech_recognition after installing.

Project links:

Library Reference

The library reference documents every publicly accessible object in the library. This document is also included under reference/library-reference.rst .

See Notes on using PocketSphinx for information about installing languages, compiling PocketSphinx, and building language packs from online resources. This document is also included under reference/pocketsphinx.rst .

You have to install Vosk models for using Vosk. Here are models avaiable. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

See the examples/ directory in the repository root for usage examples:

First, make sure you have all the requirements listed in the “Requirements” section.

The easiest way to install this is using pip install SpeechRecognition .

Otherwise, download the source distribution from PyPI , and extract the archive.

In the folder, run python install .


To use all of the functionality of the library, you should have:

The following requirements are optional, but can improve or extend functionality in some situations:

The following sections go over the details of each requirement.

The first software requirement is Python 3.8+ . This is required to use the library.

PyAudio (for microphone users)

PyAudio is required if and only if you want to use microphone input ( Microphone ). PyAudio version 0.2.11+ is required, as earlier versions have known memory management bugs when recording from microphones in certain situations.

If not installed, everything in the library will still work, except attempting to instantiate a Microphone object will raise an AttributeError .

The installation instructions on the PyAudio website are quite good - for convenience, they are summarized below:

PyAudio wheel packages for common 64-bit Python versions on Windows and Linux are included for convenience, under the third-party/ directory in the repository root. To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the repository root directory .

PocketSphinx-Python (for Sphinx users)

PocketSphinx-Python is required if and only if you want to use the Sphinx recognizer ( recognizer_instance.recognize_sphinx ).

PocketSphinx-Python wheel packages for 64-bit Python 3.4, and 3.5 on Windows are included for convenience, under the third-party/ directory . To install, simply run pip install wheel followed by pip install ./third-party/WHEEL_FILENAME (replace pip with pip3 if using Python 3) in the SpeechRecognition folder.

On Linux and other POSIX systems (such as OS X), follow the instructions under “Building PocketSphinx-Python from source” in Notes on using PocketSphinx for installation instructions.

Note that the versions available in most package repositories are outdated and will not work with the bundled language data. Using the bundled wheel packages or building from source is recommended.

Vosk (for Vosk users)

Vosk API is required if and only if you want to use Vosk recognizer ( recognizer_instance.recognize_vosk ).

You can install it with python3 -m pip install vosk .

You also have to install Vosk Models:

Here are models avaiable for download. You have to place them in models folder of your project, like “your-project-folder/models/your-vosk-model”

Google Cloud Speech Library for Python (for Google Cloud Speech API users)

Google Cloud Speech library for Python is required if and only if you want to use the Google Cloud Speech API ( recognizer_instance.recognize_google_cloud ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_google_cloud will raise an RequestError .

According to the official installation instructions , the recommended way to install this is using Pip : execute pip install google-cloud-speech (replace pip with pip3 if using Python 3).

FLAC (for some systems)

A FLAC encoder is required to encode the audio data to send to the API. If using Windows (x86 or x86-64), OS X (Intel Macs only, OS X 10.6 or higher), or Linux (x86 or x86-64), this is already bundled with this library - you do not need to install anything .

Otherwise, ensure that you have the flac command line tool, which is often available through the system package manager. For example, this would usually be sudo apt-get install flac on Debian-derivatives, or brew install flac on OS X with Homebrew.

Whisper (for Whisper users)

Whisper is required if and only if you want to use whisper ( recognizer_instance.recognize_whisper ).

You can install it with python3 -m pip install SpeechRecognition[whisper-local] .

Whisper API (for Whisper API users)

The library openai is required if and only if you want to use Whisper API ( recognizer_instance.recognize_whisper_api ).

If not installed, everything in the library will still work, except calling recognizer_instance.recognize_whisper_api will raise an RequestError .

You can install it with python3 -m pip install SpeechRecognition[whisper-api] .


The recognizer tries to recognize speech even when i’m not speaking, or after i’m done speaking..

Try increasing the recognizer_instance.energy_threshold property. This is basically how sensitive the recognizer is to when recognition should start. Higher values mean that it will be less sensitive, which is useful if you are in a loud room.

This value depends entirely on your microphone or audio data. There is no one-size-fits-all value, but good values typically range from 50 to 4000.

Also, check on your microphone volume settings. If it is too sensitive, the microphone may be picking up a lot of ambient noise. If it is too insensitive, the microphone may be rejecting speech as just noise.

The recognizer can’t recognize speech right after it starts listening for the first time.

The recognizer_instance.energy_threshold property is probably set to a value that is too high to start off with, and then being adjusted lower automatically by dynamic energy threshold adjustment. Before it is at a good level, the energy threshold is so high that speech is just considered ambient noise.

The solution is to decrease this threshold, or call recognizer_instance.adjust_for_ambient_noise beforehand, which will set the threshold to a good value automatically.

The recognizer doesn’t understand my particular language/dialect.

Try setting the recognition language to your language/dialect. To do this, see the documentation for recognizer_instance.recognize_sphinx , recognizer_instance.recognize_google , recognizer_instance.recognize_wit , recognizer_instance.recognize_bing , recognizer_instance.recognize_api , recognizer_instance.recognize_houndify , and recognizer_instance.recognize_ibm .

For example, if your language/dialect is British English, it is better to use "en-GB" as the language rather than "en-US" .

The recognizer hangs on recognizer_instance.listen ; specifically, when it’s calling .

This usually happens when you’re using a Raspberry Pi board, which doesn’t have audio input capabilities by itself. This causes the default microphone used by PyAudio to simply block when we try to read it. If you happen to be using a Raspberry Pi, you’ll need a USB sound card (or USB microphone).

Once you do this, change all instances of Microphone() to Microphone(device_index=MICROPHONE_INDEX) , where MICROPHONE_INDEX is the hardware-specific index of the microphone.

To figure out what the value of MICROPHONE_INDEX should be, run the following code:

This will print out something like the following:

Now, to use the Snowball microphone, you would change Microphone() to Microphone(device_index=3) .

Calling Microphone() gives the error IOError: No Default Input Device Available .

As the error says, the program doesn’t know which microphone to use.

To proceed, either use Microphone(device_index=MICROPHONE_INDEX, ...) instead of Microphone(...) , or set a default microphone in your OS. You can obtain possible values of MICROPHONE_INDEX using the code in the troubleshooting entry right above this one.

The program doesn’t run when compiled with PyInstaller .

As of PyInstaller version 3.0, SpeechRecognition is supported out of the box. If you’re getting weird issues when compiling your program using PyInstaller, simply update PyInstaller.

You can easily do this by running pip install --upgrade pyinstaller .

On Ubuntu/Debian, I get annoying output in the terminal saying things like “bt_audio_service_open: […] Connection refused” and various others.

The “bt_audio_service_open” error means that you have a Bluetooth audio device, but as a physical device is not currently connected, we can’t actually use it - if you’re not using a Bluetooth microphone, then this can be safely ignored. If you are, and audio isn’t working, then double check to make sure your microphone is actually connected. There does not seem to be a simple way to disable these messages.

For errors of the form “ALSA lib […] Unknown PCM”, see this StackOverflow answer . Basically, to get rid of an error of the form “Unknown PCM cards.pcm.rear”, simply comment out pcm.rear cards.pcm.rear in /usr/share/alsa/alsa.conf , ~/.asoundrc , and /etc/asound.conf .

For “jack server is not running or cannot be started” or “connect(2) call to /dev/shm/jack-1000/default/jack_0 failed (err=No such file or directory)” or “attempt to connect to server failed”, these are caused by ALSA trying to connect to JACK, and can be safely ignored. I’m not aware of any simple way to turn those messages off at this time, besides entirely disabling printing while starting the microphone .

On OS X, I get a ChildProcessError saying that it couldn’t find the system FLAC converter, even though it’s installed.

Installing FLAC for OS X directly from the source code will not work, since it doesn’t correctly add the executables to the search path.

Installing FLAC using Homebrew ensures that the search path is correctly updated. First, ensure you have Homebrew, then run brew install flac to install the necessary files.

To hack on this library, first make sure you have all the requirements listed in the “Requirements” section.

To install/reinstall the library locally, run python -m pip install -e .[dev] in the project root directory .

Before a release, the version number is bumped in README.rst and speech_recognition/ . Version tags are then created using git config gpg.program gpg2 && git config user.signingkey DB45F6C431DE7C2DCD99FF7904882258A4063489 && git tag -s VERSION_GOES_HERE -m "Version VERSION_GOES_HERE" .

Releases are done by running VERSION_GOES_HERE to build the Python source packages, sign them, and upload them to PyPI.

To run all the tests:

To run static analysis:

To ensure RST is well-formed:

Testing is also done automatically by GitHub Actions, upon every push.

FLAC Executables

The included flac-win32 executable is the official FLAC 1.3.2 32-bit Windows binary .

The included flac-linux-x86 and flac-linux-x86_64 executables are built from the FLAC 1.3.2 source code with Manylinux to ensure that it’s compatible with a wide variety of distributions.

The built FLAC executables should be bit-for-bit reproducible. To rebuild them, run the following inside the project directory on a Debian-like system:

The included flac-mac executable is extracted from xACT 2.39 , which is a frontend for FLAC 1.3.2 that conveniently includes binaries for all of its encoders. Specifically, it is a copy of xACT 2.39/ in .

Please report bugs and suggestions at the issue tracker !

How to cite this library (APA style):

Zhang, A. (2017). Speech Recognition (Version 3.8) [Software]. Available from .

How to cite this library (Chicago style):

Zhang, Anthony. 2017. Speech Recognition (version 3.8).

Also check out the Python Baidu Yuyin API , which is based on an older version of this project, and adds support for Baidu Yuyin . Note that Baidu Yuyin is only available inside China.

Copyright 2014-2017 Anthony Zhang (Uberi) . The source code for this library is available online at GitHub .

SpeechRecognition is made available under the 3-clause BSD license. See LICENSE.txt in the project’s root directory for more information.

For convenience, all the official distributions of SpeechRecognition already include a copy of the necessary copyright notices and licenses. In your project, you can simply say that licensing information for SpeechRecognition can be found within the SpeechRecognition README, and make sure SpeechRecognition is visible to users if they wish to see it .

SpeechRecognition distributes source code, binaries, and language files from CMU Sphinx . These files are BSD-licensed and redistributable as long as copyright notices are correctly retained. See speech_recognition/pocketsphinx-data/*/LICENSE*.txt and third-party/LICENSE-Sphinx.txt for license details for individual parts.

SpeechRecognition distributes source code and binaries from PyAudio . These files are MIT-licensed and redistributable as long as copyright notices are correctly retained. See third-party/LICENSE-PyAudio.txt for license details.

SpeechRecognition distributes binaries from FLAC - speech_recognition/flac-win32.exe , speech_recognition/flac-linux-x86 , and speech_recognition/flac-mac . These files are GPLv2-licensed and redistributable, as long as the terms of the GPL are satisfied. The FLAC binaries are an aggregate of separate programs , so these GPL restrictions do not apply to the library or your programs that use the library, only to FLAC itself. See LICENSE-FLAC.txt for license details.

Project details

Release history release notifications | rss feed.

May 5, 2024

Mar 30, 2024

Mar 28, 2024

Dec 6, 2023

Mar 13, 2023

Dec 4, 2022

Dec 5, 2017

Jun 27, 2017

Apr 13, 2017

Mar 11, 2017

Jan 7, 2017

Nov 21, 2016

May 22, 2016

May 11, 2016

May 10, 2016

Apr 9, 2016

Apr 4, 2016

Apr 3, 2016

Mar 5, 2016

Mar 4, 2016

Feb 26, 2016

Feb 20, 2016

Feb 19, 2016

Feb 4, 2016

Nov 5, 2015

Nov 2, 2015

Sep 2, 2015

Sep 1, 2015

Aug 30, 2015

Aug 24, 2015

Jul 26, 2015

Jul 12, 2015

Jul 3, 2015

May 20, 2015

Apr 24, 2015

Apr 14, 2015

Apr 7, 2015

Apr 5, 2015

Apr 4, 2015

Mar 31, 2015

Dec 10, 2014

Nov 17, 2014

Sep 11, 2014

Sep 6, 2014

Aug 25, 2014

Jul 6, 2014

Jun 10, 2014

Jun 9, 2014

May 29, 2014

Apr 23, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages .

Source Distribution

Uploaded May 5, 2024 Source

Built Distribution

Uploaded May 5, 2024 Python 2 Python 3

Hashes for speechrecognition-3.10.4.tar.gz

Hashes for speechrecognition-3.10.4.tar.gz
Algorithm Hash digest

Hashes for SpeechRecognition-3.10.4-py2.py3-none-any.whl

Hashes for SpeechRecognition-3.10.4-py2.py3-none-any.whl
Algorithm Hash digest
  • português (Brasil)

Supported by

speech to text library

Python Speech Recognition

The Ultimate Guide To Speech Recognition With Python

Table of Contents

How Speech Recognition Works – An Overview

Picking a python speech recognition package, installing speechrecognition, the recognizer class, supported file types, using record() to capture data from a file, capturing segments with offset and duration, the effect of noise on speech recognition, installing pyaudio, the microphone class, using listen() to capture microphone input, handling unrecognizable speech, putting it all together: a “guess the word” game, recap and additional resources, appendix: recognizing speech in languages other than english.

Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Speech Recognition With Python

Have you ever wondered how to add speech recognition to your Python project? If so, then keep reading! It’s easier than you might think.

Far from a being a fad, the overwhelming success of speech-enabled products like Amazon Alexa has proven that some degree of speech support will be an essential aspect of household tech for the foreseeable future. If you think about it, the reasons why are pretty obvious. Incorporating speech recognition into your Python application offers a level of interactivity and accessibility that few technologies can match.

The accessibility improvements alone are worth considering. Speech recognition allows the elderly and the physically and visually impaired to interact with state-of-the-art products and services quickly and naturally—no GUI needed!

Best of all, including speech recognition in a Python project is really simple. In this guide, you’ll find out how. You’ll learn:

  • How speech recognition works,
  • What packages are available on PyPI; and
  • How to install and use the SpeechRecognition package—a full-featured and easy-to-use Python speech recognition library.

In the end, you’ll apply what you’ve learned to a simple “Guess the Word” game and see how it all comes together.

Free Bonus: Click here to download a Python speech recognition sample project with full source code that you can use as a basis for your own speech recognition apps.

Before we get to the nitty-gritty of doing speech recognition in Python, let’s take a moment to talk about how speech recognition works. A full discussion would fill a book, so I won’t bore you with all of the technical details here. In fact, this section is not pre-requisite to the rest of the tutorial. If you’d like to get straight to the point, then feel free to skip ahead.

Speech recognition has its roots in research done at Bell Labs in the early 1950s. Early systems were limited to a single speaker and had limited vocabularies of about a dozen words. Modern speech recognition systems have come a long way since their ancient counterparts. They can recognize speech from multiple speakers and have enormous vocabularies in numerous languages.

The first component of speech recognition is, of course, speech. Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once digitized, several models can be used to transcribe the audio to text.

Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM). This approach works on the assumption that a speech signal, when viewed on a short enough timescale (say, ten milliseconds), can be reasonably approximated as a stationary process—that is, a process in which statistical properties do not change over time.

In a typical HMM, the speech signal is divided into 10-millisecond fragments. The power spectrum of each fragment, which is essentially a plot of the signal’s power as a function of frequency, is mapped to a vector of real numbers known as cepstral coefficients. The dimension of this vector is usually small—sometimes as low as 10, although more accurate systems may have dimension 32 or more. The final output of the HMM is a sequence of these vectors.

To decode the speech into text, groups of vectors are matched to one or more phonemes —a fundamental unit of speech. This calculation requires training, since the sound of a phoneme varies from speaker to speaker, and even varies from one utterance to another by the same speaker. A special algorithm is then applied to determine the most likely word (or words) that produce the given sequence of phonemes.

One can imagine that this whole process may be computationally expensive. In many modern speech recognition systems, neural networks are used to simplify the speech signal using techniques for feature transformation and dimensionality reduction before HMM recognition. Voice activity detectors (VADs) are also used to reduce an audio signal to only the portions that are likely to contain speech. This prevents the recognizer from wasting time analyzing unnecessary parts of the signal.

Fortunately, as a Python programmer, you don’t have to worry about any of this. A number of speech recognition services are available for use online through an API, and many of these services offer Python SDKs .

A handful of packages for speech recognition exist on PyPI. A few of them include:

  • google-cloud-speech
  • pocketsphinx
  • SpeechRecognition
  • watson-developer-cloud

Some of these packages—such as wit and apiai—offer built-in features, like natural language processing for identifying a speaker’s intent, which go beyond basic speech recognition. Others, like google-cloud-speech, focus solely on speech-to-text conversion.

There is one package that stands out in terms of ease-of-use: SpeechRecognition.

Recognizing speech requires audio input, and SpeechRecognition makes retrieving this input really easy. Instead of having to build scripts for accessing microphones and processing audio files from scratch, SpeechRecognition will have you up and running in just a few minutes.

The SpeechRecognition library acts as a wrapper for several popular speech APIs and is thus extremely flexible. One of these—the Google Web Speech API—supports a default API key that is hard-coded into the SpeechRecognition library. That means you can get off your feet without having to sign up for a service.

The flexibility and ease-of-use of the SpeechRecognition package make it an excellent choice for any Python project. However, support for every feature of each API it wraps is not guaranteed. You will need to spend some time researching the available options to find out if SpeechRecognition will work in your particular case.

So, now that you’re convinced you should try out SpeechRecognition, the next step is getting it installed in your environment.

SpeechRecognition is compatible with Python 2.6, 2.7 and 3.3+, but requires some additional installation steps for Python 2 . For this tutorial, I’ll assume you are using Python 3.3+.

You can install SpeechRecognition from a terminal with pip:

Once installed, you should verify the installation by opening an interpreter session and typing:

Note: The version number you get might vary. Version 3.8.1 was the latest at the time of writing.

Go ahead and keep this session open. You’ll start to work with it in just a bit.

SpeechRecognition will work out of the box if all you need to do is work with existing audio files. Specific use cases, however, require a few dependencies. Notably, the PyAudio package is needed for capturing microphone input.

You’ll see which dependencies you need as you read further. For now, let’s dive in and explore the basics of the package.

All of the magic in SpeechRecognition happens with the Recognizer class.

The primary purpose of a Recognizer instance is, of course, to recognize speech. Each instance comes with a variety of settings and functionality for recognizing speech from an audio source.

Creating a Recognizer instance is easy. In your current interpreter session, just type:

Each Recognizer instance has seven methods for recognizing speech from an audio source using various APIs. These are:

  • recognize_bing() : Microsoft Bing Speech
  • recognize_google() : Google Web Speech API
  • recognize_google_cloud() : Google Cloud Speech - requires installation of the google-cloud-speech package
  • recognize_houndify() : Houndify by SoundHound
  • recognize_ibm() : IBM Speech to Text
  • recognize_sphinx() : CMU Sphinx - requires installing PocketSphinx
  • recognize_wit() :

Of the seven, only recognize_sphinx() works offline with the CMU Sphinx engine. The other six all require an internet connection.

A full discussion of the features and benefits of each API is beyond the scope of this tutorial. Since SpeechRecognition ships with a default API key for the Google Web Speech API, you can get started with it right away. For this reason, we’ll use the Web Speech API in this guide. The other six APIs all require authentication with either an API key or a username/password combination. For more information, consult the SpeechRecognition docs .

Caution: The default key provided by SpeechRecognition is for testing purposes only, and Google may revoke it at any time . It is not a good idea to use the Google Web Speech API in production. Even with a valid API key, you’ll be limited to only 50 requests per day, and there is no way to raise this quota . Fortunately, SpeechRecognition’s interface is nearly identical for each API, so what you learn today will be easy to translate to a real-world project.

Each recognize_*() method will throw a speech_recognition.RequestError exception if the API is unreachable. For recognize_sphinx() , this could happen as the result of a missing, corrupt or incompatible Sphinx installation. For the other six methods, RequestError may be thrown if quota limits are met, the server is unavailable, or there is no internet connection.

Ok, enough chit-chat. Let’s get our hands dirty. Go ahead and try to call recognize_google() in your interpreter session.

What happened?

You probably got something that looks like this:

You might have guessed this would happen. How could something be recognized from nothing?

All seven recognize_*() methods of the Recognizer class require an audio_data argument. In each case, audio_data must be an instance of SpeechRecognition’s AudioData class.

There are two ways to create an AudioData instance: from an audio file or audio recorded by a microphone. Audio files are a little easier to get started with, so let’s take a look at that first.

Working With Audio Files

Before you continue, you’ll need to download an audio file. The one I used to get started, “harvard.wav,” can be found here . Make sure you save it to the same directory in which your Python interpreter session is running.

SpeechRecognition makes working with audio files easy thanks to its handy AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.

Currently, SpeechRecognition supports the following file formats:

  • WAV: must be in PCM/LPCM format
  • FLAC: must be native FLAC format; OGG-FLAC is not supported

If you are working on x-86 based Linux, macOS or Windows, you should be able to work with FLAC files without a problem. On other platforms, you will need to install a FLAC encoder and ensure you have access to the flac command line tool. You can find more information here if this applies to you.

Type the following into your interpreter session to process the contents of the “harvard.wav” file:

The context manager opens the file and reads its contents, storing the data in an AudioFile instance called source. Then the record() method records the data from the entire file into an AudioData instance. You can confirm this by checking the type of audio :

You can now invoke recognize_google() to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.

Congratulations! You’ve just transcribed your first audio file!

If you’re wondering where the phrases in the “harvard.wav” file come from, they are examples of Harvard Sentences. These phrases were published by the IEEE in 1965 for use in speech intelligibility testing of telephone lines. They are still used in VoIP and cellular testing today.

The Harvard Sentences are comprised of 72 lists of ten phrases. You can find freely available recordings of these phrases on the Open Speech Repository website. Recordings are available in English, Mandarin Chinese, French, and Hindi. They provide an excellent source of free material for testing your code.

What if you only want to capture a portion of the speech in a file? The record() method accepts a duration keyword argument that stops the recording after a specified number of seconds.

For example, the following captures any speech in the first four seconds of the file:

The record() method, when used inside a with block, always moves ahead in the file stream. This means that if you record once for four seconds and then record again for four seconds, the second time returns the four seconds of audio after the first four seconds.

Notice that audio2 contains a portion of the third phrase in the file. When specifying a duration, the recording might stop mid-phrase—or even mid-word—which can hurt the accuracy of the transcription. More on this in a bit.

In addition to specifying a recording duration, the record() method can be given a specific starting point using the offset keyword argument. This value represents the number of seconds from the beginning of the file to ignore before starting to record.

To capture only the second phrase in the file, you could start with an offset of four seconds and record for, say, three seconds.

The offset and duration keyword arguments are useful for segmenting an audio file if you have prior knowledge of the structure of the speech in the file. However, using them hastily can result in poor transcriptions. To see this effect, try the following in your interpreter:

By starting the recording at 4.7 seconds, you miss the “it t” portion a the beginning of the phrase “it takes heat to bring out the odor,” so the API only got “akes heat,” which it matched to “Mesquite.”

Similarly, at the end of the recording, you captured “a co,” which is the beginning of the third phrase “a cold dip restores health and zest.” This was matched to “Aiko” by the API.

There is another reason you may get inaccurate transcriptions. Noise! The above examples worked well because the audio file is reasonably clean. In the real world, unless you have the opportunity to process audio files beforehand, you can not expect the audio to be noise-free.

Noise is a fact of life. All audio recordings have some degree of noise in them, and un-handled noise can wreck the accuracy of speech recognition apps.

To get a feel for how noise can affect speech recognition, download the “jackhammer.wav” file here . As always, make sure you save this to your interpreter session’s working directory.

This file has the phrase “the stale smell of old beer lingers” spoken with a loud jackhammer in the background.

What happens when you try to transcribe this file?

So how do you deal with this? One thing you can try is using the adjust_for_ambient_noise() method of the Recognizer class.

That got you a little closer to the actual phrase, but it still isn’t perfect. Also, “the” is missing from the beginning of the phrase. Why is that?

The adjust_for_ambient_noise() method reads the first second of the file stream and calibrates the recognizer to the noise level of the audio. Hence, that portion of the stream is consumed before you call record() to capture the data.

You can adjust the time-frame that adjust_for_ambient_noise() uses for analysis with the duration keyword argument. This argument takes a numerical value in seconds and is set to 1 by default. Try lowering this value to 0.5.

Well, that got you “the” at the beginning of the phrase, but now you have some new issues! Sometimes it isn’t possible to remove the effect of the noise—the signal is just too noisy to be dealt with successfully. That’s the case with this file.

If you find yourself running up against these issues frequently, you may have to resort to some pre-processing of the audio. This can be done with audio editing software or a Python package (such as SciPy ) that can apply filters to the files. A detailed discussion of this is beyond the scope of this tutorial—check out Allen Downey’s Think DSP book if you are interested. For now, just be aware that ambient noise in an audio file can cause problems and must be addressed in order to maximize the accuracy of speech recognition.

When working with noisy files, it can be helpful to see the actual API response. Most APIs return a JSON string containing many possible transcriptions. The recognize_google() method will always return the most likely transcription unless you force it to give you the full response.

You can do this by setting the show_all keyword argument of the recognize_google() method to True.

As you can see, recognize_google() returns a dictionary with the key 'alternative' that points to a list of possible transcripts. The structure of this response may vary from API to API and is mainly useful for debugging.

By now, you have a pretty good idea of the basics of the SpeechRecognition package. You’ve seen how to create an AudioFile instance from an audio file and use the record() method to capture data from the file. You learned how to record segments of a file using the offset and duration keyword arguments of record() , and you experienced the detrimental effect noise can have on transcription accuracy.

Now for the fun part. Let’s transition from transcribing static audio files to making your project interactive by accepting input from a microphone.

Working With Microphones

To access your microphone with SpeechRecognizer, you’ll have to install the PyAudio package . Go ahead and close your current interpreter session, and let’s do that.

The process for installing PyAudio will vary depending on your operating system.

Debian Linux

If you’re on Debian-based Linux (like Ubuntu) you can install PyAudio with apt :

Once installed, you may still need to run pip install pyaudio , especially if you are working in a virtual environment .

For macOS, first you will need to install PortAudio with Homebrew, and then install PyAudio with pip :

On Windows, you can install PyAudio with pip :

Testing the Installation

Once you’ve got PyAudio installed, you can test the installation from the console.

Make sure your default microphone is on and unmuted. If the installation worked, you should see something like this:

Shell A moment of silence, please... Set minimum energy threshold to 600.4452854381937 Say something! Copied! Go ahead and play around with it a little bit by speaking into your microphone and seeing how well SpeechRecognition transcribes your speech.

Note: If you are on Ubuntu and get some funky output like ‘ALSA lib … Unknown PCM’, refer to this page for tips on suppressing these messages. This output comes from the ALSA package installed with Ubuntu—not SpeechRecognition or PyAudio. In all reality, these messages may indicate a problem with your ALSA configuration, but in my experience, they do not impact the functionality of your code. They are mostly a nuisance.

Open up another interpreter session and create an instance of the recognizer class.

Now, instead of using an audio file as the source, you will use the default system microphone. You can access this by creating an instance of the Microphone class.

If your system has no default microphone (such as on a Raspberry Pi ), or you want to use a microphone other than the default, you will need to specify which one to use by supplying a device index. You can get a list of microphone names by calling the list_microphone_names() static method of the Microphone class.

Note that your output may differ from the above example.

The device index of the microphone is the index of its name in the list returned by list_microphone_names(). For example, given the above output, if you want to use the microphone called “front,” which has index 3 in the list, you would create a microphone instance like this:

For most projects, though, you’ll probably want to use the default system microphone.

Now that you’ve got a Microphone instance ready to go, it’s time to capture some input.

Just like the AudioFile class, Microphone is a context manager. You can capture input from the microphone using the listen() method of the Recognizer class inside of the with block. This method takes an audio source as its first argument and records input from the source until silence is detected.

Once you execute the with block, try speaking “hello” into your microphone. Wait a moment for the interpreter prompt to display again. Once the “>>>” prompt returns, you’re ready to recognize the speech.

If the prompt never returns, your microphone is most likely picking up too much ambient noise. You can interrupt the process with Ctrl + C to get your prompt back.

To handle ambient noise, you’ll need to use the adjust_for_ambient_noise() method of the Recognizer class, just like you did when trying to make sense of the noisy audio file. Since input from a microphone is far less predictable than input from an audio file, it is a good idea to do this anytime you listen for microphone input.

After running the above code, wait a second for adjust_for_ambient_noise() to do its thing, then try speaking “hello” into the microphone. Again, you will have to wait a moment for the interpreter prompt to return before trying to recognize the speech.

Recall that adjust_for_ambient_noise() analyzes the audio source for one second. If this seems too long to you, feel free to adjust this with the duration keyword argument.

The SpeechRecognition documentation recommends using a duration no less than 0.5 seconds. In some cases, you may find that durations longer than the default of one second generate better results. The minimum value you need depends on the microphone’s ambient environment. Unfortunately, this information is typically unknown during development. In my experience, the default duration of one second is adequate for most applications.

Try typing the previous code example in to the interpeter and making some unintelligible noises into the microphone. You should get something like this in response:

Audio that cannot be matched to text by the API raises an UnknownValueError exception. You should always wrap calls to the API with try and except blocks to handle this exception .

Note : You may have to try harder than you expect to get the exception thrown. The API works very hard to transcribe any vocal sounds. Even short grunts were transcribed as words like “how” for me. Coughing, hand claps, and tongue clicks would consistently raise the exception.

Now that you’ve seen the basics of recognizing speech with the SpeechRecognition package let’s put your newfound knowledge to use and write a small game that picks a random word from a list and gives the user three attempts to guess the word.

Here is the full script:

Let’s break that down a little bit.

The recognize_speech_from_mic() function takes a Recognizer and Microphone instance as arguments and returns a dictionary with three keys. The first key, "success" , is a boolean that indicates whether or not the API request was successful. The second key, "error" , is either None or an error message indicating that the API is unavailable or the speech was unintelligible. Finally, the "transcription" key contains the transcription of the audio recorded by the microphone.

The function first checks that the recognizer and microphone arguments are of the correct type, and raises a TypeError if either is invalid:

The listen() method is then used to record microphone input:

The adjust_for_ambient_noise() method is used to calibrate the recognizer for changing noise conditions each time the recognize_speech_from_mic() function is called.

Next, recognize_google() is called to transcribe any speech in the recording. A try...except block is used to catch the RequestError and UnknownValueError exceptions and handle them accordingly. The success of the API request, any error messages, and the transcribed speech are stored in the success , error and transcription keys of the response dictionary, which is returned by the recognize_speech_from_mic() function.

You can test the recognize_speech_from_mic() function by saving the above script to a file called “” and running the following in an interpreter session:

The game itself is pretty simple. First, a list of words, a maximum number of allowed guesses and a prompt limit are declared:

Next, a Recognizer and Microphone instance is created and a random word is chosen from WORDS :

After printing some instructions and waiting for 3 three seconds, a for loop is used to manage each user attempt at guessing the chosen word. The first thing inside the for loop is another for loop that prompts the user at most PROMPT_LIMIT times for a guess, attempting to recognize the input each time with the recognize_speech_from_mic() function and storing the dictionary returned to the local variable guess .

If the "transcription" key of guess is not None , then the user’s speech was transcribed and the inner loop is terminated with break . If the speech was not transcribed and the "success" key is set to False , then an API error occurred and the loop is again terminated with break . Otherwise, the API request was successful but the speech was unrecognizable. The user is warned and the for loop repeats, giving the user another chance at the current attempt.

Once the inner for loop terminates, the guess dictionary is checked for errors. If any occurred, the error message is displayed and the outer for loop is terminated with break , which will end the program execution.

If there weren’t any errors, the transcription is compared to the randomly selected word. The lower() method for string objects is used to ensure better matching of the guess to the chosen word. The API may return speech matched to the word “apple” as “Apple” or “apple,” and either response should count as a correct answer.

If the guess was correct, the user wins and the game is terminated. If the user was incorrect and has any remaining attempts, the outer for loop repeats and a new guess is retrieved. Otherwise, the user loses the game.

When run, the output will look something like this:

In this tutorial, you’ve seen how to install the SpeechRecognition package and use its Recognizer class to easily recognize speech from both a file—using record() —and microphone input—using listen(). You also saw how to process segments of an audio file using the offset and duration keyword arguments of the record() method.

You’ve seen the effect noise can have on the accuracy of transcriptions, and have learned how to adjust a Recognizer instance’s sensitivity to ambient noise with adjust_for_ambient_noise(). You have also learned which exceptions a Recognizer instance may throw— RequestError for bad API requests and UnkownValueError for unintelligible speech—and how to handle these with try...except blocks.

Speech recognition is a deep subject, and what you have learned here barely scratches the surface. If you’re interested in learning more, here are some additional resources.

For more information on the SpeechRecognition package:

  • Library reference
  • Troubleshooting page

A few interesting internet resources:

  • Behind the Mic: The Science of Talking with Computers . A short film about speech processing by Google.
  • A Historical Perspective of Speech Recognition by Huang, Baker and Reddy. Communications of the ACM (2014). This article provides an in-depth and scholarly look at the evolution of speech recognition technology.
  • The Past, Present and Future of Speech Recognition Technology by Clark Boyd at The Startup. This blog post presents an overview of speech recognition technology, with some thoughts about the future.

Some good books about speech recognition:

  • The Voice in the Machine: Building Computers That Understand Speech , Pieraccini, MIT Press (2012). An accessible general-audience book covering the history of, as well as modern advances in, speech processing.
  • Fundamentals of Speech Recognition , Rabiner and Juang, Prentice Hall (1993). Rabiner, a researcher at Bell Labs, was instrumental in designing some of the first commercially viable speech recognizers. This book is now over 20 years old, but a lot of the fundamentals remain the same.
  • Automatic Speech Recognition: A Deep Learning Approach , Yu and Deng, Springer (2014). Yu and Deng are researchers at Microsoft and both very active in the field of speech processing. This book covers a lot of modern approaches and cutting-edge research but is not for the mathematically faint-of-heart.

Throughout this tutorial, we’ve been recognizing speech in English, which is the default language for each recognize_*() method of the SpeechRecognition package. However, it is absolutely possible to recognize speech in other languages, and is quite simple to accomplish.

To recognize speech in a different language, set the language keyword argument of the recognize_*() method to a string corresponding to the desired language. Most of the methods accept a BCP-47 language tag, such as 'en-US' for American English, or 'fr-FR' for French. For example, the following recognizes French speech in an audio file:

Only the following methods accept a language keyword argument:

  • recognize_bing()
  • recognize_google()
  • recognize_google_cloud()
  • recognize_ibm()
  • recognize_sphinx()

To find out which language tags are supported by the API you are using, you’ll have to consult the corresponding documentation . A list of tags accepted by recognize_google() can be found in this Stack Overflow answer .

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About David Amos

David Amos

David is a writer, programmer, and mathematician passionate about exploring mathematics through code.

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Aldren Santos

Master Real-World Python Skills With Unlimited Access to Real Python

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

What Do You Think?

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal . Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session . Happy Pythoning!

Keep Learning

Related Topics: advanced data-science machine-learning

Recommended Video Course: Speech Recognition With Python

Keep reading Real Python by creating a free account or signing in:

Already have an account? Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Download Now

Get a Full Python Speech Recognition Sample Project (Source Code / .zip)

🔒 No spam. We take your privacy seriously.

speech to text library

  • Português – Brasil

Using the Speech-to-Text API with Node.js

1. overview.

Google Cloud Speech-to-Text API enables developers to convert audio to text in 120 languages and variants, by applying powerful neural network models in an easy to use API.

In this codelab, you will focus on using the Speech-to-Text API with Node.js. You will learn how to send an audio file in English and other languages to the Cloud Speech-to-Text API for transcription.

What you'll learn

  • How to enable the Speech-to-Text API
  • How to Authenticate API requests
  • How to install the Google Cloud client library for Node.js
  • How to transcribe audio files in English
  • How to transcribe audio files with word timestamps
  • How to transcribe audio files in different languages

What you'll need

  • A Google Cloud Platform Project
  • A Browser, such Chrome or Firefox
  • Familiarity using Javascript/Node.js

How will you use this tutorial?

How would you rate your experience with node.js, how would you rate your experience with using google cloud platform services, 2. setup and requirements, self-paced environment setup.

  • Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one .)


Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID .

  • Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell , a command line environment running in the Cloud.

Activate Cloud Shell


If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:


It should only take a few moments to provision and connect to Cloud Shell.


This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

  • Run the following command in Cloud Shell to confirm that you are authenticated:

Command output

If it is not, you can set it with this command:

3. Enable the Speech-to-Text API

Before you can begin using the Speech-to-Text API, you must enable the API. You can enable the API by using the following command in the Cloud Shell:

4. Authenticate API requests

In order to make requests to the Speech-to-Text API, you need to use a Service Account . A Service Account belongs to your project and it is used by the Google Client Node.js library to make Speech-to-Text API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set an environment variable with your PROJECT_ID which you will use throughout this codelab, if you are using Cloud Shell this will be set for you:

Next, create a new service account to access the Speech-to-Text API by using:

Next, create credentials that your Node.js code will use to login as your new service account. Create these credentials and save it as a JSON file ~/key.json by using the following command:

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Speech-to-Text API Node.js library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created, by using:

You can read more about authenticating the Speech-to-Text API .

5. Install the Google Cloud Speech-to-Text API client library for Node.js

First, create a project that you will use to run this Speech-to-Text API lab, initialize a new Node.js package in a folder of your choice:

NPM asks several questions about the project configuration, such as name and version. For each question, press ENTER to accept the default values. The default entry point is a file named index.js .

Next, install the Google Cloud Speech library to the project:

For more instructions on how to set up a Node.js development for Google Cloud please see the Setup Guide .

Now, you're ready to use Speech-to-Text API!

6. Transcribe Audio Files

In this section, you will transcribe a pre-recorded audio file in English. The audio file is available on Google Cloud Storage.

Navigate to the index.js file inside the and replace the code with the following:

Take a minute or two to study the code and see it is used to transcribe an audio file*.*

The Encoding parameter tells the API which type of audio encoding you're using for the audio file. Flac is the encoding type for .raw files (see the doc for encoding type for more details).

In the RecognitionAudio object, you can pass the API either the uri of our audio file in Cloud Storage or the local file path for the audio file. Here, we're using a Cloud Storage uri.

Run the program:

You should see the following output:

7. Transcribe with word timestamps

Speech-to-Text can detect time offset (timestamp) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

Take a minute or two to study the code and see it is used to transcribe an audio file with word timestamps*.* The EnableWordTimeOffsets parameter tells the API to enable time offsets (see the doc for more details).

Run your program again:

8. Transcribe different languages

Speech-to-Text API supports transcription in over 100 languages! You can find a list of supported languages here .

In this section, you will transcribe a pre-recorded audio file in French. The audio file is available on Google Cloud Storage.

Run your program again and you should see the following output:

This is a sentence from a popular French children's tale .

For the full list of supported languages and language codes, see the documentation here .

9. Congratulations!

You learned how to use the Speech-to-Text API using Node.js to perform different kinds of transcription on audio files!

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

  • Go to the Cloud Platform Console .
  • Select the project you want to shut down, then click ‘Delete' at the top: this schedules the project for deletion.
  • Google Cloud Speech-to-Text API:
  • Node.js on Google Cloud Platform:
  • Google Cloud Node.js client:

This work is licensed under a Creative Commons Attribution 2.0 Generic License.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

  • Skip to main content
  • Skip to search
  • Skip to select language
  • Sign up for free

Using the Web Speech API

Speech recognition.

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.

Note: On some browsers, such as Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer . When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting mobile browser (such as Chrome).


The HTML and CSS for the app is really trivial. We have a title, instructions paragraph, and a div into which we output diagnostic messages.

The CSS provides a very simple responsive styling so that it looks OK across devices.

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Browsers currently support speech recognition with prefixed properties. Therefore at the start of our code we include these lines to allow for both prefixed properties and unprefixed versions that may be supported in future:

The grammar

The next part of our code defines the grammar we want our app to recognize. The following variable is defined to hold our grammar:

The grammar format used is JSpeech Grammar Format ( JSGF ) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:

  • The lines are separated by semicolons, just like in JavaScript.
  • The first line — #JSGF V1.0; — states the format and version used. This always needs to be included first.
  • The second line indicates a type of term that we want to recognize. public declares that it is a public rule, the string in angle brackets defines the recognized name for this term ( color ), and the list of items that follow the equals sign are the alternative values that will be recognized and accepted as appropriate values for the term. Note how each is separated by a pipe character.
  • You can have as many terms defined as you want on separate lines following the above structure, and include fairly complex grammar definitions. For this basic demo, we are just keeping things simple.

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

  • SpeechRecognition.continuous : Controls whether continuous results are captured ( true ), or just a single result each time recognition is started ( false ).
  • SpeechRecognition.lang : Sets the language of the recognition. Setting this is good practice, and therefore recommended.
  • SpeechRecognition.interimResults : Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this simple demo.
  • SpeechRecognition.maxAlternatives : Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this simple demo, so we are just specifying one (which is actually the default anyway.)

Starting the speech recognition

After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start() . The forEach() method is used to output colored indicators showing what colors to try saying.

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events .) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop() ) once a single word has been recognized and it has finished being spoken:

Handling errors and unrecognized speech

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

The error event handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis . This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter / Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting mobile browser.

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option> s via JavaScript (see later on.)

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis . This is API's entry point — it returns an instance of SpeechSynthesis , the controller interface for web speech synthesis.

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices() , which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from ), the language of the voice (grabbed from SpeechSynthesisVoice.lang ), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true .)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter / Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak() , passing it the SpeechSynthesisUtterance instance as a parameter.

In the final part of the handler, we include an pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch / rate values displayed in the UI, each time the slider positions are moved.

The Best Speech-to-Text APIs in 2024

speech-to-text gold trophy

If you've been shopping for a speech-to-text (STT) solution for your business, you're not alone. In our recent  State of Voice Technology  report, 82% of respondents confirmed their current utilization of voice-enabled technology, a 6% increase from last year.

The vast number of options for speech transcription can be overwhelming, especially if you're unfamiliar with the space. From Big Tech to open source options, there are many choices, each with different price points and feature sets. While this diversity is great, it can also be confusing when you're trying to compare options and pick the right solution.

This article breaks down the leading speech-to-text APIs available today, outlining their pros and cons and providing a ranking that accurately represents the current STT landscape. Before getting to the ranking, we explain exactly what an STT API is, and the core features you can expect an STT API to have, and some key use cases for speech-to-text APIs.

What is a speech-to-text API?

At its core, a speech-to-text (also known as automatic speech recognition, or ASR) application programming interface (API) is simply the ability to call a service to transcribe audio containing speech into written text. The STT service will take the provided audio data, process it using either machine learning or legacy techniques (e.g. Hidden Markov Models), and then provide a transcript of what it has inferred was said.

What are the most important things to consider when choosing a speech-to-text API?

What makes the best speech-to-text API? Is the fastest speech-to-text API the best? Is the most accurate speech-to-text API the best? Is the most affordable speech-to-text API the best? The answers to these questions depend on your specific project and are thus certainly different for everybody. There are a number of aspects to carefully consider in the evaluation and selection of a transcription service and the order of importance is dependent on your target use case and end user needs.

Accuracy - A speech-to-text API should produce highly accurate transcripts, even while dealing with varying levels of speaking conditions (e.g. background noise, dialects, accents, etc.). “Garbage in, garbage out,” as the saying goes. The vast majority of voice applications require highly accurate results from their transcription service to deliver value and a good customer experience to their users.

Speed - Many applications require quick turnaround times and high throughput. A responsive STT solution will deliver value with low latency and fast processing speeds.

Cost - Speech-to-text is a foundational capability in the application stack, and cost efficiency is essential. Solutions that fail to deliver adequate ROI and a good price-to-performance ratio will be a barrier to the overall utility of the end user application.

Modality - Important input modes include support for pre-recorded or real-time audio:

Batch or pre-recorded transcription capabilities - Batch transcription won't be needed by everyone, but for many use cases, you'll want a service that you can send batches of files to to be transcribed, rather than having to do it one-by-one on your end.

Real-time streaming - Again, not everyone will need real-time streaming. However, if you want to use STT to create, for example, truly conversational AI that can respond to customer inquiries in real time, you'll need to use a STT API that returns its results as quickly as possible.

Features & Capabilities - Developers and companies seeking speech processing solutions require more than a bare transcript. They also need rich features that help them build scalable products with their voice data, including sophisticated formatting and speech understanding capabilities to improve readability and utility by downstream tasks.

Scalability and Reliability - A good speech-to-text solution will accommodate varying throughput needs, adequately handling a range of audio data volumes from small startups to large enterprises. Similarly, ensuring reliable, operational integrity is a hard requirement for many applications where the effects from frequent or lengthy service interruption could result in revenue impacts and damage to brand reputation. 

Customization, Flexibility, and Adaptability - One size, fits few. The ability to customize STT models for specific vocabulary or jargon as well as flexible deployment options to meet project-specific privacy, security, and compliance needs are important, often overlooked considerations in the selection process.

Ease of Adoption and Use - A speech-to-text API only has value if it can be integrated into an application. Flexible pricing and packaging options are critical, including usage-based pricing with volume discounts. Some vendors do a better job than others to provide a good developer experience by offering frictionless self-onboarding and even including free tiers with an adequate volume of credits to help developers test the API and prototype their applications before choosing the best subscription option to choose.

Support and Subject Matter Expertise - Domain experts in AI, machine learning, and spoken language understanding are an invaluable resource when issues arise. Many solution providers outsource their model development or offer STT as a value-add to their core offering. Vendors for whom speech AI is their core focus are better equipped to diagnose and resolve challenge issues in a timely fashion. They are also more inclined to make continuous improvements to their STT service and avoid issues with stagnating performance over time.

What are the most important features of a speech-to-text API?

In this section, we'll survey some of the most common features that STT APIs offer. The key features that are offered by each API differ, and your use cases will dictate your priorities and needs in terms of which features to focus on.

Multi-language support - If you're planning to handle multiple languages or dialects, this should be a key concern. And even if you aren't planning on multilingual support now, if there's any chance that you would in the future, you're best off starting with a service that offers many languages and is always expanding to more.

Formatting - Formatting options like punctuation, numeral formatting, paragraphing, speaker labeling (or speaker diarization), word-level timestamping, profanity filtering, and more, all to improve readability and utility for data science

Automatic punctuation & capitalization - Depending on what you're planning to do with your transcripts, you might not care if they're formatted nicely. But if you're planning on surfacing them publicly, having this included in what the STT API provides can save you time.

Profanity filtering or redaction - If you're using STT as part of an effort for community moderation, you're going to want a tool that can automatically detect profanity in its output and censor it or flag it for review.

Understanding - A primary motivation for employing a speech-to-text API is to gain understanding of who said what and why they said it. Many applications employ natural language and spoken language understanding tasks to accurately identify, extract, and summarize conversational audio to deliver amazing customer experiences. 

Topic detection - Automatically identify the main topics and themes in your audio to improve categorization, organization, and understanding of large volumes of spoken language content..

Intent detection - Similarly, intent detection is used to determine the purpose or intention behind the interactions between speakers, enabling more efficient handling by downstream agents or tasks in a system in order to determine the next best action to take or response to provide.

Sentiment analysis - Understand the interactions, attitudes, views, and emotions in conversational audio by quantitatively scoring the overall and component sections as being positive, neutral, or negative. 

Summarization - Deliver a concise summary of the content in your audio, retaining the most relevant and important information and overall meaning, for responsive understanding, analysis, and efficient archival.

Keywords (a.k.a. Keyword Boosting) - Being able to include an extended, custom vocabulary is helpful if your audio has lots of specialized terminology, uncommon proper nouns, abbreviations, and acronyms that an off-the-shelf model wouldn't have been exposed to. This allows the model to incorporate these custom terms as possible predictions.

Custom models - While keywords provide inclusion of a small set of specialized, out-of-vocabulary words, a custom model trained on representative data will always give the best performance. Vendors that allow you to tailor a model for your specific needs, fine-tuned on your own data, give you the ability to boost accuracy beyond what an out-of-the-box solution alone provides.

Accepts multiple audio formats - Another concern that won't be present for everyone is whether or not the STT API can process audio in different formats. If you have audio coming from multiple sources that aren't encoded in the same format, having a STT API that removes the need for converting to different types of audio can save you time and money.

What are the top speech-to-text use cases?

As noted at the outset, voice technology that's built on the back of STT APIs is a critical part of the future of business. So what are some of the most common use cases for speech-to-text APIs? Let's take a look.

Smart assistants  - Smart assistants like Siri and Alexa are perhaps the most frequently encountered use case for speech-to-text, taking spoken commands, converting them to text, and then acting on them.

Conversational AI  - Voicebots let humans speak and, in real time, get answers from an AI. Converting speech to text is the first step in this process, and it has to happen quickly for the interaction to truly feel like a conversation.

Sales and support enablement  - Sales and support digital assistants that provide tips, hints, and solutions to agents by transcribing, analyzing and pulling up information in real time. It can also be used to gauge sales pitches or sales calls with a customer.

Contact centers  - Contact centers can use STT to create transcripts of their calls, providing more ways to evaluate their agents, understand what customers are asking about, and provide insight into different aspects of their business that are typically hard to assess.

Speech analytics  - Broadly speaking, speech analytics is any attempt to process spoken audio to extract insights. This might be done in a call center, as above, but it could also be done in other environments, like meetings or even speeches and talks.

Accessibility  - Providing transcriptions of spoken speech can be a huge win for accessibility, whether it's  providing captions for classroom lectures  or creating badges that transcribe speech on the fly.

How do you evaluate performance of a speech-to-text API?

All speech-to-text solutions aim to produce highly accurate transcripts in a user-friendly format. We advise performing side-by-side accuracy testing using files that resemble the audio you will be processing in production to determine the best speech solution for your needs. The best evaluation regimes employ a holistic approach that includes a mix of quantitative benchmarking and qualitative human preference evaluation across the most important dimensions of quality and performance, including accuracy and speed.

The generally accepted industry metric for measuring transcription quality is Word Error Rate (WER). Consider WER in relation to the following equation:

WER + Accuracy Rate = 100%

Thus, an 80% accurate transcript corresponds to a WER of 20%

WER is an industry standard focusing on error rate rather than accuracy as the error rate can be subdivided into distinct error categories. These categories provide valuable insights into the nature of errors present in a transcript. Consequently, WER can also be defined using the formula:

WER = (# of words inserted + # of words deleted + # of words substituted) / total # of words.

We suggest a degree of skepticism towards vendor claims about accuracy. This includes the qualitative claim that OpenAI’s model “approaches human level robustness on accuracy in English,” and the WER statistics published in Whisper’s documentation.

speech to text library Highest Fastest Lowest High High Slow Low Low High Slow High Medium Medium Very slow High Medium Medium Medium Medium Medium High Medium High Low High Very slow High Medium High Medium High Medium Low Slow High Medium Low Slow Low Medium

There you have it–the top 10 speech-to-text APIs in 2024. We hope that this helps you demystify some of the confusion around the proliferation of options that exist in this space, and gives you a better sense of which provider might be the best for your particular use case. If you'd like to give Deepgram a try for yourself, you can sign up for a free API key or contact us if you have questions about how you might use Deepgram for your transcription needs. If you have any feedback about this post, or anything else around Deepgram, we'd love to hear from you. Please let us know in our   GitHub discussions  or  contact us  to talk to one of our product experts for more information today.

Google vs. NVIDIA: Losing the AI Innovation Competition

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.

man at desk connected to sound bars and documents

IBM Watson® Speech to Text technology enables fast and accurate speech transcription in multiple languages for a variety of use cases, including but not limited to customer self-service, agent assistance and speech analytics. Get started fast with our advanced machine learning models out-of-the-box or customize them for your use case.

IBM Watson Speech to Text is now available as a containerized library for IBM partners to embed AI technology in their commercial applications.

Register for the IBM TechXchange Day: AI and Automation

Our best-in-class AI, embedded within Watson Speech to Text, truly understands your customers.

Train Watson Speech to Text on your unique domain language and specific audio characteristics.

Enjoy the security of IBM’s world-class data governance practices.

Built to support global languages and deployable on any cloud — public, private, hybrid, multicloud, or on-premises.

Enable your voice applications using neural technologies for speech recognition powered by IBM Watson.

Improve speech recognition accuracy for your use case with language and acoustic training options.

Activate your voice application with speech models tuned for the customer care domain.

Improve speech recognition accuracy for extracting phrases, words, letters, numbers or lists.

Use our models optimized for low latency in real-time speech applications.

Analyze and correct weak audio signals before transcription begins.

Improve application response times by using speech transcription as it is generated and throughout the finalization process.

Transcribe dates, times, numbers, currency values, email and website addresses in your final transcripts by converting them into conventional forms.

Recognize who said what in a multi-participant voice exchange. Currently optimized for two-way call center conversations but can detect up to 6 different speakers.

Filter for specific words or inappropriate content by using our keyword spotting and profanity filtering features. (US English only)

Accelerate your business growth as an Independent Software Vendor (ISV) by innovating with IBM. Partner with us to deliver enhanced commercial solutions embedded with AI to better address clients’ needs.

Build AI-based solutions faster with IBM embeddable AI

Hear how a large call center transformed its operations with AI. (2:31)

Get started for free or view a demo . 

500 minutes of free speech recognition a month and 38 pre-trained speech models.

As low as USD 0.01 per minute

Tune your speech models to improve accuracy in recognition as well as transcription. Plus version includes unlimited minutes per month and 100 concurrent transcriptions.

Contact us for pricing

Provides large and security-sensitive firms with more capacity and data protection. Premium includes unlimited minutes per month and unlimited concurrent transcriptions.

Deploy Anywhere

Deploy behind your firewall or on any cloud with the flexibility of  IBM Cloud Pak for Data . The Deploy Anywhere version includes unlimited minutes per month and unlimited concurrent transcriptions, along with noise detection, speech customization and data isolation. 

Technical API specifications for all of your development needs.

The Watson SDK repository in GitHub.

See documentation about our enhanced security features that ensure your data is isolated and encrypted end-to-end, while in transit and at rest.

Learn how to create custom speech models using IBM Watson quickly — without knowing how to code.

Read about Watson Speech to Text requirements, the methodology and some best practices inspired by actual clients.

Guidelines on how to add a new or existing virtual assistant to your brand-new Watson IVR.

Improve customer engagement by interacting with users in their own language using any written text.

Solve customer issues the first time using an AI virtual assistant across any application, device, or channel.

Infuse powerful natural language AI into commercial applications with a containerized library designed to empower IBM partners with greater flexibility.

See Watson Speech to Text capabilities in action.

Top Free Text-to-speech (TTS) libraries for python


With more artificial intelligence applications being built, we need text-to-speech(TTS) engine API. The good news, there are a lot of open-source modules opensource for text-to-speech (TTS). This story will talk about python’s top text-to-speech(TTS) libraries.

gTTS (Google Text-to-Speech) is a Python library that allows you to convert text to speech using Google’s Text-to-Speech API. It’s designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the speed of the speech, and the volume.

When I wrote this post, The project had 1.7k stars on GitHub.

To use gTTS, you will need to install the library using pip:

Then, you can use the gTTS class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the gTTS constructor. For example:

Once you have an instance of the gTTS class, you can use the save method to save the speech to a file. For example:

You can also use the gTTS class to change the speech output’s language and speech speed. For example:

Complete code and output

Many other options are available for controlling the speech output, such as setting the volume and pitch of the speech. You can find more information about these options in the gTTS documentation .

I already have a series of videos and posts about coquiTTS that you can find here .

CoquiTTS is a neural text-to-speech (TTS) library developed in PyTorch. It is designed to be easy to use and provides a range of options for controlling the speech output, such as setting the language, the pitch, and the duration of the speech.

It is the most popular package, with 7.4k stars on GitHub.

To use CoquiTTS, you will need to install the library using pip:

Once you have installed the library, you can use the coquiTTS class to create an instance of the text-to-speech converter. You can pass the text you want to convert to speech as a string to the Synthesizer constructor. For example:

Once you have an instance of the Synthesizer class, you can use the tts method to generate speech. You can save the speech to a file using a the save_wav method. For example:

You can find the documentation here .


TensorFlowTTS (TensorFlow Text-to-Speech) is a deep learning-based text-to-speech (TTS) library developed by TensorFlow, an open-source platform for machine learning and artificial intelligence. It is designed to be easy to use and provides a range of features for building TTS systems, including support for multiple languages and customizable models.

It has 3k stars on gihub.

To use TensorFlowTTS, you will need to install the library using pip:

Sample code

TensorFlowTTS also provides pre-trained models for various languages, including English, Chinese, and Japanese. You can use these models to perform speech synthesis without the need to train your model. You can find more information about how to use TensorFlowTTS and the available options on GitHub .

pyttsx3 is a Python text-to-speech (TTS) library that allows you to convert text to speech using a range of TTS engines, including the Microsoft Text-to-Speech API, the Festival, and the eSpeak TTS engine. pyttsx3 is designed to be easy to use and provides a range of options for controlling speech output.

It has 1.3k stars on github.

To use pyttsx3, you will need to install the library using pip:

Once you have installed the library, you can use the pyttsx3.init function to create an instance of the text-to-speech converter. You can pass the TTS engine you want to use as an argument to the init function. For example:

Once you have an instance of the TTS engine, you can use the say method to generate speech from text. The say method takes the text you want to synthesize as an argument. For example:

Larynx is a text-to-speech (TTS) library written in Python that uses the Google Text-to-Speech API to convert text to speech.

To use Larynx, you will need to install the library using pip:

Once you have installed the library, you can use the text_to_speech function for the text-to-speech converter. You can pass many parameters like:

You can save the speech to a file using the **wavfile** function. For example:

Let's Innovate together for a better future.

We have the knowledge and the infrastructure to build, deploy and monitor Ai solutions for any of your needs.

  • Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
  • Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
  • OverflowAI GenAI features for Teams
  • OverflowAPI Train & fine-tune LLMs
  • Labs The future of collective knowledge sharing
  • About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

Options for free (and preferably open source) speech to text library [closed]

Looking for a library (with Java or Python APIs) that converts speech to text. 100% accuracy is not an absolute requirement because I just need to run some experiments for a prototype. Ideally it should accept an input file (e.g., .wav) and return the output as text.

  • speech-recognition
  • speech-to-text

Soumya Simanta's user avatar

2 Answers 2

You can use the Sphinx like kdazzle has suggested for you or you can also check out other java implementation here .

For python library, check out pyspeech or dragonfly . If the library can output the text, I think the library should be possible to print out the text into a file.

Matt Swain's user avatar

Sphinx is pretty good. It's made by the folks at Carnegie Mellon.

kdazzle's user avatar

  • 1 Thanks. I've already looked and implemented this this option. I'm looking for other options as well. –  Soumya Simanta Commented Apr 12, 2012 at 2:00

Not the answer you're looking for? Browse other questions tagged java python speech-recognition speech-to-text or ask your own question .

  • Featured on Meta
  • We spent a sprint addressing your requests — here’s how it went
  • Upcoming initiatives on Stack Overflow and across the Stack Exchange network...
  • What makes a homepage useful for logged-in users

Hot Network Questions

  • Keyboard Ping Pong
  • Two-period two-good optimal consumption problem
  • Is infinity a number?
  • Who is ??? In Cult of the Lamb?
  • Is it possible with modern-day technology to expand an already built bunker further below without the risk of collapsing the entire bunker?
  • Accommodating whiteboard glare for low-vision student
  • Why does Macbeth well deserve his name?
  • Pattern on a PCB
  • Are the North Star and the moon ever visible in the night sky at the same time?
  • An adjective for something peaceful but sad?
  • French Election 2024 - seat share based on first round only
  • Distorted square wave
  • Can a festival or a celebration like Halloween be "invented"?
  • How to choose between 3/4 and 6/8 time?
  • How to turn a sum into an integral?
  • Is an employment Conflict of Interest necessary when redundant with my Affiliation?
  • Dual of slope semistable vector bundle on higher dimensional variety
  • Does the damage from Thunderwave occur before or after the target is moved
  • What enforcement exists for medical informed consent?
  • Power pedals on a fully MTB (riding flat roads), later on road/gravel bike
  • How does light beyond the visible spectrum relate to color theory?
  • On the Rambam's view on funding Torah scholars
  • Reviewer "rejected" my submission, and then submitted one by their own. What did they do wrong?
  • Why do jet aircraft need chocks when they have parking brakes?

speech to text library

  • Services & Software

This AI Startup Wants You to Read Audiobooks to Yourself

The text-to-speech reading tool is challenging a $2 billion industry.

speech to text library

AI startup Speechify is putting its own spin on audiobooks and giving you, the listener, a leading role. You get to be the star if you want.

You can import your own voice to make an AI clone and then listen to text with your voice or your girlfriend's, as in the case of CEO Cliff Weitzman.

You can also choose from celebrities like Snoop Dogg and Gwyneth Paltrow, who have signed on to add their voices as options. The twist being these are AI generated, not the celebs themselves reading.

"You can just pick your own voices and that's a great experience," Weitzman said.

It's this ability to choose whatever voice you want and to turn any book into an audiobook that Weitzman argued sets Speechify apart from titans of industry like Apple Books, Audible and Spotify.

Audiobooks are a hot commodity. According to the Audio Publishers Association,  2023 marked the 12th consecutive year of sales growth, with a total of about $2 billion for the year. The APA also found 52% of US adults have listened to audiobooks at some point, which is equivalent to about 150 million people.   

AI Atlas art badge tag

With artificial intelligence, and especially generative AI, exerting its influence far and wide, we're seeing entrepreneurs seek to harness the technology to challenge the status quo in a variety of industries, from  law to  medicine and even  generative AI itself. Seven-year-old Speechify is positioning its text-to-speech reading tool as an alternative to traditional audiobooks through the use of AI-generated human voices. 

As a child with dyslexia, Weitzman relied on his parents to read books aloud to him. But when he got to college, he couldn't find audio versions of his textbooks, so he built a program to read to him using deep learning, an AI technique that teaches computers to process data like the human brain does, and what's known as concatenative text-to-speech, a form of speech generation that taps into pre-recorded samples of speech.

The native Hebrew speaker also included the ability to change the speed -- a feature Speechify retains today.

"When I started out, I didn't speak English, so I would listen to everything at 0.75x speed and then with time I increased to 1x,1.25x, 1.5x, 2x, 3x," he said. "If a sentence was easy to understand, I'd make it really fast. If the sentence was hard, I'd make it really slow."

Weitzman's brother Tyler joined as a co-founder in 2018 and has served as head of AI and president since 2022. Tyler Weitzman helped develop the algorithm that eventually became the first version of Speechify. It was trained on 100,000 hours of audio so the reading voice sounded human. As the product improved, the startup signed partnerships with celebrities to use their voices as well.

Speechify can read books, documents and articles on a mobile device. To use it, you can upload a PDF to the web app, which adds the audio to your mobile app, or you can download the Chrome extension to listen to text from Google Drive, iCloud or Dropbox.

A limited version of Speechify is free. It includes six reading voices to start and you can listen at speeds up to 1x. These voice options include computer-generated US males named Nate and John, as well as Stephanie, a female voice from the UK, along with Snoop Dogg, Gwyneth Paltrow and US Youtuber Mr. Beast.

I picked Stephanie, and then the app told me more than 100 voices would also be available in the app. (You then have to listen to a roughly minute-long sales pitch in your chosen voice before proceeding.)

Speechify Premium, which costs $11.67 per month per user, has 250-plus reading voices and 50-plus languages and you can listen at up to 4.5x.

Speechify has 40 million users, according to Cliff Weitzman. (However, the app itself says 23-plus million people use Speechify while you're signing up.) 

The startup  is reportedly backed by $4.5 million from an early-stage venture capital round in 2020. The company declined to comment on funding. 

This is one of a series of short profiles of AI startups, to help you get a handle on the landscape of artificial intelligence activity going on. For more on AI, see our new AI Atlas hub, which includes product reviews, news, tips and explainers.

Services and Software Guides

  • Best iPhone VPN
  • Best Free VPN
  • Best Android VPN
  • Best Mac VPN
  • Best Mobile VPN
  • Best VPN for Firestick
  • Best VPN for Windows
  • Fastest VPN
  • Best Cheap VPN
  • Best Password Manager
  • Best Antivirus
  • Best Identity Theft Protection
  • Best LastPass Alternative
  • Best Live TV Streaming Service
  • Best Streaming Service
  • Best Free TV Streaming Service
  • Best Music Streaming Services
  • Best Web Hosting
  • Best Minecraft Server Hosting
  • Best Website Builder
  • Best Dating Sites
  • Best Language Learning Apps
  • Best Weather App
  • Best Stargazing Apps
  • Best Cloud Storage
  • Best Resume Writing Services
  • New Coverage on Operating Systems

ACM Digital Library home

  • Advanced Search

Regularizing cross-attention learning for end-to-end speech translation with ASR and MT attention matrices

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, recommendations, lattice-based asr-mt interface for speech translation.

The usual approach to improve the interface between automatic speech recognition (ASR) and machine translation (MT) is to use ASR word lattices for translation. In comparison with the previous research along this line, this paper presents an efficient ...

End-to-End Speech Translation With Transcoding by Multi-Task Learning for Distant Language Pairs

Directly translating spoken utterances from a source language to a target language is challenging because it requires a fundamental transformation in both linguistic and para/non-linguistic features. Traditional speech-to-speech translation approaches ...

Automatic quality estimation for speech translation using joint ASR and MT features

This paper addresses the automatic quality estimation of spoken language translation (SLT). This relatively new task is defined and formalized as a sequence-labeling problem where each word in the SLT hypothesis is tagged as good or bad according to a ...


Published in.

Pergamon Press, Inc.

United States

Publication History

Author tags.

  • End-to-end speech-to-text translation
  • Transformer
  • Multitask learning
  • Cross-attention learning
  • Research-article


Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Transcribe Speech to Text Live 4+

Ai voice memo, audio dictation, take agency, llc, designed for ipad.

  • 4.9 • 28 Ratings
  • Offers In-App Purchases



Transcribe Speech To Text: Convert Voice to Text Effortlessly! Unlock the power of speech-to-text conversion with our cutting-edge app, Transcribe Speech To Text. Experience the convenience of converting spoken words into accurate text in a flash. Whether you're a student, professional, or simply need to jot down important notes, our app is your go-to transcription solution. Key Features: 1. Instant & Accurate Transcripts Witness the magic of real-time transcription with lightning-fast accuracy. Our advanced algorithms ensure precise conversion, saving you valuable time and effort. 2. Record Speech to Text Never miss a single word! Effortlessly record your speech and let Transcribe Speech To Text transform it into written text. Perfect for meetings, interviews, lectures, or personal memos. 3. Import Files from Anywhere Bring in existing audio files from your device or cloud storage and let our app do the rest. Seamlessly transcribe speeches, podcasts, and recorded content with ease. 4. Edit and Playback Refine your transcripts effortlessly! Edit the converted text to correct any errors or make improvements. Better yet, sync the text with the audio and play back the recording to review your content accurately. 5. Export in 10+ Different Formats Flexibility at its best! With Transcribe Speech To Text, you can export your transcriptions in a wide range of document formats, including TXT, PDF, DOCX, JPEG, and PNG. Additionally, our app supports subtitle formats like SRT, VTT, and SBV, as well as data formats like CSV and JSON. 6. Enhanced Audio Download Enjoy the option to download the enhanced audio alongside the transcriptions. This feature ensures you have access to clear and high-quality recordings for future reference. 7. 90+ Languages Supported Break the language barrier with our extensive language support. Transcribe Speech To Text can handle over 90 languages, enabling you to transcribe content from around the world effortlessly. With Transcribe Speech To Text, taking notes, creating transcripts, and converting voice to text has never been easier. Empower yourself with a versatile and reliable transcription tool that simplifies your work and enhances productivity. Download Transcribe Speech To Text now and join countless users who have embraced the power of speech-to-text technology. Make your voice heard and your words written, anytime, anywhere! ◆ Subscriptions & Terms • No charge during the Free Trial period. • Free trial automatically converts to a paid subscription unless canceled at least 24-hours before the end of the trial period. From that point onwards, the subscription automatically renews unless canceled at least 24-hours before the end of the current period. • The payment will be charged to your iTunes Account when you confirm the purchase. • The subscription automatically renews for the same price and duration period depend on the selected plan (monthly, half-annual or annual) unless canceled at least 24-hours before the end of the current period. • You can disable the automatic renewal function at any time by adjusting your account settings. • Any unused portion of a free trial period will be forfeited when the user purchases a subscription. Privacy Policy: Terms of Use:

Version 1.0.6

Bug fixes. Enjoy!

Ratings and Reviews

I was a little worried that it wouldn’t give me a lot of free time for interviews but it worked out pretty good. I imported a 13 minute interview and it was ready in 5 minutes. Also love the pink and orange app color

To the point

Clean interface, reliable recording, and excellent transcription + audio cleanup. I use it for all of my interviews, and it works seamlessly with past ones from Voice Memos.

Excellent app

Was great for transcribing audio within a video file!

App Privacy

The developer, Take Agency, LLC , indicated that the app’s privacy practices may include handling of data as described below. For more information, see the developer’s privacy policy .

Data Not Linked to You

The following data may be collected but it is not linked to your identity:

  • Identifiers
  • Diagnostics

Privacy practices may vary, for example, based on the features you use or your age. Learn More


  • Transcribe Pro $2.99
  • Transcribe Pro $9.99
  • Transcribe Pro $49.99
  • Transcribe Pro $19.99
  • Transcribe Pro $4.99
  • Transcribe Pro $29.99
  • Unlimited Saves $5.99
  • App Support
  • Privacy Policy

More By This Developer

Auto Caption Video Subtitles

Video to MP3 Converter AI

You Might Also Like

Transcribe: Voice Note To Text

Transcribe Voice To Text +

Voice To Text: Speech AI

Live Transcribe:Speech to Text

Transcribe voice to text.

Live Transcribe Voice to Text.

  • Español – América Latina
  • Português – Brasil
  • Documentation
  • Cloud Speech-to-Text On-Device

Get started

Contact Google sales to get access to , Google Cloud Storage buckets with Speech models and the Speech runtime.

The Speech runtime is a C library. Browse usage examples of the Speech runtime. The individual git repositories will have documentation for how to set up and try out the Speech models with the runtime.

  • Python bindings
  • Rust bindings
  • Java/Android bindings
  • Go bindings

If an example with a favorite language is missing, reach out and we will add it.

  • Model Adaptation

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2024-07-09 UTC.

LW - Robin Hanson & Liron Shapira Debate AI X-Risk by Liron The Nonlinear Library

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robin Hanson & Liron Shapira Debate AI X-Risk, published by Liron on July 10, 2024 on LessWrong. Robin and I just had an interesting 2-hour AI doom debate. We picked up where the Hanson-Yudkowsky Foom Debate left off in 2008, revisiting key arguments in the light of recent AI advances. My position is similar to Eliezer's: P(doom) on the order of 50%. Robin's position remains shockingly different: P(doom) 1%. I think we managed to illuminate some of our cruxes of disagreement, though by no means all. Let us know your thoughts and feedback! Topics AI timelines The "outside view" of economic growth trends Future economic doubling times The role of culture in human intelligence Lessons from human evolution and brain size Intelligence increase gradient near human level Bostrom's Vulnerable World hypothesis The optimization-power view Feasibility of AI alignment Will AI be "above the law" relative to humans Where To Watch/Listen/Read YouTube video Podcast audio Transcript About Doom Debates My podcast, Doom Debates, hosts high-quality debates between people who don't see eye-to-eye on the urgent issue of AI extinction risk. All kinds of guests are welcome, from luminaries to curious randos. If you're interested to be part of an episode, DM me here or contact me via Twitter or email. If you're interested in the content, please subscribe and share it to help grow its reach. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit

  • Episode Website
  • More Episodes
  • © 2024 The Nonlinear Fund

Top Podcasts In Education


  1. 14 Best Text to Speech Solutions for Business and Personal Use

    speech to text library

  2. OpenAI Whisper

    speech to text library

  3. Top 10 Free Speech-to-Text APIs that you can use in your next IoT Project

    speech to text library

  4. Simple Example of Speech To Text

    speech to text library

  5. Google Text To Speech: Read Texts On Your Screen Aloud

    speech to text library

  6. Text To Speech (App)

    speech to text library


  1. RealtimeSTT: A low-latency speech-to-text library with advanced voice activity detection

  2. The Best Text to Speech Tool Powered by AI 2024 (Free Access Link Below)

  3. How to convert your text to speech using Opensource tools?

  4. Java Google Text To Speech : Tutorial [ 1 ]

  5. Text Library

  6. Text Library Pro


  1. Speech to Text in 1min

    Simple, high-powered transcription. Upload audio, video. Get accurate transcripts in 1min. Powered by AI. Accurate transcripts delivered in 1min. Unlimited minutes.

  2. Top 11 Open Source Speech Recognition/Speech-to-Text Systems

    It supports speech-to-text, text-to-speech, speech translation. And the company claims that its toolkit has 50% less errors in the output compared to other toolkit in the market. Learn more about Whisper from its official website. 11. StyleTTS2. The newest speech recognition library on the list, which was just released in the middle of November ...

  3. GitHub

    Realtime Transcription: Transforms speech to text in real-time. Wake Word Activation: Can activate upon detecting a designated wake word. Hint: Check out RealtimeTTS, the output counterpart of this library, for text-to-voice capabilities. Together, they form a powerful realtime audio wrapper around large language models.

  4. The Top Free Speech-to-Text APIs, AI Models, and Open ...

    Choosing the best Speech-to-Text API, AI model, or open-source engine to build with can be challenging.You need to compare accuracy, model design, features, support options, documentation, security, and more. This post examines the best free Speech-to-Text APIs and AI models on the market today, including ones that have a free tier, to help you make an informed decision.

  5. 13 Best Free Speech-to-Text Open Source Engines, APIs, and AI Models

    Best 13 speech-to-text open-source engine · 1 Whisper · 2 Project DeepSpeech · 3 Kaldi · 4 SpeechBrain · 5 Coqui · 6 Julius · 7 Flashlight ASR (Formerly Wav2Letter++) · 8 PaddleSpeech (Formerly DeepSpeech2) · 9 OpenSeq2Seq · 10 Vosk · 11 Athena · 12 ESPnet · 13 Tensorflow ASR.

  6. Speech to text

    The Audio API provides two speech to text endpoints, transcriptions and translations, based on our state-of-the-art open source large-v2 Whisper model.They can be used to: Transcribe audio into whatever language the audio is in. Translate and transcribe the audio into english.

  7. TTS · PyPI

    🐸TTS is a library for advanced Text-to-Speech generation. 🚀 Pretrained models in +1100 languages. 🛠️ Tools for training new models and fine-tuning existing models in any language. 📚 Utilities for dataset analysis and curation. 💬 Where to ask questions.

  8. speech-to-text · GitHub Topics · GitHub

    DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers. machine-learning embedded deep-learning offline tensorflow speech-recognition neural-networks speech-to-text deepspeech on-device.

  9. pyttsx3 · PyPI

    pyttsx3 is a text-to-speech conversion library in Python. Unlike alternative libraries, it works offline, and is compatible with both Python 2 and 3. Installation pip install pyttsx3. If you recieve errors such as No module named win32com.client, No module named win32, or No module named win32api, you will need to additionally install pypiwin32.. Usage :

  10. DeepSpeech is an open source embedded (offline, on-device) speech-to

    DeepSpeech is an open-source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper.Project DeepSpeech uses Google's TensorFlow to make the implementation easier.. Documentation for installation, usage, and training models are available on For the latest release, including pre-trained models and ...

  11. Speech-to-Text Client Libraries

    Install the client library. If you are using Visual Studio 2017 or higher, open nuget package manager window and type the following: Install-Package Google.Apis. If you are using .NET Core command-line interface tools to install your dependencies, run the following command: dotnet add package Google.Apis.

  12. SpeechRecognition · PyPI

    Library for performing speech recognition, with support for several engines and APIs, online and offline. ... IBM Speech to Text; Snowboy Hotword Detection (works offline) Tensorflow; Vosk API (works offline) OpenAI whisper (works offline) Whisper API; Quickstart: pip install SpeechRecognition. See the "Installing" section for more details.

  13. Text-to-Speech client libraries

    This page shows how to get started with the Cloud Client Libraries for the Text-to-Speech API. Client libraries make it easier to access Google Cloud APIs from a supported language. Although you can use Google Cloud APIs directly by making raw requests to the server, client libraries provide simplifications that significantly reduce the amount ...

  14. Top 5 Speech Recognition Open-Source Projects and Libraries ...

    DeepSpeech is an open-source speech-to-text engine which can run in real-time using a model trained by machine learning techniques based on Baidu's Deep Speech research paper and is implemented ...

  15. The Ultimate Guide To Speech Recognition With Python

    Speech must be converted from physical sound to an electrical signal with a microphone, and then to digital data with an analog-to-digital converter. Once digitized, several models can be used to transcribe the audio to text. Most modern speech recognition systems rely on what is known as a Hidden Markov Model (HMM). This approach works on the ...

  16. Using the Speech-to-Text API with Node.js

    5. Install the Google Cloud Speech-to-Text API client library for Node.js. First, create a project that you will use to run this Speech-to-Text API lab, initialize a new Node.js package in a folder of your choice: NPM asks several questions about the project configuration, such as name and version.

  17. Using the Web Speech API

    Using the Web Speech API. The Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as text to speech, or tts) — which open up interesting new possibilities for accessibility, and control mechanisms. This article provides a simple introduction to both areas, along with demos.

  18. Best Speech-to-Text APIs in 2024

    8. Amazon Transcribe. Amazon Transcribe is offered as a part of the overall Amazon Web Services (AWS) platform. With similar features as Google and Microsoft's speech-to-text solutions, Amazon Transcribe offers good accuracy for pre-recorded audio, but poor accuracy for real-time streaming use cases.

  19. Client libraries

    While you can use Speech-to-Text by making direct requests, we provide client libraries for several popular languages. Speech-to-Text client libraries are built on Google Cloud Client Libraries.This common infrastructure provides functionality for API-specific library implementations, but it also provides types and methods that you may use directly when using any Cloud API.

  20. IBM Watson Speech to Text

    Train Watson Speech to Text on your unique domain language and specific audio characteristics. Protects your data. Enjoy the security of IBM's world-class data governance practices. Truly runs anywhere. Built to support global languages and deployable on any cloud — public, private, hybrid, multicloud, or on-premises.

  21. Top Free Text-to-speech (TTS) libraries for python

    pyttsx3 is a Python text-to-speech (TTS) library that allows you to convert text to speech using a range of TTS engines, including the Microsoft Text-to-Speech API, the Festival, and the eSpeak TTS engine. pyttsx3 is designed to be easy to use and provides a range of options for controlling speech output. It has 1.3k stars on github.

  22. Options for free (and preferably open source) speech to text library

    3. You can use the Sphinx like kdazzle has suggested for you or you can also check out other java implementation here. For python library, check out pyspeech or dragonfly. If the library can output the text, I think the library should be possible to print out the text into a file. edited Sep 30, 2016 at 12:46. Matt Swain.

  23. This AI Startup Wants You to Read Audiobooks to Yourself

    The text-to-speech reading tool is challenging a $2 billion industry. Lisa Lacy Lead AI Writer Lisa joined CNET after more than 20 years as a reporter and editor.

  24. Regularizing cross-attention learning for end-to-end speech translation

    The cross-attention mechanism enables Transformer to capture correspondences between the input and output. However, in the domain of end-to-end (E2E) speech-to-text translation (ST), the learned cross-attention weights often struggle to accurately correspond with actual alignments, given the need to align speech and text across different modalities and languages.

  25. Transcribe Speech to Text Live 4+

    ‎Transcribe Speech To Text: Convert Voice to Text Effortlessly! Unlock the power of speech-to-text conversion with our cutting-edge app, Transcribe Speech To Text. Experience the convenience of converting spoken words into accurate text in a flash. Whether you're a student, professional, or simply n…

  26. Speech-to-Text client libraries

    This page shows how to get started with the Cloud Client Libraries for the Speech-to-Text API. Client libraries make it easier to access Google Cloud APIs from a supported language. Although you can use Google Cloud APIs directly by making raw requests to the server, client libraries provide simplifications that significantly reduce the amount ...

  27. Can You Have Your Kindle Read To You? Yep, Here's How

    Scroll to the "Text-to-Speech" option and move the toggle from "Off" to "On." Return to your e-book and tap the middle of the screen to bring up the progress bar at the bottom. Press play beside ...

  28. Get started

    Guides, examples, and references for Cloud Speech-to-Text V1 public features. Cloud Speech-to-Text V2 Guides, examples, and references for Cloud Speech-to-Text V2 public features. ... The Speech runtime is a C library. Browse usage examples of the Speech runtime. The individual git repositories will have documentation for how to set up and try ...

  29. ‎The Nonlinear Library: LW

    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robin Hanson & Liron Shapira Debate AI X-Risk, published by Liron on July 10, 2024 on LessWrong. Robin and I just had an interesting 2-hour AI doom debate.