research papers on speech recognition

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, speech recognition.

1245 papers with code • 184 benchmarks • 89 datasets

Speech Recognition is the task of converting spoken language into text. It involves recognizing the words spoken in an audio recording and transcribing them into a written format. The goal is to accurately transcribe the speech in real-time or from recorded audio, taking into account factors such as accents, speaking speed, and background noise.

( Image credit: SpecAugment )

Benchmarks Add a Result

Most implemented papers

Listen, attend and spell.

Unlike traditional DNN-HMM models, this model learns all the components of a speech recognizer jointly.

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages.

Communication-Efficient Learning of Deep Networks from Decentralized Data

Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device.

Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition

Describes an audio dataset of spoken words designed to help train and evaluate keyword spotting systems.

SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

On LibriSpeech, we achieve 6. 8% WER on test-other without the use of a language model, and 5. 8% WER with shallow fusion with a language model.

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can outperform the best semi-supervised methods while being conceptually simpler.

Deep Speech: Scaling up end-to-end speech recognition

We present a state-of-the-art speech recognition system developed using end-to-end deep learning.

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs).

Recurrent Neural Network Regularization

wojzaremba/lstm • 8 Sep 2014

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units.

Split Computing and Early Exiting for Deep Learning Applications: Survey and Research Challenges

Mobile devices such as smartphones and autonomous vehicles increasingly rely on deep neural networks (DNNs) to execute complex inference tasks such as image classification and speech recognition, among others.

Speech-to-Text and Text-to-Speech Recognition Using Deep Learning

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Help | Advanced Search

Computer Science > Computation and Language

Title: deep speech: scaling up end-to-end speech recognition.

Abstract: We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, or speaker variation, but instead directly learns a function that is robust to such effects. We do not need a phoneme dictionary, nor even the concept of a "phoneme." Key to our approach is a well-optimized RNN training system that uses multiple GPUs, as well as a set of novel data synthesis techniques that allow us to efficiently obtain a large amount of varied data for training. Our system, called Deep Speech, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set. Deep Speech also handles challenging noisy environments better than widely used, state-of-the-art commercial speech systems.

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

1 blog link

Dblp - cs bibliography, bibtex formatted citation.

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

(PDF) Speech Recognition Using Deep Neural Networks: A Systematic Review
(PDF) Evaluation of Automated Speech Recognition Systems for
(PDF) Speech Recognition Techniques: A Review
Robust Speech Recognition
(PDF) A Review on Speech Recognition Technique
(PDF) Speech Emotion Recognition Using Deep Learning Techniques: A Review

VIDEO

Automatic Speech Recognition: An Overview
Class 01 l Geomorphology l Overview of the topic l UPSC 2025 l Geography Optional
5 Months Plan l Geography Optional Classes for UPSC 2025
A Neuroprosthesis for Speech Decoding and Avatar Control
Large scale visual speech recognition (LSVSR)
Class 02 l Geomorphology l History of development of Geomorphology l UPSC 2025 l Geography Optional