Speech-to-Text (STT) technology allows you to turn any audio content into written text. It is also called Automatic Speech Recognition (ASR), or computer speech recognition. Speech-to-Text is based on acoustic modeling and language modeling. Note that it is commonly confused with voice recognition, but it focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.
Speech-to-Text APIs uses cases
You can use Speech Recognition in numerous fields, and some STT APIs are built especially for those fields. Here are some common use cases:
Call centers: data collected and recorded by speech recognition software can be studied and analysed to identify trends in customer
Banking: make communications with customers more secure and efficient.
Automation: fully automate tasks like appointment bookings or find out where your order is
Governance and security: completing an identification and verification (I&V) process, with the customer speaking their details such as account number, date of birth and address.
Medical: voice-driven medical report generation or voice-driven form filling for medical procedures, patient identity verification etc
Media: automated process for TV, radio, social networks videos, and other speech-based content conversion into fully searchable text.
Top Speech-to-Text APIs
Speech experts at Eden AI tested, compared and used many Speech-to-Text APIs of the market. There are many actors and here are actors that perform well (in alphabetical order):
Assembly AI
AWS Transcribe
Deepgram
Gladia
Google Cloud Speech
IBM Watson Speech-to-Text
Microsoft Azure Speech-to-Text
NeuralSpace
One AI
OpenAI
Rev AI
Speechmatics
Symbl
Voci
Performance variations of STT APIs
For all the companies who use voice technology in their softwares and for their customers, cost and performances are real concerns. The voice market is dense and all those providers have their benefits and weaknesses.
Performance variations according to the languages
Speech-to-Text APIs perfom differently depending the language of audio. In fact, some providers are specialized in specific languages. Different specificities exist:
Accent speciality: some providers improve their speech-to-text APIs to make them accurate for audios from specific regions. For example: english (US, UK, Canada, South Africa, Singapore, Hong Kong, Ghana, Ireland, Australia, India, etc.), spanish (Spain, Argentina, Bolivia, Chile, Cuba, Equatorial Guinea, Laos, Peru, US, etc.). Same for portuguese, chinese, arabic, etc.
Rare language speciality: some speech-to-text providers care about rare languages and dialects. You can find providers that allow you to process audios in Gujarati, Marathi, Burmese, Pashto, Zulu, Swahili, etc.
Performance variations according to audio data quality
When testing multiple speech-to-text APIs, you will find that providers accuracy can be different according to audio format and quality. Format .wav, .mp3, .m4a will impact performance as well as the sample rate that can be most of the time 8000Hz, 16 000Hz and higher. Some providers will perform better with low quality data, other with high quality.
Performance variations according fields
Some STT APIs trained their engine with specific data. This means that speech-to-text APIs will perform better for audio in medical field, other in automotive field, other in generic fields, etc. If you have customers coming from different fields, you must consider this detail and optimize your choice.
Using multiple speech-to-text APIs is the key
All the companies that have speech recognition feature in their product or deal with voice technology for their customers have to use multiple speech-to-text APIs. This is mandatory to reach high performance, optimize cost and cover all the customers needs. There are many reasons for using multiple APIs:
Fallback provider is the ABCs. You need to set up a provider API that is requested if and only if the main speech-to-text provider does not perform well (or is down). You can use confidence score returned or other methods to check provider accuracy.
Performance optimization. After testing phase, you will be able to build a mapping of providers performance that depend on criterias that you chosed (languages, fields, etc.). Each audio that you need to process will be then send to the best provider.
Cost - Performance ratio optimization. This method allows you to choose the cheapest provider that performs well for your data. Let's imagine that you choose Google Cloud API for customer "A" because they all perform well and this is the cheapest. You will then choose Microsoft Azure for customer "B", more expensive API but Googleperformances are not satisfying for customer "B". (this is a random example)
Combine multiple STT APIs transcriptions. This approach is required if you look for extremely high accuracy. The combination leads to higher costs but allows your transcription service to be safe and accurate because speech-to-text providers will validate and invalidate each others for each words and sentences.
Eden AI is a must have
Eden AI has been made for multiple speech-to-text APIs use. Eden AI is the future of speech recognition usage in companies. The Eden AI API speech-to-text APIs allows you to call multiple speech-to-text APIs and handle all your voice issues:
Centralized and fully monitored billing on Eden AI for all speech-to-text APIs providers
Unified API for all providers: simple and standard to use, quick switch between providers, access to the specificic features of each provider
Standardised response format: the json output format is the same for all suppliers thanks to Eden AI's standardisation work. The response elements are also standardised thanks to Eden AI's powerful matching algorithms.
The best speech-to-text APIs of the market are available: specialized engines for different languages like english (US, GB, ETC.), chinese (trad, off, etc), european languages, afrikaans languages, asian languages, esp, portugal, etc.), special engines for rare languages
Data protection: Eden AI will not store or use any data. Possibility to filter to use only GDPR engines.
Next step in your project
The Eden AI team can help you with your speech recognition integration project. This can be done by :
Organizing a product demo and a discussion to better understand your needs. You can book a time slot on this link: Contact
By testing the public version of Eden AI for free: however, not all providers are available on this version. Some are only available on the Enterprise version.
By benefiting from the support and advice of a team of experts to find the optimal combination of providers according to the specifics of your needs
Having the possibility to integrate on a third party platform: we can quickly develop connectors