In this article, we test several pre-trained Speech-to-Text APIs. We test these solutions on various relevant use cases.
In recent years, within the world of Artificial Intelligence (AI), one of the most popular applications is Speech recognition. This popularity is due to the huge diversity of applications and needs : call center, broadcasting, traduction, health care, banking, voice assistant, etc.
Speech recognition includes various functionalities :
This list does not represent an exhaustive list of all speech recognition functionalities. Many solutions are based on several functionalities combined.
This article briefly treats pre-trained Speech-to-Text APIs. The aim is to show which problems can be solved with this kind of API. Who are the main providers on the market ? What is the optimal process when using pre-trained APIs ?
During our study on Speech-to-Text pre-trained APIs, we decided to choose 6 providers APIs that provide high performance according to many blog articles and rankings.
This is the list of providers APIs we are going to test. It is interesting to note that some other solutions and open source solutions exist.
As said previously, Speech-to-Text APIs are used in hundreds of fields, for many various use cases. In this article, we are going to test different Speech-to-Text APIs with different types of audios representing common use cases.
We chose 3 use cases with different speakers and speeches. For each use case, we tested the Speech-to-Text API from the 6 providers, with one audio per use case. Of course, for a real project you will need to test on a representative part of your database (not only one audio) to have the right view about different performance.
The API response is only a text response. This response (often json format) will be used to develop applications. For our example, the way to proceed is :
The benchmark is the best and fastest way to find and visualize performances of different solutions and see which one best fits with the type of audio you have. It depends on many parameters like language, type of voice, punctuation, speed processing, speed of speech, length of audio, etc.
Google, IBM, AWS, Azure, Rev.ai and Assembly AI provide performant Speech to text API. They provide different specific parameters and it is interesting to look at their performances on different audio files to quickly identify weak and strong points of each API.
The first audio file to transcribe is an interview of a young man. Here is the exact speech:
“I am not sure the exact date. It's for Comic relief, a big televised event, where a lot of comedians come together and try to do something funny for money, which is the slogan. And people also go around wearing red noses and trying to raise money like that. It can genuinely be anyone yeah. It is usually students and school children mainly but it can be anyone.”
Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :
Google Cloud response:
“I’m not sure the exact date. It’s for comic relief a big televised event where a lot of comedians come together and try to do something funny for money which is the second and people sick around wearing red noses and trying to raise money like that students in school children mainly but it can be any. I’m not sure the exact date for comic relief a big televised event where lots of comedians come together and try to do something funny for money which is the second and people sick around wearing red noses. I’m trying to raise money like that usually students in school children mainly but they have me”
AWS response:
“I’m not sure the exact date. It’s for Comic relief, a big televised event where a lot of comedians come together on and try to do something funny for money, which is the slogan Andi. People also go around wearing red noses and try and raise. Money like that can generally be anyone. It’s usually students and schoolchildren, mainly, but it can be anyone.”
Microsoft Azure response:
“I’m not sure the exact date it’s for Comic Relief a big televised event where a lot of comedians come together and try to do something funny for money, which is the slogan and people also go around wearing red noses and try and raise money like that. Can generally be anyone. Yeah, it’s usually students and school children mainly, but it can be anyone.”
IBM response:
“%HESITATION I’m not sure the exact date it’s %HESITATION for comic relief a big televised event %HESITATION relative comedians come together and I try to do something funny for money which is the second %HESITATION and people to go around wearing red noses and trying to raise money like that can generally be anyway it’s usually students and school children mainly but it can be anyone”
Rev.ai response:
“Um, I’m not sure of the exact date it’s for comic relief, a big televised event, um, where a lot of comedians come together and try to do something funny for money, which is the slogan. Um, and people also go around wearing red noses and try and raise money like that. It can genuinely be anyone. Yeah. It’s usually students in school, children mainly, but it can be anyone.”
Assembly AI response:
“I’m not sure the exact date. It’s for comic release. I’m not sure the exact date. It’s for comic relief. A big televised event. A big televised event. Where a lot of comedians come together and try to do something funny for money, which is the slogan. Where a lot of comedians come together and try to do something funny for money, which is the slogan and people ought to go around wearing red noses and try and raise money like that. I can generally read anyone. it’s usually students in school children mainly, but it can be anyone. And people also go round wearing red noses and try and raise money like that can generally be anyone? Yeah it’s usually students in school children mainly, but it can be anyone.”
Use case n°1 review:
For this use case, we can note that some difficulties in the speech lead to errors for every provider. But for this use case, Rev.ai clearly provides the best performance. It remains important to notice that Assembly AI punctuation management is impressive. Additionally, for Google and Assembly AI, we got a problem with text repetition that can be annoying for project integration. By combining results from different APIs, regarding to their strong points, there is a way to get very high performance.
This second audio file is a 27 second woman speech about her personal means of transport:
“In England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. To go on holiday, I go by plane or by boat. However, I do not like flying because I’m scared of heights. And I do not like going by boat because I feel seasick.”
Eden API returns responses for AWS, GCP, IBM, Azure APIs :
Google response:
“in England we use cause a lot to travel I go to school on foot or by bike however to go further I would go in the car or on the bus to go on holiday I go by plane go by boat however I do not like flying because I’m scared of heights and I do not like going by boat because I feel seasick in England we use cause a lot to travel I go to school on foot or by bike however to go further I would go in the car or on the bus to go on holiday I go by plane go by boat however I do not like flying because I’m scared of heights and I do not like going by boat because I feel seasick”
AWS response:
“In England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of heights on. And I do not like going by boat because I feel seasick.”
Microsoft Azure response:
“In England we use cars allowed to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus . to go on holiday. I go by plane or by boat. However, I do not like flying because I’m scared of Heights and I do not like going by boat because I feel seasick.”
IBM response:
in England we use because a lot to travel I go to school on foot all bye bye however it to go fed that I would go in the call or on the bus to go on holiday I go by plane or by boat however I do not like flying because I’m scared of heights and I do not like going by both because I feel seasick
Rev.ai response:
“In England, we cause a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday. I go by plane or by boat. However, I do not like flying because I am scared of Heights. And I do not like going by boat because I feel seasick.”
Assembly AI response:
“In England, we use cars a lot to travel. I go to school on foot or by bike. However, to go further, I would go in the car or on the bus. to go on holiday, I go by plane or by boat. However, I do not like flying because I am scared of heights and I do not like going by boat because I feel seasick.”
Use case n°2 review:
For this second use case, we can see a huge performance gap between providers. Assembly AI provides a very high level of performance, followed by Rev.ai a bit less effective but still very performant. Behind, AWS is still closer than Microsoft, Google and IBM that provides a weak result compared to Assembly AI and Rev.ai
This third use case is a phone message left by a man who is talking about his new phone. We will briefly see performance with a phone quality audio file. Here is the speech:
“Hi it’s Paul again, I’m very excited I went and got my new IPhone today with the new software. It’s a very very good phone, everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone very very neat. Talk to you soon. Bye !”
Eden AI API returns responses for AWS, GCP, IBM, Azure APIs :
Google response:
“Hi it’s Paul again I’m very excited I went and got my new iPhone today with the new software. to very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very neat talk to you soon bye”
AWS response:
“Hi It’s Paul again. I’m very excited. I went and got my new iPhone today with the new software. It’s a very, very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my Web browsing. It’s a phone. Very, very neat. Talk to you soon bye.”
Microsoft Azure response:
“Hi it’s Paul again. I’m very excited. I would went and got my new iPhone today with the new software. It’s a very very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone. Very very neat. Talk to you soon bye.”
IBM response:
“hi it’s Paul again %HESITATION I’m very excited I went and got my new iPhone today with the new software it’s a very very good phone everyone should get one I love it it does many wonderful things it allows me to do my email on my web browsing it’s a phone very very needs talk to you soon bye”
Rev.ai response:
“Hi, it’s Paul. Again, I’m very excited. I went and got my new iPhone today with the new software. It’s a very, very good phone. Everyone should get one. I love it. It does many wonderful things. It allows me to do my email, my web browsing. It’s a phone. It’s very, very neat. Talk to you soon. Bye.”
AssemblyAI response:
“Hi it’s Paul again I’m very excited. I went and got my new iPhone today with a new software it’s a very, very good phone. Everyone should get one I love it, it does many wonderful things. It allows me to do my email. My web browsing it’s a phone it’s very, very neat. Talk to you soon. Bye.”
Use case n°3 review:
For this third use case, all the providers give high performances. It is interesting to note that there are providers that succeed for some difficulties and fail an other and vice versa for other providers. But for this kind of case, the API choice is often made on speed processing or pricing.
Concerning the costs of the APIs, they are defined according to duration thresholds with degressive prices:
Prices are displayed in dollars per second. We notice that they are important price changes between the different providers, 3 price ranges stand out. Google and Rev.ai are the most expensive : for volumes higher than 1M minutes, Google is 360% more expensive than IBM and Rev.ai 350%. Next come Microsoft and AWS with similar prices. IBM and AssemblyAI are the less expensive of the panel. Moreover, the pricing presented in this table corresponds to standard offers, it may change with particular requests containing specific parameters : For example, Google proposes higher prices for models dedicated to videos and phone calls but on the contrary lower prices when users agree to share their data in order to improve Google’s models.
Please note that the prices displayed in this table may have changed according to the providers as of the day of writing of this article.
So we have chosen 3 random use cases. It shows that the way to manage a project can be different for each kind of datas :
Depending on the use case, the best way to obtain the highest performance is always different. It is important to note that Google, AWS, IBM and Microsoft supports speech-to-text for many languages. In comparison, Assembly AI and and Rev.ai supports for the moment only English from different countries but they are currently working to launch models with other languages. But another important thing to notice: contrary to IBM and Google, Amazon, Microsoft, Rev.ai and Assembly AI manage punctuation and this is a very important feature. Of course other specific features of each provider can make the difference depending on your project, we highly recommend checking for any specific optional parameter, it can change your choice!
For GCP, AWS, Azure and Watson, we do not need to use their API directly. In fact, the Eden AI Speech-to-Text API allows to get the 4 providers APIs results with only one simple request. With few lines of code, we can have access to the results from the 4 providers. Rev.ai and Assembly AI are not implemented yet on Eden AI, so we use their API directly.
With Eden AI, you can get fast access to various results from various providers. So you can have a better idea about which is the solution that best fits for you. Other providers will be added in Eden AI in the future.
The decision making is as following :
First you run your datas on Eden AI to benchmark solutions available on the market. Then you have 3 options :
This process garanties you to make the right choice to succeed in your project. Eden AI is only a tool that allows you to realize a benchmark very easily and quickly. Finally, it is possible to use Eden AI API to realize the entire project avoiding accounts and billings from many providers, and keeping the flexibility to not just choose one provider.
In the case of Speech-to-text solutions pricing is an important element for decision making, because high differences exist between the providers. It is especially true when considering important volumes.
You can directly start building now. If you have any questions, feel free to chat with us!
Get startedContact sales