Help Center

When choosing an online solution for converting text to speech, it is recommended that you pay attention to the number of languages and realistic voices offered by the service; the ability to add special effects; conversion cost; and the ability to download and/or share the converted audio files.

Experiments show that we can distinguish an artificial voice if it speaks for more than one minute. The reason is that the voice behaves the same. One speed, timbre, tonality. An actor will not behave like that. He will change intonation, speed, pitch, and he will have pauses of different lengths. The effects are what makes voice acting realistic.

It’s hard to say who will be the first to create a truly realistic synthesized voice. But we can be sure of one thing: in another few years we will not be able to determine who is talking to us on the phone - a real person or a computer. And the realism of the voices will be primarily achieved through the organic use of voice effects.

Do you wonder how to pick the right accent for your message? Here are three recommendations to help you get started:

  • Define your audience: In which country and in which city do they live? What language do they speak?

  • Find out which accent is popular in everyday life, and which one is used for business communication (it can be one accent, or it can be two or more);

  • Decide if you want your message to sound official or casual.

Ask yourself these simple questions and it is likely that the answers will give you a new perspective on your audience and help you find common ground more easily.

With Kukarella, you get easy backdoor access to voice transcription software from Google that supports more than 120 languages and accents. According to Google, the accuracy of their online transcription service is between 90-95%.

Google has the best voice recognition technology. But, it’s hard to use it if you are not a developer!

To start with, you would need to create a developer account, add an API, and organize settings. Then you would have to work with unfriendly interfaces that aren’t optimized for end-users.

With Kukarella, you get easy backdoor access to all languages and transcription services in the Google databases. Instantly!

And you will also get access to lots of cool features and tools which you won’t find anywhere else. Even on Google!

How long does it take to transcribe audio manually? At least as long as the audio itself. With our audio to text converter you can transcribe audio and video twice as fast while keeping costs under $0.20 per minute of audio. Thats instead of the $1 per minute or more industry standard for manual transcription.

Wether you want to transcribe a Zoom call, podcast, film, lecture, meeting note, Youtube or Vimeo video (you name it), you can do that with the Kukarella audio to text converter in a few simple steps. The software can transcribe from over 120 languages and accents

You can be sure that the transcription will be done quickly and accurately. Well, if you speak loud and clear enough and don’t use funny accents.

And what is also good - no one will listen to your conversation and read its transcript. You’ll keep it strictly confidential.

Quite often audio recognition may be inaccurate or even fail to detect any speech. This can be for a number of reasons, such as low volume, high background noise, or audio distortion. A good rule of thumb is this: if Siri, Alexa, or Google Assistant on your phone can understand what is being said in an audio recording, then the transcription service should also understand that.

In order to get the transcription to give its best results, please try the following things:

  • Use a high quality recording with less background noise and a clear voice;
  • Try editing the recording to reduce noise and increase it’s speaking volume using programs such as Audacity (free) or Adobe Audition (paid);
  • You can also try selecting “1 Speaker”, as sometimes it can have better results than multiple speakers

Wether you want to transcribe a Zoom call, podcast, film, lecture, meeting note, Youtube or Vimeo video (you name it), you can do that with the Kukarella audio to text converter in a few simple steps. The software can transcribe from over 120 languages and accents

You can be sure that the transcription will be done quickly and accurately. Well, if you speak loud and clear enough and don’t use funny accents.

And what is also good - no one will listen to your conversation and read its transcript. You’ll keep it strictly confidential.

Do you want to transcribe all types of audio and video files? Kukarella allows you to convert MP3, WAV, MP4, MOV, and more to text. You can also convert soundtracks from YouTube and Vimeo videos. Just add their URL addresses.

Among popular audio and video formats that can be transcribed by Kukarella online audio to text converter: *.aac, *.aif, *.aifc, *.au, *.avi, *.flac, *.m4a, *.m4v, *.mov, *.mp2, *.mp3, *.mp4, *.mpa, *.mpe, *.mpg, *.mpga, *.mts, *.oga, *.ogg, *.ovg, *.opus, *.ts, *.wav, *.webm, *.wma, *.wmv

Upload any type or audio or video and get your transcription in minutes

Option one. Do it yourself, but that is very tedious and will take loads of time, which you could spend more efficiently.

Option two - hire someone. It can be your employee, some freelancer from upwork or fiverr, or an agency. Transcription itself will take at least twice as much time as the audio recording itself. The cost will be from $15 to $60 per hour of audio. Above that, forget about privacy.

Option three - online audio to text converters. The best technologies, such as Google, support around 120 languages and accents. The accuracy is between 90-95%. The problem was that so far, only large companies could use these technologies. We decided that ordinary users should also have access to the most advanced audio transcription tools. And so we created Kukarella.

Automatic transcription allows you to convert voice to text with high accuracy (usually - 90-95%); in real time (for recordings) or twice as fast as manual transcription (for uploaded files); for a really competitive fee ($5 per one hour of audio).

It also guarantees full privacy, since nobody except you will see the text or listen to your audio. When working with an agency or remote freelancer, you never know who can see your audio or video files, or how they will be used in the future.

With online transcription software, you can transcribe personal notes, business conversations, interviews, or meeting notes, and nobody, except you, will get access to your audio or read your text.

Our service provides an interface for our text to speech service providers, which include Google Text-to-Speech, Amazon Polly, IBM Watson Text-to-Speech, and Azure Text-to-Speech. We provide some extra functionality on top of their services, abstracting away much of the difficulty in using those services, however we are still bound by the limitations they include.

For Google, their voices have sample rates of 24,000 Hz, (with 3 exceptions, lower at 22,050 Hz). Details

IBM's voices have a sample rate of 22050. We can specify different values, but they state that their service will only up/down sample, not truly generating it at a different sample rate. Details

Amazon has several sample rates available, but the highest is also 24,000. We use the highest sample rates available for the type of voice. Details

Finally, Azure's voices do support output of 48,000 Hz sample rate, which is the rate we use for these voices. Specifically, 48,000 Hz sample rate, 16 bit depth, single channel (mono).

Please note that stereo, 2 channel audio does not make sense for a synthesized voice, as there is no spatial information to capture in the stereo field :)

For all of our voices, we use the highest available sample rates that each provider can provide, at the highest bit depth each provider can provide. Unfortunately, a higher sample rate is just not available for most platforms.

At the very end, our service does encode the audio into an MP3 format during storage, but with settings applied to ensure the highest quality is preserved on these files.

Of course, if you create voice-over for a complex emotional text or dialogue, it is not yet possible to replace a professional actor. In this case, the actor not only reads the text, he creates a character, an image. So if you have a challenging dramatic task and a sufficient budget, then a professional actor can handle the task better than a computer voice.

However, today many businesses use synthesized voices to convert text to speech. The fear of using computer tts voices is fading into the background, giving way to such advantages as the ability to quickly and inexpensively get high-quality voice over.

Computer voices have become more realistic, and the choice among them is growing. If among the synthesized voices you come across robotic voices, such as the voice used by Steven Hawking, then this is an exception.

Today you can choose not only the type of voice, but also the accent and intonation, and sometimes the temperament of the narrator. Computer voices are so natural and realistic that we often don’t notice them in YouTube or Tik Tok videos or when we hear announcements at airports and train stations.

Want to be heard by your audience and avoid unnecessary costs by remodeling messages? Then you need to regularly ask yourself these questions:

  • Who is my audience? You’re going to talk differently to a forty-year-old mother of three than you will to a six-year-old boy.

  • What information do I want to deliver to them?

  • Where will they see or hear my message?

  • What reaction do I expect from them?

A voice cannot be good or bad. It can only be SUITABLE or NOT SUITABLE - this is the main criterion. A beautiful, lush voice, but not appropriate in specific circumstances, will destroy the message. And another voice, perhaps even one that resonates unpleasantly, may be remembered and even bring incredible success to your company.

There is no magic to this. It’s just a puzzle. The main thing is to match theme and voice.

On Kukarella you can conduct a quick experiment that will take a few days and cost you a cup of coffee. And by doing that you can get a response from your audience, which will save a lot of time and prevent errors.

Standard terms of service Google, Amazon, Microsoft, and IBM give ownership of resulting sound recording copyrights for the recorded files to the user of the application, as long as the text is an original text created or otherwise legally used by the application user.

That means that if you hold the copyright to the text, you hold the copyright to the recordings as well. The crucial consideration governing that right is that you ‘Created’ the work, meaning that there is a modicum of creativity.

Kukarella allows commercial use of audio, created with any paid plan.

The highest-quality text to speech online software, predictably, turned out to be from giants like Google, Microsoft, IBM and Amazon. But these platforms aren’t designed for end users. Rather, they’re meant as a B2B solution.

To start converting text to speech users need to create accounts with each platform, update lots of settings and even to code. That’s why we created Kukarella, which gives users an easy access to the most realistic voices from the most popular providers.