Archived Blog
Should you use automated text-to-audio conversions for on-line narration?
It's the holy grail of narration recording - automated narration from recording directly from written scripts. It can reduce development time, improve consistency of recording, and greatly reduce translation costs, but does it work for learning material? The answer depends upon if you look at it from a development resources perspective, or a learning efficiency perspective.
First, let's define what we are discussing. Text-to-speech software attempts to match sounds to the arrangement of linguistic symbols and connect those sounds together, producing the illusion of speech. These packages use an engine to process text files, and a defined language file or files to apply the sounds to each element of the written text. This separation means that you can apply a different language or even just a different-sounding voice to the same file. This is the great appeal to using a text-to-speech application for recording narration; the re-recording time for different languages is almost irrelevant, and provides a consistent reading of each translation.
A learning development group producing audio material can merely write one script, and translate that script for multiple languages. Changes can be made and implemented by anyone that has access to the text-to-speech engine. These options are very attractive from a project management viewpoint.
However, from a quality of learning perspective, there are significant limitations. The primary limitation being that while the engine software can interpret the linguistic symbols, and the voice file(s) can be used to present sounds that represent words, neither imparts meaning to the words or sentences. When humans communicate verbally, they produce consistent and recognizable elements of sounds known as phonemes. These elements tend to be neutral in tone, inflection and accent, but may be subtly changed to become new elements known as morphemes that are part of words.
These morphemes become a standard way of saying and hearing words, and allow the listener not only recognize the word, but anticipate what type of words may follow. Take away the morpheme quality in a tonal language such as Mandarin Chinese, and the words lose their meaning, but even the meaning of non-tonal languages such as English are impacted by improperly rendered morphemes. They increase the cognitive effort required by the listener to comprehend the meaning of the word in context, and distract them from the listening material.
This is the major failing of any of the publicly-available text-to-speech engines; even if the voice quality is excellent, the lack of comprehension of meaning of each phrase means that the sentences become flat and mono-tone. The listener must struggle to understand meaning, and this increases in difficulty with the length of the material. Instead of improving comprehension through use of another sense, the learner is distracted. Comprehension is reduced compared to using a live voice narrator.
Also, there is a rather significant licensing cost associated with using the voices produced for text-to-speech engines in released work. Depending upon the producing company, quality of the voice, and intended distribution method and usage agreements, it may cost tens of thousands of dollars to use one voice for all of your projects. Though this is probably less than the long-term use a professional narrator, it is a cost to consider.
What does this mean to your team in your evaluation of the feasibility of using an automated narration software application? Ultimately, with your service to your learners and their reactions to the narration. If you produce multi-media learning material using automated narration and the response is favorable, then you have another tool at your disposal.
Here are a few things to keep in mind to produce the best quality narration using a typical text-to-speech engine:
Use short sentences
The longer the sentence, the more likely that the lack of proper emphasis on words and phrases becomes apparent.
Use Phonetic Replacements
When a text-to-speech engine cannot properly define a word, it typically spells it out. Also, many words may be read incorrectly within the context. Errors in pronunciation between words such as “live” or technical words are common. Most text-to-speech engines allow you to add phonetic sounds to replace words and enhance the accuracy of the translation.
Return to Top