You might have used text to speech services and noticed a distinctly robotic sound that doesn’t sound like a real person at all. How do we go from that to something that sounds like a living breathing person? Is it truly possible to bridge the gap and create fully synthetic audio that not only sounds believable, but is able to deliver the proper emotion, nuance, and annunciations that a real person would be able to identify? Can it be done in a feasible way that will let creators make thousands of unique voiced characters come to life, without heavy costs? We think it’s not only possible but closer to reality than you might think.
Text to speech has always been a popular option because it’s been away for those with visual or reading disabilities to absorb content and have it read to them instead of having to struggle with comprehension. Many industries however are seeking to take it a step further and are attempting to use this technology to not only read out words but act them out with accuracy in tone with a unique voice each use.
Where Text to Speech Comes In
Video games are one primary user of this technology. For example, for voice acting in video games, large scripts with thousands of lines of dialogue are commonplace all over the industry, and there are significant resources devoted to hiring actors, voicing these lines, editing them, and implementing them into the final product. Sometimes, this phase can last through the entire development cycle and leaves little room for error as games grow larger and larger with more dialogue needed than ever before.
If a smaller developer is planning a game with 2,000 npcs, but only has the budget to hire 10 actors, their only options are to keep a majority of npcs silent or reuse the same actors for hundreds of characters. This might all be changing, as this new technology can help them give every single of those 2000 characters a voice of their own.
How This Helps Developers
With speech synthesis, game developers are aiming to cut back on costs, while expanding the potential for their games. Having natural voices generated instantly is already possible thanks to advancements in machine learning and artificial intelligence helping the voices sound more genuine to listeners’ ears. As more text has been transcribed over time, things like context and usage have been learned by these services to deliver words and sentences the correct way, with the proper feelings and emotions attached depending on where they need to be. As the cycle of sentences being analyzed and transcribed continues, these will only improve.
Text to speech can be a massive time saver for developers who need to fill large worlds with immersive characters. For main characters, extra attention can always be given by attaching real voices, but for extras such as NPCs or one-off roles, the need to provide as much attention might just be a thing of the past without players being able to tell the difference at all.