Optimizing Japanese text-to-speech with Amazon Polly
Amazon Polly is a cloud service that offers text-to-speech (TTS), a system that converts text input into a waveform, in a range of 61 voices in 29 languages by using advanced deep learning technologies. The Amazon Polly service supports companies in developing digital products that use speech synthesis for a variety of use cases, including automated contact centers, language learning platforms, translation apps, and reading of articles.
Amazon Polly currently offers two Japanese voices in its portfolio. Japanese is a language that poses many challenges for a TTS system because of the complexities of its writing system.
This post provides an overview of the challenges the Japanese language presents to TTS, how Amazon Polly addresses those challenges, and what is available to developers to optimize your customer experience.
Japanese is a challenging language for TTS
The Japanese writing system consists mainly of three scripts—kanji, hiragana, and katakana—that you can use interchangeably in many cases. For example, you can write the word for “candle” in kanji (蝋燭), hiragana (ろうそく), or katakana (ロウソク). Kanji are logographic characters, and hiragana and katakana (together called kana) are syllabic characters that more closely represent pronunciation. Japanese sentences almost always contain a mixture of both kanji and kana.
The multitude of scripts has allowed Japanese speakers to be creative with their writing system, and kanji compounds are sometimes read differently than what you might expect from the component characters (ateji). This is even more pronounced in person names, in which you cannot always predict the pronunciation of the name based on the sequence of characters.
One of the first steps in the TTS front end is to split the sentence into words, which presents another challenge in Japanese. In English, you can deterministically separate words with spaces, so the task is straightforward. Japanese strings words together without spaces in-between, so you need models to predict where a word ends and the next word begins. Imagine, in English, separating a sequence of letters such as Applesonatable into individual words—your knowledge of the language tells you that it should be “Apples on a table” rather than “Apple son at able.” You have to teach the model to do this.
Furthermore, the pronunciation of Japanese words depends heavily on the surrounding context. You may write words with the same sequence of kanji characters that you would pronounce in many different ways, meaning different things (homographs) depending on the context. These present the greatest challenges for TTS. The following example illustrates this difference in pronunciation:
東京都 pronounced “Tokyo-to,” meaning “Tokyo metropolitan city”
Listen now Voiced by Amazon Polly |
東京都 pronounced “Higashi-Kyoto,” meaning “East Kyoto”
Listen now Voiced by Amazon Polly |
行った pronounced “itta,” meaning “went”
Listen now Voiced by Amazon Polly |
行った pronounced “okonatta,” meaning “performed”
Listen now Voiced by Amazon Polly |
You could split 東京都に行った into 東京/都/に/行った, in which case you would pronounce it “Tokyo-to ni itta,” but in the case of 東/京都/に/行った, the pronunciation is “Higashi-Kyoto ni itta.” In both of these cases 行った is pronounced “itta”. But in a different context, such as 東京都に行った事業の報告をする, it could take the second meaning (“performed”), and you would pronounce it “okonatta” rather than “itta.”
Additionally, Japanese is a pitch accent language, which means that the difference in pitch accent can make a difference in the meaning of the word. For example, 雨 (pronounced “ámè’,”meaning “rain”) vs. 飴 (pronounced “àmé,” meaning “sweets”). The written form of the word in Japanese doesn’t indicate the pitch accent.
To deal with these difficulties, Amazon Polly employs several machine learning (ML) models in its Japanese TTS system. The ML models use information about the surrounding words and their syntactic (grammatical) and morphological (word structure) information to predict the pronunciation of a word or the pitch accent and phrasing. These models are useful in generalizing patterns in the language and allow you to predict the pronunciation and intonation for sentences that have never been synthesized.
Despite continually working to improve Amazon Polly’s models, there are cases in which the service can’t predict the correct pronunciation. Humans often infer contextual information from broader cultural or situational knowledge, enabling them to understand written text even when the written context is sparse. Some of these types of information are not available to the current TTS model, or the model can’t yet make use of the information available to make an accurate prediction. There are even cases in which native speakers struggle to predict the correct pronunciation because they lack background knowledge. This is particularly common in the case of person names and place names. For example, you can read a popular first name such as 愛 in at least 28 different ways, including “Ai” (あい), “Megumi” (めぐみ), “Manami” (まなみ), and “Mana” (まな).
To work around these difficulties, there are several ways in which you can control the pronunciation of Japanese text.
Controlling pronunciation by specifying the word boundary
Sometimes specifying the word boundary is sufficient to indicate the desired pronunciation. For example, 東京都 (Tokyo Metropolitan City) normally reads “Tokyo-to.” Here, you can interpret the word as consisting of two words: 東京 (Tokyo) and 都 (“metropolis,” pronounced “to”). But there is an alternative interpretation that is equally valid, where the pronunciation is “Higashi-Kyoto,” meaning “East Kyoto.” You can interpret the word as consisting of two words with the word boundary between 東 (“east,” pronounced “higashi”) and 京都 (Kyoto). By applying an explicit word boundary marker
, you can obtain the desired pronunciation as follows:
Listen now Voiced by Amazon Polly |
This overrides the default pronunciation for 東京都 and results in the pronunciation “Higashi-Kyoto” rather than “Tokyo-to.” The
tag is a form of SSML (Speech Synthesis Markup Language) that you can use to control pronunciation in TTS by surrounding individual words.
Controlling pronunciation with furigana
Specifying word boundaries is useful, but only works if the sequence of characters consists of multiple words. There are many cases in which a single word (or a single character) has multiple possible pronunciations. For example, the place name 日本橋 in Tokyo is pronounced にほんばし (“Nihonbashi”), but if you’re referring to the same place name in Osaka, it’s pronounced にっぽんばし (“Nipponbashi”). Without context, you do not know which pronunciation is required.
The simplest way to specify pronunciation in such cases is with parentheses using furigana.
Listen now Voiced by Amazon Polly |
Listen now Voiced by Amazon Polly |
Furigana is a pronunciation aid that disambiguates the pronunciation of kanji words. You can see this format of indicating furigana in parentheses after kanji in books and newspapers. Furigana is easy to use because Japanese speakers are familiar with it.
However, the furigana string is visible in the text, which may not be desirable in some use cases. It also does not work well when the furigana does not match one of the pronunciations Amazon Polly recognizes, for example, 海(やま). This is often the case with person names that use non-standard reading of characters, for example, 七音(どれみ).
To overcome this problem, Amazon Polly has a special SSML type attribute for Japanese, which allows customers to specify the pronunciation using furigana. This method is recommended for controlling pronunciation whenever possible.
You can use it as shown in the following example to enforce the pronunciation “Nipponbashi” (にっぽんばし) for 日本橋, where the default pronunciation is “Nihonbashi” (にほんばし):
Listen now Voiced by Amazon Polly |
The phoneme type is called ruby
and refers to small annotations usually placed above or next to kanji characters to show furigana in Japanese text.
Using this syntax, you can apply any furigana (pronunciation) to any sequence of characters. The phoneme type ruby
works irrespective of word boundaries. You can use it for parts of words, for example, to tag only the kanji part of a word without the okurigana, as in the following example:
Listen now Voiced by Amazon Polly |
Listen now Voiced by Amazon Polly |
If the furigana does not match one of the pronunciations that Amazon Polly recognizes for the word, it predicts the pitch accent, which may not be correct. For example, you can enforce the pronunciation うみ for the character 山, but this is not one of the recognized pronunciations for the character, so the pitch accent may not be what you expect.
Furigana specified with the ph
attribute can be in hiragana or katakana. However, if the furigana is not in a valid format, the specified pronunciation doesn’t have an effect, and the system falls back to the default pronunciation predicted for the tagged string. Furigana must not contain non-kana characters such as digits or symbols, and unconventional furigana usage such as double choonpu (prolonged vowel mark) is not considered valid.
The following examples have furigana in an incorrect format, which doesn’t take effect:
If using the ruby
tag without any string inside, nothing is synthesized. The following example shows an empty phoneme tag:
Controlling pronunciation with pronunciation kana
The phoneme type ruby
is sometimes not powerful enough because it does not provide fine control over the pitch accent. Not only does a change in pitch accent change the meaning of a word, the same word can also have different pitch accents depending on the surrounding context. For example, the word “speech” is 音声, pronounced オ’ンセー, with the pitch accent on the first mora (syllable) when pronounced on its own. Here, the apostrophe symbol denotes the pitch accent. The pitch accent indicates the mora where the pitch of the word goes down. The word “synthesis” is 合成, pronounced as unaccented ゴーセー on its own.
However, when the two words combine to form the compound word 音声合成 (“speech synthesis”), the pitch accent position (notated by the apostrophe) shifts, and the compound word is pronounced オンセーゴ’ーセー, not オ’ンセーゴーセー. Thus, predicting the pitch accent of words is a challenge for TTS. ML models can resolve it, but the models sometimes make mistakes. The SSML alphabet attribute x-amazon-pron-kana
, which uses a pronunciation notation called pronunciation kana, can help you specify the pitch accent directly and explicitly.
You can distinguish two ways of saying 毎日新聞を読む by using the SSML alphabet attribute x-amazon-pron-kana
, with one meaning “I read the newspaper every day” and the other meaning “I read the Mainichi newspaper.” The first pattern is the default pitch accent that Amazon Polly currently predicts.
The following example shows how you can enforce the second pitch accent pattern:
Listen now Voiced by Amazon Polly |
The phoneme alphabet x-amazon-pron-kana
(pronunciation kana) is notated in the katakana character set and is similar to furigana, but differs in several ways. Ignoring this difference could lead to unnatural pronunciation.
Furigana doesn’t always reflect pronunciation accurately, but pronunciation kana does. For example, the particle は is transcribed ワ in pronunciation kana, not ハ. The particle へ is transcribed エ, not ヘ, and the particle を is transcribed オ, not ヲ. 格子 and 仔牛 are both written こうし in furigana, but コーシ and コウシ respectively in pronunciation kana.
In pronunciation kana, the apostrophe indicates the location of the pitch fall. If there is no apostrophe, the word or phrase is assumed to be unaccented (平板アクセント).
The following examples illustrate the difference in pitch accent:
Listen now Voiced by Amazon Polly |
Listen now Voiced by Amazon Polly |
Each word can have a maximum of one pitch accent. The term word here means prosodic word, each of which has at most one pitch fall. If you find that the word or phrase you are tagging has multiple pitch accents (multiple peaks in pitch), you must separate the pronunciation kana for each word by a space, or they must be in separate phoneme tags.
The following examples illustrate the difference in intonation depending on the prosodic word grouping. They all have the same pronunciation kana, but the intonation is different because you can split the phrase into prosodic words in different ways.
Listen now Voiced by Amazon Polly |
Listen now Voiced by Amazon Polly |
Listen now Voiced by Amazon Polly |
Unlike phoneme type="ruby"
, phoneme alphabet="x-amazon-pron-kana"
enforces word boundaries around it. When tagging words with okurigana, you need to tag the entire word, including the okurigana.
The following table summarizes some of the differences between furigana (in katakana) and pronunciation kana.
Sentence | 彼は北欧神話に登場する神を紹介した。 |
Furigana (in katakana) | カレハホクオウシンワニトウジョウスルカミヲショウカイシタ |
Pronunciation kana (x-amazon-pron-kana) | カ‘レワ ホクオーシ‘ンワニ トージョースル カ‘ミオ ショーカイシタ |
Summary
This post explained some of the challenges in developing a Japanese TTS system due to the complexities of its writing system. The service employs several ML models to predict the pronunciation and pitch accent, and also uses several Japanese-specific SSML tags to help you create natural-sounding contents with Japanese Amazon Polly voices more easily. This post aims to help speed up your development process and make your experience working with Amazon Polly more pleasant. For the full list of languages and voices that Amazon Polly offers, see Voices in Amazon Polly.
About the Author
Kayoko Yanagisawa is a Senior Research Scientist in the Amazon Text-to-Speech (TTS) team. She worked on the launch of Amazon’s Japanese TTS voices for Amazon Polly and Amazon Alexa. When not working on making machines speak, she enjoys playing the viola she made herself.
Tags: Archive
Leave a Reply