Skip to content

Commit 4da3cf3

Browse files
feat: [google-cloud-texttospeech] Support HD voice custom pronunciations (#13721)
- [ ] Regenerate this pull request now. BEGIN_COMMIT_OVERRIDE feat: Support HD voice custom pronunciations docs: A comment for method `StreamingSynthesize` in service `TextToSpeech` is changed docs: A comment for enum value `OGG_OPUS` in enum `AudioEncoding` is changed docs: A comment for enum value `PCM` in enum `AudioEncoding` is changed docs: A comment for field `low_latency_journey_synthesis` in message `.google.cloud.texttospeech.v1.AdvancedVoiceOptions` is changed docs: A comment for enum value `PHONETIC_ENCODING_IPA` in enum `PhoneticEncoding` is changed docs: A comment for enum value `PHONETIC_ENCODING_X_SAMPA` in enum `PhoneticEncoding` is changed docs: A comment for field `phrase` in message `.google.cloud.texttospeech.v1.CustomPronunciationParams` is changed docs: A comment for field `pronunciations` in message `.google.cloud.texttospeech.v1.CustomPronunciations` is changed docs: A comment for message `MultiSpeakerMarkup` is changed docs: A comment for field `custom_pronunciations` in message `.google.cloud.texttospeech.v1.SynthesisInput` is changed docs: A comment for field `voice_clone` in message `.google.cloud.texttospeech.v1.VoiceSelectionParams` is changed docs: A comment for field `audio_encoding` in message `.google.cloud.texttospeech.v1.StreamingAudioConfig` is changed docs: A comment for field `text` in message `.google.cloud.texttospeech.v1.StreamingSynthesisInput` is changed END_COMMIT_OVERRIDE docs: A comment for method `StreamingSynthesize` in service `TextToSpeech` is changed docs: A comment for enum value `OGG_OPUS` in enum `AudioEncoding` is changed docs: A comment for enum value `PCM` in enum `AudioEncoding` is changed docs: A comment for field `low_latency_journey_synthesis` in message `.google.cloud.texttospeech.v1.AdvancedVoiceOptions` is changed docs: A comment for enum value `PHONETIC_ENCODING_IPA` in enum `PhoneticEncoding` is changed docs: A comment for enum value `PHONETIC_ENCODING_X_SAMPA` in enum `PhoneticEncoding` is changed docs: A comment for field `phrase` in message `.google.cloud.texttospeech.v1.CustomPronunciationParams` is changed docs: A comment for field `pronunciations` in message `.google.cloud.texttospeech.v1.CustomPronunciations` is changed docs: A comment for message `MultiSpeakerMarkup` is changed docs: A comment for field `custom_pronunciations` in message `.google.cloud.texttospeech.v1.SynthesisInput` is changed docs: A comment for field `voice_clone` in message `.google.cloud.texttospeech.v1.VoiceSelectionParams` is changed docs: A comment for field `audio_encoding` in message `.google.cloud.texttospeech.v1.StreamingAudioConfig` is changed docs: A comment for field `text` in message `.google.cloud.texttospeech.v1.StreamingSynthesisInput` is changed PiperOrigin-RevId: 742280480 Source-Link: googleapis/googleapis@2059f0f Source-Link: googleapis/googleapis-gen@1308d49 Copy-Tag: eyJwIjoicGFja2FnZXMvZ29vZ2xlLWNsb3VkLXRleHR0b3NwZWVjaC8uT3dsQm90LnlhbWwiLCJoIjoiMTMwOGQ0OTUyMWNkOGFmZTg2ZjIxZTJiNWMxYzQzNzM4NzFlMTM1ZCJ9 --------- Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com>
1 parent a22aac0 commit 4da3cf3

File tree

5 files changed

+65
-40
lines changed

5 files changed

+65
-40
lines changed

packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/services/text_to_speech/async_client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -546,7 +546,7 @@ def streaming_synthesize(
546546
metadata: Sequence[Tuple[str, Union[str, bytes]]] = (),
547547
) -> Awaitable[AsyncIterable[cloud_tts.StreamingSynthesizeResponse]]:
548548
r"""Performs bidirectional streaming speech synthesis:
549-
receive audio while sending text.
549+
receives audio while sending text.
550550
551551
.. code-block:: python
552552

packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/services/text_to_speech/client.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -961,7 +961,7 @@ def streaming_synthesize(
961961
metadata: Sequence[Tuple[str, Union[str, bytes]]] = (),
962962
) -> Iterable[cloud_tts.StreamingSynthesizeResponse]:
963963
r"""Performs bidirectional streaming speech synthesis:
964-
receive audio while sending text.
964+
receives audio while sending text.
965965
966966
.. code-block:: python
967967

packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/services/text_to_speech/transports/grpc.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -385,7 +385,7 @@ def streaming_synthesize(
385385
r"""Return a callable for the streaming synthesize method over gRPC.
386386
387387
Performs bidirectional streaming speech synthesis:
388-
receive audio while sending text.
388+
receives audio while sending text.
389389
390390
Returns:
391391
Callable[[~.StreamingSynthesizeRequest],

packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/services/text_to_speech/transports/grpc_asyncio.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -396,7 +396,7 @@ def streaming_synthesize(
396396
r"""Return a callable for the streaming synthesize method over gRPC.
397397
398398
Performs bidirectional streaming speech synthesis:
399-
receive audio while sending text.
399+
receives audio while sending text.
400400
401401
Returns:
402402
Callable[[~.StreamingSynthesizeRequest],

packages/google-cloud-texttospeech/google/cloud/texttospeech_v1/types/cloud_tts.py

Lines changed: 61 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,11 @@ class AudioEncoding(proto.Enum):
9191
MP3 audio at 32kbps.
9292
OGG_OPUS (3):
9393
Opus encoded audio wrapped in an ogg
94-
container. The result will be a file which can
95-
be played natively on Android, and in browsers
96-
(at least Chrome and Firefox). The quality of
97-
the encoding is considerably higher than MP3
98-
while using approximately the same bitrate.
94+
container. The result is a file which can be
95+
played natively on Android, and in browsers (at
96+
least Chrome and Firefox). The quality of the
97+
encoding is considerably higher than MP3 while
98+
using approximately the same bitrate.
9999
MULAW (5):
100100
8-bit samples that compand 14-bit audio
101101
samples using G.711 PCMU/mu-law. Audio content
@@ -107,7 +107,7 @@ class AudioEncoding(proto.Enum):
107107
PCM (7):
108108
Uncompressed 16-bit signed little-endian
109109
samples (Linear PCM). Note that as opposed to
110-
LINEAR16, audio will not be wrapped in a WAV (or
110+
LINEAR16, audio won't be wrapped in a WAV (or
111111
any other) header.
112112
"""
113113
AUDIO_ENCODING_UNSPECIFIED = 0
@@ -202,7 +202,7 @@ class AdvancedVoiceOptions(proto.Message):
202202
Attributes:
203203
low_latency_journey_synthesis (bool):
204204
Only for Journey voices. If false, the
205-
synthesis will be context aware and have higher
205+
synthesis is context aware and has a higher
206206
latency.
207207
208208
This field is a member of `oneof`_ ``_low_latency_journey_synthesis``.
@@ -268,10 +268,10 @@ class CustomPronunciationParams(proto.Message):
268268
269269
Attributes:
270270
phrase (str):
271-
The phrase to which the customization will be
272-
applied. The phrase can be multiple words (in
273-
the case of proper nouns etc), but should not
274-
span to a whole sentence.
271+
The phrase to which the customization is
272+
applied. The phrase can be multiple words, such
273+
as proper nouns, but shouldn't span the length
274+
of the sentence.
275275
276276
This field is a member of `oneof`_ ``_phrase``.
277277
phonetic_encoding (google.cloud.texttospeech_v1.types.CustomPronunciationParams.PhoneticEncoding):
@@ -292,10 +292,10 @@ class PhoneticEncoding(proto.Enum):
292292
PHONETIC_ENCODING_UNSPECIFIED (0):
293293
Not specified.
294294
PHONETIC_ENCODING_IPA (1):
295-
IPA. (e.g. apple -> ˈæpəl )
295+
IPA, such as apple -> ˈæpəl.
296296
https://en.wikipedia.org/wiki/International_Phonetic_Alphabet
297297
PHONETIC_ENCODING_X_SAMPA (2):
298-
X-SAMPA (e.g. apple -> "{p@l" )
298+
X-SAMPA, such as apple -> "{p@l".
299299
https://en.wikipedia.org/wiki/X-SAMPA
300300
"""
301301
PHONETIC_ENCODING_UNSPECIFIED = 0
@@ -325,8 +325,7 @@ class CustomPronunciations(proto.Message):
325325
326326
Attributes:
327327
pronunciations (MutableSequence[google.cloud.texttospeech_v1.types.CustomPronunciationParams]):
328-
The pronunciation customizations to be
329-
applied.
328+
The pronunciation customizations are applied.
330329
"""
331330

332331
pronunciations: MutableSequence["CustomPronunciationParams"] = proto.RepeatedField(
@@ -345,7 +344,7 @@ class MultiSpeakerMarkup(proto.Message):
345344
"""
346345

347346
class Turn(proto.Message):
348-
r"""A Multi-speaker turn.
347+
r"""A multi-speaker turn.
349348
350349
Attributes:
351350
speaker (str):
@@ -405,21 +404,19 @@ class SynthesisInput(proto.Message):
405404
406405
This field is a member of `oneof`_ ``input_source``.
407406
custom_pronunciations (google.cloud.texttospeech_v1.types.CustomPronunciations):
408-
Optional. The pronunciation customizations to
409-
be applied to the input. If this is set, the
410-
input will be synthesized using the given
407+
Optional. The pronunciation customizations
408+
are applied to the input. If this is set, the
409+
input is synthesized using the given
411410
pronunciation customizations.
412411
413-
The initial support will be for EFIGS (English,
414-
French, Italian, German, Spanish) languages, as
415-
provided in VoiceSelectionParams. Journey and
416-
Instant Clone voices are not supported yet.
412+
The initial support is for en-us, with plans to
413+
expand to other locales in the future. Instant
414+
Clone voices aren't supported.
417415
418416
In order to customize the pronunciation of a
419417
phrase, there must be an exact match of the
420418
phrase in the input types. If using SSML, the
421-
phrase must not be inside a phoneme tag
422-
(entirely or partially).
419+
phrase must not be inside a phoneme tag.
423420
"""
424421

425422
text: str = proto.Field(
@@ -481,8 +478,9 @@ class VoiceSelectionParams(proto.Message):
481478
the custom voice matching the specified configuration.
482479
voice_clone (google.cloud.texttospeech_v1.types.VoiceCloneParams):
483480
Optional. The configuration for a voice clone. If
484-
[VoiceCloneParams.voice_clone_key] is set, the service will
485-
choose the voice clone matching the specified configuration.
481+
[VoiceCloneParams.voice_clone_key] is set, the service
482+
chooses the voice clone matching the specified
483+
configuration.
486484
"""
487485

488486
language_code: str = proto.Field(
@@ -519,10 +517,10 @@ class AudioConfig(proto.Message):
519517
stream.
520518
speaking_rate (float):
521519
Optional. Input only. Speaking rate/speed, in the range
522-
[0.25, 4.0]. 1.0 is the normal native speed supported by the
520+
[0.25, 2.0]. 1.0 is the normal native speed supported by the
523521
specific voice. 2.0 is twice as fast, and 0.5 is half as
524522
fast. If unset(0.0), defaults to the native 1.0 speed. Any
525-
other values < 0.25 or > 4.0 will return an error.
523+
other values < 0.25 or > 2.0 will return an error.
526524
pitch (float):
527525
Optional. Input only. Speaking pitch, in the range [-20.0,
528526
20.0]. 20 means increase 20 semitones from the original
@@ -669,12 +667,18 @@ class StreamingAudioConfig(proto.Message):
669667
670668
Attributes:
671669
audio_encoding (google.cloud.texttospeech_v1.types.AudioEncoding):
672-
Required. The format of the audio byte stream. For now,
673-
streaming only supports PCM and OGG_OPUS. All other
674-
encodings will return an error.
670+
Required. The format of the audio byte stream. Streaming
671+
supports PCM, ALAW, MULAW and OGG_OPUS. All other encodings
672+
return an error.
675673
sample_rate_hertz (int):
676674
Optional. The synthesis sample rate (in
677675
hertz) for this audio.
676+
speaking_rate (float):
677+
Optional. Input only. Speaking rate/speed, in the range
678+
[0.25, 2.0]. 1.0 is the normal native speed supported by the
679+
specific voice. 2.0 is twice as fast, and 0.5 is half as
680+
fast. If unset(0.0), defaults to the native 1.0 speed. Any
681+
other values < 0.25 or > 2.0 will return an error.
678682
"""
679683

680684
audio_encoding: "AudioEncoding" = proto.Field(
@@ -686,6 +690,10 @@ class StreamingAudioConfig(proto.Message):
686690
proto.INT32,
687691
number=2,
688692
)
693+
speaking_rate: float = proto.Field(
694+
proto.DOUBLE,
695+
number=3,
696+
)
689697

690698

691699
class StreamingSynthesizeConfig(proto.Message):
@@ -699,6 +707,20 @@ class StreamingSynthesizeConfig(proto.Message):
699707
streaming_audio_config (google.cloud.texttospeech_v1.types.StreamingAudioConfig):
700708
Optional. The configuration of the
701709
synthesized audio.
710+
custom_pronunciations (google.cloud.texttospeech_v1.types.CustomPronunciations):
711+
Optional. The pronunciation customizations
712+
are applied to the input. If this is set, the
713+
input is synthesized using the given
714+
pronunciation customizations.
715+
716+
The initial support is for en-us, with plans to
717+
expand to other locales in the future. Instant
718+
Clone voices aren't supported.
719+
720+
In order to customize the pronunciation of a
721+
phrase, there must be an exact match of the
722+
phrase in the input types. If using SSML, the
723+
phrase must not be inside a phoneme tag.
702724
"""
703725

704726
voice: "VoiceSelectionParams" = proto.Field(
@@ -711,6 +733,11 @@ class StreamingSynthesizeConfig(proto.Message):
711733
number=4,
712734
message="StreamingAudioConfig",
713735
)
736+
custom_pronunciations: "CustomPronunciations" = proto.Field(
737+
proto.MESSAGE,
738+
number=5,
739+
message="CustomPronunciations",
740+
)
714741

715742

716743
class StreamingSynthesisInput(proto.Message):
@@ -722,10 +749,8 @@ class StreamingSynthesisInput(proto.Message):
722749
text (str):
723750
The raw text to be synthesized. It is
724751
recommended that each input contains complete,
725-
terminating sentences, as this will likely
726-
result in better prosody in the output audio.
727-
That being said, users are free to input text
728-
however they please.
752+
terminating sentences, which results in better
753+
prosody in the output audio.
729754
730755
This field is a member of `oneof`_ ``input_source``.
731756
"""

0 commit comments

Comments
 (0)