Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 3049. Отображено 100.
12-04-2012 дата публикации

Speech synthesizer, speech synthesizing method and program product

Номер: US20120089402A1
Принадлежит: Toshiba Corp

According to one embodiment, a speech synthesizer includes an analyzer, a first estimator, a selector, a generator, a second estimator, and a synthesizer. The analyzer analyzes text and extracts a linguistic feature. The first estimator selects a first prosody model adapted to the linguistic feature and estimates prosody information that maximizes a first likelihood representing probability of the selected first prosody model. The selector selects speech units that minimize a cost function determined in accordance with the prosody information. The generator generates a second prosody model that is a model of the prosody information of the speech units. The second estimator estimates prosody information that maximizes a third likelihood calculated on the basis of the first likelihood and a second likelihood representing probability of the second prosody model. The synthesizer generates synthetic speech by concatenating the speech units on the basis of the prosody information estimated by the second estimator.

Подробнее
05-07-2012 дата публикации

Multi-lingual text-to-speech system and method

Номер: US20120173241A1

A multi-lingual text-to-speech system and method processes a text to be synthesized via an acoustic-prosodic model selection module and an acoustic-prosodic model mergence module, and obtains a phonetic unit transformation table. In an online phase, the acoustic-prosodic model selection module, according to the text and a phonetic unit transcription corresponding to the text, uses at least a set controllable accent weighting parameter to select a transformation combination and find a second and a first acoustic-prosodic models. The acoustic-prosodic model mergence module merges the two acoustic-prosodic models into a merged acoustic-prosodic model, according to the at least a controllable accent weighting parameter, processes all transformations in the transformation combination and generates a merged acoustic-prosodic model sequence. A speech synthesizer and the merged acoustic-prosodic model sequence are further applied to synthesize the text into an L1-accent L2 speech.

Подробнее
23-08-2012 дата публикации

Hearing assistance system for providing consistent human speech

Номер: US20120215532A1
Принадлежит: Apple Inc

Broadly speaking, the embodiments disclosed herein describe an apparatus, system, and method that allows a user of a hearing assistance system to perceive consistent human speech. The consistent human speech can be based upon user specific preferences.

Подробнее
20-09-2012 дата публикации

Apparatus and method for supporting reading of document, and computer readable medium

Номер: US20120239390A1
Принадлежит: Toshiba Corp

According to one embodiment, an apparatus for supporting reading of a document includes a model storage unit, a document acquisition unit, a feature information extraction, and an utterance style estimation unit. The model storage unit is configured to store a model which has trained a correspondence relationship between first feature information and an utterance style. The first feature information is extracted from a plurality of sentences in a training document. The document acquisition unit is configured to acquire a document to be read. The feature information extraction unit is configured to extract second feature information from each sentence in the document to be read. The utterance style estimation unit is configured to compare the second feature information of a plurality of sentences in the document to be read with the model, and to estimate an utterance style of the each sentence of the document to be read.

Подробнее
27-12-2012 дата публикации

Speech synthesizer, navigation apparatus and speech synthesizing method

Номер: US20120330667A1
Принадлежит: HITACHI LTD

Included in a speech synthesizer, a natural language processing unit divides text data, input from a text input unit, into a plurality of components (particularly, words). An importance prediction unit estimates an importance level of each component according to the degree of how much each component contributes to understanding when a listener hears synthesized speech. Then, the speech synthesizer determines a processing load based on the device state when executing synthesis processing and the importance level. Included in the speech synthesizer, a synthesizing control unit and a wave generation unit reduce the processing time for a phoneme with a low importance level by curtailing its processing load (relatively degrading its sound quality), allocate a part of the processing time, made available by this reduction, to the processing time of a phoneme with a high importance level, and generates synthesized speech in which important words are easily audible.

Подробнее
28-03-2013 дата публикации

Methods and Apparatus for Rapid Acoustic Unit Selection From a Large Speech Corpus

Номер: US20130080176A1
Принадлежит: AT&T INTELLECTUAL PROPERTY II, L.P.

A speech synthesis system can select recorded speech fragments, or acoustic units, from a very large database of acoustic units to produce artificial speech. The selected acoustic units are chosen to minimize a combination of target and concatenation costs for a given sentence. However, as concatenation costs, which are measures of the mismatch between sequential pairs of acoustic units, are expensive to compute, processing can be greatly reduced by pre-computing and caching the concatenation costs. The number of possible sequential pairs of acoustic units makes such caching prohibitive. Statistical experiments reveal that while about 85% of the acoustic units are typically used in common speech, less than 1% of the possible sequential pairs of acoustic units occur in practice. The system synthesizes a large body of speech, identifies the acoustic unit sequential pairs generated and their respective concatenation costs, and stores those concatenation costs likely to occur. 1. A method comprising: assigning a default value as the associated concatenation cost; and', 'updating the concatenation cost database by synthesizing a body of speech, identifying the acoustic unit sequential pair in the body of speech, and recording a respective concatenation cost in the concatenation cost database., 'when, while synthesizing speech, an acoustic unit sequential pair does not have an associated concatenation cost in a concatenation cost database2. The method of claim 1 , further comprising synthesizing the speech using the respective concatenation cost.3. The method of claim 1 , wherein recording the respective concatenation cost comprises:assigning a value to each acoustic unit in the acoustic unit sequential pair; anddetermining a difference associated with the value assigned to each acoustic unit, to yield the respective concatenation cost.4. The method of claim 1 , wherein the concatenation cost database contains a portion of all possible concatenation costs associated with ...

Подробнее
04-04-2013 дата публикации

SPEECH SAMPLES LIBRARY FOR TEXT-TO-SPEECH AND METHODS AND APPARATUS FOR GENERATING AND USING SAME

Номер: US20130085759A1
Принадлежит: VIVOTEXT LTD.

A method for converting translating text into speech with a speech sample library is provided. The method comprises converting translating an input text to a sequence of triphones; determining musical parameters of each phoneme in the sequence of triphones; detecting, in the speech sample library, speech segments having at least the determined musical parameters; and concatenating the detected speech segments. 1. A method for converting text into speech with a speech sample library , comprising:converting an input text to a sequence of triphones;determining musical parameters of each phoneme in the sequence of triphones;detecting, in the speech sample library, speech segments having at least the determined musical parameters; andconcatenating the detected speech segments.2. The method of claim 1 , further comprising:adjusting the musical parameters of speech segments prior to concatenating the speech segments.3. The method of claim 1 , wherein the at least one musical parameter is any one of: a pitch curve claim 1 , a pitch perception claim 1 , duration claim 1 , and a volume.4. The method of claim 3 , wherein a value of a musical vector is an index indicative of a sub range in which its respective at least one musical parameter lies.5. The method of claim 1 , wherein the sequence of triphones includes overlapping triphones.6. The method of claim 2 , wherein determining the musical parameters of each phoneme in the sequence of triphones further includes: providing a set of numerical targets for each of the musical parameters.7. The method of claim 6 , wherein detecting the speech segments having at least the determined musical parameters further includes:searching the speech sample library for at least one of a central phoneme, phonemic context, and a musical index indicating at least one range of at least one of the musical parameters within which at least of the numerical targets lies.8. The method of claim 1 , wherein each of the speech segments comprises at ...

Подробнее
04-04-2013 дата публикации

TRAINING AND APPLYING PROSODY MODELS

Номер: US20130085760A1
Автор: Jr. James H., Stephens
Принадлежит: MORPHISM LLC

Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. 126-. (canceled)27. A computer-implementable method for synthesizing audible speech , with varying prosody , from textual content , the method comprising:maintaining an inventory of prosody models with lexicons,selecting a subset of multiple prosody models from the inventory of prosody models;associating prosody models in the subset of multiple prosody models with different segments of a text based on phrases in the text statistically associated with the lexicons of the prosody models;applying the associated prosody models to the different segments of the text to produce prosody annotations for the text;considering annotations of the prosody annotations to reconcile conflicting prosody annotations due to multiple prosody models associated with a segment of the text; andsynthesizing audible speech from the text and the reconciled prosody annotations.28. The method of claim 27 , wherein the reconciling is based on a reconciliation policy.29. The method of claim 28 , wherein the reconciliation policy considers the annotations of the prosody annotations that comprise a prosody model identifier and a prosody model confidence for the prosody annotation.30. The method of claim 29 , wherein annotations of the prosody annotations are represented by markup elements that indicate the scope of the tagged text.31. The method of claim 30 , wherein the reconciliation eliminates conflicting annotations that result from applications of multiple models.32. The method of claim 31 , wherein the selecting is based on input parameters.33. The ...

Подробнее
02-05-2013 дата публикации

FACILITATING TEXT-TO-SPEECH CONVERSION OF A DOMAIN NAME OR A NETWORK ADDRESS CONTAINING A DOMAIN NAME

Номер: US20130110512A1
Принадлежит: RESEARCH IN MOTION LIMITED

A method and apparatus of facilitating text-to-speech conversion of a domain name are provided. At a processor of a computing device, a pronunciation of a top level domain of a network address is determined by one or more of: generating a phonetic representation of each character in the top level domain pronounced individually; and, generating a tokenized representation of each individual character of the top level domain suitable for interpretation by a text-to-speech engine. For each other level domain of the network address, at the processor, a pronunciation of the other level domain is determined based on one or more recognized words within the other level domain. 1. A method comprising: generating a phonetic representation of each character in the top level domain pronounced individually; and,', 'generating a tokenized representation of each individual character of the top level domain suitable for interpretation by a text-to-speech engine; and, 'determining, at a processor of a computing device, a pronunciation of a top level domain of a network address by one or more offor each other level domain of the network address, determining, at the processor, a pronunciation of the other level domain based on one or more recognized words within the other level domain.2. The method of claim 1 , wherein the determining the pronunciation of a top level domain of a network address further comprises determining whether said top level domain is one of a set of top level domains.3. The method of claim 2 , wherein the set represents top level domains that are pronounced as a whole.4. The method of claim 1 , wherein the determining the pronunciation of a top level domain of a network address further comprises one or more of:generating a phonetic representation of the top level domain pronounced as a whole; andgenerating a tokenized representation of the top level domain as a whole suitable for interpretation by a text-to-speech engine.5. The method of claim 1 , wherein the ...

Подробнее
09-05-2013 дата публикации

SPEECH SYNTHESIZER, SPEECH SYNTHESIS METHOD, AND SPEECH SYNTHESIS PROGRAM

Номер: US20130117026A1
Автор: Kato Masanori
Принадлежит: NEC Corporation

State duration creation means creates a state duration indicating a duration of each state in a hidden Markov model, based on linguistic information and a model parameter of prosody information. Duration correction degree computing means derives a speech feature from the linguistic information, and computes a duration correction degree which is an index indicating a degree of correcting the state duration, based on the derived speech feature. State duration correction means corrects the state duration based on a phonological duration correction parameter and the duration correction degree, the phonological duration correction parameter indicating a correction ratio of correcting a phonological duration. 110.-. (canceled)11. A speech synthesizer comprising:a state duration creation unit for creating a state duration indicating a duration of each state in a hidden Markov model, based on linguistic information and a model parameter of prosody information;a duration correction degree computing unit for deriving a speech feature from the linguistic information, and computing a duration correction degree based on the derived speech feature, the duration correction degree being an index indicating a degree of correcting the state duration; anda state duration correction unit for correcting the state duration based on a phonological duration correction parameter and the duration correction degree, the phonological duration correction parameter indicating a correction ratio of correcting a phonological duration.12. The speech synthesizer according to claim 11 , wherein the duration correction degree computing unit estimates a temporal change degree of the speech feature derived from the linguistic information claim 11 , and computes the duration correction degree based on the estimated temporal change degree.13. The speech synthesizer according to claim 12 , wherein the duration correction degree computing unit estimates a temporal change degree of a spectrum or a pitch from ...

Подробнее
16-05-2013 дата публикации

VIDEO GENERATION BASED ON TEXT

Номер: US20130124206A1
Автор: Rezvani Behrooz, ROUHI Ali
Принадлежит: Seyyer, Inc.

Techniques for generating a video sequence of a person based on a text sequence, are disclosed herein. Based on the received text sequence, a processing device generates the video sequence of a person to simulate visual and audible emotional expressions of the person, including using an audio model of the person's voice to generate an audio portion of the video sequence. The emotional expressions in the visual portion of the video sequence are simulated based a priori knowledge about the person. For instance, the a priori knowledge can include photos or videos of the person captured in real life. 1. A method comprising:inputting a text sequence at a processing device; andgenerating, by the processing device, a video sequence of a person based on the text sequence to simulate visual and audible emotional expressions of the person, including using an audio model of the person's voice to generate an audio portion of the video sequence.2. The method of claim 1 , wherein the processing device is a mobile device claim 1 , the text sequence is inputted from a second mobile device via a Short Message Service (SMS) channel claim 1 , and said generating a video sequence of a person comprises generating claim 1 , by the mobile device claim 1 , a video sequence of a person based on shared information stored on the mobile device and the second mobile device.3. The method of claim 1 , wherein the text sequence includes a set of words including at least one word claim 1 , and wherein the video sequence is generated such that the person appears to utter the words in the video sequence.4. The method of claim 1 , wherein the text sequence includes a text representing an utterance claim 1 , and wherein the video sequence is generated such that the person appears to utter the utterance in the video sequence.5. The method of claim 1 , wherein the text sequence includes a word and an indicator for the word claim 1 , the indicator indicates an emotional expression of the person at a time ...

Подробнее
23-05-2013 дата публикации

System and Method for Generating Challenge Items for CAPTCHAs

Номер: US20130132093A1
Автор: GROSS JOHN NICHOLAS

Challenge items for an audible based electronic challenge system are generated using a variety of techniques to identify optimal candidates. The challenge items are intended for use in a computing system that discriminates between humans and text to speech (TTS) system. 119.-. (canceled)20. A method embodied in a computer readable medium of selecting challenge data to be used for accessing data and/or resources of a computing system comprising:(a) providing data identifying a first set of diphones to be assessed by a computing system, wherein each of said first set of diphones represents a sound associated with an articulation of a pair of phonemes in a natural language;(b) generating an a plurality of articulation scores using the computing system based on measuring acoustical characteristics of a machine text to speech (TTS) system articulation of each of said first set of diphones; and(c) selecting challenge text including words and phrases from the natural language using the computing system based on said plurality of articulation scores;wherein said challenge text is useable by an utterance-based challenge system for discriminating between humans and machines.21. The method of further including a step: processing input speech by an entity using said challenge item database to distinguish between a human and a machine synthesized voice.22. A method embodied in a computer readable medium of selecting challenge data to be used for accessing data and/or resources of a computing system comprising:a) selecting a candidate challenge item which includes text words and/or visual images;b) measuring first acoustical characteristics of a computer synthesized utterance when articulating challenge content associated with said candidate challenge item;c) measuring second acoustical characteristics of a human utterance when articulating said challenge content;d) generating a challenge item score based on measuring a difference in said first and second acoustical ...

Подробнее
06-06-2013 дата публикации

SYSTEMS AND METHODS DOCUMENT NARRATION

Номер: US20130144625A1
Принадлежит: K-NFB READING TECHNOLOGY, INC.

Disclosed are techniques and systems to provide a narration of a text in multiple different voices. In some aspects, systems and methods described herein can include receiving a user-based selection of a first portion of words in a document where the document has a pre-associated first voice model and overwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words. 1. A computer implemented method , comprising:receiving a user-based selection of a first portion of words in a document, at least of portion of the document being displayed on a user interface on a display device, the document being pre-associated with a first voice model;applying, by the one or more computers, in response to the user-based selection of the first portion of words, a first set of indicia to the user-selected first portion of words in the document; andoverwriting the association of the first voice model, by the one or more computers, with a second voice model for the first portion of words.2. The method of wherein the words in the first portion of words are narrated using the second voice model and at least some of the other words in the document are narrated using the first voice model.3. The method of claim 1 , wherein the method further comprises:associating, by the one or more computers, the first voice model with the document, prior to receiving the user-based selection of the first portion of words.4. The method of claim 1 , wherein the words in the first portion of words are narrated using the second voice model and remaining words in the document are narrated using the first voice model.5. The method of claim 1 , wherein the first voice model comprises a default voice model.6. The method of claim 1 , further comprising:applying, in response to a user-based selection of a second portion of words in the document, a second highlighting indicium to the user-selected second portion of words in the document; ...

Подробнее
08-08-2013 дата публикации

CONTEXTUAL CONVERSION PLATFORM FOR GENERATING PRIORITIZED REPLACEMENT TEXT FOR SPOKEN CONTENT OUTPUT

Номер: US20130204624A1
Автор: Ben-Ezri Daniel
Принадлежит:

A contextual conversion platform, and method for converting text-to-speech, are described that can convert content of a target to spoken content. Embodiments of the contextual conversion platform can identify certain contextual characteristics of the content, from which can be generated a spoken content input. This spoken content input can include tokens, e.g., words and abbreviations, to be converted to the spoken content, as well as substitution tokens that are selected from contextual repositories based on the context identified by the contextual conversion platform. 1. A method , comprising: receiving data related to content of a target;', 'filtering the data to locate a target term;', 'accessing one or more tables in a repository, the one or more tables comprising entries with a substitution unit corresponding to the target term, the entries arranged according to a prioritized scheme that defines a position for the substitution unit in the tables; and', 'generating an output comprising data that represents the substitution unit to be utilized by a text-to-speech generator to generate spoken content,', 'wherein the position of the substitution unit in the one or more tables is assigned based on a specificity characteristic that describes the relative inclusivity of the substitution unit as compared to other substitution units in the one or more tables., 'at a computer comprising a computer program to implement processing operations2. The method of claim 1 , further comprising:breaking the content into at least one contextual unit that includes the target term; andinserting the substitution unit in the contextual unit in place of the target term.3. The method of claim 1 , further comprising:identifying a context cue in the data, the context cue identifying characteristics of the target; andselecting a table from the one or more tables in which the substitution unit is compatible with the characteristics of target.4. The method of claim 1 , further comprising: ...

Подробнее
15-08-2013 дата публикации

Apparatus and method for emotional voice synthesis

Номер: US20130211838A1
Принадлежит: ACRIIL Inc

The present disclosure provides an emotional voice synthesis apparatus and an emotional voice synthesis method. The emotional voice synthesis apparatus includes a word dictionary storage unit for storing emotional words in an emotional word dictionary after classifying the emotional words into items each containing at least one of an emotion class, similarity, positive or negative valence, and sentiment strength; voice DB storage unit for storing voices in a database after classifying the voices according to at least one of emotion class, similarity, positive or negative valence and sentiment strength in correspondence to the emotional words; emotion reasoning unit for inferring an emotion matched with the emotional word dictionary with respect to at least one of each word, phrase, and sentence of document including text and e-book; and voice output unit for selecting and outputting a voice corresponding to the document from the database according to the inferred emotion.

Подробнее
15-08-2013 дата публикации

Feature sequence generating device, feature sequence generating method, and feature sequence generating program

Номер: US20130211839A1
Автор: Masanori Kato
Принадлежит: NEC Corp

Spread level parameter correcting means 501 receives a contour parameter as information representing the contour of a feature sequence (a sequence of features of a signal considered as the object of generation) and a spread level parameter as information representing the level of a spread of the distribution of the features in the feature sequence. The spread level parameter correcting means 501 corrects the spread level parameter based on a variation of the contour parameter represented by a sequence of the contour parameters. Feature sequence generating means 502 generates the feature sequence based on the contour parameters and the corrected spread level parameters.

Подробнее
15-08-2013 дата публикации

METHOD AND DEVICE FOR PROCESSING VOCAL MESSAGES

Номер: US20130211845A1
Автор: IMPARATO Ciro
Принадлежит: LA VOCE.NET DI CIRO IMPARATO

A method for automatically generating at least one voice message with the desired voice expression, starting from a prestored voice message, including assigning a vocal category to one word or to groups of words of the prestored message, computing, based on a vocal category/vocal parameter correlation table, a predetermined level of each one of the vocal parameters, emitting said voice message, with the vocal parameter levels computed for each word or group of words. 1. Method for automatically generating at least one voice message with the desired vocal expression , starting from a prestored voice message , comprising the steps of:assigning a vocal category to one word or to groups of words of the prestored message,computing, based on a vocal category/vocal parameter correlation table, a predetermined level of each one of the vocal parameters,omitting said voice message, with the vocal parameter levels computed for each word or group of words.2. The method according to claim 1 , wherein such vocal categories are chosen among friendship claim 1 , trust claim 1 , confidence claim 1 , passion claim 1 , apathy and anger.3. The method according to claim 1 , wherein such vocal parameters are chosen among volume claim 1 , tone claim 1 , time claim 1 , rhythm.4. Method for automatically decoding a message being listened to claim 1 , in order to detect its vocal expression and the emotion of the person who recorded the voice message claim 1 , comprising the steps of:assigning a level of each one of the vocal parameters to each word or group of words of the message being listened to,extracting, based on a vocal category/vocal parameter correlation table, the vocal categories of such words or groups of words starting from such vocal parameters assigned in the preceding step,determining the vocal expression of said voice message, based on the analysis of such extracted vocal categories.5. The method according to claim 4 , wherein such vocal categories are chosen among ...

Подробнее
22-08-2013 дата публикации

SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND COMPUTER PROGRAM PRODUCT

Номер: US20130218568A1
Принадлежит: KABUSHIKI KAISHA TOSHIBA

According to an embodiment, a speech synthesis device includes a first storage, a second storage, a first generator, a second generator, a third generator, and a fourth generator. The first storage is configured to store therein first information obtained from a target uttered voice. The second storage is configured to store therein second information obtained from an arbitrary uttered voice. The first generator is configured to generate third information by converting the second information so as to be close to a target voice quality or prosody. The second generator is configured to generate an information set including the first information and the third information. The third generator is configured to generate fourth information used to generate a synthesized speech, based on the information set. The fourth generator configured to generate the synthesized speech corresponding to input text using the fourth information. 1. A speech synthesis device comprising:a first storage configured to store therein first information obtained from a target uttered voice;a second storage configured to store therein second information obtained from an arbitrary uttered voice;a first generator configured to generate third information by converting the second information so as to be close to a target voice quality or prosody;a second generator configured to generate an information set including the first information and the third information;a third generator configured to generate fourth information used to generate a synthesized speech, based on the information set; anda fourth generator configured to generate the synthesized speech corresponding to input text using the fourth information.2. The device according to claim 1 ,wherein the first information and the second information are stored together with attribute information thereof, andthe second generator generates the information set by adding the first information and the entire or a portion of the third information, the ...

Подробнее
29-08-2013 дата публикации

Methods employing phase state analysis for use in speech synthesis and recognition

Номер: US20130226569A1
Принадлежит: Lessac Tech Inc

A computer-implemented method for automatically analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition. Possible steps include: initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data; using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals; analyzing acoustic wave data representing a selected acoustic unit to determine the phase state of the acoustic unit; and analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit. Also included are systems for implementing the described and related methods.

Подробнее
05-09-2013 дата публикации

Automatic Sound Level Control

Номер: US20130231921A1
Принадлежит: AT&T Intellectual Property I, L.P.

A method includes identifying, at a computing device, a plurality of words in data. Each of the plurality of words corresponds to a particular word of a written language. The method includes determining a sound output level based on a location of the computing device. The method includes generating sound data based on the sound output level and the plurality of words identified in the data. 1. A method comprising:identifying, at a computing device, a plurality of words in data, wherein each of the plurality of words corresponds to a particular word of a written language;determining a sound output level based on at least in part on a location of the computing device; andgenerating sound data based on the sound output level and the plurality of words identified in the data.2. The method of claim 1 , further comprising determining a noise level external to the computing device claim 1 , wherein the sound output level is based on the noise level external to the computing device.3. The method of claim 2 , wherein determining the noise level external to the computing device includes receiving sound data from one or more sound input devices of the computing device.4. The method of claim 1 , wherein the data includes image data claim 1 , and wherein at least one of the plurality of words is identified in the image data.5. The method of claim 4 , wherein the at least one of the plurality of words is identified in the image data using optical character recognition.6. The method of claim 1 , wherein the data is accessed from a data file.7. The method of claim 6 , wherein the data file is in a portable document format.8. The method of claim 1 , further comprising outputting one or more sounds from the computing device based on the sound data.9. The method of claim 1 , further comprising accessing sound configuration data from a memory of the computing device claim 1 , the sound configuration data including sound data corresponding to one or more locations.10. The method of ...

Подробнее
03-10-2013 дата публикации

Text to speech method and system

Номер: US20130262109A1
Принадлежит: Toshiba Corp

A text-to-speech method for simulating a plurality of different voice characteristics includes dividing inputted text into a sequence of acoustic units; selecting voice characteristics for the inputted text; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model having a plurality of model parameters provided in clusters each having at least one sub-cluster and describing probability distributions which relate an acoustic unit to a speech vector; and outputting the sequence of speech vectors as audio with the selected voice characteristics. A parameter of a predetermined type of each probability distribution is expressed as a weighted sum of parameters of the same type using voice characteristic dependent weighting. In converting the sequence of acoustic units to a sequence of speech vectors, the voice characteristic dependent weights for the selected voice characteristics are retrieved for each cluster such that there is one weight per sub-cluster.

Подробнее
03-10-2013 дата публикации

PLAYBACK CONTROL APPARATUS, PLAYBACK CONTROL METHOD, AND PROGRAM

Номер: US20130262118A1
Принадлежит: SONY CORPORATION

A playback control apparatus includes a playback controller configured to control playback of first content and second content. The first content is to output first sound which is generated based on text information using speech synthesis processing. The second content is to output second sound which is generated not using the speech synthesis processing. The playback controller causes an attribute of content to be played back to be displayed on the screen, the attribute indicating whether or not the content is to output sound which is generated based on text information using speech synthesis processing. 1. A playback control apparatus comprising:a playback controller configured to control playback of first content and second content, the first content is to output first sound which is generated based on text information using speech synthesis processing, the second content is to output second sound which is generated not using the speech synthesis processing,wherein the playback controller causes an attribute of content to be played back to be displayed on the screen, the attribute indicating whether or not the content is to output sound which is generated based on text information using speech synthesis processing.2. The playback control apparatus according to claim 1 , wherein the playback controller further causes a display portion claim 1 , associated with sound output at that time claim 1 , to be displayed in a highlighted state.3. The playback control apparatus according to claim 1 , wherein the playback controller further changes a speaker or background music claim 1 , which is in part of the sound claim 1 , in accordance with content of the text information used in generating sound.4. The playback control apparatus according to claim 1 , wherein a text-to-speech function for generating sound based on the text information using the speech synthesis processing is configured to be turned on or off claim 1 , andthe playback controller causes the first content ...

Подробнее
03-10-2013 дата публикации

TEXT TO SPEECH SYSTEM

Номер: US20130262119A1
Принадлежит: KABUSHIKI KAISHA TOSHIBA

A text-to-speech method configured to output speech having a selected speaker voice and a selected speaker attribute, including: inputting text; dividing the inputted text into a sequence of acoustic units; selecting a speaker for the inputted text; selecting a speaker attribute for the inputted text; converting the sequence of acoustic units to a sequence of speech vectors using an acoustic model; and outputting the sequence of speech vectors as audio with the selected speaker voice and a selected speaker attribute. The acoustic model includes a first set of parameters relating to speaker voice and a second set of parameters relating to speaker attributes, which parameters do not overlap. The selecting a speaker voice includes selecting parameters from the first set of parameters and the selecting the speaker attribute includes selecting the parameters from the second set of parameters. 1. A text-to-speech method configured to output speech having a selected speaker voice and a selected speaker attribute ,said method comprising:inputting text;dividing said inputted text into a sequence of acoustic units;selecting a speaker for the inputted text;selecting a speaker attribute for the inputted text;converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model; andoutputting said sequence of speech vectors as audio with said selected speaker voice and a selected speaker attribute,wherein said acoustic model comprises a first set of parameters relating to speaker voice and a second set of parameters relating to speaker attributes, wherein the first and second set of parameters do not overlap, and wherein selecting a speaker voice comprises selecting parameters from the first set of parameters which give the speaker voice and selecting the speaker attribute comprises selecting the parameters from the second set which give the selected speaker attribute.2. A method according to claim 1 , wherein there are a plurality of sets of ...

Подробнее
31-10-2013 дата публикации

Realistic Speech Synthesis System

Номер: US20130289998A1
Принадлежит: SRC Inc

A system and method for realistic speech synthesis which converts text into synthetic human speech with qualities appropriate to the context such as the language and dialect of the speaker, as well as expanding a speaker's phonetic inventory to produce more natural sounding speech.

Подробнее
14-11-2013 дата публикации

SYSTEM AND METHOD FOR AUDIBLY PRESENTING SELECTED TEXT

Номер: US20130304474A1
Принадлежит:

Disclosed herein are methods for presenting speech from a selected text that is on a computing device. This method includes presenting text on a touch-sensitive display and having that text size within a threshold level so that the computing device can accurately determine the intent of the user when the user touches the touch screen. Once the user touch has been received, the computing device identifies and interprets the portion of text that is to be selected, and subsequently presents the text audibly to the user. 1. A method comprising:displaying, via a processor, text via a touch-sensitive display;receiving, from the touch-sensitive display, input identifying a portion of the text; andaudibly presenting the portion of the text.2. The method of claim 1 , wherein receiving the input further comprises receiving non-contiguous separate touches on the touch-sensitive display claim 1 , wherein the non-contiguous separate touches indicate a number of paragraphs of the text to be audibly presented as the portion of the text.3. The method of claim 1 , wherein the input comprises data associated with a first tap at a first location and a second tap at a second location claim 1 , and the portion of the text is identified as text displayed between the first location and the second location.4. The method of claim 1 , wherein audibly presenting the portion of the text occurs via a speaker associated with the touch-sensitive display.5. The method of claim 1 , wherein the touch-sensitive display is part of a mobile phone.6. The method of claim 1 , wherein audibly presenting the portion of the text comprises communicating pre-recorded phonemes combined together.7. The method of claim 1 , wherein the input further comprises an area of the touch-sensitive display indicated by user touch.8. A system comprising:a processor; anda computer-readable storage medium having instructions stored which, when executed by the processor, result in the processor performing operations comprising ...

Подробнее
21-11-2013 дата публикации

Voice processing apparatus

Номер: US20130311189A1
Принадлежит: Yamaha Corp

In a voice processing apparatus, a processor performs generating a converted feature by applying a source feature of source voice to a conversion function, generating an estimated feature based on a probability that the source feature belongs to each element distribution of a mixture distribution model that approximates distribution of features of voices having different characteristics, generating a first conversion filter based on a difference between a first spectrum corresponding to the converted feature and an estimated spectrum corresponding to the estimated feature, generating a second spectrum by applying the first conversion filter to a source spectrum corresponding to the source feature, generating a second conversion filter based on a difference between the first spectrum and the second spectrum, and generating target voice by applying the first conversion filter and the second conversion filter to the source spectrum.

Подробнее
12-12-2013 дата публикации

Method and System for Enhancing a Speech Database

Номер: US20130332169A1
Принадлежит: AT&T INTELLECTUAL PROPERTY II, L.P.

A system, method and computer readable medium that enhances a speech database for speech synthesis is disclosed. The method may include labeling audio files in a primary speech database, identifying segments in the labeled audio files that have varying pronunciations based on language differences, identifying replacement segments in a secondary speech database, enhancing the primary speech database by substituting the identified secondary speech database segments for the corresponding identified segments in the primary speech database, and storing the enhanced primary speech database for use in speech synthesis. 1. A method comprising:receiving text as part of a text-to-speech process; identifying primary speech segments in the primary speech database which do not meet a need of the text-to-speech process, wherein the primary speech segments comprise one of half-phones, half-phonemes, demi-syllables, and polyphones;', 'identifying replacement speech segments which satisfy the need in a secondary speech database; and', 'enhancing the primary speech database by substituting, in the primary database, the primary speech segments with the replacement speech segments; and, 'selecting a speech segment associated with the text, wherein the speech segment is selected from a primary speech database which has been modified bygenerating speech corresponding to the text using the speech segment.2. The method of claim 1 , wherein the need is based on one of dialect differences claim 1 , geographic language differences claim 1 , regional language differences claim 1 , accent differences claim 1 , national language differences claim 1 , idiosyncratic speech differences claim 1 , and database coverage differences.3. The method of claim 1 , wherein the primary speech segments are one of diphones claim 1 , triphones claim 1 , and phonemes.4. The method of claim 1 , wherein the primary speech database has been further modified by identifying boundaries of the primary speech segments.5. ...

Подробнее
16-01-2014 дата публикации

Training and Applying Prosody Models

Номер: US20140019138A1
Автор: Jr. James H., Stephens
Принадлежит: MORPHISM LLC

Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. 114-. (canceled)15. A computer-implementable method for synthesizing audible speech , with varying prosody , from textual content , the method comprising:generating texts annotated with prosody information generated from audio using a speech recognition engine that performs the annotation during its operation;training prosody models with lexicons based on first segments of the texts with the prosody information;maintaining an inventory of the prosody models with lexicons, selecting a subset of multiple prosody models from the inventory of prosody models;associating prosody models in the subset of multiple prosody models with second segments of a text based on phrases in the text statistically associated with the lexicons of the prosody models;applying the associated prosody models to one of the second segments of the text to produce prosody annotations for the text;updating the associated prosody models' lexicons based on the phrases in the second segments of text;analyzing annotations of the prosody annotations to reconcile conflicting prosody annotations previously produced by multiple prosody models associated with the second segments of text; andsynthesizing audible speech from the second segments of text and the reconciled prosody annotations.16. The method of claim 15 , wherein the prosody information comprises directives related to pitch claim 15 , rate claim 15 , and volume of the audio as measured by the speech recognition engine.17. The method of claim 16 , wherein the reconciliation of conflicting prosody ...

Подробнее
13-02-2014 дата публикации

SYSTEM FOR CREATING MUSICAL CONTENT USING A CLIENT TERMINAL

Номер: US20140046667A1
Автор: KANG Won Mo, Yeom Jong Hak
Принадлежит: TGENS CO., LTD

A system for creating musical content using a client terminal, wherein diverse musical information such as a desired lyric and musical scale, duration and singing technique is input from an online or cloud computer, an embedded terminal or other such client terminal by means of technology for generating musical vocal content by using computer speech synthesis technology, and then speech in which cadence is expressed in accordance with the musical scale is synthesized as speech run by being produced for the applicable duration and is transmitted to the client terminal is provided. 1. A system for creating musical content using a client terminal , comprising:a client terminal for editing lyrics and a sound source, reproducing a sound corresponding to a location of a piano key, and editing a vocal effect or transmitting music information to the voice synthesis server to reproduce music synthesized and processed by the voice synthesis server, the music information being obtained by editing a singer sound source and a track corresponding to a vocal part;a voice synthesis server for obtaining the music information transmitted from the client terminal to extract, synthesize, and process a sound source corresponding to the lyrics; anda voice synthesis transmission server for transmitting the music created by the voice synthesis server to the client terminal.2. The system according to claim 1 , wherein the client terminal comprises:a lyrics editing unit for editing lyrics;a sound source editing unit for editing a sound source;a vocal effect editing unit for editing a vocal effect;a singer and track editing unit for selecting a singer sound source corresponding to a vocal part and editing a plurality of tracks; anda reproduction unit for receiving and reproducing a signal synthesized by the voice synthesis server from the voice synthesis transmission server.3. The system according to claim 1 , wherein the client terminal comprises:a lyrics editing unit for editing lyrics;a ...

Подробнее
20-02-2014 дата публикации

PROSODY EDITING APPARATUS AND METHOD

Номер: US20140052446A1
Принадлежит: KABUSHIKI KAISHA TOSHIBA

According to one embodiment, a prosody editing apparatus includes a storage, a first selection unit, a search unit, a normalization unit, a mapping unit, a display, a second selection unit, a restoring unit and a replacing unit. The search unit searches the storage for one or more second prosodic patterns corresponding to attribute information that matches attribute information of the selected phrase. The mapping maps each of the normalized second prosodic patterns on a low-dimensional space. The restoring unit restores a restored prosodic pattern according to the selected coordinates. The replacing unit replaces prosody of synthetic speech generated based on the selected phrase by the restored prosodic pattern. 1. A prosody editing apparatus comprising:a storage configured to store attribute information items of phrases and one or more first prosodic patterns corresponding to each of the attribute information items of the phrases, the attribute information items each indicating an attribute associated with a phrase, the first prosodic patterns each including parameters which indicate a prosody type of the phrase and expresses prosody of the phrase, the parameters each including elements not less than the number of phonemes of the phrase;a first selection unit configured to select a phrase including phonemes from text to obtain a selected phrase;a search unit configured to search the storage for one or more second prosodic patterns corresponding to an attribute information item that matches an attribute information item of the selected phrase to obtain as a prosodic pattern set, the second prosodic patterns being included in the first prosodic patterns;a normalization unit configured to normalize the second prosodic patterns respectively;a mapping unit configured to map each of the normalized second prosodic patterns on a low-dimensional space represented by one or more coordinates smaller than the number of the elements to generate mapping coordinates;a display ...

Подробнее
20-02-2014 дата публикации

SPEECH SYNTHESIS APPARATUS, METHOD, AND COMPUTER-READABLE MEDIUM

Номер: US20140052447A1
Принадлежит: KABUSHIKI KAISHA TOSHIBA

According to one embodiment, a speech synthesis apparatus is provided with generation, normalization, interpolation and synthesis units. The generation unit generates a first parameter using a prosodic control dictionary of a target speaker and one or more second parameters using a prosodic control dictionary of one or more standard speakers based on language information for an input text. The normalization unit normalizes the one or more second parameters based a normalization parameter. The interpolation unit interpolates the first parameter and the one or more normalized second parameters based on weight information to generate a third parameter and the synthesis unit generates synthesized speech using the third parameter. 1. A speech synthesis apparatus comprising:a text analysis unit configured to analyze an input text and output language information;a dictionary storage unit configured to store a first prosodic control dictionary of a target speaker and a second prosodic control dictionary of one standard speaker or each of a plurality of standard speakers;a prosodic parameter generation unit configured to generate a first prosodic parameter using the first prosodic control dictionary and generate one or a plurality of second prosodic parameters using the second prosodic control dictionary, based on the language information;a normalization unit configured to normalize the one or the plurality of second prosodic parameters based a normalization parameter;a prosodic parameter interpolation unit configured to interpolate the first prosodic parameter and the one or the plurality of normalized second prosodic parameters based on weight information to generate a third prosodic parameter; anda speech synthesis unit configured to generate synthesized speech in accordance with the third prosodic parameter.2. The apparatus according to claim 1 , further comprising a normalization parameter generation unit configured to generate the normalization parameter based on the ...

Подробнее
27-02-2014 дата публикации

SYSTEM FOR TUNING SYNTHESIZED SPEECH

Номер: US20140058734A1
Принадлежит:

An embodiment of the invention is a software tool used to convert text, speech synthesis markup language (SSML), and/or extended SSML to synthesized audio. Provisions are provided to create, view, play, and edit the synthesized speech, including editing pitch and duration targets, speaking type, paralinguistic events, and prosody. Prosody can be provided by way of a sample recording. Users can interact with the software tool by way of a graphical user interface (GUI). The software tool can produce synthesized audio file output in many file formats. 1. A method of tuning synthesized speech , comprising:synthesizing, by a text-to-speech engine, user supplied text to produce synthesized speech;receiving, by the text-to-speech engine, a user indication of segments of the user supplied text and/or the synthesized speech to skip during re-synthesis of the speech; andre-synthesizing, by the text-to-speech engine, the speech based on the user indicated segments to skip.2. A method of tuning synthesized speech as defined in claim 1 , further comprising receiving a user modification of duration cost factors associated with the synthesized speech to change the duration of the synthesized speech claim 1 , wherein re-synthesizing the speech includes re-synthesizing the speech based on the user modified duration cost factors.3. A method of tuning synthesized speech as defined in claim 2 , wherein receiving a user modification of duration cost factors includes modifying a search of speech units when the user supplied text is re-synthesized to favor shorter speech units in response to user marking of any speech units in the synthesized speech as too long and modifying the search of speech units to favor longer speech units in response to user marking of any speech units in the synthesized speech as too short.4. A method of tuning synthesized speech as defined in claim 1 , further comprising receiving a user modification of pitch cost factors associated with the synthesized speech ...

Подробнее
06-03-2014 дата публикации

METHOD AND SYSTEM FOR REPRODUCTION OF DIGITAL CONTENT

Номер: US20140067399A1
Принадлежит: MATOPY LIMITED

The present invention relates to a method and system of aurally reproducing visually structured content by associating specific audio formatting elements with visual formatting elements of the content. A method and system for reproducing visually structured content by associating abstract visual elements with visual formatting elements of the content is also described. 1. A method of aurally reproducing visually structured content by associating specific audio formatting elements with visual formatting elements of the content.2. A method as claimed in including the step of aurally reproducing the content using the associated audio formatting elements.3. A method as claimed in wherein aural reproduction of the content include layering of audio related to multiple audio formatting element types.4. A method as claimed in wherein the audio formatting element types include background music claim 3 , voice claim 3 , sound effect claim 3 , and audio effect.5. A method as claimed in wherein a processor associates the audio formatting elements with visual formatting elements in accordance with a set of rules.6. A method as claimed in wherein audio formatting elements are associated with visual formatting elements in accordance with a scoring method.7. A method as claimed in wherein elements of content are ordered in accordance with a score assigned to each element using a scoring method8. A method as claimed in wherein the scoring method including the step of calculating a score for each element of content using attributes of one or more visual formatting elements associated with that element of content.9. A method as claimed in including the step of receiving input during aural reproduction to navigate within the content.10. A method as claimed in wherein the input specifies navigation to different portions of the aurally reproduced content based upon visual formatting elements.11. A method as claimed in wherein the input is a single user action.12. A method as claimed in ...

Подробнее
27-03-2014 дата публикации

METHOD AND DEVICE FOR USER INTERFACE

Номер: US20140088970A1
Автор: KANG Donghyun
Принадлежит: LG ELECTRONICS INC.

A method for user interface according to one embodiment of the present invention comprises the steps of: displaying text on a screen; receiving a character selection command of a user who selects at least one character included in a text, receiving a speech command of a user who designates a selected range in the text including at least one character, specifying the selected range according to the character selection command and the speech command; and a step for receiving an editing command of a user for the selected range. 1. A user interface method comprising:displaying a text on a screen;receiving a character selection command for selecting at least one character included in the text from a user;receiving a speech command for designating a selected range of the text including the at least one character from the user;specifying the selected range according to the character selection command and the speech command; andreceiving an editing command for the selected range from the user.2. The user interface method according to claim 1 , wherein the selected range corresponds to a word claim 1 , phrase claim 1 , sentence claim 1 , paragraph or page including the at least one character.3. The user interface method according to claim 1 , wherein the editing command corresponds to one of a copy command claim 1 , a cut command claim 1 , an edit command claim 1 , a transmit command and a search command for the selected range of the text.4. The user interface method according to claim 1 , wherein the character selection command is received through a touch gesture of the user claim 1 , applied to the at least one character.5. The user interface method according to claim 1 , wherein the character selection command is received through movement of a cursor displayed on the screen.6. The user interface method according to claim 5 , wherein the cursor is moved by user input using a gesture claim 5 , keyboard claim 5 , mouse or wireless remote controller.7. The user interface ...

Подробнее
06-01-2022 дата публикации

METHODS AND SYSTEMS FOR SYNTHESIZING SPEECH AUDIO

Номер: US20220005460A1
Принадлежит: TOBROX COMPUTING LIMITED

A computer-implemented method for synthesizing speech audio includes obtaining a grammatical profile defining an input text of actual words as a function of at least syllable-occurrence rates and syllable-count-per-word rates; generating a dictionary of pseudo-words having the syllable-count-per-word rates, each pseudo-word consisting of one syllable or concatenated syllables selected from the input text, wherein substantially all of the pseudo-words are not actual words; constructing an output text product having the grammatical profile, the output text product comprising at least one sentence consisting of one or more pseudo-words selected from the dictionary; and synthesizing speech audio using the output text product. Related systems and computer-readable media are also provided. 1. A computer-implemented method for synthesizing speech audio comprising:obtaining a grammatical profile defining an input text of actual words as a function of at least syllable-occurrence rates and syllable-count-per-word rates;generating a dictionary of pseudo-words having the syllable-count-per-word rates, each pseudo-word consisting of one syllable or concatenated syllables selected from the input text, wherein substantially all of the pseudo-words are not actual words;constructing an output text product having the grammatical profile, the output text product comprising at least one sentence consisting of one or more pseudo-words selected from the dictionary;andsynthesizing speech audio using the output text product.2. The method of claim 1 , wherein constructing the output text product comprises:constructing multiple sentences each consisting of one or more pseudo-words selected from the dictionary.3. The method of claim 2 , wherein constructing multiple sentences further comprises:associating each of the multiple sentences with an intended one of a plurality of different speakers.4. The method of claim 1 , wherein the generating comprises:generating a potential pseudo-word using ...

Подробнее
05-01-2017 дата публикации

VOICE SYNTHESIZER, VOICE SYNTHESIS METHOD, AND COMPUTER PROGRAM PRODUCT

Номер: US20170004821A1
Принадлежит:

According to an embodiment, a voice synthesizer includes a content selection unit, a content generation unit, and a content registration unit. The content selection unit determines selected content among a plurality of pieces of content registered in a content storage unit. The content includes tagged text in which tag information for controlling voice synthesis is added to text serving as a target of the voice synthesis. The content generation unit applies the tag information in the tagged text included in the selected content to designated text to generate new content. The content registration unit registers the generated new content in the content storage unit. 1. A voice synthesizer , comprising:a content selection unit configured to determine selected content among a plurality of pieces of content registered in a content storage unit, the content including tagged text in which tag information for controlling voice synthesis is added to text serving as a target of the voice synthesis;a content generation unit configured to apply the tag information in the tagged text included in the selected content to designated text to generate new content; anda content registration unit configured to register the generated new content in the content storage unit.2. The synthesizer according to claim 1 , wherein the content includes the tagged text and a voice waveform of a synthesized voice corresponding to the tagged text claim 1 , a tag information extraction unit configured to extract the tag information from the tagged text included in the selected content;', 'a tagged text generation unit configured to apply the tag information extracted by the tag information extraction unit to designated text to generate the tagged text; and', 'a voice waveform generation unit configured to generate a voice waveform of a synthesized voice corresponding to the tagged text generated by the tagged text generation unit using a voice synthesis dictionary, and, 'the content generation unit ...

Подробнее
05-01-2017 дата публикации

Transliteration work support device, transliteration work support method, and computer program product

Номер: US20170004822A1
Принадлежит: Toshiba Corp

According to an embodiment, a transliteration work support apparatus include an input unit, an extraction unit, a presentation unit, a reception unit, and a correction unit. The input unit receives document information. The extraction unit extracts, as a correction part, a surface expression of the document information that matches a correction pattern expressing a plurality of surface expressions having the same regularity in way of correction in one form. The presentation unit presents a way of correction defined in accordance with the correction pattern used in the extraction of the correction part. The reception unit receives selection of the way of correction. The correction unit corrects the correction part based on the selected way of correction.

Подробнее
04-01-2018 дата публикации

SYSTEM AND METHODS FOR NUTRITION MONITORING

Номер: US20180004913A1
Принадлежит:

An apparatus comprising a natural language processor, a mapper, a string comparator, a nutrient calculator, and a diet planning module, the diet planning module configured to generate a diet action control, the diet action control comprising instructions to operate the client device to perform a diet change recommendation on the client device, and apply the diet action control to the client device. 1. An apparatus comprising:a natural language processor to receive text from a client device and transform the text into a generated entity;a mapper to transform the generated entity into mapped data lists;a string comparator to transform the mapped data lists into a verified diet-specific control utilizing a nutrition control memory structure;a nutrient calculator to determine nutrition content from the verified diet-specific control;a diet planning module to generate a diet action control, the diet action control comprising instructions to operate the client device to perform a diet change recommendation on the client device and apply the diet action control to the client device; and receive a prompt activation signal from the natural language processor;', 'generate a prompt comprising instructions to operate the client device to display on a machine display of the client device an indication of a prompt item, the prompt item comprising an intent signal or a required entity;', 'receive an unstructured input, the unstructured input enabling the natural language processor to transform the text into the generated entity; and', 'send the prompt to the client device., 'a prompting module to2. The apparatus of claim 1 , further comprising a speech recognition module to:receive an audio from the client device;generate the text from the audio; andsend the text to the natural language processor.3. The apparatus of claim 1 , wherein the natural language processor comprises: compare the text to the intent signal in an intent signal control memory structure; and', 'generate the ...

Подробнее
04-01-2018 дата публикации

SOUND CONTROL DEVICE, SOUND CONTROL METHOD, AND SOUND CONTROL PROGRAM

Номер: US20180005617A1
Принадлежит:

A sound control device includes: a reception unit that receives a start instruction indicating a start of output of a sound; a reading unit that reads a control parameter that determines an output mode of the sound, in response to the start instruction being received; and a control unit that causes the sound to be output in a mode according to the read control parameter. 1. A sound control device comprising:a reception unit that receives a start instruction indicating a start of output of a sound;a reading unit that reads a control parameter that determines an output mode of the sound, in response to the start instruction being received; anda control unit that causes the sound to be output in a mode according to the read control parameter.2. The sound control device according to claim 1 , further comprising:a storage unit that stores syllable information indicating a syllable and the control parameter associated with the syllable information,wherein the reading unit reads the syllable information and the control parameter from the storage unit, andthe control unit causes a singing sound indicating the syllable to be output as the sound, in a mode according to the read control parameter.3. The sound control device according to claim 2 , wherein the control unit causes the singing sound to be output in the mode according to the control parameter and at a certain pitch.4. The sound control device according to claim 2 , wherein the syllable is represented by or corresponding to one or more characters.5. The sound control device according to claim 4 , wherein the one or more characters are Japanese kana.6. The sound control device according to claim 1 , further comprising:a storage unit that stores a plurality of control parameters respectively associated with a plurality of mutually different orders,wherein the receiving unit sequentially accepts a plurality of start instructions including the start instruction, andthe reading unit reads from the storage unit a control ...

Подробнее
02-01-2020 дата публикации

ARTIFICIAL INTELLIGENCE (AI)-BASED VOICE SAMPLING APPARATUS AND METHOD FOR PROVIDING SPEECH STYLE IN HETEROGENEOUS LABEL

Номер: US20200005764A1
Автор: CHAE Jonghoon
Принадлежит: LG ELECTRONICS INC.

Disclosed is an artificial intelligence (AI)-based voice sampling apparatus for providing a speech style in a heterogeneous label, including a rhyme encoder configured to receive a user's voice, extract a voice sample, and analyze a vocal feature included in the voice sample, a text encoder configured to receive text for reflecting the vocal feature, a processor configured to classify the voice sample input to the rhythm encoder into a label according to the vocal feature, provide a weight by measuring a distance between a voice sample corresponding to the label and a voice sample corresponding to a heterogeneous label as a label other than the label and provide a weight by measuring similarity between the label and the heterogeneous label, extract an embedding vector representing the vocal feature, generate a speech style from the embedding vector, and apply the generated speech style to the text, and a rhyme decoder configured to output synthesized voice data in which the speech style is applied to the text by the processor. 1. An artificial intelligence (AI)-based voice sampling apparatus for providing a speech style in a heterogeneous label , the apparatus comprising:a rhyme encoder configured to receive a user's voice, extract a voice sample, and analyze a vocal feature included in the voice sample;a text encoder configured to receive an input of text for reflecting the vocal feature;a processor configured to classify the voice sample input to the rhythm encoder into a label according to the vocal feature, provide a weight by measuring a distance between a voice sample corresponding to the label and a voice sample corresponding to a heterogeneous label as a label other than the label or provide a weight by measuring similarity between the label and the heterogeneous label, extract an embedding vector representing the vocal feature, generate a speech style from the embedding vector, and apply the generated speech style to the text; anda rhyme decoder configured ...

Подробнее
03-01-2019 дата публикации

SYSTEM AND METHOD FOR THE CREATION AND PLAYBACK OF SOUNDTRACK-ENHANCED AUDIOBOOKS

Номер: US20190005959A1
Принадлежит:

A synchronised soundtrack for an audiobook. The soundtrack has a soundtrack timeline having one or more audio regions that are configured for synchronised playback with corresponding narration regions in the audiobook playback timeline. Each audio region having a position along the soundtrack timeline that is dynamically adjustable to maintain synchronization of the audio regions of the soundtrack with their respective narration regions in the audiobook based on a narration speed variable indicative of the playback narration speed of the audiobook. 1. A synchronised soundtrack for an audiobook , the soundtrack comprising a soundtrack timeline having one or more audio regions that are configured for synchronised playback with corresponding narration regions in the audiobook playback timeline , each audio region having a position along the soundtrack timeline that is dynamically adjustable to maintain synchronization of the audio regions of the soundtrack with their respective narration regions in the audiobook based on a narration speed variable indicative of the playback narration speed of the audiobook.2. A synchronised soundtrack according to wherein each audio region of the soundtrack is defined by a start position and stop position along the audiobook playback timeline.3. A synchronised soundtrack according to wherein the start position and stop position defining each audio region comprise a start and stop time values defined along the audiobook playback timeline.4. A synchronised soundtrack according to wherein the start position and stop position defining each audio region comprise start and stop proportional commencement values in relation to the overall length of the audiobook or preset time markets along the audiobook playback timeline.5. A synchronised soundtrack according to any one of - wherein the start and stop positions defining each audio region of the soundtrack are defined or configured based on a nominal narration speed or nominal audiobook ...

Подробнее
08-01-2015 дата публикации

Training and Applying Prosody Models

Номер: US20150012277A1
Автор: Jr. James H., Stephens
Принадлежит: MORPHISM LLC

Techniques for training and applying prosody models for speech synthesis are provided. A speech recognition engine processes audible speech to produce text annotated with prosody information. A prosody model is trained with this annotated text. After initial training, the model is applied during speech synthesis to generate speech with non-standard prosody from input text. Multiple prosody models can be used to represent different prosody styles. 114-. (canceled)15. A computer-implementable method for synthesizing audible speech , with varying prosody , from textual content , the method comprising:building an inventory of prosody models with designated characteristics;selecting a target prosody model for training based on input parameters and first keywords related to a first text segment of a first text with prosody annotations;training the target prosody models based on the prosody annotations of the first text segment;maintaining associations between the first keywords, the designated characteristics, and the input characteristics;selecting multiple prosody models for application for a second text segment of a second text based on second keywords related to the second text and the associations;applying the multiple application prosody models to the second text segment;reconciling conflicts from the application of the multiple application prosody models to generate reconciled prosody information; andgenerating audible speech for the second text segment based on the reconciled prosody information using a text-to-speech synthesis engine.16. The method of claim 15 , wherein the input parameters comprise the identity claim 15 , type claim 15 , or role of a speaker of first text segment.17. The method of claim 16 , wherein the input parameters indicate geographical information related to a speaker of the first text segment.18. The method of claim 17 , wherein the input parameters comprise emotion designators.19. The method of claim 18 , wherein the input parameters ...

Подробнее
27-01-2022 дата публикации

System for Communication Skills Training Using Juxtaposition of Recorded Takes

Номер: US20220028369A1
Принадлежит:

An Internet-based application allows a trainee to record a performance of a scene containing roles A and B with performers for the scene's roles alternately speaking their respective lines. The system displays the lines in a teleprompter style, and based on the experience level of the trainee, may blank out increasing portions of the teleprompter-style lines. If the trainee is assigned role A, the system will present each role A line to be spoken by the trainee with a time progress bar indicating the speed/timing or time remaining for that line. The trainee's performance is recorded by a computer. The teleprompter timer ensures that the trainee's performance is coordinated with a take of role B, even though the trainee's take and the role B take are actually recorded at different times. The takes are played in tandem for evaluating effectiveness of the training. 1. A method for training a trainee employing juxtaposable and interchangeable takes , the method comprising the steps of: all takes of a first role of the script are interchangeable, and', 'all takes of a second role of the script are juxtaposable with all takes of the first role of the script;, 'selecting a training scenario from an internet-connected server, wherein a script is associated with the scenario, wherein the script contains at least two roles, wherein an audio or audiovisual take of each role is associated with the script and wherein a duration of each line of each take of the at least two roles of the script is governed by timing information built into the script, such thatassigning the trainee to make a performance and a recording of one of the at least two roles of said selected training scenario using an internet connected computing device;playing the recording in juxtaposition with a take of a role not assigned to the trainee; andevaluating a level of the trainee's training based on the playing in juxtaposition.2. The method of claim 1 , wherein the recording is recorded while displaying ...

Подробнее
12-01-2017 дата публикации

METHODS EMPLOYING PHASE STATE ANALYSIS FOR USE IN SPEECH SYNTHESIS AND RECOGNITION

Номер: US20170011733A1
Принадлежит:

A computer-implemented method for automatically analyzing, predicting, and/or modifying acoustic units of prosodic human speech utterances for use in speech synthesis or speech recognition. Possible steps include: initiating analysis of acoustic wave data representing the human speech utterances, via the phase state of the acoustic wave data; using one or more phase state defined acoustic wave metrics as common elements for analyzing, and optionally modifying, pitch, amplitude, duration, and other measurable acoustic parameters of the acoustic wave data, at predetermined time intervals; analyzing acoustic wave data representing a selected acoustic unit to determine the phase state of the acoustic unit; and analyzing the acoustic wave data representing the selected acoustic unit to determine at least one acoustic parameter of the acoustic unit with reference to the determined phase state of the selected acoustic unit. Also included are systems for implementing the described and related methods. 110-. (canceled)11. A method for categorically mapping the relationship of at least one text unit in a sequence of text to at least one corresponding prosodic phonetic unit , to at least one linguistic feature category in the sequence of text , and to at least one speech utterance represented in a synthesized speech signal , the method comprising:(a) identifying, and optionally modifying, acoustic data representing the at least one speech utterance, to provide the synthesized speech signal;(b) identifying, and optionally modifying, the acoustic data representing the at least one utterance to provide the at least one speech utterance with an expressive prosody determined according to prosodic rules; and(c) identifying acoustic unit feature vectors for each of the at least one prosodic phonetic units, each acoustic unit feature vector comprising a bundle of feature values selected according to proximity to a statistical mean of the values of acoustic unit candidates available ...

Подробнее
09-01-2020 дата публикации

ADAPTIVE TEXT-TO-SPEECH OUTPUTS

Номер: US20200013387A1
Принадлежит: Google LLC

In some implementations, a language proficiency of a user of a client device is determined by one or more computers. The one or more computers then determines a text segment for output by a text-to-speech module based on the determined language proficiency of the user. After determining the text segment for output, the one or more computers generates audio data including a synthesized utterance of the text segment. The audio data including the synthesized utterance of the text segment is then provided to the client device for output. 1. A method comprising: a voice query was input to the client device by the user; and', 'an indication of a language proficiency designated to the user, the language proficiency designated to the user comprising one of a first level of language proficiency or a second level of language proficiency different than the first level of language proficiency;, 'receiving, at data processing hardware, from a client device associated with a user, data indicating that a first text segment comprising first information responsive to the voice query when the language proficiency designated to the user comprises the first level of language proficiency; or', 'a second text segment comprising second information responsive to the voice query when the language proficiency designated to the user comprises the second level of language proficiency, wherein at least a portion of the second information of the second text segment is different than the first information of the first test segment; and, 'generating, by the data processing hardware, audio data comprising a synthesized utterance of a particular text segment responsive to the voice query and based on the language proficiency designated to the user, the particular text segment comprising one ofproviding, by the data processing hardware, the audio data to the client device associated with the user.2. The method of claim 1 , further comprising claim 1 , prior to generating the audio data comprising the ...

Подробнее
19-01-2017 дата публикации

HYBRID PREDICTIVE MODEL FOR ENHANCING PROSODIC EXPRESSIVENESS

Номер: US20170018267A1
Принадлежит:

Systems and methods for prosody prediction include extracting features from runtime data using a parametric model. The features from runtime data are compared with features from training data using an exemplar-based model to predict prosody of the runtime data. The features from the training data are paired with exemplars from the training data and stored on a computer readable storage medium. 1. A computer-implemented method for prosody prediction , comprising:extracting features from runtime data using a parametric model;comparing the features from runtime data with features from training data using an exemplar-based model to predict prosody of the runtime data, the features from the training data being paired with exemplars from the training data and stored on a computer-readable storage medium; andsynthesizing speech, by a speech synthesizer, using the predicted prosody.2. The computer-implemented method as recited in claim 1 , wherein the parametric model includes a neural network model and the exemplar-based model includes a Gaussian Process model.3. The computer-implemented method as recited in claim 1 , wherein the features include deep layer features.4. The computer-implemented method as recited in claim 3 , wherein deep layer features include features after data has been transformed up to a layer of the parametric model before an output layer.5. The computer-implemented method as recited in claim 1 , further comprising training the parametric model to transform the training data to reproduce training targets.6. The computer-implemented method as recited in claim 1 , further comprising training the exemplar-based model to determine exemplars from the training data.7. The computer-implemented method as recited in claim 1 , further comprising constraining a number of nodes in a deepest layer of the parametric model before an output layer to reduce dimensionality.8. The computer-implemented method as recited in claim 1 , wherein a hybrid model including the ...

Подробнее
18-01-2018 дата публикации

Systems and methods for dynamically improving user intelligibility of synthesized speech in a work environment

Номер: US20180018955A1
Принадлежит: Vocollect Inc

A method and apparatus that dynamically adjust operational parameters of a text-to-speech engine in a speech-based system are disclosed. A voice engine or other application of a device provides a mechanism to alter the adjustable operational parameters of the text-to-speech engine. In response to one or more environmental conditions, the adjustable operational parameters of the text-to-speech engine are modified to increase the intelligibility of synthesized speech.

Подробнее
18-01-2018 дата публикации

SPEECH SYNTHESIS APPARATUS, SPEECH SYNTHESIS METHOD, SPEECH SYNTHESIS PROGRAM, PORTABLE INFORMATION TERMINAL, AND SPEECH SYNTHESIS SYSTEM

Номер: US20180018956A1
Автор: TAKATSUKA Susumu
Принадлежит: SONY MOBILE COMMUNICATIONS INC.

A speech synthesis apparatus includes a content selection unit that selects a text content item to be converted into speech; a related information selection unit that selects related information which can be at least converted into text and which is related to the text content item selected by the content selection unit; a data addition unit that converts the related information selected by the related information selection unit into text and adds text data of the text to text data of the text content item selected by the content selection unit; a text-to-speech conversion unit that converts the text data supplied from the data addition unit into a speech signal; and a speech output unit that outputs the speech signal supplied from the text-to-speech conversion unit. 1a content selection unit that selects a text content item to be converted into speech;a related information selection unit that selects related information which can be at least converted into text and which is related to the text content item selected by the content selection unit;a data addition unit that converts the related information selected by the related information selection unit into text and adds text data of the text to text data of the text content item selected by the content selection. unit;a text-to-speech conversion unit that converts the text data supplied from the data addition unit into a speech signal; anda speech output unit that outputs the speech signal supplied from the text-to-speech conversion unit.. A speech synthesis apparatus comprising: The present application is a continuation of and claims the benefit of priority under 35 U.S.C. §120 from U.S. application Ser. No. 12/411,031, filed Mar. 25, 2009, which contains subject matter related to that disclosed in Japanese Priority Patent Application JP 2008-113202 filed in the Japan Patent Office on Apr. 23, 2008, the entire content of both of which are hereby incorporated herein by reference.The present invention relates to a ...

Подробнее
17-01-2019 дата публикации

ADAPTIVE TEXT-TO-SPEECH OUTPUTS

Номер: US20190019501A1
Принадлежит: Google LLC

In some implementations, a language proficiency of a user of a client device is determined by one or more computers. The one or more computers then determines a text segment for output by a text-to-speech module based on the determined language proficiency of the user. After determining the text segment for output, the one or more computers generates audio data including a synthesized utterance of the text segment. The audio data including the synthesized utterance of the text segment is then provided to the client device for output. 1. A method comprising:determining, by data processing hardware, a user context of a user of a client device, the user context indicating a level of complexity of speech that the user is likely able to comprehend;determining, by the data processing hardware, a particular text segment for text-to-speech output to the user, the particular text segment having a complexity score indicating a corresponding level of complexity associated with the particular text segment;modifying, by the data processing hardware, the particular text segment for the text-to-speech output to the user based on the complexity score of the particular text segment and the selected user context;generating, by the data processing hardware, audio data comprising a synthesized utterance of the modified particular text segment; andproviding, by the data processing hardware, the audio data comprising the synthesized utterance of the modified particular text segment to the client device.2. The method of claim 1 , further comprising claim 1 , prior to determining the particular text segment for text-to-speech output to the user claim 1 , receiving claim 1 , at the data processing hardware claim 1 , data indicating a voice query detected by the client device claim 1 ,wherein determining the particular text segment for text-to-speech output comprises generating the particular text segment as a response to the voice query, andwherein providing the audio data comprises ...

Подробнее
26-01-2017 дата публикации

Method and Device for Editing Singing Voice Synthesis Data, and Method for Analyzing Singing

Номер: US20170025115A1
Принадлежит:

A singing voice synthesis data editing method includes adding, to singing voice synthesis data, a piece of virtual note data placed immediately before a piece of note data having no contiguous preceding piece of note data, the singing voice synthesis data including: multiple pieces of note data for specifying a duration and a pitch at which each note that is in a time series, representative of a melody to be sung, is voiced; multiple pieces of lyric data associated with at least one of the multiple pieces of note data; and a sequence of sound control data that directs sound control over a singing voice synthesized from the multiple pieces of lyric data, and obtaining the sound control data that directs sound control over the singing voice synthesized from the multiple pieces of lyric data, and that is associated with the piece of virtual note data. 1. A singing voice synthesis data editing method comprising:adding to singing voice synthesis data a piece of virtual note data placed immediately before a piece of note data having no contiguous preceding piece of note data, the singing voice synthesis data including: multiple pieces of note data for specifying a duration and a pitch at which each note that is in a time series, representative of a melody to be sung, is voiced; multiple pieces of lyrics data associated with at least one of the multiple pieces of note data; and a sequence of sound control data that directs sound control over a singing voice synthesized from the multiple pieces of lyrics data; andobtaining sound control data that directs sound control over the singing voice synthesized from the multiple pieces of lyrics data, and that is associated with the piece of virtual note data.2. The singing voice synthesis data editing method according to claim 1 ,wherein the adding a piece of virtual note data includes adding, as the piece of virtual note data, a piece of note data having a time length corresponding to a time difference between the note-on timing ...

Подробнее
28-01-2016 дата публикации

METHOD FOR FORMING THE EXCITATION SIGNAL FOR A GLOTTAL PULSE MODEL BASED PARAMETRIC SPEECH SYNTHESIS SYSTEM

Номер: US20160027430A1
Принадлежит:

A system and method are presented for forming the excitation signal for a glottal pulse model based parametric speech synthesis system. The excitation signal may be formed by using a plurality of sub-band templates instead of a single one. The plurality of sub-band templates may be combined to form the excitation signal wherein the proportion in which the templates are added is dynamically based on determined energy coefficients. These coefficients vary from frame to frame and are learned, along with the spectral parameters, during feature training. The coefficients are appended to the feature vector, which comprises spectral parameters and is modeled using HMMs, and the excitation signal is determined. 1. A method for creating parametric models for use in training a speech synthesis system , wherein the system comprises at least a training text corpus , a speech database , and a model training module , the method comprising:a. obtaining, by the model training module, speech data for the training text corpus, wherein the speech data comprises recorded speech signals and corresponding transcriptions;b. converting, by the model training module, the training text corpus into context dependent phone labels;c. extracting, by the model training module, for each frame of speech in the speech signal from the speech training database, at least one of: spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;d. forming, by the model training module, a feature vector stream for each frame of speech using the at least one of: spectral features, a plurality of band excitation energy coefficients, and fundamental frequency values;e. labeling speech with context dependent phones;f. extracting durations of each context dependent phone from the labelled speech;g. performing parameter estimation of the speech signal, wherein the parameter estimation is performed comprising the features, HMM, and decision trees; andh. identifying a ...

Подробнее
28-01-2021 дата публикации

SPEECH PROCESSING APPARATUS, AND PROGRAM

Номер: US20210027760A1
Автор: KAINUMA Ken-ichi
Принадлежит:

A speech processing apparatus is provided in which, while face feature points are extracted from moving image data obtained by imaging a speaker's face, for each frame, a first generation network for generating face feature points of the corresponding frame based on speech feature data extracted from uttered speech of the speaker for each frame is generated, and whether the first generation network is appropriate is evaluated using an identification network, then, a second generation network for generating the uttered speech from a plurality of uncertain settings including at least text representing utterance content of the uttered speech and information indicating emotions included in the uttered speech, a plurality of types of fixed settings which define speech quality, and the face feature points generated by the first generation network evaluated as appropriate, is generated, and whether the second generation network is appropriate is evaluated using the identification network. 1. A speech processing apparatus comprising:an extracting means configured to separate moving image data obtained by imaging a face of a speaker in an utterance period into frames having a predetermined time length, and extract face feature point data indicating positions of face feature points determined in advance, for each frame;a first generating means configured to separate speech data representing uttered speech of the speaker in the utterance period into the frames and generate a first generation network for generating face feature points of each frame from speech feature data of the corresponding frame;a first evaluating means configured to evaluate whether or not the first generation network is appropriate using the face feature point data extracted from each frame using a first identification network;a second generating means configured to cause a user to designate a plurality of types of uncertain settings including at least text representing utterance content of the uttered ...

Подробнее
17-02-2022 дата публикации

Two-Level Speech Prosody Transfer

Номер: US20220051654A1
Принадлежит: Google LLC

A method includes receiving an input text utterance to be synthesized into expressive speech having an intended prosody and a target voice and generating, using a first text-to-speech (TTS) model, an intermediate synthesized speech representation tor the input text utterance The intermediate synthesized speech representation possesses the intended prosody. The method also includes providing the intermediate synthesized speech representation to a second TTS model that includes an encoder portion and a decoder portion. The encoder portion is configured to encode the intermediate synthesized speech representation into an utterance embedding that specifies the intended prosody The decoder portion is configured to process the input text utterance and the utterance embedding to generate an output audio signal of expressive speech that has the intended prosody specified by the utterance embedding and speaker characteristics of the target voice.

Подробнее
31-01-2019 дата публикации

INFORMATION PROCESSING SYSTEM, CLIENT TERMINAL, INFORMATION PROCESSING METHOD, AND RECORDING MEDIUM

Номер: US20190033957A1
Автор: Shionozaki Atsushi
Принадлежит:

[Object] To provide an information processing system, a client terminal, an information processing method, and a recording medium capable of selecting an appropriate agent from among multiple agents according to a user emotion, and providing more comfortable dialogue. 1. An information processing system comprising:a storage section that holds a plurality of agent programs with different attributes;a communication section that provides an agent service by the agent programs to a client terminal of a user; anda control section that selects, from among the plurality of agent programs, one agent program suited to an emotion of a user who can use the agent service.2. The information processing system according to claim 1 , whereinwhen providing an agent service by a first agent program to a client terminal of the user, if an emotion change of the user is detected, the control section controls a switching to an agent service by a second agent program appropriate to the emotion change.3. The information processing system according to claim 1 , whereinwhen the user owns a usage right to the agent service by the selected agent program, the control section starts providing the agent service by the selected agent program to the user.4. The information processing system according to claim 3 , whereinthe control section stores, in the storage section, preference information of the user according to feedback received from the user enjoying the agent service.5. The information processing system according to claim 4 , whereinthe control section selects, from among the plurality of agent programs, one agent program suited to an emotion of the user who can use the agent service, on a basis of the preference information.6. The information processing system according to claim 4 , whereinthe feedback is input through text or speech by the user on the client terminal.7. The information processing system according to claim 4 , whereinthe feedback is at least one of biological information, ...

Подробнее
04-02-2021 дата публикации

Controlling Expressivity In End-to-End Speech Synthesis Systems

Номер: US20210035551A1
Принадлежит: Google LLC

A system for generating an output audio signal includes a context encoder, a text-prediction network, and a text-to-speech (TTS) model. The context encoder is configured to receive one or more context features associated with current input text and process the one or more context features to generate a context embedding associated with the current input text. The text-prediction network is configured to process the current input text and the context embedding to predict, as output, a style embedding for the current input text. The style embedding specifies a specific prosody and/or style for synthesizing the current input text into expressive speech The TTS model is configured to process the current input text and the style embedding to generate an output audio signal of expressive speech of the current input text. The output audio signal has the specific prosody and/or style specified by the style embedding. 1. A system comprising: receive one or more context features associated with current input text to be synthesized into expressive speech, each context feature derived from a text source of the current input text; and', 'process the one or more context features to generate a context embedding associated with the current input text;, 'a context encoder configured to receive the current input text from the text source, the text source comprising sequences of text to be synthesized into expressive speech;', 'receive the context embedding associated with the current input text from the context encoder; and', 'process the current input text and the context embedding associated with the current input text to predict, as output, a style embedding for the current input text, the style embedding specifying a specific prosody and/or style for synthesizing the current input text into expressive speech; and, 'a text-prediction network in communication with the context encoder and configured to receive the current input text from the text source;', 'receive the style ...

Подробнее
31-01-2019 дата публикации

VOICE CONTROLLED INFORMATION DELIVERY IN PAIRED DEVICES

Номер: US20190037642A1
Принадлежит:

A voice-commanded common computing device may be selectively paired other computing devices in a shared network. The common computing device may detect co-presence of paired devices on the shared network, and may determine when audio and/or video content may be cast to devices on the shared network based on the detected co-presence. Audio content may include messages composed by a first user of a first device in the shared network, to be audibly output to a second user of a second device in the shared network. Casting of personal information may include positive authentication and verification prior to audible output of the personal information. 1. A method , comprising:receiving, by a common computing device in a shared network, from a user of a first computing device in a shared network, at a first point in time, message content to be transmitted to a user of a second computing device;detecting, by the common computing device at a second point in time, the second computing device in the shared network; andgenerating, by the common computing device, an audio signal corresponding to the message content, in response to the detecting of the second computing device in the shared network.2. The method of claim 1 , wherein receiving the message content includes receiving the message content in text form claim 1 , to be output to the user of the second computing device in response to detection of the second computing device in the shared network.3. The method of claim 2 , wherein generating the audio signal corresponding to the message content includes:transcribing the text form of the message content from text into speech; andaudibly outputting the transcribed message content.4. The method of claim 1 , wherein generating the audio signal corresponding to the message content includes:outputting the message content in audible form after a delay interval has elapsed from the detection of the second computing device in the shared network.5. The method of claim 1 , wherein ...

Подробнее
12-02-2015 дата публикации

Machine And Method To Assist User In Selecting Clothing

Номер: US20150043822A1
Принадлежит: K-NFB READING TECHNOLOGY, INC.

A device to convey information to a user regarding clothing. The device receives data that specifies a clothing mode to use for processing an image, accesses a knowledge base to provide data to configure the computer program product for the clothing mode, the data including data specific to the clothing mode and receives an image or images of an article of clothing. The device processes the image or images to identify patterns in the image corresponding to items of clothing based on information obtained from the knowledge base. 127-. (canceled)28. A method of operating a portable electronic device , the method comprising:receiving by a processor device in the portable electronic device an image that captures a scene;retrieving by the processor device a template that includes a layout of a machine;processing by the processor device the image of the scene to recognized a pattern of controls by comparing the layout in the template to the recognized pattern in the scene, and to recognize a gesturing item in the image that indicates a user-initiated gesture pointing to a portion of the pattern in the image;determining by the processing device the control pointed to by the user; andcausing by the processor device, the portable electronic device to operate in a transaction mode.29. The method of wherein a directed reading mode is selected by the user according to the command determined from the gesturing.30. The method of further comprising:applying by the processor the pattern-recognition processing to the image to detect the gesturing over a control in the image andapplying optical character recognition processing to determine text in the image; andapplying the text to speech processing to announce the text to the user.31. The method of further comprising:processing by the processor the retrieved stored template that has a stored layout of controls on the machine; andprocessing by the processor the image according to the template to navigate the user through use of the ...

Подробнее
01-05-2014 дата публикации

APPARATUS AND METHOD FOR GENERATION OF PROSODY ADJUSTED SOUND RESPECTIVE OF A SENSORY SIGNAL AND TEXT-TO-SPEECH SYNTHESIS

Номер: US20140122082A1
Принадлежит: VIVOTEXT LTD.

A method for generation of a prosody adjusted digital sound. The method comprises receiving at least a sensory signal from at least one sensor; generating a digital sound respective of an input text content and a text-to-speech content retrieved from a memory unit; and modifying the generated digital sound respective of the at least the sensory signal to create the prosody adjusted digital sound. 1. An apparatus for generating prosody adjusted sound , comprising:a memory unit for maintaining at least a library that contains information to be used for text-to-speech conversion, the memory unit further maintains exactable instructions;at least one sensor; anda processing unit connected to the memory unit and to the at least one sensor, the processing unit is configured to execute the instructions, thereby causing the apparatus to: convert a text content into speech content respective of the library, and generate a prosody adjusted digital sound respective of the speech content and at least a sensory signal received from the at least one sensor.2. The apparatus of claim 1 , further comprises:a digital-to-analog converter (DAC) configured to receive the prosody adjusted digital sound and to generate an analog signal therefrom.3. The apparatus of claim 1 , wherein the at least one sensor is any one of: a physical sensor claim 1 , a virtual sensor.4. The apparatus of claim 3 , wherein the physical sensor is any one of: a temperature sensor claim 3 , a global positioning system (GPS) claim 3 , a pressure sensor claim 3 , a light intensity claim 3 , an image analyzer claim 3 , a sound sensor claim 3 , an ultrasound sensor claim 3 , a speech recognizer claim 3 , a moistness sensor.5. The apparatus of claim 3 , wherein the virtual sensor is a data receiving component communicatively connected to a global network through an interface.6. The apparatus of claim 5 , wherein the interface is further configured to provide connectivity through a local network between the apparatus ...

Подробнее
07-02-2019 дата публикации

AUTOMATIC SPEECH IMITATION

Номер: US20190043472A1
Автор: Garcia Jason
Принадлежит:

Embodiments of systems, apparatuses, and/or methods are disclosed for automatic speech imitation. An apparatus may include a machine learner to perform an analysis of tagged data that is to be generated based on a speech pattern and/or a speech context behavior in media content. The machine learner may further generate, based on the analysis, a trained speech model that is to be applied to the media content to transform speech data to mimic data. The apparatus may further include a data analyzer to perform an analysis of the speech pattern, the speech context behavior, and/or the tagged data. The data analyzer may further generate, based on the analysis, a programmed speech rule that is to be applied to transform the speech data to the mimic data. 1. A computer system comprising: [ perform a first analysis of tagged data that is to be generated based on one or more of a speech pattern or a speech context behavior in media content; and', 'generate, based on the first analysis, a trained speech model that is to be applied to transform speech data to mimic data; or, 'a machine learner to, perform a second analysis of one or more of the speech pattern, the speech context behavior, or the tagged data; and', 'generate, based on the second analysis, a programmed speech rule that is to be applied to transform the speech data to the mimic data; and, 'a data analyzer to], 'a training data provider to provide training data including one or more ofa speech device to output imitated speech based on the mimic data.2. The system of claim 1 , further including claim 1 ,a speech pattern identifier to identify one or more of an ordered speech pattern, a literary point of view, or a disordered speech pattern in the media content; anda context behavior identifier to identify one or more of a trained behavior, a replacement behavior, or an additive behavior in the media content.3. The system of claim 1 , further including a media content tagger to:modify the media content with a speech ...

Подробнее
07-02-2019 дата публикации

GENERATING AUDIO RENDERING FROM TEXTUAL CONTENT BASED ON CHARACTER MODELS

Номер: US20190043474A1
Принадлежит:

A computer implemented method, device and computer program product are provided. The method, device and computer program product utilize textual machine learning to analyze textual content to identify narratives for associated content segments of the textual content. The method, device and computer program further utilize textual machine learning to designate character models for the corresponding narratives and generate an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments. 1. A computer implemented method , comprising:under control of one or more processors configured with specific executable program instructions,analyzing textual content to identify narratives for associated content segments of the textual content;designating character models for the corresponding narratives; andgenerating an audio rendering of the textual content utilizing the character models in connection with the corresponding narratives for the associated content segments.2. The method of claim 1 , wherein the designating utilizes textual machine learning to designate the character models to simulate different characters within a book claim 1 , the characters differing from one another by one or more of age claim 1 , sex claim 1 , race claim 1 , nationality or personality traits.3. The method of claim 1 , wherein the content segments represent sections of spoken dialogue by different individual speakers claim 1 , and wherein the machine learning assigns different character models for the individual speakers.4. The method of claim 1 , wherein the designating utilizes textual machine learning to designate different intonations to be utilized by at least one of the character models in connection with sections of the corresponding narrative.5. The method of claim 4 , further comprising identifying a dialogue tag associated with the content segment and assigning at least one of the intonations ...

Подробнее
06-02-2020 дата публикации

METHOD FOR AUDIO SYNTHESIS ADAPTED TO VIDEO CHARACTERISTICS

Номер: US20200043465A1
Принадлежит: KOREA ELECTRONICS TECHNOLOGY INSTITUTE

An audio synthesis method adapted to video characteristics is provided. The audio synthesis method according to an embodiment includes: extracting characteristics x from a video in a time-series way; extracting characteristics p of phonemes from a text; and generating an audio spectrum characteristic Sused to generate an audio to be synthesized with a video at a time t, based on correlations between an audio spectrum characteristic S, which is used to generate an audio to be synthesized with a video at a time t−1, and the characteristics x. Accordingly, an audio can be synthesized according to video characteristics, and speech according to a video can be easily added. 1. An audio synthesis method comprising:receiving an input of a video;receiving an input of a text;extracting characteristics x from the video in a time-series way;extracting characteristics p of phonemes from the text; and{'sub': t', 't-1, 'generating an audio spectrum characteristic Sused to generate an audio to be synthesized with a video at a time t, based on correlations between an audio spectrum characteristic S, which is used to generate an audio to be synthesized with a video at a time t−1, and the characteristics x.'}2. The method of claim 1 , wherein the generating comprises:{'sub': 't-1', 'a first calculation step of calculating scores e based on the correlations between the audio spectrum characteristic Sused to generate the audio to be synthesized with the video at the time t−1, and the respective characteristics x; and'}{'sub': 't', 'a first generation step of generating the audio spectrum characteristic Sby using the calculated scores e.'}3. The method of claim 2 , wherein the first calculation step is performed by using an AI model which is trained to receive the audio spectrum characteristic Sand the respective characteristics x claim 2 , and to calculate and output the scores e based on the correlations therebetween.4. The method of claim 2 , wherein the first generation step ...

Подробнее
15-02-2018 дата публикации

IMMERSIVE ELECTRONIC READING

Номер: US20180046331A1
Принадлежит: Microsoft Technology Licensing, LLC

Electronic reading devices provide readers with text on a display, and enhancements to their functionality and efficiency are discussed herein. Text is provided to the reader in an enhanced contrast mode that highlights the active word and line of the text as well as words of interest in the text so as to improve the functionality of the electronic reading device itself as a provider of textual content. 1. A method for improving functionality and efficiency of an electronic reading device , comprising:receiving a text to be presented to a reader using the electronic reading device, the text comprising lines, the lines comprising words; determining an active word in the text;', 'determining, based on the active word, an active line in the text, the active line including the active word;', 'displaying words comprising the active line according to a first text luminance;', 'displaying words not comprising the active line according to a second text luminance;', 'displaying a background behind the text according to a first background luminance;', 'displaying a background behind the active word according to a second background luminance, wherein a difference between the first text luminance and first background luminance defines a first contrast, wherein a difference between the first text luminance and the second background luminance defines a second contrast, wherein a difference between the second text luminance and the first background luminance defines a third contrast, and wherein the second contrast is greater than the first contrast and the first contrast is greater than the third contrast;', 'in response to the text including a next word after the active word, selecting the next word as the active word for reading back next; and', 'wherein a delay between displaying the active word and selecting the next word is set by a pace defined by the reader., 'reading back the text to the reader, wherein reading back comprises recursively2. The method of claim 1 , further ...

Подробнее
16-02-2017 дата публикации

TEXT-TO-SPEECH METHOD AND MULTI-LINGUAL SPEECH SYNTHESIZER USING THE METHOD

Номер: US20170047060A1
Принадлежит:

A text to-speech method and a multi-lingual speech synthesizer using the method are disclosed. The multi-lingual speech synthesizer and the method executed by processor are applied for processing a multi-lingual text message in a mixture of a first language and a second language into a multi-lingual voice message. The multi-lingual speech synthesizer comprises a storage device configured to store a first language model database, second language model database a broadcasting device configured to broadcast the multi-lingual voice message, and a processor, connected to the storage de ice and the broadcasting device, configured to ex cute the method disclosed herein. 2. The text-to-speech method of claim 1 , wherein when the one of the at least one first language phoneme label sequence is in front of the one of the at least one second language phoneme label sequence claim 1 , the step of producing the inter-lingual connection tone information comprises:replacing a first phoneme label of the at least one second language phoneme label sequence with a corresponding phoneme label of the first language phoneme labels which has a closest pronunciation to the first phoneme label of the at least one second language phoneme label sequence; andlooking up the first language model database using the corresponding, phoneme label of the first language phoneme labels thereby obtaining a corresponding cognate connection tone information of the first language model database between a last phoneme label of the at least one first language phoneme label sequence and the corresponding phoneme label of the first language phoneme labels, wherein the corresponding cognate connection tone information of the first language model database serves as the inter-lingual connection tone information at the boundary between the one of the at least one first language phoneme label sequence and the one of the at least one second language phoneme label sequence.3. The text-to-speech method of claim 1 , ...

Подробнее
16-02-2017 дата публикации

SPEECH INTELLIGIBILITY IMPROVING APPARATUS AND COMPUTER PROGRAM THEREFOR

Номер: US20170047080A1
Автор: SHIGA Yoshinori

[Object] To provide a speech intelligibility improving apparatus capable of generating highly intelligible speech in various environments without unnecessarily amplifying sound volume. 1. A speech intelligibility improving apparatus for generating an intelligible speech , comprising:peak general outline extracting means for extracting, from a spectrum of a speech signal as an object, a general outline of peaks represented by a curve along a plurality of local peaks of a spectral envelope of the spectrum;spectrum modifying means for modifying the spectrum of said speech signal based on the general outline of peaks extracted by the peak general outline extracting means; andspeech synthesizing means for generating a speech based on the spectrum modified by said spectrum modifying means.2. The speech intelligibility improving apparatus according to claim 1 , wherein said peak general outline extracting means extracts claim 1 , from the spectrogram of a speech signal as an object claim 1 , a curved surface along a plurality of local peaks of an envelope of the spectrogram in time/frequency domain claim 1 , and obtains said general outline of peaks at each time from the extracted curved surface.3. The speech intelligibility improving apparatus according to claim 1 , wherein said peak general outline extracting means extracts said general outline of peaks based on perceptual or psycho-acoustic scale of frequency.4. The speech intelligibility improving apparatus according to claim 1 , wherein said spectrum modifying means includes spectrum peak emphasizing means for emphasizing a peak of said speech signal claim 1 , based on said general outline of peaks extracted by said peak general outline extracting means.5. The speech intelligibility improving apparatus according to claim 1 , whereinsaid spectrum modifying means includesambient sound spectrum extracting means for extracting a spectrum from an ambient sound collected in an environment to which the speech is to be ...

Подробнее
15-02-2018 дата публикации

System and method for low-latency web-based text-to-speech without plugins

Номер: US20180047384A1
Принадлежит: Nuance Communications Inc

Disclosed herein are systems, methods, and non-transitory computer-readable storage media for reducing latency in web-browsing TTS systems without the use of a plug-in or Flash® module. A system configured according to the disclosed methods allows the browser to send prosodically meaningful sections of text to a web server. A TTS server then converts intonational phrases of the text into audio and responds to the browser with the audio file. The system saves the audio file in a cache, with the file indexed by a unique identifier. As the system continues converting text into speech, when identical text appears the system uses the cached audio corresponding to the identical text without the need for re-synthesis via the TTS server.

Подробнее
03-03-2022 дата публикации

Building a Text-to-Speech System from a Small Amount of Speech Data

Номер: US20220068256A1
Принадлежит: Google LLC

A method of building a text-to-speech (TTS) system from a small amount of speech data includes receiving a first plurality of recorded speech samples from an assortment of speakers and a second plurality of recorded speech samples from a target speaker where the assortment of speakers does not include the target speaker. The method further includes training a TTS model using the first plurality of recorded speech samples from the assortment of speakers. Here, the trained TTS model is configured to output synthetic speech as an audible representation of a text input. The method also includes re-training the trained TTS model using the second plurality of recorded speech samples from the target speaker combined with the first plurality of recorded speech samples from the assortment of speakers. Here, the re-trained TTS model is configured to output synthetic speech resembling speaking characteristics of the target speaker.

Подробнее
03-03-2022 дата публикации

SYSTEM AND METHOD FOR CROSS-SPEAKER STYLE TRANSFER IN TEXT-TO-SPEECH AND TRAINING DATA GENERATION

Номер: US20220068259A1
Принадлежит:

Systems are configured for generating spectrogram data characterized by a voice timbre of a target speaker and a prosody style of source speaker by converting a waveform of source speaker data to phonetic posterior gram (PPG) data, extracting additional prosody features from the source speaker data, and generating a spectrogram based on the PPG data and the extracted prosody features. The systems are configured to utilize/train a machine learning model for generating spectrogram data and for training a neural text-to-speech model with the generated spectrogram data.

Подробнее
08-05-2014 дата публикации

Method, System, and Relevant Devices for Playing Sent Message

Номер: US20140129228A1
Автор: Lai Yizhe
Принадлежит: Huawei Technologies Co., Ltd.

A method and a system for playing a message that are applicable to the field of communications technologies. The message playing method includes: receiving, by a receiving terminal, a message that includes a user identifier and text information, obtaining a speech identifier and an image identifier corresponding to the user identifier, generating or obtaining a speech animation stream according to a speech characteristic parameter indicated by the speech identifier, an image characteristic parameter indicated by the image identifier, and the text information, and playing the speech animation stream. In this way, the text information in the message can be played as a speech animation stream according to the user identifier, the text information in the message can be presented vividly, and the message can be presented in a personalized manner according to the speech identifier and the image identifier corresponding to the user identifier. 1. A message playing method , applicable to a terminal device , comprising:receiving a message that comprises a user identifier and a text information;obtaining a speech identifier and an image identifier corresponding to the user identifier, wherein the speech identifier is used to indicate a speech characteristic parameter and the image identifier is used to indicate an image characteristic parameter; andgenerating or obtaining a speech animation stream according to the speech characteristic parameter indicated by the speech identifier, the image characteristic parameter indicated by the image identifier, and the text information; andplaying the speech animation stream.2. The method according to claim 1 , wherein before receiving the message claim 1 , the method further comprises:providing a setup interface used to receive a correspondence between the user identifier, and the speech identifier and the image identifier;receiving the correspondence between the user identifier, and the speech identifier and the image identifier from ...

Подробнее
03-03-2022 дата публикации

GENERATING VIDEOS WITH A CHARACTER INDICATING A REGION OF AN IMAGE

Номер: US20220070550A1
Автор: Ingel Ben Avi, Zass Ron
Принадлежит: VIDUBLY LTD

Methods, systems, and computer-readable media for generating videos with characters indicating regions of images are provided. For example, an image containing a first region may be received. At least one characteristic of a character may be obtained. A script containing a first segment of the script may be received. The first segment of the script may be related to the first region of the image. The at least one characteristic of a character and the script may be used to generate a video of the character presenting the script and at least part of the image, where the character visually indicates the first region of the image while presenting the first segment of the script. 120-. (canceled)21. A non-transitory computer readable medium storing data and computer implementable instructions for carrying out a method for generating videos , the method comprising:receiving an input including at least one image;obtaining a personalized profile associated with a user;obtaining a script related to the input;using the personalized profile to determine at least one visual characteristic of an avatar; andusing the determined at least one visual characteristic of the avatar to artificially generate an output video of the avatar presenting the script and at least part of the at least one image, wherein the user is a prospective viewer of the output video.22. The non-transitory computer readable medium of claim 21 , wherein the at least one image further comprises a first region and a second region claim 21 , the script contains a first segment and a second segment claim 21 , the first segment of the script is related to the first region and the second segment of the script is related to the second region claim 21 , and wherein in the artificially generated video the avatar visually indicates the first region while presenting the first segment of the script and visually indicates the second region while presenting the second segment of the script.23. The non-transitory computer ...

Подробнее
25-02-2021 дата публикации

Duration informed attention network (durian) for audio-visual synthesis

Номер: US20210056949A1
Автор: Chengzhu YU, Dong Yu, Heng Lu
Принадлежит: Tencent America LLC

A method and apparatus include receiving a text input that includes a sequence of text components. Respective temporal durations of the text components are determined using a duration model. A spectrogram frame is generated based on the duration model. An audio waveform is generated based on the spectrogram frame. Video information is generated based on the audio waveform. The audio waveform is provided as an output along with a corresponding video.

Подробнее
25-02-2021 дата публикации

SYSTEMS AND METHODS FOR TRANSPOSING SPOKEN OR TEXTUAL INPUT TO MUSIC

Номер: US20210056952A1
Принадлежит:

Described herein are real-time musical translation devices (RETM) and methods of use thereof. Exemplary uses of RETMs include optimizing the understanding and/or recall of an input message for a user and improving a cognitive process in a user. 1. A method of transforming textual input to a musical score comprising:receiving text input;transliterating the text input into a standardized phonemic representation of the text input;determining for the phonemic text input, a plurality of spoken pause lengths and a plurality of spoken phoneme lengths;mapping the plurality of spoken pause lengths to a respective plurality of sung pause lengths;mapping the plurality of spoken phoneme lengths to a respective plurality of sung phoneme lengths;generating, from the plurality of sung pause lengths and the plurality of sung phoneme lengths, a timed text input;generating a plurality of matching metrics for each of a respective plurality of portions of the timed text input against a plurality of melody segments; andgenerating a patterned musical message from the timed text input and the plurality of melody segments based at least in part on the plurality of matching metrics.2. The method of claim 1 , wherein the method is performed in real-time or in near-real-time claim 1 , and further comprises causing the patterned musical message to be played audibly on a transducer.3. The method of claim 1 , wherein the patterned musical message is expected to optimize claim 1 , for a user claim 1 , at least one of an understanding of the input message and a recall of the input message.4. The method of claim 1 , further comprising providing to a user a visual image relating to the patterned musical message aimed at enhancing comprehension and learning.5. The method of claim 1 , wherein the patterned musical message is presented to a user having a cognitive impairment claim 1 , a behavioral impairment claim 1 , or a learning impairment.6. The method of claim 5 , wherein the user has a ...

Подробнее
22-02-2018 дата публикации

Social Networking with Assistive Technology Device

Номер: US20180053498A1
Принадлежит:

An approach is provided that assists visually impaired users. The approach analyzes a document that is being utilized by the visually impaired user. The analysis derives a sensitivity of the document. A vocal characteristic corresponding to the derived sensitivity is retrieved. Text from the document is audibly read to the visually impaired user with a text to speech process that utilizes the retrieved vocal characteristic. The retrieved vocal characteristic conveys the derived sensitivity of the document to the visually impaired user. 1. A method implemented by an information handling system that includes a processor and a memory accessible by the processor , the method comprising:analyzing, by the processor, a document that is being composed by a visually impaired user, wherein the analysis derives a sensitivity of the document;retrieving, from the memory, a vocal characteristic corresponding to the derived sensitivity based on one or more predefined settings;retrieving, from the memory, an additional vocal characteristic corresponding to an audience size of the document; andaudibly reading text from the document to the visually impaired user with a text to speech process utilizing the retrieved vocal characteristic and the additional vocal characteristic.2. (canceled)3. The method of wherein the vocal characteristic is a speaker gender and the additional vocal characteristic is a speaker volume.4. The method of wherein the analyzing further detects an intended message that includes the document and an audience relationship between the visually impaired user and one or more members of the audience claim 1 , and wherein the additional voice characteristic also corresponds to the audience relationship.5. The method of wherein the derived sensitivity is based on one or more aspects from a group that includes confidentiality claim 1 , offensiveness claim 1 , social convention claim 1 , ethnicity claim 1 , and age.6. The method of wherein the document is a message ...

Подробнее
22-02-2018 дата публикации

SYSTEMS AND TECHNIQUES FOR PRODUCING SPOKEN VOICE PROMPTS

Номер: US20180053499A1
Принадлежит:

Methods and systems are described in which spoken voice prompts can be produced in a manner such that they will most likely have the desired effect, for example to indicate empathy, or produce a desired follow-up action from a call recipient. The prompts can be produced with specific optimized speech parameters, including duration, gender of speaker, and pitch, so as to encourage participation and promote comprehension among a wide range of patients or listeners. Upon hearing such voice prompts, patients/listeners can know immediately when they are being asked questions that they are expected to answer, and when they are being given information, as well as the information that considered sensitive. 1. A method of producing spoken voice prompts for telephony-based informational interaction , the method comprising:for one or more voice prompts, determining words that receive an optimized speech parameter, based on context and/or meaning of the text of the one or more voice prompts;recording the one or more voice prompts, producing one or more spoken voice prompts; andconveying the one or more spoken voice prompts to a listener over a telephone system.2. The method of claim 1 , further comprising determining the number of words that receive an optimized speech parameter based on context and/or meaning of the one or more voice prompts;3. The method of claim 1 , wherein the optimized speech parameter comprises one or more pitch accents.4. The method of claim 3 , wherein the one or more pitch accents yield a pause lengthening pattern.5. The method of claim 3 , wherein the one or more pitch accents comprise a phrase-final lengthening pattern.6. The method of claim 3 , further comprising one or more boundary tones claim 3 , wherein the one or more pitch accents and boundary tones comprise a defined intonation pattern.7. The method of claim 6 , wherein the defined intonation pattern comprises specific rises or falls of the fundamental frequency of a spoken prompt.8. The ...

Подробнее
26-02-2015 дата публикации

SPEECH PROCESSING SYSTEM AND METHOD

Номер: US20150058019A1
Автор: CHEN Langzhou
Принадлежит: KABUSHIKI KAISHA TOSHIBA

A method of training an acoustic model for a text-to-speech system, 1. A method of training an acoustic model for a text-to-speech system ,the method comprising:receiving speech data,said speech data comprising data corresponding to different values of a first speech factor,and wherein said speech data is unlabelled, such that for a given item of speech data, the value of said first speech factor is unknown;clustering said speech data according to the value of said first speech factor into a first set of clusters; andestimating a first set of parameters to enable the acoustic model to accommodate speech for the different values of the first speech factor,wherein said clustering and said first parameter estimation are jointly performed according to a common maximum likelihood criterion.2. A method according to claim 1 , wherein each of the first set of clusters comprises at least one sub-cluster claim 1 , and wherein said first set of parameters are weights to be applied such there is one weight per sub-cluster claim 1 , and wherein said weights are dependent on said first speech factor.3. A method according to claim 1 , wherein said first set of parameters are constrained likelihood linear regression transforms which are dependent on said first speech factor.4. A method according to claim 1 , wherein the first speech factor is speaker and said speech data further comprises speech data from one or more speakers speaking with neutral speech.5. A method according to claim 1 , wherein the first speech factor is expression.6. A method according to claim 5 , further comprisingreceiving text data corresponding to said received speech data; extracting expressive features from the speech data and forming an expressive feature synthesis vector constructed in a second space; and', 'training a machine learning algorithm, the training input of the machine learning algorithm being an expressive linguistic feature vector and the training output the expressive feature synthesis ...

Подробнее
23-02-2017 дата публикации

Information Processing Method and Information Processing Device

Номер: US20170053642A1
Принадлежит:

An information processing method includes receiving a change instruction to change a voice parameter used in synthesizing a voice for a set of texts, changing the voice parameter in accordance with the change instruction to change the voice parameter, changing, in accordance with the change instruction, an image parameter used in synthesizing an image of a virtual object, the virtual object indicating a character that vocalizes the voice that has been synthesized, synthesizing the voice using the changed voice parameter, and synthesizing the image using the changed image parameter. 1. An information processing method comprising:receiving a change instruction to change a voice parameter used in synthesizing a voice for a set of texts;changing the voice parameter in accordance with the change instruction;changing, in accordance with the change instruction, an image parameter used in synthesizing an image of a virtual object, the virtual object indicating a character that vocalizes the voice that has been synthesized;synthesizing the voice using the changed voice parameter; andsynthesizing the image using the changed image parameter.2. The information processing method according to further comprising:synchronizing a synthetic voice and a synthetic image with each other and playing the synchronized synthetic voice and synthetic image,wherein the changing of a voice parameter and the changing of an image parameter include changing the voice parameter and the image parameter while the voice and the image are being played.3. The information processing method according to claim 2 ,wherein the synthesizing of a voice includes:synthesizing a voice using a set of texts in a section that has been sequentially specified as a target section among multiple sections obtained by segmenting the set of texts; andsynthesizing a voice for a second section using the voice parameter that has been changed in accordance with a change instruction, received between a start of voice synthesis ...

Подробнее
13-02-2020 дата публикации

SYSTEM AND METHOD FOR ACOUSTIC ACTIVITY RECOGNITION

Номер: US20200051544A1
Принадлежит:

Embodiments are provided to recognize features and activities from an audio signal. In one embodiment, a model is generated from sound effect data, which is augmented and projected into an audio domain to form a training dataset efficiently. Sound effect data is data that has been artificially created or from enhanced sounds or sound processes to provide a more accurate baseline of sound data than traditional training data. The sound effect data is augmented to create multiple variants to broaden the sound effect data. The augmented sound effects are projected into various audio domains, such as indoor, outdoor, urban, based on mixing background sounds consistent with these audio domains. The model is installed on any computing device, such as a laptop, smartphone, or other device. Features and activities from an audio signal are then recognized by the computing device based on the model without the need for in-situ training. 1. A method comprising:accessing a first set of sound effects from a sound effects database for a selected audio context;augmenting the sound effects to generate an augmented set of sound effects;projecting the augmented set of sound effects into an acoustic domain based on mixing the augmented set of sound effects with one or more background sounds that are consistent with the acoustic domain;assembling a training data set based on the augmented set of sound effects and the projections of the augmented set of sound effects; andgenerating a model that recognizes at least one feature in an audio signal based on the training dataset.2. The method of claim 1 , further comprising standardizing the first set of sound effects into a file format.3. The method of claim 2 , further comprising removing silences greater than a threshold duration.4. The method of claim 1 , wherein augmenting the first set of sound effects comprises applying an amplitude augmentation to generate a plurality of variants for each of the sound effects.5. The method of claim 4 ...

Подробнее
20-02-2020 дата публикации

Artificial intelligence apparatus for correcting synthesized speech and method thereof

Номер: US20200058290A1
Принадлежит: LG ELECTRONICS INC

Disclosed herein is an artificial intelligence apparatus includes a memory configured to store learning target text and human speech of a person who pronounces the text, a processor configured to generate synthesized speech in which the text is pronounced by synthesized sound and extract a synthesized speech feature set including information on a feature pronounced in the synthesized speech and a human speech feature set including information on a feature pronounced in the human speech, and a learning processor configured to train a speech correction model for outputting a corrected speech feature set to allow predetermined synthesized speech to be corrected based on a human pronunciation feature when a synthesized speech feature set extracted from predetermined synthesized speech is input, based on the synthesized speech feature set and the human speech feature set.

Подробнее
05-03-2015 дата публикации

VARIABLE-DEPTH AUDIO PRESENTATION OF TEXTUAL INFORMATION

Номер: US20150066510A1
Принадлежит:

A respective sequence of tracks of Internet content of common subject matter is queued to each of a plurality of stations, where each of the tracks of Internet content resides on a respective Internet resource in textual form. In response to receiving a sample input, snippets of each of multiple tracks queued to a selected station among the plurality of stations is transmitted for audible presentation as synthesized human speech, where each of the snippets includes only a subset of a corresponding track. Thereafter, one or more complete tracks among the multiple tracks for which snippets were previously transmitted are transmitted for audio presentation as synthesized human speech. 1. A method of supporting a variable-depth presentation of Internet content in a data processing system including a processor , the method comprising:a processor queuing a respective sequence of tracks of Internet content of common subject matter to each of a plurality of stations, wherein each of the tracks of Internet content resides on a respective Internet resource in textual form;in response to receiving a sample input, the processor transmitting snippets of each of multiple tracks queued to a selected station among the plurality of stations for audible presentation as synthesized human speech, wherein each of the snippets includes only a subset of a corresponding track; andthereafter, transmitting, for audio presentation as synthesized human speech, one or more complete tracks among the multiple tracks for which snippets were previously transmitted.2. The method of claim 1 , wherein the transmitting includes beginning transmission of the one or more complete tracks beginning with a track from which a snippet was being presented when the sample input was received.3. The method of claim 1 , wherein the transmitting includes transmitting the one or more complete tracks beginning with a track from which a snippet was first presented.4. The method of claim 1 , wherein:a selected track ...

Подробнее
04-03-2021 дата публикации

SPEECH SYNTHESIS METHOD AND APPARATUS

Номер: US20210065678A1
Принадлежит: SAMSUNG ELECTRONICS CO., LTD.

A speech synthesis method performed by an electronic apparatus to synthesize speech from text and includes: obtaining text input to the electronic apparatus; obtaining a text representation by encoding the text using a text encoder of the electronic apparatus; obtaining an audio representation of a first audio frame set from an audio encoder of the electronic apparatus, based on the text representation; obtaining an audio representation of a second audio frame set based on the text representation and the audio representation of the first audio frame set; obtaining an audio feature of the second audio frame set by decoding the audio representation of the second audio frame set; and synthesizing speech based on an audio feature of the first audio frame set and the audio feature of the second audio frame set. 1. A method , performed by an electronic apparatus , of synthesizing speech from text , the method comprising:obtaining text input to the electronic apparatus;obtaining a text representation of the text by encoding the text using a text encoder of the electronic apparatus;obtaining a first audio representation of a first audio frame set of the text from an audio encoder of the electronic apparatus, based on the text representation;obtaining a first audio feature of the first audio frame set by decoding the first audio representation of the first audio frame set;obtaining a second audio representation of a second audio frame set of the text based on the text representation and the first audio representation of the first audio frame set;obtaining a second audio feature of the second audio frame set by decoding the second audio representation of the second audio frame set; andsynthesizing speech corresponding to the text based on at least one of the first audio frame set or the second audio feature of the second audio frame set.2. The method of claim 1 , wherein the second audio frame set includes at least one audio frame succeeding a last audio frame of the first ...

Подробнее
01-03-2018 дата публикации

TRANSMISSION DEVICE, TRANSMISSION METHOD, RECEPTION DEVICE, AND RECEPTION METHOD

Номер: US20180062777A1
Принадлежит: SONY CORPORATION

There is provided a transmission device, including circuitry configured to receive alert information including metadata related to a predetermined pronunciation of a message. The circuitry is configured to generate vocal information for the message based on the metadata included in the alert information. The circuitry is further configured to transmit emergency information that includes the message and the generated vocal information for the message. 1. A transmission device , comprising:circuitry configured toreceive alert information including metadata related to a predetermined pronunciation of a message;generate vocal information for the message based on the metadata included in the alert information; andtransmit emergency information that includes the message and the generated vocal information for the message.2. The transmission device according to claim 1 ,wherein the metadata indicates the predetermined pronunciation of a character string which is readable in different ways or is spoken in a manner that differs from a way a word included in the character string is spelled.3. The transmission device according to claim 1 ,wherein the alert information includes the message, andwherein a reception device that receives the emergency information displays the message, and outputs a sound according to the predetermined pronunciation of the message based on the vocal information.4. The transmission device according to claim 1 , wherein the circuitry is further configured to:receive content,transmit a digital broadcast signal that includes the content, andtransmit the emergency information.5. The transmission device according to claim 1 ,wherein the alert information is CAP information that is compliant with a Common Alerting Protocol (CAP) specified by the Organization for the Advancement of Structured Information Standards (OASIS), andwherein the CAP information includes the metadata or address information indicating a location of a file of the metadata.6. The ...

Подробнее
08-03-2018 дата публикации

TRANSLATION OF VERBAL DIRECTIONS INTO A LIST OF MANEUVERS

Номер: US20180066949A1
Принадлежит:

Natural language directions are received and a set of maneuver/context pairs are generated based upon the natural language directions. The set of maneuver/context pairs are provided to a routing engine to obtain route information based upon the set of maneuver/context pairs. The route information is provided to an output system for surfacing to a user. 1. A computing system , comprising:maneuver identification logic that receives a natural language instruction and matches it with a pre-defined navigation maneuver;context identifier logic that identifies context information corresponding to the pre-defined navigation maneuver, the context information being indicative of a location at which the pre-defined navigation maneuver is performed; 'an output system that surfaces the trigger.', 'trigger output logic that outputs, as a trigger, the pre-defined navigation maneuver and the corresponding context information; and'}2. The computing system of and further comprising:limiter identifier logic that identifies any geographic limiter in the natural language instruction, the trigger output logic being configured to output the geographic limiter, along with the pre-defined navigation maneuver and the corresponding context information, as the trigger.3. The computing system of and further comprising:a disambiguation system that performs disambiguation of the context information and the geographic limiter to identify geographic locations corresponding to the context information and the geographic limiter.4. The computing system of and further comprising:a routing engine that identifies a route based on a first point and a second point.5. The computing system of wherein the maneuver identification logic is configured to receive a plurality of natural language instructions and match each with a pre-defined navigation maneuver claim 4 , the context identifier logic being configured to identify context information corresponding to each of the pre-defined navigation maneuvers claim ...

Подробнее
17-03-2022 дата публикации

MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS

Номер: US20220084500A1
Автор: Kim Taesu, Lee Younggun
Принадлежит: NEOSAPIENCE, INC.

A multilingual text-to-speech synthesis method and system are disclosed. The method includes receiving an articulatory feature of a speaker regarding a first language, receiving an input text of a second language, and generating output speech data for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to a single artificial neural network multilingual text-to-speech synthesis model. The single artificial neural network multilingual text-to-speech synthesis model is generated by learning similarity information between phonemes of the first language and phonemes of the second language based on a first learning data of the first language and a second learning data of the second language. 1. A method for multilingual text-to-speech synthesis , comprising:receiving an articulatory feature of a speaker regarding a first language;receiving an input text of a second language; andgenerating output speech data for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to a single artificial neural network multilingual text-to-speech synthesis model,wherein the single artificial neural network multilingual text-to-speech synthesis model is generated by learning similarity information between phonemes of the first language and phonemes of the second language based on a first learning data of the first language and a second learning data of the second language.2. The method of claim 1 , wherein the first learning data of the first language includes a learning text of the first language and learning speech data of the first language corresponding to the learning text of the first language claim 1 , andthe second learning data of the second language includes a learning text of the second language ...

Подробнее
10-03-2016 дата публикации

MULTILINGUAL PROSODY GENERATION

Номер: US20160071512A1
Принадлежит:

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multilingual prosody generation. In some implementations, data indicating a set of linguistic features corresponding to a text is obtained. Data indicating the linguistic features and data indicating the language of the text are provided as input to a neural network that has been trained to provide output indicating prosody information for multiple languages. The neural network can be a neural network having been trained using speech in multiple languages. Output indicating prosody information for the linguistic features is received from the neural network. Audio data representing the text is generated using the output of the neural network. 1. A method performed by data processing apparatus , the method comprising:obtaining data indicating a set of linguistic features corresponding to a text;providing (i) data indicating the linguistic features and (ii) data indicating the language of the text as input to a neural network that has been trained to provide output indicating prosody information for multiple languages, the neural network having been trained using speech in multiple languages;receiving, from the neural network, output indicating prosody information for the linguistic features; andgenerating audio data representing the text using the output of the neural network.2. The method of claim 1 , wherein the text is a first text in a first language; obtaining data indicating a set of second linguistic features corresponding to a second text in a second language that is different from the first language;', 'providing (i) data indicating the second linguistic features and (ii) data indicating the language of the second text as input to the neural network that has been trained to provide output indicating prosody information for multiple languages;', 'receiving, from the neural network, second output indicating prosody information for the second linguistic ...

Подробнее
28-02-2019 дата публикации

SPEECH SYNTHESIS DICTIONARY DELIVERY DEVICE, SPEECH SYNTHESIS SYSTEM, AND PROGRAM STORAGE MEDIUM

Номер: US20190066656A1
Принадлежит:

A speech synthesis dictionary delivery device that delivers a dictionary for performing speech synthesis to terminals, comprises a storage device for speech synthesis dictionary database that stores a first dictionary which includes an acoustic model of a speaker and is associated with identification information of the speaker, that stores a second dictionary which includes an acoustic model generated using voice data of a plurality of speakers, and that stores parameter sets of the speakers to be used with the second dictionary and which are associated with identification information of the speakers, a processor that determines one of the first dictionary and the second dictionary, which should be used in the terminal for a specified speaker, and an input output interface (I/F) that receives the identification information of a speaker transmitted from the terminal and then delivers at least one of a first dictionary, the second dictionary, and a parameter set of the second dictionary, on the basis of the received identification information of the speaker and a result of the determination by the processor. 1. A speech synthesis dictionary delivery device that delivers a dictionary for performing speech synthesis to a terminal , comprising:a storage device for a speech synthesis dictionary database, that stores a first dictionary which includes an acoustic model of a speaker and is associated with identification information of the speaker, that stores a second dictionary which includes an acoustic model generated using voice data of a plurality of speakers, and that stores parameter sets of the speakers to be used with the second dictionary and which are associated with identification information of the speakers;a processor that determines one of the first dictionary and the second dictionary, which should be used in the terminal for a specified speaker; andan input output interface (I/F) that receives identification information of a speaker transmitted from the ...

Подробнее
12-03-2015 дата публикации

MULTILINGUAL SPEECH SYSTEM AND METHOD OF CHARACTER

Номер: US20150073772A1
Принадлежит: FUTURE ROBOT CO., LTD.

The present invention relates to multilingual speech system and method, wherein a two-dimensional or three-dimensional character speaks to express messages in multiple languages according to surroundings whereby messages such as consultations or guide services or the like can be precisely delivered through the character. To accomplish the objective, the multilingual speech system of the character according to the present invention includes a context-aware unit to recognize the surroundings; a conversation selection unit to select spoken words in accordance with the recognized surroundings; a Unicode multilingual database in which the spoken words are stored in Unicode-based multiple languages according to languages of respective nations; a behavior expression unit to express behaviors in accordance with the selected spoken words; and a work processing unit to synchronize and express the selected spoken words and the behaviors according to the spoken words. 1. A multilingual speech system of a character , comprising:a context-aware unit to recognize surroundings;a conversation selection unit to select spoken words in accordance with the recognized surroundings;a Unicode multilingual database in which the spoken words are stored in Unicode-based multiple languages according to languages for each nation;a behavior expression unit to express behaviors in accordance with the selected spoken words; anda work processing unit to synchronize and express the selected spoken words and the behaviors according to the spoken words.2. The multilingual speech system of the character according to claim 1 , further comprising a feeling production unit to select feelings in accordance with the recognized surroundings claim 1 ,wherein the work processing unit synchronizes and expresses the selected feelings and the behaviors according to the spoken words.3. The multilingual speech system of the character according to claim 1 , wherein the Unicode multilingual database additionally ...

Подробнее
27-02-2020 дата публикации

SPEECH SYNTHESIS DEVICE, SPEECH SYNTHESIS METHOD, AND COMPUTER PROGRAM PRODUCT

Номер: US20200066250A1
Принадлежит: TOSHIBA DIGITAL SOLUTIONS CORPORATION

A speech synthesis device according to an embodiment includes a speech synthesizing unit, a speaker parameter storing unit, an availability determining unit, and a speaker parameter control unit. Based on a speaker parameter value representing a set of values of parameters related to the speaker individuality, the speech synthesizing unit is capable of controlling the speaker individuality of synthesized speech. The speaker parameter storing unit is used to store already-registered speaker parameter values. Based on the result of comparing an input speaker parameter value with each already-registered speaker parameter value, the availability determining unit determines the availability of the input speaker parameter value. The speaker parameter control unit prohibits or restricts the use of the input speaker parameter value that is determined to be unavailable by the availability determining unit. 1. A speech synthesis device comprising:a speech synthesizing unit that, based on a speaker parameter value representing a set of values of parameters related to speaker individuality, is capable of controlling the speaker individuality of synthesized speech;a speaker parameter storing unit that is used to store an already-registered speaker parameter value;an availability determining unit that, based on a result of comparing an input speaker parameter value with each already-registered speaker parameter value, determines availability of the input speaker parameter value; anda speaker parameter control unit that prohibits or restricts use of the input speaker parameter value that is determined to be unavailable by the availability determining unit.2. The speech synthesis device according to claim 1 , further comprising a speech synthesis model storing unit that is used to store a speech synthesis model including a base model obtained by modeling base speaker individuality and a speaker individuality control model obtained by modeling features of factors of speaker ...

Подробнее
11-03-2021 дата публикации

Generation of Speech with a Prosodic Characteristic

Номер: US20210074260A1

A computer system that generates output speech is described. During operation, the computer system may receive an input associated with a type of interaction. Then, the computer system may generate, using a voice synthesis engine, the output speech corresponding to an individual based at least in part on the input, where the voice synthesis engine predicts positions and duration of a prosodic characteristic of speech by the individual, and selectively adds the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction. Note that the prosodic characteristic may include: pauses in the speech by the individual, and/or disfluences in the speech by the individual. 1. A computer system , comprising:a computation device;memory configured to store program instructions, wherein, when executed by the computation device, the program instructions cause the computer system to perform one or more operations comprising:receiving an input associated with a type of interaction; andgenerating, using a voice synthesis engine, output speech corresponding to an individual based at least in part on the input, wherein the voice synthesis engine is configured to predict positions and duration of a prosodic characteristic of speech by the individual, and to selectively add the prosodic characteristic of the speech by the individual in the output speech based at least in part on the prediction, andwherein the prosodic characteristic comprises one of: pauses in the speech by the individual, or disfluences in the speech by the individual.2. The computer system of claim 1 , wherein the input comprises one of: text; or speech of a second individual claim 1 , who is different from the individual.3. The computer system of claim 1 , wherein one or more operations comprise generating claim 1 , using a rendering engine claim 1 , video of a visual representation corresponding to the individual based at least in part on the output speech; ...

Подробнее
11-03-2021 дата публикации

METHOD FOR SYNTHESIZED SPEECH GENERATION USING EMOTION INFORMATION CORRECTION AND APPARATUS

Номер: US20210074261A1
Принадлежит:

A method includes generating first synthesized speech by using text and a first emotion vector configured for the text, extracting a second emotion vector included in the first synthesized speech, determining whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold, re-performing speech synthesis by using a third emotion information vector generated by correcting the second emotion information vector, and outputting the generated synthesized speech, thereby configuring emotion information of speech in a more effective manner. A speech synthesis apparatus may be associated with an artificial intelligence module, drone (unmanned aerial vehicle, UAV), robot, augmented reality (AR) devices, virtual reality (VR) devices, devices related to 5G services, and the like. 1. A method for generating synthesized speech , the method comprising:generating first synthesized speech by using text and a first emotion vector configured for the text;extracting a second emotion vector included in the first synthesized speech;determining whether correction of the second emotion information vector is needed by comparing a loss value calculated by using the first emotion information vector and the second emotion information vector with a preconfigured threshold;based on the loss value calculated by using the first emotion information vector and the second emotion information vector exceeding a preconfigured threshold, generating a third emotion information vector by correcting the second emotion information vector based on the first emotion information vector and generating second synthesized speech by using the third emotion information vector; andoutputting the second synthesized speech,wherein a loss value calculated by using the first emotion information vector and an emotion information vector included in the second ...

Подробнее
19-03-2015 дата публикации

ARRANGEMENT AND A METHOD FOR CREATING A SYNTHESIS FROM NUMERICAL DATA AND TEXTUAL INFORMATION

Номер: US20150081307A1
Принадлежит: VIVAGO OY

An arrangement and a method for creating a synthesis from numerical data and textual information, and more particularly, relating to wellbeing of an individual. The arrangement includes a first information in a numerical form provided by a wellbeing device, for example, a second information including free-form text in natural language format provided by care-giving personnel, for example, and a control entity for obtaining the first information and the second information, wherein the control entity is arranged to semantically analyze the free-form text in natural language format of the second information in order to create a synthesis from the first information and the second information. 2. An arrangement for creating a synthesis according to further comprising one or more external measurement devices arranged to provide a third information claim 1 , which third information is included in the obtained information in control entity for creating said synthesis.3. An arrangement for creating a synthesis according to further comprising one or more external databases claim 1 , from which a fourth information is obtained and included in the other obtained information in control entity for creating said synthesis.4. An arrangement for creating a synthesis according to claim 1 , wherein said synthesis is provided in a textual form.5. An arrangement for creating a synthesis according to claim 1 , wherein said first information is provided by a wellbeing device claim 1 , which is arranged to monitor said individual's wellbeing.6. An arrangement for creating a synthesis according to claim 1 , wherein said second information is provided by caregiver personnel.7. An arrangement for creating a synthesis according to claim 1 , wherein a synthesis is created from information of a selectable a time period.8. A method for creating a synthesis from numerical data and textual information by using an arrangement according to comprising at least following steps:obtaining information ...

Подробнее
05-06-2014 дата публикации

SPEECH PROCESSING SYSTEM

Номер: US20140156280A1
Автор: Ranniery Maia
Принадлежит: KABUSHIKI KAISHA TOSHIBA

A method of deriving speech synthesis parameters from an audio signal, the method comprising: 1. A method of deriving speech synthesis parameters from an audio signal , the method comprising:receiving an input speech signal;estimating the position of glottal closure incidents from said audio signal;deriving a pulsed excitation signal from the position of the glottal closure incidents;segmenting said audio signal on the basis of said glottal closure incidents, to obtain segments of said audio signal;processing the segments of the audio signal to obtain the complex cepstrum and deriving a synthesis filter from said complex cepstrum;reconstructing said speech audio signal to produce a reconstructed speech signal using an excitation model where the pulsed excitation signal is passed through said synthesis filter;comparing said reconstructed speech signal with said input speech signal; andcalculating the difference between the reconstructed speech signal and the input speech signal and modifying either the pulsed excitation signal or the complex cepstrum to reduce the difference between the reconstructed speech signal and the input speech.2. A method according to claim 1 , comprising modifying both the pulsed excitation signal and the complex cepstrum to reduce the difference between the reconstructed speech and the input speech.3. A method according to claim 2 , wherein modifying the pulsed excitation signal and the complex cepstrum comprises the process of:optimising the position of the pulses in said excitation signal to reduce the mean squared error between reconstructed speech and the input speech; andrecalculating the complex cepstrum using the optimised pulse positions, wherein the process is repeated until the position of the pulses and the complex cepstrum results in a minimum difference between the reconstructed speech and the input speech.4. A method according to claim 2 , wherein the difference between the reconstructed speech and the input speech is ...

Подробнее
16-03-2017 дата публикации

VOICE SYNTHESIZING DEVICE, VOICE SYNTHESIZING METHOD, AND COMPUTER PROGRAM PRODUCT

Номер: US20170076714A1
Принадлежит:

According to one embodiment, a voice synthesizing device includes a first operation receiving unit, a score transforming unit, and a voice synthesizing unit. The first operation receiving unit configured to receive a first operation specifying voice quality of a desired voice based on one or more upper level expressions indicating the voice quality. The score transforming unit configured to transform, based on a score transformation model that transforms a score of the upper level expression into a score of a lower level expression which is less abstract than the upper level expression, the score of the upper level expression corresponding to the first operation into a score of one or more lower level expressions. The voice synthesizing unit configured to generate a synthetic sound corresponding to a certain text based on the score of the lower level expression. 1. A voice synthesizing device comprising:a first operation receiving unit configured to receive a first operation specifying voice quality of a desired voice based on one or more upper level expressions indicating the voice quality;a score transforming unit configured to transform, based on a score transformation model that transforms a score of the upper level expression into a score of a lower level expression which is less abstract than the upper level expression, a score of the upper level expression corresponding to the first operation into a score of one or more lower level expressions; anda voice synthesizing unit configured to generate a synthetic sound corresponding to a certain text based on the score of the lower level expression.2. The voice synthesizing device according to claim 1 , further comprisinga second operation receiving unit configured to receive a second operation to change the score of the lower level expression resulting from transformation, whereinthe voice synthesizing unit generates the synthetic sound based on the score of the lower level expression changed based on the second ...

Подробнее
16-03-2017 дата публикации

TRAINING APPARATUS FOR SPEECH SYNTHESIS, SPEECH SYNTHESIS APPARATUS AND TRAINING METHOD FOR TRAINING APPARATUS

Номер: US20170076715A1
Принадлежит:

According to one embodiment, a training apparatus for speech synthesis includes a storage device and a hardware processor in communication with the storage device. The storage stores an average voice model, training speaker information representing a feature of speech of a training speaker and perception representation information represented by scores of one or more perception representations related to voice quality of the training speaker, the average voice model constructed by utilizing acoustic data extracted from speech waveforms of a plurality of speakers and language data. The hardware processor, based at least in part on the average voice model, the training speaker information, and the perception representation score, train one or more perception representation acoustic models corresponding to the one or more perception representations. 1. A training apparatus for speech synthesis , the training apparatus comprising:a storage device that stores an average voice model, training speaker information representing a feature of speech of a training speaker and perception representation score information represented by scores of one or more perception representations related to voice quality of the training speaker, the average voice model constructed by utilizing acoustic data extracted from speech waveforms of a plurality of speakers and language data; anda hardware processor in communication with the storage device and configured to, based at least in part on the average voice model, the training speaker information, and the perception representation score, train one or more perception representation acoustic models corresponding to the one or more perception representations.2. The training apparatus according to claim 1 , wherein the one or more perception representations comprise at least one of gender claim 1 , age claim 1 , brightness claim 1 , deepness claim 1 , or clearness of speech.3. The training apparatus according to claim 1 , wherein the training ...

Подробнее
18-03-2021 дата публикации

A HIGHLY EMPATHETIC ITS PROCESSING

Номер: US20210082396A1
Автор: LIU Shihui, Luan Jian
Принадлежит:

The present disclosure provides a technical solution of highly empathetic TTS processing, which not only takes a semantic feature and a linguistic feature into consideration, but also assigns a sentence ID to each sentence in a training text to distinguish sentences in the training text. Such sentence IDs may be introduced as training features into a processing of training a machine learning model, so as to enable the machine learning model to learn a changing rule for the changing of acoustic codes of sentences with a context of sentence. A speech naturally changed in rhythm and tone may be output to make TTS more empathetic by performing TTS processing with the trained model. A highly empathetic audio book may be generated using the TTS processing provided herein, and an online system for generating a highly empathetic audio book may be established with the TTS processing as a core technology. 1. An electronic apparatus , comprising:a processing unit; anda memory, coupled to the processing unit and containing instructions stored thereon, the instructions cause the electronic apparatus to perform operations upon being executed by the processing unit, the operations comprising:extracting a text feature from each sentence in an input text, to acquire a semantic code of sentence and a linguistic feature of sentence of the each sentence in the input text;performing searching of similarity matching in a dictionary of acoustic codes of sentences according to the semantic code of sentence of the each sentence in the input text, to acquire an acoustic code of sentence matched with the semantic code of sentence of the each sentence, wherein the dictionary of acoustic codes of sentences comprises a plurality of items consisted of semantic codes of sentence, IDs of sentence, and acoustic codes of sentence, which have mapping relationship therebetween; andinputting the acoustic code of sentence and the linguistic feature of sentence of the each sentence in the input text into ...

Подробнее
18-03-2021 дата публикации

System and method for accent classification

Номер: US20210082402A1
Принадлежит: Cerence Operating Co

A system and/or method receives speech input including an accent. The accent is classified with an accent classifier to yield an accent classification. Automatic speech recognition is performed based on the speech input and the accent classification to yield an automatic speech recognition output. Natural language understanding is performed on the speech recognition output and the accent classification determining an intent of the speech recognition output. Natural language generation generates an output based on the speech recognition output and the intent and the accent classification. An output is rendered using text to speech based on the natural language generation and the accent classification.

Подробнее
22-03-2018 дата публикации

TEXT-TO-SPEECH METHOD AND SYSTEM

Номер: US20180082675A1
Автор: Wang Sung-Wen
Принадлежит:

A text-to-speech method includes: receiving a text series, and generating a plurality of phonemes corresponding to the text series, wherein the phonemes form a phoneme series; inserting a pause phoneme into the phoneme series; dividing the phoneme series and the pause phoneme into a plurality of phoneme sub-series by using the pause phoneme as a dividing point, and generating a plurality of speech segments according to the phoneme sub-series; and performing a speech synthesis operation individually on the speech segments to generate a plurality of speech outputs corresponding to the plurality of speech segments. The pause phoneme is a last phoneme of the phoneme sub-series in which the pause phoneme locates. 1. A text-to-speech method , comprising:receiving a text series, and generating a plurality of phonemes corresponding to the text series, wherein the plurality of phonemes form a phoneme series;inserting at least one pause phoneme into the phoneme series; anddividing the phoneme series and the at least one pause phoneme into a plurality of phoneme sub-series by using the at least one pause phoneme as a dividing point, and generating a plurality of speech segments, wherein each of the speech segments comprises a plurality of text labels that comprise relationships of the plurality of phonemes;wherein, the at least one pause phoneme is a last phoneme of the phoneme sub-series in which the at least one pause phoneme locates.2. The text-to-speech method according to claim 1 , wherein the step of inserting the at least one pause phoneme into the phoneme series comprises:inserting a pause phoneme of the at least one pause phoneme at a corresponding punctuation in the text series.3. The text-to-speech method according to claim 1 , wherein the step of inserting the at least one pause phoneme into the phoneme series comprises:determining a pause position of a pause phoneme of the at least one pause phoneme according to a length of a buffer memory; andinserting the pause ...

Подробнее
22-03-2018 дата публикации

OPTIMAL HUMAN-MACHINE CONVERSATIONS USING EMOTION-ENHANCED NATURAL SPEECH USING HIERARCHICAL NEURAL NETWORKS AND REINFORCEMENT LEARNING

Номер: US20180082679A1
Принадлежит:

A system and method for emotion-enhanced natural speech using dilated convolutional neural networks, wherein an audio processing server receives a raw audio waveform from a dilated convolutional artificial neural network, associates text-based emotion content markers with portions of the raw audio waveform to produce an emotion-enhanced audio waveform, and provides the emotion-enhanced audio waveform to the dilated convolutional artificial neural network for use as a new input data set. 1. A system for emotion-enhanced natural speech audio generation using dilated convolutional neural networks , comprising:an automated emotion engine comprising at least a plurality of programming instructions stored in a memory and operating on a processor of a network-connected computing device and configured to provide a plurality of input data to, and receive a plurality of output data from, a dilated convolutional artificial neural network;wherein the automated emotion engine is configured to receive at least a raw audio waveform from the dilated convolutional artificial neural network; andwherein the automated emotion engine is configured to recognize a plurality of emotional states within the raw audio waveform.2. The system of claim 1 , wherein the automated emotion engine is configured to produce an emotion-enhanced audio waveform by associating a plurality of emotion content markers claim 1 , each comprising at least a text label describing an emotional state claim 1 , with at least a portion of the audio waveform.3. The system of claim 2 , wherein the automated emotion engine is configured to provide the emotion-enhanced audio waveform to the dilated convolutional artificial neural network as an input data set.4. The system of claim 1 , wherein at least a portion of the emotion content markers are based on a text-to-speech script that was used in the generation of the raw audio waveform.5. A method for emotion-enhanced natural speech audio generation using dilated ...

Подробнее
31-03-2022 дата публикации

Variational Embedding Capacity in Expressive End-to-End Speech Synthesis

Номер: US20220101826A1
Принадлежит: Google LLC

A method for estimating an embedding capacity includes receiving, at a deterministic reference encoder, a reference audio signal, and determining a reference embedding corresponding to the reference audio signal, the reference embedding having a corresponding embedding dimensionality. The method also includes measuring a first reconstruction loss as a function of the corresponding embedding dimensionality of the reference embedding and obtaining a variational embedding from a variational posterior. The variational embedding has a corresponding embedding dimensionality and a specified capacity. The method also includes measuring a second reconstruction loss as a function of the corresponding embedding dimensionality of the variational embedding and estimating a capacity of the reference embedding by comparing the first measured reconstruction loss for the reference embedding relative to the second measured reconstruction loss for the variational embedding having the specified capacity.

Подробнее
12-03-2020 дата публикации

COMPUTATIONALLY-ASSISTED MUSICAL SEQUENCING AND/OR COMPOSITION TECHNIQUES FOR SOCIAL MUSIC CHALLENGE OR COMPETITION

Номер: US20200082802A1
Принадлежит:

An application that manipulates audio (or audiovisual) content, automated music creation technologies may be employed to generate new musical content using digital signal processing software hosted on handheld and/or server (or cloud-based) compute platforms to intelligently process and combine a set of audio content captured and submitted by users of modern mobile phones or other handheld compute platforms. The user-submitted recordings may contain speech, singing, musical instruments, or a wide variety of other sound sources, and the recordings may optionally be preprocessed by the handheld devices prior to submission. 1. A social music method comprising:capturing, using an audio interface of a portable computing device, a vocal performance by a user of the portable computing device; computationally segmenting the captured vocal performance;', 'temporally remapping segments of the segmented vocal performance; and', 'generating from the temporal remapping a derived musical composition;, 'in an audio processing pipeline,'}audibly rendering the derived musical composition at the portable computing device;coordinating, in a video pipeline, video in accordance with the temporal remapping; andresponsive to a selection by the user, causing a challenge to be transmitted to a remote second user, the challenge including an encoding of the derived musical composition and a seed corresponding to the user's captured vocal performance.2. The method of claim 1 ,wherein the seed includes at least a portion of the user's captured vocal performance.3. The method of claim 1 ,wherein the seed encodes the segmentation of the user's captured vocal performance.4. The method of claim 1 ,wherein the coordinated video includes coordinated video effects.5. The method of claim 1 , further comprising:receiving, for a vocal contribution of the remote second user captured in response to the challenge, a segmentation of the remote second user's vocal contribution; andfrom a combined segment set ...

Подробнее
12-03-2020 дата публикации

MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS

Номер: US20200082806A1
Автор: Kim Taesu, Lee Younggun
Принадлежит: NEOSAPIENCE, INC.

A multilingual text-to-speech synthesis method and system are disclosed. The method includes receiving first learning data including a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language, receiving second learning data including a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language, and generating a single artificial neural network text-to-speech synthesis model by learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data. 1. A multilingual text-to-speech synthesis method comprising:receiving first learning data including a learning text of a first language and learning speech data of the first language corresponding to the learning text of the first language;receiving second learning data including a learning text of a second language and learning speech data of the second language corresponding to the learning text of the second language; andgenerating a single artificial neural network text-to-speech synthesis model by learning similarity information between phonemes of the first language and phonemes of the second language based on the first learning data and the second learning data.2. The multilingual text-to-speech synthesis method of claim 1 , further comprising:receiving an articulatory feature of a speaker regarding the first language;receiving an input text of the second language; andgenerating output speech data for the input text of the second language that simulates the speaker's speech by inputting the input text of the second language and the articulatory feature of the speaker regarding the first language to the single artificial neural network text-to-speech synthesis model.3. The multilingual text-to-speech synthesis method of claim 2 , wherein the articulatory ...

Подробнее
12-03-2020 дата публикации

Text-to-speech synthesis method and apparatus using machine learning, and computer-readable storage medium

Номер: US20200082807A1
Автор: Taesu Kim, Younggun Lee
Принадлежит: Neosapience Inc

A text-to-speech synthesis method using machine learning, the text-to-speech synthesis method is disclosed. The method includes generating a single artificial neural network text-to-speech synthesis model by performing machine learning based on a plurality of learning texts and speech data corresponding to the plurality of learning texts, receiving an input text, receiving an articulatory feature of a speaker, generating output speech data for the input text reflecting the articulatory feature of the speaker by inputting the articulatory feature of the speaker to the single artificial neural network text-to-speech synthesis model.

Подробнее