Настройки

Укажите год
-

Небесная энциклопедия

Космические корабли и станции, автоматические КА и методы их проектирования, бортовые комплексы управления, системы и средства жизнеобеспечения, особенности технологии производства ракетно-космических систем

Подробнее
-

Мониторинг СМИ

Мониторинг СМИ и социальных сетей. Сканирование интернета, новостных сайтов, специализированных контентных площадок на базе мессенджеров. Гибкие настройки фильтров и первоначальных источников.

Подробнее

Форма поиска

Поддерживает ввод нескольких поисковых фраз (по одной на строку). При поиске обеспечивает поддержку морфологии русского и английского языка
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Ведите корректный номера.
Укажите год
Укажите год

Применить Всего найдено 5558. Отображено 200.
10-05-2012 дата публикации

УСТРОЙСТВО И СПОСОБ ДЛЯ КОДИРОВАНИЯ МНОГОКАНАЛЬНОГО ЗВУКОВОГО СИГНАЛА

Номер: RU2450369C2

Изобретение относится к кодированию многоканального звукового сигнала, в частности к сведению фонограмм стереофонического речевого сигнала к монофоническому сигналу для кодирования с помощью монофонического кодера, такого как кодер линейного предсказания. Техническим результатом является повышение качества и эффективности кодирования. Указанный результат достигается тем, что устройство для кодирования многоканального звукового сигнала содержит приемник многоканального звукового сигнала, содержащего первый и второй звуковые сигналы от первого и второго микрофонов, модуль разности времени для определения межвременной разности между первым и вторым звуковыми сигналами посредством объединения последовательных наблюдений взаимных корреляций между первым и вторым звуковыми сигналами, при этом взаимные корреляции нормализуются для вывода вероятностей состояния, накапливаемых с использованием алгоритма Витерби, для достижения межвременной разности со встроенным гистерезисом, и алгоритм Витерби ...

Подробнее
16-05-2019 дата публикации

УСТРОЙСТВО И СПОСОБ ДЛЯ ОБРАБОТКИ КОДИРОВАННОГО ЗВУКОВОГО СИГНАЛА

Номер: RU2687872C1

Изобретение относится к обработке звуковых сигналов, в частности к декодерам. Устройство содержит демультиплексор, формирующий из кадров аудиоинформации базовый сигнал и набор параметров, повышающий дискретизатор для повышающей дискретизации базового сигнала и выдачи первого спектра с повышенной дискретизацией и следующего по времени второго спектра с повышенной дискретизацией. Как первый спектр с повышенной дискретизацией, так и второй спектр с повышенной дискретизацией имеют то же самое содержимое, что и базовый сигнал, и имеют вторую ширину спектра, являющуюся большей, чем первая ширина спектра базового спектра. Преобразователь параметров для преобразования параметров упомянутого набора параметров упомянутого элемента доступа для получения преобразованных параметров и процессор заполнения спектральных промежутков для обработки упомянутого первого спектра с повышенной дискретизацией и упомянутого второго спектра с повышенной дискретизацией, с использованием упомянутых преобразованных ...

Подробнее
27-03-2003 дата публикации

SYSTEM ZUR SPRACHKODIERUNG

Номер: DE0069529672D1
Принадлежит: SONY CORP, SONY CORP., TOKIO/TOKYO

Подробнее
10-09-1997 дата публикации

Method and apparatus for speech enhancement in a speech communication system

Номер: GB0009714001D0
Автор:
Принадлежит:

Подробнее
22-03-2017 дата публикации

A spoken dialogue system, a spoken dialogue method and a method of adapting a spoken dialogue system

Номер: GB0201701918D0
Автор:
Принадлежит:

Подробнее
09-10-2013 дата публикации

Audio-Visual Dialogue System and Method

Номер: GB0201315142D0
Автор:
Принадлежит:

Подробнее
25-10-2000 дата публикации

Speech processing apparatus

Номер: GB0002349259A
Принадлежит:

An input speech signal is processed to compensate for the effects of noise. The input speech signal is divided into a plurality of sequential time frames and a set of spectral parameters are extracted for each time frame. The parameters for each frame are scaled in dependence upon a measure of the signal to noise ratio for the input frame. In this way, the effects of additive noise on the input signal can be reduced.

Подробнее
03-11-1999 дата публикации

Improving speech intelligibility in presence of noise

Номер: GB0002336978A
Принадлежит:

A speech communication system comprises a receiving unit 14 which receives speech data and uses that data to output speech 15. The characteristics of the speech are altered by processing unit 10 based upon the listener's current background noise before the speech is output to enhance its intelligibility to a listener. An analysis unit 12 determines the type and level of the background noise by a microphone 13 and decision unit 11 determines whether the speech currently received would be intelligible to an average listener in the current background noise. If unit 11 determines that the speech would be unintelligible, then processing unit 10 alters the speech before passing it to the output to make the speech more intelligible. Preferably the speech characteristics are altered by altering line spectral pair data representing the speech.

Подробнее
31-10-2007 дата публикации

Apparatus and method for encoding a multi channel audio signal

Номер: GB0000718682D0
Автор:
Принадлежит:

Подробнее
08-02-2017 дата публикации

A speech processing system

Номер: GB0002508417B

Подробнее
02-07-1997 дата публикации

Speech synthesizer apparatus

Номер: GB0009709696D0
Автор:
Принадлежит:

Подробнее
15-10-2010 дата публикации

SPECTRAL SMOOTHING PROCEDURE OF SIGNALS RUSHED

Номер: AT0000484822T
Принадлежит:

Подробнее
04-11-2021 дата публикации

Method for extracting speech from degraded signals by predicting the inputs to a speech vocoder

Номер: AU2020242078A1
Принадлежит:

A method for Parametric resynthesis (PR) producing an audible signal. A degraded audio signal is received which includes a distorted target audio signal. A prediction model predicts parameters of the audible signal from the degraded signal. The prediction model was trained to minimize a loss function between the target audio signal and the predicted audible signal. The predicted parameters are provided to a waveform generator which synthesizes the audible signal.

Подробнее
16-05-2019 дата публикации

A robust speaker recognition system based on dynamic time wrapping

Номер: AU2019100372A4
Принадлежит: Gloria Li

Abstract The application relates to a robust speaker recognition system based on dynamic time wrapping. This invention lies in the field of voice recognition, specifically speaker recognition, and illustrates the basic principle and key technology including MFCC, DTW and so on. The invention consists of the following steps: To begin with, we collect some data and divide them into training set and test set. Secondly, in training procedure training data is preprocessed and converted into MFCCs, then stored in database. Thirdly, in test procedure, we process test data in the same way to get their MFCCs and compare them with those in database to get the result. With some improvement to the traditional endpoint detection method, this invention is more robust against noise when extracting effective speech segments and achieves 100% accuracy with our dataset, with 311.0524 seconds spent on recognition of each sample on average. Besides, the implementation in MATLAB is given. Generally speaking ...

Подробнее
22-01-2001 дата публикации

Encoding and decoding with harmonic components and minimum phase

Номер: AU0006292300A
Принадлежит:

Подробнее
28-04-1992 дата публикации

A METHOD OF, AND SYSTEM FOR, CODING ANALOGUE SIGNALS

Номер: AU0008856791A
Принадлежит:

Подробнее
29-08-2019 дата публикации

Apparatus and method for processing an encoded audio signal

Номер: AU2016373990B2
Принадлежит: Spruson & Ferguson

The invention refers to an apparatus for processing an encoded audio signal (100). The audio signal (100) comprises a sequence of access units (100'), each access unit comprising a core signal (101) with a first spectral width and parameters describing a spectrum above the first spectral width. The apparatus comprises: a demultiplexer (1) for generating, from an access unit (100') of the encoded audio signal (100), said core signal (101 ) and a set of said parameters (102), an upsampler (2) for upsampling said core signal (101 ) of said access unit (100') and outputting a first upsampled spectrum (103) and a timely consecutive second upsampled spectrum (103'), the first upsampled spectrum (103) and the second upsampled spectrum (103'), both, having a same content as the core signal (101 ) and having a second spectral width being greater than the first spectral width of the core spectrum (101), a parameter converter (3) for converting parameters of said set of parameters (102) of said access ...

Подробнее
02-12-1993 дата публикации

Voice signal processing device

Номер: AU0000644124B2
Принадлежит:

Подробнее
05-09-1995 дата публикации

REJECTION METHOD FOR SPEECH RECOGNITION

Номер: CA0002013263C

A speech recognizer, for recognizing unknown utterances in isolated-word small-vocabulary speech has improved rejection of out of vocabulary utterances. Both a usual spectral representation including a dynamic component and an equalized representation are used to match unknown utterances to templates for in-vocabulary words. In a preferred embodiment, the representations are mel-based cepstral with dynamic components being signed vector differences between pairs of primary cepstra. The equalized representation being the signed difference of each cepstral coefficient less an average value of the coefficients. Factors are generated from the ordered lists of templates to determine the probability of the top choice being a correct acceptance, with different methods being applied when the usual and equalized representations yield a different match. For additional enhancement, the rejection method may use templates corresponding to non-vocabulary utterances or decoys. If the top choice corresponds ...

Подробнее
29-01-2002 дата публикации

METHOD OF, AND SYSTEM FOR, CODING ANALOGUE SIGNALS

Номер: CA0002091754C
Принадлежит: KONINKLIJKE PHILIPS ELECTRONICS N.V.

In a Code Excited Linear Prediction (CELP) analogue signal coding system sequences from a master codebook (40), which may be a one dimensional codebook, are filtered (42) and then stored in slav e codebooks (70, 72). Input analogue signals (20) are filtered (34, 36) and compared orthogonally (66, 78, 80) with sequences from the slave codebooks and an optimum pair of se- quences are selected. Since the comparisons are orthogonal, sequences can be selected from the codebooks (70, 72) and compared (78, 80) with the filtered incoming analogue signals entirely independently. Reduced length sequences from the master codebook may be compared with orthogonalised analogue signals since orthogonalised signals contain some redundancy. The master code- book may not need to be orthogonalised in some circumstances. Various means of orthogonalisation of the sequences are possible including separation into odd and even sequences. Further orthogonalisations are possible, for example to give four comparisons ...

Подробнее
11-07-1996 дата публикации

SPEECH CODING METHOD USING SYNTHESIS ANALYSIS

Номер: CA0002209623A1
Принадлежит:

A speech signal linear prediction analysis is performed for each frame of a speech signal to determine the coefficients of a short-term synthesis filter, and an open-loop analysis is performed to determine a degree of frame voicing. At least one closed-loop analysis is performed for each sub-frame to determine an excitation sequence which, when applied to the short-term synthesis filter, generates a synthetic signal representative of the speech signal. Each closedloop analysis uses the impulse response of a filter consisting of the shortterm synthesis filter and a perceptual weighting filter, by truncating said impulse response to a truncation length that is no greater than the number of samples per sub-frame, and dependent on the energy distribution of said response and the degree of voicing of the frame.

Подробнее
30-11-2018 дата публикации

Voice model training method, voice recognition method, devices, facility and medium

Номер: CN0108922515A
Автор: TU HONG
Принадлежит:

Подробнее
25-05-2018 дата публикации

Identity verification method and device based on recurrent neural network

Номер: CN0108074575A
Автор: CHEN SHUDONG
Принадлежит:

Подробнее
19-02-2019 дата публикации

Language depth neural network-based language recognition method

Номер: CN0109360554A
Автор: HONG CHUANGBO
Принадлежит:

Подробнее
16-04-2019 дата публикации

Self-adapting method of DNN acoustic model based on personal identity characteristics

Номер: CN0109637526A
Автор: LI YING, YAN BEIBEI, GUO XUDONG
Принадлежит:

Подробнее
10-07-2013 дата публикации

Frequency axis elastic coefficient estimation device and system method

Номер: CN101809652B
Автор: EMORI TADASHI
Принадлежит:

Подробнее
12-01-2001 дата публикации

PROCESSES AND DEVICES OF CODING AND AUDIO DECODING

Номер: FR0002796189A1
Принадлежит:

Le codeur estime une fréquence fondamentale (F0) du signal audio, détermine un spectre du signal audio par une transformée dans le domaine fréquentiel d'une trame du signal audio, et inclut dans le flux numérique transmis au décodeur des données de codage d'une composante harmonique du signal audio, comprenant des données représentatives d'amplitudes spectrales associées à des fréquences multiples de la fréquence fondamentale estimée. Les données de codage de la composante harmonique comprennent en outre, pour au moins une des fréquences multiples de la fréquence fondamentale estimée, des données (iΔφ) relatives à la phase du spectre du signal audio au voisinage de cette fréquence multiple.

Подробнее
01-10-2004 дата публикации

PROCEEDED Of ANALYSIS Of INFORMATION OF FUNDAMENTAL FREQUENCY AND PROCESS AND CONVERSION SYSTEM OF VOICE IMPLEMENTING SUCH a PROCESS Of ANALYSIS.

Номер: FR0002853125A1
Принадлежит:

Procédé d'analyse d'informations de fréquence fondamentale contenues dans des échantillons vocaux, caractérisé en ce qu'il comporte au moins : - une étape (2) d'analyse des échantillons vocaux regroupés en trames pour obtenir, pour chaque trame d'échantillons, des informations relatives au spectre et des informations relatives à la fréquence fondamentale; - une étape (20) de détermination d'un modèle représentant les caractéristiques communes de spectre et de fréquence fondamentale de tous les échantillons; et - une étape (30) de détermination, à partir de ce modèle et des échantillons vocaux, d'une fonction de prédiction de la fréquence fondamentale en fonction uniquement d'informations relatives au spectre.

Подробнее
29-07-2021 дата публикации

DETECTION AND CLASSIFICATION OF SIREN SIGNALS AND LOCALIZATION OF SIREN SIGNAL SOURCES

Номер: US20210233554A1
Принадлежит:

In an embodiment, a method comprises: capturing, by one or more microphone arrays of a vehicle, sound signals in an environment; extracting frequency spectrum features from the sound signals; predicting, using an acoustic scene classifier and the frequency spectrum features, one or more siren signal classifications; converting the one or more siren signal classifications into one or more siren signal event detections; computing time delay of arrival estimates for the one or more detected siren signals; estimating one or more bearing angles to one or more sources of the one or more detected siren signals using the time delay of arrival estimates and a known geometry of the microphone array; and tracking, using a Bayesian filter, the one or more bearing angles. If a siren is detected, actions are performed by the vehicle depending on the location of the emergency vehicle and whether the emergency vehicle is active or inactive.

Подробнее
06-10-2020 дата публикации

Method and system for diagnosing coronary artery disease (CAD) using a voice signal

Номер: US0010796714B2
Принадлежит: VOCALIS HEALTH LTD., VOCALIS HEALTH LTD

The present invention extends to methods, systems, for diagnosing coronary artery disease (CAD) in patients by using their voice signal comprising receiving voice signal data indicative of speech from the patient.

Подробнее
26-09-2019 дата публикации

AUDIO INTERVAL DETECTION APPARATUS, METHOD, AND RECORDING MEDIUM

Номер: US20190295529A1
Принадлежит: CASIO COMPUTER CO., LTD.

An audio interval detection apparatus comprising a processor and a storage storing instructions that, when executed by the processor, control the processor to: detect from a target audio signal a specified audio interval including a specified audio signal representing a state of a phoneme of a same consonant produced continuously over a period longer than a specified time, and by eliminating, from the target audio signal at least the detected specified audio interval, detect from the target audio signal an utterance audio interval that includes a speech utterance signal representing a speech utterance uttered by a speaker.

Подробнее
02-08-2016 дата публикации

Adaptive microphone sampling rate techniques

Номер: US0009406313B2
Принадлежит: Intel Corporation, INTEL CORP

An apparatus for adjusting a microphone sampling rate, the apparatus including an input to receive an audio signal from a microphone and a front-end processing module. The front-end processing module is to generate a plurality of frames from the audio signal received by the microphone, determine a noise profile using the plurality of frames, and adjust a sampling rate of the microphone based on the determined noise profile.

Подробнее
12-01-2021 дата публикации

Uncertainty measure of a mixture-model based pattern classifer

Номер: US0010891942B2

There are provided mechanisms for determining an uncertainty measure of a mixture-model based parametric classifier. A method is performed by a classification device. The method includes obtaining a short-term frequency representation of a multimedia signal. The short-term frequency representation defines an input sequence. The method includes classifying the input sequence to belong to one class of at least two available classes using the parametric classifier. The parametric classifier has been trained with a training sequence. The method includes determining an uncertainty measure of the classified input sequence based on a relation between posterior probabilities of the input sequence and posterior probabilities of the training sequence.

Подробнее
16-03-2021 дата публикации

Audio fingerprint extraction method and device

Номер: US0010950255B2

An audio fingerprint extraction method and device are provided. The method includes: converting an audio signal to a spectrogram; determining one or more characteristic points in the spectrogram; in the spectrogram, determining one or more masks for the characteristic points; determining mean energy of each of the spectrum regions; determining one or more audio fingerprint bits according to mean energy of the plurality of spectrum regions in the one or more masks; judging credibility of the audio fingerprint bits to determine one or more weight bits; and combining the audio fingerprint bits and the weight bits to obtain an audio fingerprint. Each of the one or more masks includes a plurality of spectrum regions.

Подробнее
21-04-2020 дата публикации

Cepstral variance normalization for audio feature extraction

Номер: US0010629184B2
Принадлежит: Intel Corporation, INTEL CORP, INTEL CORPORATION

Cepstral variance normalization is described for audio feature extraction. In some embodiments a method includes receiving a sequence of frames of digitized audio from a microphone, determining a feature vector for a first frame of the sequence of frames, the feature vector being determined using an initial mean and an initial variance, updating the initial mean to a current mean using the determined feature vector for the first frame, updating the variance to a current variance using the current mean and the determined feature vector for the first frame, determining a next feature vector for each of subsequent frames of the sequence of frames, after determining a next feature vector for each subsequent frame, updating the current mean to a next current mean and updating the current variance to a next current variance and wherein determining a feature vector for a subsequent frame comprises using the next current mean and the next current variance, and sending the determined feature vectors ...

Подробнее
22-04-2021 дата публикации

METHOD AND APPARATUS FOR EMOTION RECOGNITION FROM SPEECH

Номер: US20210118464A1
Принадлежит:

Embodiments of the present invention relate to a method and apparatus for emotion recognition from speech. According to one embodiment of the invention, a method for emotion recognition from speech may include: receiving an audio signal; performing data cleaning on the received audio signal; slicing the cleaned audio signal into at least one segment; performing feature extraction on the at least one segment to extract a plurality of Mel frequency cepstral coefficients and a plurality of Bark frequency cepstral coefficients from the at least one segment; performing feature padding by padding the plurality of Mel frequency cepstral coefficients and the plurality of Bark frequency cepstral coefficients into a feature matrix based a length threshold; and performing machine learning inference on the feature matrix to recognize the emotion indicated in the audio signal. Embodiments of the present invention can be adaptive to an audio signal in almost any size, and can real time recognizing emotions ...

Подробнее
03-09-2019 дата публикации

Systems and methods for identifying speech based on cepstral coefficients and support vector machines

Номер: US0010403303B1
Принадлежит: GoPro, Inc., GOPRO INC

Audio content may have a duration. The audio content may be segmented into audio segments. Individual audio segments may correspond to a portion of the duration. Mel frequency spectral power features, Mel frequency cepstral coefficient features, and energy features of the audio segments may be determined. Feature vectors of the audio segments may be determined based on the Mel frequency spectral power features, the Mel frequency cepstral coefficient features, and the energy features. The feature vectors may be processed through a support vector machine. The support vector machine may output predictions on whether the audio segments contain speech. One or more of the audio segments may be identified as containing speech based on filtering the predictions and comparing the filtered predictions to a threshold. Storage of the identification of the one or more of the audio segments as containing speech in one or more storage media may be effectuated.

Подробнее
10-07-2013 дата публикации

Wideband audio signal coding/decoding device and method

Номер: EP2051245A3
Принадлежит:

Disclosed is a wideband audio signal coding/decoding device and method that may code a wideband audio signal while maintaining a low bit rate. The wideband audio signal coding device includes an enhancement layer (200) that extracts a first spectrum parameter from an inputted wideband signal having a first bandwidth, quantizes the extracted first spectrum parameter, and converts the extracted first spectrum parameter into a second spectrum parameter; and a coding unit (130) that extracts a narrowband signal from the inputted wideband signal and codes the narrowband signal based on the second spectrum parameter provided from the enhancement layer, wherein the narrowband signal has a second bandwidth smaller than the first bandwidth. The wideband audio signal coding/decoding device and method may code a wideband audio signal while maintaining a low bit rate.

Подробнее
12-02-1997 дата публикации

METHODS AND APPARATUS FOR AUTOMATING TELEPHONE DIRECTORY ASSISTANCE FUNCTIONS

Номер: EP0000757868A1
Принадлежит:

In methods and apparatus for at least partially automating a telephone directory assistance function, directory assistance callers are prompted to speak locality or called entity names associated with desired directory listings. A speech recognition algorithm is applied to speech signals received in response to prompting to determine spoken locality or called entity names. Desired telephone numbers are released to callers, and released telephone numbers are used to confirm or correct at least some of the recognized locality or called entity names. Speech signal representations labelled with the confirmed or corrected names are used as labelled speech tokens to refine prior training of the speech recognition algorithm. The training refinement automatically adjusts for deficiencies in prior training of the speech recognition algorithm and to long term changes in the speech patterns of directory assistance callers served by a particular directory assistance installation. The methods can be ...

Подробнее
05-02-2003 дата публикации

Determination of line spectrum frequencies for use in a radiotelephone

Номер: EP0000774750B1
Автор: Ruoppila, Vesa T.
Принадлежит: Nokia Corporation

Подробнее
02-05-2002 дата публикации

Synthesis of speech signals in the absence of coded parameters

Номер: EP0000764939B1
Автор: Chen, Juin-Hwey
Принадлежит: AT&T Corp.

Подробнее
28-10-2004 дата публикации

STIMMUMWANDLUNG

Номер: DE0069826446D1
Принадлежит: MICROSOFT CORP, MICROSOFT CORP., REDMOND

Подробнее
04-04-2012 дата публикации

AUDIO FEATURE EXTRACTING APPARATUS, AUDIO FEATURE EXTRACTING METHOD, AND AUDIO FEATURE EXTRACTING PROGRAM

Номер: GB0201202741D0
Автор:
Принадлежит:

Подробнее
28-07-2021 дата публикации

Detection and classification of siren signals and localization of siren signal sources

Номер: GB2591329A
Принадлежит:

A vehicle microphone array 1301 captures ambient sound signals. Frequency spectrum features are extracted 1303 from the sound signals, e.g. by generating a spectrogram, mel-spectrogram or mel-frequency cepstral coefficients (MFCC). An acoustic scene classifier (e.g. a convolutional neural network 1304) uses the frequency spectrum features to predict siren signal classifications which are converted into signal event detections. Time delay of arrival estimates are computed for the detected siren signals and used (with known microphone array geometry) to estimate 1305 bearing angles to sources of the detected siren signals. The bearing angles are tracked using a Bayesian filter (e.g. Kalman or particle filter) and may be used to estimate corresponding ranges. The vehicle may be autonomous and perform actions depending on the emergency vehicle’s location and whether it is active or inactive. E.g. if the AV has crossed a stop line, it may stop immediately or after traversing the intersection ...

Подробнее
15-03-1995 дата публикации

Methods and apparatus for detecting harmonic structure in a waveform

Номер: GB0009501417D0
Автор:
Принадлежит:

Подробнее
15-07-2015 дата публикации

Feature extraction

Номер: GB0201509483D0
Автор:
Принадлежит:

Подробнее
15-05-1999 дата публикации

PROCEDURE FOR THE LANGUAGE CODING OF MEANS ANALYSIS BY SYNTHESIS

Номер: AT0000180092T
Принадлежит:

Подробнее
15-05-2011 дата публикации

DISTINCTION BETWEEN FOREGROUND LANGUAGE AND BACKGROUND NOISES

Номер: AT0000508452T
Принадлежит:

Подробнее
15-05-2006 дата публикации

PROCEDURE AND DEVICE FOR THE SPEAKER RECOGNITION AND - VERIFICATION

Номер: AT0000323933T
Принадлежит:

Подробнее
11-03-1999 дата публикации

Sound encoding system

Номер: AU0000703046B2
Принадлежит:

Подробнее
24-07-1996 дата публикации

Speech coding method using synthesis analysis

Номер: AU0004490396A
Принадлежит:

Подробнее
08-10-2020 дата публикации

A speaker identification method based on deep learning

Номер: AU2020102038A4
Принадлежит: Qian Wang

This invention lies in the field of digital audio processing, which is a speech recognition system for identifying different identities based on deep learning. The invention consists of the following steps: First of all, the preparation of sufficient data was made, and the data was also split into training and testing data. Secondly, we preprocessed the data by using Voice Activity Detection (VAD) for detecting the effective audio segments, and Mel-frequency cepstral coefficients (MFCC) for feature extraction. Then, the training data were batched into the Convolutional Neural Network (CNN) that we have already set. Simultaneously, the parameters in the CNN were adjusted, namely dropout rate, learning base rate, loss rate, in order to optimize the performance of the model. Eventually, the optimal CNN can be used for the testing data, and the identities can be recognized with an accuracy of 92.6%. In brief, the identity of the speaker can be recognized automatically without human involvement ...

Подробнее
02-04-2009 дата публикации

APPARATUS AND METHOD FOR ENCODING A MULTI CHANNEL AUDIO SIGNAL

Номер: CA0002698600A1
Принадлежит:

An encoding apparatus comprises a frame processor (105) which receives a multi channel audio signal comprising at least a first audio signal from a first microphone (101) and a second audio signal from a second microphone (103). An ITD processor (107) then determines an inter time difference between the first audio signal and the second audio signal and a set of delays (109, 111) generates a compensated multi channel audio signal from the multi channel audio signal by delaying at least one of the first and second audio signals in response to the inter time difference signal. A combiner (113) then generates a mono signal by combining channels of the compensated multi channel audio signal and a mono signal encoder (115) encodes the mono signal. The inter time difference may specifically be determined by an algorithm based on determining cross correlations between the first and second audio signals.

Подробнее
30-05-1991 дата публикации

NEAR-TOLL QUALITY 4.8 KBPS SPEECH CODEC

Номер: CA0002031006A1
Автор: TZENG, FORREST F.
Принадлежит:

Подробнее
08-06-1995 дата публикации

Transmitted Noise Reduction in Communications Systems

Номер: CA0002153170A1
Принадлежит:

Подробнее
22-10-2002 дата публикации

LINEAR TRAJECTORY MODELS INCORPORATING PREPROCESSING PARAMETERS FOR SPEECH RECOGNITION

Номер: CA0002260685C

The proposed model aims at finding an optimal linear transformation on the M el- warped DFT features according to the minimum classification error (MCE) criterion. This linear transformation, along with the (NSHMM) parameters, are automatically trained using the gradient ascent method. An advantageous error rate reduction can be realized on a standard 39-class TIMTT phone classification task in comparison with the MCE-trained NSHMM using conventional preprocessing techniques.

Подробнее
23-04-2019 дата публикации

Non-parallel text speech conversion method under condition of limited training data

Номер: CN0109671423A
Автор: LI YANPING, XU JILIANG
Принадлежит:

Подробнее
23-10-2018 дата публикации

Electronic device, authentication method and storage medium

Номер: CN0108694952A
Принадлежит:

Подробнее
03-09-2019 дата публикации

Electronic device, identity verification method, and computer-readable storage medium

Номер: CN0108564955B
Автор:
Принадлежит:

Подробнее
06-07-2004 дата публикации

Förfarande och anordning för vektorkvantisering

Номер: SE0000524202C2
Автор:
Принадлежит:

Подробнее
30-01-2014 дата публикации

SPEECH CONVERTER AND SPEECH CONVERSION PROGRAM

Номер: WO2014016892A1
Принадлежит:

Provided is a speech converter (10) that includes: a codebook storage part (122) that stores a representative codebook that has a plurality of representative vectors that represent features of speeches spoken in a state in which equipment that limits pronunciation is used, and a corresponding codebook that has a plurality of corresponding vectors that represent features of speeches spoken in a state in which the equipment is not used, the corresponding vectors being made to correspond to the representative vectors; a feature vector calculation part (111) that determines, for each frame, a feature vector of a user that represents a feature of a speech spoken by the user in the state in which the equipment is used; a representative vector selection part (112) that selects a representative vector that is closest to the feature vector of the user from the representative codebook; a corresponding vector selection part (113) that selects a corresponding vector that is made to correspond to the ...

Подробнее
25-11-1997 дата публикации

Method and apparatus for detecting end points of speech activity

Номер: US0005692104A1
Принадлежит: Apple Computer, Inc.

A method and apparatus for detecting end points of speech activity in an input signal using spectral representation vectors performs beginning point detection using spectral representation vectors for the spectrum of each sample of the input signal and a spectral representation vector for the steady state portion of the input signal. The beginning point of speech is detected when the spectrum diverges from the steady state portion of the input signal. Once the beginning point has been detected, the spectral representation vectors of the input signal are used to determine the ending point of the sound in the signal. The ending point of speech is detected when the spectrum converges towards the steady state portion of the input signal. After both the beginning and ending of the sound are detected, vector quantization distortion can be used to classify the sound as speech or noise.

Подробнее
06-03-2008 дата публикации

MULTI-CHANNEL CODEBOOK DEPENDENT COMPENSATION

Номер: US20080059180A1

Methods and apparatus, in the context of speech recognition, for compensating in the cepstral domain for the effect of an interfering signal by using a reference signal.

Подробнее
19-07-2012 дата публикации

SPEECH FEATURE EXTRACTION APPARATUS, SPEECH FEATURE EXTRACTION METHOD, AND SPEECH FEATURE EXTRACTION PROGRAM

Номер: US20120185243A1
Принадлежит: International Business Machines Corp.

A speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program. A speech feature extraction apparatus includes: first difference calculation module to: (i) receive, as an input, a spectrum of a speech signal segmented into frames for each frequency bin; and (ii) calculate a delta spectrum for each of the frame, where the delta spectrum is a difference of the spectrum within continuous frames for the frequency bin; and first normalization module to normalize the delta spectrum of the frame for the frequency bin by dividing the delta spectrum by a function of an average spectrum; where the average spectrum is an average of spectra through all frames that are overall speech for the frequency bin; and where an output of the first normalization module is defined as a first delta feature.

Подробнее
27-04-2017 дата публикации

APPARATUS AND METHOD FOR ENCODING A MULTI CHANNEL AUDIO SIGNAL

Номер: US20170116997A1
Принадлежит:

An encoding apparatus comprises a frame processor (105) which receives a multi channel audio signal comprising at least a first audio signal from a first microphone (101) and a second audio signal from a second microphone (103). An ITD processor 107 then determines an inter time difference between the first audio signal and the second audio signal and a set of delays (109, 111) generates a compensated multi channel audio signal from the multi channel audio signal by delaying at least one of the first and second audio signals in response to the inter time difference signal. A combiner (113) then generates a mono signal by combining channels of the compensated multi channel audio signal and a mono signal encoder (115) encodes the mono signal. The inter time difference may specifically be determined by an algorithm based on determining cross correlations between the first and second audio signals.

Подробнее
17-12-2002 дата публикации

Modulated complex lapped transform for integrated signal enhancement and coding

Номер: US0006496795B1

The present invention is embodied in a system and method for performing spectral analysis of a digital signal having a discrete duration by spectrally decomposing the digital signal at predefined frequencies uniformly distributed over a sampling frequency interval into complex frequency coefficients so that magnitude and phase information at each frequency is immediately available to produce a modulated complex lapped transform (MCLT). The present invention includes a MCLT processor, an acoustic echo cancellation device and a noise reducer integrated with an encoder/decoder device.

Подробнее
27-06-2019 дата публикации

METHOD FOR ASSOCIATING A DEVICE WITH A SPEAKER IN A GATEWAY, CORRESPONDING COMPUTER PROGRAM, COMPUTER AND APPARATUS

Номер: US20190198023A1
Принадлежит:

The present disclosure proposes a solution to associate a device with a user by capturing a voice of a speaker by a microphone connected to the network device (e.g. a residential or home gateway), monitoring the IP traffic of the network device and detecting the device contributing to this IP traffic in order to establish a link between the speaker and his device(s) and associate the device with the speaker.

Подробнее
18-10-2022 дата публикации

Multi-modal emotion recognition device, method, and storage medium using artificial intelligence

Номер: US0011475710B2
Автор: Dae-Hun Yoo, Young-Bok Lee
Принадлежит: Genesis Lab, Inc.

A multi-modal emotion recognition system is disclosed. The system includes a data input unit for receiving video data and voice data of a user, a data pre-processing unit including a voice pre-processing unit for generating voice feature data from the voice data and a video pre-processing unit for generating one or more face feature data from the video data, a preliminary inference unit for generating situation determination data as to whether or not the user's situation changes according to a temporal sequence based on the video data. The system further comprises a main inference unit for generating at least one sub feature map based on the voice feature data or the face feature data, and inferring the user's emotion state based on the sub feature map and the situation determination data.

Подробнее
23-04-2014 дата публикации

IDENTIFICATION OF A LOCAL SPEAKER

Номер: EP2721609A1
Принадлежит:

Подробнее
25-06-1997 дата публикации

Speaker verification system

Номер: EP0000780830A2
Принадлежит:

In a speaker verification system, a method of generating reference data includes analyzing a plurality of example utterances of a speaker to be enrolled in the system to provide a matrix of averaged cepstral coefficients representing the utterances. This matrix is compared with matrices of cepstral coefficients derived from test utterances of the speaker and of other speakers to determine, by use of a genetic algorithm process, the coefficients of a weighting vector to be used in the comparison. The reference data, including the matrix of average cepstral coefficients and the coefficients of the genetic algorithm derived weighting vector are recorded on a user's card for subsequent use by the speaker when desiring service from a self-service terminal.

Подробнее
10-01-2006 дата публикации

СПОСОБ И УСТРОЙСТВО ДЛЯ ВОССТАНОВЛЕНИЯ РЕЧИ В СИСТЕМЕ РАСПРЕДЕЛЕННОГО РАСПОЗНАВАНИЯ РЕЧИ

Номер: RU2005125737A
Принадлежит:

... 1. Способ восстановления речи, содержащий этапы приема первого множества Mel-частотных кепстральных коэффициентов (MFCC); вычисления второго множества коэффициентов MFCC и использования принятых и вычисленных коэффициентов MFCC для восстановления речи. 2. Способ по п.1, в котором этап использования принятых и вычисленных коэффициентов MFCC для восстановления речи содержит этапы преобразования принятых и вычисленных коэффициентов MFCC в амплитуды гармоник и использования амплитуд гармоник для восстановления речи. 3. Способ по п.1, в котором этап приема первого множества коэффициентов MFCC содержит этап приема коэффициентов от С0 до С12. 4. Способ по п.3, в котором этап вычисления второго множества коэффициентов MFCC содержит этап вычисления коэффициентов от С13 до С22. 5. Способ по п.4, в котором этап использования принятых и вычисленных коэффициентов MFCC для восстановления речи содержит этапы преобразования коэффициентов от С0 до С22 в амплитуды гармоник и использования амплитуд гармоник ...

Подробнее
10-11-2011 дата публикации

УСТРОЙСТВО И СПОСОБ ДЛЯ КОДИРОВАНИЯ МНОГОКАНАЛЬНОГО ЗВУКОВОГО СИГНАЛА

Номер: RU2010116295A
Принадлежит:

... 1. Устройство для кодирования многоканального звукового сигнала, причем устройство содержит ! приемник для приема многоканального звукового сигнала, содержащего по меньшей мере первый звуковой сигнал от первого микрофона и второй звуковой сигнал от второго микрофона, ! модуль разности времени для определения межвременной разности между первым звуковым сигналом и вторым звуковым сигналом посредством объединения последовательных наблюдений взаимных корреляций между первым звуковым сигналом и вторым звуковым сигналом, и где взаимные корреляции обрабатываются так, чтобы выводить вероятности, которые накапливаются с использованием алгоритма, подобного Витерби, ! модуль задержек для генерирования компенсированного многоканального звукового сигнала из многоканального звукового сигнала посредством задерживания по меньшей мере одного из первого звукового сигнала и второго звукового сигнала в ответ на сигнал межвременной разности, !монофонический модуль для генерирования монофонического сигнала посредством ...

Подробнее
03-06-1998 дата публикации

Vector quantizer method and apparatus

Номер: GB0002282943B
Принадлежит: MOTOROLA INC, * MOTOROLA INC

Подробнее
30-08-2017 дата публикации

Microphone unit comprising integrated speech analysis

Номер: GB0201711576D0
Автор:
Принадлежит:

Подробнее
02-09-1998 дата публикации

Method and apparatus for speech enhancement in a speech communication system

Номер: GB0009814279D0
Автор:
Принадлежит:

Подробнее
03-11-1999 дата публикации

Method and apparatus for speech enhancement in a speech communication system

Номер: GB0009920667D0
Автор:
Принадлежит:

Подробнее
15-11-2001 дата публикации

STARTING/CTERMINATOR POINT DETECTION FOR WORD RECOGNITION

Номер: AT0000208081T
Принадлежит:

Подробнее
15-03-2003 дата публикации

SYSTEM TO THE LANGUAGE CODING

Номер: AT0000233008T
Принадлежит:

Подробнее
10-10-1996 дата публикации

Vector quantizer method and apparatus

Номер: AU0006084396A
Принадлежит:

Подробнее
21-02-2013 дата публикации

Restoration of high-order Mel Frequency Cepstral Coefficients

Номер: US20130046540A9
Автор: Alexander Sorin
Принадлежит: Individual

A method for estimating high-order Mel Frequency Cepstral Coefficients, the method comprising initializing any of N-L high-order coefficients (HOC) of an MFCC vector of length N having L low-order coefficients (LOC) to a predetermined value, thereby forming a candidate MFCC vector, synthesizing a speech signal frame from the candidate MFCC vector and a pitch value, and computing an N-dimensional MFCC vector from the synthesized frame, thereby producing an output MFCC vector.

Подробнее
07-01-2021 дата публикации

MULTISTREAM ACOUSTIC MODELS WITH DILATIONS

Номер: US20210005182A1
Принадлежит:

Audio signals of speech may be processed using an acoustic model. An acoustic model may be implemented with multiple streams of processing where different streams perform processing using different dilation rates. For example, a first stream may process features of the audio signal with one or more convolutional neural network layers having a first dilation rate, and a second stream may process features of the audio signal with one or more convolutional neural network layers having a second dilation rate. Each stream may compute a stream vector, and the stream vectors may be combined to a vector of speech unit scores, where the vector of speech unit scores provides information about the acoustic content of the audio signal. The vector of speech unit scores may be used for any appropriate application of speech, such as automatic speech recognition. 1. A computer-implemented method for processing speech , comprising:receiving a sequence of feature vectors computed from an audio signal;computing a first stream vector by processing the sequence of feature vectors in a first stream, wherein the first stream comprises a first convolutional neural network layer having a first dilation rate;computing a second stream vector by processing the sequence of feature vectors in a second stream, wherein the second stream comprises a second convolutional neural network layer having a second dilation rate, wherein the second dilation rate is different from the first dilation rate; andcomputing a vector of speech unit scores by processing the first stream vector and the second stream vector.2. The computer-implemented method of claim 1 , comprising processing the vector of speech unit scores to determine one or more words spoken in the audio signal.3. The computer-implemented method of claim 1 , wherein the sequence of feature vectors comprise a sequence of vectors of Mel-frequency cepstral coefficients.4. The computer-implemented method of claim 1 , wherein the sequence of feature ...

Подробнее
04-01-2018 дата публикации

MICROPHONE UNIT COMPRISING INTEGRATED SPEECH ANALYSIS

Номер: US20180005636A1

A microphone unit has a transducer, for generating an electrical audio signal from a received acoustic signal; a speech coder, for obtaining compressed speech data from the audio signal; and a digital output, for supplying digital signals representing said compressed speech data. The speech coder may be a lossy speech coder, and may contain a bank of filters with centre frequencies that are non-uniformly spaced, for example mel frequencies. 1. A microphone unit , comprising:a transducer, for generating an electrical audio signal from a received acoustic signal;a speech coder, for obtaining compressed speech data from the audio signal; anda digital output, for supplying digital signals representing said compressed speech data.2. A microphone unit as claimed in claim 1 , wherein the speech coder contains a bank of filters with non-uniformly spaced centre frequencies.3. A microphone unit as claimed in claim 2 , wherein the centre frequencies are mel frequencies.4. A microphone unit as claimed in or claim 2 , wherein outputs of the bank of filters are coupled to a log weighting block and a discrete cosine transform block to provide cepstral coefficients.5. A microphone unit as claimed in one of to claim 2 , operable in a mode in which the digital output supplies uncompressed speech data.6. A microphone unit as claimed in claim 1 , comprising a compressive sampling coder comprising a sampling circuit which samples the input signal at a sample rate less than the input signal bandwidth claim 1 , wherein the sampling instants are caused to be distributed randomly in time.7. A microphone unit as claimed in one of to claim 1 , wherein the speech coder is a lossy speech coder.8. A microphone unit as claimed in claim 7 , wherein the lossy speech coder uses at least one coding technique selected from ADPCM claim 7 , MDCT claim 7 , MDCT-Hybrid subband claim 7 , CELP claim 7 , ACELP claim 7 , Two-Stage Noise Feedback Coding (TSNFC) claim 7 , VSELP claim 7 , RPE-LTP claim 7 , LPC ...

Подробнее
02-01-2020 дата публикации

Sound processing apparatus

Номер: US20200005770A1
Принадлежит: Oticon AS

There is provided a speech classification apparatus for hearing devices with electroencephalography, EEG, dependent sound processing, comprising a sound processing unit configured to capturing sound input signals from at least one external microphone and segmenting said captured sound input signals into segmented sound signals, a speech classification unit comprising a speech cepstrum calculation unit configured to calculate a speech cepstrum for each segmented sound signal, an EEG cepstrum calculation unit configured to calculate an EEG cepstrum for an EEG signal of a user's brain, a mapping unit configured to select a predetermined number of coefficients from each calculated sound cepstrum and from the calculated EEG cepstrum, and a correlation unit configured to calculate a correlation value for each captured sound input signal based on a correlation of the predetermined number of selected coefficients from the respective calculated sound cepstrum with the predetermined number of selected coefficients from the calculated EEG cepstrum, wherein an attended speech source is classified based on the obtained correlation values.

Подробнее
07-01-2021 дата публикации

AUDIO PROCESSING FOR VOICE SIMULATED NOISE EFFECTS

Номер: US20210006516A1
Принадлежит:

Systems and methods may be used to process and output information related to a non-speech vocalization, for example from a user attempting to mimic a non-speech sound. A method may include determine a mimic quality value associated with an audio file by comparing a non-speech vocalization to a prerecorded audio file. For example, the method may include determining an edit distance between the non-speech vocalization and the prerecorded audio file. The method may include assigning a mimic quality value to the audio file based on the edit distance. The method may include outputting the mimic quality value. 120.-. (canceled)21. A device comprising:a display to provide a user interface for interacting with a chat bot;memory; and initiate a chat session with the chat bot within the user interface;', 'receive an audio file or streamed audio from a user within the chat session, the audio file or streamed audio including a non-speech vocalization from the user via the user interface;', 'identify a prerecorded non-speech audio file stored in a database;', 'determine a similarity characteristic based on a comparison of the non-speech vocalization to the prerecorded non-speech audio file, and', 'output a response to the received audio file or the streamed audio from the chat bot for display in the chat session on the user interface based on the similarity characteristic., 'a processor in communication with the memory, the processor to22. The device of claim 21 , wherein the similarity characteristic is based on an edit distance from the non-speech vocalization to the prerecorded non-speech audio file.23. The device of claim 22 , wherein the edit distance is determined using a threshold.24. The device of claim 22 , wherein the edit distance is determined using a machine learning classifier.25. The device of claim 22 , wherein the edit distance is determined using dynamic time warping.26. The device of claim 22 , wherein the edit distance is determined using Mel Frequency ...

Подробнее
20-01-2022 дата публикации

AUDIO MODIFYING CONFERENCING SYSTEM

Номер: US20220020388A1
Принадлежит:

A computer-implemented method for modifying audio-based communications produced during a conference call is disclosed. The computer-implemented method can include monitoring a plurality of utterances transmitted via an audio feed of a device connected to the conference call. The computer-implemented method can identify a first unwanted audio component transmitted via the audio feed. The computer-implemented method can actively modify the audio feed by removing the first unwanted audio component from the audio feed. 1. A computer-implemented method for modifying audio-based communications produced during a conference call , comprising:monitoring a plurality of utterances transmitted via an audio feed of a device connected to the conference call;identifying a first unwanted audio component transmitted via the audio feed; andactively modifying the audio feed by removing the first unwanted audio component from the audio feed.2. The computer-implemented method of claim 1 , wherein actively modifying the audio feed is based on:determining that the first unwanted audio component is being generated by a person that has not opted-in to the conference call.3. The computer-implemented method of claim 2 , further comprising:determining that the person has opted-in to the conference call; andpermitting, in response to the person opting-in to the conference call, the first unwanted audio component to be transmitted via the audio feed.4. The computer-implemented method of claim 2 , wherein determining if the person should be added to the conference call claim 2 , is based claim 2 , at least in part claim 2 , on:sending a prompt to a first person that has opted-in to the conference call;receiving, in response to the prompt, information indicative that the first person should be added to the conference call; andautomatically opting in, responsive to receiving the information indicative that the first person should be added to the conference call, the first person to the conference ...

Подробнее
27-01-2022 дата публикации

Method and system for correcting infant crying identification

Номер: US20220028409A1

A method for correcting infant crying identification includes the following steps: a detecting step provides an audio unit to detect a sound around an infant to generate a plurality of audio samples. A converting step provides a processing unit to convert the audio samples to generate a plurality of audio spectrograms. An extracting step provides a common model to extract the audio spectrograms to generate a plurality of infant crying features. An incremental training step provides an incremental model to train the infant crying features to generate an identification result. A judging step provides the processing unit to judge whether the identification result is correct according to a real result of the infant. When the identification result is different from the real result, an incorrect result is generated. A correcting step provides the processing unit to correct the incremental model according to the incorrect result.

Подробнее
14-01-2016 дата публикации

AUDIO MATCHING WITH SUPPLEMENTAL SEMANTIC AUDIO RECOGNITION AND REPORT GENERATION

Номер: US20160012807A1
Принадлежит:

System, apparatus and method for determining semantic information from audio, where incoming audio is sampled and processed to extract audio features, including temporal, spectral, harmonic and rhythmic features. The extracted audio features are compared to stored audio templates that include ranges and/or values for certain features and are tagged for specific ranges and/or values. The semantic information may be associated with audio codes to determine changing characteristics of identified media during a time period. 120.-. (canceled)21. A processor-based method for producing supplemental information for media containing an embedded audio code , the method comprising:obtaining an audio code from a device during a first time period, the audio code representing a first characteristic of the media, wherein the audio code is read from an audio portion of the media;obtaining first semantic audio signature data from the device during the first time period, the first semantic audio signature data being a measure of at least one of a temporal feature, a spectral feature, a harmonic feature, or a rhythmic feature relating to a second characteristic of the media; andassociating the audio code of the first time period to a second time period when the processor determines that a second semantic audio signature data of the second time period substantially matches the first semantic audio signature data for the first time period.22. The method of claim 21 , wherein:the temporal feature includes at least one of amplitude, power, or zero crossing of at least some of the audio of the media;the spectral feature includes at least one of a spectral centroid, a spectral rolloff, a spectral flux, a spectral flatness measure, a spectral crest factor, Mel-frequency cepstral coefficients, Daubechies wavelet coefficients, a spectral dissonance, a spectral irregularity or a spectral inharmonicity of at least some of the audio of the media;the harmonic feature includes at least one of a ...

Подробнее
11-01-2018 дата публикации

PHONETIC POSTERIORGRAMS FOR MANY-TO-ONE VOICE CONVERSION

Номер: US20180012613A1
Принадлежит:

A method for converting speech using phonetic posteriorgrams (PPGs). A target speech is obtained and a PPG is generated based on acoustic features of the target speech. Generating the PPG may include using a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers. The PPG includes a set of values corresponding to a range of times and a range of phonetic classes, the phonetic classes corresponding to senones. A mapping between the PPG and one or more segments of the target speech is generated. A source speech is obtained, and the source speech are converted into a converted speech based on the PPG and the mapping. 1. A computer-implemented method comprising:obtaining a target speech;obtaining a source speech;generating a phonetic posteriorgram (PPG) based on acoustic features of the target speech, the PPG including a set of values corresponding to a range of times and a range of phonetic classes;generating a mapping between the PPG and the acoustic features of the target speech; andconverting the source speech into a converted speech based on the PPG and the mapping.2. The computer-implemented method of claim 1 , wherein the range of phonetic classes correspond to a range of senones.3. The computer-implemented method of claim 1 , wherein the set of values correspond to posterior probabilities of each of the range of phonetic classes for each of the range of times claim 1 , and wherein the PPG comprises a matrix.4. The computer-implemented method of claim 1 , wherein the source speech is different than the target speech.5. The computer-implemented method of claim 1 , wherein generating the PPG includes using a speaker-independent automatic speech recognition (SI-ASR) system for equalizing different speakers.6. The computer-implemented method of claim 5 , wherein the SI-ASR system is trained for PPGs generation using a multi-speaker ASR corpus claim 5 , an input being an MFCC feature vector of tframe claim 5 , denoted as X ...

Подробнее
10-01-2019 дата публикации

UNCERTAINTY MEASURE OF A MIXTURE-MODEL BASED PATTERN CLASSIFER

Номер: US20190013014A1
Принадлежит:

There are provided mechanisms for determining an uncertainty measure of a mixture-model based parametric classifier. A method is performed by a classification device. The method includes obtaining a short-term frequency representation of a multimedia signal. The short-term frequency representation defines an input sequence. The method includes classifying the input sequence to belong to one class of at least two available classes using the parametric classifier. The parametric classifier has been trained with a training sequence. The method includes determining an uncertainty measure of the classified input sequence based on a relation between posterior probabilities of the input sequence and posterior probabilities of the training sequence. 1. A method for determining an uncertainty measure of a mixture-model based parametric classifier , the method being performed by a classification device , the method comprising:obtaining a short-term frequency representation of a multimedia signal, the short-term frequency representation defining an input sequence x;{'sub': m', '1', '2, 'classifying the input sequence x to belong to one class ω* of at least two available classes ω, ωusing the parametric classifier, the parametric classifier having been trained with a training sequence y; and'}determining an uncertainty measure of the classified input sequence x based on a relation between posterior probabilities of the input sequence x and posterior probabilities of the training sequence y.2. The method according to claim 1 , wherein the uncertainty measure describes a deviation from an optimal performance of the parametric classifier.3. The method according to claim 2 , wherein the optimal performance is based on the posterior probabilities of the training sequence y.4. The method according to claim 1 , wherein the uncertainty measure is defined as minimum of 1 and a ratio between the posterior probabilities of the input sequence x and the posterior probabilities of the ...

Подробнее
10-01-2019 дата публикации

INTERFACE TO LEAKY SPIKING NEURONS

Номер: US20190013037A1
Автор: Haiut Moshe
Принадлежит:

A processor, that may include at least one neural network that comprises at least one leaky spiking neuron; wherein the at least one leaky spiking neuron is configured to directly receive an input pulse density modulation (PDM) signal from a sensor; wherein the input PDM signal represents a detected signal that was detected by the sensor; and wherein the at least one neural network is configured to process the input PDM signal to provide an indication about the detected input signal. 1. A processor , that comprises:at least one neural network that comprises at least one leaky spiking neuron; wherein the at least one leaky spiking neuron is configured to directly receive an input pulse density modulation (PDM) signal from a sensor; wherein the input PDM signal represents a detected signal that was detected by the sensor; andwherein the at least one neural network is configured to process the input PDM signal to provide an indication about the detected input signal.2. The processor according to wherein the sensor is an audio sensor and the detected signal is an audio signal.3. The processor according to wherein the at least one neural network comprises a first group of leaky spiking neurons that is configured to pre-process the PDM signal to provide approximations of output signals of a bank of Mel filters claim 2 , the approximations represent the detected signal.4. The processor according to wherein the at least one neural network comprises a second group of leaky spiking neurons that is coupled to the first group of leaky spiking neurons and is configured to process the approximations to provide an audio process result.5. The processor according to wherein the first group of leaky spiking neurons comprises multiple resonators claim 3 , wherein different resonators are configured to output approximations of different Mel filters of the bank of MEL filters.6. The processor according to wherein the multiple resonators are multiple tunable resonators.7. The processor ...

Подробнее
17-01-2019 дата публикации

Expressive control of text-to-speech content

Номер: US20190019497A1
Принадлежит: I Am Plus Electronics Inc

Methods and systems for audio content production are provided whereby expressive speech is generated via acoustic elements extracted from human input. One of the inputs to the system is a human designating an intonation to be applied onto text-to-speech (TTS) generated synthetic speech. The human intonation includes the pitch contour and other acoustic features extracted from the speech. The system is designed to be used for, and is capable of speech generation, speech analysis, speech transformation, and speech re-synthesis at the acoustic level.

Подробнее
10-02-2022 дата публикации

ACOUSTIC EVENT DETECTION SYSTEM AND METHOD

Номер: US20220044698A1
Автор: HUANG HUNG-PIN
Принадлежит:

An acoustic event detection system and a method are provided. The system includes a voice activity detection subsystem, a database, and an acoustic event detection subsystem. The voice activity detection subsystem includes a voice receiving module, a feature extraction module, and a first determination module. The voice receiving module receives an original sound signal, the feature extraction module extracts a plurality of features from the original sound signal, and the first determination module executes a first classification process to determine whether or not the plurality of features match to a start-up voice. The acoustic event detection subsystem includes a second determination module and a function response module. The second determination module executes a second classification process to determine whether the features match to at least one of a plurality of predetermined voices. The function response module executes one of functions corresponding to the predetermined voices that is matched. 1. An acoustic event detection system , comprising: a voice receiving module configured to receive an original sound signal;', 'a feature extraction module configured to extract a plurality of features from the original sound signal; and', 'a first determination module configured to execute a first classification process to determine whether or not the plurality of features match to a start-up voice;, 'a voice activity detection subsystem, includinga database configured to store the plurality of extracted features; and a second determination module configured to, in response to the first determination module determining that the plurality of features match the start-up voice, execute a second classification process to determine whether or not the plurality of features match to at least one of a plurality of predetermined voices; and', 'a function response module configured to, in response to the second determination module determining that the plurality of features ...

Подробнее
28-01-2021 дата публикации

Systems and Methods for Animation Generation

Номер: US20210027511A1
Принадлежит: LoomAi, Inc.

Systems and methods for animating from audio in accordance with embodiments of the invention are illustrated. One embodiment includes a method for generating animation from audio. The method includes steps for receiving input audio data, generating an embedding for the input audio data, and generating several predictions for several tasks from the generated embedding. The several predictions includes at least one of blendshape weights, event detection, and/or voice activity detection. The method includes steps for generating a final prediction from the several predictions, where the final prediction includes a set of blendshape weights, and generating an output based on the generated final prediction. 1. A method for generating animation from audio , the method comprising:receiving input audio data;generating an embedding for the input audio data;generating a plurality of predictions for a plurality of tasks from the generated embedding, wherein the plurality of predictions comprises at least one of blendshape weights, event detection, and voice activity detection;generating a final prediction from the plurality of predictions, wherein the final prediction comprises a set of blendshape weights; andgenerating an output based on the generated final prediction.2. The method of claim 1 , wherein the input audio data comprises mel-frequency cepstral coefficient (MFCC) features.3. The method of claim 2 , wherein generating the embedding comprises utilizing at least one of a recurrent neural network and a convolutional neural network to generate the embedding based on the MFCC features.4. The method of claim 1 , wherein generating the plurality of predictions comprises utilizing a multi-branch decoder claim 1 , wherein the multi-branch decoder comprises a plurality of Long Short Term Memory networks (LSTMs) that generate predictions for the plurality of tasks based on the generated embedding.5. The method of claim 1 , wherein generating the plurality of predictions ...

Подробнее
28-01-2021 дата публикации

PULMONARY FUNCTION ESTIMATION

Номер: US20210027893A1
Принадлежит:

Pulmonary function estimation can include detecting one or more cough events from a time series of audio signals generated by an electronic device of a user. Based on the one or more cough events, one or more lung function metrics of the user can be determined. 1. A method , comprising:detecting, with a processor, one or more cough events from a time series of audio signals generated by an electronic device of a user; anddetermining, with the processor, one or more lung function metrics of the user based on the one or more cough events.2. The method of claim 1 , further comprising:generating a plurality of features relevant to cough characteristics associated with the one or more cough events;selecting one or more features from the plurality of features based on contextual information relating to the user; anddetermining the lung function metrics based on the one or more features selected from the plurality of features.3. The method of claim 2 , wherein the determining comprises using a context-based regression model to determine the lung function metrics based on the one or more features selected from the plurality of features.4. The method of claim 2 , further comprising:segmenting the one or more cough events into a plurality of segments;determining a cough burst segment from the plurality of segments; andselecting the one or more features from the plurality of features based on the cough burst segment.5. The method of claim 1 , wherein the one or more cough events comprise a plurality of cough events claim 1 , the method further comprising:identifying an initial cough event from the plurality of cough events; anddetermining the lung function metrics based on the initial cough event.6. The method of claim 2 , wherein the plurality of features includes at least one of cough duration claim 2 , cough force claim 2 , mel-frequency cepstral coefficients claim 2 , and/or cough waveform skewness.7. The method of claim 1 , further comprising:determining a quality of one ...

Подробнее
02-02-2017 дата публикации

INFORMATION PRESENTATION METHOD, INFORMATION PRESENTATION APPARATUS, AND COMPUTER READABLE STORAGE MEDIUM

Номер: US20170034086A1
Принадлежит: FUJITSU LIMITED

An information presentation method including: monitoring specified information in contents of interactive communications between a first apparatus and a second apparatus, the contents from the first apparatus being outputted to the second apparatus when a first condition is satisfied, the contents from the second apparatus being outputted to the first apparatus when a second condition is satisfied, and changing, when the specified information is detected in the contents, at least one of the first condition and the second condition so that the contents is outputted more easily. 1. An information presentation method comprising:monitoring specified information in contents of interactive communications between a first apparatus and a second apparatus, the contents from the first apparatus being outputted to the second apparatus when a first condition is satisfied, the contents from the second apparatus being outputted to the first apparatus when a second condition is satisfied; andchanging, when the specified information is detected in the contents, at least one of the first condition and the second condition so that the contents is outputted more easily.2. The information presentation method according to claim 1 , whereinthe contents are audio data, and the specified information is a specified sound.3. The information presentation method according to claim 2 , whereinthe monitoring includes:dividing the audio data for every specified period;calculating each first feature of each frequency component of each of the divided audio data; andcalculating each second feature of each of the divided audio data by applying a specified function to at least one component of each first feature, the specified function being a function of x that corresponds to each frequency component, a derivative or subderivative function of the specified function by x being monotonically decreasing within an interval ab≦x≦at (0≦ab Подробнее

17-02-2022 дата публикации

SOUND PROCESSING METHOD

Номер: US20220051687A1
Автор: SENDODA Mitsuru
Принадлежит: NEC Corporation

A sound processing apparatus includes a feature value extractor configured to perform a Fourier transform and then a cepstral analysis of a sound signal and to extract, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal. 1. A sound processing method comprising:performing a Fourier transform and then a cepstral analysis of a sound signal; andextracting, as feature values of the sound signal, values including frequency components obtained by the Fourier transform of the sound signal and a value based on a result obtained by the cepstral analysis of the sound signal.2. The sound processing method according to claim 1 , wherein the extracting comprises extracting claim 1 , as the feature values of the sound signal claim 1 , values including the frequency components obtained by the Fourier transform of the sound signal and the zero-th-order component of the result obtained by the cepstral analysis of the sound signal.3. The sound processing method according to claim 2 , wherein the extracting comprises extracting claim 2 , as the feature values of the sound signal claim 2 , values including the frequency components obtained by the Fourier transform of the sound signal claim 2 , the zero-th-order component of the result obtained by the cepstral analysis of the sound signal claim 2 , and a differential component of the result obtained by the cepstral analysis of the sound signal.4. The sound processing method according to claim 1 , wherein the cepstral analysis is a mel-frequency cepstral coefficient analysis.5. The sound processing method according to claim 1 , wherein a model is generated by learning the sound signal based on the feature values extracted from the sound signal and identification information identifying the sound signal.6. The sound processing method according to claim 5 , wherein the feature ...

Подробнее
30-01-2020 дата публикации

END-TO-END NEURAL NETWORKS FOR SPEECH RECOGNITION AND CLASSIFICATION

Номер: US20200035222A1
Принадлежит: Deepgram, Inc.

Systems and methods are disclosed for end-to-end neural networks for speech recognition and classification and additional machine learning techniques that may be used in conjunction or separately. Some embodiments comprise multiple neural networks, directly connected to each other to form an end-to-end neural network. One embodiment comprises a convolutional network, a first fully-connected network, a recurrent network, a second fully-connected network, and an output network. Some embodiments are related to generating speech transcriptions, and some embodiments relate to classifying speech into a number of classifications. 1. A speech recognition system , comprising:a processor;a memory, the memory comprising instructions for an end-to-end speech recognition neural network comprising:a convolutional neural network configured to receive acoustic features of an utterance and output a first representation of the utterance;a first fully-connected neural network configured to receive a first portion of the first representation of the utterance and a plurality of copies of the first fully-connected neural network for receiving additional portions of the first representation of the utterance, the first fully-connected neural network and the copies configured to collectively output a second representation of the utterance;a recurrent neural network configured to receive the second representation of the utterance from the fully-connected neural network and output a third representation of the utterance;a second fully-connected neural network configured to receive the third representation of the utterance and output a fourth representation of the utterance,wherein the fourth representation of the utterance comprises a word embedding;an output neural network configured to receive the fourth representation of the utterance from the second fully-connected neural network and output an indication of one or more words corresponding to the utterance;wherein the output neural network ...

Подробнее
04-02-2021 дата публикации

DEEP LEARNING INTERNAL STATE INDEX-BASED SEARCH AND CLASSIFICATION

Номер: US20210035565A1
Принадлежит:

Systems and methods are disclosed for generating internal state representations of a neural network during processing and using the internal state representations for classification or search. In some embodiments, the internal state representations are generated from the output activation functions of a subset of nodes of the neural network. The internal state representations may be used for classification by training a classification model using internal state representations and corresponding classifications. The internal state representations may be used for search, by producing a search feature from an search input and comparing the search feature with one or more feature representations to find the feature representation with the highest degree of similarity. 1. A system comprising one or more processors , and a non-transitory computer-readable medium including one or more sequences of instructions that , when executed by the one or more processors , cause the system to perform operations comprising:providing a trained speech recognition neural network, the speech recognition neural network including a plurality of layers each having a plurality of nodes;transcribing speech audio by the speech recognition neural network;generating one or more feature representations from a subset of the nodes;receiving a first set of classifications for a first portion of the speech audio;providing a trained a classification model, the classification model trained on a first set of feature representations corresponding to the first portion of the speech audio and the first set of classifications; anddetermining a second set of classifications for a second portion of the speech audio by inputting a second set of feature representations corresponding to the second portion of the speech audio into the trained classification model, the second set of feature representations comprising a second subset of the feature representations generated during the speech audio transcription.2. ...

Подробнее
11-02-2016 дата публикации

Relative excitation features for speech recognition

Номер: US20160042734A1
Автор: Cetin CETINTURK
Принадлежит: Cetin CETINTURKC

Relative Excitation Features, in all conditions, are far superior to conventional acoustic features like Mel-Frequency Cepstrum (MFC) and Perceptual Linear Prediction (PLP), and provide much more speaker-independence, channel-independence, and noise-immunity. Relative Excitation features are radically different than conventional acoustic features. Relative Excitation method doesn't try to model the speech-production or vocal tract shape, doesn't try to do deconvolution, and doesn't utilize LP (Linear Prediction) and Cepstrum techniques. This new feature set is completely related to human hearing. The present invention is inspired by the fact that human auditory perception analyzes and tracks the relations between spectral frequency component amplitudes and the “Relative Excitation” name implies relative excitation levels of human auditory neurons. Described herein is a major breakthrough for explaining and simulating the human auditory perception and its robustness.

Подробнее
06-02-2020 дата публикации

SYSTEMS AND METHODS FOR A TRIPLET NETWORK WITH ATTENTION FOR SPEAKER DIARIZATION

Номер: US20200043508A1

Various embodiments of a systems and methods for a triplet network having speaker diarization are disclosed. 1. A method for the diarization of speakers in a speech audio sample , comprising:segmenting an audio recording featuring one or more speakers in the time-domain into a plurality of audio samples, wherein a set of temporal sequence features or a set of mel-frequency cepstral coefficients are extracted from each of the plurality of audio samples;learning a set of hidden representations from a set of hidden embeddings using an attention model, wherein the attention model is implemented using a neural network and wherein a set of encoded ordering information are used to learn the set of hidden representations and wherein the set of mel-frequency cepstral coefficients are used to learn the set of hidden embeddings;learning a similarity metric between each of the plurality of audio samples using a metric learner, wherein the metric learner computes the triplet ranking loss using the set of hidden representations and the set of temporal sequence features, and wherein the metric learner is implemented using the neural network; andperforming speaker clustering using the similarity metric, wherein the result of speaker clustering diarizes the one or more speakers featured in the audio recording.2. The method of claim 1 , wherein the neural network is iteratively trained using a labeled corpus.3. The method of claim 1 , wherein the mel-frequency cepstral coefficients are extracted from each of the plurality of audio samples using Hamming windows.4. The method of claim 3 , wherein a set of delta and double delta coefficients are added to the set of mel-frequency cepstral coefficients and the set of mel-frequency cepstral coefficients are embodied in the form of 60-dimensional feature vectors for every one of a plurality of frames.5. The method of claim 4 , further comprising:encoding a set of ordering information contained in the audio recording by mapping a plurality ...

Подробнее
15-02-2018 дата публикации

DENOISING A SIGNAL

Номер: US20180047409A1
Принадлежит:

A computer-implemented method according to one embodiment includes creating a clean dictionary, utilizing a clean signal, creating a noisy dictionary, utilizing a first noisy signal, determining a time varying projection, utilizing the clean dictionary and the noisy dictionary, and denoising a second noisy signal, utilizing the time varying projection. 1. A computer-implemented method , comprising:creating a clean dictionary, utilizing a clean signal;creating a noisy dictionary, utilizing a first noisy signal;determining a time varying projection, utilizing the clean dictionary and the noisy dictionary; anddenoising a second noisy signal, utilizing the time varying projection.2. The computer-implemented method of claim 1 , wherein creating the noisy dictionary includes creating a noisy spectrogram claim 1 , converting the noisy spectrogram into a plurality of noisy spectro-temporal building blocks by applying a convolutive non-negative matrix factorization (CNMF) algorithm may to the noisy spectrogram claim 1 , and adding the plurality of noisy spectro-temporal building blocks to the noisy dictionary.3. The computer-implemented method of claim 1 , wherein determining the time varying projection includes:generating a time activation matrix for the clean signal, utilizing the clean dictionary;generating a time activation matrix for the first noisy signal, utilizing the noisy dictionary; andcomparing the time activation matrix for the clean signal and the time activation matrix for the first noisy signal to create the time varying projection.4. The computer-implemented method of claim 1 , further comprising expanding the clean dictionary and the noisy dictionary by updating the clean dictionary and the noisy dictionary to include new clean spectro-temporal building blocks and new noisy spectro-temporal building blocks created utilizing additional clean and noisy signals.5. The computer-implemented method of claim 1 , wherein creating the clean dictionary includes ...

Подробнее
06-02-2020 дата публикации

ACOUSTIC SIGNAL PROCESSING DEVICE, ACOUSTIC SIGNAL PROCESSING METHOD, AND HANDS-FREE COMMUNICATION DEVICE

Номер: US20200045166A1
Автор: Furuta Satoru
Принадлежит: Mitsubishi Electric Corporation

An acoustic signal processing device includes an acoustic signal analysis unit that analyzes an acoustic feature of a reception signal from a far end side and thereby generates an appropriate control signal, an echo canceller that cancels an acoustic echo mixed into an input acoustic signal, a noise canceller that cancels noise mixed into the input acoustic signal, and a speech enhancement unit that enhances a feature of speech included in the input acoustic signal, and thus high speech quality can be maintained irrespective of the type of a mobile phone or a communication network, and a high-quality hands-free voice call and high-accuracy speech recognition become possible. 1. An acoustic signal processing device comprising:a first storage unit storing first reference data;a second storage unit storing second reference data;an acoustic parameter calculation unit to analyze a first acoustic signal of reception voice inputted from a far end side and to generate an analytic acoustic parameter;an acoustic parameter analysis unit to analyze the analytic acoustic parameter by using the first reference data and thereby generate a parameter analysis result;a control signal generation unit to generate a control signal for correcting a second acoustic signal of transmission voice inputted from a near end side based on the parameter analysis result by using the second reference data; andan acoustic signal correction unit to make a correction of the second acoustic signal based on the control signal.2. The acoustic signal processing device according to claim 1 , wherein the acoustic signal correction unit includes an echo canceller that performs an echo cancellation process claim 1 , as the correction for removing an acoustic echo included in the second acoustic signal claim 1 , based on the control signal.3. The acoustic signal processing device according to claim 1 , wherein the acoustic signal correction unit includes a noise canceller that performs a noise cancellation ...

Подробнее
14-02-2019 дата публикации

IDENTIFICATION OF AUDIO SIGNALS IN SURROUNDING SOUNDS AND GUIDANCE OF AN AUTONOMOUS VEHICLE IN RESPONSE TO THE SAME

Номер: US20190049989A1
Принадлежит:

Embodiments include apparatuses, systems, and methods for a computer-aided or autonomous driving (CA/AD) system to identify and respond to an audio signal, e.g., an emergency alarm signal. In embodiments, the CA/AD driving system may include a plurality of microphones disposed to capture the audio signal included in surrounding sounds to a semi-autonomous or autonomous (SA/AD) vehicle. In embodiments, an audio analysis unit may receive the audio signal to extract audio features from the audio signal. In embodiments, a neural network such as a Deep Neural Network (DNN) may receive the extracted audio features from the audio analysis unit and to generate a probability score to allow identification of the audio signal. In embodiments, the CA/AD driving system may control driving elements of the SA/AD vehicle to autonomously or semi-autonomously drive the SA/AD vehicle in response to the identification. Other embodiments may also be described and claimed. 1. A computer-aided or autonomous driving (CA/AD) apparatus to identify an audio signal that originates outside of or proximate to a semi-autonomous or autonomous driving (SA/AD) vehicle and is included in surrounding sounds proximate to the SA/AD vehicle , the CA/AD driving apparatus , comprising:a communication interface to receive, from a plurality of microphones coupled to the SA/AD vehicle, the audio signal; divide the audio signal into a plurality of frames; and', 'extract audio features from one or more of the plurality of frames; and, 'an audio analysis unit coupled to receive the audio signal from the communication interface and toa neural network classifier coupled to receive the extracted audio features from the audio analysis unit and to generate a probability score for one or more of the plurality of frames to classify the one or more of the plurality of frames to allow identification of the audio signal.2. The CA/AD driving apparatus of claim 1 , further comprising the plurality of microphones and to ...

Подробнее
10-03-2022 дата публикации

Homomorphic encryption of communications involving voice-enabled devices in a distributed computing environment

Номер: US20220075880A1
Принадлежит: Toronto Dominion Bank

The disclosed exemplary embodiments include computer-implemented systems, devices, apparatuses, and processes that maintain data confidentiality in communications involving voice-enabled devices in a distributed computing environment using homomorphic encryption. By way of example, an apparatus may receive encrypted command data from a computing system, decrypt the encrypted command data using a homomorphic private key, and perform operations that associate the decrypted command data with a request for an element of data. Using a public cryptographic key associated with a device, the apparatus generate an encrypted response that includes the requested data element, and transmit the encrypted response to the device. The device may decrypt the encrypted response using a private cryptographic key and to perform operations that present first audio content representative of the requested data element through an acoustic interface.

Подробнее
10-03-2022 дата публикации

AUDIO PROCESSING METHOD AND APPARATUS, ELECTRONIC DEVICE, AND STORAGE MEDIUM

Номер: US20220076692A1
Автор: DENG Shuo

The embodiments of this application disclose an audio processing method and apparatus, an electronic device, and a storage medium. In the embodiments of this application, a current playback environment of audio may be obtained; audio recognition may be performed on ambient sound of the current playback environment in a case that the current playback environment is in a foreground state; foreground sound in the ambient sound may be determined according to an audio recognition result; the foreground sound in the ambient sound may be classified to determine a type of the foreground sound; and audio mixing may be performed on the foreground sound and the audio to obtain mixed playback sound based on the type of the foreground sound. 1. An audio processing method , executed by an electronic device , and comprising:obtaining a current playback environment of audio;performing audio recognition on ambient sound of the current playback environment in a case that the current playback environment is in a foreground state;determining foreground sound in the ambient sound according to an audio recognition result;classifying the foreground sound in the ambient sound to determine a type of the foreground sound; andperforming audio mixing on the foreground sound and the audio to obtain a mixed playback sound based on the type of the foreground sound.2. The method according to claim 1 , wherein the performing audio recognition on ambient sound of the current playback environment in a case that the current playback environment is in a foreground state comprises:sampling the ambient sound of the current playback environment in a case that the current playback environment is in the foreground state;extracting a Mel-frequency cepstrum coefficient feature of the ambient sound obtained by the sampling, to obtain a Mel feature of the ambient sound; andperforming audio recognition on the Mel feature of the ambient sound by using an adaptive discriminant network.3. The method according to ...

Подробнее
20-02-2020 дата публикации

SOUND IDENTIFICATION UTILIZING PERIODIC INDICATIONS

Номер: US20200058297A1
Принадлежит:

A computer-implemented method is provided. The computer-implemented method is performed by a speech recognition system having at least a processor. The method further includes performing a speech recognition operation on the audio signal data to decode the audio signal data into a textual representation based on the estimated sound identification information from a neural network having periodic indications and components of a frequency spectrum of the audio signal data inputted thereto. The neural network includes a plurality of fully-connected network layers having a first layer that includes a plurality of first nodes and a plurality of second nodes. The method further comprises training the neural network by initially isolating the periodic indications from the components of the frequency spectrum in the first layer by setting weights between the first nodes and a plurality of input nodes corresponding to the periodic indications to 0. 1. A computer-implemented method performed by a speech recognition system having at least a processor , the method comprising:performing, by a processor, a speech recognition operation on an audio signal data to decode the audio signal data into a textual representation based on estimated sound identification information from a neural network having periodic indications and components of a frequency spectrum of the audio signal data inputted thereto,wherein the neural network includes a plurality of fully-connected network layers having a first layer that includes a plurality of first nodes and a plurality of second nodes, and wherein the method further comprises training the neural network by initially isolating the periodic indications from the components of the frequency spectrum in the first layer by setting weights between the first nodes and a plurality of input nodes corresponding to the periodic indications to 0.2. The computer-implemented method of claim 1 , wherein the estimating sound identification includes identifying ...

Подробнее
17-03-2022 дата публикации

MEASUREMENT OF NEUROMOTOR COORDINATION FROM SPEECH

Номер: US20220079511A1
Принадлежит:

A system for measuring neuromotor disorders from speech is configured to receive an audio recording that includes spoken speech and compute feature coefficients from at least a portion of the spoken speech in the audio recording. The feature coefficients represent at least one characteristic of the spoken speech in the audio recording. One or more vocal tract variables may be computed from the feature coefficients. The vocal tract variables may represent a physical configuration of a vocal tract associated with at least one of the one or more sounds. The vocal tract variables and/or the feature coefficients are used to determine if a disorder that affects neuromotor speech is present. 1. A method for measuring neuromotor coordination from speech:receiving an audio recording that includes spoken speech;computing time varying feature coefficients from at least a portion of the spoken speech in the audio recording, the feature coefficients representing at least one characteristic of the at least a portion off the spoken speech in the audio recording;computing, from the feature coefficients, one or more time varying vocal tract variables representing time variation of physical configuration of a vocal tract, the time varying vocal tract variables associated with at least one of the one or more sounds; anddetermining a measurement of a disorder based at least in part on a degree of correlation between at least two of the vocal tract variables.2. The method of wherein the feature coefficients represent characteristics of an audio power spectrum of the portion of the speech.3. The method of wherein the feature coefficients comprise cepstral coefficients.4. The method of wherein computing the vocal tract variables comprises providing the feature coefficients as inputs to a neural network claim 1 , and using the neural network to compute the vocal tract variables from the feature coefficients.5. The method of wherein the neural network comprises stored parameters determined ...

Подробнее
04-03-2021 дата публикации

TRIGGER TO KEYWORD SPOTTING SYSTEM (KWS)

Номер: US20210065689A1
Принадлежит:

In accordance with embodiments, methods and systems for a trigger to the KWS are provided. The computing device converts an audio signal into a plurality of audio frames. The computing device generates a Mel Frequency Cepstral Coefficients (MFCC) matrix. The MFCC matrix includes N columns. Each column of the N columns comprises coefficients associated with audio features corresponding to a different audio frame of the plurality of audio frames. The computing device determines that a trigger condition is satisfied based on an MFCC_0 buffer. The MFCC_0 buffer comprises a first row of the MFCC matrix. The computing device then provides the MFCC matrix to a neural network for the neural network to use the MFCC matrix to make keyword inference based on the determining that the trigger condition is satisfied. 1. A method comprising:converting, by a computing device, an audio signal into a plurality of audio frames;generating, by the computing device, a Mel Frequency Cepstral Coefficients (MFCC) matrix based on the plurality of audio frames, the MFCC matrix including N columns, each column of the N columns comprising coefficients associated with audio features corresponding to a different audio frame of the plurality of audio frames;determining, by the computing device, that a trigger condition is satisfied based on an MFCC_0 buffer, the MFCC_0 buffer comprising a first row of the MFCC matrix; andproviding, by the computing device, the MFCC matrix to a neural network for the neural network to use the MFCC matrix to make keyword inference based on the determining.2. The method of claim 1 , the determining comprising:receiving a new audio frame;performing a left shift to update the MFCC matrix such that an i-th column of the MFCC matrix becomes an (i−1)-th column of the MFCC matrix and such that a last column of the MFCC matrix includes new coefficients associated with new audio features corresponding to the new audio frame;updating the MFCC_0 buffer based on the MFCC matrix ...

Подробнее
04-03-2021 дата публикации

AUDIO SCENE RECOGNITION USING TIME SERIES ANALYSIS

Номер: US20210065734A1
Принадлежит:

A method is provided. Intermediate audio features are generated from respective segments of an input acoustic time series for a same scene. Using a nearest neighbor search, respective segments of the input acoustic time series are classified based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series. Each respective segment corresponds to a respective different acoustic window. The generating step includes learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series, dividing the same scene into the different windows having varying MFCC features, and feeding the MFCC features of each window into respective LSTM units such that a hidden state of each respective LSTM unit is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different windows. 1. A computer-implemented method for audio scene classification in an information retrieval system , comprising:generating intermediate audio features from respective segments of an input acoustic time series for a same scene captured by a sensor device; andclassifying, using a nearest neighbor search, the respective segments of the input acoustic time series based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic time series, each of the respective segments corresponding to a respective different one of different acoustic windows;wherein said generating step comprises:learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic time series;dividing the same scene into the different acoustic windows having varying ones of the MFCC features; andfeeding the MFCC features of each of the different acoustic windows into respective LSTM units ...

Подробнее
04-03-2021 дата публикации

SEQUENCE MODELS FOR AUDIO SCENE RECOGNITION

Номер: US20210065735A1
Принадлежит:

A method is provided. Intermediate audio features are generated from an input acoustic sequence. Using a nearest neighbor search, segments of the input acoustic sequence are classified based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic sequence. Each segment corresponds to a respective different acoustic window. The generating step includes learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic sequence. The generating step includes dividing the same scene into the different acoustic windows having varying MFCC features. The generating step includes feeding the MFCC features of each of the different acoustic windows into respective LSTM units such that a hidden state of each respective LSTM unit is passed through an attention layer to identify feature correlations between hidden states at different time steps corresponding to different ones of the different acoustic windows. 1. A computer-implemented method for audio scene classification , comprising:generating intermediate audio features from an input acoustic sequence; andclassifying, using a nearest neighbor search, segments of the input acoustic sequence based on the intermediate audio features to generate a final intermediate feature as a classification for the input acoustic sequence, each of the segments corresponding to a respective different one of different acoustic windows;wherein said generating step comprises:learning the intermediate audio features from Multi-Frequency Cepstral Component (MFCC) features extracted from the input acoustic sequence;dividing the same scene into the different acoustic windows having varying ones of the MFCC features; andfeeding the MFCC features of each of the different acoustic windows into respective LSTM units such that a hidden state of each of the respective LSTM units is passed through an attention layer to identify ...

Подробнее
01-03-2018 дата публикации

User authentication using audiovisual synchrony detection

Номер: US20180063106A1
Принадлежит: International Business Machines Corp

Methods, computing systems and computer program products implement embodiments of the present invention that include receiving, at a first time, first video and first audio signals generated in response to a user uttering a passphrase, and receiving, at a second time subsequent to the first time, second video and second audio signals generated in response the user uttering the passphrase. Upon computing an audio temporal alignment between the first and the second audio signals and computing a video temporal alignment between the first and the second video signal, the user can be authenticated by comparing the audio temporal alignment to the video temporal alignment.

Подробнее
09-03-2017 дата публикации

COVARIANCE MATRIX ESTIMATION WITH STRUCTURAL-BASED PRIORS FOR SPEECH PROCESSING

Номер: US20170069313A1
Автор: Aronowitz Hagai
Принадлежит:

According to some embodiments of the present invention there is provided a computerized method for speech processing using a Gaussian Mixture Model. The method comprises the action of receiving by hardware processor(s) two or more covariance values representing relationships between distributions of speech coefficient values that represent two or more audible input speech signals recorded by a microphone. The method comprises the action of computing two or more eigenvectors and eignevalues using a principle component analysis of the covariance values and transforming the speech coefficient values using the eigenvectors and computing two or more second covariance values from the transformed speech coefficient values. The method comprises the action of modifying some of the second covariance values according to the eignevalues, the covariance values, and two or more indices of the speech coefficient values. The second covariance values to the speech processor comprising the Gaussian Mixture Model. 1. A computerized method for speech processing using a Gaussian Mixture Model , comprising:receiving by at least one hardware processor a plurality of covariance values representing relationships between distributions of a plurality of speech coefficient values from a speech processor comprising a Gaussian Mixture Model, wherein said plurality of speech coefficient values represent a plurality of audible input speech signals recorded by at least one microphone;computing, using said at least one hardware processor, a plurality of eigenvectors and eignevalues using a principle component analysis of said plurality of covariance values;transforming, using said at least one hardware processor, said plurality of speech coefficient values using said plurality of eigenvectors and computing a plurality of second covariance values from said transformed plurality of speech coefficient values;modifying, using said at least one hardware processor, some of said plurality of second ...

Подробнее
11-03-2021 дата публикации

KEYWORD SPOTTING APPARATUS, METHOD, AND COMPUTER-READABLE RECORDING MEDIUM THEREOF

Номер: US20210074270A1
Принадлежит:

A keyword spotting apparatus, method, and computer-readable recording medium are disclosed. The keyword spotting method using an artificial neural network according to an embodiment of the disclosure may include obtaining an input feature map from an input voice; performing a first convolution operation on the input feature map for each of n different filters having the same channel length as the input feature map, wherein a width of each of the filters is w and the width w is less than a width of the input feature map; performing a second convolution operation on a result of the first convolution operation for each of different filters having the same channel length as the input feature map; storing a result of the second convolution operation as an output feature map; and extracting a voice keyword by applying the output feature map to a learned machine learning model. 1. A keyword spotting method using an artificial neural network , the method comprising:obtaining an input feature map from an input voice;{'b': 1', '1, 'performing a first convolution operation on the input feature map for each of n different filters having the same channel length as the input feature map, wherein a width of each of the filters is w and the width of w is less than a width of the input feature map;'}performing a second convolution operation on a result of the first convolution operation for each of different filters having the same channel length as the input feature map;storing a result of the second convolution operation as an output feature map; andextracting a voice keyword by applying the output feature map to a learned machine learning model.2. The method of claim 1 , wherein a stride value of the first convolution operation is 1.3. The method of claim 1 , wherein the performing of a second convolution operation includes:{'b': '2', 'performing a first sub-convolution operation on m different filters having a width of w and the result of the first convolution operation;'}{'b': ...

Подробнее
11-03-2021 дата публикации

PRIVACY-PRESERVING VOICEPRINT AUTHENTICATION APPARATUS AND METHOD

Номер: US20210075787A1
Автор: Yan Zheng, Zhang Rui
Принадлежит:

A voiceprint authentication apparatus is provided, comprising: a voice receiving module configured to receive a user's voices in different speaking modes; a feature extraction module configured to extract respective sets of voice features from the user's voices in different speaking modes; a synthesis module configured to generate a first voiceprint template by synthesizing the respective sets of voice features; and a first communication module configured to send the first voiceprint template to a server to authenticate the user, wherein the user's voices and the respective sets of voice features are not sent to the server. A corresponding voice authentication method, as well as a computer readable medium, are also provided. 123-. (canceled)24. An apparatus , comprising:at least one processing core,at least one memory including computer program code,the at least one memory and the computer program code being configured to, with the at least one processing core, cause the apparatus at least toreceive a user's voices in different speaking modes;extract respective sets of voice features from the user's voices in the different speaking modes;generate a first voiceprint template by synthesizing the respective sets of the voice features; andsend the first voiceprint template to a server to authenticate the user, wherein the user's voices and the respective sets of the voice features are not sent to the server.25. The apparatus of claim 24 , wherein the apparatus is further configured to:extract a first set of features from the user's voice in a first speaking mode based on a linear prediction cepstrum coefficient algorithm; andextract a second set of features from the user's voice in a second speaking mode based on a mel frequency cepstral coefficient algorithm.26. The apparatus of claim 24 , wherein the apparatus is further configured to:synthesize the respective sets of the voice features with a voice synthesis algorithm based on a log magnitude approximate vocal tract ...

Подробнее
07-03-2019 дата публикации

REAL-TIME VOCAL FEATURES EXTRACTION FOR AUTOMATED EMOTIONAL OR MENTAL STATE ASSESSMENT

Номер: US20190074028A1
Автор: Howard Newton
Принадлежит:

Embodiments of the present systems and methods may provide techniques for extracting vocal features from voice signals to determine an emotional or mental state of one or more persons, such as to determine a risk of suicide and other mental health issues. For example, as a person's mental state may indirectly alters his or her speech, suicidal risk in, for example, hotline calls, may be determined through speech analysis. In embodiments, such techniques may include preprocessing of the original recording, vocal feature extraction, and prediction processing. For example, in an embodiment, a computer-implemented method of determining an emotional or mental state of a person, the method comprising acquiring an audio signal relating to a conversation including the person, extracting signal components relating to an emotional or mental state of at least the person, and outputting information characterizing the extracted emotional or mental state of the person. 1. A computer-implemented method of determining an emotional or mental state of a person , the method comprising:acquiring an audio signal relating to a conversation including the person;extracting signal components relating to an emotional or mental state of at least the person; andoutputting information characterizing the extracted emotional or mental state of the person.2. The method of claim 1 , wherein acquiring the audio signal relating to a conversation comprises:recording a conversation between a caller to suicide help line and a counselor of the suicide help line.3. The method of claim 1 , wherein the signal components relating to emotional intent of at least one party comprises:extracting signal features from the audio signal comprising discriminative speech indicators, which differentiate between speech and silence;determining which extracted signal features to use; andenhancing the robustness of the determination against background noise.4. The method of claim 3 , wherein:determining which extracted ...

Подробнее
05-03-2020 дата публикации

Clockwork Hierarchical Variational Encoder

Номер: US20200074985A1
Принадлежит: Google LLC

A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme. 1. A method comprising:receiving, at data processing hardware, a text utterance having at least one word, each word having at least one syllable, each syllable having at least one phoneme;selecting, by the data processing hardware, a mel spectral embedding for the text utterance; and predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and', 'generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme, each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme., 'for each phoneme, using the selected mel spectral embedding2. The method of claim 1 , wherein a network representing a hierarchical linguistic structure of the text utterance comprises:a first level including each syllable of the text utterance;a second level including each phoneme of the text utterance; anda third level including each fixed- ...

Подробнее
05-03-2020 дата публикации

LOW ENERGY DEEP-LEARNING NETWORKS FOR GENERATING AUDITORY FEATURES FOR AUDIO PROCESSING PIPELINES

Номер: US20200074989A1
Принадлежит:

Low energy deep-learning networks for generating auditory features such as mel frequency cepstral coefficients in audio processing pipelines are provided. In various embodiments, a first neural network is trained to output auditory features such as mel-frequency cepstral coefficients, linear predictive coding coefficients, perceptual linear predictive coefficients, spectral coefficients, filter bank coefficients, and/or spectro-temporal receptive fields based on input audio samples. A second neural network is trained to output a classification based on input auditory features such as mel-frequency cepstral coefficients. An input audio sample is provided to the first neural network. Auditory features such as mel-frequency cepstral coefficients are received from the first neural network. The auditory features such as mel-frequency cepstral coefficients are provided to the second neural network. A classification of the input audio sample is received from the second neural network. 1. A neurosynaptic chip comprising: 'the first artificial neural network being trained to output auditory features based on input audio samples;', 'a first artificial neural network,'} the second artificial neural network being operatively coupled to the first artificial neural network and receiving therefrom the auditory features,', 'the second artificial neural network being trained to output a classification of the input audio samples based on the auditory features., 'a second artificial neural network,'}2. The neurosynaptic chip of claim 1 , wherein the auditory features comprise mel-frequency cepstral coefficients.3. The neurosynaptic chip of claim 1 , wherein the auditory features comprise linear predictive coding coefficients claim 1 , perceptual linear predictive coefficients claim 1 , spectral coefficients claim 1 , filter bank coefficients claim 1 , and/or spectro-temporal receptive fields.4. The neurosynaptic chip of claim 1 , wherein the auditory features comprise a combination of ...

Подробнее
05-03-2020 дата публикации

System and Method for Relative Enhancement of Vocal Utterances in an Acoustically Cluttered Environment

Номер: US20200074995A1
Принадлежит:

The invention discloses systems and methods for enhancing the sound of vocal utterances of interest in an acoustically cluttered environment. The system generates canceling signals (sound suppression signals) for an ambient audio environment and identifies and characterizes desired vocal signals and hence a vocal stream or multiple streams of interest. Each canceling signal, or collectively, the noise canceling stream, is processed so that signals associated with the desired audio stream or streams are dynamically removed from the canceling stream. This modified noise canceling stream is combined (electronically or acoustically) with the ambient to effectuate a destructive interference of all ambient sound except for the removed audio streams, thus “enhancing” the vocal streams with respect to the unwanted ambient sound. Cepstral analysis may be used to identify a fundamental frequency associated with a voiced human utterance. Filtering derived from that analysis removes the voiced utterance from the canceling signal. 1. A method of generating a modified electronic noise canceling signal from an electronic noise canceling signal generated from an electronic representation of an acoustically-cluttered audio environment to destructively interfere with said audio environment , said method for enhancing the discernibility of vocal streams of interest made in the audio environment , comprising:identifying from within said electronic representation time-varying, audio-frequency components associated with vocal utterances;identifying from among said vocal utterances at least one sequence of vocal utterances comprising a target vocal stream;characterizing audio-frequency components of said target vocal stream; andremoving from the electronic noise canceling signal audio-frequency components associated with said target vocal stream.2. The method of claim 1 , wherein the step of characterizing includes tracking a fundamental frequency associated with said target vocal stream. ...

Подробнее
05-03-2020 дата публикации

ATTENTION MECHANISM FOR COPING WITH ACOUSTIC-LIPS TIMING MISMATCH IN AUDIOVISUAL PROCESSING

Номер: US20200076988A1
Принадлежит:

Embodiments of the present systems and methods may provide techniques for handling acoustic-lips timing mismatch in audiovisual processing. In embodiments, the context-dependent time shift between the audio and visual streams may be explicitly modeled using an attention mechanism. For example, in an embodiment, a computer-implemented method for determining a context-dependent time shift of audio and video features in an audiovisual stream or file may comprise receiving audio information and video information of the audiovisual stream or file, processing the audio information and video information separately to generate a new representation of the audio information, including information relating to features of the audio information, and a new representation of the video information, including information relating to features of the video information, and mapping features of the audio information and features of the video information using an attention mechanism to identify synchronized pairs of audio and video features. 1. A computer-implemented method for determining a context-dependent time shift of audio and video features in an audiovisual stream or file , the method comprising:receiving audio information and video information of the audiovisual stream or file;processing the audio information and video information separately to generate a new representation of the audio information, including information relating to features of the audio information, and a new representation of the video information, including information relating to features of the video information; andmapping features of the audio information and features of the video information using an attention mechanism to identify pairs of audio and video features, wherein the pairs of audio and video features are identified as being synchronized (true) features that contain a recording of a speaker in which the audio information of the recording and the video information of lips of the speaker are ...

Подробнее
14-03-2019 дата публикации

Assessment of a Pulmonary Condition by Speech Analysis

Номер: US20190080803A1
Принадлежит: Cordio Medical Ltd.

Described embodiments include apparatus that includes a network interface () and a processor (). The processor is configured to receive, via the network interface, speech of a subject () who suffers from a pulmonary condition related to accumulation of excess fluid, to identify, by analyzing the speech, one or more speech-related parameters of the speech, to assess, in response to the speech-related parameters, a status of the pulmonary condition, and to generate, in response thereto, an output indicative of the status of the pulmonary condition. Other embodiments are also described. 1. Apparatus , comprising:a network interface; and to receive, via the network interface, speech of a subject who suffers from a pulmonary condition related to accumulation of excess fluid,', 'to identify, by analyzing the speech, one or more speech-related parameters of the speech,', 'to assess, in response to the speech-related parameters, a status of the pulmonary condition, and', 'to generate, in response thereto, an output indicative of the status of the pulmonary condition., 'a processor, configured2. The apparatus according to claim 1 , wherein the processor is configured to analyze the speech by performing a spectral analysis of the speech.3. The apparatus according to claim 1 , wherein the processor is configured to analyze the speech by performing a cepstral analysis of the speech.4. The apparatus according to claim 1 , wherein the processor is further configured to identify claim 1 , by analyzing the speech claim 1 , a meaning of the speech claim 1 , wherein the processor is configured to assess the status in response to the meaning.5. The apparatus according to claim 1 , wherein the processor is further configured to prompt the subject to provide claim 1 , by orally responding to a question claim 1 , the speech.6. The apparatus according to claim 5 , wherein the processor is configured to prompt the subject to provide the speech by:placing a call to the subject, andupon the ...

Подробнее
31-03-2022 дата публикации

System and Method for Hierarchical Audio Source Separation

Номер: US20220101869A1

The audio processing system includes a memory to store a neural network trained to process an audio mixture to output estimation of at least a subset of a set of audio sources present in the audio mixture. The audio sources are subject to hierarchical constraints enforcing a parent-children hierarchy on the set of audio sources, such that a parent audio source in includes a mixture of its one or multiple children audio sources. The subset includes a parent audio source and at least one of its children audio sources. The system further comprises a processor to process a received input audio mixture using the neural network to estimate the subset of audio sources and their mutual relationships according to the parent-children hierarchy. The system further includes an output interface configured to render the extracted audio sources and their mutual relationships. 1. An audio processing system , comprising:a memory configured to store a neural network trained to process an audio mixture to output estimation of at least a subset of a set of audio sources present in the audio mixture, wherein the audio sources are subject to hierarchical constraints enforcing a parent-children hierarchy on the set of audio sources, such that a parent audio source in the parent-children hierarchy includes a mixture of its one or multiple children audio sources, and wherein the subset includes at least one parent audio source and at least one of its children audio sources;an input interface configured to receive an input audio mixture;a processor configured to process the input audio mixture using the neural network to extract estimates of the subset of audio sources and their mutual relationships according to the parent-children hierarchy; andan output interface configured to render the extracted audio sources and their mutual relationships.2. The audio processing system of claim 1 , wherein the subset of audio sources corresponds to a path on the parent-children hierarchy of the set of ...

Подробнее
25-03-2021 дата публикации

COMPUTER-IMPLEMENT VOICE COMMAND AUTHENTICATION METHOD AND ELECTRONIC DEVICE

Номер: US20210090577A1
Принадлежит: Merry Electronics(Shenzhen) Co., Ltd.

A computer-implement voice command authentication method is provided. The method includes obtaining a sound signal stream; calculating a Signal-to-Noise Ratio (SNR) value of the sound signal stream; converting the sound signal stream into a Mel-Frequency Cepstral Coefficients (MFCC) stream; calculating a Dynamic Time Warping (DTW) distance corresponding to the MFCC stream according to the MFCC stream and one of a plurality of sample streams generated by the Gaussian Mixture Model with Universal Background Model (GMM-UBM); calculating, according to the MFCC stream and the sample streams, a Log-likelihood ratio value corresponding to the MFCC stream as a GMM-UBM score; determining whether the sound signal stream passes a voice command authentication according to the GMM-UBM score, the DTW distance and the SNR value; in response to determining that the sound signal stream passes the voice command authentication, determining that the sound signal stream is a voice stream spoken from a legal user. 1. A computer-implement voice command authentication method , comprising:obtaining a sound signal stream;calculating a Signal-to-Noise Ratio (SNR) value of the sound signal stream;converting the sound signal stream into a Mel-Frequency Cepstral Coefficients (MFCC) stream;calculating a Dynamic Time Warping (DTW) distance corresponding to the MFCC stream according to the MFCC stream and one of a plurality of sample streams generated by the Gaussian Mixture Model with Universal Background Model (GMM-UBM);calculating, according to the MFCC stream and the sample streams, a Log-likelihood ratio (LLR) value corresponding to the MFCC stream as a GMM-UBM score corresponding to the sound signal stream;determining whether the sound signal stream passes a voice command authentication according to the GMM-UBM score, the DTW distance and the SNR value;in response to determining that the sound signal stream passes the voice command authentication, determining that the sound signal stream is a ...

Подробнее
05-05-2022 дата публикации

METHOD AND DEVICE FOR FUSING VOICEPRINT FEATURES, VOICE RECOGNITION METHOD AND SYSTEM, AND STORAGE MEDIUM

Номер: US20220139401A1
Принадлежит: SOUNDAI TECHNOLOGY CO., LTD.

A method and device for fusing voiceprint features. The method includes: obtaining at least two voiceprint features of a voice sample of a target speaker (S; S); fusing the at least two voiceprint features on the basis of linear discriminant analysis (S). The present method introduces a technique employing linear discriminant analysis to fuse various voiceprint features, so as to improve complementarities between the various voiceprint features and distinctions between the fused features, thereby increasing the recognition rate for target speakers and reducing the misrecognition rate for non-target speakers in voiceprint authentication scenarios, and providing personalized and improved user experience.

Подробнее
30-03-2017 дата публикации

Coherent Pitch and Intensity Modification of Speech Signals

Номер: US20170092285A1
Автор: Sorin Alexander
Принадлежит:

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour. 1. A method comprising:operating one or more hardware processors for receiving an utterance embodied as digitized speech signal, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame;operating at least one of said one or more hardware processors for calculating an original intensity contour of said utterance;operating at least one of said one or more hardware processors for generating a pitch-modified utterance based on the target pitch contour;operating at least one of said one or more hardware processors for calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames, wherein each of the intensity modification factors is ten in the power of the twentieth of the ratio of average ...

Подробнее
30-03-2017 дата публикации

Coherent Pitch and Intensity Modification of Speech Signals

Номер: US20170092286A1
Автор: Sorin Alexander
Принадлежит:

A method comprising: receiving an utterance, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame; calculating an original intensity contour of said utterance; generating a pitch modified utterance based on the target pitch contour; calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames; calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity contour; and generating a coherently modified speech signal by time dependent scaling of the intensity of said pitch modified utterance according to said final intensity contour. 1. A method comprising:operating one or more hardware processors for receiving an utterance embodied as digitized speech signal, an original pitch contour of the utterance, and a target pitch contour for the utterance, wherein the utterance comprises a plurality of consecutive frames, and wherein at least one of said frames is a voiced frame;operating one or more hardware processors for calculating an original intensity contour of said utterance;operating one or more hardware processors for generating a pitch-modified utterance based on the target pitch contour;operating one or more hardware processors for calculating an intensity modification factor for each of said frames, based on said original pitch contour and on said target pitch contour, to produce a sequence of intensity modification factors corresponding to said plurality of consecutive frames;operating one or more hardware processors for calculating a final intensity contour for said utterance by applying said intensity modification factors to said original intensity ...

Подробнее
09-04-2015 дата публикации

System and method of using neural transforms of robust audio features for speech processing

Номер: US20150100312A1
Принадлежит: AT&T INTELLECTUAL PROPERTY I LP

A system and method for processing speech includes receiving a first information stream associated with speech, the first information stream comprising micro-modulation features and receiving a second information stream associated with the speech, the second information stream comprising features. The method includes combining, via a non-linear multilayer perceptron, the first information stream and the second information stream to yield a third information stream. The system performs automatic speech recognition on the third information stream. The third information stream can also be used for training HMMs.

Подробнее
01-04-2021 дата публикации

ELECTRONIC DEVICE, METHOD AND SYSTEM OF IDENTITY VERIFICATION AND COMPUTER READABLE STORAGE MEDIUM

Номер: US20210097159A1
Принадлежит: PING AN TECHNOLOGY (SHENZHEN) CO., LTD.

An electronic device for identity verification includes a memory and a processor; the system of identity verification is stored in the memory, and executed by the processor to implement: after receiving current voice data of a target user, carrying out framing processing on the current voice data according to preset framing parameters to obtain multiple voice frames; extracting preset types of acoustic features in all the voice frames by using a predetermined filter, and generating multiple observed feature units corresponding to the current voice data according to the extracted acoustic features; pairwise coupling all the observed feature units with pre-stored observed feature units respectively to obtain multiple groups of coupled observed feature units; inputting the multiple groups of coupled observed feature units into a preset type of identity verification model generated by pre-training to carry out the identity verification on the target user. 1. An electronic device , comprising a memory and a processor connected with the memory , wherein the memory stores a system of identity verification which may be operated in the processor , and the system of identity verification is executed by the processor to implement the following steps:{'b': '1', 'S, after current voice data of a target user to be subjected to identity verification are received, carrying out framing processing on the current voice data according to preset framing parameters to obtain multiple voice frames;'}{'b': '2', 'S, extracting preset types of acoustic features in all the voice frames by using a predetermined filter, and generating multiple observed feature units corresponding to the current voice data according to the extracted acoustic features;'}{'b': '3', 'S, pairwise coupling all the observed feature units with pre-stored observed feature units respectively to obtain multiple groups of coupled observed feature units;'}{'b': '4', 'S, inputting the multiple groups of coupled observed ...

Подробнее
19-03-2020 дата публикации

Systems And Methods For Sensing Emotion In Voice Signals And Dynamically Changing Suggestions In A Call Center

Номер: US20200092419A1
Принадлежит:

Systems and methods for sensing emotion in voice signals and dynamically changing suggestions in a call center are disclosed. According to one embodiment, a computer-implemented method comprises receiving a call from a customer at a call center. A sample of the call is recorded and the Mel-Frequency Cepstral coefficient is extracted from the sample. A machine learning model predicts an emotion of the customer and generates a confidence score for the emotion. A script of a call center agent is modified based on the emotion and the confidence score. 1. A computer-implemented method , comprising:receiving a call from a customer at a call center;recording a sample of the call;extracting the Mel-Frequency Cepstral Coefficient (MFCC) from the sample;using a machine learning model to predict an emotion of the customer using the MFCC;generating a confidence score for the emotion; andmodifying a script of a call center agent based on the emotion and the confidence score.2. The computer-implemented method of claim 1 , wherein extracting the MFCC from the sample further comprises extracting an audio feature from the sample.3. The computer-implemented method of claim 1 , wherein the machine learning model was trained using an audio file database comprising running the machine learning model across the audio file database to identify patterns from the database that match the emotion claim 1 , and storing those patterns for future predictions.4. The computer-implemented method of claim 1 , wherein extracting the MFCC from the sample further comprises framing the sample into short frames.5. The computer-implemented method of claim 4 , further comprising calculating a periodogram estimate of a power spectrum for each frame.6. The computer-implemented method of claim 5 , further comprising generating DCT coefficients and using the DCT coefficients in the machine learning model.7. The computer-implemented method of claim 1 , wherein modifying the script of the call center agent ...

Подробнее
01-04-2021 дата публикации

AUTHENTICATION METHOD, AUTHENTICATION DEVICE, ELECTRONIC DEVICE AND STORAGE MEDIUM

Номер: US20210099303A1
Автор: Wang Ran
Принадлежит: BOE Technology Group Co., Ltd.

The present disclosure provides an authentication method, an authentication device, an electronic device and a storage medium. The authentication method includes: receiving target voice data; obtaining a first voiceprint feature parameter corresponding to the target voice data from a device voiceprint model library; performing a first encryption process on the first voiceprint feature parameter with a locally stored private key to generate to-be-verified data; transmitting the to-be-verified data to a server, so that the server uses a public key which matches the private key to decrypt the to-be-verified data to obtain the first voiceprint feature parameter, and performs authentication on the first voiceprint feature parameter to obtain an authentication result; receiving the authentication result returned by the server. 1. An authentication method applied to an electronic device , comprising:receiving target voice data;obtaining a first voiceprint feature parameter corresponding to the target voice data from a device voiceprint model library;performing a first encryption process on the first voiceprint feature parameter with a locally stored private key to generate to-be-verified data;transmitting the to-be-verified data to a server, so that the server uses a public key which matches the private key to decrypt the to-be-verified data to obtain the first voiceprint feature parameter, and performs authentication on the first voiceprint feature parameter to obtain an authentication result;receiving the authentication result returned by the server.2. The method according to claim 1 , wherein the obtaining a first voiceprint feature parameter corresponding to the target voice data from a device voiceprint model library claim 1 , includes:extracting a target voiceprint feature parameter in the target voice data;matching the target voiceprint feature parameter with a plurality of voiceprint feature parameters pre-stored in the device voiceprint model library, thereby ...

Подробнее
28-03-2019 дата публикации

Feature extraction of acoustic signals

Номер: US20190096389A1
Принадлежит: International Business Machines Corp

Embodiments of the present disclosure relate to a new approach used for adaptively selecting an acoustic feature extractor used in an Artificial Intelligence system. The method comprises: acquiring a frame of acoustic signal; checking a status of a flag to be used to indicate a proper acoustic feature extractor to be selected, the first status of the flag indicates a low-cost feature extractor associated with quasi-stationary acoustic signal, the second status of the flag indicates a high-cost feature extractor associated with non-stationary acoustic signal; processing the frame of acoustic signal by the selected acoustic feature extractor indicated by the checked status; determining, based on data generated in the processing of the frame of acoustic signal, an actual status of the frame of acoustic signal; and updating the status of the flag according to the actual status.

Подробнее
13-04-2017 дата публикации

Sound Detection Method for Recognizing Hazard Situation

Номер: US20170103776A1
Принадлежит:

A method of detecting a particular abnormal sound in an environment with background noise is provided. The method includes acquiring a sound from a microphone, separating abnormal sounds from the input sound based on non-negative matrix factorization (NMF), extracting Mel-frequency cepstral coefficient (MFCC) parameters according to the separated abnormal sounds, calculating hidden Markov model (HMM) likelihoods according to the separated abnormal sounds, and comparing the likelihoods of the separated abnormal sounds with a reference value to determine whether or not an abnormal sound has occurred. According to the method, based on NMF, a sound to be detected is compared with ambient noise in a one-to-one basis and classified so that the sound may be stably detected even in an actual environment with multiple noises. 1. A method of detecting a particular abnormal sound in an environment with mixed background noise , the method comprising:acquiring a sound signal from a microphone;separating abnormal sounds from the input sound signal through non-negative matrix factorization (NMF);extracting Mel-frequency cepstral coefficient (MFCC) parameters according to the separated abnormal sounds;calculating hidden Markov model (HMM) likelihoods according to the separated abnormal sounds; andcomparing the HMM likelihoods of the separated abnormal sounds with a reference value to determine whether or not an abnormal sound has occurred.2. The method according to claim 1 , wherein the separating of the abnormal sounds based on NMF comprises decomposing the input sound into a linear combination of several vectors using a background noise base and a plurality of abnormal sound bases and determining degrees of similarity with a pre-trained abnormal sound signal.3. The method according to claim 2 , wherein the background noise base and the plurality of abnormal sound bases are obtained through NMF training in an offline environment using corresponding signals.4. The method according ...

Подробнее
02-06-2022 дата публикации

Clockwork Hierarchal Variational Encoder

Номер: US20220172705A1
Принадлежит: Google LLC

A method for providing a frame-based mel spectral representation of speech includes receiving a text utterance having at least one word, and selecting a mel spectral embedding for the text utterance. Each word in the text utterance has at least one syllable and each syllable has at least one phoneme. For each phoneme, using the selected mel spectral embedding, the method also includes: predicting a duration of the corresponding phoneme by encoding linguistic features of the corresponding phoneme with a corresponding syllable embedding for the syllable that includes the corresponding phoneme; and generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme. Each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme. 1. A computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations comprising:receiving a text utterance having at least one word, each word having at least one syllable, each syllable having at least one phoneme;selecting a mel spectral embedding for the text utterance; and predicting a duration of the corresponding phoneme based on corresponding linguistic features associated with the word that includes the corresponding phoneme and corresponding linguistic features associated with the syllable that includes the corresponding phoneme; and', 'generating a plurality of fixed-length predicted mel-frequency spectrogram frames based on the predicted duration for the corresponding phoneme, each fixed-length predicted mel-frequency spectrogram frame representing mel-spectral information of the corresponding phoneme., 'for each phoneme, using the selected mel spectral embedding2. The computer-implemented method of claim 1 , wherein claim 1 , for each phoneme claim 1 , using the selected mel spectral embedding claim 1 , predicting the duration ...

Подробнее
19-04-2018 дата публикации

Vehicle Ambient Audio Classification Via Neural Network Machine Learning

Номер: US20180108369A1
Автор: Gross Ethan
Принадлежит:

A method and an apparatus for detecting and classifying sounds around a vehicle via neural network machine learning are described. The method involves an audio recognition system that may determine the origin of the sounds being inside or outside of a vehicle and classify the sounds into different categories such as adult, child, or animal sounds. The audio recognition system may communicate with a plurality of sensors in and around the vehicle to obtain information of conditions of the vehicle. Based on information of the sounds and conditions of the vehicles, the audio recognition system may determine whether an occupant or the vehicle is at risk and send alert messages or issue warning signals. 1. A method , comprising:receiving audible information from one or more microphones;receiving vehicle information of one or more conditions of a vehicle from one or more sensors;determining whether the vehicle is at risk of theft or whether an occupant of the vehicle is at risk of danger based on the vehicle information and the audible information; andtriggering one or more actions upon determining that the occupant or the vehicle is at risk.2. The method of claim 1 , wherein the receiving of the audible information from the one or more microphones comprises determining whether the occupant is inside the vehicle based on information of a first neural network and a second neural network.3. The method of claim 2 , wherein the determining of whether the occupant is inside the vehicle based on information of the first neural network and the second neural network by performing operations comprising:detecting a plurality of sounds in and around the vehicle;recording the sounds into a plurality of audio files;determining whether the sounds are originated from inside or outside of the vehicle based on information of the first neural network; andclassifying the sounds into a plurality of categories based on information of the second neural network.4. The method of claim 3 , wherein ...

Подробнее
02-04-2020 дата публикации

Translation processing method, translation processing device, and device

Номер: US20200104372A1

The present disclosure provides a translation processing method, a translation processing device, and a device. The first speech signal of the first language is obtained, and the speech feature vector of the first speech signal is extracted based on the preset algorithm. Further, the speech feature vector is input into the pre-trained end-to-end translation model for conversion from the first language speech to the second language text for processing, and the text information of the second language corresponding to the first speech signal is obtained. Moreover, speech synthesis is performed on the text information of the second language, and the corresponding second speech signal is obtained and played.

Подробнее
02-04-2020 дата публикации

METHODS AND SYSTEMS FOR MANAGING CHATBOTS WITH DATA ACCESS

Номер: US20200105257A1

Embodiments for managing a chatbot by one or more processors are described. A communication from an individual is received. At least one data source associated with the individual is selected based on the received communication. A response to the received communication is generated based on the at least one selected data source. 1. A method , by one or more processors , for managing a chatbot comprising:receiving a communication from an individual;selecting at least one data source associated with the individual based on the received communication; andgenerating a response to the received communication based on the at least one selected data source.2. The method of claim 1 , wherein the generated response is indicative of at least one of a request for authorization to access the at least one selected data source and a request for information associated with the received communication.3. The method of claim 1 , wherein the generated response is indicative of a request for authorization to access the at least one selected data source claim 1 , and further comprising:receiving an indication of an authorization from the individual to access the at least one selected data source;analyzing the at least one selected data source based on the received communication; andgenerating a second response to the received communication based on said analysis of the at least one selected data source.4. The method of claim 1 , wherein the received communication is at least one of a voice communication or a text-based communication.5. The method of claim 1 , wherein at least one of the selecting of the at least one data source and the generating of the response is performed utilizing an analysis of the received communication claim 1 , wherein the analysis is performed utilizing at least one of natural language processing (NLP) and a Mel-frequency cepstral coefficients (MFCC) algorithm.6. The method of claim 1 , wherein the received communication includes a query.7. The method of claim 1 ...

Подробнее
02-04-2020 дата публикации

METHODS AND SYSTEMS FOR SUPPRESSING VOCAL TRACKS

Номер: US20200105286A1
Принадлежит:

The methods and systems described herein aid users by modifying the presentation of content to users. For example, the methods and systems suppress the dialogue track of a movie when the user engages with the content by reciting a line of the movie as it is presented to the user. Words spoken by the user are detected and compared with the words in the movie. When the user is not engaging with the movie by reciting the lines or humming tunes while watching the movie, the audio track of the movie is not modified. Content can be modified in response to engagement by a single user or by multiple users (e.g., each reciting lines of a different character in a movie). Accordingly, the methods and systems described herein provide increased interest in and engagement with content. 1. A method for suppressing vocal tracks in content upon detection of corresponding words , the method comprising:detecting, during output of content, an utterance, wherein the content comprises a vocal track and at least one additional audio track;determining at least one first word in the detected utterance;determining at least one second word included in a portion of the content that was output at a time when the at least one first word was uttered;comparing the at least one first word with the at least one second word;determining, based on the comparing, that the at least one first word matches the at least one second word; andin response to determining that the at least one first word matches the at least one second word, suppressing output of the vocal track of the content.2. The method of claim 1 , further comprising: determining at least one additional word included in a portion of the content that was output;', 'comparing the at least one additional word with the utterance;', 'determining, based on comparing the at least one additional word with the utterance, that the at least one additional word does not match the utterance; and, 'subsequent to determining that the at least one first ...

Подробнее
11-04-2019 дата публикации

AUDIO PROCESSING FOR VOICE SIMULATED NOISE EFFECTS

Номер: US20190109804A1
Принадлежит:

Systems and methods may be used to process and output information related to a non-speech vocalization, for example from a user attempting to mimic a non-speech sound. A method may include determine a mimic quality value associated with an audio file by comparing a non-speech vocalization to a prerecorded audio file. For example, the method may include determining an edit distance between the non-speech vocalization and the prerecorded audio file. The method may include assigning a mimic quality value to the audio file based on the edit distance. The method may include outputting the mimic quality value. 1. A device comprising:a display to provide a user interface for interacting with a social bot;memory; and provide an indication initiating an impression game within the user interface with the social bot, the indication indicating a non-speech sound to be mimicked;', 'receive an audio file or streamed audio including a non-speech vocalization from a user attempting to mimic the non-speech sound via the user interface;', 'determine a mimic quality value associated with the audio file or the streamed audio by comparing the non-speech vocalization to a prerecorded audio file in a database; and', 'output a response to the received audio file or the streamed audio from the social bot for display on the user interface based on the mimic quality value., 'a processor in communication with the memory, the processor to2. The device of claim 1 , wherein the prerecorded audio file is a recording of the non-speech sound to be mimicked or a recording of a person mimicking the non-speech sound.3. The device of claim 1 , wherein the processor is further to provide a token via the user interface in response to the mimic quality value exceeding a threshold claim 1 , the token used to unlock digital content.4. The device of claim 1 , wherein to determine the mimic quality value claim 1 , the processor is further to determine whether the non-speech vocalization is within a ...

Подробнее
09-06-2022 дата публикации

LEARNABLE SPEED CONTROL OF SPEECH SYNTHESIS

Номер: US20220180856A1
Автор: YU Chengzhu, Yu Dong
Принадлежит: Tencent America LLC

A method, computer program, and computer system is provided for synthesizing speech at one or more speeds. A context associated with one or more phonemes corresponding to a speaking voice is encoded, and the one or more phonemes are aligned to one or more target acoustic frames based on the encoded context. One or more mel-spectrogram features are recursively generated from the aligned phonemes and target acoustic frames, and a voice sample corresponding to the speaking voice is synthesized using the generated mel-spectrogram features. 1. A method of synthesizing speech at one or more speeds , comprising:receiving, by a computer, a sequence of one or more phonemes, and outputting a sequence of one or more hidden states containing a sequential representation associated with the received sequence of phonemes;aligning, by the computer, the one or more phonemes to one or more target acoustic frames based on the encoded context, based on generating one or more frame-aligned hidden states according to a rate associated with each phoneme;recursively generating, by the computer, one or more mel-spectrogram features from the aligned phonemes and the target acoustic frames; andsynthesizing, by the computer, a voice sample at a given speed corresponding to the speaking voice using the generated mel-spectrogram features.2. The method of claim 1 , wherein the aligning the one or more phonemes to one or more target acoustic frames further comprises:concatenating the output sequence of one or more hidden states with information corresponding to the speaking voice; andapplying dimension reduction to the concatenated output sequence using a fully connected layer.3. The method of claim 2 , wherein the aligning the one or more phonemes to one or more target acoustic frames further comprises:expanding the dimension-reduced output sequence based on the rate associated with each phoneme; andaligning the expanded output sequence to the target acoustic frames.4. The method of claim 3 , ...

Подробнее
05-05-2016 дата публикации

METHOD AND SYSTEM FOR IDENTIFYING LOCATION ASSOCIATED WITH VOICE COMMAND TO CONTROL HOME APPLIANCE

Номер: US20160125880A1
Принадлежит:

The present invention relates to a method for controlling a home appliance located in assigned room with voice commands in home environment. The method comprises the steps of: receiving a voice command by a user; recording the received voice command; sampling the recorded voice command and feature extracting from the recorded voice command; determining room label by comparing the extracted features of the voice command with feature references, wherein the room label is associated with the feature references; assigning the room label to the voice command; and controlling the home appliance located in the assigned room in accordance with the voice command. 18-. (canceled)9. A method for controlling an appliance located in a corresponding environment to a voice command , the method comprising the steps of:recording a received voice command by a user;sampling the recorded voice command and features extracted from the recorded voice command, the features including voice related features and non-voice related features; andcontrolling the appliance located in the corresponding environment to assigned environment label which is associated with feature references, wherein the environment label is assigned to the voice command by comparing the features extracted from the voice command with the feature references, the feature references are accumulated by the sampling.10. The method according to claim 9 , wherein the feature references are accumulated by the sampling including training phase.11. The method according to claim 9 , the step of determining environment label is performed on the basis of K-nearest neighbor algorism.12. The method according to claim 9 , wherein the voice features are MFCC (Mel-Frequency Cepstral Coefficients) and reverberation effect coefficient claim 9 , and non-voice feature is the time when the voice command is recorded.13. A system for controlling an appliance located in a corresponding environment to a voice command claim 9 , the system ...

Подробнее
16-04-2020 дата публикации

SPEAKING CLASSIFICATION USING AUDIO-VISUAL DATA

Номер: US20200117887A1
Принадлежит:

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating predictions for whether a target person is speaking during a portion of a video. In one aspect, a method includes obtaining one or more images which each depict a mouth of a given person at a respective time point. The images are processed using an image embedding neural network to generate a latent representation of the images. Audio data corresponding to the images is processed using an audio embedding neural network to generate a latent representation of the audio data. The latent representation of the images and the latent representation of the audio data is processed using a recurrent neural network to generate a prediction for whether the given person is speaking. 1. A method performed by one or more data processing apparatus , the method comprising:obtaining one or more images which each depict a mouth of a given person at a respective time point, wherein each of the respective time points are different;processing the one or more images using an image embedding neural network to generate a latent representation of the one or more images;obtaining audio data corresponding to the one or more images;processing a representation of the audio data using an audio embedding neural network to generate a latent representation of the audio data; andprocessing the latent representation of the one or more images and the latent representation of the audio data using a recurrent neural network to generate an output defining a prediction for whether the given person is speaking at one or more of the respective time points;wherein the image embedding neural network, the audio embedding neural network, and the recurrent neural network are trained by an end-to-end optimization procedure.2. The method of claim 1 , wherein obtaining one or more images which each depict a mouth of a given person at a respective time point comprises:obtaining one or more video frames ...

Подробнее
25-08-2022 дата публикации

ALWAYS-ON WAKE ON MULTI-DIMENSIONAL PATTERN DETECTION (WOMPD) FROM A SENSOR FUSION CIRCUITRY

Номер: US20220270592A1
Принадлежит:

A device wake-up system has one or more sensors each receptive to an external input. The respective external inputs are translatable to corresponding signals. One or more feature extractors connected to a respective one of the one or more sensors are receptive to the signals outputted from the sensors, and the feature data is associated with the signals being generated by the corresponding one of the one or more feature extractors. One or more inference circuits are connected to a respective one of the one or more feature extractors, and inference decisions are generated from patterns of the feature data generated by a corresponding one of the one or more feature extractors. A decision combiner is connected to each of the one or more inference circuits, and a wake signal is be generated based upon an aggregate of the inference decisions provided by the one or more inference circuits. 1. A device wake-up system comprising:one or more sensors each receptive to an external input, the respective external inputs being translatable to corresponding signals;one or more feature extractors connected to a respective one of the one or more sensors and receptive to the signals outputted therefrom, feature data associated with the signals being generated by the corresponding one of the one or more feature extractors;one or more inference circuits connected to a respective one of the one or more feature extractors, inference decisions being generated from patterns of the feature data generated by a corresponding one of the one or more feature extractors; anda decision combiner connected to each of the one or more inference circuits, a wake signal being generated based upon an aggregate of the inference decisions provided by the one or more inference circuits.2. The system of claim 1 , wherein the wake signal is output to an application processor.3. The system of claim 1 , wherein one of the one or more sensors is a microphone and the external input is an audio wave.4. The system ...

Подробнее
25-04-2019 дата публикации

DIGITAL ASSISTANT PROVIDING WHISPERED SPEECH

Номер: US20190122666A1
Принадлежит:

Systems and processes for detecting and/or providing a whispered speech response are provided. In one example process, speech is received from a user, and based on the speech input, determined that a whispered speech response is to be provided. Upon determining that a whispered speech response is to be provided, the whispered speech response is generated and provided to the user. 126.-. (canceled)27. A non-transitory computer-readable storage medium storing one or more programs , the one or more programs comprising instructions , which when executed by one or more processors of an electronic device , cause the electronic device to:receive a speech input from a user;determine whether providing a whispered speech response is disabled; determine, based on the speech input, that a whispered speech response is to be provided;', 'upon determining that a whispered speech response is to be provided, generate the whispered speech response; and', 'provide the whispered speech response to the user;, 'in accordance with a determination that providing the whispered speech response is not disabled generate a non-whispered speech response; and', 'provide the non-whispered speech response to the user in lieu of the whispered speech response., 'in accordance with a determination that providing the whispered speech response is disabled28. The non-transitory computer-readable storage medium of claim 27 , wherein the speech input comprises at least one of an informational request or a request to perform a task.29. The non-transitory computer-readable storage medium of claim 28 , wherein the whispered speech response comprises at least one of a response to the informational request or a response associated with performing the task.30. The non-transitory computer-readable storage medium of claim 27 , wherein determining whether providing the whispered speech response is disabled is based on a time of day.31. The non-transitory computer-readable storage medium of claim 27 , wherein ...

Подробнее
27-05-2021 дата публикации

METHOD AND APPARATUS FOR VOICE INTERACTION, DEVICE AND COMPUTER READABLE STORATE MEDIUM

Номер: US20210158816A1
Принадлежит:

A method, apparatus, device, and storage medium for voice interaction. A specific embodiment of the method includes: extracting an acoustic feature from received voice data, the acoustic feature indicating a short-term amplitude spectrum characteristic of the voice data; applying the acoustic feature to a type recognition model to determine an intention type of the voice data, the intention type being one of an interaction intention type and a non-interaction intention type, and the type recognition model being constructed based on the acoustic feature of training voice data; and performing an interaction operation indicated by the voice data, based on determining that the intention type is the interaction intention type. 1. A method for voice interaction , the method comprising:extracting an acoustic feature from received voice data, the acoustic feature indicating a short-term amplitude spectrum characteristic of the voice data;applying the acoustic feature to a type recognition model to determine an intention type of the voice data, the intention type being one of an interaction intention type and a non-interaction intention type, and the type recognition model being constructed based on acoustic feature of training voice data; andperforming, based on determining that the intention type is the interaction intention type, an interaction operation indicated by the voice data.2. The method according to claim 1 , further comprising:labeling the training voice data, the labeled training voice data being positive training voice data indicating an interaction intention or negative training voice data indicating a non-interaction intention; andconstructing the type recognition model by using the labeled training voice data.3. The method according to claim 2 , wherein the labeling the training voice data comprises:labeling the training voice data as the positive training voice data, based on determining at least one of:semantics of the training voice data is correctly ...

Подробнее
27-05-2021 дата публикации

Sound event detection learning

Номер: US20210158837A1
Принадлежит: Qualcomm Inc

A device includes a processor configured to receive audio data samples and provide the audio data samples to a first neural network to generate a first output corresponding to a first set of sound classes. The processor is further configured to provide the audio data samples to a second neural network to generate a second output corresponding to a second set of sound classes. A second count of classes of the second set of sound classes is greater than a first count of classes of the first set of sound classes. The processor is also configured to provide the first output to a neural adapter to generate a third output corresponding to the second set of sound classes. The processor is further configured to provide the second output and the third output to a merger adapter to generate sound event identification data based on the audio data samples.

Подробнее
16-04-2020 дата публикации

CALL RECORDING SYSTEM FOR AUTOMATICALLY STORING A CALL CANDIDATE AND CALL RECORDING METHOD

Номер: US20200120207A1
Принадлежит: i2x GmbH

Embodiments of the present disclosure describe a call recording system and a call recording method for automatically recording, i.e. storing, a call candidate when an active call is detected. The call recording system comprises a sound receiver to receive sound data and to convert sound data to audio representations of sound, a buffer to buffer the audio representations of sound for a predetermined time duration, a call candidate determination unit to determine if the buffered audio representations comprise a call candidate, a call analyzer to analyze the call candidate, wherein the call analyzer determines if the call candidate is a call to be stored, and a storage to store the call candidate as a call. Hence, a reliable system can be provided for automatically storing a call. 1. A call recording system for automatically storing of a call , the call recording system comprising:a sound receiver configured to receive sound data and to convert sound data to audio representations of sound;a buffer configured to buffer the audio representations of sound for a predetermined time duration;a call candidate determination unit configured to determine if the buffered audio representations comprise a call candidate, wherein the buffered audio representations comprise a call candidate if a characteristic of a buffered audio representation exceeds a first predetermined threshold;a call analyzer configured to analyze the call candidate, wherein the call analyzer is configured to output a value of the call candidate and to determine from the audio representations of the call candidate if the output value corresponding to the call candidate exceeds a second predetermined threshold; anda storage configured to store the call candidate as a call if the value of the call candidate exceeds the second predetermined threshold.2. The call recording system of claim 1 , wherein the call analyzer comprisesa trained Machine Learning (ML) model configured to output a probability as the value of ...

Подробнее
01-09-2022 дата публикации

IMPRESSION ESTIMATION APPARATUS, LEARNING APPARATUS, METHODS AND PROGRAMS FOR THE SAME

Номер: US20220277761A1

An impression estimation technique without the need of voice recognition is provided. An impression estimation device includes an estimation unit configured to estimate an impression of a voice signal s by defining p Подробнее

02-05-2019 дата публикации

Processing of speech signals

Номер: US20190130932A1
Принадлежит: International Business Machines Corp

A method for processing a speech signal. The method comprises obtaining a logmel feature of a speech signal. The method further includes one or more processors processing the logmel feature so that the logmel feature is normalized under a constraint that a power level of the logmel feature is kept as originally obtained. The method further includes inputting the processed logmel feature into a speech-to-text system to generate corresponding text data.

Подробнее
23-04-2020 дата публикации

ROBUST START-END POINT DETECTION ALGORITHM USING NEURAL NETWORK

Номер: US20200126556A1
Принадлежит:

An end detector configured to receive the feature data and detect an end point of a keyword, and a start detector configured to receive an indication of the detected end point and process the feature data associated with corresponding input frames to detect a start point of the keyword. The start detector and end detector comprise neural networks trained through a process using a cross-entropy cost function for non-Region of Target (ROT) frames and a One-Spike Connectionist Temporal Classification cost function for ROT frames. 1. A method comprising:receiving an audio input stream including a plurality of audio frames;extracting from each audio frame features representative of the audio frame;detecting an end point in the extracted features using a first neural network;providing an indication to a second neural network when an end point is detected; anddetecting a start point in the extracted features from audio frames preceding the detected end point in the audio input stream using the second neural network.2. The method of claim 1 , wherein extracting from each audio frame features representative of the audio frame includes generating Mel-frequency cepstral coefficients (MFCCs) for each frame of the audio input stream.3. The method of claim 1 , further comprising training the first neural network for end point detection.4. The method of claim 1 , further comprising training the second neural network for start point detection.5. The method of claim 1 , further comprising receiving claim 1 , at a computing device claim 1 , a stream of training data including a plurality of input samples having segmented labeled data;computing, by the first neural network, a network output for each input sample in a forward pass through the training data; andupdating, by the first neural network, weights and biases through a backward pass through the first neural network, including determining whether an input frame is in a Region of Target (ROT).6. The method of claim 5 , further ...

Подробнее
18-05-2017 дата публикации

ADAPTIVE VOICE AUTHENTICATION SYSTEM AND METHOD

Номер: US20170140760A1
Автор: SACHDEV Umesh
Принадлежит:

An adaptive voice authentication system is provided. The adaptive voice authentication system includes an adaptive module configured to compare a feature quality index of the plurality of authentication features and the plurality of enrolment features and dynamically replace and store one or more enrolment features with one or more authentication features to form a plurality of updated enrolment features. The adaptive module is configured to generate an updated enrolment voice print model from the plurality of the updated enrolment features. The adaptive module is further configured to compare the updated enrolment voice print model with the previously stored enrolment voice print model and dynamically update the previously stored enrolment voice print model with the updated enrolment voice print model based on a model quality index. 1. An adaptive voice authentication system comprising:a feature extractor configured to receive a user's enrolment voice sample and a user's authentication voice sample and configured to extract a plurality of enrolment features from the user's enrolment voice sample and a plurality of authentication features from the user's authentication voice sample; wherein the user's enrolment voice sample is an initial voice sample and the user's authentication voice sample is a plurality of subsequent voice samples; an enrolment voice print model from the plurality of enrolment features; and', 'an authentication voice print model from the plurality of authentication features;, 'a voice print model generator configured to generatean authentication module configured to receive the authentication voice print model and authenticate the user based on the enrolment voice print model;a storage module configured to store the plurality of enrolment features, the plurality of authentication features, the enrolment voice print model and the authentication voice print model; and compare a feature quality index of the plurality of authentication features and ...

Подробнее
08-09-2022 дата публикации

DETECTION AND CLASSIFICATION OF SIREN SIGNALS AND LOCALIZATION OF SIREN SIGNAL SOURCES

Номер: US20220284919A1
Принадлежит:

In an embodiment, a method comprises: capturing, by one or more microphone arrays of a vehicle, sound signals in an environment; extracting frequency spectrum features from the sound signals; predicting, using an acoustic scene classifier and the frequency spectrum features, one or more siren signal classifications; converting the one or more siren signal classifications into one or more siren signal event detections; computing time delay of arrival estimates for the one or more detected siren signals; estimating one or more bearing angles to one or more sources of the one or more detected siren signals using the time delay of arrival estimates and a known geometry of the microphone array; and tracking, using a Bayesian filter, the one or more bearing angles. If a siren is detected, actions are performed by the vehicle depending on the location of the emergency vehicle and whether the emergency vehicle is active or inactive. 1. A method comprising:capturing, by one or more microphone arrays of a vehicle, sound signals in an environment;extracting, using one or more processors, frequency spectrum features from the sound signals;predicting, using an acoustic scene classifier and the frequency spectrum features, one or more siren signal classifications, wherein the acoustic scene classifier predicts labels that indicate the presence of one or more of a plurality of different types of siren signals;converting, using the one or more processors, the one or more siren signal classifications into one or more siren signal event detections;computing time delay of arrival estimates for the one or more detected siren signals;estimating, using the one or more processors, one or more bearing angles to one or more sources of the one or more detected siren signals using the time delay of arrival estimates and a known geometry of the microphone array; andtracking, using a Bayesian filter, the one or more bearing angles.2. The method of claim 1 , wherein the time delay of arrival ...

Подробнее
10-06-2021 дата публикации

TEXT-BASED SPEECH SYNTHESIS METHOD, COMPUTER DEVICE, AND NON-TRANSITORY COMPUTER-READABLE STORAGE MEDIUM

Номер: US20210174781A1
Принадлежит: PING AN TECHNOLOGY (SHENZHEN) CO., LTD.

A text-based speech synthesis method, a computer device, and a non-transitory computer-readable storage medium are provided. The text-based speech synthesis method includes: a target text to be recognized is obtained; each character in the target text is discretely characterized to generate a feature vector corresponding to each character; the feature vector is input into a pre-trained spectrum conversion model, to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; and the Mel-spectrum is converted to speech to obtain speech corresponding to the target text. 1. A text-based speech synthesis method , comprising:obtaining target text to be recognized;discretely characterizing each character in the target text to generate a feature vector corresponding to each character;inputting the feature vector into a pre-trained spectrum conversion model to obtain a Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model; andconverting the Mel-spectrum into speech to obtain speech corresponding to the target text.2. The method as claimed in claim 1 , further comprising before inputting the feature vector into the pre-trained spectrum conversion model to obtain the Mel-spectrum corresponding to each character in the target text output by the spectrum conversion model:obtaining a preset number of training text and matching speech corresponding to the training text;discretely characterizing the training text to obtain a feature vector corresponding to each character in the training text;inputting the feature vector corresponding to each character in the training text into a spectrum conversion model to be trained to obtain a Mel-spectrum output by the spectrum conversion model to be trained; andwhen an error between the Mel-spectrum output by the spectrum conversion model to be trained and a Mel-spectrum corresponding to the matching speech is less than or equal to a preset ...

Подробнее
02-06-2016 дата публикации

METHOD FOR IMPROVING ACOUSTIC MODEL, COMPUTER FOR IMPROVING ACOUSTIC MODEL AND COMPUTER PROGRAM THEREOF

Номер: US20160155438A1
Принадлежит:

Embodiments include methods and systems for improving an acoustic model. Aspects include acquiring a first standard deviation value by calculating standard deviation of a feature from first training data and acquiring a second standard deviation value by calculating standard deviation of a feature from second training data acquired in a different environment from an environment of the first training data. Aspects also include creating a feature adapted to an environment where the first training data is recorded, by multiplying the feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value. Aspects further include reconstructing an acoustic model constructed using training data acquired in the same environment as the environment of the first training data using the feature adapted to the environment where the first training data is recorded. 1. (canceled)2. (canceled)3. (canceled)4. (canceled)5. (canceled)6. (canceled)7. (canceled)8. (canceled)9. A computer for improving an acoustic model , comprising:a standard deviation value calculating unit for calculating standard deviation of a first feature from first training data to acquire a first standard deviation value and calculating standard deviation of a second feature from second training data acquired in a different environment from an environment of the first training data to acquire a second standard deviation value;a feature creating unit for creating a modified feature adapted to the environment where the first training data is recorded, by multiplying the second feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value; andan acoustic model reconstructing unit for reconstructing an acoustic model constructed using training data acquired in the same environment as the environment of the first training data, using the modified ...

Подробнее
01-06-2017 дата публикации

METHOD AND ELECTRONIC DEVICE FOR VOICE RECOGNITION BASED ON DYNAMIC VOICE MODEL SELECTION

Номер: US20170154640A1
Автор: WANG Yongqing
Принадлежит:

The embodiments of the present disclosure provide a method and a device for voice recognition based on dynamic voice model selection. Wherein, the method includes: obtaining a first voice packet of a voice to be detected and extracting the basic frequency of the first voice packet, wherein the basic frequency is the vibration frequency of a vocal cord; classifying the sources of the voice to be detected according to the basic frequency and selecting a pre-trained voice model voice model in a corresponding category; and performing front-end processing on the voice to be detected to obtain the values of the characteristic parameters of the voice to be detected, and matching the processed voice to be detected with the voice model and scoring, thus obtaining a voice recognition result. 1. A method for voice recognition based on dynamic voice model selection , comprising the following steps:obtaining a first voice packet of a voice to be detected and extracting the basic frequency of the first voice packet, wherein the basic frequency is the vibration frequency of a vocal cord;classifying the sources of the voice to be detected according to the basic frequency and selecting a pre-trained voice model voice model in a corresponding category; andperforming front-end processing on the voice to be detected to obtain the values of the characteristic parameters of the voice to be detected, and matching the processed voice to be detected with the voice model and scoring, thus obtaining a voice recognition result.2. The method according to claim 1 , wherein the obtaining the first voice packet of the voice to be detected further comprises:performing voice activity detection on the voice to be detected to obtain the initial point of the voice to be detected; andserving a voice signal with a certain time range after the initial point as the first voice packet.3. The method according to claim 2 , wherein the serving the voice signal with the certain time range after the initial ...

Подробнее
17-06-2021 дата публикации

Keyword Spotting Using Machine Learning

Номер: US20210183380A1
Принадлежит: Silicon Laboratories Inc

A system and method of keyword spotting using two neural networks is disclosed. The system is in sleep mode most of the time, and wakes up periodically. Upon waking, a limited duration of audio is examined. This may be performed using an auxiliary neural network. If any audio activity is detected in this duration, the system fully wakes and examines a longer duration of audio for keywords. The keyword spotting is also performed by the main neural network, which may be a convolutional neural network (CNN).

Подробнее
22-09-2022 дата публикации

SYSTEM AND METHOD FOR GENERATING MULTILINGUAL TRANSCRIPT FROM MULTILINGUAL AUDIO INPUT

Номер: US20220300719A1
Автор: RAINA Vishay
Принадлежит:

The present disclosure relates to a system for generating a multilingual transcript from a multilingual audio input. The system includes a processor being configured receive, from a source, a set of first signals pertaining to the multilingual audio input. Extract, based on the set of first signals, one or more attributes of the multilingual audio input, and correspondingly generate a set of second signals. Convert, based on the set of second signals, the multilingual audio input in to a plurality of monolingual transcripts having respective plurality of segments. The plurality of segments is associated with a plurality of languages present in the multilingual audio input. Generate, using machine learning technique, the multilingual transcript from the plurality of monolingual text outputs. The transcript comprises the one or more segments from each of the plurality of segments associated with the plurality of monolingual transcript. 1. A system for generating a multilingual transcript from a multilingual audio input , the system comprising: receive, from a source, a set of first signals pertaining to the multilingual audio input;', 'extract, based on the set of first signals, one or more attributes of the multilingual audio input, and correspondingly generate a set of second signals;', 'convert, based on the set of second signals, the multilingual audio input in to a plurality of monolingual transcripts having respective plurality of segments, wherein the plurality of monolingual transcripts is associated with a plurality of languages present in the multilingual audio input; and', 'generate, using machine learning technique, the multilingual transcript corresponding to the plurality of monolingual transcripts, wherein the multilingual transcript comprises the one or more segments from each of the plurality of segments associated with the plurality of monolingual transcripts., 'a processor being configured to execute a set of instructions stored in a memory, which ...

Подробнее
22-09-2022 дата публикации

Method of Contextual Speech Decoding from the Brain

Номер: US20220301563A1
Принадлежит: UNIVERSITY OF CALIFORNIA

Provided are methods of contextual decoding and/or speech decoding from the brain of a subject. The methods include decoding neural or optical signals from the cortical region of an individual, extracting context-related features and/or speech-related features from the neural or optical signals, and decoding the context-related features and/or speech-related features from the neural or optical signals. Contextual decoding and speech decoding systems and devices for practicing the subject methods are also provided.

Подробнее
08-06-2017 дата публикации

METHOD AND APPARATUS FOR A LOW POWER VOICE TRIGGER DEVICE

Номер: US20170162205A1

Aspects of the present disclosure involve a method for a voice trigger device that can be used to interrupt an externally connected system. The current disclosure also presents the architecture for the voice trigger device used for searching and matching an audio signature with a reference signature. In one embodiment a reverse matching mechanism is performed. In another embodiment, the reverse search and match operation is performed using an exponential normalization technique. 1. A method , comprising:detecting receipt of an audio signal that is sampled into blocks;determining a plurality of energy values of the sampled audio signal blocks;performing energy binning of the plurality of energy values to determine whether speech is present in the sampled audio signal blocks;determining that speech is present in the sampled audio signal blocks received;storing a representation of the sampled audio signal block in a trigger buffer;matching the representation of the sampled audio signal blocks stored in trigger buffer to a training sequence stored in a training buffer in reverse order; andenabling a wake up pin in a voice trigger device upon matching the representation to the training sequence.2. The method of claim 1 , wherein the audio signal is received by a mixed signal component in the voice trigger device.3. The method of claim 1 , wherein the sampled audio signal blocks are a trigger sequence.4. The method of claim 1 , wherein the plurality of energy values are determined using a frequency domain translated block of samples.5. The method of claim 1 , wherein the energy binning reduces a number of bins used to process each of the blocks of samples received.6. The method of claim 1 , wherein Mel-Frequency Cepstrum Coefficients (MFCCs) are determined using at least the energy binning.7. The method of claim 6 , wherein the MFCCs are exponentially normalized.8. The method of claim 6 , wherein the MFCCs are used at least in part to represent the training sequence.9. ...

Подробнее
14-05-2020 дата публикации

Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications

Номер: US20200152206A1
Принадлежит: ROBERT BOSCH GMBH

A speaker recognition device includes a memory, and a processor. The memory stores enrolled key phrase data corresponding to utterances of a key phrase by enrolled users,and text-dependent and text-independent acoustic speaker models of the enrolled users. The processor is operatively connected to the memory, and executes instructions to authenticate a speaker as an enrolled user, which includes detecting input key phrase data corresponding to a key phrase uttered by the speaker, computing text-dependent and text-independent scores for the speaker using speech models of the enrolled user, computing a confidence score, and authenticating or rejecting the speaker as the enrolled user based on whether the confidence score indicates that the input key phrase data corresponds to the speech from the enrolled user.

Подробнее
24-06-2021 дата публикации

VIDEO CLASSIFICATION METHOD AND APPARATUS, COMPUTER DEVICE, AND STORAGE MEDIUM

Номер: US20210192220A1
Автор: QU Bing Xin, ZHENG Mao

Video classification accuracy can be improved by utilizing multiple features. Classification based on a combination of an image classification model, an audio classification model, and a textual description classification model may improve classification. The image classification result is based on an image feature of the image frame. The audio classification result is based on an audio feature of the audio. The textual description classification result is based on a text feature of the textual description information. A target classification result of the target video is determined based on to the image classification result, the audio classification result, and the textual classification result. 1. A video classification method , performed by a computer device , the method comprising:obtaining a target video;classifying an image frame in the target video by using a first classification model, to obtain an image classification result, the first classification model being configured to perform classification based on an image feature of the image frame;classifying an audio in the target video by using a second classification model, to obtain an audio classification result, the second classification model being configured to perform classification based on an audio feature of the audio;classifying textual description information corresponding to the target video by using a third classification model, to obtain a textual classification result, the third classification model being configured to perform classification based on a text feature of the textual description information; anddetermining a target classification result of the target video according to the image classification result, the audio classification result, and the textual classification result.2. The method according to claim 1 , wherein the image classification result comprises a first image classification result claim 1 , further wherein the classifying an image frame in the target video by using a ...

Подробнее
24-06-2021 дата публикации

SPEECH SYNTHESIS METHOD AND APPARATUS AND COMPUTER READABLE STORAGE MEDIUM USING THE SAME

Номер: US20210193113A1
Принадлежит:

The present disclosure provides a speech synthesis method as well as an apparatus and a computer readable storage medium using the same. The method includes: obtaining a to-be-synthesized text, and extracting to-be-processed Mel spectrum features of the to-be-synthesized text through a preset speech feature extraction algorithm; inputting the to-be-processed Mel spectrum features into a preset ResUnet network model to obtain first intermediate features; performing an average pooling and a first down sampling on the to-be-processed Mel spectrum features to obtain second intermediate features; taking the second intermediate features and the first intermediate features output by the ResUnet network model as an input to perform a deconvolution and a first up sampling so as to obtain target Mel spectrum features corresponding to the to-be-processed Mel spectrum features; and converting the target Mel spectrum features into a target speech corresponding to the to-be-synthesized text. 1. A computer-implemented speech synthesis method , comprising steps of:obtaining a to-be-synthesized text, and extracting one or more to-be-processed Mel spectrum features of the to-be-synthesized text through a preset speech feature extraction algorithm;inputting the to-be-processed Mel spectrum features into a preset ResUnet network model to obtain one or more first intermediate features;performing an average pooling and a first down sampling on the to-be-processed Mel spectrum features to obtain one or more second intermediate features;taking the second intermediate features and the first intermediate features output by the ResUnet network model as an input to perform a deconvolution and a first up sampling so as to obtain one or more target Mel spectrum features corresponding to the to-be-processed Mel spectrum features; andconverting the target Mel spectrum features into a target speech corresponding to the to-be-synthesized text.2. The method of claim 1 , wherein the step of inputting ...

Подробнее
24-06-2021 дата публикации

REAL-TIME VOICE PHISHING DETECTION

Номер: US20210193174A1
Принадлежит: Eduworks Corporation

Methods and systems are disclosed for detecting threats in voice communications such as telephone calls. Various voice phishing (vishing) detectors detect respective type of threats and can be used or activated individually or in various combinations. A tampering detector utilizes deep scattering spectra and shifted delta cepstra features to detect tampering in the form of voice conversion, speech synthesis, or splicing. A content detector predicts a likelihood that word patterns on an incoming voice signal are indicative of a vishing threat. A spoofing detector authenticates or repudiates a purported speaker based on comparison of voice profiles. The vishing detectors can be provided as an authentication service or embedded in communication equipment. Machine learning and signal processing aspects are disclosed, along with applications to mobile telephony and call centers. 1. A computer-implemented method of notifying a user of a vishing threat in an incoming voice signal received by the user , comprising:(a) monitoring the incoming voice signal;(b) determining a measure of likelihood that the incoming voice signal has been tampered with, by evaluating deep scattering spectra (DSS) features and shifted delta cepstra (SDC) features of the incoming voice signal;(c) based at least partly on the measure of likelihood, issuing a real-time indication to the user that the incoming voice signal is the vishing threat.2. The computer-implemented method of claim 1 , wherein the measure of likelihood is a first measure of likelihood claim 1 , and the method further comprises:(d) determining a second measure of likelihood that one or more detected words in the incoming voice signal are indicative of the vishing threat;wherein the issuing is further based at least partly on the second measure of likelihood.3. The computer-implemented method of claim 2 , wherein the computer-implemented method is performed on one or more servers as part of an authentication service configured to ...

Подробнее
29-09-2022 дата публикации

Voice conversion system and training method therefor

Номер: US20220310063A1
Принадлежит:

The present disclosure proposes a speech conversion scheme for non-parallel corpus training, to get rid of dependence on parallel text and resolve a technical problem that it is difficult to achieve speech conversion under conditions that resources and equipment are limited. A voice conversion system and a training method therefor are included. Compared with the prior art, according to the embodiments of the present disclosure: a trained speaker-independent automatic speech recognition model can be used for any source speaker, that is, the speaker is independent; and bottleneck features of audio are more abstract as compared with phonetic posteriorGram features, can reflect decoupling of spoken content and timbre of the speaker, and meanwhile are not closely bound with a phoneme class, and are not in a clear one-to-one correspondence relationship. In this way, a problem of inaccurate pronunciation caused by a recognition error in ASR is relieved to some extent. Pronunciation accuracy of audio obtained by performing voice conversion by the bottleneck feature is obviously higher than that of a phonetic posteriorGram based method, and timbre is not significantly different. By means of a transfer learning mode, dependence on training corpus can be greatly reduced. 1. A voice conversion system , comprising:a speaker-independent automatic speech recognition model, comprising at least a bottleneck layer, configured to: convert a mel-scale frequency cepstral coefficients feature of an inputted source speech into a bottleneck feature of the source speech through the bottleneck layer, and output the bottleneck feature of the source speech to an Attention voice conversion network through the bottleneck layer;where a training method for the speaker-independent automatic speech recognition model comprises:inputting a number of a character encoding to which a word in a multi-speaker speech recognition training corpus is converted, together with a mel-scale frequency cepstral ...

Подробнее
21-05-2020 дата публикации

RECOGNITION SYSTEM AND RECOGNITION METHOD

Номер: US20200160179A1
Принадлежит:

A recognition method includes: receiving a training voice or a training image; and extracting a plurality of voice features in the training voice, or extracting a plurality of image features in the training image; wherein when extracting the voice features, a specific number of voice parameters are generated according to the voice features, and the voice parameters are input into a deep neural network (DNN) to generate a recognition model. When extracting the image features, the specific number of image parameters are generated according to the image features, and the image parameters are input into the deep neural network to generate the recognition model. 1. A recognition system , comprising:a voice receiving device, configured to receive a training voice;a camera, configured to receive a training image; anda first processor, configured to extract a plurality of voice features in the training voice, or extract a plurality of image features in the training image;wherein when the first processor extracts the voice features, the first processor generates a specific number of voice parameters according to the voice features, and the voice parameters are input into a deep neural network (DNN) to generate a recognition model;wherein when the first processor extracts the image features, the first processor generates the specific number of image parameters according to the image features, and the image parameters are input into the deep neural network to generate the recognition model.2. The recognition system of claim 1 , further comprising:a second processor, configured to extract a plurality of newly added features of the newly added data, select the specific number of newly added features as a plurality of new parameters, and substitute the new parameters into the recognition model to recognize the new data and generate a prediction result.3. The recognition system of claim 1 , wherein the first processor performs a Mel-scale Frequency Cepstral Coefficients (MFCC) ...

Подробнее
21-05-2020 дата публикации

METHOD AND SYSTEM FOR GENERATING ADVANCED FEATURE DISCRIMINATION VECTORS FOR USE IN SPEECH RECOGNITION

Номер: US20200160839A1
Автор: Hone Brian, Short Kevin M.
Принадлежит:

A method of renormalizing high-resolution oscillator peaks, extracted from windowed samples of an audio signal, is disclosed. Feature vectors are generated for which variations in both fundamental frequency and time duration of speech are substantially mitigated. The feature vectors may be aligned within a common coordinate space, free of those variations in frequency and time duration that occurs between speakers, and even over speech by a single speaker, to facilitate a simple and accurate determination of matches between those AFDVs generated from a sample of the audio signal and corpus AFDVs generated for known speech at the phoneme and sub-phoneme level. The renormalized feature vectors can be combined with traditional feature vectors such as MFCCs, or they can be used exclusively to identify voiced, semi-voiced and unvoiced sounds. 1taking a plurality of samples of the input audio signal, the plurality of samples being a portion of the input audio signal as it evolves over a window of predetermined time;for each portion of the input audio signal taken:performing a signal analysis on the portion to extract one or more high resolution oscillator peaks therefrom, the extracted oscillator peaks forming a spectral representation of the portion;renormalizing the extracted oscillator peaks to eliminate variations in a fundamental frequency and a time duration for each portion occurring over the window;normalizing a power of the renormalized extracted oscillator peaks;forming the renormalized and power normalized extracted oscillator peaks into an AFDV for the sample;collecting a set of audio samples from a specific individual to form a model of a voice of the individual comprising a database in an AFDV format; andcreating an audio fingerprint in the AFDV format of an individual comprising an aggregation of the collected audio samples.. A method of generating advanced feature discrimination vectors (AFDVs) representing sounds forming at least part of an input audio ...

Подробнее
01-07-2021 дата публикации

VOICE CONVERSION TRAINING METHOD AND SERVER AND COMPUTER READABLE STORAGE MEDIUM

Номер: US20210201890A1
Принадлежит:

The present disclosure discloses a voice conversion training method. The method includes: forming a first training data set including a plurality of training voice data groups; selecting two of the training voice data groups from the first training data set to input into a voice conversion neural network for training; forming a second training data set including the first training data set and a first source speaker voice data group; inputting one of the training voice data groups selected from the first training data set and the first source speaker voice data group into the network for training; forming the third training data set including the second source speaker voice data group and the personalized voice data group that are parallel corpus with respect to each other; and inputting the second source speaker voice data group and the personalized voice data group into the network for training. 1. A voice conversion training method , comprising steps of:forming a first training data set, wherein the first training data set comprises a plurality of training voice data groups;selecting two of the training voice data groups from the first training data set to input into a voice conversion neural network for training;forming a second training data set, wherein the second training set comprises the first training data set and a first source speaker voice data group;inputting one of the training voice data groups selected from the first training data set and the first source speaker voice data group into the voice conversion neural network for training;forming a third training data set, wherein the third training data set comprises a second source speaker voice data group and a personalized voice data group, the second source speaker voice data group comprises a second quantity of second source speaker voice data and corresponds to a same speaker with the first source speaker voice data group, and the personalized voice data group comprises the second quantity of ...

Подробнее
23-06-2016 дата публикации

METHOD FOR IMPROVING ACOUSTIC MODEL, COMPUTER FOR IMPROVING ACOUSTIC MODEL AND COMPUTER PROGRAM THEREOF

Номер: US20160180836A1
Принадлежит:

Embodiments include methods and systems for improving an acoustic model. Aspects include acquiring a first standard deviation value by calculating standard deviation of a feature from first training data and acquiring a second standard deviation value by calculating standard deviation of a feature from second training data acquired in a different environment from an environment of the first training data. Aspects also include creating a feature adapted to an environment where the first training data is recorded, by multiplying the feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value. Aspects further include reconstructing an acoustic model constructed using training data acquired in the same environment as the environment of the first training data using the feature adapted to the environment where the first training data is recorded. 1. A method for improving an acoustic model , comprising:acquiring, by a computer, a first standard deviation value by calculating standard deviation of a first feature from a first training data acquired in first environment;acquiring, by the computer, a second standard deviation value by calculating standard deviation of a second feature from second training data acquired in a second environment;calculating, by the computer, a modified first feature, by multiplying the second feature acquired from the second training data by a ratio obtained by dividing the first standard deviation value by the second standard deviation value; andreconstructing, by the computer, an acoustic model constructed using training data acquired in the first environment, using the modified first feature.2. The method according to claim 1 , wherein the first feature is one of a cepstrum and log mel filter bank output.3. The method according to claim 1 , wherein the amount of the first training data is smaller than an amount of the second training data.4. The ...

Подробнее
23-06-2016 дата публикации

SYSTEM AND METHOD OF USING NEURAL TRANSFORMS OF ROBUST AUDIO FEATURES FOR SPEECH PROCESSING

Номер: US20160180843A1
Принадлежит:

A system and method for processing speech includes receiving a first information stream associated with speech, the first information stream comprising micro-modulation features and receiving a second information stream associated with the speech, the second information stream comprising features. The method includes combining, via a non-linear multilayer perceptron, the first information stream and the second information stream to yield a third information stream. The system performs automatic speech recognition on the third information stream. The third information stream can also be used for training HMMs. 1. A method comprising:receiving, via a communication network, a first information stream associated with a formant frequency of speech;receiving, via the communication network, a second information stream associated with the speech; andperforming, via at least one hardware processor, automatic speech recognition on a third information stream formed by combining, via a non-linear multilayer perceptron, the first information stream and the second information stream.2. The method of claim 1 , wherein the first information stream comprises micro-modulation features modeled in a first time scale.3. The method of claim 2 , wherein the second information stream comprises cepstral features modeled in a second time scale.4. The method of claim 3 , wherein the first time scale is distinct from the second time scale.5. The method of claim 1 , further comprising filtering out noise from the third information stream prior to performing automatic speech recognition.6. The method of claim 1 , wherein the third information stream comprises less features than raw features in the first information stream and the second information stream.7. The method of claim 1 , further comprising training a Hidden Markov model using the third information stream.8. A system comprising:a processor; and receiving, via a communication network, a first information stream associated with a formant ...

Подробнее
23-06-2016 дата публикации

SPEAKER IDENTIFICATION USING SPATIAL INFORMATION

Номер: US20160180852A1
Автор: Huang Shen, Sun Xuejing

Embodiments of the present invention relate to speaker identification using spatial information. A method of speaker identification for audio content being of a format based on multiple channels is disclosed. The method comprises extracting, from a first audio clip in the format, a plurality of spatial acoustic features across the multiple channels and location information, the first audio clip containing voices from a speaker, and constructing a first model for the speaker based on the spatial acoustic features and the location information, the first model indicating a characteristic of the voices from the speaker. The method further comprises identifying whether the audio content contains voices from the speaker based on the first model. Corresponding system and computer program product are also disclosed. 1. A method of speaker identification for audio content , the audio content being of a format based on multiple channels , the method comprising:extracting, from a first audio clip in the format, a plurality of spatial acoustic features across the multiple channels and location information, the first audio clip containing voices from a speaker;constructing a first model for the speaker based on the spatial acoustic features and the location information, the first model indicating a characteristic of the voices from the speaker; andidentifying whether the audio content contains voices from the speaker based on the first model.2. The method according to claim 1 , wherein the spatial acoustic features include an intra-channel shifted delta cepstrum (SDC) feature and an inter-channel SDC feature claim 1 , andwherein extracting the spatial acoustic features from the first audio clip comprises:for each of the multiple channels, extracting a cepstrum coefficient for each frame of the first audio clip in a frequency domain;determining an intra-channel SDC feature for each of the multiple channels based on difference between the cepstrum coefficients for the channel over ...

Подробнее
22-06-2017 дата публикации

TECHNOLOGIES FOR ROBUST CRYING DETECTION USING TEMPORAL CHARACTERISTICS OF ACOUSTIC FEATURES

Номер: US20170178667A1
Принадлежит:

Technologies for identifying sounds are disclosed. A sound identification device may capture sound data, and split the sound data into frames. The sound identification device may then determine an acoustic feature vector for each frame, and determine parameters based on how each acoustic feature varies over the duration of time corresponding to the frames. The sound identification device may then determine if the sound matches a pre-defined sound based on the parameters. In one embodiment, the sound identification device may be a baby monitor, and the pre-defined sound may be a baby crying. 1. A sound identification device for identifying sounds , the sound identification device comprising:a sound data capture module to acquire sound data;a sound frame determination module to determine a plurality of frames of sound data based on the sound data; determine an acoustic feature matrix having two dimensions and comprising a plurality of first-dimension vectors and a plurality of second-dimension vectors, wherein each first-dimension vector of the plurality of first-dimension vectors corresponds to a corresponding frame of the plurality of frames and each second-dimension vector of the plurality of second-dimension vectors comprises an acoustic feature vector associated with the corresponding frame, and wherein each first-dimension vector of the plurality of first-dimension vectors is associated with a different acoustic feature;', 'determine a plurality of temporal parameters for each first-dimension vector of the plurality of first-dimension vectors;', 'determine, based on the pluralities of temporal parameters, whether the sound data corresponds to a pre-defined sound., 'a sound identification module to2. The sound identification device of claim 1 , wherein each acoustic feature vector of the acoustic feature matrix comprises mel-frequency cepstrum coefficients.3. The sound identification device of claim 1 , wherein the pre-defined sound is a cry of an infant.4. The ...

Подробнее
28-05-2020 дата публикации

SYSTEMS AND METHODS FOR SPEECH RECOGNITION IN UNSEEN AND NOISY CHANNEL CONDITIONS

Номер: US20200168208A1
Принадлежит:

Systems and methods for speech recognition are provided. In some aspects, the method comprises receiving, using an input, an audio signal. The method further comprises splitting the audio signal into auditory test segments. The method further comprises extracting, from each of the auditory test segments, a set of acoustic features. The method further comprises applying the set of acoustic features to a deep neural network to produce a hypothesis for the corresponding auditory test segment. The method further comprises selectively performing one or more of: indirect adaptation of the deep neural network and direct adaptation of the deep neural network. 1. A method for speech recognition comprising:receiving, using an input, an audio signal;splitting the audio signal into auditory test segments;extracting, from each of the auditory test segments, a set of acoustic features;applying the set of acoustic features to a deep neural network to produce a hypothesis for the corresponding auditory test segment; andselectively performing one or more of: indirect adaptation of the deep neural network and direct adaptation of the deep neural network.2. The method of claim 1 , wherein performing indirect adaptation of the deep neural network comprises:extracting, from each of the auditory test segments, two distinct sets of acoustic features; andapplying the two distinct sets of acoustic features to the deep neural network simultaneously.3. The method of claim 2 , further comprising:performing a feature-space transformation on each of the two distinct sets of acoustic features prior to applying the two distinct sets of acoustic features to the deep neural network simultaneously.4. The method of claim 3 , wherein the feature-space transformation is a feature space maximum likelihood linear regression transformation.5. The method of claim 1 , wherein the set of acoustic features comprises a set of feature vectors claim 1 , each of the set of feature vectors comprising quantitative ...

Подробнее
13-06-2019 дата публикации

PORTABLE DEVICE FOR CONTROLLING EXTERNAL DEVICE, AND AUDIO SIGNAL PROCESSING METHOD THEREFOR

Номер: US20190180738A1
Автор: KIM Dong-Wan
Принадлежит: SAMSUNG ELECTRONICS CO., LTD.

Disclosed is a portable device for controlling an external device. The portable device comprises: a first microphone, disposed on one surface of a portable device, for receiving an audio signal including user voice uttered by a user; a second microphone, disposed on the other surface of the portable device opposite to the one surface of the portable device, for receiving the audio signal including the user voice; a signal processing unit for processing the audio signal; a communication unit for communicating with an external device; and a processor which determines the user utterance distance between the portable device and the user on the basis of the audio signal received through the first and second microphones, if it is determined that the user utterance distance is a short distance utterance, controls the signal processing unit to process only the audio signal received through the microphone disposed at a relatively further distance from the user from among the first and second microphones, and controls the communication unit to transmit the processed audio signal to the external device. 1. A portable device for controlling an external device , the portable device comprising:a first microphone configured to be disposed in one surface of the portable device and configured to receive an audio signal including a user voice uttered by a user;a second microphone configured to be disposed on the other surface opposite to the one surface of the portable device and configured to receive an audio signal including the user voice;a signal processing unit configured to process the audio signals;a communication unit configured to communicate with the external device; anda processor configured to determine a user utterance distance between the portable device and the user based on the audio signals received through the first and second microphones, control the signal processing unit to process only the audio signal received through one of the first and second microphones ...

Подробнее
29-06-2017 дата публикации

ACOUSTIC FEEDBACK CANCELLATION BASED ON CESPTRAL ANALYSIS

Номер: US20170188147A1
Принадлежит: UNIVERSIDADE DO PORTO

The present disclosure relates to a circuit and method for cancelling the acoustic feedback in public address systems, sound reinforcement systems, hearing aids, teleconference systems or hands-free communication systems, comprising providing a filter for tracking the acoustic feedback path between the radiator device broadcasting and the receiver device, the input of said filter being the signal applied to the radiator device; updating the filter for tracking the acoustic feedback path based on time-domain information contained in the cepstrum of the receiver device signal, or updating the filter for tracking the acoustic feedback path based on time-domain information contained in the cepstrum of the signal applied to the radiator device, or updating the filter for tracking the acoustic feedback path based on time-domain information contained in the cepstrum of the difference between the receiver device signal and the signal applied to the radiator device filtered by the filter. 1. Method for cancelling acoustic feedback from a radiator device broadcasting to a receiver device in an environment , comprising:providing a filter H(z,n) for tracking the acoustic feedback path between the radiator device broadcasting and the receiver device, the input of said filter being the signal x(n) applied to the radiator device;{'sub': 'y', 'updating the filter H(z,n) for tracking the acoustic feedback path based on time-domain information contained in the cepstrum c(τ,n) of the receiver device signal y(n), or'}{'sub': 'x', 'updating the filter H(z,n) for tracking the acoustic feedback path based on time-domain information contained in the cepstrum c(τ,n) of the signal x(n) applied to the radiator device, or'}{'sub': 'e', 'updating the filter H(z,n) for tracking the acoustic feedback path based on time-domain information contained in the cepstrum c(τ,n) of the difference between the receiver device signal and the signal x(n) applied to the radiator device filtered by the filter H ...

Подробнее
15-07-2021 дата публикации

REDUCED MISS RATE IN SOUND TO TEXT CONVERSION USING BANACH SPACES

Номер: US20210217421A1
Принадлежит:

A computer-implemented method includes: comparing features extracted from a first document that include a sound to features extracted from acoustic files related to the sound; designating the sound in a document of the plurality of documents as a true; designating the sound in the first document as a false negative; generating a first sound vector for the sound in the first document in response to the sound in the first document being designated a false negative; generating a sound vector for each of the documents designated as a true positive; creating a centroid vector for the sound vectors of the documents designated as a true positive; and redesignating the sound in the first document from a false negative to a true positive in response to the first sound vector and the centroid vector being a Banach space. 1. A computer-implemented method comprising:comparing, by a computer device, features extracted from a first document of a plurality of documents that include a sound, to features extracted from acoustic files related to the sound;designating, by the computer device, the sound in a document of the plurality of documents as a true positive in response to correctly identifying the sound;designating, by the computer device, the sound in the first document as a false negative in response to failing to identify the sound, and the sound existing in an acoustic library;generating, by the computer device, a first sound vector for the sound in the first document in response to the sound in the first document being designated a false negative;generating, by the computer device, a sound vector for each of the documents designated as a true positive;creating, by the computer device, a centroid vector for the sound vectors of the documents designated as a true positive; andredesignating, by the computer device, the sound in the first document from a false negative to a true positive in response to the first sound vector and the centroid vector being a Banach space.2. The ...

Подробнее
11-06-2020 дата публикации

METHOD AND SYSTEM FOR ARTICULATION EVALUATION BY FUSING ACOUSTIC FEATURES AND ARTICULATORY MOVEMENT FEATURES

Номер: US20200178883A1
Принадлежит:

The present invention provides an articulation evaluation method and system combining acoustic features and articulation motion features. According to the articulation evaluation method and system, audio data and articulation motion data are acquired, acoustic features are extracted from the audio data, articulation motion features are extracted from the articulation motion features, and feature fusion and policy fusion are performed on the acoustic features and the articulation motion features according to a time correspondence, which effectively utilizes the complementarity of the two types of features to ensure the objectivity and comprehensiveness of evaluation, so that a more accurate and reliable feature fusion evaluation result and decision fusion evaluation result are obtained, making the articulation evaluation more objective and accurate. 1. An articulation evaluation method combining acoustic features and articulation motion features , comprising the following steps:{'b': '10', 'step (): acquiring audio data and articulation motion data, extracting acoustic features from the audio data, and extracting articulation motion features from the articulation motion data, wherein the audio data and the articulation motion data correspond in time;'}{'b': '20', 'step (): performing feature fusion processing on the acoustic features and the articulation motion features according to a time correspondence to obtain fusion features;'}{'b': '30', 'step (): performing training according to the fusion features to obtain a fusion feature intelligibility discrimination model; and'}{'b': '40', 'step (): obtaining a feature fusion evaluation result by using the fusion feature intelligibility discrimination model.'}2. The articulation evaluation method combining acoustic features and articulation motion features according to claim 1 , wherein further claim 1 , training is respectively performed according to the acoustic features and the articulation motion features to obtain ...

Подробнее
22-07-2021 дата публикации

SYSTEM AND METHOD FOR MEASUREMENT OF VOCAL BIOMARKERS OF VITALITY AND BIOLOGICAL AGING

Номер: US20210219893A1
Принадлежит:

A system and method for screening and monitoring progression of subjects' health conditions and wellbeing, by the analysis of their voice signal. According to one embodiment, a system is provided that records voice samples of subjects and evaluates, in real time, the severity of their health condition based on vitality biomarkers. The vitality biomarkers are the construct of machine learning and deep learning models trained in an offline procedure. The offline training procedure is optimized to associate between (a) acoustic features and/or image representations of training cohort subjects' pre-recorded voices; and (b) their vitality score, extracted from their medical records. In the training procedure, the vitality scores of the training cohort subjects is heuristically defined as a function of the speaker age at the time of recording and the duration elapsed between the time of recording and available clinical events, with emphasis on the time of death when available. 136-. (canceled)37100150100100150. A computer-based system , comprising a measuring unit for estimating a vitality score of a subject based on voice and a training unit for training said measuring unit , said system comprising one or more processors and non-transitory computer-readable media (CRM) , said CRMs storing instructions to said processors for operation of modules of said measuring unit and said training unit ,{'b': '100', 'claim-text': [{'b': '105', 'i. one or more recording devices , configured to record a voice sample of a subject;'}, {'b': '110', 'claim-text': a) compute temporal sequences of a set of low-level acoustic features of said voice sample; and', 'b) convert said low-level sequences of acoustic features to image representations;, 'ii. an acoustic processing module , configured to'}, {'b': '115', 'iii. a vocal biomarker model file , configured to store parameters of a vocal biomarker model;'}, {'b': '120', 'iv. a vocal biomarker evaluation module , configured to evaluate a ...

Подробнее
05-07-2018 дата публикации

SYSTEM AND METHOD FOR NEURAL NETWORK BASED FEATURE EXTRACTION FOR ACOUSTIC MODEL DEVELOPMENT

Номер: US20180190267A1
Принадлежит:

A system and method are presented for neural network based feature extraction for acoustic model development. A neural network may be used to extract acoustic features from raw MFCCs or the spectrum, which are then used for training acoustic models for speech recognition systems. Feature extraction may be performed by optimizing a cost function used in linear discriminant analysis. General non-linear functions generated by the neural network are used for feature extraction. The transformation may be performed using a cost function from linear discriminant analysis methods which perform linear operations on the MFCCs and generate lower dimensional features for speech recognition. The extracted acoustic features may then be used for training acoustic models for speech recognition systems. 1. A method for training acoustic models in speech recognition systems , wherein the speech recognition system comprises a neural network , the method comprising the steps of: a. extracting acoustic features from a speech signal using the neural network; and b. processing the acoustic features into an acoustic model by the speech recognition system.224.-. (canceled) The present invention generally relates to telecommunications systems and methods, as well as automatic speech recognition systems. More particularly, the present invention pertains to the development of acoustic models used in automatic speech recognition systems.A system and method are presented for neural network based feature extraction for acoustic model development. A neural network may be used to extract acoustic features from raw MFCCs or the spectrum, which are then used for training acoustic models for speech recognition systems. Feature extraction may be performed by optimizing a cost function used in linear discriminant analysis. General non-linear functions generated by the neural network are used for feature extraction. The transformation may be performed using a cost function from linear discriminant ...

Подробнее
06-07-2017 дата публикации

SYSTEM AND METHOD FOR NEURAL NETWORK BASED FEATURE EXTRACTION FOR ACOUSTIC MODEL DEVELOPMENT

Номер: US20170193988A1
Принадлежит:

A system and method are presented for neural network based feature extraction for acoustic model development. A neural network may be used to extract acoustic features from raw MFCCs or the spectrum, which are then used for training acoustic models for speech recognition systems. Feature extraction may be performed by optimizing a cost function used in linear discriminant analysis. General non-linear functions generated by the neural network are used for feature extraction. The transformation may be performed using a cost function from linear discriminant analysis methods which perform linear operations on the MFCCs and generate lower dimensional features for speech recognition. The extracted acoustic features may then be used for training acoustic models for speech recognition systems. 1. A method for training acoustic models in speech recognition systems , wherein the speech recognition system comprises a neural network , the method comprising the steps of:a. extracting acoustic features from a speech signal using the neural network; andb. processing the acoustic features into an acoustic model by the speech recognition system.2. The method of claim 1 , wherein the acoustic features are extracted from Mel Frequency Cepstral Coefficients.3. The method of claim 1 , wherein the features are extracted from a speech signal spectrum.4. The method of claim 1 , wherein the neural network comprises at least one of: activation functions with parameters claim 1 , prealigned feature data claim 1 , and training.5. The method of claim 4 , wherein the training is performed using a stochastic gradient descent method on a cost function.6. The method of claim 5 , wherein the cost function is a linear discriminant analysis cost function.7. The method of claim 1 , wherein the extracting of step (a) further comprises the step of optimizing a cost function claim 1 , wherein the cost function is capable of transforming general non-linear functions generated by the neural network.8. The ...

Подробнее
20-06-2019 дата публикации

METHOD AND SYSTEM FOR DIAGNOSING CORONARY ARTERY DISEASE (CAD) USING A VOICE SIGNAL

Номер: US20190189146A1
Автор: LEVANON Yoram, Luz Yotam
Принадлежит:

The present invention extends to methods, systems, for diagnosing coronary artery disease (CAD) in patients by using their voice signal comprising receiving voice signal data indicative of speech from the patient. 2. The method of claim 1 , wherein the step of computing MFCC is performed by computing a Cepstral representation using any degree of freedom.3. The method of claim 1 , wherein the Cepstral representation comprises time-series is used for statistical feature extraction.4. The method of claim 1 , wherein the step of segmenting the voice signal data into frames claim 1 , further provides a power spectrum density (PSD) and/or its Root Mean Squaring (RMS) spectrogram with any resolution between 1 to 200 frames per second.5. The method of claim 1 , wherein the step of computing Mel Frequency Cepstral Coefficients (MFCC) from a log scaling function that resemble the human acoustic perception of sounds is achieved by using any number of Mel frequency triangular filter banks.6. The method of claim 1 , wherein the step of computing Mel Frequency Cepstral Coefficients (MFCC) from a log scaling functions that resemble the human acoustic perception of sound pressure levels is achieved by converting to decibels (DB).7. The method of claim 1 , wherein for each of the two or more of frequency bands the intensity ratio values is manifested at a given time period.8. The method of claim 1 , wherein the voice signal data has a finite duration and each time period separating the respective plurality of intensity ratio values is essentially evenly distributed within the duration of the speech.9. The method of claim 1 , wherein the existence of at least one coronary artery disease symptom associated with the patient is determined based at least in part upon the type of statistical operator function including at least one decay feature.10. The method of claim 9 , wherein the zero-crossing type operator measure provides an indicator of the severity of the coronary artery disease ...

Подробнее
11-06-2020 дата публикации

IMAGE-BASED APPROACHES TO IDENTIFYING THE SOURCE OF AUDIO DATA

Номер: US20200184955A1
Принадлежит: VocaliD, INC.

Image-based machine learning approaches are used to classify audio data, such as speech data as authentic or otherwise. For example, audio data can be obtained and a visual representation of the audio data can be generated. The visual representation can include, for example, an image such as a spectrogram or other visual or electronic representation of the audio data. Before processing the image, the audio data and/or image may undergo various preprocessing techniques. Thereafter, the image representation of the audio data can be analyzed using a trained model to classify the audio data as authentic or otherwise. 1. A computing system , comprising:at least one computing processor; and obtain voice data, the voice data associated with a first identifier for a source of the voice data;', 'generate an image that includes a visual representation of the voice data;', 'use a trained model to analyze the image to determine classification information, the classification information associated with a second identifier, the second identifier corresponding to an identity of a speaker of the voice data; and', 'generate information for one of the source or the identity of the speaker., 'memory including instructions that, when executed by the at least one computing processor, enable the computing system to2. The computing system of claim 1 , wherein the classification information includes a confidence score quantifying a likelihood that voice data is authentic and a similarity score quantifying a level of similarity of the voice data to stored voice data.3. The computing system of claim 1 , wherein the voice data is obtained from at least one source of a plurality of sources claim 1 , the plurality of sources including radio content providers claim 1 , video content providers claim 1 , audio content providers claim 1 , television content providers claim 1 , or live performance content providers claim 1 , and wherein at least a portion of the voice data is generated from one of a ...

Подробнее