PROCEDURE FOR SPEECH RECOGNITION

15-06-2011 дата публикации
Номер:
AT0000510278T
Принадлежит:
Контакты:
Номер заявки: 46-72-0816
Дата заявки: 22-10-2008

Technical Field

[1]

The The present invention relates to a method for speech recognition, wherein for Execution the method only small computing power is required.

State of the Art

[2]

Speech Recognition Systems need a significant expense for computing power and memory location. For the Implementation in personal computers (PC) and independent computer systems for a number of years (embedded systems) are only sufficiently rapid Processors available, in order to benefit from the usual speech recognition algorithms necessary fast multiply functions offer. Even small battery-powered devices (such B. Mobile phones) can already with simple speech recognition systems are equipped, for example, for dialling speech input, since the computing power of the in these instruments used Processors therefor is sufficient and sufficiently More storage space is.

[3]

The for speech recognition is already in some instances computing power very simple tasks prominent, for example, needs a Speech recognition system having a vocabulary of 10 words a computing power from 5 to 20 million instructions per second (MIPS), a program memory need from 5 to 50 kbytes and a working memory of 1 to 5 kbytes. This requirements for modern mobile phones constitute no difficulty, since their processors some program and make some 100 MIPS Memory and Comprise MByte.

[4]

Does but only very small computing power available, speech recognition systems may, after the state of the art not be used. Can, for instance, in digital hearing aids (Hearing aids) used Processors not conventional speech recognition algorithms implement, because their computing power typically 1 MIPS is and about 100 byte 1 kbytes program memory and working memory for Available stand.

[5]

It is not possible, for the Voice control of hearing aids other, more efficient Processors to use, since these the available energy supply (usually miniature batteries ) to quickly would empty. The in digital hearing aids used for speech recognition systems according to the state processors offer of the art required fast multiply functions not, as these for in digital hearing aids necessary functionalities are not necessary.

Representation of the invention

[6]

The Invention is based on the problem, a method for speech recognition indicate, wherein for carrying out the method only small computing power is required.

[7]

The Problem is solved by a process, which by means of a pattern recognizer from an input pattern by comparison with stored templates all recognizable words a rank the words according to the hit probability and the input pattern thereby drawn up creating that

  • -the digitized language input signal by means of a divided into a plurality of frequency channels m band filter bank is, and
  • -Pattern elements pfb, which from said digital [...] each frequency range b, consisting of W adjacent frequency channels m, via a temporal summation range f, consisting of L adjacent points in time t according to are determined, wherein and wherein | Stm| the amount of the a channel m at the time t signal scanning value represents an arbitrarilyand wherein u min to be elected Value represents.

[8]

The pattern recognizer to be used in this process can be set arbitrarily, the advantageous properties of the subject invention are independent of the selected Pattern Recognizer. It will be natural, existing speech recognizer with said subject Method be equipped.

[9]

With the process of the present invention the advantage is achievable that also with processors Low Computing power a speech recognizer can be established.

[10]

A further important aspect of the invention is that for execution of said the process according to the invention the the method exporting Processor no multiply function having must possess.

[11]

Particularly for battery powered Devices, z. B. Hearing Aids (Hearing Aids) the insert is of the invention appropriate, since the used in such applications Processors often only with great program-technical (processing units) An expense for these multiplications able to realise and the Multiplications required period is too long, a conventional Speech recognizer to realize.

Brief description of the drawings

[12]

It as an example and schematically show:

[13]

1 Block Diagram a speech recognizer

[14]

2 Recovery the pattern elements pfb

[15]

3 Block Diagram a pattern detector PSD final POINT

Embodiment of the invention

[16]

In 1 is exemplary and schematically a speech recognizer SE represented, this speech recognizer includes a feature extraction VU, a pattern final POINT Detection PSD [...] PC and a pattern. This Speech recognizer SE speech recognition procedure provides as output of each a Ranking r of all recognizable words, be ranked according to their respective Probability. This Ranking r serves as an input signal a Control logic SL, the further processing changed on the basis of the ranking r or analysis of the recognized words is carried out.

[17]

The the input signal of the speech recognizer isused by the audio signal s t Feature Extraction VU forming, from this audio signals t pattern elements pf via temporal summation ranges to determine, as an input variable to the pattern which Classification PC serve (pattern recognizer). Furthermore, the determined Feature Extraction VU from the audio signals t a Energy Equivalente f, which as an input variable to the pattern final POINT Detection PSD (Wortende-Detektor) serves. This The Pattern final POINT Detection PSD gives, after having the End of a word has recognised, the Word End-signal d to the pattern Classification PC ab.

[18]

2 represents exemplary and schematically the determination of the pattern elements pfb and the pattern elements via a temporal constitutesummation range p f. The horizontal axis represents the issue, the vertical axis the frequency. A language input signal is temporally discrete (digitized) band filter bank (not shown) by means of a m divided in frequency channels. It will each comprise a number of adjacent frequency channels m to W Frequency ranges b combined. In the subject example the number is of the frequency channels 10, the number W of adjacent frequency channels, which to frequency ranges are being combined is 2 and thus are made 5 frequency ranges b in this example.

[19]

Furthermore, the signals ( [...] be) of all frequency channels via a Number of samples L combined and thus according to the pattern elements pfb determined. Size umin represents an arbitrarily to be elected Value, wherein concrete values of u in min Context of the execution of of the process according to the invention and during the adaptation of the speech recognizer to specific fields of application setting.

[20]

The pfb necessary for determining the pattern elements Values of ufb are defined bydetermined, wherein the amount| of the signal scanning value a | S tm Channel m represents at the time t.

[21]

The Pattern elements pfb one frequency band b and a temporal summation range be f represents in each case a temporal summation range f together on the pattern recognizer PC (The Pattern Classification) guided and are for these constitutethe input variable p f pattern recognizer PC.

[22]

The Feature Extraction VU determined further comprises energy equivalente f a temporalsummation range f according to wherein umin represents a value to be elected as desired.

[23]

This Energy Equivalente f a temporalsummation range f serves as an input variable to the pattern final POINT Detection PSD (Wortende-Detektor).

[24]

In 3 is the block diagram of a pattern (Wortende-Detektor) PSD final POINT detection represented. The The Pattern final POINT detection PSD 2 Word End-detectors comprises D1 and D2, which both as an input the energy equivalente f a temporalsummation range f obtained and in the case is recognized that a Word End, respectively a Word End-signal d1, d2 with the value "true" to said logic Or- [...] OD deliver. This logical Or-Link OD drawn up by means of logic Or-Link from the two Word End-signals d1, d2 the Word End-signal d, which is passed to the pattern Classification PC.

[25]

Each D1 or D2 leads Word End-detector word ending each case an identical method for detecting a from, wherein the parameters of the process, consisting of a first Threshold SWE11, SWE12 SWE21 and a second threshold, SWE22, a first period T11, T12 and a second period T21, T22 in both Word End-detectors D1, D2 are different.

[26]

The D1 and D2 by the Word End-detectors word ending detection of a energy equivalent takes place by means of comparison of the value of thee f SWE11 with said thresholds, SWE21, SWE12 and SWE22, wherein after the value of the longer thanenergy equivalent e f t11 or a first period. T12 greater than the first threshold SWE11 or SWE12 and thereafter for at least the duration of a second period T21 or. T22 low sWE21 or SWE22 is than the second threshold, a a Word End-signal d1, d2 is generated.

[27]

SE
Speech Recognizer
VU
Feature Extraction
PC
Pattern Classification
PSD
Pattern final POINT Detection
SL
Control Logic
st
Audio Signal at the time t
[...]
Sample Element the temporal summation range f
ef
Energy Equivalent the temporal summation range f
r
Ranking
d
Word End-Signal
d1
Word End-Signal d1 Word End-detector of the
d2
Word End-Signal the Word End-detector D2
b
Frequency Range
B
Number Frequency Ranges
m
Frequency Channel
W
Number of adjacent frequency channels
f
temporal summation range
L
number of scanning values in temporal summation range f
[...]
Sample Element the temporal summation range f and frequency range b
t
Date
D1, D2
Word End-Detectors
SWE11
first Threshold value of the D1 Word End-detector
SWE21
second Threshold value of the D1 Word End-detector
SWE12
first Threshold value of the D2 Word End-detector
SWE22
second Threshold value of the D2 Word End-detector
T11
First Period of the D1 Word End-detector
T21
Second Period of the D1 Word End-detector
T12
First Period of the Word End-detector D2
T22
Second Period of the Word End-detector D2
ef
Energy Equivalent
OD
logical Or- [...]



[28]

The method involves generating a sequence of words from an input pattern through comparison with stored patterns of recognizable words by using a sample recognizer based on success probability. A digital speech input signal is divided into frequency channels by using a band filter bank, and pattern units are determined from digital speech samples of a frequency area comprising the frequency channels over a temporal summation area. A word end-detection is provided for detecting the ends of the words through logistic disjunction of binary output signals of word end-detectors.



Speech recognition method which creates a sequence of words on a probability basis from an input pattern by means of a pattern recogniser by comparison with stored patterns of all the recognisable words, said input pattern being created by

- splitting the digitised speech input signal into a plurality of frequency channels m by means of a bandpass filter bank, and

- pattern elements pfb which are determined from the digital speech samples of each frequency range b, consisting of W adjacent frequency channels m, over a time summation range f, consisting of L adjacent instants, according to pfb={0forufbuminln2ufb/uminforufb>umin where ufb=t=fL+1fl+Lm=bW+1bW+Wstm and where |Stm| represents the absolute value of the signal sample of a channel m at instant t and where umin is a randomly selected value.

Method according to claim 1, characterised in that word end detection is provided which determines the end of a word by logical ORing of the binary output signals of two word end detectors D1, D2, the energy equivalent ef being determined according to ef={0forufuminln2uf/uminforuf>umin and used as the input variable of the word end detectors D1, D2, where uf=b=0B-1ufb and where umin is a randomly selected number and b represents a frequency range, consisting of W adjacent frequency channels m, and B the number of frequency ranges, and wherein the word end detectors D1, D2 perform end of word by comparing the energy equivalent ef with threshold values SWE11, SWE12, SWE21 and SWE22, and recognise a word end once the value of the energy equivalent ef is longer than a first time period T11, T12, greater than the first threshold value SWE11, SWE12 and then lower than the second threshold value SWE21, SWE22 for at least the duration of a second time period T21, T22.

Method according to claim 2, characterised in that the first time periods T11, T12, the second time periods T21, T22 and the threshold values SWE11, SWE12, SWE21 and SWE22 of the two word end detectors D1, D2 are different.

Method according to one of claims 1 to 3, characterised in that the method is used in a hearing aid.