PROCEDURE FOR SPEECH RECOGNITION

15-06-2011 дата публикации

Номер:

AT0000510278T

Автор: TSCHIRK WOLFGANG, TSCHIRK, WOLFGANG

Принадлежит:

Контакты:

Номер заявки: 46-72-0816

Дата заявки: 22-10-2008

Technical Field

[1]

The The present invention relates to a method for speech recognition, wherein for Execution the method only small computing power is required.

State of the Art

[2]

Speech Recognition Systems need a significant expense for computing power and memory location. For the Implementation in personal computers (PC) and independent computer systems for a number of years (embedded systems) are only sufficiently rapid Processors available, in order to benefit from the usual speech recognition algorithms necessary fast multiply functions offer. Even small battery-powered devices (such B. Mobile phones) can already with simple speech recognition systems are equipped, for example, for dialling speech input, since the computing power of the in these instruments used Processors therefor is sufficient and sufficiently More storage space is.

[3]

The for speech recognition is already in some instances computing power very simple tasks prominent, for example, needs a Speech recognition system having a vocabulary of 10 words a computing power from 5 to 20 million instructions per second (MIPS), a program memory need from 5 to 50 kbytes and a working memory of 1 to 5 kbytes. This requirements for modern mobile phones constitute no difficulty, since their processors some program and make some 100 MIPS Memory and Comprise MByte.

[4]

Does but only very small computing power available, speech recognition systems may, after the state of the art not be used. Can, for instance, in digital hearing aids (Hearing aids) used Processors not conventional speech recognition algorithms implement, because their computing power typically 1 MIPS is and about 100 byte 1 kbytes program memory and working memory for Available stand.

[5]

It is not possible, for the Voice control of hearing aids other, more efficient Processors to use, since these the available energy supply (usually miniature batteries ) to quickly would empty. The in digital hearing aids used for speech recognition systems according to the state processors offer of the art required fast multiply functions not, as these for in digital hearing aids necessary functionalities are not necessary.

Representation of the invention

[6]

The Invention is based on the problem, a method for speech recognition indicate, wherein for carrying out the method only small computing power is required.

[7]

The Problem is solved by a process, which by means of a pattern recognizer from an input pattern by comparison with stored templates all recognizable words a rank the words according to the hit probability and the input pattern thereby drawn up creating that

-the digitized language input signal by means of a divided into a plurality of frequency channels m band filter bank is, and
-Pattern elements p_fb, which from said digital [...] each frequency range b, consisting of W adjacent frequency channels m, via a temporal summation range f, consisting of L adjacent points in time t according to are determined, wherein and wherein | S_tm| the amount of the a channel m at the time t signal scanning value represents an arbitrarily_{and wherein u} min to be elected Value represents.

[8]

The pattern recognizer to be used in this process can be set arbitrarily, the advantageous properties of the subject invention are independent of the selected Pattern Recognizer. It will be natural, existing speech recognizer with said subject Method be equipped.

[9]

With the process of the present invention the advantage is achievable that also with processors Low Computing power a speech recognizer can be established.

[10]

A further important aspect of the invention is that for execution of said the process according to the invention the the method exporting Processor no multiply function having must possess.

[11]

Particularly for battery powered Devices, z. B. Hearing Aids (Hearing Aids) the insert is of the invention appropriate, since the used in such applications Processors often only with great program-technical (processing units) An expense for these multiplications able to realise and the Multiplications required period is too long, a conventional Speech recognizer to realize.

Brief description of the drawings

[12]

It as an example and schematically show:

[13]

1 Block Diagram a speech recognizer

[14]

2 Recovery the pattern elements p_fb

[15]

3 Block Diagram a pattern detector PSD final POINT

Embodiment of the invention

[16]

In 1 is exemplary and schematically a speech recognizer SE represented, this speech recognizer includes a feature extraction VU, a pattern final POINT Detection PSD [...] PC and a pattern. This Speech recognizer SE speech recognition procedure provides as output of each a Ranking r of all recognizable words, be ranked according to their respective Probability. This Ranking r serves as an input signal a Control logic SL, the further processing changed on the basis of the ranking r or analysis of the recognized words is carried out.

[17]

The the input signal of the speech recognizer is_{used by the audio signal} s t Feature Extraction VU forming, from this audio signal_{s t pattern elements} p_{f via} temporal summation ranges to determine, as an input variable to the pattern which Classification PC serve (pattern recognizer). Furthermore, the determined Feature Extraction VU from the audio signal_{s t a} Energy Equivalent_{e f,} which as an input variable to the pattern final POINT Detection PSD (Wortende-Detektor) serves. This The Pattern final POINT Detection PSD gives, after having the End of a word has recognised, the Word End-signal d to the pattern Classification PC ab.

[18]

2 represents exemplary and schematically the determination of the pattern elements p_{fb and the} pattern elements via a temporal constitute_{summation range p} f. The horizontal axis represents the issue, the vertical axis the frequency. A language input signal is temporally discrete (digitized) band filter bank (not shown) by means of a m divided in frequency channels. It will each comprise a number of adjacent frequency channels m to W Frequency ranges b combined. In the subject example the number is of the frequency channels 10, the number W of adjacent frequency channels, which to frequency ranges are being combined is 2 and thus are made 5 frequency ranges b in this example.

[19]

Furthermore, the signals ( [...] be) of all frequency channels via a Number of samples L combined and thus according to the pattern elements p_{fb determined}. Size u_{min represents} an arbitrarily to be elected Value, wherein concrete values of u in min Context of the execution of of the process according to the invention and during the adaptation of the speech recognizer to specific fields of application setting.

[20]

The p_{fb necessary} for determining the pattern elements Values of u_{fb are defined by}determined, wherein the amount_{| of the signal scanning value a | S} tm Channel m represents at the time t.

[21]

The Pattern elements p_{fb one frequency band} b and a temporal summation range be f represents in each case a temporal summation range f together on the pattern recognizer PC (The Pattern Classification) guided and are for these constitute_{the input variable p} f pattern recognizer PC.

[22]

The Feature Extraction VU determined further comprises energy equivalent_{e f a temporal}summation range f according to wherein u_{min represents a} value to be elected as desired.

[23]

This Energy Equivalent_{e f a temporal}summation range f serves as an input variable to the pattern final POINT Detection PSD (Wortende-Detektor).

[24]

In 3 is the block diagram of a pattern (Wortende-Detektor) PSD final POINT detection represented. The The Pattern final POINT detection PSD 2 Word End-detectors comprises D1 and D2, which both as an input the energy equivalent_{e f a temporal}summation range f obtained and in the case is recognized that a Word End, respectively a Word End-signal d1, d2 with the value "true" to said logic Or- [...] OD deliver. This logical Or-Link OD drawn up by means of logic Or-Link from the two Word End-signals d1, d2 the Word End-signal d, which is passed to the pattern Classification PC.

[25]

Each D1 or D2 leads Word End-detector word ending each case an identical method for detecting a from, wherein the parameters of the process, consisting of a first Threshold SWE11, SWE12 SWE21 and a second threshold, SWE22, a first period T11, T12 and a second period T21, T22 in both Word End-detectors D1, D2 are different.

[26]

The D1 and D2 by the Word End-detectors word ending detection of a energy equivalent takes place by means of comparison of the value of the_{e f SWE11 with said thresholds}, SWE21, SWE12 and SWE22, wherein after the value of the longer than_{energy equivalent} e f t11 or a first period. T12 greater than the first threshold SWE11 or SWE12 and thereafter for at least the duration of a second period T21 or. T22 low sWE21 or SWE22 is than the second threshold, a a Word End-signal d1, d2 is generated.

[27]

SE: Speech Recognizer
VU: Feature Extraction
PC: Pattern Classification
PSD: Pattern final POINT Detection
SL: Control Logic
st: Audio Signal at the time t
[...]: Sample Element the temporal summation range f
ef: Energy Equivalent the temporal summation range f
r: Ranking
d: Word End-Signal
d1: Word End-Signal d1 Word End-detector of the
d2: Word End-Signal the Word End-detector D2
b: Frequency Range
B: Number Frequency Ranges
m: Frequency Channel
W: Number of adjacent frequency channels
f: temporal summation range
L: number of scanning values in temporal summation range f
[...]: Sample Element the temporal summation range f and frequency range b
t: Date
D1, D2: Word End-Detectors
SWE11: first Threshold value of the D1 Word End-detector
SWE21: second Threshold value of the D1 Word End-detector
SWE12: first Threshold value of the D2 Word End-detector
SWE22: second Threshold value of the D2 Word End-detector
T11: First Period of the D1 Word End-detector
T21: Second Period of the D1 Word End-detector
T12: First Period of the Word End-detector D2
T22: Second Period of the Word End-detector D2
ef: Energy Equivalent
OD: logical Or- [...]

[28]

The method involves generating a sequence of words from an input pattern through comparison with stored patterns of recognizable words by using a sample recognizer based on success probability. A digital speech input signal is divided into frequency channels by using a band filter bank, and pattern units are determined from digital speech samples of a frequency area comprising the frequency channels over a temporal summation area. A word end-detection is provided for detecting the ends of the words through logistic disjunction of binary output signals of word end-detectors.

Speech recognition method which creates a sequence of words on a probability basis from an input pattern by means of a pattern recogniser by comparison with stored patterns of all the recognisable words, said input pattern being created by

- splitting the digitised speech input signal into a plurality of frequency channels m by means of a bandpass filter bank, and

- pattern elements p_fb which are determined from the digital speech samples of each frequency range b, consisting of W adjacent frequency channels m, over a time summation range f, consisting of L adjacent instants, according to pfb={0forufb≤uminln2ufb/uminforufb>umin where ufb=∑t=fL+1fl+L∑m=bW+1bW+Wstm and where |S_tm| represents the absolute value of the signal sample of a channel m at instant t and where u_min is a randomly selected value.

Method according to claim 1, characterised in that word end detection is provided which determines the end of a word by logical ORing of the binary output signals of two word end detectors D1, D2, the energy equivalent e_f being determined according to ef={0foruf≤uminln2uf/uminforuf>umin and used as the input variable of the word end detectors D1, D2, where uf=∑b=0B-1ufb and where u_min is a randomly selected number and b represents a frequency range, consisting of W adjacent frequency channels m, and B the number of frequency ranges, and wherein the word end detectors D1, D2 perform end of word by comparing the energy equivalent e_f with threshold values SWE11, SWE12, SWE21 and SWE22, and recognise a word end once the value of the energy equivalent e_f is longer than a first time period T11, T12, greater than the first threshold value SWE11, SWE12 and then lower than the second threshold value SWE21, SWE22 for at least the duration of a second time period T21, T22.

Method according to claim 2, characterised in that the first time periods T11, T12, the second time periods T21, T22 and the threshold values SWE11, SWE12, SWE21 and SWE22 of the two word end detectors D1, D2 are different.

Method according to one of claims 1 to 3, characterised in that the method is used in a hearing aid.

CPC - классификация

G G1 G10 G10L G10L1 G10L15 G10L15/G10L15/0 G10L15/02 G10L2 G10L25 G10L25/G10L25/1 G10L25/18 H H0 H04 H04R H04R2 H04R25 H04R25/H04R25/5 H04R25/55 H04R25/558

IPC - классификация

G G1 G10 G10L G10L1 G10L15 G10L15/G10L15/0 G10L15/02 G10L2 G10L25 G10L25/G10L25/1 G10L25/18