PROCEDURE FOR THE ADMINISTRATION OF THE SOUND SIGNAL CODING AND - RENDITION IN AN ASYNCHRONOUS TRANSMISSION SYSTEM

15-02-2005 дата публикации
Номер:
AT0000287625T
Принадлежит:
Контакты:
Номер заявки: 82-59-0094
Дата заявки: 22-06-2000

[1]

The present invention relates to a method of managing the asynchronism transmitting audio.

THE GENERAL FIELD OF THE INVENTION

[2]

Generally, the invention relates to transmission systems using speech coder with reduced flow rate, wherein the signals are not carrying the reference clock of the coding system of the source (sampling rate of the encoder). This is the case for example for transmissions IP-based (Internet protocol) or for the discontinuous transmissions, andc.

[3]

A general purpose of the invention is to overcome the problems of producing continuous flow of speech or its decoded encountered with such systems.

[4]

Traditionally the telephone communication networks and paths sounds have used analog transmission systems and frequency division multiplexing (primary group, amplitude modulation and frequency). Under these conditions, the speech signal (or music; the term speech will be used generically in the suite of this document) is converted into an electrical signal by a microphone and it is this analog signal which is filtered and modulated to be presented to the receiver that the amplify prior to its presentation to the redemption system (earpiece, speaker, and so on).

[5]

From a certain number of years, the transmission techniques and digital switching have gradually replaced the analog techniques. In these systems called PCM (pulse code modulation or PCM in English for pulsed modulated codes), the speech signal is sampled and converted to digital form in a digital to analog converter (DAC in the suite) driven by a fixed sampling frequency derived from a master clock from the mains and also known of the receiving system. This is the case for the ARU and URNs (subscriber terminal unit analog and digital) of the telecommunications network. The digital signal received by the recipient (broad sense) is converted to analog to be listened to using a digital to analog converter (DAC in the suite) driven by a clock of the same frequency as that used by the ADC of the source. Under these conditions, the entire system is perfectly synchronous and is the case generally with the present systems switching and transmission. These can include rate reduction systems (for example the telephone signal, from 64 kb/s to 32, 16 or 8 kb/sec.). This is the network (or the end systems as for example in the case of ISDN (ISDN English)) which takes over the operations of ADC, encoding, decoding (encoding and decoding taken here in the direction rate reduction) and DAC. The clocks are always distributed and the chain ADC, speech coder, transmission and switching, voice decoder and finally DAC is perfectly isochronous. There is not lost or repetitions of speech samples in the decoder.

[6]

The techniques of synchronous transmissions described above require the presence of a reference clock in the entire network. Increasingly, the transmission techniques (data in a first time) disclose techniques for asynchronous packet (IP, atms). In many new situation, the decoder does not have any reference as to the sampling frequency used by the encoder and must reconstitute by its own means a decoding clock that is trying to follow the reference to the encoder. The present invention is thus particularly attractive for telephony systems on frame relay ("Relay" according to the terminology is frequently), for telephony over atm or for IP-based telephony. The disclosed technique can be easily used in other fields of speech or sounds for which there is not an effective transmission of the reference clock of the encoder to the decoder.

DISPLAYING THE STATUS OF THE TECHNIQUE

Exposed the general problem

[7]

The general problem posed by transmission systems which apply the invention is to overcome the fact that the speech decoder or sound does have reference clock-related encoding of the source.

[8]

Lessons in this regard may distinguish two cases: those corresponding to a "asynchronism low" and those corresponding to a "strong asynchrony".

" The asynchronism low "

[9]

Illustratively, the position in the case of a transmission system that has, as well as schematically illustrated in Figure 1:

  • a source coding 1 comprising an analog to digital converter driven by a reference clock frequency [...] equal to 8 kilohertz (for securing the computing elements in following the presentation) and a speech encoder (more or less complex and reduces of greater or lesser degree the flow rate to be transmitted);
  • an asynchronous transmission system (shown schematically by the link 2) that transmits the information outputted from the source coding using own transmission clock and its own protocols (for example, one can imagine that the speech encoder produces a flow rate of 8 kb/s and that the transmission system is linked as a asynchronous rs. 232 to 9600 bits/sec.);
  • a system for receiving and decoding 3 receiving the information transmitted by the asynchronous link (whose flow rate must obviously be somewhat higher than the flow rate crude coding, for example 9600 bits/S instead of 8000 bits/sec.) and which generates the signal after decoding (decompression) and sending the generated signal to a digital to analog converter connected to a force transducer such as a loudspeaker, telephone handset, helmets or sound card installed in a PC.

[10]

It is understood that since the system receiver and decoder 3 is without reference clock, it must enforce a strategy to overcome this lack of synchronization between the encoder and the decoder.

[11]

Whatever the encoding technique used or the type of transmission that does not convey directly a clock, time markers in the transmitted frame or details on transmission timing, can be fed back into the problem discussed previously (making abstraction of the speech coder, the asynchronous transmission system and the speech decoder) system comprising, as well as shown in Figure 2:

  • an analog to digital converter 4 loaded with an analog-to-digital sounds or speech signals at a sampling rate fixed by a local oscillator;
  • a digital to analog converter 5 loaded reproduce the sounds to a suitable transducer to a service area involved and which operates over a given sampling by a local oscillator a priori of same frequency but never exactly at the same frequency for construction costs tolerable (there are sources of very stable and highly accurate but they must be compensated temperature and their cost is fatal to implementations industrial large-volume);
  • a digital register 6 wherein the analog 4 written with its sampling frequency (F.ADC ), this register being read with the sampling frequency (F.DAC ) of the reproduction system by the digital analog converter (DAC).

[12]

It is understood that since the two clocks (frequency fADC and fDAC ) are different, from time to time the DAC will replay twice the same information (if [...] is greater than [...]) or otherwise ([...] is less than [...]) information will be overwritten by the ADC before the DAC cannot the replay.

[13]

The oscillators commonly found commercially are characterized by a functional accuracy (within a certain temperature range).

[14]

The oscillators to 50 ppm (part per million) are quite common and are used as the basis for the next calculation that will indicate the wastes or repetitions of samples for a sampling frequency of 8 kHz rate (the reader will easily calculate that for sampling frequencies higher frequency hopping and repeats is obtainable in the sampling frequency ratio; the higher the frequency of sampling is the longer the leaps or repeats will be high).

[15]

In the least favorable conditions, then an ADC operating at 8000 * (1 + 50.e and 6) and a DAC operating at 8000 * (I-- 50.e and 6). In this particular example, the period of hop (removal of samples in the DAC since [...] is less than [...]) is simply calculated by counting the number of DAC's period (period greater than that of the ADC) which results in a value equal to the period of the DAC when multiplied by the difference of the periods.

[16]

Either PCNA for the period of the DAC (herein 1/8000 * (I-- 50.e and 6)) and the period of the ADC [...] (herein 1/8000 * (I-50.e and 6 +)); it should also be noted have n * (and PCNA-to-[...])=and PCNA. N is the number of elementary operations that shift the difference of periods. By laying the EPSI 50.e and 6=&; and by applying the rules for simplifying current for small numbers, obtained n=i/(2 *&the EPSI ;). In our example this gives immediately the period jumps that will be close to 1.25 seconds. If the accuracy of the local oscillators is improved (e.g. by passing of 50.e and 6 to 5.e and 6) then the period jumps will increase (here this will be all 12.5 seconds).

[17]

This phenomenon of "slip" of a clock relative to another will result when one is situated in a complete system for transmission with audio encoders operating on signal frames, of lapse of speech frames (any frame to be decoded in the allotted time for decoding) or over abundance of frames (two frames to be decoded instead of in the allotted time). In effect, taking the example of a speech encoder operating with frames of 30 ms to 8 kilohertz, either 240 samples, at the receiver and more particularly of the decoder is expected to be received in a time slot of 30ms, a frame to be decoded, in order to ensure continuous rendering of speech signal. Gold, if for example, is less than [...][...], we are going to have, by taking the preceding assumptions, an absence of sample frame to be decoded by the sound rendering system all 240 * 1.25=300 seconds, and conversely two frames instead of a (either a frame to "suppress") to decode all 300 seconds. In this case, the phenomenon of the offending skipping or repeating samples becomes truly strong unpleasant since this is all a signal block that is skipped or repeated and therefore requires proper management.

" The asynchronism strong "

[18]

Certain types of transmissions amplify this problem of dyssynchrony owing to the phenomenon of "slip" clocks previously comprehended. This is what is meant here by "asynchronism strong".

[19]

Indeed, when!has transmission is not perfect and introduces losses of samples or frames of samples and also when the transmission generates a jitter on the arrived samples, unrelated to the transmit clock or the receive clock, but related to other mechanisms of the transmission chain having its clock, the receiving system may then be confronted with absence of multiple frames, or the overabundance of several frames. This is for example the case for IP-based networks with the phenomenon of packet losses and that of jitter introduced when routing packets. These phenomena will greatly disturbing the continuity of the acoustic rendering of the audio signal. Indeed, in the case of packet loss or a jitter having delayed one or more packets, the renderer will be without any sample (sample frame or none) to be sent to the DAC to ensure continuity of the acoustic rendering. And conversely, in the case of pronounced jitter, the renderings can end up with too many frames or samples to be sent at the same time the DAC. Indeed, if a high jitter, the frame transmission sound signal may be in the form of bursts, thereby creating strong phenomena of holes and on abundance of frames of samples.

[20]

It should be noted that when using speech encoders operating with a transmission system type the VAD/for DTX/NGC (vocie where frequent [...]/[...] transmitting/comfort through background noise generating according to the terminology is frequently), also introduced a mechanism that is similar to the case of the packet loss, since in the event of silence, the transmitter will stop transmitting frames of samples. The transmission stop of samples may in effect at the receiver be likened to the phenomenon of packet loss or even if the clock of the ADC is faster than that of the DAC, which causes as has been shown above holes in the signal at the receiver.

[21]

"Dyssynchrony strong" is distinguished therefore "dyssynchrony low" in involving instead only jumps and/or repetitions of cyclic ways, but also holes in the signal and/or of the overabundance of signal and this non cyclic and multiple.

Description of the different existing methods.

[22]

Known mainly two methods for overcoming the disadvantages due to the fact that the speech decoder or sound does have reference clocks.

[23]

The first is simply to proceed as has been pointed out in the paragraph describing "dyssynchrony low", i.e. by skipping or repeating samples. The decoding system samples at a rate approximately equal to that of the encoder and the present the digital analog converter with this rhythm (the means for performing such reconstruction system are known to the skilled artisan). In some cases, for example in the case "of asynchronism strong" with transmission framed, it is preferred in the absence of sample playing, sending frames of zero samples to the ADC, rather than repeating the previous frame. In addition, conversely when a surplus of samples, they will not be deleted directly, but a FIFO of a certain size may be used to absorb the jitter in part. A filling too important to the FIFOs will complete or partial emptying of the FIFOs thereby creating again jumps into the sound reproduction.

[24]

The second method, more elaborate and improved performance, requires the implementation of hardware clock recovery loop locked by the fullness of a buffer the signal for decoding (or transmit as for instance in aal1 of atm). This servo method attempts by the clock recovery loop recover the sampling frequency of the source. The fullness of the buffer of the receiver generates a control signal to slave a PLL (digital or analog).

[25]

The first method has extreme simplicity but has a serious defect related to the quality of the reproduced sound. Indeed, a jump or deletion all 1.25 seconds can be highly objectionable to listen, if "dyssynchrony low", with correction of asynchrony at the sample. Similarly, in the case of a system operating with frames of samples, the repetitions or blanks have been introduced, and the discontinuities in the signal by suppressing frames amplify the quality degradation rendered highly noticeable and highly disruptive to the listener.

[26]

In addition, with the use of a memory first incoming/outgoing first (FIFOs), the risk of delay in the transmitting, which affects also the overall quality of the communication.

[27]

The second method is, it, much more complex and request of servo clock and thus specific hardware. However, it ensures a synchronism partial and hence avoiding problems of managing an asynchronous. However, this method is hard to discontinuous transmission systems, systems with frame loss and also to systems with high jitter. In these cases, the synchronization information is not available. Further, this method is not feasible on platforms of terminals where the servo clock is not possible, as this is particularly the case with terminals PC type for example where the renderer used acoustic would be the sound card.

[28]

Already known by the document wo/99 17,584 layouts are implementing a method according to the preamble of claim 1, these devices having only a buffer.

[29]

The US-a 4 703,477 facilitates reading of voice information by implementing a method for end-to-end frames relating to the same voice information.

PRESENTATION OF THE INVENTION

[30]

A general purpose of the invention is to propose a solution to the problems of continuity of the rendering of the speech signal in the presence of an asynchronous transmission, and this by acting at the receiver, i.e. at the end of the transmission chain.

[31]

To this end, the invention provides a method according to claim 1.

[32]

The method is simple to implement and allows a QoS guarantee avoiding excessively increasing the transmission delay and to efficiently manage the holes in the speech signal. In addition, it does not involve any specific hardware which servo circuit, and can therefore adapt quickly to platforms, terminals and different asynchronous networks.

[33]

This method is advantageously supplemented by the various following characteristics taken alone or according to all their technically possible combinations:

  • implements a voice activity detection and deleting the frames considered as non-active by this detection, when the filling rate is between a first and a second threshold and that carries out a processing of concatenation on two successive frames, when the filling ratio is between a second and third thresholds;
  • the first and second thresholds are merged;
  • it is detected in or out of a decoder block comprising a first buffer input and/or output a possible frame missing or erroneous or possible absence of samples to render and generating a dummy frame which maintains the continuity of the audio rendering when such frame missing or erroneous or such an absence of samples to be reproduced is detected;
  • in the case where the block decoding uses cyclically its decoding processing with respect to the contents of the first buffer, is implemented with the same cyclic detection of an eventual missing or erroneous frame and possible absence of samples to be restored, this detection involved sufficiently in advance to the decoding processing of generating dummy frame timely;
  • it does not generate false frame when a frame detection missing or erroneous occurs over a frame for which an absence of samples has already been detected;
  • in the case where the system is of a type which can voluntarily stopping transmitting frames, is stored from one frame to the other frame type previously generated and is determined based on the information whether to generate false fields or frames of silence;
  • in a treatment of concatenation of two successive frames weights the samples so as to give more importance to the first samples of the first frame and the second to last samples;
  • the (or) threshold (e) is (or are) (e) adaptive;
  • a threshold is a function of time spent filling ratio higher than a given threshold.

[34]

The invention also relates to a device for restoring a speech signal having a first memory buffer that receives encoded frames, means for performing decoding processing on the frames stored in said first buffer, a second buffer receives frames decoded output of the decoding means, sound restitution means receiving said frames output from the second buffer memory, characterized in that it further comprises means for performing the above method.

[35]

As is will include reading the description that follows, these means are essentially computer means.

THE FIGURES

[36]

Other features and advantages of the invention shall become apparent even of the description that follows is illustrative and non-limiting and which must be read in front of attached drawings on which:

  • figure 1 is a schematic representation of a string of asynchronous transmission;
  • figure 2 is a graph illustrating modeling of such a transmission chain;
  • figure 3 is a schematic diagram of a receiving device;
  • figure 4 illustrates signals obtained by implementing a concatenation treatment proposed by the invention.

DETAILED DESCRIPTION OF ONE OR MORE EMBODIMENTS

[37]

The method of managing the asynchronism of the transmission according to the invention implements two treatments corresponding to the takeover of two phenomena previously exposed which are the absence of samples and the surplus of samples.

1. Presentation of the chain of sound reproduction of a typical application transmission.

[38]

And this was shown in Figure 3, the chain rendering the speech signal consists of three elements:

  • A block 10 receive samples or frames of codes from the network. This block 10 11 contains a memory FIFO buffer or circular type (referred to herein as "FIFOs 1" in the suite of the document) for temporarily storing the frames before decoding.
  • A block decoding takes 12 frames from the block 10, decodes them, and stores them in turn in a FIFO memory 13 (referred to herein as "FIFOs 2" in the suite of the document).
  • A rendering block 14 which will take the frames of decoded samples, and send it to the sound rendering system 15 whatsoever.

[39]

According to the terminals and the organization of this chain, the frequency of the clock reproduction audio (that of the digital-analog converter [...]) is not necessarily directly related to such blocks. The block 14 being in direct connection with the system made, it is directly linked to the [...]. The other blocks can be in connection with the arrival rate rather frames from the network that with this [...]. In effect, taking the example of a terminal with a multitasking system, and wherein each block would be performed by a specific task, the task 10 and 12 could be bound thereby receiving frames. The task waits for a frame of the network 10, the latter is then decoded by the task 12 and into the "FIFOs 2".

[40]

As to the task 14 clocked by [...], it will send in from draw upon the memory "FIFOs 2" samples continuously at sound rendering system.

[41]

See therefore that in case "an asynchronous high or low", this is the memory management "FIFOs 2" that will request a special care. Similarly, if the task 12 had been bound strongly to the task 14, this time this would have been the memory "FIFOs 1" that would have requested a particular attention.

[42]

The mechanism according to an embodiment of the invention is shown by applying the management on the memory "FIFOs 2", but it will be during explanations of how the transpose if is being adapted, to manage the memory "FIFOs 1".

2. Absence of samples

[43]

To ensure a sound rendering continuous service in absence of samples, is treated in parallel the two potential sources of absence of samples to be rendered. The first corresponds to the information packet loss, and the second corresponds to the absence information of samples to be rendered (e.g. "FIFOs 2" vacuum), then it is necessary to the samples sound rendering system.

2.1 Loss of frames, or erroneous frames

[44]

The treatment on frame slips or the erroneous frames requires a much transmission system providing access to the information on the frame dropping or receiving erroneous frame. This is often the case, in the transmission system.

[45]

For example, for IP networks, it is possible to use the marking data packets originated from the RTP layer, which gives the number samples lost between two receptions packet audio codes. This information of lost frames or in the case IP packet loss (containing one or more speech frame), will only be generally known that on receipt of the packet following the or lost packets.

[46]

It is not necessary to make an interest act, while one or more valid frames can be decoded. In fact, with the next generation speech codecs (coders the CELP, transform coders, ...) for the sake of to maintain said acoustic rendering quality, it is often necessary to provide some synchronism between the encoder and the decoder. The loss of this synchronism codec may be compensated for by using correction algorithms of lost frames related to speech encoder used. These algorithms are for example provided in the standard some speech codecs (e.g.: ITU g. 723.1). With the use of encoders simpler, this mechanism is not necessarily required.

[47]

When a large number of frames has been lost, it is possible to limit the number of "false" frames of samples to generate in order to ensure filling unnecessarily the memory "FIFOs 2". The purpose of the processing for generating false frames is fill in the holes so as to provide a continuous signal, but also to smooth the internal variables of the decoder to avoid overflow differences in decoding the valid frame according to the invalid frames or lost, and thus avoiding an audible discontinuity. After the generation of some frames which can be considered the variables as smoothed, and thus restrict this spurious frames to a small number of frames (e.g. 4 to 6), when a large number of frames have been lost.

[48]

As it will have to be included, is servo thereby processing relative to information of lost frames.

[49]

A similar process is performed with respect to the information of frames called "invalid frame". This information is transmitted to the decoder via the network portion of the receiver and the early enough to allow the implementation of an algorithm corrections of frames that taking into account this invalid frame ensures continuity of the signal, as well as to avoid another source in the absence of sample memory "FIFOs 2".

[50]

In short, the first process is also consistent with the managing information type "n frames lost" or "invalid received frame" has come the network layer of the receiver. This management is characterized by the implementation of a correction algorithm of lost frames (also called in this document generation algorithm "false" frames). The first process therefore acts at the decoding task and is supplied to the memory "FIFOs 2".

2.2 Absence of samples to render

[51]

This second treatment is associated clock from the task 14, i.e. to a clock [...]. Indeed, as mentioned above, the memory "FIFOs 2" (or "FIFOs 1" if the task 12 is interleaved with the task 14) can do more contain sample while it is necessary to provide samples at renderings system sound. Should therefore provide samples to the rendering system, and if possible avoid render zeros (this strongly degrading the sound signal).

[52]

This second treatment can be analyzed as a feedback loop on the decoding of the frames. This loop initiates the call to the algorithm corrections of lost frames and must therefore be activated sufficiently early to enable the execution of the algorithm and sending samples at renderings system sound. According to the platform, the call to this feedback can be different.

[53]

This loop can be implemented in both ways which will now be described.

[54]

In the case of a single-task receiver (e.g. on a DSP without [...] (Brazilian Time Division geostationary system as [...] terminology), the audio decoder is completely linked to the clock of the DAC ([...]), and is thus permanently waiting for a frame to be decoded cyclically. for example, with a speech coder using frames 30 msec, constructed curls of expectations of multiple period of 30 msec.

[55]

Thus, in the case of loop 30 msec, the decoder will, all 30 msec, be waiting for a frame to be decoded into the "FIFOs 1" (which may just correspond to the passage of a frame from the network layer to the task 12). Upon arrival thereof, it decodes it and up the memory "FIFOs 2" for sending to the DAC. The treatment feedback will be implemented observer that Tc of t=to + 30 msec - absence of frame to be decoded in the memory "FIFOs 1", where To=time of the start of the wait loop of 30 msec, and Tc=execution time of the algorithm generating false frames with an additional margin corresponding to the interruptions and/or other ancillary treatments which may occur before the end of the loop.

[56]

The treatment is thus implemented with the stop time waiting to Tb (loop time) - CT (computation time + margin).

[57]

In the case of a receiver multitasking (case of a PC terminal for example), the time is not managed as exactly and a treatment therefore somewhat dispute should then be implemented. (Remark: this treatment remains fairly close to the previous, because it seeks also to take into account the time Tc).

[58]

In such a case, are available often that of waiting loops related events, for example the fact that packets have been received by the network, or the fact that the buffer "nth" (containing one or more frames of samples) previously sent sound rendering system was read by the DAC, and thus is available again for resending the DAC.

[59]

Following implantation and the need to quickly respond to the event or not, it is possible to mark a time delay prior to the filling of the buffer for re-transmission to the DAC. This time delay is chosen to allowing sufficient time for the execution of the generation algorithm "false" frames (if necessary).

[60]

And then, after optionally of this timer, the processing checks the presence of sufficient samples in "FIFOs 2" (remark: this can be in "FIFOs 1" if the management is placed at this level), and otherwise requires the generation of the number of false frames suitable for filling the buffer memory "d".

[61]

In the case where the system is such as to be filled from the buffer memory "immediately" "d", to monitor the availability of samples and potentially the call generation processing "false frames" are implemented directly after each sending the buffer to the DAC, so that the samples generated are already in memory "FIFOs 2" event by a "buffer" the n "available".

[62]

Thus, regardless of the receiver, the treatment determines the absence of samples to be sent to the system sound renderings by implementing a controlled filling of the buffer "FIFOs 2" (or "FIFOs 1" according to chain management reproduction audio) and activates the generation algorithm "false" frames to generate the missing samples.

[63]

As it will have to be included, the second treatment first respond to the problem of "slip" clocks, and more precisely the case where the receiver clock ([...]) is faster than the transmit clock ([...]). It is also actuated with respect to the phenomenon of lost frames, because it may cause an absence of samples to be sent to the DAC prior detected frame loss, this detection being only receiving the frame following the loss.

[64]

For binding the actions of the first and second treatment, forbidding the first process to generate "false" frames upon detection of lost frames, if corresponding frames newly generated by the second process.

[65]

For this purpose flags, as well as counters determines the number of samples generated by the second process.

2.3.Specific actions in the case of speech encoders using the functionality for DTX the VAD // NGCs.

[66]

Encoders using a system for DTX the VAD // NGC may voluntarily stopping transmitting frames; in this case, the absence of samples should not be considered exactly as a loss of frames, but as a moment of silence. The only means to determine if the frame generating should be silence or corresponds to a lost frame, is to know the type of the frame previously generated (either signal frame or frame corresponding to a lost frame (the FSF), either updating frame noise (SIDs), either silence frame (subscribed not)) for this purpose, stores the generated frame type, and is determined when the frame generation for absence of frame or frame loss, whether to generate false frames from the algorithm corrections of lost frames, (case of the previous frame type the FSF), or silence frames by activating the decoder adequately (if the previous frame SIDs or NOT).

3. Overabundance of samples to be rendered.

[67]

To treat the case of overabundance of samples to render, used drainage treatment frames, with suppression of partial or complete certain frames before they can be taken into account by the sound rendering system.

[68]

This treatment results in a storage of the frames in the buffers up to certain thresholds triggering actions to minimize such storage and the tap delay on the communication chain that it involves. This storage allows limited taking into account the phenomena of jitters frame receiving burst and clock slip while limiting the transmission delay.

3.2 Drainage treatment

[69]

The accumulation of frames is first detectable at the memory "FIFOs 1", and then thereafter transferred at the memory "FIFOs 2".

[70]

The proposed method manages the information of the buffer fill reference, i.e. "FIFOs 1" or "2 FIFOs" following implantation tasks 10, 12 and 14 (detailed previously) in the receiver. Indeed, if the tasks 12 and 14 are bonded, filling the information which is used by the method is that provided "FIFOs 1" that buffer between the network and the sound rendering system. Similarly, if the tasks 10 and 12 are bonded, it is the memory "FIFOs 2" that buffer and thus it is its fill factor which is taken into consideration for the management processing.

[71]

The process will now be comprehended by lying in the second case. The first case is deduced immediately by transposition.

[72]

Best to maintain synchronization between the encoder and decoder and thus an optimal sound reproduction, it is chosen to decode all frames from the network. The treatment decides based on information filling action linked to the decoded frame. This action is in detail subsequently. To activate the treatment, filling thresholds are used. These thresholds define alarm settings fill "FIFOs". For work in the least audibly as possible ([...] to say to limit the quality degradation), two levels of actions are selected. A first level (alarm level 1) corresponds to a non-critical level of overfilling (away from the maximum fill tolerated), the second (alarm level 2) it corresponds to a duty on each frame (moderately close to the maximum fill tolerated). A third level (alarm level 3), said security (to avoid memory overflows, or problem) has been defined. It corresponds to a filling near the maximum tolerated. The alarm level is never actually reached if the actions of the two previous thresholds are well performed and if the thresholds are properly defined.

[73]

During each decoding, the information filling is compared to the thresholds to know the status of the "FIFOs" (in alarm or not), and where appropriate the level of the alarm.

[74]

If the state obtained, is not an alarm state, any action is performed, and the decoded frame is stored in "FIFOs 2".

[75]

In the first alarm state, it is considered that at least 50% of the signal from a conversation is unnecessary; therefore removed, in this alarm level, all frames that have only very little information. For this purpose, this treatment may implement a single VAD (AVD=voice activity detector) which scans all frames of samples after decoding and enables the decision of their writing or not in "FIFOs 2". The processing can decide from information directly in the frame of purification and codes or not the importance of the information contained therein. In the alarm condition, any frame thought that only contain noise, will not be stored in "FIFOs 2" for future sound reproduction.

[76]

In the second alarm condition (critical level), it is necessary to act on each frame to limit very strongly increasing the buffer fill "FIFOs 2". At this point the preceding treatment (i.e. the implemented for the alarm level 1) remains active. But this time, it is imposed to reduce two consecutive frames to a frame or less. A decision is thus taken from two frames of samples not "silent" (indeed, if a frame is "silent", it is simply not written in "FIFOs 2" (case of the alarm condition 1 included in the alarm condition 2)). The action on two consecutive frames is thus engaged that when a frame is detected as not "silent". This frame is first stored, then if the second frame is "silent", then it is the first frame which is written in "FIFOs 2".

[77]

In the case where the two fields contain important information, are to be replaced by a single these two frames minimizing information loss and degradation of quality. This derived resulting frame that will be stored in "FIFOs 2". Any effective solution that can perform this task can be used and activated under these conditions (second condition alarm, "quiet" non-frames). Two examples of algorithmic perform this task are shown below.

[78]

According to a first solution of algorithmic, replacing the two frames end to end by a single frame where each coefficient of XY (j from 0 to n-1 (n number of samples per frame)) takes the value (the Xthe I the X +i + 1 )/2 to (i ranging from 0 to with 2 * n 1, Xthe I monocyte-derived two original frames). This solution returns somewhat to the undersampling smoothness. The frequency of the signal is then rendered twice on that frame. However, it has been found by the inventors that when the alarm condition 2 is not very frequent, this solution is sufficient for maintaining the quality of sound reproduction.

[79]

According to a second solution, use is made of a height detection signal to compact the two frames in a pseudo-frame length less than or equal to that of a frame. The number of samples contained in this pseudo-frame is determined by the fundamental frequency information ("fork" according to the terminology is frequently), but is totals less than or equal to the length of a normal frame, and close to this frame length. The algorithm used ensures continuity of the signal is audible rendering non-hole, nor frequency doubling, while dividing the storage of the signal by a factor greater than or equal to 2. this is comprehended in more detail at paragraph 3.4 below. In addition, it also minimizes loss of sound information, with cancellation is less 50% of the information.

[80]

It should be noted that in the case where the receiver to carry out its processing from an analysis of "FIFOs 1", the decoder being directly connected to the sound rendering system, it must generate a sufficient number of samples, or in our case ensure the provision of at least one frame of samples in "FIFOs 2". The frame concatenation algorithm is then calibrated to generate always a minimum of samples, but at least one frame. Another solution may also include the activate several times instead of only once when it is desired to obtain a sufficient number of samples.

[81]

In the third alarm level, normally never reached, no frame is stored in "FIFOs 2". Alternatively, the system may also decide a drain abrupt a part of the buffer (this will be the case, if it is the management of "FIFOs 1" which is activated).

[82]

It should be noted also that following the networks and the kinds of problems of asynchronism, options can either disabling certain alarm levels. For example in the case "an asynchronous low", levels 1 and 2 alarm may be grouped together, and the simple solution of replacing two frames by a single can then be the only active process.

3.2 Alarm thresholds

[83]

We now describe in more detail the alarm thresholds and their management.

[84]

As has been explained previously, the memory reference is declared a alarm state 1, when its filling is greater than the threshold 1; this state remains active until the filling becomes lower than the threshold state 0.1 thus follows a operation as [...].

[85]

The memory is declared a alarm state 2, if the filling 2 becomes greater than the threshold, and an alarm condition 3, if the filling is greater than the threshold 3. management of these alarm conditions by [...] can also be envisaged.

[86]

The thresholds 0, 1 and 2 are adaptive. The threshold 3 which in turn is directly related to the size maximum tolerated is fixed. The adaptation of these thresholds is necessary to support the various contexts communications and fluctuations over time thereof. Indeed, it is nevertheless authorize more delay when there is a lot of jitter in the communication (the tap delay rendering remains the best means to ensure proper quality in the presence of jitter). In a context of greater jitter, should therefore have the thresholds 0, 1 and 2 sufficiently high levels.

[87]

To facilitate processing, the position thresholds may correspond to an integer number of the size of the frames exchanged between different tasks of the receiver. Either Tt of this size frames.

[88]

The initial values of these thresholds can be for example the following:

  1. 0 threshold: 5 Tt of the X
  2. Threshold 1: Tt of the X 8
  3. 2 threshold: 12 Tt of the X
  4. 3 threshold: Tt of the X 24 (fixed value)

[89]

The thresholds 0, 1 and 2 can be adapted together by size step HT. The extreme values can for example be admitted de - 1 to + 8.

[90]

Thus, for example, the threshold 1 can take the values 7x, 8x, 9x, 10x, .... 16x HT. The actual adaptation thresholds is accomplished from a fitness criterion which is time spent in alarm state. To this end, a percentage alarm status is evaluated every n seconds (e.g. n=10) about. When this percentage is higher than a given threshold (5%) the thresholds alarm status are increased; when the contrary these percentages are below a given minimum threshold (0.5%), the alarm thresholds are decreased. To avoid much oscillation of the system due to an adaptation too frequently thresholds, a [...] is applied to the adaptation decision. Indeed, the thresholds are actually being augmented by a pitch that in case two consecutive increase options and decreased by one step if three consecutive reduction options. Said fluid thus flows at least 2 * n seconds between two increments of thresholds and 3 * n seconds between two decrements of thresholds. The procedure of increasing thresholds may be accelerated if a large percentage of frames is in alarm. Accelerating a procedure involves incrementing directly the thresholds if for example the percentage alarm is greater than 50%.

[91]

Of course, the threshold values for the percentages of alarm data is provided only as a guide.

3.3 Interaction with the first treatment

[92]

The first treatment is treatment that triggers the generation of "false" frames on loss of frames or erroneous frames. In the case where the system is in alarm (overabundance of frames), it becomes unnecessary to generate these "false" frames which do but aggravate the situation of overabundance. However, it is important for rendering quality acoustic keep synchronization coder - decoder by informing the decoder frame loss concealment (by starting for example, one or two generations of false frames (but not more)). The third processing alarm state will act on the first process for substantially limiting the generation of "false" frames.

3.4 Frame concatenation

[93]

The treatment of concatenation aims to shorten the duration of a digital audio signal containing speech or music by introducing the less degradation can be heard. The sample rate is given and fixed, decreasing the number of samples that are sent to the backup apparatus thereof. An obvious solution to shorten a sequence of n samples is removal of m samples evenly spaced from the sequence in question. This is the increase in the fundamental frequency which can be disruptive to the listening especially when the ratio m/n is high. In addition, there is the risk of unnecessary to comply with the sampling theorem. The treatment shown below can shorten an audio sequence without changing the fundamental frequency and without introducing an audible degradation due to the discontinuity of the signal. This is based on detection of the off period "fork". The number of samples removed by this algorithm cannot be selected freely, it is a multiple of the value of the pitch P is nevertheless possible to define the minimum number. samples to eliminate nwe commence which must verify the relationship Nwe commence[!the!] N/2. As, in the course of the management device transmitting audio dyssynchrony, the object is to suppress at least 50% of the samples. N-advantageously is fixedwe commence =N/2. It is assumed also that the maximum value of the pitch P is less than the length n of the sequence to shorten. The number do samples removed by the algorithm is the smallest multiple of the value of the pitch P which is greater than or equal to nwe commence : Do=Kp within, where K is a positive integer, N.I pGCs&; N.we commence N->I THE P -. The length of the output signal is then nthe R FOR N-N=I . note the input signal to shorten the s (d), n=1, ..., n is then the output signal is (d), n=1, ..., n-the R . To ensure continuity of the output signal, is the gradual melting first and last Nthe R samples of the signal s (n-): if (d)=O (n-I The N=). (D) Of W + E (D). (1 A-W(D)), N=1, ..., N-the R (d) wherein W is a weighting function as 0[!the!] (d) a W[!the!] from 1, n=1, ..., n-the R , and W (d)[!the!] W is (n + 1), n=1, ..., n-1 -1. For example, the W (d) may be simply the linear function of W (d)=n/nthe R . For unvoiced signal where it is impossible to be the pitch, cannot be fixed freely.

[94]

Figure 4 representing sequences has, and b, c and d of signals illustrates the implementation of the treatment on a specific example. The first sequence (has) shows the piece of s (d) of n=640 samples to shorten in full. The goal is to shorten this sequence by at least 320 samples, without changing the fundamental frequency, and without introducing a discontinuity or other impairments audible. The pitch of s (d) changes slowly, its value is equal to 49 at the beginning of the sequence and 45 at the end of the sequence. The pitch detected using a correlation method is p=47. Thus, the s (d) will be shortened by k=7 periods, i.e. nI =kp=7 * 47=329 samples.

[95]

In this example is chosen linear weighting. The sequences b and c illustrate the two pieces of signal length nthe R FOR N-N=I =1.311 already weighted that will be fused together in the following. The fusing occurs by adding these two signals. On the sequence c, it can been seen because of the slight variation of the pitch, the two pieces of the s (d) are slightly out of phase. Due to the merging technique used, this does not introduce a discontinuity in the output signal (d) becomes (solid line block d). See also on the sequence 4 that the signal shortened if (d) remains perfectly in phase with the signals preceding and following (trait by dashes in Figures 1 and 4).



[96]

The invention concerns a method for decoding and retrieving a sound signal in an asynchronous transmission system which consists in detecting an backfilling overload in said buffer memory and/or a second memory in the input or the output of the decoding unit and comparing the backfilling rate to at least a threshold. The invention is characterised in that, depending on the value of the backfilling rate, it consists in using a detection of voice activity and in eliminating the frames considered as being non-active by said detection; carrying out a concatenation processing on two successive frames.



A method of managing the decoding and playback (14) of a sound signal in an asynchronous transmission system, in which any overabundance of filling of a first buffer memory (11) and/or of a second buffer memory (13) situated at the inlet and/or at the outlet of a decoding block (12) is detected by comparing the filling level with at least one threshold, the method beingcharacterized in that, depending on the value of the filling level:

voice activity detection is implemented and frames considered by said detection as being non-active are eliminated; and

concatenation processing is implemented on two successive frames to compact them into a pseudo-frame of length less than or equal to one frame, the length reduction ratio of the pseudo-frame relative to the length of the two frames being greater than or equal to two.

A method according to claim 1, characterized in that voice activity detection is implemented and frames considered by said detection as being not active are eliminated whenever the filling level lies between a first threshold and a second threshold, and in that concatenation processing is implemented on two successive frames whenever the filling level lies between a second threshold and a third threshold.

A method according to claim 2, characterized in that the first and second thresholds are the same.

A method according to any preceding claim, characterized in that detection is performed at the inlet or the outlet of a decoding block (12) having a first buffer memory (11) at its inlet and/or its outlet to determine whether any frame is missing or erroneous or whether any samples to be played back are absent, and a fake frame is generated to ensure continuity in the audio playback on detecting such a missing or erroneous frame, or on detecting such an absence of samples for playback.

A method according to claim 4, characterized in that when the decoding block (12) implements its decoding processing in cyclical manner relative to the content of the first buffer memory (11), detection of any missing or erroneous frame or of any absence of samples to play back is implemented at the same cyclical frequency, said detection taking place far enough in advance relative to the decoding process to make it possible to generate a fake frame in good time.

A method according to claim 4 or claim 5, characterized in that a fake frame is not generated when a missing or erroneous frame is detected for a frame on which an absence of samples has already been detected.

A method according to any one of claims 4 to 6, characterized in that, for a system of the type which can voluntarily stop sending frames, the type of the previously-generated frame is stored from one frame to the next, and this information is used to determine whether to generate fake frames or to generate frames of silence.

A method according to any preceding claim, characterized in that in processing for concatenating two successive frames, the samples are weighted in such a manner as to give more importance to the first samples of the first frame and to the last samples of the second frame.

A method according to any preceding claim, characterized in that the threshold(s) is/are adaptive.

A method according to claim 9, characterized in that a threshold is adapted as a function of the length of time passed with a filling level above a given threshold.

A device for playing back a speech signal, the device comprising a first buffer memory (11) receiving coded frames, means implementing decoding processing (12) on the frames stored in said first buffer memory (11), a second buffer memory (13) receiving decoded frames output by the decoding means, and sound playback means (14) receiving the frames output by the second buffer memory (13), the device being characterized in that it further comprises means capable of detecting any overabundance of filling of the first and/or of the second buffer memory by comparing the filling level with at least one threshold, and means which, as a function of the value of the filling level, implement voice activity detection and eliminate frames considered as being non-active, and?? implement concatenation processing on two successive frames to compact them into a pseudo-frame of length less than or equal to one frame, the reduction ratio of the pseudo-frame relative to the length of the two frames being greater than or equal to two.