Sound source localization apparatus and method
The present invention relates generally to sound source localization. More specifically, embodiments of the present invention relate to apparatuses and methods for performing sound source localization through an array of microphones. Examples of sound source localization include localizing sound sources using an array of microphones. For example, a method (SRP-PHAT algorithm) of performing sound source localization based on time difference (phase difference) between the signals of different microphones has been proposed in According to the present invention, there is provided a method of performing sound source localization according to claim 1, an apparatus of performing sound source localization according to claim 8, and a computer-readable medium according to claim 15. Dependent claims relate to preferred embodiments of the present invention. According to an embodiment of the present invention, a method of performing sound source localization is provided. The method includes calculating a frame amplitude difference vector based on short time frame data acquired through an array of microphones, the frame amplitude difference vector reflecting differences between amplitudes captured by microphones of the array during recording the short time frame data; evaluating similarity between the frame amplitude difference vector and each of a plurality of reference frame amplitude difference vectors, each of the plurality of reference frame amplitude difference vectors reflecting differences between amplitudes captured by microphones of the array during recording sound from one of a plurality of candidate locations; and estimating a desired location of sound source based at least on the candidate locations and associated similarity. According to another embodiment of the present invention, an apparatus for performing sound source localization is provided. The apparatus includes a vector calculator that calculates a frame amplitude difference vector based on short time frame data acquired through an array of microphones, the frame amplitude difference vector reflecting differences between amplitudes captured by microphones of the array during recording the short time frame data; a similarity evaluator which evaluates similarity between the frame amplitude difference vector and each of a plurality of reference frame amplitude difference vectors, each of the plurality of reference frame amplitude difference vectors reflecting differences between amplitudes captured by microphones of the array during recording sound from one of a plurality of candidate locations; and an estimator which estimates a desired location of sound source based at least on the candidate locations and associated similarity. According to another embodiment of the present invention, a computer-readable medium having computer program instructions recorded thereon for enabling a processor to perform sound source localization is provided. The computer program instructions include means for calculating a frame amplitude difference vector based on short time frame data acquired through an array of microphones, the frame amplitude difference vector reflecting differences between amplitudes captured by microphones of the array during recording the short time frame data; means for evaluating similarity between the frame amplitude difference vector and each of a plurality of reference frame amplitude difference vectors, each of the plurality of reference frame amplitude difference vectors reflecting differences between amplitudes captured by microphones of the array during recording sound from one of a plurality of candidate locations; and means for estimating a desired location of sound source based at least on the candidate locations and associated similarity. Further features and advantages of the invention, as well as the structure and operation of various embodiments of the invention, are described in detail below with reference to the accompanying drawings. It is noted that the invention is not limited to the specific embodiments described herein. Such embodiments are presented herein for illustrative purposes only. Additional embodiments will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which: The embodiments of the present invention are below described by referring to the drawings. It is to be noted that, for purpose of clarity, representations and descriptions about those components and processes known by those skilled in the art but unrelated to the present invention are omitted in the drawings and the description. As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon. Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wired line, optical fiber cable, RF, etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. Apparatus 100 may be useful in various applications. In one application, apparatus 100 may detect speakers' location information in a meeting. This location information can be used to separate the speakers in meeting recording, or it can be used for spatial audio coding of the meeting. As illustrated in Vector calculator 101 is configured to calculate a frame amplitude difference vector (FADV) based on short time frame data acquired through an array of microphones. The frame amplitude difference vector reflects differences between amplitudes captured by microphones of the array during recording the short time frame data. In general, due to diversity between distances from a sound source to different microphones of the array, or diversity between sensitivity of the microphones to sound signals from the sound source, amplitudes of the sound signals captured by the microphones from the same sound source are different. For different sound source locations, distributions of amplitude differences between the microphones can be different. For example, in case of an array of unidirectional microphones, or in case that the sound source is close to an array of omni-directional microphones, the distributions of the amplitude differences between the microphones can be significantly different. Based on this observation, the distributions of the amplitude differences between the microphones may be associated with different sound locations, at least those locations exhibiting this diversity. In this regard, it is possible to estimate whether a sound source is located at one of these locations according to this association, based on amplitude differences between the microphones introduced by the sound source. In the following, an array of unidirectional microphones will be adopted as an example to describe the embodiments of the present invention. The unidirectional microphones may be cardioid microphones. In general, the location of a sound source may refer to the angle of direction (DOA) of arrival or position of the sound source. In some cases, the distributions of amplitude differences between microphones for different positions along the DOA are substantially similar, and therefore, the DOA may be employed to measure the location. Depending on specific applications, the DOA may be represented with the azimuth angle of the sound source in a plane (named as horizontal plane) where the CMA is located. In this case, the audio localization problem is simplified to angle detection problem. However, it is possible to detect both the azimuth angle in the horizontal plane and the elevation angle in a vertical plane of source by adding one microphone facing upwards. In the following, the azimuth angle of the sound source in the horizontal plane will be adopted as an example of the location. It should be noted that various microphone arrays may be applied to the embodiments of the present invention as long as the distributions of amplitude differences between microphones for different locations can exhibit significant diversity. The FADV reflects amplitude differences between microphones. In the following, the FADV is represented as a vector ( The short time frame data may be extracted from audio data stream pre-recorded through the array or recorded through the array in real time. Further, a window may be multiplied on the short time frame data. The window may be hamming window, hanning window, etc. Assuming that the short time frame contains N samples and the number of microphone is M, the short time frame data can be stored as a matrix The value of N may be determined based on the sampling rate and expected time length of a short time frame: Adjacent short time frames may or may not overlap with each other. The value of N and whether to use overlapped short time frame is dependent on applications' requirement on time resolution and computation complexity. A larger N usually means more accurate estimation with more data, but lower time resolution if there is no overlapped data between adjacent short time frames. The time resolution can be increased by using overlapped data between adjacent short time frames. But using overlapped data between adjacent short time frames may increase the computation complexity. If Fast Fourier transform (FFT) is performed, N preferably belongs to a set expressed as {2 As an example, in one implementation, There are multiple methods to calculate the FADV. The methods may be classified into energy-based and eigenvector-based. The methods may also be classified into ones based on time domain sample values and ones based on frequency domain parameters. Specifically, the FADV may be calculated based on average amplitude on respective channels in the short time frame data, or based on eigenvector analysis on a covariance matrix including time domain sample values of respective channels in the short time frame data (i.e., portions corresponding to respective microphones in the short time frame data), or based on average amplitude on frequency domain of respective channels in the short time frame data, or based on eigenvector analysis on spectral parameters of respective channels in the short time frame data. This method calculates the FADV directly based on the short-time average amplitudes of the channels (i.e., audio data recorded through respective microphones). That is to say, the FADV is calculated based on average amplitude on respective channels in the short time frame data. > First, the root mean square To facilitate comparison with reference frame amplitude difference vector (RFADV) (to be described later), preferably, a normalization of root mean square (RMS) is performed on this vector to obtain the FADV According to this method, the FADV is calculated based on eigenvector analysis on a covariance matrix including time domain sample values of respective channels in the short time frame data. First, the covariance matrix Then, the largest eigenvector of the covariance matrix is calculated as the FADV by eigendecomposition. According to this method, the FADV is calculated as an average amplitude on frequency domain of respective channels in the short time frame data. The frequency domain of FFT can be divided into a number It is assumed that a set In an example, it is determined that a frequency range from frequency In one implementation, it is possible to set First, N samples of each channel m is grouped into S sub-frames, and each sub-frame contains As a special case, the number S of the sub-frames can be set to 1, which means the spectral analysis is performed directly on the all N samples of one short time frame. Then, spectral analysis is performed on each sub-frame to obtain parameters Then, frequency domain parameters of all the channels in each frequency bin Then, the amplitude Then a vector ( Then, to facilitate comparison with RFADV (to be described later), preferably, a normalization of root mean square (RMS) is performed on this vector to obtain the FADV According to this method, the FADV is calculated based on eigenvector analysis on spectral parameters of respective channels in the short time frame data. As described on the foregoing, N samples of each channel m is grouped into S sub-frames, and each sub-frame contains In one implementation, it is possible to obtain a first covariance matrix as a sum of second covariance matrices. Each of the second covariance matrices corresponds to respective one of used frequency bins and includes spectral parameters of all the sub-frames of all the channels for the respective used frequency bin. Accordingly, it is possible to calculate the FADV based on the eigenvector analysis on the first covariance matrix. Specifically, the FADV can be calculated by finding the largest eigenvector based on the covariance matrices for the frequency bins. For example, it is possible to calculate a sum As a special case, the number S of sub-frames for one short time frame equals to 1, that is to say, the grouping is not performed. This means that the spectral analysis is performed directly on the all N samples of one short time frame for each channel to obtain the spectral parameters of the channel. That is, frequency domain parameters of all the channels in all the frequency bin In this case, the FADV is calculated as the largest eigenvector of a covariance matrix which includes spectral parameters of all the used frequency bins of all the channels. For example, for matrix In another implementation, it is possible to calculate the FADV by averaging the largest eigenvectors of covariance matrices. Each of the covariance matrices corresponds to respective one of used frequency bins and includes spectral parameters of all the sub-frames of all the channels for the respective used frequency bin. For example, it is possible to calculate the largest eigenvector Further more, the FADV may be calculated in an adaptive way. That is to say, vector calculator 101 may be configured to calculate the FADV adaptively based on the short time frame data and its previous short time frame data. In one implementation, it is possible to calculate the desired FADV adaptively by calculating a current FADV based on the short time frame data, and smoothing the current FADV and a historic FADV based on the previous short time frame data as the desired FADV. The desired FADV or the current FADV may be used as a historic FADV for the next desired FADV. For example, in case of having calculated the FADV In another implementation, it is possible to calculate the FADV according to an eigenvector-based method (based on time domain sample values or based on frequency domain parameters). In this case, it is possible to obtain the final covariance matrix (summed covariance matrix, covariance matrix for a used frequency bin, or covariance matrix for all the used frequency bins) for calculating an eigenvector based on the short time frame data by deriving a current covariance matrix based on the short time frame data, and smoothing the current covariance matrix and a historic covariance matrix based on the previous short time frame data as the final covariance matrix. The final covariance matrix for calculating an eigenvector based on the short time frame data or the current covariance matrix may be used as a historic covariance matrix for the next final covariance matrix. For example, it is possible to calculate the covariance matrix Returning to For comparison with the FADV, the RFADVs have the same format as the FADV. Because each RFADV reflects the amplitude differences associated with one of the candidate locations, the RFADV is associated with the candidate locations. The term "candidate location" means that the sound source may locate at the location and originate the sound for the current short time frame at the location. It is possible to assume an even probability distribution for all locations, and thus the candidate locations may include all the locations spaced at an even interval depending on the localizing resolution. Preferably, to reduce the computation complexity, the candidate location may be a subset of all the locations. The subset may be different in different scenarios based on a prior knowledge of the source location's probability distribution. Various methods can be adopted to calculate the similarity between the FADV The distances can be implemented as Euclidean distance. The distance Estimator 103 is configured to estimate a desired location of sound source based at least on the candidate locations and associated similarity. For example, a RFADV having the highest similarity to the FADV may be found, and the candidate location associated with the RFADV may be estimated as the desired location of the sound source. Further, the desired location may be estimated with reference to an estimation result obtained through another sound source localization method, for example, a method based on time difference. As illustrated in As an example of calculating the FADV based on eigenvector analysis on spectral parameters of respective channels in the short time frame data, as described in the foregoing, the spectral parameters of each of the channels may be obtained by performing spectral analysis on a plurality of sub-frames of the channel, wherein all the samples of the channel are grouped into the sub-frames. In this case, a first covariance matrix may be obtained as a sum of second covariance matrices. Each of the second covariance matrices corresponds to respective one of used frequency bins and includes spectral parameters of all the sub-frames of all the channels for the respective used frequency bin. The FADV may be calculated based on the eigenvector analysis on the first covariance matrix. Alternatively, the FADV may be calculated by averaging the largest eigenvectors of covariance matrices. Each of the covariance matrices corresponds to respective one of used frequency bins and includes spectral parameters of all the sub-frames of all the channels for the respective used frequency bin. As another example of calculating the FADV based on eigenvector analysis on spectral parameters of respective channels in the short time frame data, as described in the foregoing, the spectral parameters of each of the channels may be obtained by performing spectral analysis directly on all the samples of the channel. In this case, the FADV may be calculated as the largest eigenvector of a covariance matrix. The covariance matrix includes spectral parameters of all the used frequency bins of all the channels. Further more, the FADV may be calculated adaptively based on the short time frame data and its previous short time frame data. As an example, the FADV may be calculated adaptively by calculating a current frame amplitude difference vector based on the short time frame data, and smoothing the current frame amplitude difference vector and a historic frame amplitude difference vector calculated adaptively based on the previous short time frame data as the frame amplitude difference vector. The frame amplitude difference vector or the current frame amplitude difference vector may be used as a historic frame amplitude difference vector for the next frame amplitude difference vector. As another example, the FADV may be calculated according to an eigenvector-based method, and the final covariance matrix for calculating an eigenvector based on the short time frame data may be obtained by deriving a current covariance matrix based on the short time frame data, and smoothing the current covariance matrix and a historic covariance matrix for calculating an eigenvector based on the previous short time frame data as the final covariance matrix. The final covariance matrix for calculating an eigenvector based on the short time frame data or the current covariance matrix may be used as the historic covariance matrix for the next final covariance matrix. At step 305, similarity between the FADV and each of a plurality of RFADVs is evaluated. Each of the plurality of RFADVs reflects differences between amplitudes captured by microphones of the array during recording sound from one of a plurality of candidate locations. At step 307, a desired location of sound source is estimated based at least on the candidate locations and associated similarity. At step 309, the method ends. As illustrated in Reference vector calculator 405 may be configured to calculate the RFADVs based on audio data obtained by capturing sound originated from the candidate locations respectively through the array. In this case, for each candidate location θ, a reference amplitudes Alternatively, reference vector calculator 405 may also be configured to calculate the RFADVs based on sensitivity of the microphones of the array to sound originated from the candidate locations. For example, the sensitivity of a microphone to sound originated from locations can be defined through directionality pattern of the microphone. Accordingly, the RFADVs can be estimated according to the directionality pattern of the microphones. For example, in case of the CMA, when the sound source is placed at location θ and no noise is present, the theoretical amplitude of microphone 201, 202 and 203 (see Then for CMA, Then the RFADV for location θ is Further, by considering the influence of noise, the reference amplitudes of microphones 201, 202 and 203 may be calculated by Assuming that the noise is independent of the sound source's location and the noise level of three microphones is the same, Then Then Various methods can be used to estimate As illustrated in Steps 503, 505, 507 and 509 have the same function as step 303, 305, 307 and 309, and will not be described in detail herein. As illustrated in Possibility evaluator 606 is configured to evaluate possibility that each of a plurality of possible locations is the desired location according to an audio localization method based on time difference. Reference to the term "possible locations" is only for purpose of distinguishing from the candidate locations in the above embodiments based on amplitude difference. The possible locations are dependent on the method based on time difference. The term "possibility" is dependent on the measurement adopted by the method based on time difference to evaluate the closeness of possible locations to the desired location. Estimator 603 is configured to estimate the desired location based on the candidate locations, their similarity, the possible locations and their possibility. Estimator 603 has two kinds of information to estimate the desired location. One is the candidate locations and their similarity, another is the possible locations and their possibility. Considering that one kind of information is a refinement to another, various policies may be adopted to estimate the desired location. For example, the estimation may be performed in similar to a voting problem. As illustrated in After step 705, method 700 proceeds to step 706. At step 706, possibility that each of a plurality of possible locations is the desired location is evaluated according to an audio localization method based on time difference. At step 707, the desired location is estimated based on the candidate locations, their similarity, the possible locations and their possibility. Estimator 603 has two kinds of information to estimate the desired location. Method 700 ends at step 709. It should be noted that step 706 may be performed before step 705, or in parallel to step 705. As illustrated in First function generator 807 is configured to derive a first probability function for estimating probability that all locations are the desired location based on the possible locations and their possibility. The first probability function may estimate the probability that the possible locations are the desired location. Furthermore, the first probability function may also estimate the probability that other locations are the desired location. Various functions can be used to derive the first probability function of different locations based on the possibility. For example, the possibility is measured by steered response power (SRP). One method is to directly use the steered response For another example, it is possible to derive the first probability function Second function generator 808 is configured to derive a second probability function for estimating probability that all locations are the desired location based on the candidate locations and their similarity. The second probability function may estimate the probability that the candidate locations are the desired location. Furthermore, the second probability function may also estimate the probability that other locations are the desired location. The second probability function can be estimated with various methods. For example, the second probability function For another example, the second probability function Third function generator 809 is configured to derive a combined probability function for estimating probability that all locations are the desired location based on the first probability function and the second probability function. The combined probability function may estimate the probability that the possible locations and the candidate locations are the desired location. Further more, the combined probability function may also estimate the probability that other locations are the desired location. Various methods can be used to derive the combined probability function based on two probability functions. For example, it is possible to derive the combined probability function by multiplying the first and the second probability functions as follows: Estimator 803 is configured to estimate the location The location In the example of Preferably, estimator 803 is further configured to choose the closest one to the location having the largest combined probability from one or more peak locations in the first probability function or from one or more possible locations having the higher possibility. For example, if the combined estimation result is close to the estimated location by the time difference based algorithm (i.e., possible locations having the higher possibility), the combined estimation result can be adjusted to the estimated location. For example, if the combined estimation result is close to one potential location, i.e., one local maxima (peak) of SRP curve, the combined estimation result can be adjusted to that location. As an example, it is possible to estimate a location θ Then θ If As another example, it is possible to calculate all the local maxima of SRP curves as θ If In the example of Alternatively, the refinement can be performed by comparing the θ As illustrated in After step 906, method 900 proceeds to step 907. At step 907, a first probability function for estimating probability that all locations are the desired location is derived based on the possible locations and their possibility. At step 908, a second probability function for estimating probability that all locations are the desired location is derived based on the candidate locations and their similarity. At step 909, a combined probability function for estimating probability that all locations are the desired location is calculated based on the first probability function and the second probability function. At step 910, a location having the highest combined probability is estimated as the desired location, based on the combined probability function. Method 900 ends at step 911. It should be noted that step 907 may be executed at any time between steps 905 and 909, and step 908 may be executed at any time between steps 906 and 909. Further, the first probability function may be derived by incorporating a first factor, and the second probability function may be derived by incorporating a second factor. The first factor and the second factor enable the combined probability function to be more sensitive to the similarity. For example, Equation (1) may be adjusted to For example, Equation (3) may be adjusted to For example, Equation (5) may be adjusted to For example, Equation (6) may be adjusted to Similarly, a smaller In a modification to the embodiments of For example, a time difference based algorithm (such as SRP) returns all the angles with a local maxima value in steered response power curve for all angles while an amplitude difference based algorithm returns probability function. Then the probability function's values on the returned angles by SRP-PHAT are compared, and the angle with the largest probability is chosen as the final estimated angle. In the example of In a further example of the embodiments of In the example of In The CPU 1101, the ROM 1102 and the RAM 1103 are connected to one another via a bus 1104. An input / output interface 1105 is also connected to the bus 1104. The following components are connected to the input / output interface 1105: an input section 1106 including a keyboard, a mouse, or the like ; an output section 1107 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; the storage section 1108 including a hard disk or the like ; and a communication section 1109 including a network interface card such as a LAN card, a modem, or the like. The communication section 1109 performs a communication process via the network such as the internet. A drive 1110 is also connected to the input/ output interface 1105 as required. A removable medium 1111, such as a magnetic disk, an optical disk, a magneto - optical disk, a semiconductor memory, or the like, is mounted on the drive 1110 as required, so that a computer program read therefrom is installed into the storage section 1108 as required. In the case where the above-described steps and processes are implemented by the software, the program that constitutes the software is installed from the network such as the internet or the storage medium such as the removable medium 1111. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. A method of performing sound source localization, comprising:
calculating (303; 503; 703; 903) a frame amplitude difference vector (FADV) based on short time frame data acquired through an array of microphones (201, 202, 203), the frame amplitude difference vector (FADV) reflecting differences between amplitudes captured by microphones (201, 202, 203) of the array during recording the short time frame data; evaluating (305; 505; 705; 905) similarity between the frame amplitude difference vector (FADV) and each of a plurality of reference frame amplitude difference vectors (RFADV), each of the plurality of reference frame amplitude difference vectors (RFADV) reflecting differences between amplitudes captured by microphones (201, 202, 203) of the array during recording sound from one of a plurality of candidate locations; estimating (307; 507; 707; 907) a desired location of sound source based at least on the candidate locations and associated similarity; wherein the frame amplitude difference vector (FADV) is calculated based on eigenvector analysis on spectral parameters of respective channels in the short time frame data or on a covariance matrix including time domain sample values of respective channels in the short time frame data. The method according to claim 1, further comprising:
evaluating (706; 906) a possibility that each of a plurality of possible locations is the desired location according to an audio localization method based on time difference, and wherein the estimating comprises estimating the desired location based on the candidate locations, the similarity, the possible locations and the possibility. The method according to claim 1, wherein the frame amplitude difference vector (FADV) is calculated according to one of the following methods: a method based on time domain sample values, and a method based on frequency domain parameters. The method according to claim 1, wherein the frame amplitude difference vector (FADV) is calculated adaptively based on the short time frame data and its previous short time frame data. The method according to claim 1, further comprising:
acquiring the plurality of reference frame amplitude difference vectors (RFADV) by capturing sound originated from the candidate locations respectively through the array. The method according to claim 1, further comprising:
calculating (502) the plurality of reference frame amplitude difference vectors (RFADV) based on sensitivity of the microphones (201, 202, 203) of the array to sound originated from the candidate locations. The method according to claim 1, wherein the array comprises three cardioid microphones (201, 202, 203) which are orientated in directions of 0 degree, -120 degree and -240 degree respectively in a plane. An apparatus for performing sound source localization, comprising:
a vector calculator (101; 401; 601; 801) that calculates a frame amplitude difference vector (FADV) based on short time frame data acquired through an array of microphones (201, 202, 203), the frame amplitude difference vector (FADV) reflecting differences between amplitudes captured by microphones (201, 202, 203) of the array during recording the short time frame data; a similarity evaluator (102; 402; 602; 802) which evaluates similarity between the frame amplitude difference vector (FADV) and each of a plurality of reference frame amplitude difference vectors (RFADV), each of the plurality of reference frame amplitude difference vectors (RFADV) reflecting differences between amplitudes captured by microphones (201, 202, 203) of the array during recording sound from one of a plurality of candidate locations; an estimator (103; 403; 603; 803) which estimates a desired location of sound source based at least on the candidate locations and associated similarity; wherein the vector calculator (101; 401; 601; 801) is configured to calculate the frame amplitude difference vector (FADV) based on eigenvector analysis on spectral parameters of respective channels in the short time frame data or on a covariance matrix including time domain sample values of respective channels in the short time frame data. The apparatus according to claim 8, further comprising:
a possibility evaluator (606; 806) which evaluates possibility that each of a plurality of possible locations is the desired location according to an audio localization method based on time difference, and wherein the estimator is further configured to estimate the desired location based on the candidate locations, the similarity, the possible locations and the possibility. The apparatus according to claim 8, wherein the vector calculator (101; 401; 601; 801) is configured to calculate the frame amplitude difference vector (FADV) according to one of the following methods: a method based on time domain sample values, and a method based on frequency domain parameters. The apparatus according to claim 8, wherein the vector calculator (101; 401; 601; 801) is configured to calculate the frame amplitude difference vector (FADV) adaptively based on the short time frame data and its previous short time frame data. The apparatus according to claim 8, further comprising:
a reference vector calculator (405) which calculates the plurality of reference frame amplitude difference vectors (RFADV) based on audio data obtained by capturing sound originated from the candidate locations respectively through the array. The apparatus according to claim 8, further comprising:
a reference vector calculator (405) which calculates the plurality of reference frame amplitude difference vectors (RFADV) based on sensitivity of the microphones (201, 202, 203) of the array to sound originated from the candidate locations. The apparatus according to claim 8, wherein the array comprises three cardioid microphones (201, 202, 203) orientated in directions of 0 degrees, -120 degrees and -240 degrees respectively in a plane. A computer-readable medium having computer program instructions recorded thereon for enabling a processor to perform the steps of a method of sound source localization according to anyone of claims 1 to 7.Technical Field
Background
Summary
Brief Description of Drawings
Detailed Description
Calculating the FADV
Method based on energy and time domain sample values
Method based on eigenvector and time domain sample values
Method based on energy and frequency domain parameters
where
where
where |Method based on eigenvector and frequency domain parameters
and then calculate the largest eigenvector
and then the largest eigenvector Calculating the FADV adaptively
where
where
where θ represents a candidate location,
where Generation of RFADVs
where
where a smaller
where a smaller
where