æ¬åæå ¬å¼äºä¸ç§åºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³åè£ ç½®ãå¨è®ç»é¶æ®µï¼é¦å ä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼ä¸ºæåçå个æ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼ç¶åéè¿æ¢¯åº¦ä¸éæ³è®ç»ä¸åæ¹åãä¸åé¢å¸¦çæéå¼ãå¨çº¿å®ä½é¶æ®µï¼åæ ·é¦å å¯¹ä¿¡å·æåç¹å¾ï¼æ¥çå¨ä¸åç¹å¾åä¸åé¢å¸¦ä¸å°ææåçç¹å¾ä¸å个æ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹é ï¼æåéè¿å æèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å¾å°æç»çå£°æºæ¹åç¸ä¼¼åº¦ï¼åæå¤§ç¸ä¼¼åº¦æ¹åä¸ºå£°æºæ¹åãå®éªå¨ä¸åç§ç±»åªå£°ç¯å¢ä¸è¿è¡ï¼å®éªç»æè¡¨ææ¬åæå¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½é®é¢ã
The present invention discloses a binaural sound source localization method and device based on weighted template matching. In the training stage, firstly, binaural cross-correlation functions and binaural intensity differences in different directions are extracted from training data, and templates are established for the extracted binaural cross-correlation functions and binaural intensity differences in each direction; then, the weight values of different directions and different frequency bands are trained by the gradient descent method. In the online positioning stage, the signal is firstly extracted, and then the extracted features are matched with the templates in each direction in terms of similarity on different features and different frequency bands. Finally, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction. The experiment is carried out in different types of noise environments, and the experimental results show that the present invention can resist the interference of noise to a certain extent and realize the angle positioning problem of the sound source.
Description Translated from Chinese ä¸ç§åºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³åè£ ç½®A binaural sound source localization method and device based on weighted template matchingææ¯é¢åTechnical Field
æ¬åæå±äºä¿¡æ¯ææ¯é¢åï¼æ¶åä¸ç§åºç¨å¨è¯é³æç¥åè¯é³å¢å¼ºä¸çåè³å£°æºå®ä½æ¹æ³ï¼å ·ä½æ¶åä¸ç§åºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³åè£ ç½®ãThe invention belongs to the field of information technology, and relates to a binaural sound source localization method applied in speech perception and speech enhancement, and in particular to a binaural sound source localization method and device based on weighted template matching.
ææ¯èæ¯technical background
人æºäº¤äºå¨æºå¨äººé¢åå ·æè¶æ¥è¶éè¦çä½ç¨ï¼äººæºäº¤äºè½å¤ä½¿äººä¸æºå¨çäº¤æµæ´å æ¹ä¾¿ã髿ãå好ã卿¥å¸¸çæ´»ä¸ï¼äººä»¬æç¥å¤çä¿¡æ¯çä¸»è¦æ¹å¼æè§è§ãå¬è§ã触è§ãå è§ãå³è§çãå ¶ä¸äººç±»éè¿è§è§è·å¾çä¿¡æ¯çº¦å 70ï¼ -80ï¼ ï¼éè¿å¬è§è·å¾çä¿¡æ¯çº¦å 10ï¼ -20ï¼ ãå¬è§æç¥æ¯äººä»¬ä¸å¤çè¿è¡ä¿¡æ¯äº¤æµæèªç¶ãæ¹ä¾¿ï¼ææçæ¹å¼ä¹ä¸ãå¦å¤ç¸æ¯äºè§è§ä¿¡å·ï¼å¬è§ä¿¡å·å ·æ360度çè§éï¼ä¸åå ç §å½±åï¼ä¹ä¸éè¦æ»¡è¶³å£°æºå麦å é£ä¹é´æ 鮿¡ç©çæ¡ä»¶ï¼å æ¤ï¼æºå¨äººå¬è§æ¯å®ç°äººæºäº¤äºçéè¦éå¾ä¹ä¸ãæºå¨äººå¬è§ä¸»è¦å æ¬å£°æºçå®ä½ä¸è¿½è¸ªãè¯é³å»åªãè¯é³å¢å¼ºãè¯é³å离ã说è¯äººè¯å«ãè¯é³è¯å«ãè¯é³æ æè¯å«çï¼å ¶ä¸å£°æºå®ä½ä½ä¸ºæºå¨äººå¬è§å端çä¸ä¸ªä»»å¡ï¼å¯ä»¥ä¸ºå ¶å®è¯é³ä»»å¡æä¾è¯é³ç©ºé´ä½ç½®ä¿¡æ¯ä½ä¸ºè¾ å©ãæºå¨äººå£°æºå®ä½å·²æä¸ºæºå¨äººå¬è§ç³»ç»çä¸ä¸ªéè¦ç»æé¨åãHuman-computer interaction plays an increasingly important role in the field of robotics. Human-computer interaction can make communication between people and machines more convenient, efficient and friendly. In daily life, the main ways for people to perceive external information are vision, hearing, touch, smell, taste, etc. Among them, about 70%-80% of the information obtained by humans through vision, and about 10%-20% of the information obtained through hearing. Auditory perception is one of the most natural, convenient and effective ways for people to communicate with the outside world. In addition, compared with visual signals, auditory signals have a 360-degree field of view, are not affected by light, and do not need to meet conditions such as no obstructions between the sound source and the microphone. Therefore, robot hearing is one of the important ways to achieve human-computer interaction. Robot hearing mainly includes sound source positioning and tracking, speech denoising, speech enhancement, speech separation, speaker recognition, speech recognition, speech emotion recognition, etc. Among them, sound source positioning as a task of the robot's hearing front end can provide speech spatial position information as an aid for other speech tasks. Robot sound source positioning has become an important part of the robot's hearing system.
è¯é³å离æ¥èªäºèåçâ鸡尾é ä¼âé®é¢ï¼å³äººä»¬å¯ä»¥å¨ä¼å¤è°è¯å£°ååªå£°ä¸èç¦äºæä¸ªäººç声é³çè½åï¼è¯¥é®é¢é¿ä¹ 以æ¥è¢«è®¤ä¸ºæ¯è¯é³å离ä¸çä¸ä¸ªå ·ææææ§çé®é¢ãéè¿å¨è¯é³å离ä¸ç»å声æºå®ä½ææ¯è·å¾å£°æºçæ¹ä½ä¿¡æ¯ï¼æå©äºå离混å çè¯é³ï¼è½å¤æé«å¯¹æå ´è¶£æ¹åè¯é³çè¯å«çåç¡®çãå¨è§é¢ä¼è®®ä¸ï¼å¯ä»¥æ ¹æ®éº¦å é£å£°æºå®ä½çç»æåæ¶è°æ´æåæºçä½ç½®ï¼ä½¿å ¶è½¬å说è¯äººçä½ç½®ãå¨è§é¢çæ§ä¸ï¼å¯ä»¥æ ¹æ®å£°æºæ¹åä¿¡æ¯è°æ´æåæºçè§åº¦ï¼æ©å¤§çæ§èå´ï¼è¾¾å°æ´å¥½ççæ§ä½ç¨ãSpeech separation comes from the famous "cocktail party" problem, which is the ability of people to focus on a person's voice amidst a lot of conversations and noise. This problem has long been considered a challenging problem in speech separation. By combining sound source localization technology with speech separation to obtain the directional information of the sound source, it is helpful to separate the aliased speech and improve the accuracy of recognizing the speech in the direction of interest. In video conferencing, the position of the camera can be adjusted in time according to the results of microphone sound source localization to turn it to the position of the speaker. In video surveillance, the angle of the camera can be adjusted according to the sound source direction information to expand the monitoring range and achieve better monitoring.
æ ¹æ®éº¦å 飿°é以忝å¦å ·ææºå¨äººå·¥å¤´çè³èç»æï¼å£°æºå®ä½ææ¯å¤§ä½å¯ä»¥å为åºäºéº¦å é£éµåç声æºå®ä½ååºäºåè³éº¦å é£éµåç声æºå®ä½ãåè³éº¦å é£å®ä½ææ¯å¨äººå½¢æºå¨äººé¢åå ·æéè¦çä½ç¨ï¼å®è½å¤å åå©ç¨è³èç»æå¯¹å£°é³çè¡å°ä½ç¨ï¼æ¨¡æäººç±»å¬è§ç¹æ§ãæºå¨äººåè³å£°æºå®ä½ä» 使ç¨ä¸¤ä¸ªéº¦å é£ï¼åå«æè½½å¨æºå¨äººå¤´é¨å·¦å³ä¾§ãç¸æ¯ä¸¤éº¦å é£éµåç声æºå®ä½ï¼åè³å£°æºå®ä½å 为æäºè³èå人工头对声é³ä¿¡å·çè¡å°çä½ç¨ï¼å¯ä»¥æ´å¥½ç模æäººç±»çå¬è§ç¹æ§ï¼å¯ä»¥æ´å¥½çåºç¨å¨äººå½¢æºå¨äººï¼å©å¬å¨è¯é³å¢å¼ºãèæç°å®çåºæ¯ãå¹¶ä¸å¯ä»¥æ¶é¤ä¸¤éº¦å é£å£°æºå®ä½åååçæ§ä¹é®é¢ãAccording to the number of microphones and whether there is a cochlear structure of a robot foreman, sound source localization technology can be roughly divided into sound source localization based on microphone arrays and sound source localization based on binaural microphone arrays. Binaural microphone localization technology plays an important role in the field of humanoid robots. It can make full use of the diffraction effect of the cochlear structure on sound and simulate human auditory characteristics. Robot binaural sound source localization uses only two microphones, which are mounted on the left and right sides of the robot's head. Compared with the sound source localization of two microphone arrays, binaural sound source localization can better simulate human auditory characteristics because of the diffraction of sound signals by the cochlea and artificial head, and can be better applied in humanoid robots, hearing aid speech enhancement, virtual reality and other scenarios. And it can eliminate the ambiguity of the forward and backward directions of the two-microphone sound source localization.
åè³å£°æºå®ä½ä¸»è¦å æ¬ä»¥ä¸å 个æ¥éª¤ï¼Binaural sound source localization mainly includes the following steps:
1ãåè³ä¿¡å·ç模æä¸å½å¶ãéç¨åè³å²æ¿å½æ°ä¸çº¯å声é³ä¿¡å·å·ç§¯è·å模æåè³å£°é³ä¿¡å·ï¼æè ç´æ¥å½å¶åè³ä¿¡å·ä½ä¸ºçå®ä¿¡å·ã1. Simulation and recording of binaural signals. Use binaural impulse function and pure sound signal convolution to obtain simulated binaural sound signal, or directly record binaural signal as real signal.
2ãä¿¡å·çæ°æ¨¡è½¬æ¢ï¼é¢æ»¤æ³¢ãé¦å å°æ¨¡æä¿¡å·è¿è¡é¢æ»¤æ³¢ï¼é«éæ»¤æ³¢å¨æ»¤é¤50Hzççµæºåªå£°ä¿¡å·ï¼ä½é滤波滤é¤å£°é³ä¿¡å·ä¸é¢çåéè¶ è¿éæ ·é¢çä¸åçé¨åï¼é²æ¢æ··å å¹²æ°ï¼å¯¹æ¨¡æå£°é³ä¿¡å·è¿è¡éæ ·åéåå¾å°æ°åä¿¡å·ã2. Digital-to-analog conversion and pre-filtering of the signal. First, pre-filter the analog signal. The high-pass filter removes the 50Hz power supply noise signal, and the low-pass filter removes the frequency component of the sound signal that exceeds half of the sampling frequency to prevent aliasing interference. The analog sound signal is sampled and quantized to obtain a digital signal.
3ãé¢å éãä¿¡å·éè¿é«é¢å éæ»¤æ³¢å¨å²æ¿ååºH(z)ï¼1-0.95z-1ï¼ä»¥è¡¥å¿å´åè¾å°å¸¦æ¥çé«é¢è¡°åã3. Pre-emphasis: The signal passes through the high-frequency emphasis filter with an impulse response of H(z)=1-0.95z -1 to compensate for the high-frequency attenuation caused by the lip radiation.
4ãå帧ãå çªãè¯é³ä¿¡å·å ·ææ¶åçç¹æ§ï¼ä½æ¯äººä½å´é¨èèè¿å¨ç¸å¯¹è¾æ ¢ï¼ä¸è¬è®¤ä¸ºè¯é³ä¿¡å·å¨çæ¶é´å æ¯å¹³ç¨³çï¼ä¸è¬ä¸º10ms-30msãå æ¤å¾å¾æç §ä¸è¿°æ¶é´é´é对信å·è¿è¡å帧å¤çï¼ä¾å¦æ¯20msåä¸å¸§ãå¦å¤ï¼ä¸ºäºé²æ¢å帧带æ¥çé®é¢ï¼ä¸è¬å¯¹å帧åçä¿¡å·è¿è¡å çªå¤çï¼å¸¸è§ççªå æ¬ç©å½¢çªãæ±å®çªãæ±æçªçï¼å ¶ä¸ï¼æ±æçªä½¿ç¨è¾ä¸ºå¹¿æ³ã4. Framing and windowing. Speech signals have time-varying characteristics, but the movement of human mouth muscles is relatively slow. It is generally believed that speech signals are stable in a short period of time, generally 10ms-30ms. Therefore, the signal is often framed according to the above time interval, for example, one frame every 20ms. In addition, in order to prevent the problems caused by framing, the framed signal is generally windowed. Common windows include rectangular windows, Hanning windows, Hamming windows, etc. Among them, the Hamming window is more widely used.
5ãç¹å¾æåãæ¯å¸§ä¿¡å·å¯ä»¥æåå å«å£°æºæ¹ä½ä¿¡æ¯çåè³ç¹å¾ï¼åè³å£°æºå®ä½ä¸å¸¸ä½¿ç¨çåè³ç¹å¾å æ¬åè³äºç¸å ³å½æ°(Interaural Cross-correlation Function,CCF)ãåè³æ¶é´å·®(Interaural Time Difference,ITD)ååè³å¼ºåº¦å·®(InterauralIntensity Differenceï¼IID)çãç±äºå¾å¤æååè³æ¶é´å·®çæ¹æ³æ¯åºäºåè³äºç¸å ³å½æ°çï¼å æ¤æ¬åæä¸ä½¿ç¨åè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ç¹å¾ã5. Feature extraction. Each frame signal can extract binaural features containing sound source orientation information. The binaural features commonly used in binaural sound source localization include binaural cross-correlation function (CCF), interaural time difference (ITD) and interaural intensity difference (IID). Since many methods for extracting interaural time difference are based on binaural cross-correlation function, binaural cross-correlation function and binaural intensity difference features are used in the present invention.
6ãå®ä½ãå°æåå°çç¹å¾æ å°å°ç¸åºçæ¹åã使å¾è¯¥å£°æºå¨çå®å£°æºæ¹åçåéªæ¦çæå¤§ãæ å°æ¹æ³å æ¬å¾å¤ï¼ä¾å¦ä½¿ç¨é«æ¯æ··å模åï¼ç¥ç»ç½ç»æ¨¡åçï¼æ¬åæä½¿ç¨ä¸ä¸ªåºäºå ææ¨¡æ¿å¹é çæ¹æ³ã6. Positioning. Map the extracted features to the corresponding directions. Make the posterior probability of the sound source in the direction of the real sound source the largest. There are many mapping methods, such as using Gaussian mixture model, neural network model, etc. The present invention uses a method based on weighted template matching.
ä¼ ç»çåºäºé«æ¯æ··å模åååºäºç¥ç»ç½ç»æ¨¡åçæ¹æ³ï¼éè¿å¨ä¸åé¢å¸¦ä¸åå«è®¡ç®å£°æºæ¹åï¼æåç¸å å¾åºæç»çç»æï¼æ²¡æèèå°ä¸åé¢å¸¦çå¯é æ§ä»¥åä¸åç¹å¾çå¯é æ§ãå¦å¤åºäºç¥ç»ç½ç»æ¨¡åçæ¹æ³åå¨ä¸å¯è§£éæ§ãTraditional methods based on Gaussian mixture models and neural network models calculate the sound source direction in different frequency bands and finally add them together to get the final result, without considering the reliability of different frequency bands and the reliability of different features. In addition, the methods based on neural network models are unexplainable.
åæå 容Summary of the invention
é对ä¸è¿°é®é¢ï¼æ¬åæçç®çå¨äºæä¾ä¸ç§å¯è§£éçåºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³åè£ ç½®ï¼åå«å¨æ¯ä¸ªé¢ç带ä¸è¿è¡å个æ¹å声æºä¼¼ç¶å¼ç计ç®ï¼éè¿ä¸åé¢ç带åä¸åç¹å¾çæéå°ç»æè¿è¡æ´åï¼å¾åºæç»å£°æºæ¹åç»æãIn view of the above problems, the purpose of the present invention is to provide an interpretable binaural sound source localization method and device based on weighted template matching, which calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result.
为äºå®ç°ä¸è¿°ç®çï¼æ¬åæéç¨ä»¥ä¸ææ¯æ¹æ¡ï¼In order to achieve the above object, the present invention adopts the following technical solutions:
ä¸ç§åºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³ï¼å æ¬ä»¥ä¸æ¥éª¤ï¼A binaural sound source localization method based on weighted template matching comprises the following steps:
ä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼Extract binaural cross-correlation functions and binaural intensity differences in different directions from the training data;
为æåçå个æ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼Building templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction;
è®ç»ä¸ååè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼Train different binaural localization features and weights of different frequency bands;
å¨çº¿å®ä½æ¶ï¼æå声æºä¿¡å·çåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼å°å ¶ä¸å个æ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹é ï¼å¹¶éè¿è®ç»å¾å°çæéèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å®ç°å£°æºå®ä½ãDuring online positioning, the binaural cross-correlation function and binaural intensity difference of the sound source signal are extracted, and similarity matching is performed with templates in various directions. The similarities of different features and frequency bands are fused through the trained weights to achieve sound source positioning.
è¿ä¸æ¥å°ï¼æè¿°ä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³å®ä½ç¹å¾ï¼æ¯éç¨åè³å²æ¿å½æ°ä¸çº¯åè¯é³ä¿¡å·å·ç§¯æè ç´æ¥å©ç¨å½å ¥ç声é³ä¿¡å·ï¼è®¡ç®åºæææ¹åä¸çäºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼å ¶ä¸ä¸åæ¹åæ¯æåæä¸åçæ°´å¹³è½¬åè§ï¼è½¬åè§éç¨éååçååæ¹å¼ãFurthermore, the binaural localization features in different directions are extracted from the training data by convolving a binaural impulse function with a pure speech signal or directly using a recorded sound signal to calculate the cross-correlation function and binaural intensity difference in all directions; wherein different directions refer to different horizontal steering angles, and the steering angles are divided in a non-uniform manner.
è¿ä¸æ¥å°ï¼æè¿°è½¬åè§çååæ¹å¼ä¸ºï¼[-80°,-65°,-55°,-45°:5°:45°,55°,65°,80°]ãFurthermore, the steering angle is divided into: [-80°, -65°, -55°, -45°: 5°: 45°, 55°, 65°, 80°].
è¿ä¸æ¥å°ï¼æè¿°ä¸ºæåçå个æ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼æ¯å°å¤å¸§ä»å䏿¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ãFurthermore, the template is established for the extracted binaural cross-correlation function and binaural intensity difference in each direction by taking the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction as the template for that direction.
è¿ä¸æ¥å°ï¼æè¿°è®ç»ä¸åçåè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼æ¯éç¨ååä¼ ææ¹æ³è¿è¡è®ç»ï¼æå¤±å½æ°è®¾ç½®ä¸ºå¹³æ¹æå¤±ï¼ä½¿å¾åæ¹åçæ¨¡æ¿ä¹é´çç¸ä¼¼åº¦æå¤§ï¼ä¸åæ¹åçæ¨¡æ¿é´ç¸ä¼¼åº¦å°½å¯è½å°ãFurthermore, the training of different binaural localization features and weights of different frequency bands is performed by adopting a back propagation method, and the loss function is set to a square loss, so that the similarity between templates in the same direction is maximized and the similarity between templates in different directions is minimized.
è¿ä¸æ¥å°ï¼éç¨ä»¥ä¸å ¬å¼è®¡ç®æè¿°ç¸ä¼¼åº¦ï¼Furthermore, the similarity is calculated using the following formula:
å ¶ä¸ï¼sim(θ)è¡¨ç¤ºå æåçç¸ä¼¼åº¦ç©éµï¼Ïccf,i表示å¨ç¬¬i个é¢å¸¦ä¸äºç¸å ³å½æ°çæéï¼simccf,i(θ)表示å¨ç¬¬i个é¢å¸¦ä¸äºç¸å ³å½æ°ä¸æ¹åÎ¸ä¸æ¨¡æ¿çä½å¼¦ç¸ä¼¼åº¦ï¼Ïiid,i表示å¨ç¬¬i个é¢å¸¦ä¸åè³å¼ºåº¦å·®çæéï¼simiid,i(θ)表示å¨ç¬¬i个é¢å¸¦ä¸åè³å¼ºåº¦å·®ä¸æ¹åÎ¸ä¸æ¨¡æ¿çç¸ä¼¼åº¦ãAmong them, sim(θ) represents the weighted similarity matrix, Ï ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim ccf,i (θ) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction θ; Ï iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim iid,i (θ) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction θ.
ä¸ç§éç¨ä¸è¿°æ¹æ³çåºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½è£ ç½®ï¼å ¶å æ¬ï¼A binaural sound source localization device based on weighted template matching using the above method comprises:
è®ç»æ¨¡åï¼ç¨äºä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼ä¸ºæåçå个æ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼ç¶åè®ç»ä¸ååè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼A training module is used to extract binaural cross-correlation functions and binaural intensity differences in different directions from training data, establish templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction, and then train different binaural localization features and weights of different frequency bands;
å¨çº¿å®ä½æ¨¡åï¼ç¨äºæå声æºä¿¡å·çåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼å°å ¶ä¸å个æ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹é ï¼å¹¶éè¿è®ç»å¾å°çæéèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å®ç°å£°æºå®ä½ãThe online localization module is used to extract the binaural cross-correlation function and binaural intensity difference of the sound source signal, match it with the templates in each direction for similarity, and fuse the similarities of different features and frequency bands through the trained weights to achieve sound source localization.
æ¬åæçæçæææ¯ï¼The beneficial effects of the present invention are:
æ¬åæåå«å¨æ¯ä¸ªé¢ç带ä¸è¿è¡å个æ¹å声æºä¼¼ç¶å¼ç计ç®ï¼éè¿ä¸åé¢ç带åä¸åç¹å¾çæéå°ç»æè¿è¡æ´åï¼å¾åºæç»å£°æºæ¹åç»æï¼å¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½ãThe present invention calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result, which can resist the interference of noise to a certain extent and realize the angular positioning of the sound source.
éå¾è¯´æBRIEF DESCRIPTION OF THE DRAWINGS
å¾1æ¯æ¬åææ¹æ³çæ´ä½æµç¨å¾ãFIG. 1 is an overall flow chart of the method of the present invention.
å¾2æ¯æ¬åæææåç¹å¾ä¸ä¸ªä¾åãå ¶ä¸(a)å¾è¡¨ç¤ºææåçåè³äºç¸å ³å½æ°ç¹å¾ï¼(b)å¾è¡¨ç¤ºææåçåè³å¼ºåº¦å·®ç¹å¾ãFig. 2 is an example of the features extracted by the present invention, wherein (a) shows the extracted binaural cross-correlation function features, and (b) shows the extracted binaural intensity difference features.
å¾3æ¯æ¬åæçä¸ä¸ªå£°æºä¿¡å·ä¸å个æ¹å模æ¿è¿è¡ç¸ä¼¼åº¦è®¡ç®çä¾åãå ¶ä¸ï¼ä¸åé¨å表示声æºä¿¡å·äºç¸å ³å½æ°ä¸å个æ¹å模æ¿çç¸ä¼¼åº¦ï¼ä¸åé¨å表示声æºä¿¡å·åè³å¼ºåº¦å·®ä¸å个æ¹å模æ¿çç¸ä¼¼åº¦ãæ¨ªåæ è¡¨ç¤ºä¸åçæ¹åãFig. 3 is an example of similarity calculation between a sound source signal and each directional template of the present invention. The upper part shows the similarity between the cross-correlation function of the sound source signal and each directional template, and the lower part shows the similarity between the binaural intensity difference of the sound source signal and each directional template. The horizontal axis shows different directions.
å¾4æ¯æ¬åæçæç»æéè®ç»ç»æãä¸¤æ¡æçº¿å ±64个ç¹ï¼åå«è¡¨ç¤ºä¸åé¢å¸¦çåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®çæéãFig. 4 is the final weight training result of the present invention. The two broken lines have 64 points in total, which respectively represent the binaural cross-correlation function of different frequency bands and the weights of the binaural intensity difference.
å ·ä½å®æ½æ¹å¼Detailed ways
ä¸é¢å°ç»å宿½ä¾åéå¾ï¼å¯¹æ¬åæçææ¯æ¹æ¡è¿è¡æ¸ æ¥ã宿´å°æè¿°ãThe technical solution of the present invention will be clearly and completely described below in conjunction with the embodiments and drawings.
å¾1æ¯æ¬åæçä¸ç§åºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½æ¹æ³çæµç¨å¾ï¼å æ¬ä»¥ä¸æ¥éª¤ï¼FIG1 is a flow chart of a binaural sound source localization method based on weighted template matching of the present invention, comprising the following steps:
1)æ°æ®åå¤é¶æ®µï¼æ¨¡æåè³çå个æ¹åçä¿¡å·ï¼æä¾åå§å£°æºä¿¡å·ã1) Data preparation stage: simulate the signals from all directions of both ears and provide the original sound source signals.
1.1)å°äººå·¥å¤´ååå¹³é¢åæ25个ä¸å水平转åè§åº¦ï¼ä¾å¦ï¼è½¬åè§éç¨éååçååæ¹æ³ï¼[-80°,-65°,-55°,-45°:5°:45°,55°,65°,80°]ãå ¶ä¸ï¼-45°:5°:45°表示æ¯é5度设置ä¸ä¸ªè§åº¦ã1.1) The front half plane of the artificial head is divided into 25 different horizontal steering angles. For example, the steering angle is divided in a non-uniform manner: [-80°, -65°, -55°, -45°: 5°: 45°, 55°, 65°, 80°]. Among them, -45°: 5°: 45° means that an angle is set every 5 degrees.
1.2)ç»åTIMITæ°æ®åºæä¾ç纯åçè¯é³ä¿¡å·åCIPICæ°æ®åºæä¾çåè³å²æ¿ååºå½æ°ï¼ä»¥åNOISEX-92æ°æ®åºæä¾çä¸åç§ç±»çåªå£°ä¿¡å·æé è®ç»æ°æ®åæµè¯æ°æ®ãå ¶ä¸è®ç»æ°æ®ä¸ä½¿ç¨åªå£°ä¿¡å·ãæµè¯æ°æ®ä½¿ç¨ä¸åä¿¡åªæ¯çåªå£°ä¿¡å·ï¼å®éªä¸ä½¿ç¨ä»-10dBå°35dBçæµè¯ä¿¡å·ã1.2) The training data and test data are constructed by combining the pure speech signals provided by the TIMIT database, the binaural impulse response functions provided by the CIPIC database, and the different types of noise signals provided by the NOISEX-92 database. The training data does not use noise signals. The test data uses noise signals with different signal-to-noise ratios, and the test signals from -10dB to 35dB are used in the experiment.
2)è®ç»é¶æ®µï¼ä»æ°æ®ä¸æååè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®æ°æ®ï¼ä¸ºäºç¸å ³å½æ°(CCF)ååè³å¼ºåº¦å·®(IID)å»ºç«æ¨¡æ¿ï¼ä»¥åè®ç»ä¸åç¹å¾ãé¢å¸¦å¯¹åºçæéï¼ä½¿å¾åå£°æºæ¹å模æ¿ä¹é´çç¸ä¼¼åº¦æå¤§ï¼ä¸åå£°æºæ¹å模æ¿é´ç¸ä¼¼åº¦å°½å¯è½å°ãå¯ä»¥éç¨åè³å²æ¿å½æ°(HRTF)ä¸çº¯åè¯é³ä¿¡å·å·ç§¯æè ç´æ¥å©ç¨å½å ¥ç声é³ä¿¡å·ï¼è®¡ç®åºæææ¹åä¸çäºç¸å ³å½æ°ååè³å¼ºåº¦å·®æ¨¡æ¿ã2) In the training phase, binaural cross-correlation function and binaural intensity difference data are extracted from the data, templates are established for cross-correlation function (CCF) and binaural intensity difference (IID), and weights corresponding to different features and frequency bands are trained to maximize the similarity between templates in the same sound source direction and minimize the similarity between templates in different sound source directions. The cross-correlation function and binaural intensity difference template in all directions can be calculated by convolving the binaural impulse function (HRTF) with the clean speech signal or directly using the recorded sound signal.
2.1)使ç¨4é¶32ééçä¼½çéæ»¤æ³¢å¨å¯¹å¸¦ææ¹åä¿¡æ¯çä¿¡å·è¿è¡åé¢å¸¦å¤çï¼æå¤§é¢ç设置为7200Hzã2.1) Use a 4th-order 32-channel gammatone filter to perform frequency division processing on the signal with directional information, and the maximum frequency is set to 7200 Hz.
2.2)仿 åªå£°çè®ç»æ°æ®ä¸æåäºç¸å ³å½æ°(CCF)ååè³å¼ºåº¦å·®(IID)ï¼ç»¼åå¤å¸§æ°æ®çå¹³åå¼å»ºç«æ¨¡æ¿ï¼å³å°å¤å¸§ä»å䏿¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ã2.2) Extract the cross-correlation function (CCF) and interaural intensity difference (IID) from the noise-free training data, and establish a template by combining the average values of multiple frames of data. That is, the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction.
äºç¸å ³å½æ°ç计ç®å ¬å¼å¦ä¸ï¼The calculation formula of the cross-correlation function is as follows:
å ¶ä¸in
Gpï¼q(iï¼Ï)ï¼ânxp(iï¼n)xq(iï¼n+Ï)ï¼pï¼qâ{lï¼r}G p,q (i, Ï) = â n x p (i, n) x q (i, n + Ï), p, q â {l, r}
å ¶ä¸ï¼låråå«è¡¨ç¤ºå·¦è³åå³è³ï¼i表示ä¸åé¢å¸¦ï¼n表示æ¯ä¸å¸§çéæ ·ç¹ï¼Ï表示æ¶é´å»¶è¿ï¼å½pãq被èµäºlærçå¼ä¹åï¼xpãxqè¡¨ç¤ºå·¦è³æ¥æ¶å°çä¿¡å·æè å³è³æ¥åå°çä¿¡å·ï¼Ï0表示æ°å0ãAmong them, l and r represent the left ear and right ear respectively, i represents different frequency bands, n represents the sampling points of each frame, and Ï represents the time delay; when p and q are assigned the values of l or r, x p and x q represent the signal received by the left ear or the signal received by the right ear; Ï 0 represents the number 0.
åè³å¼ºåº¦å·®è®¡ç®å ¬å¼å¦ä¸ï¼The formula for calculating the binaural intensity difference is as follows:
å ¶ä¸ï¼xlè¡¨ç¤ºå·¦è³æ¥åå°çä¿¡å·ï¼xr表示å³è³æ¥åå°çä¿¡å·ãAmong them, x l represents the signal received by the left ear, and x r represents the signal received by the right ear.
2.3)å°25个ä¸åæ¹åçä¿¡å·æä¸One-hotæ ç¾ï¼ä¾å¦-80度æ¹åçæ ç¾è®¾ç½®ä¸º[1ï¼0ï¼...ï¼0]ï¼å ¶ä¸å ±24个0ãå¨ä¸åé¢å¸¦åç¹å¾ä¸ï¼å°æ¯å¸§è®ç»æ°æ®ä¸æ¨¡æ¿è®¡ç®ç¸ä¼¼åº¦ï¼å ±å¾å°(2*32)*25((ç¹å¾æ°*é¢å¸¦æ°)*å鿹忰)çç¸ä¼¼åº¦ç©éµï¼å¸æå ææ¤ç©éµï¼ä½¿çå®å£°æºå¯¹åºçåéæ¹åç¸ä¼¼åº¦æå¤§ãå ¶ä¸æéç©éµçè¡æ°ãåæ°ä¸º1*64ï¼ç¸ä¼¼åº¦ç©éµçè¡æ°ãåæ°ä¸º64*25ï¼æç»ç»æä¸º1*25çç©éµsim(θ)ã2.3) One-hot labels are added to the signals of 25 different directions. For example, the label of the -80 degree direction is set to [1, 0, ..., 0], with a total of 24 zeros. The similarity between each frame of training data and the template is calculated in different frequency bands and features, and a similarity matrix of (2*32)*25 ((number of features * number of frequency bands) * number of candidate directions) is obtained. It is hoped that this matrix can be weighted to maximize the similarity of the candidate directions corresponding to the real sound source. The number of rows and columns of the weight matrix is 1*64, the number of rows and columns of the similarity matrix is 64*25, and the final result is a 1*25 matrix sim(θ).
å ¶ä¸ï¼sim(θ)è¡¨ç¤ºå æåçç¸ä¼¼åº¦ç©éµï¼Ïccfï¼i表示å¨ç¬¬i个é¢å¸¦ä¸äºç¸å ³å½æ°çæéï¼simccfï¼i(θ)表示å¨ç¬¬i个é¢å¸¦ä¸äºç¸å ³å½æ°ä¸æ¹åÎ¸ä¸æ¨¡æ¿çä½å¼¦ç¸ä¼¼åº¦ï¼Ïiidï¼i表示å¨ç¬¬i个é¢å¸¦ä¸åè³å¼ºåº¦å·®çæéï¼simiidï¼i(θ)表示å¨ç¬¬i个é¢å¸¦ä¸åè³å¼ºåº¦å·®ä¸æ¹åÎ¸ä¸æ¨¡æ¿çç¸ä¼¼åº¦ãå ¶ä¸ä½å¼¦ç¸ä¼¼åº¦ä½¿ç¨ä¸å¼è®¡ç®ï¼Wherein, sim(θ) represents the weighted similarity matrix, Ï ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim ccf,i (θ) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction θ; Ï iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim iid,i (θ) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction θ. The cosine similarity is calculated using the following formula:
Riï¼r(Ï)æçæ¯ç®æ äºç¸å ³å½æ°ï¼Rtemp(θï¼iï¼Ï)表示å¨Î¸è§åº¦ï¼é¢å¸¦içäºç¸å ³å½æ°æ¨¡æ¿ï¼Rlï¼r(iï¼Ï)è¡¨ç¤ºææ¥æ¶ä¿¡å·å¨é¢å¸¦i计ç®çäºç¸å ³å½æ°ãR i,r (Ï) refers to the target cross-correlation function, R temp (θ,i,Ï) represents the cross-correlation function template of frequency band i at angle θ, and R l,r (i,Ï) represents the cross-correlation function of the received signal calculated in frequency band i.
åè³å¼ºåº¦å·®ç¸ä¼¼åº¦ä½¿ç¨ä¸è¿°å ¬å¼ï¼The binaural intensity difference similarity uses the following formula:
å ¶ä¸i表示é¢å¸¦ç´¢å¼ï¼temp代表模æ¿ï¼Î¸è¡¨ç¤ºæ¹åï¼iidtempï¼Î¸ï¼i表示θè§åº¦æ¹åï¼ç¬¬i个é¢å¸¦æå¯¹åºçåè³å¼ºåº¦å·®æ¨¡æ¿ï¼iidi表示å½å仿µè¯ä¿¡å·ä¸è®¡ç®ç第i个é¢å¸¦çåè³å¼ºåº¦å·®ãWhere i represents the frequency band index, temp represents the template, θ represents the direction, iid temp, θ,i represents the angle direction of θ, the binaural intensity difference template corresponding to the i-th frequency band, and iid i represents the binaural intensity difference of the i-th frequency band currently calculated from the test signal.
2.4)éè¿ååä¼ ææ³è®ç»å ¶ä¸çæéÏccfï¼iåÏiidï¼iã2.4) Train the weights Ï ccf,i and Ï iid,i by back-propagation.
æå¤±å½æ°è®¾ç½®ä¸ºå¹³æ¹æå¤±ï¼å ¶ä¸y为ä¸è¿°ç宿 ç¾ï¼/> æéçè®ç»æ¯å¨ä¸¤ç§ä¸ååè³ç¹å¾åä¸åé¢ç带ä¸åæ¶è®ç»ï¼è®ç»åºçæéå ·æç´è§çå¯è§£éæ§ãThe loss function is set to square loss: Where y is the true label above, /> The weights are trained simultaneously on two different binaural features and different frequency bands, and the trained weights are intuitively interpretable.
3)æµè¯é¶æ®µï¼é¦å å°ééå°çä¿¡å·ç»ç±ä¼½é©¬é滤波å¨è¿è¡åé¢å¤çï¼ç¶åå¨åé¢åçå个é¢å¸¦ä¿¡å·ä¸æåäºç¸å ³å½æ°ååè³å¼ºåº¦å·®ç¹å¾ï¼å¨ä¸åç¹å¾åä¸åé¢å¸¦ä¸å°ç¹å¾ä¸å ¨é¨æ¹åçæ¨¡æ¿ç¹å¾è¿è¡ç¸ä¼¼åº¦è®¡ç®ï¼æåéè¿ä¸è¿°è®ç»é¶æ®µå¾åºçæéè¿è¡å æï¼æ±å¾å£°æºæ¥èªå个æ¹åçä¼¼ç¶å¼ï¼å³å¯å¾å°å£°æºæ¹åä¿¡æ¯ï¼å³éè¿å æèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å¾å°æç»çå£°æºæ¹åç¸ä¼¼åº¦ï¼åæå¤§ç¸ä¼¼åº¦æ¹åä¸ºå£°æºæ¹åã3) In the testing phase, the collected signal is first subjected to frequency division processing through a gammatone filter, and then the cross-correlation function and binaural intensity difference features are extracted from the divided frequency band signals. The similarity between the features and the template features of all directions is calculated at different features and different frequency bands. Finally, the weights obtained in the above training phase are used to weight the likelihood values of the sound source coming from various directions, and the sound source direction information can be obtained. That is, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction.
ä¸é¢æä¾ä¸ä¸ªå ·ä½åºç¨å®ä¾ãæ¬å®ä¾å®æ½éç¨çæ¯åºäºCIPICæ°æ®åº003人工头å½å¶çåè³èå²ååºï¼å ¶å°æ°´å¹³è§åæ25个ä¸åè§åº¦ï¼ä¿¯ä»°è§åæ50个ä¸åè§åº¦ï¼å¯ä»¥æ¨¡æçå®ç¯å¢ä¸åæ¹åçä¿¡å·ãæ¬å®ä¾ä½¿ç¨æ°´å¹³å¹³é¢ä¸ç25个åè³èå²å²å»ååºï¼è¿è¡æ°´å¹³è§çå®ä½ã声æºä¿¡å·åèªTIMITæ°æ®åºçå®äººä»¬è¯´è¯å£°é³ãå°å£°é³ä¿¡å·ä¸åè³èå²å²å»ååºè¿è¡å·ç§¯ï¼å¯ä»¥ç宿¨¡æäººè³æ¥æ¶å°çæ åªå£°ä¿¡å·ã使ç¨NOISEX-92æ°æ®åºå½å¶çä¸åç¯å¢çåªå£°ï¼å å°åè³ä¿¡å·ä¸ï¼å¯ä»¥ç宿¨¡æäººè³å¨ä¸åç§ç±»çåªå£°ç¯å¢ä¸æ¥æ¶å°çä¿¡å·ãA specific application example is provided below. This example is based on the binaural impulse response recorded by the artificial head of the CIPIC database 003. It divides the horizontal angle into 25 different angles and the pitch angle into 50 different angles, which can simulate the signals from different directions in the real environment. This example uses 25 binaural impulse responses on the horizontal plane to locate the horizontal angle. The sound source signal is taken from the real speaking sound of people in the TIMIT database. Convolving the sound signal with the binaural impulse response can truly simulate the noise-free signal received by the human ear. Using the noise of different environments recorded by the NOISEX-92 database and adding it to the binaural signal can truly simulate the signal received by the human ear in different types of noise environments.
å¨è®ç»é¶æ®µï¼é¦å å°ä»¥ä¸åå¤çæ°æ®è¿è¡é¢å éãå帧ãå çªï¼éè¿4é¶32é¢å¸¦ï¼æä½ä¸å¿é¢ç80Hzï¼æé«ä¸å¿é¢ç7200Hzçgammatone滤波å¨ï¼å¾å°32个ä¸åé¢å¸¦çä¿¡å·ãç¶åï¼å©ç¨äºç¸å ³å½æ°è®¡ç®å ¬å¼æåäºç¸å ³å½æ°ï¼è¿éæä»¬èèå°åè³ä¿¡å·çæå¤§æ¶é´å·®ä¸ä¼è¶ è¿æ£è´1.1毫ç§ï¼å¹¶ä¸ç»å16kéæ ·çï¼ä» åé¿åº¦ä¸º37çäºç¸å ³å½æ°çäºç¸å ³å¼ï¼åæ¶ä½¿ç¨è®¡ç®åè³å¼ºåº¦å·®çå ¬å¼æååè³å¼ºåº¦å·®ï¼å®æè¯¥å¸§ä¿¡å·çç¹å¾æåå·¥ä½(å¦å¾2æç¤º)ãå°å¤å¸§ä»å䏿¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ãæåï¼è®¡ç®æ¯å¸§ä¿¡å·çå®ä½ç¹å¾ä¸å个æ¹å模æ¿çç¸ä¼¼åº¦ï¼å¨æ¯ä¸ªåéæ¹åä¸å ±å¾å°64个ç¸ä¼¼åº¦(å¦å¾3æç¤º)ï¼å°ä»ä»¬è¿è¡å æå¾å°æç»çæ¹åç¸ä¼¼åº¦ãç»åç»åºçç¸ä¼¼åº¦æ ç¾(å³One-hotæ ç¾)ï¼ååä¼ æè°æ´æéå¼(å¦å¾4æç¤º)ãIn the training phase, the data prepared above are first pre-emphasized, framed, and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function is extracted using the cross-correlation function calculation formula. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the binaural intensity difference is extracted using the formula for calculating the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). The average value of the binaural positioning features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction. Finally, the similarity between the positioning features of each frame signal and the templates of each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. Combined with the given similarity label (i.e., One-hot label), the weight value is adjusted by back propagation (as shown in Figure 4).
æµè¯é¶æ®µï¼é¦å å°ä»¥ä¸åå¤çæ°æ®å帧ãå çªï¼éè¿4é¶32é¢å¸¦ï¼æä½ä¸å¿é¢ç80Hzï¼æé«ä¸å¿é¢ç7200Hzçgammatone滤波å¨ï¼å¾å°32个ä¸åé¢å¸¦çä¿¡å·ãç¶åï¼åå«å©ç¨äºç¸å ³å½æ°è®¡ç®å ¬å¼æåäºç¸å ³å½æ°ï¼è¿éæä»¬èèå°åè³ä¿¡å·çæå¤§æ¶é´å·®ä¸ä¼è¶ è¿æ£è´1.1毫ç§ï¼å¹¶ä¸ç»å16kéæ ·çï¼ä» åé¿åº¦ä¸º37çäºç¸å ³å½æ°çäºç¸å ³å¼ï¼åæ¶ä½¿ç¨è®¡ç®åè³å¼ºåº¦å·®çå ¬å¼æååè³å¼ºåº¦å·®ï¼å®æè¯¥å¸§ä¿¡å·çç¹å¾æåå·¥ä½(å¦å¾2æç¤º)ãç¶åè®¡ç®æµè¯ä¿¡å·çå®ä½ç¹å¾ä¸å个æ¹å模æ¿çç¸ä¼¼åº¦ï¼å¨æ¯ä¸ªåéæ¹åä¸å ±å¾å°64个ç¸ä¼¼åº¦(å¦å¾3æç¤º)ï¼å°ä»ä»¬è¿è¡å æå¾å°æç»çæ¹åç¸ä¼¼åº¦ãéåæå¤§ç¸ä¼¼åº¦æ¹åä½ä¸ºå£°æºæ¹åãIn the test phase, the data prepared above are first framed and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function calculation formula is used to extract the cross-correlation function. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the formula for calculating binaural intensity difference is used to extract the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). Then, the similarity between the positioning features of the test signal and the templates in each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. The direction with the maximum similarity is selected as the sound source direction.
å ¶ä¸è®ç»é¶æ®µä½¿ç¨æ åªå£°çä¿¡å·ï¼æµè¯é¶æ®µä½¿ç¨ä¸åç§ç±»ä¸åä¿¡åªæ¯çåªå£°ï¼ä»-10dBå°35dBï¼é´é5dBãThe training phase uses a noise-free signal, and the test phase uses different types of noise with different signal-to-noise ratios, ranging from -10dB to 35dB, with an interval of 5dB.
å®éªç»æè¡¨ææ¬åæçæ¹æ³å¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½ãExperimental results show that the method of the present invention can resist noise interference to a certain extent and achieve the angular positioning of the sound source.
åºäºåä¸åæææï¼æ¬åæçå¦ä¸ä¸ªå®æ½ä¾æä¾ä¸ç§éç¨ä¸è¿°æ¹æ³çåºäºå ææ¨¡æ¿å¹é çåè³å£°æºå®ä½è£ ç½®ï¼å ¶å æ¬ï¼Based on the same inventive concept, another embodiment of the present invention provides a binaural sound source localization device based on weighted template matching using the above method, which includes:
è®ç»æ¨¡åï¼ç¨äºä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼ä¸ºæåçå个æ¹åçåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼ç¶åè®ç»ä¸ååè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼A training module is used to extract binaural cross-correlation functions and binaural intensity differences in different directions from training data, establish templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction, and then train different binaural localization features and weights of different frequency bands;
å¨çº¿å®ä½æ¨¡åï¼ç¨äºæå声æºä¿¡å·çåè³äºç¸å ³å½æ°ååè³å¼ºåº¦å·®ï¼å°å ¶ä¸å个æ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹é ï¼å¹¶éè¿è®ç»å¾å°çæéèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å®ç°å£°æºå®ä½ãThe online localization module is used to extract the binaural cross-correlation function and binaural intensity difference of the sound source signal, match it with the templates in each direction for similarity, and fuse the similarities of different features and frequency bands through the trained weights to achieve sound source localization.
åºäºåä¸åæææï¼æ¬åæçå¦ä¸å®æ½ä¾æä¾ä¸ç§çµåè£ ç½®(è®¡ç®æºãæå¡å¨ãæºè½ææºç)ï¼å ¶å æ¬åå¨å¨åå¤çå¨ï¼æè¿°åå¨å¨åå¨è®¡ç®æºç¨åºï¼æè¿°è®¡ç®æºç¨åºè¢«é ç½®ä¸ºç±æè¿°å¤ç卿§è¡ï¼æè¿°è®¡ç®æºç¨åºå æ¬ç¨äºæ§è¡æ¬åææ¹æ³ä¸åæ¥éª¤çæä»¤ãBased on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes instructions for executing each step in the method of the present invention.
åºäºåä¸åæææï¼æ¬åæçå¦ä¸å®æ½ä¾æä¾ä¸ç§è®¡ç®æºå¯è¯»åå¨ä»è´¨(å¦ROM/RAMãç£çãå ç)ï¼æè¿°è®¡ç®æºå¯è¯»åå¨ä»è´¨åå¨è®¡ç®æºç¨åºï¼æè¿°è®¡ç®æºç¨åºè¢«è®¡ç®æºæ§è¡æ¶ï¼å®ç°æ¬åææ¹æ³çå个æ¥éª¤ãBased on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, disk, CD), which stores a computer program. When the computer program is executed by a computer, it implements the various steps of the method of the present invention.
å¯ä»¥çè§£çæ¯ï¼ä»¥ä¸ææè¿°ç宿½ä¾ä» ä» æ¯æ¬åæä¸é¨å宿½ä¾ï¼è䏿¯å ¨é¨ç宿½ä¾ãåºäºæ¬åæä¸ç宿½ä¾ï¼æ¬é¢åææ¯äººå卿²¡æååºåé æ§å³å¨åæä¸æè·å¾çææå ¶ä»å®æ½ä¾ï¼é½å±äºæ¬åæçä¿æ¤èå´ãIt is understandable that the embodiments described above are only some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of the present invention.
Claims (7)1. A binaural sound source localization method based on weighted template matching is characterized by comprising the following steps:
Extracting binaural cross-correlation functions and binaural intensity differences in different directions from the training data;
establishing templates for the extracted binaural cross-correlation functions and binaural intensity differences in all directions;
Training weights of different binaural localization features and different frequency bands;
During on-line positioning, extracting a binaural cross-correlation function and a binaural intensity difference of a sound source signal, performing similarity matching on the binaural cross-correlation function and the binaural intensity difference and templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning;
Training different binaural positioning characteristics and weights of different frequency bands by adopting a back propagation method, wherein a loss function is set as square loss, so that the similarity between templates in the same direction is maximum, and the similarity between templates in different directions is as small as possible;
the similarity is calculated using the following formula:
Wherein sim (θ) represents a weighted similarity matrix, Ï ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim ccf,i (θ) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction θ; omega iid,i represents the weight of the binaural intensity difference in the ith frequency band, sim iid,i (θ) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction θ;
wherein the calculation formula of sim ccf,i(θ)ãsimiid,i (θ) is:
Where R l,r (Ï) is the target cross-correlation function, R temp (θ, i, Ï) represents the cross-correlation function template for frequency band i at θ, and R l,r (i, Ï) represents the cross-correlation function calculated for the received signal in frequency band i;
Where i denotes the band index, temp denotes the template, θ denotes the direction, iid temp,θ,i denotes the θ angle direction, iid i denotes the binaural intensity difference template corresponding to the i-th band, and iid i denotes the binaural intensity difference of the i-th band currently calculated from the test signal.
2. The method of claim 1, wherein the extracting binaural localization features in different directions from the training data is by convolving the binaural impulse function with the clean speech signal or directly using the recorded sound signal, and calculating a cross-correlation function and a binaural intensity difference in all directions; wherein different directions are divided into different horizontal steering angles, and the steering angles are divided in a non-uniform way.
3. The method according to claim 1, wherein the steering angle is divided in the following manner: -80, -65, -55, -45:5:45, 55, 65, 80 ].
4. The method of claim 1, wherein the building a template for the extracted binaural cross-correlation function and binaural intensity difference for each direction is a template for a direction that is an average of binaural localization features extracted from a plurality of frames of noiseless speech frames emanating from the same direction.
5. A weighted template matching based binaural sound source localization device employing the method of any one of claims 1-4, comprising:
the training module is used for extracting the binaural cross-correlation functions and the binaural intensity differences in different directions from the training data, establishing templates for the extracted binaural cross-correlation functions and the binaural intensity differences in different directions, and then training weights of different binaural positioning features and different frequency bands;
And the on-line positioning module is used for extracting the binaural cross-correlation function and the binaural intensity difference of the sound source signal, matching the binaural cross-correlation function and the binaural intensity difference with templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning.
6. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-4.
7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-4.
CN202011456914.0A 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Active CN112731289B (en) Priority Applications (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Applications Claiming Priority (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Publications (2) Family ID=75599430 Family Applications (1) Application Number Title Priority Date Filing Date CN202011456914.0A Active CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Country Status (1) Citations (7) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title WO2012058805A1 (en) * 2010-11-03 2012-05-10 Huawei Technologies Co., Ltd. Parametric encoder for encoding a multi-channel audio signal CN103901401A (en) * 2014-04-10 2014-07-02 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound source positioning method based on binaural matching filter CN104965194A (en) * 2015-07-29 2015-10-07 æ¸¤æµ·å¤§å¦ Indoor multi-sound-source positioning device and method simulating binaural effect CN105075293A (en) * 2013-03-29 2015-11-18 䏿çµåæ ªå¼ä¼ç¤¾ Audio device and audio providing method thereof CN107144818A (en) * 2017-03-21 2017-09-08 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN107346664A (en) * 2017-06-22 2017-11-14 河海大å¦å¸¸å·æ ¡åº A kind of ears speech separating method based on critical band CN110517705A (en) * 2019-08-29 2019-11-29 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks Family Cites Families (2) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title RU2505941C2 (en) * 2008-07-31 2014-01-27 ФÑаÑÐ½Ñ Ð¾ÑеÑ-ÐезеллÑÑаÑÑ ÑÑÑ Ð¤ÑÑдеÑÑнг Ð´ÐµÑ Ð°Ð½Ð³ÐµÐ²Ð°Ð½Ð´Ñен ФоÑÑÑнг Ð.Ф. Generation of binaural signals US9560439B2 (en) * 2013-07-01 2017-01-31 The University of North Carolina at Chapel Hills Methods, systems, and computer readable media for source and listener directivity for interactive wave-based sound propagationRetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4