RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://patents.google.com/patent/CN112731289B/en below:

CN112731289B - A binaural sound source localization method and device based on weighted template matching

CN112731289B - A binaural sound source localization method and device based on weighted template matching - Google Patents A binaural sound source localization method and device based on weighted template matching Download PDF Info

Publication number: CN112731289B
Authority: CN; China
Prior art keywords: binaural; cross; different; sound source; similarity
Prior art date: 2020-12-10
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.): Active

Application number

CN202011456914.0A

Other languages

Chinese (zh)

Other versions

CN112731289A (en

Inventor

ä¸æ¶¦ä¼

åæ°¸æ

æ¨å°

åå®

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

PKU-HKUST SHENZHEN-HONGKONG INSTITUTION

Peking University Shenzhen Graduate School

Original Assignee

PKU-HKUST SHENZHEN-HONGKONG INSTITUTION

Peking University Shenzhen Graduate School

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2020-12-10

Filing date

2020-12-10

Publication date

2024-05-07

2020-12-10 Application filed by PKU-HKUST SHENZHEN-HONGKONG INSTITUTION, Peking University Shenzhen Graduate School filed Critical PKU-HKUST SHENZHEN-HONGKONG INSTITUTION

2020-12-10 Priority to CN202011456914.0A priority Critical patent/CN112731289B/en

2021-04-30 Publication of CN112731289A publication Critical patent/CN112731289A/en

2024-05-07 Application granted granted Critical

2024-05-07 Publication of CN112731289B publication Critical patent/CN112731289B/en

Status Active legal-status Critical Current

2040-12-10 Anticipated expiration legal-status Critical

Links

230000004807 localization Effects 0.000 title claims abstract description 37
238000000034 method Methods 0.000 title claims abstract description 27
238000005314 correlation function Methods 0.000 claims abstract description 53
238000012549 training Methods 0.000 claims abstract description 27
230000005236 sound signal Effects 0.000 claims description 9
238000012360 testing method Methods 0.000 claims description 9
238000004590 computer program Methods 0.000 claims description 8
239000011159 matrix material Substances 0.000 claims description 8
230000006870 function Effects 0.000 claims description 7
238000004364 calculation method Methods 0.000 claims description 5
238000002474 experimental method Methods 0.000 abstract description 2
230000004927 fusion Effects 0.000 abstract description 2
238000011478 gradient descent method Methods 0.000 abstract 1
238000005516 engineering process Methods 0.000 description 4
210000003128 head Anatomy 0.000 description 4
230000004044 response Effects 0.000 description 4
238000005070 sampling Methods 0.000 description 4
238000000926 separation method Methods 0.000 description 4
238000003491 array Methods 0.000 description 3
238000000605 extraction Methods 0.000 description 3
230000003993 interaction Effects 0.000 description 3
238000003062 neural network model Methods 0.000 description 3
AOQBFUJPFAJULO-UHFFFAOYSA-N 2-(4-isothiocyanatophenyl)isoindole-1-carbonitrile Chemical compound C1=CC(N=C=S)=CC=C1N1C(C#N)=C2C=CC=CC2=C1 AOQBFUJPFAJULO-UHFFFAOYSA-N 0.000 description 2
238000009432 framing Methods 0.000 description 2
239000000203 mixture Substances 0.000 description 2
238000012544 monitoring process Methods 0.000 description 2
230000008447 perception Effects 0.000 description 2
238000012545 processing Methods 0.000 description 2
241000282412 Homo Species 0.000 description 1
101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
102100026388 L-amino-acid oxidase Human genes 0.000 description 1
230000009286 beneficial effect Effects 0.000 description 1
238000006243 chemical reaction Methods 0.000 description 1
210000003477 cochlea Anatomy 0.000 description 1
238000004891 communication Methods 0.000 description 1
210000005069 ears Anatomy 0.000 description 1
230000000694 effects Effects 0.000 description 1
230000008909 emotion recognition Effects 0.000 description 1
238000001914 filtration Methods 0.000 description 1
238000013507 mapping Methods 0.000 description 1
210000003205 muscle Anatomy 0.000 description 1
238000011045 prefiltration Methods 0.000 description 1
238000002360 preparation method Methods 0.000 description 1
230000005855 radiation Effects 0.000 description 1
238000005316 response function Methods 0.000 description 1
238000004088 simulation Methods 0.000 description 1
230000000007 visual effect Effects 0.000 description 1

Classifications

- G—PHYSICS
- G01—MEASURING; TESTING
- G01S—RADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
- G01S5/00—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
- G01S5/18—Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves

Landscapes

Physics & Mathematics (AREA)
Engineering & Computer Science (AREA)
General Physics & Mathematics (AREA)
Radar, Positioning & Navigation (AREA)
Remote Sensing (AREA)
Measurement Of Velocity Or Position Using Acoustic Or Ultrasonic Waves (AREA)

Abstract Translated from Chinese æ¬åæå¬å¼äºä¸ç§åºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³åè£ç½®ãå¨è®ç»é¶æ®µï¼é¦åä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼ä¸ºæåçåä¸ªæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼ç¶åéè¿æ¢¯åº¦ä¸éæ³è®ç»ä¸åæ¹åãä¸åé¢å¸¦çæéå¼ãå¨çº¿å®ä½é¶æ®µï¼åæ ·é¦åå¯¹ä¿¡å·æåç¹å¾ï¼æ¥çå¨ä¸åç¹å¾åä¸åé¢å¸¦ä¸å°ææåçç¹å¾ä¸åä¸ªæ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹éï¼æåéè¿å æèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å¾å°æç»çå£°æºæ¹åç¸ä¼¼åº¦ï¼åæå¤§ç¸ä¼¼åº¦æ¹åä¸ºå£°æºæ¹åãå®éªå¨ä¸åç§ç±»åªå£°ç¯å¢ä¸è¿è¡ï¼å®éªç»æè¡¨ææ¬åæå¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½é®é¢ã The present invention discloses a binaural sound source localization method and device based on weighted template matching. In the training stage, firstly, binaural cross-correlation functions and binaural intensity differences in different directions are extracted from training data, and templates are established for the extracted binaural cross-correlation functions and binaural intensity differences in each direction; then, the weight values of different directions and different frequency bands are trained by the gradient descent method. In the online positioning stage, the signal is firstly extracted, and then the extracted features are matched with the templates in each direction in terms of similarity on different features and different frequency bands. Finally, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction. The experiment is carried out in different types of noise environments, and the experimental results show that the present invention can resist the interference of noise to a certain extent and realize the angle positioning problem of the sound source. Description Translated from Chinese ä¸ç§åºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³åè£ç½®A binaural sound source localization method and device based on weighted template matching

ææ¯é¢åTechnical Field

æ¬åæå±äºä¿¡æ¯ææ¯é¢åï¼æ¶åä¸ç§åºç¨å¨è¯é³æç¥åè¯é³å¢å¼ºä¸çåè³å£°æºå®ä½æ¹æ³ï¼å·ä½æ¶åä¸ç§åºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³åè£ç½®ãThe invention belongs to the field of information technology, and relates to a binaural sound source localization method applied in speech perception and speech enhancement, and in particular to a binaural sound source localization method and device based on weighted template matching.

ææ¯èæ¯technical background

äººæºäº¤äºå¨æºå¨äººé¢åå·æè¶æ¥è¶éè¦çä½ç¨ï¼äººæºäº¤äºè½å¤ä½¿äººä¸æºå¨çäº¤æµæ´å æ¹ä¾¿ãé«æãåå¥½ãå¨æ¥å¸¸çæ´»ä¸ï¼äººä»¬æç¥å¤çä¿¡æ¯çä¸»è¦æ¹å¼æè§è§ãå¬è§ãè§¦è§ãåè§ãå³è§çãå¶ä¸äººç±»éè¿è§è§è·å¾çä¿¡æ¯çº¦å 70ï¼-80ï¼ï¼éè¿å¬è§è·å¾çä¿¡æ¯çº¦å 10ï¼-20ï¼ãå¬è§æç¥æ¯äººä»¬ä¸å¤çè¿è¡ä¿¡æ¯äº¤æµæèªç¶ãæ¹ä¾¿ï¼ææçæ¹å¼ä¹ä¸ãå¦å¤ç¸æ¯äºè§è§ä¿¡å·ï¼å¬è§ä¿¡å·å·æ360åº¦çè§éï¼ä¸ååç§å½±åï¼ä¹ä¸éè¦æ»¡è¶³å£°æºåéº¦åé£ä¹é´æ é®æ¡ç©çæ¡ä»¶ï¼å æ¤ï¼æºå¨äººå¬è§æ¯å®ç°äººæºäº¤äºçéè¦éå¾ä¹ä¸ãæºå¨äººå¬è§ä¸»è¦åæ¬å£°æºçå®ä½ä¸è¿½è¸ªãè¯é³å»åªãè¯é³å¢å¼ºãè¯é³åç¦»ãè¯´è¯äººè¯å«ãè¯é³è¯å«ãè¯é³ææè¯å«çï¼å¶ä¸å£°æºå®ä½ä½ä¸ºæºå¨äººå¬è§åç«¯çä¸ä¸ªä»»å¡ï¼å¯ä»¥ä¸ºå¶å®è¯é³ä»»å¡æä¾è¯é³ç©ºé´ä½ç½®ä¿¡æ¯ä½ä¸ºè¾å©ãæºå¨äººå£°æºå®ä½å·²æä¸ºæºå¨äººå¬è§ç³»ç»çä¸ä¸ªéè¦ç»æé¨åãHuman-computer interaction plays an increasingly important role in the field of robotics. Human-computer interaction can make communication between people and machines more convenient, efficient and friendly. In daily life, the main ways for people to perceive external information are vision, hearing, touch, smell, taste, etc. Among them, about 70%-80% of the information obtained by humans through vision, and about 10%-20% of the information obtained through hearing. Auditory perception is one of the most natural, convenient and effective ways for people to communicate with the outside world. In addition, compared with visual signals, auditory signals have a 360-degree field of view, are not affected by light, and do not need to meet conditions such as no obstructions between the sound source and the microphone. Therefore, robot hearing is one of the important ways to achieve human-computer interaction. Robot hearing mainly includes sound source positioning and tracking, speech denoising, speech enhancement, speech separation, speaker recognition, speech recognition, speech emotion recognition, etc. Among them, sound source positioning as a task of the robot's hearing front end can provide speech spatial position information as an aid for other speech tasks. Robot sound source positioning has become an important part of the robot's hearing system.

è¯é³åç¦»æ¥èªäºèåçâé¸¡å°¾éä¼âé®é¢ï¼å³äººä»¬å¯ä»¥å¨ä¼å¤è°è¯å£°ååªå£°ä¸èç¦äºæä¸ªäººçå£°é³çè½åï¼è¯¥é®é¢é¿ä¹ä»¥æ¥è¢«è®¤ä¸ºæ¯è¯é³åç¦»ä¸çä¸ä¸ªå·ææææ§çé®é¢ãéè¿å¨è¯é³åç¦»ä¸ç»åå£°æºå®ä½ææ¯è·å¾å£°æºçæ¹ä½ä¿¡æ¯ï¼æå©äºåç¦»æ··å çè¯é³ï¼è½å¤æé«å¯¹æå´è¶£æ¹åè¯é³çè¯å«çåç¡®çãå¨è§é¢ä¼è®®ä¸ï¼å¯ä»¥æ ¹æ®éº¦åé£å£°æºå®ä½çç»æåæ¶è°æ´æåæºçä½ç½®ï¼ä½¿å¶è½¬åè¯´è¯äººçä½ç½®ãå¨è§é¢çæ§ä¸ï¼å¯ä»¥æ ¹æ®å£°æºæ¹åä¿¡æ¯è°æ´æåæºçè§åº¦ï¼æ©å¤§çæ§èå´ï¼è¾¾å°æ´å¥½ççæ§ä½ç¨ãSpeech separation comes from the famous "cocktail party" problem, which is the ability of people to focus on a person's voice amidst a lot of conversations and noise. This problem has long been considered a challenging problem in speech separation. By combining sound source localization technology with speech separation to obtain the directional information of the sound source, it is helpful to separate the aliased speech and improve the accuracy of recognizing the speech in the direction of interest. In video conferencing, the position of the camera can be adjusted in time according to the results of microphone sound source localization to turn it to the position of the speaker. In video surveillance, the angle of the camera can be adjusted according to the sound source direction information to expand the monitoring range and achieve better monitoring.

æ ¹æ®éº¦åé£æ°éä»¥åæ¯å¦å·ææºå¨äººå·¥å¤´çè³èç»æï¼å£°æºå®ä½ææ¯å¤§ä½å¯ä»¥åä¸ºåºäºéº¦åé£éµåçå£°æºå®ä½ååºäºåè³éº¦åé£éµåçå£°æºå®ä½ãåè³éº¦åé£å®ä½ææ¯å¨äººå½¢æºå¨äººé¢åå·æéè¦çä½ç¨ï¼å®è½å¤ååå©ç¨è³èç»æå¯¹å£°é³çè¡å°ä½ç¨ï¼æ¨¡æäººç±»å¬è§ç¹æ§ãæºå¨äººåè³å£°æºå®ä½ä»ä½¿ç¨ä¸¤ä¸ªéº¦åé£ï¼åå«æè½½å¨æºå¨äººå¤´é¨å·¦å³ä¾§ãç¸æ¯ä¸¤éº¦åé£éµåçå£°æºå®ä½ï¼åè³å£°æºå®ä½å ä¸ºæäºè³èåäººå·¥å¤´å¯¹å£°é³ä¿¡å·çè¡å°çä½ç¨ï¼å¯ä»¥æ´å¥½çæ¨¡æäººç±»çå¬è§ç¹æ§ï¼å¯ä»¥æ´å¥½çåºç¨å¨äººå½¢æºå¨äººï¼å©å¬å¨è¯é³å¢å¼ºãèæç°å®çåºæ¯ãå¹¶ä¸å¯ä»¥æ¶é¤ä¸¤éº¦åé£å£°æºå®ä½åååçæ§ä¹é®é¢ãAccording to the number of microphones and whether there is a cochlear structure of a robot foreman, sound source localization technology can be roughly divided into sound source localization based on microphone arrays and sound source localization based on binaural microphone arrays. Binaural microphone localization technology plays an important role in the field of humanoid robots. It can make full use of the diffraction effect of the cochlear structure on sound and simulate human auditory characteristics. Robot binaural sound source localization uses only two microphones, which are mounted on the left and right sides of the robot's head. Compared with the sound source localization of two microphone arrays, binaural sound source localization can better simulate human auditory characteristics because of the diffraction of sound signals by the cochlea and artificial head, and can be better applied in humanoid robots, hearing aid speech enhancement, virtual reality and other scenarios. And it can eliminate the ambiguity of the forward and backward directions of the two-microphone sound source localization.

åè³å£°æºå®ä½ä¸»è¦åæ¬ä»¥ä¸å ä¸ªæ¥éª¤ï¼Binaural sound source localization mainly includes the following steps:

1ãåè³ä¿¡å·çæ¨¡æä¸å½å¶ãéç¨åè³å²æ¿å½æ°ä¸çº¯åå£°é³ä¿¡å·å·ç§¯è·åæ¨¡æåè³å£°é³ä¿¡å·ï¼æèç´æ¥å½å¶åè³ä¿¡å·ä½ä¸ºçå®ä¿¡å·ã1. Simulation and recording of binaural signals. Use binaural impulse function and pure sound signal convolution to obtain simulated binaural sound signal, or directly record binaural signal as real signal.

2ãä¿¡å·çæ°æ¨¡è½¬æ¢ï¼é¢æ»¤æ³¢ãé¦åå°æ¨¡æä¿¡å·è¿è¡é¢æ»¤æ³¢ï¼é«éæ»¤æ³¢å¨æ»¤é¤50Hzççµæºåªå£°ä¿¡å·ï¼ä½éæ»¤æ³¢æ»¤é¤å£°é³ä¿¡å·ä¸é¢çåéè¶è¿éæ ·é¢çä¸åçé¨åï¼é²æ¢æ··å å¹²æ°ï¼å¯¹æ¨¡æå£°é³ä¿¡å·è¿è¡éæ ·åéåå¾å°æ°åä¿¡å·ã2. Digital-to-analog conversion and pre-filtering of the signal. First, pre-filter the analog signal. The high-pass filter removes the 50Hz power supply noise signal, and the low-pass filter removes the frequency component of the sound signal that exceeds half of the sampling frequency to prevent aliasing interference. The analog sound signal is sampled and quantized to obtain a digital signal.

3ãé¢å éãä¿¡å·éè¿é«é¢å éæ»¤æ³¢å¨å²æ¿ååºH(z)ï¼1-0.95z^-1ï¼ä»¥è¡¥å¿å´åè¾å°å¸¦æ¥çé«é¢è¡°åã3. Pre-emphasis: The signal passes through the high-frequency emphasis filter with an impulse response of H(z)=1-0.95z ^-1 to compensate for the high-frequency attenuation caused by the lip radiation.

4ãåå¸§ãå çªãè¯é³ä¿¡å·å·ææ¶åçç¹æ§ï¼ä½æ¯äººä½å´é¨èèè¿å¨ç¸å¯¹è¾æ¢ï¼ä¸è¬è®¤ä¸ºè¯é³ä¿¡å·å¨çæ¶é´åæ¯å¹³ç¨³çï¼ä¸è¬ä¸º10ms-30msãå æ¤å¾å¾æç§ä¸è¿°æ¶é´é´éå¯¹ä¿¡å·è¿è¡åå¸§å¤çï¼ä¾å¦æ¯20msåä¸å¸§ãå¦å¤ï¼ä¸ºäºé²æ¢åå¸§å¸¦æ¥çé®é¢ï¼ä¸è¬å¯¹åå¸§åçä¿¡å·è¿è¡å çªå¤çï¼å¸¸è§ççªåæ¬ç©å½¢çªãæ±å®çªãæ±æçªçï¼å¶ä¸ï¼æ±æçªä½¿ç¨è¾ä¸ºå¹¿æ³ã4. Framing and windowing. Speech signals have time-varying characteristics, but the movement of human mouth muscles is relatively slow. It is generally believed that speech signals are stable in a short period of time, generally 10ms-30ms. Therefore, the signal is often framed according to the above time interval, for example, one frame every 20ms. In addition, in order to prevent the problems caused by framing, the framed signal is generally windowed. Common windows include rectangular windows, Hanning windows, Hamming windows, etc. Among them, the Hamming window is more widely used.

5ãç¹å¾æåãæ¯å¸§ä¿¡å·å¯ä»¥æååå«å£°æºæ¹ä½ä¿¡æ¯çåè³ç¹å¾ï¼åè³å£°æºå®ä½ä¸å¸¸ä½¿ç¨çåè³ç¹å¾åæ¬åè³äºç¸å³å½æ°(Interaural Cross-correlation Function,CCF)ãåè³æ¶é´å·®(Interaural Time Difference,ITD)ååè³å¼ºåº¦å·®(InterauralIntensity Differenceï¼IID)çãç±äºå¾å¤æååè³æ¶é´å·®çæ¹æ³æ¯åºäºåè³äºç¸å³å½æ°çï¼å æ¤æ¬åæä¸ä½¿ç¨åè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ç¹å¾ã5. Feature extraction. Each frame signal can extract binaural features containing sound source orientation information. The binaural features commonly used in binaural sound source localization include binaural cross-correlation function (CCF), interaural time difference (ITD) and interaural intensity difference (IID). Since many methods for extracting interaural time difference are based on binaural cross-correlation function, binaural cross-correlation function and binaural intensity difference features are used in the present invention.

6ãå®ä½ãå°æåå°çç¹å¾æ å°å°ç¸åºçæ¹åãä½¿å¾è¯¥å£°æºå¨çå®å£°æºæ¹åçåéªæ¦çæå¤§ãæ å°æ¹æ³åæ¬å¾å¤ï¼ä¾å¦ä½¿ç¨é«æ¯æ··åæ¨¡åï¼ç¥ç»ç½ç»æ¨¡åçï¼æ¬åæä½¿ç¨ä¸ä¸ªåºäºå ææ¨¡æ¿å¹éçæ¹æ³ã6. Positioning. Map the extracted features to the corresponding directions. Make the posterior probability of the sound source in the direction of the real sound source the largest. There are many mapping methods, such as using Gaussian mixture model, neural network model, etc. The present invention uses a method based on weighted template matching.

ä¼ ç»çåºäºé«æ¯æ··åæ¨¡åååºäºç¥ç»ç½ç»æ¨¡åçæ¹æ³ï¼éè¿å¨ä¸åé¢å¸¦ä¸åå«è®¡ç®å£°æºæ¹åï¼æåç¸å å¾åºæç»çç»æï¼æ²¡æèèå°ä¸åé¢å¸¦çå¯é æ§ä»¥åä¸åç¹å¾çå¯é æ§ãå¦å¤åºäºç¥ç»ç½ç»æ¨¡åçæ¹æ³åå¨ä¸å¯è§£éæ§ãTraditional methods based on Gaussian mixture models and neural network models calculate the sound source direction in different frequency bands and finally add them together to get the final result, without considering the reliability of different frequency bands and the reliability of different features. In addition, the methods based on neural network models are unexplainable.

åæåå®¹Summary of the invention

éå¯¹ä¸è¿°é®é¢ï¼æ¬åæçç®çå¨äºæä¾ä¸ç§å¯è§£éçåºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³åè£ç½®ï¼åå«å¨æ¯ä¸ªé¢çå¸¦ä¸è¿è¡åä¸ªæ¹åå£°æºä¼¼ç¶å¼çè®¡ç®ï¼éè¿ä¸åé¢çå¸¦åä¸åç¹å¾çæéå°ç»æè¿è¡æ´åï¼å¾åºæç»å£°æºæ¹åç»æãIn view of the above problems, the purpose of the present invention is to provide an interpretable binaural sound source localization method and device based on weighted template matching, which calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result.

ä¸ºäºå®ç°ä¸è¿°ç®çï¼æ¬åæéç¨ä»¥ä¸ææ¯æ¹æ¡ï¼In order to achieve the above object, the present invention adopts the following technical solutions:

ä¸ç§åºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³ï¼åæ¬ä»¥ä¸æ¥éª¤ï¼A binaural sound source localization method based on weighted template matching comprises the following steps:

ä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼Extract binaural cross-correlation functions and binaural intensity differences in different directions from the training data;

ä¸ºæåçåä¸ªæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼Building templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction;

è®ç»ä¸ååè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼Train different binaural localization features and weights of different frequency bands;

å¨çº¿å®ä½æ¶ï¼æåå£°æºä¿¡å·çåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼å°å¶ä¸åä¸ªæ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹éï¼å¹¶éè¿è®ç»å¾å°çæéèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å®ç°å£°æºå®ä½ãDuring online positioning, the binaural cross-correlation function and binaural intensity difference of the sound source signal are extracted, and similarity matching is performed with templates in various directions. The similarities of different features and frequency bands are fused through the trained weights to achieve sound source positioning.

è¿ä¸æ¥å°ï¼æè¿°ä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³å®ä½ç¹å¾ï¼æ¯éç¨åè³å²æ¿å½æ°ä¸çº¯åè¯é³ä¿¡å·å·ç§¯æèç´æ¥å©ç¨å½å¥çå£°é³ä¿¡å·ï¼è®¡ç®åºæææ¹åä¸çäºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼å¶ä¸ä¸åæ¹åæ¯æåæä¸åçæ°´å¹³è½¬åè§ï¼è½¬åè§éç¨éååçååæ¹å¼ãFurthermore, the binaural localization features in different directions are extracted from the training data by convolving a binaural impulse function with a pure speech signal or directly using a recorded sound signal to calculate the cross-correlation function and binaural intensity difference in all directions; wherein different directions refer to different horizontal steering angles, and the steering angles are divided in a non-uniform manner.

è¿ä¸æ¥å°ï¼æè¿°è½¬åè§çååæ¹å¼ä¸ºï¼[-80Â°,-65Â°,-55Â°,-45Â°:5Â°:45Â°,55Â°,65Â°,80Â°]ãFurthermore, the steering angle is divided into: [-80Â°, -65Â°, -55Â°, -45Â°: 5Â°: 45Â°, 55Â°, 65Â°, 80Â°].

è¿ä¸æ¥å°ï¼æè¿°ä¸ºæåçåä¸ªæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼æ¯å°å¤å¸§ä»åä¸æ¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ãFurthermore, the template is established for the extracted binaural cross-correlation function and binaural intensity difference in each direction by taking the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction as the template for that direction.

è¿ä¸æ¥å°ï¼æè¿°è®ç»ä¸åçåè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼æ¯éç¨ååä¼ ææ¹æ³è¿è¡è®ç»ï¼æå¤±å½æ°è®¾ç½®ä¸ºå¹³æ¹æå¤±ï¼ä½¿å¾åæ¹åçæ¨¡æ¿ä¹é´çç¸ä¼¼åº¦æå¤§ï¼ä¸åæ¹åçæ¨¡æ¿é´ç¸ä¼¼åº¦å°½å¯è½å°ãFurthermore, the training of different binaural localization features and weights of different frequency bands is performed by adopting a back propagation method, and the loss function is set to a square loss, so that the similarity between templates in the same direction is maximized and the similarity between templates in different directions is minimized.

è¿ä¸æ¥å°ï¼éç¨ä»¥ä¸å¬å¼è®¡ç®æè¿°ç¸ä¼¼åº¦ï¼Furthermore, the similarity is calculated using the following formula:

å¶ä¸ï¼sim(Î¸)è¡¨ç¤ºå æåçç¸ä¼¼åº¦ç©éµï¼Ï_ccf,iè¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸äºç¸å³å½æ°çæéï¼sim_ccf,i(Î¸)è¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸äºç¸å³å½æ°ä¸æ¹åÎ¸ä¸æ¨¡æ¿çä½å¼¦ç¸ä¼¼åº¦ï¼Ï_iid,iè¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸åè³å¼ºåº¦å·®çæéï¼sim_iid,i(Î¸)è¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸åè³å¼ºåº¦å·®ä¸æ¹åÎ¸ä¸æ¨¡æ¿çç¸ä¼¼åº¦ãAmong them, sim(Î¸) represents the weighted similarity matrix, Ï _ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim _ccf,i (Î¸) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction Î¸; Ï _iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim _iid,i (Î¸) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction Î¸.

ä¸ç§éç¨ä¸è¿°æ¹æ³çåºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½è£ç½®ï¼å¶åæ¬ï¼A binaural sound source localization device based on weighted template matching using the above method comprises:

è®ç»æ¨¡åï¼ç¨äºä»è®ç»æ°æ®ä¸æåä¸åæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼ä¸ºæåçåä¸ªæ¹åçåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®å»ºç«æ¨¡æ¿ï¼ç¶åè®ç»ä¸ååè³å®ä½ç¹å¾åä¸åé¢å¸¦çæéï¼A training module is used to extract binaural cross-correlation functions and binaural intensity differences in different directions from training data, establish templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction, and then train different binaural localization features and weights of different frequency bands;

å¨çº¿å®ä½æ¨¡åï¼ç¨äºæåå£°æºä¿¡å·çåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®ï¼å°å¶ä¸åä¸ªæ¹åçæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦å¹éï¼å¹¶éè¿è®ç»å¾å°çæéèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å®ç°å£°æºå®ä½ãThe online localization module is used to extract the binaural cross-correlation function and binaural intensity difference of the sound source signal, match it with the templates in each direction for similarity, and fuse the similarities of different features and frequency bands through the trained weights to achieve sound source localization.

æ¬åæçæçæææ¯ï¼The beneficial effects of the present invention are:

æ¬åæåå«å¨æ¯ä¸ªé¢çå¸¦ä¸è¿è¡åä¸ªæ¹åå£°æºä¼¼ç¶å¼çè®¡ç®ï¼éè¿ä¸åé¢çå¸¦åä¸åç¹å¾çæéå°ç»æè¿è¡æ´åï¼å¾åºæç»å£°æºæ¹åç»æï¼å¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½ãThe present invention calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result, which can resist the interference of noise to a certain extent and realize the angular positioning of the sound source.

éå¾è¯´æBRIEF DESCRIPTION OF THE DRAWINGS

å¾1æ¯æ¬åææ¹æ³çæ´ä½æµç¨å¾ãFIG. 1 is an overall flow chart of the method of the present invention.

å¾2æ¯æ¬åæææåç¹å¾ä¸ä¸ªä¾åãå¶ä¸(a)å¾è¡¨ç¤ºææåçåè³äºç¸å³å½æ°ç¹å¾ï¼(b)å¾è¡¨ç¤ºææåçåè³å¼ºåº¦å·®ç¹å¾ãFig. 2 is an example of the features extracted by the present invention, wherein (a) shows the extracted binaural cross-correlation function features, and (b) shows the extracted binaural intensity difference features.

å¾3æ¯æ¬åæçä¸ä¸ªå£°æºä¿¡å·ä¸åä¸ªæ¹åæ¨¡æ¿è¿è¡ç¸ä¼¼åº¦è®¡ç®çä¾åãå¶ä¸ï¼ä¸åé¨åè¡¨ç¤ºå£°æºä¿¡å·äºç¸å³å½æ°ä¸åä¸ªæ¹åæ¨¡æ¿çç¸ä¼¼åº¦ï¼ä¸åé¨åè¡¨ç¤ºå£°æºä¿¡å·åè³å¼ºåº¦å·®ä¸åä¸ªæ¹åæ¨¡æ¿çç¸ä¼¼åº¦ãæ¨ªåæ è¡¨ç¤ºä¸åçæ¹åãFig. 3 is an example of similarity calculation between a sound source signal and each directional template of the present invention. The upper part shows the similarity between the cross-correlation function of the sound source signal and each directional template, and the lower part shows the similarity between the binaural intensity difference of the sound source signal and each directional template. The horizontal axis shows different directions.

å¾4æ¯æ¬åæçæç»æéè®ç»ç»æãä¸¤æ¡æçº¿å±64ä¸ªç¹ï¼åå«è¡¨ç¤ºä¸åé¢å¸¦çåè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®çæéãFig. 4 is the final weight training result of the present invention. The two broken lines have 64 points in total, which respectively represent the binaural cross-correlation function of different frequency bands and the weights of the binaural intensity difference.

å·ä½å®æ½æ¹å¼Detailed ways

ä¸é¢å°ç»åå®æ½ä¾åéå¾ï¼å¯¹æ¬åæçææ¯æ¹æ¡è¿è¡æ¸æ¥ãå®æ´å°æè¿°ãThe technical solution of the present invention will be clearly and completely described below in conjunction with the embodiments and drawings.

å¾1æ¯æ¬åæçä¸ç§åºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½æ¹æ³çæµç¨å¾ï¼åæ¬ä»¥ä¸æ¥éª¤ï¼FIG1 is a flow chart of a binaural sound source localization method based on weighted template matching of the present invention, comprising the following steps:

1)æ°æ®åå¤é¶æ®µï¼æ¨¡æåè³çåä¸ªæ¹åçä¿¡å·ï¼æä¾åå§å£°æºä¿¡å·ã1) Data preparation stage: simulate the signals from all directions of both ears and provide the original sound source signals.

1.1)å°äººå·¥å¤´ååå¹³é¢åæ25ä¸ªä¸åæ°´å¹³è½¬åè§åº¦ï¼ä¾å¦ï¼è½¬åè§éç¨éååçååæ¹æ³ï¼[-80Â°,-65Â°,-55Â°,-45Â°:5Â°:45Â°,55Â°,65Â°,80Â°]ãå¶ä¸ï¼-45Â°:5Â°:45Â°è¡¨ç¤ºæ¯é5åº¦è®¾ç½®ä¸ä¸ªè§åº¦ã1.1) The front half plane of the artificial head is divided into 25 different horizontal steering angles. For example, the steering angle is divided in a non-uniform manner: [-80Â°, -65Â°, -55Â°, -45Â°: 5Â°: 45Â°, 55Â°, 65Â°, 80Â°]. Among them, -45Â°: 5Â°: 45Â° means that an angle is set every 5 degrees.

1.2)ç»åTIMITæ°æ®åºæä¾ççº¯åçè¯é³ä¿¡å·åCIPICæ°æ®åºæä¾çåè³å²æ¿ååºå½æ°ï¼ä»¥åNOISEX-92æ°æ®åºæä¾çä¸åç§ç±»çåªå£°ä¿¡å·æé è®ç»æ°æ®åæµè¯æ°æ®ãå¶ä¸è®ç»æ°æ®ä¸ä½¿ç¨åªå£°ä¿¡å·ãæµè¯æ°æ®ä½¿ç¨ä¸åä¿¡åªæ¯çåªå£°ä¿¡å·ï¼å®éªä¸ä½¿ç¨ä»-10dBå°35dBçæµè¯ä¿¡å·ã1.2) The training data and test data are constructed by combining the pure speech signals provided by the TIMIT database, the binaural impulse response functions provided by the CIPIC database, and the different types of noise signals provided by the NOISEX-92 database. The training data does not use noise signals. The test data uses noise signals with different signal-to-noise ratios, and the test signals from -10dB to 35dB are used in the experiment.

2)è®ç»é¶æ®µï¼ä»æ°æ®ä¸æååè³äºç¸å³å½æ°ååè³å¼ºåº¦å·®æ°æ®ï¼ä¸ºäºç¸å³å½æ°(CCF)ååè³å¼ºåº¦å·®(IID)å»ºç«æ¨¡æ¿ï¼ä»¥åè®ç»ä¸åç¹å¾ãé¢å¸¦å¯¹åºçæéï¼ä½¿å¾åå£°æºæ¹åæ¨¡æ¿ä¹é´çç¸ä¼¼åº¦æå¤§ï¼ä¸åå£°æºæ¹åæ¨¡æ¿é´ç¸ä¼¼åº¦å°½å¯è½å°ãå¯ä»¥éç¨åè³å²æ¿å½æ°(HRTF)ä¸çº¯åè¯é³ä¿¡å·å·ç§¯æèç´æ¥å©ç¨å½å¥çå£°é³ä¿¡å·ï¼è®¡ç®åºæææ¹åä¸çäºç¸å³å½æ°ååè³å¼ºåº¦å·®æ¨¡æ¿ã2) In the training phase, binaural cross-correlation function and binaural intensity difference data are extracted from the data, templates are established for cross-correlation function (CCF) and binaural intensity difference (IID), and weights corresponding to different features and frequency bands are trained to maximize the similarity between templates in the same sound source direction and minimize the similarity between templates in different sound source directions. The cross-correlation function and binaural intensity difference template in all directions can be calculated by convolving the binaural impulse function (HRTF) with the clean speech signal or directly using the recorded sound signal.

2.1)ä½¿ç¨4é¶32ééçä¼½çéæ»¤æ³¢å¨å¯¹å¸¦ææ¹åä¿¡æ¯çä¿¡å·è¿è¡åé¢å¸¦å¤çï¼æå¤§é¢çè®¾ç½®ä¸º7200Hzã2.1) Use a 4th-order 32-channel gammatone filter to perform frequency division processing on the signal with directional information, and the maximum frequency is set to 7200 Hz.

2.2)ä»æ åªå£°çè®ç»æ°æ®ä¸æåäºç¸å³å½æ°(CCF)ååè³å¼ºåº¦å·®(IID)ï¼ç»¼åå¤å¸§æ°æ®çå¹³åå¼å»ºç«æ¨¡æ¿ï¼å³å°å¤å¸§ä»åä¸æ¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ã2.2) Extract the cross-correlation function (CCF) and interaural intensity difference (IID) from the noise-free training data, and establish a template by combining the average values of multiple frames of data. That is, the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction.

äºç¸å³å½æ°çè®¡ç®å¬å¼å¦ä¸ï¼The calculation formula of the cross-correlation function is as follows:

å¶ä¸in

G_pï¼q(iï¼Ï)ï¼â_nx_p(iï¼n)x_q(iï¼n+Ï)ï¼pï¼qâ{lï¼r}G _p,q (i, Ï) = â _n x _p (i, n) x _q (i, n + Ï), p, q â {l, r}

å¶ä¸ï¼låråå«è¡¨ç¤ºå·¦è³åå³è³ï¼iè¡¨ç¤ºä¸åé¢å¸¦ï¼nè¡¨ç¤ºæ¯ä¸å¸§çéæ ·ç¹ï¼Ïè¡¨ç¤ºæ¶é´å»¶è¿ï¼å½pãqè¢«èµäºlærçå¼ä¹åï¼x_pãx_qè¡¨ç¤ºå·¦è³æ¥æ¶å°çä¿¡å·æèå³è³æ¥åå°çä¿¡å·ï¼Ï₀è¡¨ç¤ºæ°å0ãAmong them, l and r represent the left ear and right ear respectively, i represents different frequency bands, n represents the sampling points of each frame, and Ï represents the time delay; when p and q are assigned the values of l or r, x _p and x _q represent the signal received by the left ear or the signal received by the right ear; Ï ₀ represents the number 0.

åè³å¼ºåº¦å·®è®¡ç®å¬å¼å¦ä¸ï¼The formula for calculating the binaural intensity difference is as follows:

å¶ä¸ï¼x_lè¡¨ç¤ºå·¦è³æ¥åå°çä¿¡å·ï¼x_rè¡¨ç¤ºå³è³æ¥åå°çä¿¡å·ãAmong them, x _l represents the signal received by the left ear, and x _r represents the signal received by the right ear.

2.3)å°25ä¸ªä¸åæ¹åçä¿¡å·æä¸One-hotæ ç¾ï¼ä¾å¦-80åº¦æ¹åçæ ç¾è®¾ç½®ä¸º[1ï¼0ï¼...ï¼0]ï¼å¶ä¸å±24ä¸ª0ãå¨ä¸åé¢å¸¦åç¹å¾ä¸ï¼å°æ¯å¸§è®ç»æ°æ®ä¸æ¨¡æ¿è®¡ç®ç¸ä¼¼åº¦ï¼å±å¾å°(2*32)*25((ç¹å¾æ°*é¢å¸¦æ°)*åéæ¹åæ°)çç¸ä¼¼åº¦ç©éµï¼å¸æå ææ¤ç©éµï¼ä½¿çå®å£°æºå¯¹åºçåéæ¹åç¸ä¼¼åº¦æå¤§ãå¶ä¸æéç©éµçè¡æ°ãåæ°ä¸º1*64ï¼ç¸ä¼¼åº¦ç©éµçè¡æ°ãåæ°ä¸º64*25ï¼æç»ç»æä¸º1*25çç©éµsim(Î¸)ã2.3) One-hot labels are added to the signals of 25 different directions. For example, the label of the -80 degree direction is set to [1, 0, ..., 0], with a total of 24 zeros. The similarity between each frame of training data and the template is calculated in different frequency bands and features, and a similarity matrix of (2*32)*25 ((number of features * number of frequency bands) * number of candidate directions) is obtained. It is hoped that this matrix can be weighted to maximize the similarity of the candidate directions corresponding to the real sound source. The number of rows and columns of the weight matrix is 1*64, the number of rows and columns of the similarity matrix is 64*25, and the final result is a 1*25 matrix sim(Î¸).

å¶ä¸ï¼sim(Î¸)è¡¨ç¤ºå æåçç¸ä¼¼åº¦ç©éµï¼Ï_ccfï¼iè¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸äºç¸å³å½æ°çæéï¼sim_ccfï¼i(Î¸)è¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸äºç¸å³å½æ°ä¸æ¹åÎ¸ä¸æ¨¡æ¿çä½å¼¦ç¸ä¼¼åº¦ï¼Ï_iidï¼iè¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸åè³å¼ºåº¦å·®çæéï¼sim_iidï¼i(Î¸)è¡¨ç¤ºå¨ç¬¬iä¸ªé¢å¸¦ä¸åè³å¼ºåº¦å·®ä¸æ¹åÎ¸ä¸æ¨¡æ¿çç¸ä¼¼åº¦ãå¶ä¸ä½å¼¦ç¸ä¼¼åº¦ä½¿ç¨ä¸å¼è®¡ç®ï¼Wherein, sim(Î¸) represents the weighted similarity matrix, Ï _ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim _ccf,i (Î¸) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction Î¸; Ï _iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim _iid,i (Î¸) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction Î¸. The cosine similarity is calculated using the following formula:

R_iï¼r(Ï)æçæ¯ç®æ äºç¸å³å½æ°ï¼R_temp(Î¸ï¼iï¼Ï)è¡¨ç¤ºå¨Î¸è§åº¦ï¼é¢å¸¦içäºç¸å³å½æ°æ¨¡æ¿ï¼R_lï¼r(iï¼Ï)è¡¨ç¤ºææ¥æ¶ä¿¡å·å¨é¢å¸¦iè®¡ç®çäºç¸å³å½æ°ãR _i,r (Ï) refers to the target cross-correlation function, R _temp (Î¸,i,Ï) represents the cross-correlation function template of frequency band i at angle Î¸, and R _l,r (i,Ï) represents the cross-correlation function of the received signal calculated in frequency band i.

åè³å¼ºåº¦å·®ç¸ä¼¼åº¦ä½¿ç¨ä¸è¿°å¬å¼ï¼The binaural intensity difference similarity uses the following formula:

å¶ä¸iè¡¨ç¤ºé¢å¸¦ç´¢å¼ï¼tempä»£è¡¨æ¨¡æ¿ï¼Î¸è¡¨ç¤ºæ¹åï¼iid_{tempï¼Î¸ï¼i}è¡¨ç¤ºÎ¸è§åº¦æ¹åï¼ç¬¬iä¸ªé¢å¸¦æå¯¹åºçåè³å¼ºåº¦å·®æ¨¡æ¿ï¼iid_iè¡¨ç¤ºå½åä»æµè¯ä¿¡å·ä¸è®¡ç®çç¬¬iä¸ªé¢å¸¦çåè³å¼ºåº¦å·®ãWhere i represents the frequency band index, temp represents the template, Î¸ represents the direction, iid _{temp, Î¸,i} represents the angle direction of Î¸, the binaural intensity difference template corresponding to the i-th frequency band, and iid _i represents the binaural intensity difference of the i-th frequency band currently calculated from the test signal.

2.4)éè¿ååä¼ ææ³è®ç»å¶ä¸çæéÏ_ccfï¼iåÏ_iidï¼iã2.4) Train the weights Ï _ccf,i and Ï _iid,i by back-propagation.

æå¤±å½æ°è®¾ç½®ä¸ºå¹³æ¹æå¤±ï¼å¶ä¸yä¸ºä¸è¿°çå®æ ç¾ï¼/> æéçè®ç»æ¯å¨ä¸¤ç§ä¸ååè³ç¹å¾åä¸åé¢çå¸¦ä¸åæ¶è®ç»ï¼è®ç»åºçæéå·æç´è§çå¯è§£éæ§ãThe loss function is set to square loss: Where y is the true label above, /> The weights are trained simultaneously on two different binaural features and different frequency bands, and the trained weights are intuitively interpretable.

3)æµè¯é¶æ®µï¼é¦åå°ééå°çä¿¡å·ç»ç±ä¼½é©¬éæ»¤æ³¢å¨è¿è¡åé¢å¤çï¼ç¶åå¨åé¢åçåä¸ªé¢å¸¦ä¿¡å·ä¸æåäºç¸å³å½æ°ååè³å¼ºåº¦å·®ç¹å¾ï¼å¨ä¸åç¹å¾åä¸åé¢å¸¦ä¸å°ç¹å¾ä¸å¨é¨æ¹åçæ¨¡æ¿ç¹å¾è¿è¡ç¸ä¼¼åº¦è®¡ç®ï¼æåéè¿ä¸è¿°è®ç»é¶æ®µå¾åºçæéè¿è¡å æï¼æ±å¾å£°æºæ¥èªåä¸ªæ¹åçä¼¼ç¶å¼ï¼å³å¯å¾å°å£°æºæ¹åä¿¡æ¯ï¼å³éè¿å æèåä¸åç¹å¾ä¸åé¢å¸¦çç¸ä¼¼åº¦ï¼å¾å°æç»çå£°æºæ¹åç¸ä¼¼åº¦ï¼åæå¤§ç¸ä¼¼åº¦æ¹åä¸ºå£°æºæ¹åã3) In the testing phase, the collected signal is first subjected to frequency division processing through a gammatone filter, and then the cross-correlation function and binaural intensity difference features are extracted from the divided frequency band signals. The similarity between the features and the template features of all directions is calculated at different features and different frequency bands. Finally, the weights obtained in the above training phase are used to weight the likelihood values of the sound source coming from various directions, and the sound source direction information can be obtained. That is, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction.

ä¸é¢æä¾ä¸ä¸ªå·ä½åºç¨å®ä¾ãæ¬å®ä¾å®æ½éç¨çæ¯åºäºCIPICæ°æ®åº003äººå·¥å¤´å½å¶çåè³èå²ååºï¼å¶å°æ°´å¹³è§åæ25ä¸ªä¸åè§åº¦ï¼ä¿¯ä»°è§åæ50ä¸ªä¸åè§åº¦ï¼å¯ä»¥æ¨¡æçå®ç¯å¢ä¸åæ¹åçä¿¡å·ãæ¬å®ä¾ä½¿ç¨æ°´å¹³å¹³é¢ä¸ç25ä¸ªåè³èå²å²å»ååºï¼è¿è¡æ°´å¹³è§çå®ä½ãå£°æºä¿¡å·åèªTIMITæ°æ®åºçå®äººä»¬è¯´è¯å£°é³ãå°å£°é³ä¿¡å·ä¸åè³èå²å²å»ååºè¿è¡å·ç§¯ï¼å¯ä»¥çå®æ¨¡æäººè³æ¥æ¶å°çæ åªå£°ä¿¡å·ãä½¿ç¨NOISEX-92æ°æ®åºå½å¶çä¸åç¯å¢çåªå£°ï¼å å°åè³ä¿¡å·ä¸ï¼å¯ä»¥çå®æ¨¡æäººè³å¨ä¸åç§ç±»çåªå£°ç¯å¢ä¸æ¥æ¶å°çä¿¡å·ãA specific application example is provided below. This example is based on the binaural impulse response recorded by the artificial head of the CIPIC database 003. It divides the horizontal angle into 25 different angles and the pitch angle into 50 different angles, which can simulate the signals from different directions in the real environment. This example uses 25 binaural impulse responses on the horizontal plane to locate the horizontal angle. The sound source signal is taken from the real speaking sound of people in the TIMIT database. Convolving the sound signal with the binaural impulse response can truly simulate the noise-free signal received by the human ear. Using the noise of different environments recorded by the NOISEX-92 database and adding it to the binaural signal can truly simulate the signal received by the human ear in different types of noise environments.

å¨è®ç»é¶æ®µï¼é¦åå°ä»¥ä¸åå¤çæ°æ®è¿è¡é¢å éãåå¸§ãå çªï¼éè¿4é¶32é¢å¸¦ï¼æä½ä¸å¿é¢ç80Hzï¼æé«ä¸å¿é¢ç7200Hzçgammatoneæ»¤æ³¢å¨ï¼å¾å°32ä¸ªä¸åé¢å¸¦çä¿¡å·ãç¶åï¼å©ç¨äºç¸å³å½æ°è®¡ç®å¬å¼æåäºç¸å³å½æ°ï¼è¿éæä»¬èèå°åè³ä¿¡å·çæå¤§æ¶é´å·®ä¸ä¼è¶è¿æ£è´1.1æ¯«ç§ï¼å¹¶ä¸ç»å16kéæ ·çï¼ä»åé¿åº¦ä¸º37çäºç¸å³å½æ°çäºç¸å³å¼ï¼åæ¶ä½¿ç¨è®¡ç®åè³å¼ºåº¦å·®çå¬å¼æååè³å¼ºåº¦å·®ï¼å®æè¯¥å¸§ä¿¡å·çç¹å¾æåå·¥ä½(å¦å¾2æç¤º)ãå°å¤å¸§ä»åä¸æ¹åååºçæ åªå£°è¯é³å¸§ä¸æåçåè³å®ä½ç¹å¾å¹³åå¼ä½ä¸ºè¯¥æ¹åçæ¨¡æ¿ãæåï¼è®¡ç®æ¯å¸§ä¿¡å·çå®ä½ç¹å¾ä¸åä¸ªæ¹åæ¨¡æ¿çç¸ä¼¼åº¦ï¼å¨æ¯ä¸ªåéæ¹åä¸å±å¾å°64ä¸ªç¸ä¼¼åº¦(å¦å¾3æç¤º)ï¼å°ä»ä»¬è¿è¡å æå¾å°æç»çæ¹åç¸ä¼¼åº¦ãç»åç»åºçç¸ä¼¼åº¦æ ç¾(å³One-hotæ ç¾)ï¼ååä¼ æè°æ´æéå¼(å¦å¾4æç¤º)ãIn the training phase, the data prepared above are first pre-emphasized, framed, and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function is extracted using the cross-correlation function calculation formula. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the binaural intensity difference is extracted using the formula for calculating the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). The average value of the binaural positioning features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction. Finally, the similarity between the positioning features of each frame signal and the templates of each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. Combined with the given similarity label (i.e., One-hot label), the weight value is adjusted by back propagation (as shown in Figure 4).

æµè¯é¶æ®µï¼é¦åå°ä»¥ä¸åå¤çæ°æ®åå¸§ãå çªï¼éè¿4é¶32é¢å¸¦ï¼æä½ä¸å¿é¢ç80Hzï¼æé«ä¸å¿é¢ç7200Hzçgammatoneæ»¤æ³¢å¨ï¼å¾å°32ä¸ªä¸åé¢å¸¦çä¿¡å·ãç¶åï¼åå«å©ç¨äºç¸å³å½æ°è®¡ç®å¬å¼æåäºç¸å³å½æ°ï¼è¿éæä»¬èèå°åè³ä¿¡å·çæå¤§æ¶é´å·®ä¸ä¼è¶è¿æ£è´1.1æ¯«ç§ï¼å¹¶ä¸ç»å16kéæ ·çï¼ä»åé¿åº¦ä¸º37çäºç¸å³å½æ°çäºç¸å³å¼ï¼åæ¶ä½¿ç¨è®¡ç®åè³å¼ºåº¦å·®çå¬å¼æååè³å¼ºåº¦å·®ï¼å®æè¯¥å¸§ä¿¡å·çç¹å¾æåå·¥ä½(å¦å¾2æç¤º)ãç¶åè®¡ç®æµè¯ä¿¡å·çå®ä½ç¹å¾ä¸åä¸ªæ¹åæ¨¡æ¿çç¸ä¼¼åº¦ï¼å¨æ¯ä¸ªåéæ¹åä¸å±å¾å°64ä¸ªç¸ä¼¼åº¦(å¦å¾3æç¤º)ï¼å°ä»ä»¬è¿è¡å æå¾å°æç»çæ¹åç¸ä¼¼åº¦ãéåæå¤§ç¸ä¼¼åº¦æ¹åä½ä¸ºå£°æºæ¹åãIn the test phase, the data prepared above are first framed and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function calculation formula is used to extract the cross-correlation function. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the formula for calculating binaural intensity difference is used to extract the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). Then, the similarity between the positioning features of the test signal and the templates in each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. The direction with the maximum similarity is selected as the sound source direction.

å¶ä¸è®ç»é¶æ®µä½¿ç¨æ åªå£°çä¿¡å·ï¼æµè¯é¶æ®µä½¿ç¨ä¸åç§ç±»ä¸åä¿¡åªæ¯çåªå£°ï¼ä»-10dBå°35dBï¼é´é5dBãThe training phase uses a noise-free signal, and the test phase uses different types of noise with different signal-to-noise ratios, ranging from -10dB to 35dB, with an interval of 5dB.

å®éªç»æè¡¨ææ¬åæçæ¹æ³å¯ä»¥å¨ä¸å®ç¨åº¦ä¸æµæåªå£°çå¹²æ°ï¼å®ç°å£°æºçè§åº¦å®ä½ãExperimental results show that the method of the present invention can resist noise interference to a certain extent and achieve the angular positioning of the sound source.

åºäºåä¸åæææï¼æ¬åæçå¦ä¸ä¸ªå®æ½ä¾æä¾ä¸ç§éç¨ä¸è¿°æ¹æ³çåºäºå ææ¨¡æ¿å¹éçåè³å£°æºå®ä½è£ç½®ï¼å¶åæ¬ï¼Based on the same inventive concept, another embodiment of the present invention provides a binaural sound source localization device based on weighted template matching using the above method, which includes:

åºäºåä¸åæææï¼æ¬åæçå¦ä¸å®æ½ä¾æä¾ä¸ç§çµåè£ç½®(è®¡ç®æºãæå¡å¨ãæºè½ææºç)ï¼å¶åæ¬åå¨å¨åå¤çå¨ï¼æè¿°åå¨å¨åå¨è®¡ç®æºç¨åºï¼æè¿°è®¡ç®æºç¨åºè¢«éç½®ä¸ºç±æè¿°å¤çå¨æ§è¡ï¼æè¿°è®¡ç®æºç¨åºåæ¬ç¨äºæ§è¡æ¬åææ¹æ³ä¸åæ¥éª¤çæä»¤ãBased on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes instructions for executing each step in the method of the present invention.

åºäºåä¸åæææï¼æ¬åæçå¦ä¸å®æ½ä¾æä¾ä¸ç§è®¡ç®æºå¯è¯»åå¨ä»è´¨(å¦ROM/RAMãç£çãåç)ï¼æè¿°è®¡ç®æºå¯è¯»åå¨ä»è´¨åå¨è®¡ç®æºç¨åºï¼æè¿°è®¡ç®æºç¨åºè¢«è®¡ç®æºæ§è¡æ¶ï¼å®ç°æ¬åææ¹æ³çåä¸ªæ¥éª¤ãBased on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, disk, CD), which stores a computer program. When the computer program is executed by a computer, it implements the various steps of the method of the present invention.

å¯ä»¥çè§£çæ¯ï¼ä»¥ä¸ææè¿°çå®æ½ä¾ä»ä»æ¯æ¬åæä¸é¨åå®æ½ä¾ï¼èä¸æ¯å¨é¨çå®æ½ä¾ãåºäºæ¬åæä¸çå®æ½ä¾ï¼æ¬é¢åææ¯äººåå¨æ²¡æååºåé æ§å³å¨åæä¸æè·å¾çææå¶ä»å®æ½ä¾ï¼é½å±äºæ¬åæçä¿æ¤èå´ãIt is understandable that the embodiments described above are only some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of the present invention.

Claims (7)

1. A binaural sound source localization method based on weighted template matching is characterized by comprising the following steps:

Extracting binaural cross-correlation functions and binaural intensity differences in different directions from the training data;

establishing templates for the extracted binaural cross-correlation functions and binaural intensity differences in all directions;

Training weights of different binaural localization features and different frequency bands;

During on-line positioning, extracting a binaural cross-correlation function and a binaural intensity difference of a sound source signal, performing similarity matching on the binaural cross-correlation function and the binaural intensity difference and templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning;

Training different binaural positioning characteristics and weights of different frequency bands by adopting a back propagation method, wherein a loss function is set as square loss, so that the similarity between templates in the same direction is maximum, and the similarity between templates in different directions is as small as possible;

the similarity is calculated using the following formula:

Wherein sim (Î¸) represents a weighted similarity matrix, Ï _ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim _ccf,i (Î¸) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction Î¸; omega _iid,i represents the weight of the binaural intensity difference in the ith frequency band, sim _iid,i (Î¸) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction Î¸;

wherein the calculation formula of sim _ccf,i(Î¸)ãsim_iid,i (Î¸) is:

Where R _l,r (Ï) is the target cross-correlation function, R _temp (Î¸, i, Ï) represents the cross-correlation function template for frequency band i at Î¸, and R _l,r (i, Ï) represents the cross-correlation function calculated for the received signal in frequency band i;

Where i denotes the band index, temp denotes the template, Î¸ denotes the direction, iid _temp,Î¸,i denotes the Î¸ angle direction, iid _i denotes the binaural intensity difference template corresponding to the i-th band, and iid _i denotes the binaural intensity difference of the i-th band currently calculated from the test signal.

2. The method of claim 1, wherein the extracting binaural localization features in different directions from the training data is by convolving the binaural impulse function with the clean speech signal or directly using the recorded sound signal, and calculating a cross-correlation function and a binaural intensity difference in all directions; wherein different directions are divided into different horizontal steering angles, and the steering angles are divided in a non-uniform way.

3. The method according to claim 1, wherein the steering angle is divided in the following manner: -80, -65, -55, -45:5:45, 55, 65, 80 ].

4. The method of claim 1, wherein the building a template for the extracted binaural cross-correlation function and binaural intensity difference for each direction is a template for a direction that is an average of binaural localization features extracted from a plurality of frames of noiseless speech frames emanating from the same direction.

5. A weighted template matching based binaural sound source localization device employing the method of any one of claims 1-4, comprising:

the training module is used for extracting the binaural cross-correlation functions and the binaural intensity differences in different directions from the training data, establishing templates for the extracted binaural cross-correlation functions and the binaural intensity differences in different directions, and then training weights of different binaural positioning features and different frequency bands;

And the on-line positioning module is used for extracting the binaural cross-correlation function and the binaural intensity difference of the sound source signal, matching the binaural cross-correlation function and the binaural intensity difference with templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning.

6. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-4.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-4.

CN202011456914.0A 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Active CN112731289B (en) Priority Applications (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Applications Claiming Priority (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Publications (2) Family ID=75599430 Family Applications (1) Application Number Title Priority Date Filing Date CN202011456914.0A Active CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Country Status (1) Citations (7) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title WO2012058805A1 (en) * 2010-11-03 2012-05-10 Huawei Technologies Co., Ltd. Parametric encoder for encoding a multi-channel audio signal CN103901401A (en) * 2014-04-10 2014-07-02 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound source positioning method based on binaural matching filter CN104965194A (en) * 2015-07-29 2015-10-07 æ¸¤æµ·å¤§å¦ Indoor multi-sound-source positioning device and method simulating binaural effect CN105075293A (en) * 2013-03-29 2015-11-18 ä¸æçµåæ ªå¼ä¼ç¤¾ Audio device and audio providing method thereof CN107144818A (en) * 2017-03-21 2017-09-08 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN107346664A (en) * 2017-06-22 2017-11-14 æ²³æµ·å¤§å¦å¸¸å·æ ¡åº A kind of ears speech separating method based on critical band CN110517705A (en) * 2019-08-29 2019-11-29 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks Family Cites Families (2) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title RU2505941C2 (en) * 2008-07-31 2014-01-27 Ð¤ÑÐ°ÑÐ½ÑÐ¾ÑÐµÑ-ÐÐµÐ·ÐµÐ»Ð»ÑÑÐ°ÑÑ ÑÑÑ Ð¤ÑÑÐ´ÐµÑÑÐ½Ð³ Ð´ÐµÑ Ð°Ð½Ð³ÐµÐ²Ð°Ð½Ð´ÑÐµÐ½ Ð¤Ð¾ÑÑÑÐ½Ð³ Ð.Ð¤. Generation of binaural signals US9560439B2 (en) * 2013-07-01 2017-01-31 The University of North Carolina at Chapel Hills Methods, systems, and computer readable media for source and listener directivity for interactive wave-based sound propagation

2020
- 2020-12-10 CN CN202011456914.0A patent/CN112731289B/en active Active

Patent Citations (7) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title WO2012058805A1 (en) * 2010-11-03 2012-05-10 Huawei Technologies Co., Ltd. Parametric encoder for encoding a multi-channel audio signal CN105075293A (en) * 2013-03-29 2015-11-18 ä¸æçµåæ ªå¼ä¼ç¤¾ Audio device and audio providing method thereof CN103901401A (en) * 2014-04-10 2014-07-02 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound source positioning method based on binaural matching filter CN104965194A (en) * 2015-07-29 2015-10-07 æ¸¤æµ·å¤§å¦ Indoor multi-sound-source positioning device and method simulating binaural effect CN107144818A (en) * 2017-03-21 2017-09-08 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN107346664A (en) * 2017-06-22 2017-11-14 æ²³æµ·å¤§å¦å¸¸å·æ ¡åº A kind of ears speech separating method based on critical band CN110517705A (en) * 2019-08-29 2019-11-29 åäº¬å¤§å¦æ·±å³ç ç©¶çé¢ A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks Also Published As Similar Documents Publication Publication Date Title CN110517705B (en) 2022-02-18 Binaural sound source positioning method and system based on deep neural network and convolutional neural network CN112106385B (en) 2022-01-07 System for sound modeling and presentation CN107430868B (en) 2021-02-09 Real-time reconstruction of user speech in immersive visualization system US9949056B2 (en) 2018-04-17 Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene CN103901401B (en) 2016-08-17 A kind of binaural sound source of sound localization method based on ears matched filtering device Chen et al. 2023 Novel-view acoustic synthesis Keyrouz et al. 2006 A new method for binaural 3-D localization based on HRTFs CN110728989B (en) 2020-07-14 A Binaural Speech Separation Method Based on Long Short-Term Memory Network LSTM JP7210602B2 (en) 2023-01-23 Method and apparatus for processing audio signals CN112492380A (en) 2021-03-12 Sound effect adjusting method, device, equipment and storage medium Youssef et al. 2012 A binaural sound source localization method using auditive cues and vision CN113099031B (en) 2022-05-17 Sound recording method and related equipment CN107144818A (en) 2017-09-08 Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN110188179B (en) 2020-06-19 Voice directional recognition interaction method, device, equipment and medium CN103901400B (en) 2016-08-17 A kind of based on delay compensation and ears conforming binaural sound source of sound localization method Trifa et al. 2007 Real-time acoustic source localization in noisy environments for human-robot multimodal interaction CN110501673A (en) 2019-11-26 A method and system for spatial direction estimation of binaural auditory sound sources based on multi-task time-frequency convolutional neural network Corey 2019 Microphone array processing for augmented listening CN112731289B (en) 2024-05-07 A binaural sound source localization method and device based on weighted template matching JP6587047B2 (en) 2019-10-09 Realistic transmission system and realistic reproduction device PertilÃ¤ et al. 2020 Time difference of arrival estimation with Deep learningâfrom acoustic simulations to recorded data Li et al. 2012 Multiple active speaker localization based on audio-visual fusion in two stages CN116189651A (en) 2023-05-30 Multi-speaker sound source positioning method and system for remote video conference JP4240878B2 (en) 2009-03-18 Speech recognition method and speech recognition apparatus Deleforge et al. 2019 Audio-motor integration for robot audition Legal Events Date Code Title Description 2021-04-30 PB01 Publication 2021-04-30 PB01 Publication 2021-05-21 SE01 Entry into force of request for substantive examination 2021-05-21 SE01 Entry into force of request for substantive examination 2024-05-07 GR01 Patent grant 2024-05-07 GR01 Patent grant

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4