A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://patents.google.com/patent/CN112731289B/en below:

CN112731289B - A binaural sound source localization method and device based on weighted template matching

CN112731289B - A binaural sound source localization method and device based on weighted template matching - Google Patents A binaural sound source localization method and device based on weighted template matching Download PDF Info
Publication number
CN112731289B
CN112731289B CN202011456914.0A CN202011456914A CN112731289B CN 112731289 B CN112731289 B CN 112731289B CN 202011456914 A CN202011456914 A CN 202011456914A CN 112731289 B CN112731289 B CN 112731289B
Authority
CN
China
Prior art keywords
binaural
cross
different
sound source
similarity
Prior art date
2020-12-10
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011456914.0A
Other languages
Chinese (zh)
Other versions
CN112731289A (en
Inventor
丁润伟
孙永恒
杨冰
刘宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
Peking University Shenzhen Graduate School
Original Assignee
PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2020-12-10
Filing date
2020-12-10
Publication date
2024-05-07
2020-12-10 Application filed by PKU-HKUST SHENZHEN-HONGKONG INSTITUTION, Peking University Shenzhen Graduate School filed Critical PKU-HKUST SHENZHEN-HONGKONG INSTITUTION
2020-12-10 Priority to CN202011456914.0A priority Critical patent/CN112731289B/en
2021-04-30 Publication of CN112731289A publication Critical patent/CN112731289A/en
2024-05-07 Application granted granted Critical
2024-05-07 Publication of CN112731289B publication Critical patent/CN112731289B/en
Status Active legal-status Critical Current
2040-12-10 Anticipated expiration legal-status Critical
Links Classifications Landscapes Abstract Translated from Chinese

本发明公开了一种基于加权模板匹配的双耳声源定位方法和装置。在训练阶段,首先从训练数据中提取不同方向的双耳互相关函数和双耳强度差,为提取的各个方向的双耳互相关函数和双耳强度差建立模板;然后通过梯度下降法训练不同方向、不同频带的权重值。在线定位阶段,同样首先对信号提取特征,接着在不同特征和不同频带上将所提取的特征与各个方向的模板进行相似度匹配,最后通过加权融合不同特征不同频带的相似度,得到最终的声源方向相似度,取最大相似度方向为声源方向。实验在不同种类噪声环境下进行,实验结果表明本发明可以在一定程度上抵抗噪声的干扰,实现声源的角度定位问题。

The present invention discloses a binaural sound source localization method and device based on weighted template matching. In the training stage, firstly, binaural cross-correlation functions and binaural intensity differences in different directions are extracted from training data, and templates are established for the extracted binaural cross-correlation functions and binaural intensity differences in each direction; then, the weight values of different directions and different frequency bands are trained by the gradient descent method. In the online positioning stage, the signal is firstly extracted, and then the extracted features are matched with the templates in each direction in terms of similarity on different features and different frequency bands. Finally, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction. The experiment is carried out in different types of noise environments, and the experimental results show that the present invention can resist the interference of noise to a certain extent and realize the angle positioning problem of the sound source.

Description Translated from Chinese 一种基于加权模板匹配的双耳声源定位方法和装置A binaural sound source localization method and device based on weighted template matching

技术领域Technical Field

本发明属于信息技术领域,涉及一种应用在语音感知和语音增强中的双耳声源定位方法,具体涉及一种基于加权模板匹配的双耳声源定位方法和装置。The invention belongs to the field of information technology, and relates to a binaural sound source localization method applied in speech perception and speech enhancement, and in particular to a binaural sound source localization method and device based on weighted template matching.

技术背景technical background

人机交互在机器人领域具有越来越重要的作用,人机交互能够使人与机器的交流更加方便、高效、友好。在日常生活中,人们感知外界信息的主要方式有视觉、听觉、触觉、嗅觉、味觉等。其中人类通过视觉获得的信息约占70%-80%,通过听觉获得的信息约占10%-20%。听觉感知是人们与外界进行信息交流最自然、方便,有效的方式之一。另外相比于视觉信号,听觉信号具有360度的视野,不受光照影响,也不需要满足声源和麦克风之间无遮挡物等条件,因此,机器人听觉是实现人机交互的重要途径之一。机器人听觉主要包括声源的定位与追踪、语音去噪、语音增强、语音分离、说话人识别、语音识别、语音情感识别等,其中声源定位作为机器人听觉前端的一个任务,可以为其它语音任务提供语音空间位置信息作为辅助。机器人声源定位已成为机器人听觉系统的一个重要组成部分。Human-computer interaction plays an increasingly important role in the field of robotics. Human-computer interaction can make communication between people and machines more convenient, efficient and friendly. In daily life, the main ways for people to perceive external information are vision, hearing, touch, smell, taste, etc. Among them, about 70%-80% of the information obtained by humans through vision, and about 10%-20% of the information obtained through hearing. Auditory perception is one of the most natural, convenient and effective ways for people to communicate with the outside world. In addition, compared with visual signals, auditory signals have a 360-degree field of view, are not affected by light, and do not need to meet conditions such as no obstructions between the sound source and the microphone. Therefore, robot hearing is one of the important ways to achieve human-computer interaction. Robot hearing mainly includes sound source positioning and tracking, speech denoising, speech enhancement, speech separation, speaker recognition, speech recognition, speech emotion recognition, etc. Among them, sound source positioning as a task of the robot's hearing front end can provide speech spatial position information as an aid for other speech tasks. Robot sound source positioning has become an important part of the robot's hearing system.

语音分离来自于著名的‘鸡尾酒会’问题,即人们可以在众多谈话声和噪声中聚焦于某个人的声音的能力,该问题长久以来被认为是语音分离中的一个具有挑战性的问题。通过在语音分离中结合声源定位技术获得声源的方位信息,有助于分离混叠的语音,能够提高对感兴趣方向语音的识别的准确率。在视频会议中,可以根据麦克风声源定位的结果及时调整摄像机的位置,使其转向说话人的位置。在视频监控中,可以根据声源方向信息调整摄像机的角度,扩大监控范围,达到更好的监控作用。Speech separation comes from the famous "cocktail party" problem, which is the ability of people to focus on a person's voice amidst a lot of conversations and noise. This problem has long been considered a challenging problem in speech separation. By combining sound source localization technology with speech separation to obtain the directional information of the sound source, it is helpful to separate the aliased speech and improve the accuracy of recognizing the speech in the direction of interest. In video conferencing, the position of the camera can be adjusted in time according to the results of microphone sound source localization to turn it to the position of the speaker. In video surveillance, the angle of the camera can be adjusted according to the sound source direction information to expand the monitoring range and achieve better monitoring.

根据麦克风数量以及是否具有机器人工头的耳蜗结构,声源定位技术大体可以分为基于麦克风阵列的声源定位和基于双耳麦克风阵列的声源定位。双耳麦克风定位技术在人形机器人领域具有重要的作用,它能够充分利用耳蜗结构对声音的衍射作用,模拟人类听觉特性。机器人双耳声源定位仅使用两个麦克风,分别搭载在机器人头部左右侧。相比两麦克风阵列的声源定位,双耳声源定位因为有了耳蜗及人工头对声音信号的衍射等作用,可以更好的模拟人类的听觉特性,可以更好的应用在人形机器人,助听器语音增强、虚拟现实等场景。并且可以消除两麦克风声源定位前后向的歧义问题。According to the number of microphones and whether there is a cochlear structure of a robot foreman, sound source localization technology can be roughly divided into sound source localization based on microphone arrays and sound source localization based on binaural microphone arrays. Binaural microphone localization technology plays an important role in the field of humanoid robots. It can make full use of the diffraction effect of the cochlear structure on sound and simulate human auditory characteristics. Robot binaural sound source localization uses only two microphones, which are mounted on the left and right sides of the robot's head. Compared with the sound source localization of two microphone arrays, binaural sound source localization can better simulate human auditory characteristics because of the diffraction of sound signals by the cochlea and artificial head, and can be better applied in humanoid robots, hearing aid speech enhancement, virtual reality and other scenarios. And it can eliminate the ambiguity of the forward and backward directions of the two-microphone sound source localization.

双耳声源定位主要包括以下几个步骤:Binaural sound source localization mainly includes the following steps:

1、双耳信号的模拟与录制。采用双耳冲激函数与纯净声音信号卷积获取模拟双耳声音信号,或者直接录制双耳信号作为真实信号。1. Simulation and recording of binaural signals. Use binaural impulse function and pure sound signal convolution to obtain simulated binaural sound signal, or directly record binaural signal as real signal.

2、信号的数模转换,预滤波。首先将模拟信号进行预滤波,高通滤波器滤除50Hz的电源噪声信号,低通滤波滤除声音信号中频率分量超过采样频率一半的部分,防止混叠干扰,对模拟声音信号进行采样和量化得到数字信号。2. Digital-to-analog conversion and pre-filtering of the signal. First, pre-filter the analog signal. The high-pass filter removes the 50Hz power supply noise signal, and the low-pass filter removes the frequency component of the sound signal that exceeds half of the sampling frequency to prevent aliasing interference. The analog sound signal is sampled and quantized to obtain a digital signal.

3、预加重。信号通过高频加重滤波器冲激响应H(z)=1-0.95z-1,以补偿嘴唇辐射带来的高频衰减。3. Pre-emphasis: The signal passes through the high-frequency emphasis filter with an impulse response of H(z)=1-0.95z -1 to compensate for the high-frequency attenuation caused by the lip radiation.

4、分帧、加窗。语音信号具有时变的特性,但是人体嘴部肌肉运动相对较慢,一般认为语音信号在短时间内是平稳的,一般为10ms-30ms。因此往往按照上述时间间隔对信号进行分帧处理,例如每20ms分一帧。另外,为了防止分帧带来的问题,一般对分帧后的信号进行加窗处理,常见的窗包括矩形窗、汉宁窗、汉明窗等,其中,汉明窗使用较为广泛。4. Framing and windowing. Speech signals have time-varying characteristics, but the movement of human mouth muscles is relatively slow. It is generally believed that speech signals are stable in a short period of time, generally 10ms-30ms. Therefore, the signal is often framed according to the above time interval, for example, one frame every 20ms. In addition, in order to prevent the problems caused by framing, the framed signal is generally windowed. Common windows include rectangular windows, Hanning windows, Hamming windows, etc. Among them, the Hamming window is more widely used.

5、特征提取。每帧信号可以提取包含声源方位信息的双耳特征,双耳声源定位中常使用的双耳特征包括双耳互相关函数(Interaural Cross-correlation Function,CCF)、双耳时间差(Interaural Time Difference,ITD)和双耳强度差(InterauralIntensity Difference,IID)等。由于很多提取双耳时间差的方法是基于双耳互相关函数的,因此本发明中使用双耳互相关函数和双耳强度差特征。5. Feature extraction. Each frame signal can extract binaural features containing sound source orientation information. The binaural features commonly used in binaural sound source localization include binaural cross-correlation function (CCF), interaural time difference (ITD) and interaural intensity difference (IID). Since many methods for extracting interaural time difference are based on binaural cross-correlation function, binaural cross-correlation function and binaural intensity difference features are used in the present invention.

6、定位。将提取到的特征映射到相应的方向。使得该声源在真实声源方向的后验概率最大。映射方法包括很多,例如使用高斯混合模型,神经网络模型等,本发明使用一个基于加权模板匹配的方法。6. Positioning. Map the extracted features to the corresponding directions. Make the posterior probability of the sound source in the direction of the real sound source the largest. There are many mapping methods, such as using Gaussian mixture model, neural network model, etc. The present invention uses a method based on weighted template matching.

传统的基于高斯混合模型及基于神经网络模型的方法,通过在不同频带上分别计算声源方向,最后相加得出最终的结果,没有考虑到不同频带的可靠性以及不同特征的可靠性。另外基于神经网络模型的方法存在不可解释性。Traditional methods based on Gaussian mixture models and neural network models calculate the sound source direction in different frequency bands and finally add them together to get the final result, without considering the reliability of different frequency bands and the reliability of different features. In addition, the methods based on neural network models are unexplainable.

发明内容Summary of the invention

针对上述问题,本发明的目的在于提供一种可解释的基于加权模板匹配的双耳声源定位方法和装置,分别在每个频率带上进行各个方向声源似然值的计算,通过不同频率带及不同特征的权重将结果进行整合,得出最终声源方向结果。In view of the above problems, the purpose of the present invention is to provide an interpretable binaural sound source localization method and device based on weighted template matching, which calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result.

为了实现上述目的,本发明采用以下技术方案:In order to achieve the above object, the present invention adopts the following technical solutions:

一种基于加权模板匹配的双耳声源定位方法,包括以下步骤:A binaural sound source localization method based on weighted template matching comprises the following steps:

从训练数据中提取不同方向的双耳互相关函数和双耳强度差;Extract binaural cross-correlation functions and binaural intensity differences in different directions from the training data;

为提取的各个方向的双耳互相关函数和双耳强度差建立模板;Building templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction;

训练不同双耳定位特征和不同频带的权重;Train different binaural localization features and weights of different frequency bands;

在线定位时,提取声源信号的双耳互相关函数和双耳强度差,将其与各个方向的模板进行相似度匹配,并通过训练得到的权重融合不同特征不同频带的相似度,实现声源定位。During online positioning, the binaural cross-correlation function and binaural intensity difference of the sound source signal are extracted, and similarity matching is performed with templates in various directions. The similarities of different features and frequency bands are fused through the trained weights to achieve sound source positioning.

进一步地,所述从训练数据中提取不同方向的双耳定位特征,是采用双耳冲激函数与纯净语音信号卷积或者直接利用录入的声音信号,计算出所有方向上的互相关函数和双耳强度差;其中不同方向是指分成不同的水平转向角,转向角采用非均匀的划分方式。Furthermore, the binaural localization features in different directions are extracted from the training data by convolving a binaural impulse function with a pure speech signal or directly using a recorded sound signal to calculate the cross-correlation function and binaural intensity difference in all directions; wherein different directions refer to different horizontal steering angles, and the steering angles are divided in a non-uniform manner.

进一步地,所述转向角的划分方式为:[-80°,-65°,-55°,-45°:5°:45°,55°,65°,80°]。Furthermore, the steering angle is divided into: [-80°, -65°, -55°, -45°: 5°: 45°, 55°, 65°, 80°].

进一步地,所述为提取的各个方向的双耳互相关函数和双耳强度差建立模板,是将多帧从同一方向发出的无噪声语音帧中提取的双耳定位特征平均值作为该方向的模板。Furthermore, the template is established for the extracted binaural cross-correlation function and binaural intensity difference in each direction by taking the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction as the template for that direction.

进一步地,所述训练不同的双耳定位特征和不同频带的权重,是采用反向传播方法进行训练,损失函数设置为平方损失,使得同方向的模板之间的相似度最大,不同方向的模板间相似度尽可能小。Furthermore, the training of different binaural localization features and weights of different frequency bands is performed by adopting a back propagation method, and the loss function is set to a square loss, so that the similarity between templates in the same direction is maximized and the similarity between templates in different directions is minimized.

进一步地,采用以下公式计算所述相似度:Furthermore, the similarity is calculated using the following formula:

其中,sim(θ)表示加权后的相似度矩阵,ωccf,i表示在第i个频带上互相关函数的权重,simccf,i(θ)表示在第i个频带上互相关函数与方向θ上模板的余弦相似度;ωiid,i表示在第i个频带上双耳强度差的权重,simiid,i(θ)表示在第i个频带上双耳强度差与方向θ上模板的相似度。Among them, sim(θ) represents the weighted similarity matrix, ω ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim ccf,i (θ) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction θ; ω iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim iid,i (θ) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction θ.

一种采用上述方法的基于加权模板匹配的双耳声源定位装置,其包括:A binaural sound source localization device based on weighted template matching using the above method comprises:

训练模块,用于从训练数据中提取不同方向的双耳互相关函数和双耳强度差,为提取的各个方向的双耳互相关函数和双耳强度差建立模板,然后训练不同双耳定位特征和不同频带的权重;A training module is used to extract binaural cross-correlation functions and binaural intensity differences in different directions from training data, establish templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction, and then train different binaural localization features and weights of different frequency bands;

在线定位模块,用于提取声源信号的双耳互相关函数和双耳强度差,将其与各个方向的模板进行相似度匹配,并通过训练得到的权重融合不同特征不同频带的相似度,实现声源定位。The online localization module is used to extract the binaural cross-correlation function and binaural intensity difference of the sound source signal, match it with the templates in each direction for similarity, and fuse the similarities of different features and frequency bands through the trained weights to achieve sound source localization.

本发明的有益效果是:The beneficial effects of the present invention are:

本发明分别在每个频率带上进行各个方向声源似然值的计算,通过不同频率带及不同特征的权重将结果进行整合,得出最终声源方向结果,可以在一定程度上抵抗噪声的干扰,实现声源的角度定位。The present invention calculates the likelihood values of sound sources in each direction in each frequency band respectively, integrates the results through the weights of different frequency bands and different features, and obtains the final sound source direction result, which can resist the interference of noise to a certain extent and realize the angular positioning of the sound source.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明方法的整体流程图。FIG. 1 is an overall flow chart of the method of the present invention.

图2是本发明所提取特征一个例子。其中(a)图表示所提取的双耳互相关函数特征,(b)图表示所提取的双耳强度差特征。Fig. 2 is an example of the features extracted by the present invention, wherein (a) shows the extracted binaural cross-correlation function features, and (b) shows the extracted binaural intensity difference features.

图3是本发明的一个声源信号与各个方向模板进行相似度计算的例子。其中,上半部分表示声源信号互相关函数与各个方向模板的相似度,下半部分表示声源信号双耳强度差与各个方向模板的相似度。横坐标表示不同的方向。Fig. 3 is an example of similarity calculation between a sound source signal and each directional template of the present invention. The upper part shows the similarity between the cross-correlation function of the sound source signal and each directional template, and the lower part shows the similarity between the binaural intensity difference of the sound source signal and each directional template. The horizontal axis shows different directions.

图4是本发明的最终权重训练结果。两条折线共64个点,分别表示不同频带的双耳互相关函数和双耳强度差的权重。Fig. 4 is the final weight training result of the present invention. The two broken lines have 64 points in total, which respectively represent the binaural cross-correlation function of different frequency bands and the weights of the binaural intensity difference.

具体实施方式Detailed ways

下面将结合实施例和附图,对本发明的技术方案进行清楚、完整地描述。The technical solution of the present invention will be clearly and completely described below in conjunction with the embodiments and drawings.

图1是本发明的一种基于加权模板匹配的双耳声源定位方法的流程图,包括以下步骤:FIG1 is a flow chart of a binaural sound source localization method based on weighted template matching of the present invention, comprising the following steps:

1)数据准备阶段,模拟双耳的各个方向的信号,提供原始声源信号。1) Data preparation stage: simulate the signals from all directions of both ears and provide the original sound source signals.

1.1)将人工头前半平面分成25个不同水平转向角度,例如,转向角采用非均匀的划分方法:[-80°,-65°,-55°,-45°:5°:45°,55°,65°,80°]。其中,-45°:5°:45°表示每隔5度设置一个角度。1.1) The front half plane of the artificial head is divided into 25 different horizontal steering angles. For example, the steering angle is divided in a non-uniform manner: [-80°, -65°, -55°, -45°: 5°: 45°, 55°, 65°, 80°]. Among them, -45°: 5°: 45° means that an angle is set every 5 degrees.

1.2)结合TIMIT数据库提供的纯净的语音信号及CIPIC数据库提供的双耳冲激响应函数,以及NOISEX-92数据库提供的不同种类的噪声信号构造训练数据及测试数据。其中训练数据不使用噪声信号。测试数据使用不同信噪比的噪声信号,实验中使用从-10dB到35dB的测试信号。1.2) The training data and test data are constructed by combining the pure speech signals provided by the TIMIT database, the binaural impulse response functions provided by the CIPIC database, and the different types of noise signals provided by the NOISEX-92 database. The training data does not use noise signals. The test data uses noise signals with different signal-to-noise ratios, and the test signals from -10dB to 35dB are used in the experiment.

2)训练阶段,从数据中提取双耳互相关函数和双耳强度差数据,为互相关函数(CCF)和双耳强度差(IID)建立模板,以及训练不同特征、频带对应的权重,使得同声源方向模板之间的相似度最大,不同声源方向模板间相似度尽可能小。可以采用双耳冲激函数(HRTF)与纯净语音信号卷积或者直接利用录入的声音信号,计算出所有方向上的互相关函数和双耳强度差模板。2) In the training phase, binaural cross-correlation function and binaural intensity difference data are extracted from the data, templates are established for cross-correlation function (CCF) and binaural intensity difference (IID), and weights corresponding to different features and frequency bands are trained to maximize the similarity between templates in the same sound source direction and minimize the similarity between templates in different sound source directions. The cross-correlation function and binaural intensity difference template in all directions can be calculated by convolving the binaural impulse function (HRTF) with the clean speech signal or directly using the recorded sound signal.

2.1)使用4阶32通道的伽玛通滤波器对带有方向信息的信号进行分频带处理,最大频率设置为7200Hz。2.1) Use a 4th-order 32-channel gammatone filter to perform frequency division processing on the signal with directional information, and the maximum frequency is set to 7200 Hz.

2.2)从无噪声的训练数据中提取互相关函数(CCF)和双耳强度差(IID),综合多帧数据的平均值建立模板,即将多帧从同一方向发出的无噪声语音帧中提取的双耳定位特征平均值作为该方向的模板。2.2) Extract the cross-correlation function (CCF) and interaural intensity difference (IID) from the noise-free training data, and establish a template by combining the average values of multiple frames of data. That is, the average value of binaural localization features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction.

互相关函数的计算公式如下:The calculation formula of the cross-correlation function is as follows:

其中in

Gp,q(i,τ)=∑nxp(i,n)xq(i,n+τ),p,q∈{l,r}G p,q (i, τ) = ∑ n x p (i, n) x q (i, n + τ), p, q ∈ {l, r}

其中,l和r分别表示左耳和右耳,i表示不同频带,n表示每一帧的采样点,τ表示时间延迟;当p、q被赋予l或r的值之后,xp、xq表示左耳接收到的信号或者右耳接受到的信号;τ0表示数字0。Among them, l and r represent the left ear and right ear respectively, i represents different frequency bands, n represents the sampling points of each frame, and τ represents the time delay; when p and q are assigned the values of l or r, x p and x q represent the signal received by the left ear or the signal received by the right ear; τ 0 represents the number 0.

双耳强度差计算公式如下:The formula for calculating the binaural intensity difference is as follows:

其中,xl表示左耳接受到的信号,xr表示右耳接受到的信号。Among them, x l represents the signal received by the left ear, and x r represents the signal received by the right ear.

2.3)将25个不同方向的信号打上One-hot标签,例如-80度方向的标签设置为[1,0,...,0],其中共24个0。在不同频带和特征上,将每帧训练数据与模板计算相似度,共得到(2*32)*25((特征数*频带数)*候选方向数)的相似度矩阵,希望加权此矩阵,使真实声源对应的候选方向相似度最大。其中权重矩阵的行数、列数为1*64,相似度矩阵的行数、列数为64*25,最终结果为1*25的矩阵sim(θ)。2.3) One-hot labels are added to the signals of 25 different directions. For example, the label of the -80 degree direction is set to [1, 0, ..., 0], with a total of 24 zeros. The similarity between each frame of training data and the template is calculated in different frequency bands and features, and a similarity matrix of (2*32)*25 ((number of features * number of frequency bands) * number of candidate directions) is obtained. It is hoped that this matrix can be weighted to maximize the similarity of the candidate directions corresponding to the real sound source. The number of rows and columns of the weight matrix is 1*64, the number of rows and columns of the similarity matrix is 64*25, and the final result is a 1*25 matrix sim(θ).

其中,sim(θ)表示加权后的相似度矩阵,ωccf,i表示在第i个频带上互相关函数的权重,simccf,i(θ)表示在第i个频带上互相关函数与方向θ上模板的余弦相似度;ωiid,i表示在第i个频带上双耳强度差的权重,simiid,i(θ)表示在第i个频带上双耳强度差与方向θ上模板的相似度。其中余弦相似度使用下式计算:Wherein, sim(θ) represents the weighted similarity matrix, ω ccf,i represents the weight of the cross-correlation function in the i-th frequency band, sim ccf,i (θ) represents the cosine similarity between the cross-correlation function in the i-th frequency band and the template in the direction θ; ω iid,i represents the weight of the binaural intensity difference in the i-th frequency band, sim iid,i (θ) represents the similarity between the binaural intensity difference in the i-th frequency band and the template in the direction θ. The cosine similarity is calculated using the following formula:

Ri,r(τ)指的是目标互相关函数,Rtemp(θ,i,τ)表示在θ角度,频带i的互相关函数模板,Rl,r(i,τ)表示所接收信号在频带i计算的互相关函数。R i,r (τ) refers to the target cross-correlation function, R temp (θ,i,τ) represents the cross-correlation function template of frequency band i at angle θ, and R l,r (i,τ) represents the cross-correlation function of the received signal calculated in frequency band i.

双耳强度差相似度使用下述公式:The binaural intensity difference similarity uses the following formula:

其中i表示频带索引,temp代表模板,θ表示方向,iidtemp,θ,i表示θ角度方向,第i个频带所对应的双耳强度差模板,iidi表示当前从测试信号中计算的第i个频带的双耳强度差。Where i represents the frequency band index, temp represents the template, θ represents the direction, iid temp, θ,i represents the angle direction of θ, the binaural intensity difference template corresponding to the i-th frequency band, and iid i represents the binaural intensity difference of the i-th frequency band currently calculated from the test signal.

2.4)通过反向传播法训练其中的权重ωccf,i和ωiid,i。2.4) Train the weights ω ccf,i and ω iid,i by back-propagation.

损失函数设置为平方损失:其中y为上述真实标签,/> 权重的训练是在两种不同双耳特征和不同频率带上同时训练,训练出的权重具有直观的可解释性。The loss function is set to square loss: Where y is the true label above, /> The weights are trained simultaneously on two different binaural features and different frequency bands, and the trained weights are intuitively interpretable.

3)测试阶段,首先将采集到的信号经由伽马通滤波器进行分频处理,然后在分频后的各个频带信号上提取互相关函数和双耳强度差特征,在不同特征和不同频带上将特征与全部方向的模板特征进行相似度计算,最后通过上述训练阶段得出的权重进行加权,求得声源来自各个方向的似然值,即可得到声源方向信息,即通过加权融合不同特征不同频带的相似度,得到最终的声源方向相似度,取最大相似度方向为声源方向。3) In the testing phase, the collected signal is first subjected to frequency division processing through a gammatone filter, and then the cross-correlation function and binaural intensity difference features are extracted from the divided frequency band signals. The similarity between the features and the template features of all directions is calculated at different features and different frequency bands. Finally, the weights obtained in the above training phase are used to weight the likelihood values of the sound source coming from various directions, and the sound source direction information can be obtained. That is, the final sound source direction similarity is obtained by weighted fusion of the similarities of different features and different frequency bands, and the direction with the maximum similarity is taken as the sound source direction.

下面提供一个具体应用实例。本实例实施采用的是基于CIPIC数据库003人工头录制的双耳脉冲响应,其将水平角分成25个不同角度,俯仰角分成50个不同角度,可以模拟真实环境不同方向的信号。本实例使用水平平面上的25个双耳脉冲冲击响应,进行水平角的定位。声源信号取自TIMIT数据库真实人们说话声音。将声音信号与双耳脉冲冲击响应进行卷积,可以真实模拟人耳接收到的无噪声信号。使用NOISEX-92数据库录制的不同环境的噪声,加到双耳信号中,可以真实模拟人耳在不同种类的噪声环境中接收到的信号。A specific application example is provided below. This example is based on the binaural impulse response recorded by the artificial head of the CIPIC database 003. It divides the horizontal angle into 25 different angles and the pitch angle into 50 different angles, which can simulate the signals from different directions in the real environment. This example uses 25 binaural impulse responses on the horizontal plane to locate the horizontal angle. The sound source signal is taken from the real speaking sound of people in the TIMIT database. Convolving the sound signal with the binaural impulse response can truly simulate the noise-free signal received by the human ear. Using the noise of different environments recorded by the NOISEX-92 database and adding it to the binaural signal can truly simulate the signal received by the human ear in different types of noise environments.

在训练阶段,首先将以上准备的数据进行预加重、分帧、加窗,通过4阶32频带,最低中心频率80Hz,最高中心频率7200Hz的gammatone滤波器,得到32个不同频带的信号。然后,利用互相关函数计算公式提取互相关函数,这里我们考虑到双耳信号的最大时间差不会超过正负1.1毫秒,并且结合16k采样率,仅取长度为37的互相关函数的互相关值;同时使用计算双耳强度差的公式提取双耳强度差,完成该帧信号的特征提取工作(如图2所示)。将多帧从同一方向发出的无噪声语音帧中提取的双耳定位特征平均值作为该方向的模板。最后,计算每帧信号的定位特征与各个方向模板的相似度,在每个候选方向上共得到64个相似度(如图3所示),将他们进行加权得到最终的方向相似度。结合给出的相似度标签(即One-hot标签),反向传播调整权重值(如图4所示)。In the training phase, the data prepared above are first pre-emphasized, framed, and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function is extracted using the cross-correlation function calculation formula. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the binaural intensity difference is extracted using the formula for calculating the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). The average value of the binaural positioning features extracted from multiple frames of noise-free speech frames emitted from the same direction is used as the template for that direction. Finally, the similarity between the positioning features of each frame signal and the templates of each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. Combined with the given similarity label (i.e., One-hot label), the weight value is adjusted by back propagation (as shown in Figure 4).

测试阶段,首先将以上准备的数据分帧、加窗,通过4阶32频带,最低中心频率80Hz,最高中心频率7200Hz的gammatone滤波器,得到32个不同频带的信号。然后,分别利用互相关函数计算公式提取互相关函数,这里我们考虑到双耳信号的最大时间差不会超过正负1.1毫秒,并且结合16k采样率,仅取长度为37的互相关函数的互相关值;同时使用计算双耳强度差的公式提取双耳强度差,完成该帧信号的特征提取工作(如图2所示)。然后计算测试信号的定位特征与各个方向模板的相似度,在每个候选方向上共得到64个相似度(如图3所示),将他们进行加权得到最终的方向相似度。选取最大相似度方向作为声源方向。In the test phase, the data prepared above are first framed and windowed, and then passed through a 4th-order 32-band gammatone filter with a minimum center frequency of 80Hz and a maximum center frequency of 7200Hz to obtain signals in 32 different frequency bands. Then, the cross-correlation function calculation formula is used to extract the cross-correlation function. Here, we consider that the maximum time difference of binaural signals will not exceed plus or minus 1.1 milliseconds, and combined with the 16k sampling rate, only the cross-correlation value of the cross-correlation function with a length of 37 is taken; at the same time, the formula for calculating binaural intensity difference is used to extract the binaural intensity difference to complete the feature extraction of the frame signal (as shown in Figure 2). Then, the similarity between the positioning features of the test signal and the templates in each direction is calculated, and a total of 64 similarities are obtained in each candidate direction (as shown in Figure 3), and they are weighted to obtain the final direction similarity. The direction with the maximum similarity is selected as the sound source direction.

其中训练阶段使用无噪声的信号,测试阶段使用不同种类不同信噪比的噪声,从-10dB到35dB,间隔5dB。The training phase uses a noise-free signal, and the test phase uses different types of noise with different signal-to-noise ratios, ranging from -10dB to 35dB, with an interval of 5dB.

实验结果表明本发明的方法可以在一定程度上抵抗噪声的干扰,实现声源的角度定位。Experimental results show that the method of the present invention can resist noise interference to a certain extent and achieve the angular positioning of the sound source.

基于同一发明构思,本发明的另一个实施例提供一种采用上述方法的基于加权模板匹配的双耳声源定位装置,其包括:Based on the same inventive concept, another embodiment of the present invention provides a binaural sound source localization device based on weighted template matching using the above method, which includes:

训练模块,用于从训练数据中提取不同方向的双耳互相关函数和双耳强度差,为提取的各个方向的双耳互相关函数和双耳强度差建立模板,然后训练不同双耳定位特征和不同频带的权重;A training module is used to extract binaural cross-correlation functions and binaural intensity differences in different directions from training data, establish templates for the extracted binaural cross-correlation functions and binaural intensity differences in each direction, and then train different binaural localization features and weights of different frequency bands;

在线定位模块,用于提取声源信号的双耳互相关函数和双耳强度差,将其与各个方向的模板进行相似度匹配,并通过训练得到的权重融合不同特征不同频带的相似度,实现声源定位。The online localization module is used to extract the binaural cross-correlation function and binaural intensity difference of the sound source signal, match it with the templates in each direction for similarity, and fuse the similarities of different features and frequency bands through the trained weights to achieve sound source localization.

基于同一发明构思,本发明的另一实施例提供一种电子装置(计算机、服务器、智能手机等),其包括存储器和处理器,所述存储器存储计算机程序,所述计算机程序被配置为由所述处理器执行,所述计算机程序包括用于执行本发明方法中各步骤的指令。Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.), which includes a memory and a processor, the memory stores a computer program, the computer program is configured to be executed by the processor, and the computer program includes instructions for executing each step in the method of the present invention.

基于同一发明构思,本发明的另一实施例提供一种计算机可读存储介质(如ROM/RAM、磁盘、光盘),所述计算机可读存储介质存储计算机程序,所述计算机程序被计算机执行时,实现本发明方法的各个步骤。Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (such as ROM/RAM, disk, CD), which stores a computer program. When the computer program is executed by a computer, it implements the various steps of the method of the present invention.

可以理解的是,以上所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明的保护范围。It is understandable that the embodiments described above are only some embodiments of the present invention, rather than all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the protection scope of the present invention.

Claims (7)

1. A binaural sound source localization method based on weighted template matching is characterized by comprising the following steps:

Extracting binaural cross-correlation functions and binaural intensity differences in different directions from the training data;

establishing templates for the extracted binaural cross-correlation functions and binaural intensity differences in all directions;

Training weights of different binaural localization features and different frequency bands;

During on-line positioning, extracting a binaural cross-correlation function and a binaural intensity difference of a sound source signal, performing similarity matching on the binaural cross-correlation function and the binaural intensity difference and templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning;

Training different binaural positioning characteristics and weights of different frequency bands by adopting a back propagation method, wherein a loss function is set as square loss, so that the similarity between templates in the same direction is maximum, and the similarity between templates in different directions is as small as possible;

the similarity is calculated using the following formula:

Wherein sim (θ) represents a weighted similarity matrix, ω ccf,i represents a weight of the cross-correlation function on the i-th frequency band, sim ccf,i (θ) represents a cosine similarity of the cross-correlation function on the i-th frequency band to the template in the direction θ; omega iid,i represents the weight of the binaural intensity difference in the ith frequency band, sim iid,i (θ) represents the similarity of the binaural intensity difference in the ith frequency band to the template in direction θ;

wherein the calculation formula of sim ccf,i(θ)、simiid,i (θ) is:

Where R l,r (τ) is the target cross-correlation function, R temp (θ, i, τ) represents the cross-correlation function template for frequency band i at θ, and R l,r (i, τ) represents the cross-correlation function calculated for the received signal in frequency band i;

Where i denotes the band index, temp denotes the template, θ denotes the direction, iid temp,θ,i denotes the θ angle direction, iid i denotes the binaural intensity difference template corresponding to the i-th band, and iid i denotes the binaural intensity difference of the i-th band currently calculated from the test signal.

2. The method of claim 1, wherein the extracting binaural localization features in different directions from the training data is by convolving the binaural impulse function with the clean speech signal or directly using the recorded sound signal, and calculating a cross-correlation function and a binaural intensity difference in all directions; wherein different directions are divided into different horizontal steering angles, and the steering angles are divided in a non-uniform way.

3. The method according to claim 1, wherein the steering angle is divided in the following manner: -80, -65, -55, -45:5:45, 55, 65, 80 ].

4. The method of claim 1, wherein the building a template for the extracted binaural cross-correlation function and binaural intensity difference for each direction is a template for a direction that is an average of binaural localization features extracted from a plurality of frames of noiseless speech frames emanating from the same direction.

5. A weighted template matching based binaural sound source localization device employing the method of any one of claims 1-4, comprising:

the training module is used for extracting the binaural cross-correlation functions and the binaural intensity differences in different directions from the training data, establishing templates for the extracted binaural cross-correlation functions and the binaural intensity differences in different directions, and then training weights of different binaural positioning features and different frequency bands;

And the on-line positioning module is used for extracting the binaural cross-correlation function and the binaural intensity difference of the sound source signal, matching the binaural cross-correlation function and the binaural intensity difference with templates in all directions, and fusing the similarity of different characteristics and different frequency bands through weights obtained through training to realize sound source positioning.

6. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-4.

7. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-4.

CN202011456914.0A 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Active CN112731289B (en) Priority Applications (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Applications Claiming Priority (1) Application Number Priority Date Filing Date Title CN202011456914.0A CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Publications (2) Family ID=75599430 Family Applications (1) Application Number Title Priority Date Filing Date CN202011456914.0A Active CN112731289B (en) 2020-12-10 2020-12-10 A binaural sound source localization method and device based on weighted template matching Country Status (1) Citations (7) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title WO2012058805A1 (en) * 2010-11-03 2012-05-10 Huawei Technologies Co., Ltd. Parametric encoder for encoding a multi-channel audio signal CN103901401A (en) * 2014-04-10 2014-07-02 北京大学深圳研究生院 Binaural sound source positioning method based on binaural matching filter CN104965194A (en) * 2015-07-29 2015-10-07 渤海大学 Indoor multi-sound-source positioning device and method simulating binaural effect CN105075293A (en) * 2013-03-29 2015-11-18 三星电子株式会社 Audio device and audio providing method thereof CN107144818A (en) * 2017-03-21 2017-09-08 北京大学深圳研究生院 Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN107346664A (en) * 2017-06-22 2017-11-14 河海大学常州校区 A kind of ears speech separating method based on critical band CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks Family Cites Families (2) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title RU2505941C2 (en) * 2008-07-31 2014-01-27 Фраунхофер-Гезелльшафт цур Фёрдерунг дер ангевандтен Форшунг Е.Ф. Generation of binaural signals US9560439B2 (en) * 2013-07-01 2017-01-31 The University of North Carolina at Chapel Hills Methods, systems, and computer readable media for source and listener directivity for interactive wave-based sound propagation Patent Citations (7) * Cited by examiner, † Cited by third party Publication number Priority date Publication date Assignee Title WO2012058805A1 (en) * 2010-11-03 2012-05-10 Huawei Technologies Co., Ltd. Parametric encoder for encoding a multi-channel audio signal CN105075293A (en) * 2013-03-29 2015-11-18 三星电子株式会社 Audio device and audio providing method thereof CN103901401A (en) * 2014-04-10 2014-07-02 北京大学深圳研究生院 Binaural sound source positioning method based on binaural matching filter CN104965194A (en) * 2015-07-29 2015-10-07 渤海大学 Indoor multi-sound-source positioning device and method simulating binaural effect CN107144818A (en) * 2017-03-21 2017-09-08 北京大学深圳研究生院 Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN107346664A (en) * 2017-06-22 2017-11-14 河海大学常州校区 A kind of ears speech separating method based on critical band CN110517705A (en) * 2019-08-29 2019-11-29 北京大学深圳研究生院 A kind of binaural sound sources localization method and system based on deep neural network and convolutional neural networks Also Published As Similar Documents Publication Publication Date Title CN110517705B (en) 2022-02-18 Binaural sound source positioning method and system based on deep neural network and convolutional neural network CN112106385B (en) 2022-01-07 System for sound modeling and presentation CN107430868B (en) 2021-02-09 Real-time reconstruction of user speech in immersive visualization system US9949056B2 (en) 2018-04-17 Method and apparatus for presenting to a user of a wearable apparatus additional information related to an audio scene CN103901401B (en) 2016-08-17 A kind of binaural sound source of sound localization method based on ears matched filtering device Chen et al. 2023 Novel-view acoustic synthesis Keyrouz et al. 2006 A new method for binaural 3-D localization based on HRTFs CN110728989B (en) 2020-07-14 A Binaural Speech Separation Method Based on Long Short-Term Memory Network LSTM JP7210602B2 (en) 2023-01-23 Method and apparatus for processing audio signals CN112492380A (en) 2021-03-12 Sound effect adjusting method, device, equipment and storage medium Youssef et al. 2012 A binaural sound source localization method using auditive cues and vision CN113099031B (en) 2022-05-17 Sound recording method and related equipment CN107144818A (en) 2017-09-08 Binaural sound sources localization method based on two-way ears matched filter Weighted Fusion CN110188179B (en) 2020-06-19 Voice directional recognition interaction method, device, equipment and medium CN103901400B (en) 2016-08-17 A kind of based on delay compensation and ears conforming binaural sound source of sound localization method Trifa et al. 2007 Real-time acoustic source localization in noisy environments for human-robot multimodal interaction CN110501673A (en) 2019-11-26 A method and system for spatial direction estimation of binaural auditory sound sources based on multi-task time-frequency convolutional neural network Corey 2019 Microphone array processing for augmented listening CN112731289B (en) 2024-05-07 A binaural sound source localization method and device based on weighted template matching JP6587047B2 (en) 2019-10-09 Realistic transmission system and realistic reproduction device Pertilä et al. 2020 Time difference of arrival estimation with Deep learning–from acoustic simulations to recorded data Li et al. 2012 Multiple active speaker localization based on audio-visual fusion in two stages CN116189651A (en) 2023-05-30 Multi-speaker sound source positioning method and system for remote video conference JP4240878B2 (en) 2009-03-18 Speech recognition method and speech recognition apparatus Deleforge et al. 2019 Audio-motor integration for robot audition Legal Events Date Code Title Description 2021-04-30 PB01 Publication 2021-04-30 PB01 Publication 2021-05-21 SE01 Entry into force of request for substantive examination 2021-05-21 SE01 Entry into force of request for substantive examination 2024-05-07 GR01 Patent grant 2024-05-07 GR01 Patent grant

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4