본 ë°ëª ì ìì±ì ì¤ìí í¹ì§ íë¼ë¯¸í°ì¸ í¼ì¹ë¥¼ ë³ê²½íì¬ ë§ì´í¬ë¥¼ íµíì¬ ì ë ¥ë ë¨ì¼ ìì±ì ë¤ì¤ì 목ìë¦¬ë¡ í©ì±í´ì£¼ë í©ì±ê¸° 구íì ê´í ê²ì´ë¤. ì¼ë°ì ì¼ë¡ ìì±ì ì¬ê¸°ì í¸(ìì)ê° ì¬í기(ì±ë)를 íµê³¼íì¬ ëì¤ë ì í¸ë¡ ê°ì íê³ ìì¼ë©°. ì¬ê¸°ì í¸ë í¼ì¹ ì±ë¶, ì¬í기ë í¬ë§í¸ ì±ë¶ì¼ë¡ 모ë¸ë§ í ì ìë¤. í¬ë§í¸ë ì±ëì 기ííì ì¸ ëª¨ìì ë°ë¼ ë¬ë¼ì§ë¤. ì를 ë¤ì´ "ì" ë¼ë ìê³¼ "ì´"ë¼ë ìì ì¬ëì ì±ë ë³íì ìí´ì ë§ë¤ì´ ë¼ ì ìì¼ë©° ì´ ëì í¬ë§í¸ 주íìë ê°ê° ë¤ë¥¸ ììì ëíë¸ë¤. ì´ì ê°ì´ í¬ë§í¸ë ìì´ ì 보를 ê°ì§ê³ ìì¼ë©°, ìì±ì í¸ ëª¨ë¸ë§ìì ì¤ìí ììë¡ ìì©íë¤. í¼ì¹ë ì±ëì 주기ì ì¸ ë¨ë¦¼ì ìí´ì ìì±ëë©° ì¸ê°ì ì²ê°ì ë§¤ì° ë¯¼ê°íê² ë°ìíë íë¼ë¯¸í°ë¡ì¨, ìì±ì í¸ì íì를 구ë¶íëë° ì¬ì©íë©°, ìì±ì í¸ì naturalnessì í° ìí¥ì 미ì¹ë¤. ê·¸ë¬ë¯ë¡ ì íí í¼ì¹ í´ìì ìì±í©ì±ì ìì§ì ì¢ì°íë ì¤ìí ììì´ë©° ìì±ì½ë©ì ìì´ìë í¼ì¹ì ì íí ì¶ì¶ê³¼ ë³µìì ìì§ì ê²°ì ì ì¸ ìí ì íë¤. ê·¸ë¦¬ê³ í¼ì¹ ì ë³´ë ìì±ì í¸ì ì ì±ì/무ì±ìì íë¨íë íë¼ë¯¸í°ë¡ë ì¬ì©ëë¤. í¼ì¹ë ì¬ëì ì±ë 구조ì ì¼ì í ì ì½ì ê°ì§ëë°, ë¨ì±ì ê²½ì° ì¼ë°ì ì¼ë¡ 50-250Hz, ì¬ì±ì ê²½ì° 120-150Hzì ì¡´ì¬íë©° ê°ì¸, ìµì, ê°ì ë±ì ë°ë¼ì ë³íë¤. ì´ë¬í í¹ì±ì ê°ì§ë Pitch ë³ê²½í¨ì¼ë¡ì¨, íì¬ëì 목ì리를 ì¬ë¬ ì¬ëì´ ë°ì±íë 목ì리ì²ë¼ í©ì±í´ ë¼ ì ìë¤. 본 ë°ëª ì ìì©ë¶ì¼ë ë¤ìíë¤. ìì©ë¶ì¼ë¡ë ì´ë경기ì¥ìì íì¬ëì ììì¼ë¡ ì¬ë¬ ì¬ëì´ ììíë í¨ê³¼ë¥¼ ë´ë ìì í©ì±ê¸°, ìì¼ì´ë íí°ì¥ ë±ììì ì¶í í©ì±ê¸°, ëë¦¼ë ¸ë ì¥ëê° ë±ì ìì©í ì ìì¼ë©°, ìíë ì°ê·¹ììì í¨ê³¼ì, ì¥ìê° ì§ì ë¹ì°ë ë§ë²ì´ ê°ì ìì ëë ë°©ì§ ìì¤í ì¼ë¡ë ìì© í ì ìë¤. ëí ìì¦ íê°ì ì ííê³ ìë 졸ë¼ë§¨ ì´ë ì ëª ì¸ ëª©ì리 íë´ë¥¼ ë´ë ìì±ë³ì¡°ìë ìì© í ì ìë¤.The present invention relates to a synthesizer implementation for synthesizing a single voice input through a microphone into multiple voices by changing pitch, which is an important feature parameter of the voice. In general, it is assumed that the excitation signal (sound source) passes through the filter (saint). The excitation signal can be modeled by the pitch component and the filter component by the formant component. The formant depends on the geometric shape of the saints. For example, the sound of "ah" and "er" can be produced by changes in human saints, and the formant frequency at this time is different. Thus formant has phonological information and plays an important role in speech signal modeling. Pitch is a parameter that is generated by periodic shaking of the vocal cords and is very sensitive to human hearing. It is used to distinguish the speaker of a voice signal and has a great influence on the naturalness of the voice signal. Therefore, accurate pitch analysis is an important factor that determines the sound quality of speech synthesis. Accurate extraction and reconstruction of pitch plays a decisive role in sound quality. The pitch information is also used as a parameter for determining the voiced sound / unvoiced sound of the voice signal. Pitch has certain constraints on the structure of the human vocal cords, which are generally 50-250Hz for men and 120-150Hz for women and vary with stress, intonation and emotion. By changing the pitch that has these characteristics, one voice can be synthesized like a voice of several people. The field of application of the invention is diverse. Applications include a cheer synthesizer that produces a cheering effect by one person at a sports stadium, a celebration synthesizer at a birthday or a party hall, or a spinning song toy. It can also be applied as an anti-theft system at home. It can also be applied to voice modulations that imitate the voices of celebrities such as jolamen, which are popular all the time.
Description Translated from Korean í¼ì¹ ë³ê²½ë²ì ì´ì©í ë¨ì¼ ìì± ë¤ì¤ 목ì리 í©ì±ê¸°{Mutiple Speech Synthesizer using Pitch Alteration Method}Multi Speech Synthesizer using Pitch Alteration Method본 ë°ëª ì í¼ì¹ë¥¼ ë³ê²½íì¬ ë¨ì¼ 목ì리를 ë¤ì¤ì 목ìë¦¬ë¡ í©ì±íë ê²ì¼ë¡ì ìì±íµì 기ì ë¶ì¼ ëë ì¤ëì¤ ì í¸ì²ë¦¬ ë¶ì¼ë¡ ë¶ë¥í ì ìë¤. íì¬ ì¬ì©ëê³ ìë 기ì ì íì¬ëì ìì±ì ì ë ¥ë°ì í¼ì¹ë¥¼ ë³ê²½ í í ë¤ì¤ì 목ìë¦¬ë¡ í©ì±í´ 주ë ê²ì´ ìëë¼ íì¬ëì ìì±ì¼ë¡ í©ì±í´ ë´ë 기ì ì ì¬ì©íê³ ìë¤. ë°ë¼ì ë¤ìí 목ì리를 í©ì±í´ ë¼ ì ìë ë¨ì ì ê°ì§ê³ ìë¤.The present invention synthesizes a single voice into multiple voices by changing the pitch and can be classified into a voice communication technology field or an audio signal processing field. Currently used technology uses a technology of synthesizing one voice instead of multiple voices after changing the pitch after receiving one voice. Therefore, there is a disadvantage that can not synthesize a variety of voices.
본 ë°ëª ì ì´ì ë¨ì ì ë³´ìíì¬ ë¤ìí 목ì리를 í©ì±í´ ë¸ë¤.The present invention secures its shortcomings to synthesize various voices.
본 ë°ëª ì ìì±ì ì¤ìí íë¼ë¯¸í°ì¸ í¼ì¹ë¥¼ ë³ê²½íì¬ ë¨ì¼ ìì±ì ë¤ì¤ì 목ìë¦¬ë¡ í©ì±í´ë´ë í©ì±ê¸°ì ëíì¬ ì ìíë ê²ì´ë¤. ë 1ì ì¼ë°ì ì¸ ìì±ìì± ëª¨ë¸ì´ë¤. íë¡ë¶í° ì±ë를 ê±°ì³ì ì±ëë¡ ë¤ì´ì¤ë ì ë ¥ì ë ê°ì§ë¡ ëë ìê° ìëë°, ì ì±ìì í¼ì¹ 주기를 기ì´ë¡ í ìíì¤ ì´ë¡, 무ì±ìì ëë¤ ë ¸ì´ì¦ë¡ 모ë¸ë§ì´ ê°ë¥íë¤. ì´ ë ì í¸ë¥¼ ì¤ìì¹ í ì í¸ë ì ë ¥ ì í¸ì ìëì§ì ë°ë¼ ì´ëì´ ê³±í´ì§ê³ ì´ë¥¼ ì±ë 모ë¸ì¸ ì¬í기를 ê±°ì¹ë©´ ìì±ì í¸ê° ë§ë¤ì´ì§ë¤. ìì± ë°ì± 모ë¸ì ë°ë¼ ìì± ì í¸ë¥¼ ë¶ìí´ ë³´ë©´ ì¸ê°ì ê°ì±ê³¼ ê°ì ì ëíë´ë ì¬ê¸°(excitation)ì ë³´ì ìì¬ ë´ì©ì ëíë´ë ì±ë ì¬í기ì í¬ë§í¸ ì ë³´ë¡ êµ¬ì±ëì´ ììì ì ì ìë¤. ì¬ê¸°ì 보를 ëíë´ë í¼ì¹ë ì±ëì 주기ì ì¸ ë¨ë¦¼ì ìí´ì ìì±ëë©° ì¸ê°ì ì²ê°ì ë§¤ì° ë¯¼ê°íê² ë°ìíë íë¼ë¯¸í°ë¡ì¨, ìì±ì í¸ì íì를 구ë¶íëë° ì¬ì©íë©°, ìì±ì í¸ì naturalnessì í° ìí¥ì 미ì¹ë¤. ì´ë¬í ì´ì¨ì 보를 ê°ì§ë í¼ì¹ë¥¼ ë³ê²½íë©´ ë¤ìí í©ì±ìì ë§ë¤ì´ ë¼ ì ìë¤The present invention proposes a synthesizer that synthesizes a single voice into multiple voices by changing the pitch, which is an important parameter of the voice. 1 is a general speech generation model. The input from the lung to the vocal cords can be divided into two types: voiced sound is an impulse sequence based on pitch period, and unvoiced sound can be modeled as random noise. The signal that switches these two signals is multiplied by the gain according to the energy of the input signal, and the voice signal is generated by passing through the filter. Analyzing the voice signal according to the voice phonation model shows that the excitation information indicating the personality and emotion of the human being and the formant information of the vocal tract filter indicating the physician's content are composed. The pitch representing excitation information is a parameter that is generated by periodic shaking of the vocal cords and is very sensitive to human hearing. It is used to distinguish the speaker of a voice signal and has a great influence on the naturalness of the voice signal. By changing the pitch with such rhyme information, various synthesized sounds can be produced.
ë 1ì ì¢ ëì ìì±ìì± ëª¨ë¸ì ì¤ëª í기 ìí ë¸ëë1 is a block diagram illustrating a conventional speech generation model
ë 2ë ì¼ë°ì ì¸ í¼ì¹ ë³ê²½ ìì¤í ì ë¸ë¡ë2 is a block diagram of a typical pitch change system.
ë 3ì 본 ë°ëª ì ì ì©í í¼ì¹ ë³ê²½ ìì¤í ë¸ëë3 is a block diagram of a pitch change system applied to the present invention.
ë 4ë 본 ë°ëª ì ì ì©í í¼ì¹ ìì ê²ì¶ë°©ë²ì ë¸ëë4 is a block diagram of a pitch time detection method applied to the present invention.
ë 5ë 본 ë°ëª ì ì ì©í í¼ì¹ ë³ê²½ë²(PSOLA í©ì±ë²)5 is a pitch change method (PSOLA synthesis method) applied to the present invention.
ë 6ì ë¤ì¤ 목ì리 í©ì± ìì¤í íëì¨ì´ 구ì±ë6 is a hardware configuration diagram of a multi-voice synthesis system
ë 7ì ë¤ì¤ 목ì리 í©ì± ìì¤í ì ìíí¸ì¨ì´ íë¡ì° ì± í¸7 is a software flow chart of a multi-voice synthesis system
í¼ì¹ ë³ê²½ ìì¤í ì ë 2ì ê°ì´ 구ì±ëë¤. í¼ì¹ ë³ê²½ ìì¤í ì ë¶ìë¨ììë ë§ì´í¬ë¡í°ì¼ë¡ ì ë ¥ë ì ì í¸ì 목ì ì í¸ì í¼ì¹ë¥¼ ê²ì¶íì¬ ë³ê²½ ê·ì¹ ìì±ë¨ì ë겨ì¤ë¤. ë³ê²½ ê·ì¹ ìì±ë¨ììë ì´ë¥¼ ì´ì©íì¬ í¼ì¹ ë³ê²½ì¨ê³¼ ê·¸ì ì í©í í¼ì¹ ë³ê²½ë²ì ê²°ì íë¤. ì´ë¬í í¼ì¹ ë³ê²½ ê·ì¹ì ì¤ì í¼ì¹ ë³ê²½ë¨ì ì ê³µëì´ ì ì í¸ì í¼ì¹ë¥¼ ì ì ë í¼ì¹ ë³ê²½ë²ì ì ì©íì¬ ë³ê²½ì¨ ë§í¼ í¼ì¹ë¥¼ ë³ê²½íê³ í©ì±ë¨ììë ì´ë¥¼ ì´ì©íì¬ ìì±ì´ ë³ê²½ë í©ì±ìì ìì±íë¤. ì´ë¬í ê³¼ì ìë ì íí í¼ì¹ ê²ì¶ê¸°ë²ê³¼ í¨ê» ìê³¡ì´ ì ì í¼ì¹ ë³ê²½ê¸°ë²ì íìë¡ íë¤. ìì±ì í¸ì í¼ì¹ ê²ì¶ë²ì ìµê·¼ 40ë ê° ìë§ì ë°©ë²ë¤ì´ ì ìëì´ ìë¤(ì°¸ê³ ë¬¸í). ì¼ìë¡ í¼ì¹ ê²ì¶ì ì기ìê´í¨ìë²ì´ ì£¼ë¡ ì¬ì©ëê³ ìì¼ë©°, ì¸ê·¼ ìì±ííë¤ ê°ì ìê´ê´ê³ë¥¼ ê³ì°íì¬ ë°ë³µì ì¸ ííì 주기를 ê²ì¶íë ë°©ë²ì´ ìë¤(ì°¸ê³ ë¬¸í). í¼ì¹ì ë³ê²½ì í¼ì¹ ê²ì¶ì´ ì ì´ë£¨ì´ì§ ë¤ìì ì´ë¥¼ ê·¼ê±°ë¡ í¼ì¹ë¥¼ ë³ê²½ìí¤ê² ëë¤. ëí í¼ì¹ë¥¼ ë³ê²½íë ë°©ë²ì ì§ê¸ê¹ì§ ë§ì´ ì ìëì´ì ¸ ìë¤(ì°¸ê³ ë¬¸í). ì¼ìë¡ ìê° ìììì í¼ì¹ì£¼ê¸° ë¨ìë¡ ìì±ííì ëê² ë¶ì í ë¤ìì ë³ê²½ë í¼ì¹ì£¼ê¸° ë¨ìë¡ ì¤ì²©ìì¼ì ííì ì¬êµ¬ì±íë PSOLA(Pitch Synchronous Overwrap and Add) í¼ì¹ ë³ê²½ë²ì´ ìë¤(ì°¸ê³ ë¬¸í).The pitch change system is configured as shown in FIG. The analysis stage of the pitch change system detects the pitch of the original signal and the target signal input to the microphone and passes it to the change rule generator. The change rule generator uses this to determine the pitch change rate and a suitable pitch change method. This pitch change rule is provided to the actual pitch change stage, and the pitch of the original signal is changed by applying a predetermined pitch change method, and the synthesized stage uses the same to generate the synthesized sound whose voice is changed. This process requires a pitch change technique with low distortion along with an accurate pitch detector technique. Pitch detection of speech signals has been proposed in the last 40 years (Ref.). As an example, pitch detection is mainly used for the autocorrelation function, and there is a method of detecting the period of a repetitive waveform by calculating a correlation between adjacent voice waveforms (reference). Pitch change causes the pitch to change based on good pitch detection. Moreover, many methods of changing a pitch have been proposed so far (Ref.). For example, a Pitch Synchronous Overwrap and Add (PSOLA) pitch change method is used to reconstruct a waveform by segmenting a speech waveform in a time period in a time domain and then superimposing the waveform in a changed pitch period (reference).
ë 3ì 본 ë°ëª ìì ì¬ì©í í¼ì¹ ë³ê²½ ìì¤í ë¸ë¡ëì´ë¤. 본 ë°ëª ììë í¼ì¹ ê²ì¶ì ìíì¬ ë 4ì ê°ì ì´ì¨ ì¡°ì ì íìí ê²ì¶ë²ì ì¬ì©íìë¤. 먼ì í리ì í¼ìì¤ íí°ë¥¼ íµí ê³ ì£¼íì ììì´ ê°ì¡°ë ì íì측ê³ìë¡ ííëë íí°ì ìì¼ë¡ íµê³¼ìí¨ ë¤ì ë¶ì구ê°ë³ë¡ ì»ì´ì§ë ì±ë¬¸ì ì§í í¹ì±ê³¼ 주기 í¹ì±ì ì ì©íì¬ì í¼ì¹ ê²ì¶ ê³¼ì ì ìííìë¤(ì°¸ê³ ë¬¸í). ìì ê°ì´ í¼ì¹ë¥¼ ê²ì¶íê³ ê²ì¶ë í¼ì¹ë¥¼ ë 5ì ê°ì PSOLA í¼ì¹ ë³ê²½ë²ì ì¬ì©íì¬ 140%, 120% ì ì¥ë í¼ì¹ì 80%, 60%ë¡ ìì¶ë í¼ì¹ë¥¼ ì½ê°ì delay를 ëì´ í©ì±íë©´ ë¤ì¤ 목ì리 í©ì±ìì ìì± í ì ìê²ëë¤.3 is a block diagram of a pitch change system used in the present invention. In the present invention, the detection method required for rhyme control as shown in FIG. 4 was used for pitch detection. First, the high frequency region through the pre-emphasis filter was passed inversely to the filter represented by the linear predictive coefficient, and the pitch detection process was performed by applying the amplitude characteristics and periodic characteristics of the gates obtained for each analysis section (reference). When the pitch is detected as described above, and the detected pitch is synthesized with a slight delay between 140%, 120% stretched pitch and 80%, 60% compressed pitch using the PSOLA pitch changing method as shown in FIG. Will be created.
[íëì¨ì´ ì¥ì¹ì 구ì±][Configuration of Hardware Device]
ë§ì´í¬ë¡í°ìì ë¤ì´ì¤ë ìë ë¡ê·¸ ííì 목ì리 ì í¸(600)를 ì ë ¥ ë°ìì í¼ì¹ë¥¼ ë³ê²½íì¬ ë¤ì¤ì 목ìë¦¬ë¡ í©ì±íë ì¥ì¹ë ë 6ê³¼ ê°ë¤. ìë ë¡ê·¸ ííë¡ ì ë ¥ë 목ì리 ì í¸íí(600)ì ì¦í기(601)ìì ì¦íë ë¤ìì ì리ì´ì§(aliasing)í¨ê³¼ë¥¼ ì ê±°í기 ìí´ ì ìíµê³¼ì¬í기(602)를 íµê³¼íê³ , ììí(quantization) ë° ë¶í¸í(coding)를 ìííë ìë ë¡ê·¸-ëì§í¸ ë³í기(603)를 íµê³¼í¨ì¼ë¡ì ì ííì¤ë¶í¸ë³ì¡°(PCM) ííì ëì§í¸ ì í¸ë¡ ë°ëì´ì ë²ì© CPUë ëì§í¸ ì í¸ì²ë¦¬ê¸°(DSP)ìì ìíí¸ì¨ì´ë íì¨ì´ì ìí´ ì²ë¦¬(604)ëë¤.An apparatus for synthesizing multiple voices by changing the pitch by receiving an analog voice signal 600 input from a microphone is illustrated in FIG. 6. The voice signal waveform 600 input in analog form is amplified by the amplifier 601 and then passed through the low pass filter 602 to eliminate the aliasing effect, and is then quantized and encoded. By passing through the analog-to-digital converter 603, which is converted into a linear pulse code modulation (PCM) type digital signal, it is processed 604 by software or firmware in a general purpose CPU or digital signal processor (DSP).
ì í¸ì²ë¦¬ ë ëë ì´ ì»´í¨í° ì²ë¦¬ê¸°(604)ê° ëë´ì¸ì ì¤ì¹ë 주ë³ì¥ì¹(609)를 ì°¸ê³ í ìë ìê³ , ëí ì ë ¥ ëì§í¸ ì í¸ë ì²ë¦¬ 결과를 ì ì¥í기 ìí´ ì£¼ë³ ë©ëª¨ë¦¬(605)를 ì°¸ê³ í ìë ìë¤.When the signal is processed, the computer processor 604 may refer to a peripheral device 609 installed both inside and outside, and may also refer to the peripheral memory 605 to store input digital signals or processing results.
CPUìì ìíí¸ì¨ì´ì ìí´ í¼ì¹ë¥¼ ë³ê²½íì¬ ë¤ì¤ì 목ìë¦¬ë¡ í©ì±ë ëì§í¸ ì í¸ë ëì§í¸-ìë ë¡ê·¸ ë³í기(608)를 íµí´ í본íë ìë ë¡ê·¸ ì í¸ííë¡ ë³íëë¤. ì´ ì í¸ë¥¼ ì ìíµê³¼ ì¬í기(607)ì íµê³¼ìí¤ë©´ ììí ì¡ìì´ ì ê±°ë ìë ë¡ê·¸ ì í¸ê° ëê³ , ì ë¹í ì¦ííë©´(606) ì¤í¼ì»¤ ë±ì íµí´ì ë¤ì ì ìë ìë ë¡ê·¸ ì í¸(610)ê° ëë¤.The digital signal synthesized into the multiple voices by changing the pitch by software in the CPU is converted into a sampled analog signal form through the digital-to-analog converter 608. Passing this signal through the lowpass filter 607 results in an analog signal from which quantization noise has been removed, and when properly amplified (606), an analog signal 610 that can be heard through a speaker or the like.
[ìíí¸ì¨ì´ ì²ë¦¬ê³¼ì ][Software Process]
í¼ì¹ ë³ê²½ë²ì ì´ì©í ë¤ì¤ 목ì리 í©ì±ê¸°ë 기존 ë¨ì¼ í¼ì¹ ë³ê²½ë²ì ì¬ì©íë ëì ì ë¤ì¤ í¼ì¹ ë³ê²½ë²ì ì¬ì©íë ìíí¸ì¨ì´ë íì¨ì´ë¥¼ ì¶ê°í ê²ì´ë¤. ë 7ì 본 ë°ëª ìì ì¬ì©í ë¤ì¤ 목ì리 í©ì±ê¸°ì ìíí¸ì¨ì´ íë¡ì° ì± í¸ë¥¼ ëíë¸ë¤.Multi-voice synthesizer using pitch change is an addition of software or firmware that uses multiple pitch change instead of using the traditional single pitch change. 7 shows a software flow chart of the multiple voice synthesizer used in the present invention.
ìë ë¡ê·¸-ëì§í¸ ë³í기(ADC)ìì ì ë ¥ë ë°ì´í° í본(701)ê°ì´ í íë ìë¨ìë¡ ëìì ì²ë¦¬ëë¤. 먼ì íì¬ íë ìì ìë ë°ì´í° ê°ì´ ì ì±ì 구ê°ì¸ì§ ìëì§ë¥¼ íì íê³ , ì ì±ì 구ê°ì´ ìëë©´(703) ë§ë²í¼ì ì ì ì¨(Buffer Rate, BR)ì ê³ì°íê² ëë¤. ì²ë¦¬ë ë°ì´í°ë¥¼ ë기ìí¤ëë° íìí ë©ëª¨ë¦¬ ë²í¼ë¥¼ ë§ë²í¼(710)ë¼ê³ íë¤.The data sample 701 input from the analog-to-digital converter (ADC) is processed simultaneously in units of one frame. First, it is determined whether the data value in the current frame is a voiced sound section, and if it is not the voiced sound section (703), the occupancy ratio (Buffer Rate, BR) of the ring buffer is calculated. The memory buffer required to wait for the processed data is called ring buffer 710.
ë§ë²í¼ì ì ì ì¨(BR)ì ì²ë¦¬ë ë°ì´í°ê° ë§ë²í¼ìì ë기ëë ìê°ë¹ì¨ì ëíë´ëë°, í íë ìì´ ë¹ì ì±ì구ê°ì´ê³ ë§ë²í¼ì ë기íê³ ìë ìê°ì´ ì í´ì§ ìê°(ì BT=1.5ì´ì)ì ëì´ì°ë¤ë©´, ì²ë¦¬ìë를 ìë¹ê¸°ëë¡ ë°ì±ì ì²ë¦¬ìê° ë¨ì¶(708)ì ìííê² ëë¤. ì´ë ê² í¨ì¼ë¡ì¨ ë¤ì¤ í¼ì¹ë³ê²½ì´ ìíë ë ì¼ê¸°ëë ì²ë¦¬ìê° ì§ì°ì í´ìí ì ìê² ëë¤. ì¦, ì ì±ì 구ê°ììë í¼ì¹ë³ê²½ì´ ìííê² ì´ë£¨ì´ì§ëë¡ ë°ì´í°ë¥¼ ì²ì²í ì¶ë ¥íì§ë§ ë¹ì ì±ì 구ê°ììë ë¹ ë¥´ê² íì¬ ì ì²´ì ì¸ ìê°ì§ì°ì í´ìíê² í ê²ì´ë¤.The ring buffer occupancy ratio (BR) represents the time rate at which processed data is waited in the ring buffer. If the current frame is a non-voicing period and the waiting time in the ring buffer exceeds a predetermined time (eg BT = 1.5 or more), In order to speed up the processing speed, the voice processing time is shortened (708). This makes it possible to eliminate the processing time delay caused when multiple pitch changes are performed. In other words, the data is output slowly so that the pitch can be changed smoothly in the voiced sound section, but in the non-voiced sound section, the time delay is eliminated.
íì¬ì íë ìì´ ì ì±ì 구ê°ì¸ì§ ë¹ì ì±ì 구ê°ì¸ì§ë¥¼ 측ì íë ë°©ë²(702)ì ìì±ì²ë¦¬ êµì¬(ì°¸ê³ ë¬¸í)ì ë§ì´ ì ìëì´ì ¸ ìì¼ë©°, ì¼ë¡ë¡ ìëì§ ë 벨ì 측ì íì¬ ì½ê² íì í ì ìë¤. ì¦, íì¬ íë ìì íê· ìëì§ê° ì í´ì§ ë¬¸í± ê° ì´íë¼ë©´ ì´ êµ¬ê°ì ë¹ì ì±ì 구ê°ì´ ëë¤.A method 702 of measuring whether the current frame is a voiced sound section or a non-voiced sound section has been proposed in a speech processing textbook (reference). For example, the energy level can be easily measured by measuring the energy level. That is, if the average energy of the current frame is less than or equal to a predetermined threshold value, this section becomes an unvoiced sound section.
ì ë ¥ë ë°ì´íê° ì ì±ì 구ê°ì´ë¼ë©´ í¼ì¹ìì ê²ì¶(705)ë²ì ì¬ì©íì¬ í¼ì¹ì£¼ê¸°ë¥¼ ê²ì¶ íì¬ì¼íë¤. ìì±ì í¸ì í¼ì¹ì£¼ê¸° ê²ì¶ë²ì ìµê·¼ 40ë ê° ìë§ì ë°©ë²ë¤ì´ ì ìëì´ ìë¤(ì°¸ê³ ë¬¸í). ì¼ìë¡ í¼ì¹ê²ì¶ì ì기ìê´í¨ìë²ì´ ì£¼ë¡ ì¬ì©ëê³ ìì¼ë©°, ì¸ê·¼ ìì±ííë¤ ê°ì ìê´ê´ê³ë¥¼ ê³ì°íì¬ ë°ë³µì ì¸ ííì 주기를 ê²ì¶íë ë°©ë²ì´ ìë¤(ì°¸ê³ ë¬¸í).If the input data is a voiced sound section, the pitch period should be detected using the pitch point detection 705 method. Pitch period detection method of speech signal has been proposed in the last 40 years (Ref.). As an example, pitch detection is mainly used for the autocorrelation function, and there is a method for detecting the period of a repetitive waveform by calculating correlations between adjacent voice waveforms (reference).
본 ë°ëª ììë ììì ì¤ëª í ì´ì¨ ì¡°ì ì íìí ê²ì¶ë²ì ì¬ì©íìë¤.In the present invention, the detection method necessary for adjusting the rhyme described above was used.
ëí ì ì±ì 구ê°ë´ìì ìµìì ë³í를 ì´ë ì ëë¡ ì í(ì, 1.5ë°° ì´ë´)í기 ìí´, ì°ìë ì ì±ì 구ê°ì í¼ì¹ì£¼ê¸°ë¥¼ ê²ì¶í ë¤ìì íë ìë¹ ë³íë를 구íê³ , ë³íê° í¬ë¤ë©´ í¼ì¹ 주기ë³ê²½ì ìííì¬ ëª©ì리를 ìì ìí¤ê² ëë¤(706). í¼ì¹ì£¼ê¸° ë³ê²½ì í¼ì¹ì£¼ê¸° ê²ì¶ì´ ì ì´ë£¨ì´ì§ ë¤ìì ì´ë¥¼ ê·¼ê±°ë¡ í¼ì¹ì£¼ê¸°ë¥¼ ë³ê²½ìí¤ê² ëë¤. ëí í¼ì¹ì£¼ê¸°ë¥¼ ë³ê²½íë ë°©ë²ì ì§ê¸ê¹ì§ ë§ì´ ì ìëì´ì ¸ ìë¤(ì°¸ê³ ë¬¸í). 본 ë°ëª ììë ìê° ìììì í¼ì¹ì£¼ê¸° ë¨ìë¡ ìì±ííì ëê² ë¶ì í ë¤ìì ë³ê²½ë í¼ì¹ì£¼ê¸° ë¨ìë¡ ì¤ì²©ìì¼ì ííì ì¬êµ¬ì±íë PSOLA(Pitch Synchronous Overwrap and Add) í¼ì¹ ë³ê²½ë²((ì°¸ê³ ë¬¸í)ì ì¬ì©íì¬ ë¤ì¤ í¼ì¹ë³ê²½ì ìí íìë¤.In addition, in order to limit the change of intonation in the voiced sound zone to some extent (eg, within 1.5 times), the pitch period of the continuous voiced sound zone is detected, and then the change rate is calculated per frame. The voice is stabilized (706). Pitch period change is to change the pitch period based on the pitch period detection is well made. In addition, a number of methods for changing the pitch period have been proposed so far (Ref.). In the present invention, multiple pitches are changed by using a PSOLA (Pitch Synchronous Overwrap and Add) pitch change method (Ref.) That reconstructs a waveform by broadly segmenting a speech waveform in a pitch period unit in the time domain and then superimposing the changed waveform unit in a pitch period unit. Was done.
ì´ë ê² ì²ë¦¬ ìë£ë ìì± ë°ì´í°ë¤ì ë§ë²í¼ì ì ì¥ìí¤ê³ (709), ì ì¥ë ììì ë°ë¼ì ëì§í¸-ìë ë¡ê·¸ ë³í기(DAC)를 íµí´ ìì± ë°ì´í° í본 ë¨ìë¡ ì¤í¼ì»¤í°ì íµí´ ì¶ë ¥íë¤(710). ì¬ê¸°ì ë¤ì¤ 목ì리 í©ì±ê¸°ì 기ë¥ì ì¤ìê°ì¼ë¡ ì²ë¦¬ëë¤. ì¦, ìë ë¡ê·¸-ëì§í¸ ë³í기(ADC)ìì í íë ìì ë°ì´í°ë¥¼ ë°ê³ (701)ëìë¶í° ê·¸ ë¤ì íë ìì ë°ì´í°ë¥¼ ë°ìì¬ ëê¹ì§ ì²ë¦¬(709)ê° ëë ì ìëë¡ í´ì¼ë§ íë¤.The processed voice data are stored in the ring buffer (709), and output through the speakerphone in units of voice data through a digital-to-analog converter (DAC) according to the stored order (710). The function of the multiple voice synthesizer is handled in real time here. That is, the processing 709 must end until the analog-to-digital converter (ADC) receives the data of one frame (701) until the data of the next frame is received.
[ì°¸ê³ ë¬¸í][references]
[1] ë°°ëª ì§, ì´ìí¨, ëì§í¸ ìì±ë¶ì, ëìì¶íì¬, 1998.[1] Myung-Jin Bae, Sang-Hyo Lee, Digital Speech Analysis, Dong Young Publishing Co., 1998.
[2] ë°°ëª ì§, ëì§í¸ ìì±í©ì±, ëìì¶íì¬, 1999.[2] Bae Myung-jin, Digital Speech Synthesis, Dong Young Publishing Co., 1999.
[3] ë°°ëª ì§, ëì§í¸ ìì±ë¶í¸í, ëìì¶íì¬, 2000.[3] Bae Myung-jin, Digital Voice Coding, Dong Young Publishing Co., 2000.
[4] Rabiner and Schefer, Digital Signal Processing of Speech Signals,[4] Rabiner and Schefer, Digital Signal Processing of Speech Signals,
Prentice Hall, 1978.Prentice Hall, 1978.
[5] ë°íë¹, ë°°ëª ì§, " ììë³ê²½ì ìí í¼ì¹ìì ê²ì¶ì ê´í ì°êµ¬ ", íêµìí¥íí, íê³ íì ë°íëí, ì 19ê¶ 1(s)í¸, No.1, pp 1, 49-152, 2000ë 7ì7-8ì¼.[5] Hyung-Bin Park, Myung-Jin Bae, "A Study on the Pitch-Point Detection for Tone Change", Korean Society for Acoustical Science, Summer Conference, Vol.19 (1), No.1, pp 1, 49-152, 2000 July 7-8.
ì´ììì ìì í ë°ì ê°ì´ 본 ë°ëª ì, ìì±ì ì´ì¨ ì 보를 ê°ì§ê³ ìë ì¤ìí íë¼ë¯¸í°ì¸ í¼ì¹ë¥¼ ë³ê²½íì¬ ë¨ì¼ ìì±ì ë¤ì¤ì 목ìë¦¬ë¡ í©ì±í´ ë´ë ê²ì´ë¤. ìì±ì 보기ì ì MITìì ì§ì í 21ì¸ê¸° 10ë 기ì , ì¼ì±ê²½ì ì°êµ¬ìê° ì ì í 21ì¸ê¸° 10ë ì ë§ê¸°ì ë¡ ì ì ë ë° ìë¤. 기ì ì ì¤ìì± ì¸ìë ìì±ê¸°ì ê´ë ¨ ìì¥ì ì´ê³ ì ì±ì¥ì¸ë¥¼ 기ë¡í ì ë§ì´ë¤. íì¬ êµë´ ìì±ê¸°ì ìì¥ì ì´ê¸°ë¨ê³ë¡ ì§ëí´ ì½ 200ìµì ê·ëª¨ë¡ ì¶ì ëê³ ìì¼ë, ì°íê· 50% ì´ìì ì±ì¥ì ì§ìí´ 2005ë ìë êµë´ ìì±ê¸°ì ìì¥ê·ëª¨ë§ ì½ 1000ìµìì ë¬í ê²ì¼ë¡ ì측ëê³ ìë¤. ì´ì ê°ì´ ì ì°¨ ì¦ê°íê³ ìë ìì±ê¸°ì ìì¥ì 본 ë°ëª ì ë¤ìí ë¶ì¼ì ìì©í ì ìë¤. ì´ë경기ì¥ìì íì¬ëì ììì¼ë¡ ì¬ë¬ ì¬ëì´ ììíë í¨ê³¼ë¥¼ ë´ë ìì í©ì±ê¸°, ìì¼ì´ë íí°ì¥ ë±ììì ì¶í í©ì±ê¸°, ëë¦¼ë ¸ë ì¥ëê° ë±ì ìì©í ì ìì¼ë©°, ìíë ì°ê·¹ììì í¨ê³¼ì, ì¥ìê° ì§ì ë¹ì°ë ë§ë²ì´ ê°ì ìì ëë ë°©ì§ ìì¤í ì¼ë¡ë ìì© í ì ìë¤. ëí ìì¦ íê°ì ì ííê³ ìë 졸ë¼ë§¨ ì´ë ì ëª ì¸ ëª©ì리 íë´ë¥¼ ë´ë ìì±ë³ì¡°ìë ìì© í ì ìë¤. ì´ì ê°ì´ ë¤ìí ë¶ì¼ì ìì© í ì ìì¼ë©° ê·¸ íê¸í¨ê³¼ê° ì주 í´ ê²ì¼ë¡ ììëë¤.As described above, the present invention synthesizes a single voice into multiple voices by changing pitch, which is an important parameter having voice rhyme information. Voice information technology has been selected as one of the twentieth century's 10th technologies designated by MIT and the ten most promising technologies selected by the Samsung Economic Research Institute. In addition to the importance of technology, the voice technology market is expected to record rapid growth. The domestic voice technology market is currently in its initial stage, estimated at about 20 billion won last year. However, the domestic voice technology market is expected to reach about 100 billion won in 2005, with annual growth of more than 50%. In this increasingly increasing voice technology market, the present invention can be applied to various fields. It can be applied to a cheering synthesizer that produces the effect of cheering by several people at a sports stadium, a celebration synthesizer at a birthday or a party, a sounding toy, etc. It can also be applied as a system. It can also be applied to voice modulation that mimics the voila of celebrities and celebrities. As such, it can be applied to various fields and its ripple effect is expected to be very large.
Claims (1) Translated from Korean본 ë°ëª ì í¬ë§í¸ ì±ë¶ì ê·¸ëë ì ì§íê³ ì´ì¨ ì 보를 ê°ì§ë ì¤ìí ìì± íë¼ë¯¸í°ì¸ í¼ì¹ë¥¼ ë³ê²½íì¬ ë¨ì¼ ìì±ì ë¤ì¤ 목ìë¦¬ë¡ í©ì±í´ ë´ë ê²ì¼ë¡ì, ì¤ìê°ì¼ë¡ ì´ì¨ì ì ì´ í ì ìë ë°©ë²ì¼ë¡ í¼ì¹ ë³ê²½ì ìê°ìììì ì ì©íìê³ , ìê°ìì í¼ì¹ ë³ê²½ìì íìì ê°ì±ê³¼ ëª ë£ì±ì ì ì§íë ¤ë©´ ë°ì±ìì ì¤ì¬ì´ ëë í¼ì¹ë¥¼ 기ì¤ì¼ë¡ íì¬ í¼ì¹ ë³ê²½ì´ ì´ë£¨ì´ì ¸ì¼ íë©°, í¼ì¹ ë³ê²½ì ìíí기 ìí´ìë ê·¸ ë°ì±ìì í¼ì¹ìì ì ê²ì¶í ì ìë ì íì측ë¶ìì ì´ì©í í¼ì¹ìì ê²ì¶ë²ì ì¬ì©íìì¼ë©°, ìê°ìììì ì¤ìê° í¼ì¹ ë³ê²½ì ìíì¬ PSOLA í©ì± ë°©ìì ì ì©íì¬ ì¬ë¬ ê°ì§ë¡ í¼ì¹ ë³ê²½ë ìì±ì ëìì í©ì±í¨ì¼ë¡ì¨ ë¤ì¤ì 목ìë¦¬ë¡ í©ì±í´ 주ë í©ì±ê¸° 구íì ê´í ë°©ì.The present invention synthesizes a single voice into multiple voices by changing the pitch, which is an important voice parameter having rhyme information, while maintaining the formant component. Pitch changes are applied in a time domain in a manner that can control the rhythm in real time. In order to maintain the individuality and clarity of the speaker in the time domain pitch change, the pitch change should be made based on the pitch that is the center of the speaker.In order to perform the pitch change, the linear predictive analysis can detect the pitch time of the speaker. Pitch point-of-sight detection method is used to implement synthesizer that synthesizes multiple voices by simultaneously synthesizing various pitch-changed voices by applying PSOLA synthesis method for real-time pitch change in time domain.
KR1020030009198A 2003-02-13 2003-02-13 Mutiple Speech Synthesizer using Pitch Alteration Method Ceased KR20030031936A (en) Priority Applications (2) Application Number Priority Date Filing Date Title KR1020030009198A KR20030031936A (en) 2003-02-13 2003-02-13 Mutiple Speech Synthesizer using Pitch Alteration Method PCT/KR2003/001238 WO2004072951A1 (en) 2003-02-13 2003-06-24 Multiple speech synthesizer using pitch alteration method Applications Claiming Priority (1) Application Number Priority Date Filing Date Title KR1020030009198A KR20030031936A (en) 2003-02-13 2003-02-13 Mutiple Speech Synthesizer using Pitch Alteration Method Publications (1) Family ID=29578508 Family Applications (1) Application Number Title Priority Date Filing Date KR1020030009198A Ceased KR20030031936A (en) 2003-02-13 2003-02-13 Mutiple Speech Synthesizer using Pitch Alteration Method Country Status (2) Cited By (11) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title KR100912339B1 (en) * 2007-05-10 2009-08-14 주ìíì¬ ì¼ì´í° Minority Speaker Voice Data Training Device Using Speech Variation and Its Method CN109712634A (en) * 2018-12-24 2019-05-03 ä¸åå¤§å¦ A kind of automatic sound conversion method TWI728277B (en) * 2017-11-10 2021-05-21 å¼åæ©é夫ç¾åæ Selecting pitch lag US11043226B2 (en) 2017-11-10 2021-06-22 Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V. Apparatus and method for encoding and decoding an audio signal using downsampling or interpolation of scale parameters US11127408B2 (en) 2017-11-10 2021-09-21 FraunhoferâGesellschaft zur F rderung der angewandten Forschung e.V. Temporal noise shaping US11217261B2 (en) 2017-11-10 2022-01-04 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Encoding and decoding audio signals US11315583B2 (en) 2017-11-10 2022-04-26 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio encoders, audio decoders, methods and computer programs adapting an encoding and decoding of least significant bits US11315580B2 (en) 2017-11-10 2022-04-26 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio decoder supporting a set of different loss concealment tools US11462226B2 (en) 2017-11-10 2022-10-04 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Controlling bandwidth in encoders and/or decoders US11545167B2 (en) 2017-11-10 2023-01-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Signal filtering US11562754B2 (en) 2017-11-10 2023-01-24 Fraunhofer-Gesellschaft Zur F Rderung Der Angewandten Forschung E.V. Analysis/synthesis windowing function for modulated lapped transformation Families Citing this family (1) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title GB2498812A (en) * 2012-01-30 2013-07-31 China Ind Ltd Providing an time delayed and pitched shifted accompaniment to a sound produced by a user Family Cites Families (3) * Cited by examiner, â Cited by third party Publication number Priority date Publication date Assignee Title US5787398A (en) * 1994-03-18 1998-07-28 British Telecommunications Plc Apparatus for synthesizing speech by varying pitch JPH08202395A (en) * 1995-01-31 1996-08-09 Matsushita Electric Ind Co Ltd Pitch converting method and its device KR100417092B1 (en) * 2001-05-03 2004-02-11 (주)ëì§í Method for synthesizing voicePatent event code: PA01091R01D
Comment text: Patent Application
Patent event date: 20030213
2003-02-13 PA0201 Request for examination 2003-04-23 PG1501 Laying open of application 2003-06-19 N231 Notification of change of applicant 2003-06-19 PN2301 Change of applicantPatent event date: 20030619
Comment text: Notification of Change of Applicant
Patent event code: PN23011R01D
2005-03-07 E902 Notification of reason for refusal 2005-03-07 PE0902 Notice of grounds for rejectionComment text: Notification of reason for refusal
Patent event date: 20050307
Patent event code: PE09021S01D
2005-09-26 E601 Decision to refuse application 2005-09-26 PE0601 Decision on rejection of patentPatent event date: 20050926
Comment text: Decision to Refuse Application
Patent event code: PE06012S01D
Patent event date: 20050307
Comment text: Notification of reason for refusal
Patent event code: PE06011S01I
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4