End-to-End Neural Speaker Diarization with Non-Autoregressive Attractors

Attractor 에 대한 설명

Attractor는 머신러닝과 신경망에서 특정 패턴이나 특징을 대표하는 일종의 "중심점" 또는 "끌림점"을 의미합니다. 화자 분할(Speaker Diarization) 맥락에서는, 어트랙터는 다양한 화자의 발화(음성 데이터)에서 해당 화자를 나타내는 특징을 학습하고 그 특징을 사용하여 화자를 구분하는 데 사용됩니다.
구체적으로, EEND(End-to-End Neural Diarization) 모델에서는 여러 화자의 음성이 섞여 있는 상황에서, 어트랙터는 각 화자의 음성 특징을 잡아내는 역할을 합니다. 각 화자는 고유한 어트랙터에 의해 구분되며, 모델은 이러한 어트랙터를 기반으로 음성 데이터를 화자별로 분리합니다.
어트랙터는 일종의 임베딩(embedding) 벡터로, 같은 화자의 음성 데이터는 이 벡터 주위로 "끌려" 오도록 학습됩니다. 이를 통해 모델은 화자가 누구인지, 또는 어떤 부분이 어느 화자에 속하는지 효과적으로 결정할 수 있게 됩니다.
비유하자면, 어트랙터는 서로 다른 자석처럼 각 화자의 특징을 끌어당기며, 음성 데이터가 각 자석 주위로 모이게 되어 결과적으로 화자 분리를 가능하게 합니다.

1. End-to-End Neural Diarization with Encoder-Decoder Attractors (EEND-EDA)

F : feature dimension

D : embedding dimension

S : Speaker number in the recording

T : number of time frames

2. End-to-End Neural Diarization with Non-Autoregressive Attractors (EEND-NAA)

어트랙터를 생성할 때 이전 단계의 정보를 사용하지 않고, 모든 어트랙터를 동시에 생성.
이 방식은 처리 속도를 높이고, 특히 복잡한 화자 간 겹침(overlap)이 있는 상황에서도 안정적인 성능을 제공

F : feature dimension

D : embedding dimension

K-means : 첫 번째 iter에서만 적용됨. 그 다음 iter에서부터는

3. EEND-NAA-Fixed

EEND-NAA의 변형으로, 어트랙터의 수를 고정된 값으로 설정한 모델.
어트랙터의 수를 미리 정의함으로써, 시스템이 추정해야 할 대상이 명확해져 학습 과정이 더 단순해짐.
화자 수가 고정된 상황에서 효과적

4. EEND-NAA-Overest

EEND-NAA와 유사하지만, 어트랙터의 수를 실제 화자 수보다 과대 추정(overestimation)하여 생성하는 모델
화자 간 경계가 불명확하거나, 화자가 변할 가능성이 있는 상황에서 더 유연하게 대응할 수 있도록 함.
실제로는 일부 어트랙터가 비활성화될 수 있지만, 이 방법을 통해 모든 화자를 포착할 가능성이 높아짐.

F : feature dimension

D : embedding dimension

S : Speaker number in the recording

K : chosen number of clusters

5. EEND-NAA-2step system

어트랙터를 두 단계로 나누어 생성하고 적용
첫 번째 단계에서 초기 어트랙터를 생성한 후,
두 번째 단계에서 이 초기 어트랙터를 기반으로 보다 정교한 최종 어트랙터를 생성하여 적용

F : feature dimension

D : embedding dimension

S : Speaker number in the recording

K : chosen number of clusters

6. EEND-NAA-1step system

F : feature dimension

D : embedding dimension

S : Speaker number in the recording

K : chosen number of clusters

Reference

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10629182&tag=1 [paper]

'AI > Speech Processing' 카테고리의 다른 글

[Paper Review] Improving Data Augmentation-based Cross-Speaker Style Transfer for TTSwith Singing Voice, Style Filtering, and F0 Matching (1)	2025.01.08
Speaker Diarization _ Dataset & Metrics (1)	2024.08.16
[Paper Review] Wav2Vec2.0 : A Framework for Self-Supervised Learning of Speech Representations (2)	2024.08.12
[Paper Review] DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREETEXT-TO-SPEECH (1)	2024.08.09
[Paper Review] Mitigating the Exposure Bias in Sentence-Level Grapheme-to-Phoneme (G2P)Transduction (1)	2024.07.31

Scrutinizer

End-to-End Neural Speaker Diarization with Non-Autoregressive Attractors

1. End-to-End Neural Diarization with Encoder-Decoder Attractors (EEND-EDA)

2. End-to-End Neural Diarization with Non-Autoregressive Attractors (EEND-NAA)

3. EEND-NAA-Fixed

4. EEND-NAA-Overest

5. EEND-NAA-2step system

6. EEND-NAA-1step system

Reference

'AI > Speech Processing' 카테고리의 다른 글

티스토리툴바

End-to-End Neural Speaker Diarization with Non-Autoregressive Attractors

1. End-to-End Neural Diarization with Encoder-Decoder Attractors (EEND-EDA)

2. End-to-End Neural Diarization with Non-Autoregressive Attractors (EEND-NAA)

3. EEND-NAA-Fixed

4. EEND-NAA-Overest

5. EEND-NAA-2step system

6. EEND-NAA-1step system

Reference

'AI > Speech Processing' 카테고리의 다른 글

'AI/Speech Processing' Related Articles

티스토리툴바