diart.blocks.embedding#

Module Contents#

Classes#

SpeakerEmbedding

OverlappedSpeechPenalty

Applies a penalty on overlapping speech and low-confidence regions to speaker segmentation scores.

EmbeddingNormalization

OverlapAwareSpeakerEmbedding

Extract overlap-aware speaker embeddings given an audio chunk and its segmentation.

class diart.blocks.embedding.SpeakerEmbedding(model, device=None)#
Parameters:
static from_pretrained(model, use_hf_token=True, device=None)#
Parameters:
  • use_hf_token (Union[Text, bool, None]) –

  • device (Optional[torch.device]) –

Return type:

SpeakerEmbedding

__call__(waveform, weights=None)#

Calculate speaker embeddings of input audio. If weights are given, calculate many speaker embeddings from the same waveform.

Parameters:
  • waveform (TemporalFeatures, shape (samples, channels) or (batch, samples, channels)) –

  • weights (Optional[TemporalFeatures], shape (frames, speakers) or (batch, frames, speakers)) – Per-speaker and per-frame weights. Defaults to no weights.

Returns:

embeddings – If weights are provided, the shape is (batch, speakers, embedding_dim), otherwise the shape is (batch, embedding_dim). If batch size == 1, the batch dimension is omitted.

Return type:

torch.Tensor

class diart.blocks.embedding.OverlappedSpeechPenalty(gamma=3, beta=10, normalize=False)#

Applies a penalty on overlapping speech and low-confidence regions to speaker segmentation scores.

Note

For more information, see “Overlap-Aware Low-Latency Online Speaker Diarization based on End-to-End Local Segmentation” (Section 2.2.1 Segmentation-driven speaker embedding). This block implements Equation 2.

Parameters:
  • gamma (float, optional) – Exponent to lower low-confidence predictions. Defaults to 3.

  • beta (float, optional) – Temperature parameter (actually 1/beta) to lower joint speaker activations. Defaults to 10.

  • normalize (bool, optional) – Whether to min-max normalize weights to be in the range [0, 1]. Defaults to False.

__call__(segmentation)#
Parameters:

segmentation (diart.features.TemporalFeatures) –

Return type:

diart.features.TemporalFeatures

class diart.blocks.embedding.EmbeddingNormalization(norm=1)#
Parameters:

norm (Union[float, torch.Tensor]) –

__call__(embeddings)#
Parameters:

embeddings (torch.Tensor) –

Return type:

torch.Tensor

class diart.blocks.embedding.OverlapAwareSpeakerEmbedding(model, gamma=3, beta=10, norm=1, normalize_weights=False, device=None)#

Extract overlap-aware speaker embeddings given an audio chunk and its segmentation.

Parameters:
  • model (EmbeddingModel) – A pre-trained embedding model.

  • gamma (float, optional) – Exponent to lower low-confidence predictions. Defaults to 3.

  • beta (float, optional) – Softmax’s temperature parameter (actually 1/beta) to lower joint speaker activations. Defaults to 10.

  • norm (float or torch.Tensor of shape (batch, speakers, 1) where batch is optional) – The target norm for the embeddings. It can be different for each speaker. Defaults to 1.

  • normalize_weights (bool, optional) – Whether to min-max normalize embedding weights to be in the range [0, 1].

  • device (Optional[torch.device]) – The device on which to run the embedding model. Defaults to GPU if available or CPU if not.

static from_pretrained(model, gamma=3, beta=10, norm=1, use_hf_token=True, normalize_weights=False, device=None)#
Parameters:
  • gamma (float) –

  • beta (float) –

  • norm (Union[float, torch.Tensor]) –

  • use_hf_token (Union[Text, bool, None]) –

  • normalize_weights (bool) –

  • device (Optional[torch.device]) –

__call__(waveform, segmentation)#
Parameters:
  • waveform (diart.features.TemporalFeatures) –

  • segmentation (diart.features.TemporalFeatures) –

Return type:

torch.Tensor