Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sortformer Diarizer 4spk v1 model PR Part 1: models, modules and dataloaders #11282

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

tango4j
Copy link
Collaborator

@tango4j tango4j commented Nov 14, 2024

What does this PR do ?

Sortformer Diarizer Model, 4 speaker limit, v1

Sortformer Paper Link

In this PR, we are adding: model files, module files and corresponding dataloader and evalutations.

Collection: ASR/speaker_tasks

Changelog

  • model files
    nemo/collections/asr/models/sortformer_diar_models.py

  • module files
    nemo/collections/asr/modules/sortformer_modules.py

  • evaluation files
    nemo/collections/asr/metrics/der.py
    nemo/collections/asr/metrics/multi_binary_acc.py

  • dataloader files
    NeMo/nemo/collections/asr/data/audio_to_diar_label.py
    NeMo/nemo/collections/asr/data/audio_to_diar_label_lhotse.py

  • training yaml
    examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml

  • post-processing yaml files
    NeMo/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml
    NeMo/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard-dev.yaml
    NeMo/nemo/collections/asr/data/audio_to_diar_label.py
    NeMo/nemo/collections/asr/data/audio_to_diar_label_lhotse.py

  • util files
    NeMo/nemo/collections/asr/parts/utils/speaker_utils.py
    NeMo/nemo/collections/asr/parts/utils/vad_utils.py

*Changed the file names of these yaml files

examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py
nemo/collections/asr/data/audio_to_diar_label.py

nemo/collections/asr/models/init.py

nemo/collections/asr/modules/sortformer_modules.py
nemo/collections/asr/parts/utils/asr_multispeaker_utils.py
nemo/collections/asr/parts/utils/speaker_utils.py
nemo/collections/asr/parts/utils/vad_utils.py
nemo/collections/common/parts/preprocessing/collections.py

Usage

  • You can potentially add a usage example below
python ${NEMO_ROOT}/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py \
     model_path=/path/to/diar_sortformer_4spk-v1.nemo \
     dataset_manifest=/path/to/eval_dataset.json

GitHub Actions CI

CI tests will be added in the second PR.
Third PR will include documents and tutorials.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the ASR and speaker_tasks

Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@tango4j tango4j marked this pull request as ready for review November 14, 2024 09:07
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
num_spks: ${model.max_num_of_spks}
session_len_sec: ${model.session_len_sec}
soft_label_thres: 0.5
soft_targets: False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some short explanation for soft_label_thres and soft_targets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments

frame_splicing: 1
dither: 0.00001

sortformer_modules:
Copy link
Collaborator

@stevehuang52 stevehuang52 Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sortformer_modules mean that we can have several different components under this section? If not, maybe just use sortformer_module to align with other fields (e.g., encoder). Also the SortformerModules name could get rid of the s in my opinion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ivan and I decided to put every Sortformer modules (trainable weights or functions for streaming) so thats why it is plural with "s".
It should be actually "SortformerAuxilaryModules" to be more precise, but for brevity it is "sortformer_modules"

):
"""
Convert subsegment timestamps to scale timestamps by multiplying with the feature rate and rounding.
All `ts` related tensors are dimensioned as (N, 2), where N is the number of subsegments.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a simple example to showcase what this function does? I'm a bit confused about the relation between segment and subsegment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this could be very confusing, because for end-to-end models, subsegments are equivalent to frames.

Modular models (vad + clustering, etc) : subsegments (usually 0.5~1.0s) in speech segments (usually 2 s ~ 15 s)
End-to-end models: frames in a session audio (the whole session length is a segment)

Added the definitions, and examples to clarify this.

tango4j and others added 3 commits November 14, 2024 16:56
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
tango4j and others added 7 commits November 14, 2024 17:53
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
@@ -0,0 +1,213 @@
sortformer_diarizer_hybrid_loss_4spk-v1.yaml# Sortformer Diarizer is an end-to-end speaker diarization model that is solely based on Transformer-encoder type of architecture.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some sneezed text pasted here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

from nemo.utils import logging
from nemo.utils.exp_manager import exp_manager

"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is NOT fixed. Fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated this.

@@ -1245,6 +1243,187 @@ def __parse_item_rttm(self, line: str, manifest_file: str) -> Dict[str, Any]:
return item


class EndtoEndDiarizationLabel(_Collection):
"""List of diarization audio-label correspondence with preprocessing."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These oneliner docstrings are not updated when copied from the original source. Update it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and updated



class EndtoEndDiarizationSpeechLabel(EndtoEndDiarizationLabel):
"""`DiarizationLabel` diarization data sample collector from structured json files."""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not updated, and out of context. Fix it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

tango4j and others added 4 commits November 15, 2024 12:30
Signed-off-by: taejinp <tango4j@gmail.com>
Signed-off-by: tango4j <tango4j@users.noreply.github.com>
Copy link
Contributor

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.


Your code was analyzed with PyLint. The following annotations have been identified:


------------------------------------
Your code has been rated at 10.00/10

Thank you for improving NeMo's documentation!

Copy link
Contributor

beep boop 🤖: 🚨 The following files must be fixed before merge!


Your code was analyzed with PyLint. The following annotations have been identified:


------------------------------------
Your code has been rated at 10.00/10

Thank you for improving NeMo's documentation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants