Skip to content

Special tokens masking candidates #212

@jihobak

Description

@jihobak

Based on the masking implementation in the transformers library, special tokens (e.g., [CLS], [SEP]) should be excluded from the masking process. However, upon reviewing the implementation in sequence_packer.py,

def mlm_masking(

it appears that these tokens are currently being treated as valid masking candidates.

Could you please confirm if this behavior is intentional? If not, I suggest updating the masking logic to explicitly exclude special tokens. For instance, adding a condition to filter out these tokens before applying the mask would ensure consistency with the transformers library's approach. Additionally, incorporating unit tests to verify that special tokens remain unmasked would improve code reliability.

Am I correct in my understanding, or is there something I might be missing?

Thank you for looking into this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions