Skip to content

Race condition when constructing the Encoding #35

@chris-monardo

Description

@chris-monardo

Hello,

I detected this issue while running parallel unit tests driven by separate processes. Constructing multiple encoders in parallel causes a race condition on the downloaded tokenizer file. I believe a file lock may be needed in public_encodings.rs, and files should not re-download if the file is already cached, as it is a source of latency.

Additionally, I have a request. We have a constraint for some of our systems whereby they do not have internet access, and we need to pass the tokenizer file in to the Encoding instead of having it downloaded. Would it be possible to surface a python api for this purpose? load_harmony_encoding would need an overload such as load_harmony_encoding_from_file.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions