Skip to content

Update push_data annotations to use JsonSerializable type #1191

@vdusek

Description

@vdusek

Currently, we use on many places these annotations for data / user_data:

data: list[dict[str, Any]] | dict[str, Any]
data: dict[str, Any]

This works, but it isn't precise - we only accept JSON-serializable types.

We've got this recursive alias:

J = TypeVar('J', bound='JsonSerializable')
JsonSerializable: TypeAlias = Union[
    list[J],
    dict[str, J],
    str,
    bool,
    int,
    float,
    None,
]

But if we use it for these variables:

data: list[dict[str, JsonSerializable]] | dict[str, JsonSerializable]
data: dict[str, JsonSerializable]

We run into variance-related errors, like this:

tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: error: Argument 1 to "__call__" of "PushDataFunction" has incompatible type "dict[str, str]"; expected "Union[list[dict[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]], dict[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]]"  [arg-type]
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: note: "Dict" is invariant -- see https://mypy.readthedocs.io/en/stable/common_issues.html#variance
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py:450: note: Consider using "Mapping" instead, which is covariant in the value type

If we follow the suggestions, and use the Mapping and Sequence:

data: Sequence[Mapping[str, JsonSerializable]] | Mapping[str, JsonSerializable]

We end up with even more errors on the usage side, e.g.

item = {'key': 'value', 'number': 42}
await dataset_client.push_data(item)

Error (dict[str, object] vs. Mapping[str, JsonSerializable])

Argument 1 to "push_data" of "MemoryDatasetClient" has incompatible type "dict[str, object]"; expected "Union[Sequence[Mapping[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]], Mapping[str, Union[list[Any], dict[str, Any], str, bool, int, float, None]]]" Mypy[arg-type](https://mypy.readthedocs.io/en/latest/_refs.html#code-arg-type)

Is using the JsonSerializable alias in this context the right choice? Should we adopt something different? How? The goal is to get precise JSON-serializable typing, avoid variance errors, and usage side errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    t-toolingIssues with this label are in the ownership of the tooling team.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions