Skip to content

Conversation

sueoglu
Copy link
Collaborator

@sueoglu sueoglu commented Sep 24, 2025

fixes #809

Implemented and wrote basic tests for:

  • ep.pp.filter_observations
  • ep.pp.filter_features

They also now generalize to 3D data structure

* `'proportion'`: The feature must pass the filtering criteria in at least a proportion `prop` of time points. For example, with `prop=0.3`,
the feature must pass the filtering criteria in at least 30% of the time points.
prop: Proportion of time points in which the feature must pass the filtering criteria. Only relevant if `time_mode='proportion'`.
inplace: Performs the operation inplace or returns a copy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inplace: Performs the operation inplace or returns a copy.

)


def filter_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your nice new functions are not yet visible in the docs; See https://github.com/theislab/ehrapy/blob/main/docs/api/preprocessing_index.md how to add them, then you can check in the readthedocs build if everything looks as it should



def filter_features(
edata: EHRData,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
edata: EHRData,
edata: EHRData | AnnData

with pytest.raises(ValueError):
ep.pp.filter_features(edata, min_obs=185, time_mode="proportion", prop=2, copy=False)

# min_obs filtering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be split into a separate test - its not doing what the test name says it is doing

with pytest.raises(ValueError):
ep.pp.filter_observations(edata, min_vars=10, time_mode="proportion", prop=2, copy=False)

# min_vars filtering
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here



def test_filter_features_invalid_args_min_max_obs():
edata = ed.dt.ehrdata_blobs(n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When layers are allowed, check that this works for layers. You can check in other test cases how this can be done with pytest.mark.parametrize

Also, it make the test more robust if we check that this works in the 2D as well as in the 3D

)


def filter_features(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we try to use single-dispatch more consistently across our functions, see the normalization methods. This is helpful to enforce that our functions work regardless of whether a numpy array, a sparse array, or a dask array is passed.

In this case, its enough to make it work for the numpy arrays as you do already - but adding a single-dispatching that raises NotImplementedErrors, should the passed array in edata be a scipy.sparse or a dask array, will help users to immediately notice that they'll have to convert their data to a numpy array if they want to use this function

prop: float | None = None,
copy: bool = False,
) -> EHRData | None:
"""Filter observations based on numbers of variables (features/measurements).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many comments from above also apply to this function :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement ep.pp.filter_observations, ep.pp.filter_features
3 participants