-
Notifications
You must be signed in to change notification settings - Fork 36
3D enabled implementation of ep.pp.filter_observations, ep.pp.filter_features #953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ly after review (without layers arg)
* `'proportion'`: The feature must pass the filtering criteria in at least a proportion `prop` of time points. For example, with `prop=0.3`, | ||
the feature must pass the filtering criteria in at least 30% of the time points. | ||
prop: Proportion of time points in which the feature must pass the filtering criteria. Only relevant if `time_mode='proportion'`. | ||
inplace: Performs the operation inplace or returns a copy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inplace: Performs the operation inplace or returns a copy. |
) | ||
|
||
|
||
def filter_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your nice new functions are not yet visible in the docs; See https://github.com/theislab/ehrapy/blob/main/docs/api/preprocessing_index.md how to add them, then you can check in the readthedocs build if everything looks as it should
|
||
|
||
def filter_features( | ||
edata: EHRData, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edata: EHRData, | |
edata: EHRData | AnnData |
with pytest.raises(ValueError): | ||
ep.pp.filter_features(edata, min_obs=185, time_mode="proportion", prop=2, copy=False) | ||
|
||
# min_obs filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be split into a separate test - its not doing what the test name says it is doing
with pytest.raises(ValueError): | ||
ep.pp.filter_observations(edata, min_vars=10, time_mode="proportion", prop=2, copy=False) | ||
|
||
# min_vars filtering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
|
||
|
||
def test_filter_features_invalid_args_min_max_obs(): | ||
edata = ed.dt.ehrdata_blobs(n_variables=45, n_observations=500, base_timepoints=15, missing_values=0.6) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When layers are allowed, check that this works for layers. You can check in other test cases how this can be done with pytest.mark.parametrize
Also, it make the test more robust if we check that this works in the 2D as well as in the 3D
) | ||
|
||
|
||
def filter_features( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we try to use single-dispatch more consistently across our functions, see the normalization methods. This is helpful to enforce that our functions work regardless of whether a numpy array, a sparse array, or a dask array is passed.
In this case, its enough to make it work for the numpy arrays as you do already - but adding a single-dispatching that raises NotImplementedErrors, should the passed array in edata
be a scipy.sparse or a dask array, will help users to immediately notice that they'll have to convert their data to a numpy array if they want to use this function
prop: float | None = None, | ||
copy: bool = False, | ||
) -> EHRData | None: | ||
"""Filter observations based on numbers of variables (features/measurements). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many comments from above also apply to this function :)
fixes #809
Implemented and wrote basic tests for:
They also now generalize to 3D data structure