-
Notifications
You must be signed in to change notification settings - Fork 335
[BUG ] fix performance warning in AddMissingIndicator #887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Fix UnboundLocalError in _variable_type_checks.py by initializing is_cat/is_dt - Add robust dtype checking using both is_object_dtype and is_string_dtype - Update find_variables.py with same robust logic for consistency - Fix warning count assertions in encoder tests (Pandas 3 adds extra deprecation warnings) - Fix floating point precision assertion in recursive feature elimination test - Apply ruff formatting and fix linting errors - All 1900 tests passing
|
I merged #885 Into codebase to fix the issue with pandas 3.0 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #887 +/- ##
==========================================
+ Coverage 98.11% 98.20% +0.09%
==========================================
Files 113 113
Lines 4829 4857 +28
Branches 768 775 +7
==========================================
+ Hits 4738 4770 +32
+ Misses 56 55 -1
+ Partials 35 32 -3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Hi @mo1998 Thank you very much for the fix to the fragmentation warning. The branch that fixes pandas warning (#885) has many files that shouldn't be changed. That PR is still WIP (work in progress). Could you please, momentarily, just commit the changes that fix the fragmentation warning? Ideally, we'd also like to add a test that would trigger the warning with the old version of the code, and does not trigger it with the new version. I asked @jose-cano for an example. I believe the warning is triggered when trying to add too many variables. After we merge #885, we can then rebase main to resolve the pandas issues. Thanks a lot! |
|
Hi @solegalli, I’ve updated the PR to focus strictly on the fragmentation warning fix (Issue #886). Summary of changes:
The PR should now be ready for review. |
| # Test for issue #886: PerformanceWarning due to fragmentation | ||
| import numpy as np | ||
| import pandas as pd | ||
| import warnings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about this:
import warnings
import numpy as np
import pandas as pd
def test_no_performance_warning_with_many_variables():
n_cols = 101
df = pd.DataFrame(
np.random.randn(10, n_cols),
columns=[f"col_{i}" for i in range(n_cols)],
)
# Introduce missing values
df.iloc[0, :] = np.nan
ami = AddMissingIndicator(missing_only=False)
ami.fit(df)
with warnings.catch_warnings(record=True) as captured:
warnings.simplefilter("always")
ami.transform(df)
assert not any(
issubclass(w.category, pd.errors.PerformanceWarning)
for w in captured
), "PerformanceWarning was raised during transform"
|
|
||
| indicator_names = [f"{feature}_na" for feature in self.variables_] | ||
| X[indicator_names] = X[self.variables_].isna().astype(int) | ||
| X_indicators = X[self.variables_].isna().astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about replacing the 2 lines with this?
X_indicators = (
X[self.variables_]
.isna()
.astype("int8")
.add_suffix("_na")
)
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_suffix.html
Summary
This PR resolves the Pandas
PerformanceWarning: DataFrame is highly fragmentedthat was triggered when using theAddMissingIndicatortransformer.The warning occurred because missing-indicator columns were previously added to the DataFrame one at a time via direct assignment, which is inefficient when handling many columns.
What Changed
The
transformmethod infeature_engine/imputation/missing_indicator.pyhas been refactored to improve performance and avoid DataFrame fragmentation:pd.concatin one operation, instead of repeated column assignments.Impact
Related Issue
Verification
tests/test_imputation/pass successfully.