Skip to content

Conversation

@mo1998
Copy link

@mo1998 mo1998 commented Feb 1, 2026

Summary

This PR resolves the Pandas PerformanceWarning: DataFrame is highly fragmented that was triggered when using the AddMissingIndicator transformer.
The warning occurred because missing-indicator columns were previously added to the DataFrame one at a time via direct assignment, which is inefficient when handling many columns.

What Changed

The transform method in feature_engine/imputation/missing_indicator.py has been refactored to improve performance and avoid DataFrame fragmentation:

  • All missing indicators are now computed in a single step.
  • A dedicated DataFrame is created to hold the indicator columns.
  • The original DataFrame and the indicators DataFrame are merged using pd.concat in one operation, instead of repeated column assignments.

Impact

  • Eliminates the Pandas fragmentation warning.
  • Improves performance and memory efficiency when generating many missing-indicator columns.
  • No changes to the public API or transformer behavior.

Related Issue

Verification

  • Reproduction: Confirmed that the warning no longer appears when running a script that previously triggered it.
  • Regression Testing: All existing tests in tests/test_imputation/ pass successfully.

solegalli and others added 18 commits January 27, 2026 21:35
- Fix UnboundLocalError in _variable_type_checks.py by initializing is_cat/is_dt
- Add robust dtype checking using both is_object_dtype and is_string_dtype
- Update find_variables.py with same robust logic for consistency
- Fix warning count assertions in encoder tests (Pandas 3 adds extra deprecation warnings)
- Fix floating point precision assertion in recursive feature elimination test
- Apply ruff formatting and fix linting errors
- All 1900 tests passing
@mo1998
Copy link
Author

mo1998 commented Feb 2, 2026

I merged #885 Into codebase to fix the issue with pandas 3.0

@codecov
Copy link

codecov bot commented Feb 2, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.20%. Comparing base (8282fd4) to head (6c41b96).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #887      +/-   ##
==========================================
+ Coverage   98.11%   98.20%   +0.09%     
==========================================
  Files         113      113              
  Lines        4829     4857      +28     
  Branches      768      775       +7     
==========================================
+ Hits         4738     4770      +32     
+ Misses         56       55       -1     
+ Partials       35       32       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@solegalli
Copy link
Collaborator

Hi @mo1998

Thank you very much for the fix to the fragmentation warning.

The branch that fixes pandas warning (#885) has many files that shouldn't be changed. That PR is still WIP (work in progress).

Could you please, momentarily, just commit the changes that fix the fragmentation warning? Ideally, we'd also like to add a test that would trigger the warning with the old version of the code, and does not trigger it with the new version.

I asked @jose-cano for an example. I believe the warning is triggered when trying to add too many variables.

After we merge #885, we can then rebase main to resolve the pandas issues.

Thanks a lot!

@mo1998
Copy link
Author

mo1998 commented Feb 3, 2026

Hi @solegalli,

I’ve updated the PR to focus strictly on the fragmentation warning fix (Issue #886).

Summary of changes:

  • Fragmentation fix: Refactored AddMissingIndicator.transform() to use pd.concat when adding indicator variables. This replaces the iterative column assignment that was triggering the PerformanceWarning.

  • New test case: Added test_no_performance_warning_with_many_variables in
    tests/test_imputation/test_missing_indicator.py.

    • The test uses a DataFrame with 101 columns (the threshold at which the warning is raised) and verifies that transform runs without triggering pd.errors.PerformanceWarning.
  • Cleanup: Reverted all unrelated changes from the WIP Pandas 3.0 compatibility work. Only the two files directly related to the fragmentation fix are now modified.

The PR should now be ready for review.
Thanks a lot!

# Test for issue #886: PerformanceWarning due to fragmentation
import numpy as np
import pandas as pd
import warnings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this:

import warnings
import numpy as np
import pandas as pd


def test_no_performance_warning_with_many_variables():
    n_cols = 101
    df = pd.DataFrame(
        np.random.randn(10, n_cols),
        columns=[f"col_{i}" for i in range(n_cols)],
    )

    # Introduce missing values
    df.iloc[0, :] = np.nan

    ami = AddMissingIndicator(missing_only=False)
    ami.fit(df)

    with warnings.catch_warnings(record=True) as captured:
        warnings.simplefilter("always")
        ami.transform(df)

    assert not any(
        issubclass(w.category, pd.errors.PerformanceWarning)
        for w in captured
    ), "PerformanceWarning was raised during transform"


indicator_names = [f"{feature}_na" for feature in self.variables_]
X[indicator_names] = X[self.variables_].isna().astype(int)
X_indicators = X[self.variables_].isna().astype(int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about replacing the 2 lines with this?

X_indicators = (
X[self.variables_]
.isna()
.astype("int8")
.add_suffix("_na")
)

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add_suffix.html

@solegalli solegalli changed the title fix: correct missing indicator creation in AddMissingIndicator class [BUG ] fix performance warning in AddMissingIndicator Feb 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pandas fragmentation warning with AddMissingIndicator

3 participants