Skip to content

Conversation

@evertlammerts
Copy link
Collaborator

@evertlammerts evertlammerts commented Feb 10, 2026

Unify Arrow stream scanning via arrow_c_stream

Problem

Related: #70

DuckDB's Python client had separate code paths for every Arrow-flavored object type: PyArrow Table, RecordBatchReader, Scanner, Dataset, PyCapsule, and PyCapsuleInterface. Many of these did the same thing through different routes — materialize to a PyArrow Table, then scan it. This made the codebase harder to extend, and objects implementing the PyCapsule Interface (__arrow_c_stream__) couldn't get projection/filter pushdown unless pyarrow.dataset was installed.

Approach

The core design decision is to prefer __arrow_c_stream__ as the universal entry point rather than maintaining isinstance checks for PyArrow Table and RecordBatchReader. Both types implement __arrow_c_stream__, so they don't need dedicated branches — they fall through to the same PyCapsuleInterface path that handles any third-party Arrow producer (Polars, ADBC, etc.).

This collapses the type detection in GetArrowType() from 6 isinstance checks down to three (the types that don't have __arrow_c_stream__):

  • Scanner
  • Dataset
  • MessageReader

...followed by a single hasattr(obj, "__arrow_c_stream__") catch-all.

The PyCapsuleInterface path now has "tiered" pushdown:

  • if pyarrow.dataset is available: import the stream as a RecordBatchReader, feed through Scanner.from_batches for projection/filter pushdown, export back to C stream.
  • otherwise: return the raw C stream directly. DuckDB handles projection/filter post-scan via arrow_scan_dumb.

For schema extraction we use schema._export_to_c as a fallback between arrow_c_schema and the stream-consuming fallback. This hopefully prevents single-use streams from being consumed during schema extraction.

Polars DataFrames with arrow_c_stream (v1.4+) now fall through to the unified path instead of going through .to_arrow(). We keep a fallback for Polars < 1.4. Edit: this resulted in a big performance degradation. Polars doesn't seem to do zero-copy conversion and will re-convert for every new scan. I've reverted for now.

I'll post some benchmarks tomorrow. First results look good.

@evertlammerts
Copy link
Collaborator Author

I ran a number of benchmarks (testing scanning, projection and filter pushdown) with arrow-backed pandas dataframes, arrow tables, record batch readers and polars. There is no significant performance gain or drop in memory consumption or processing time.

I'll still merge this. Because of the slight simplification and because it provides support for any object that provides the PyCapsule interface / __arrow_c_stream__.

@evertlammerts evertlammerts merged commit a4a4208 into duckdb:v1.5-variegata Feb 11, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant