Unify Arrow stream scanning via __arrow_c_stream__ #307
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Unify Arrow stream scanning via arrow_c_stream
Problem
Related: #70
DuckDB's Python client had separate code paths for every Arrow-flavored object type: PyArrow Table, RecordBatchReader, Scanner, Dataset, PyCapsule, and PyCapsuleInterface. Many of these did the same thing through different routes — materialize to a PyArrow Table, then scan it. This made the codebase harder to extend, and objects implementing the PyCapsule Interface (
__arrow_c_stream__) couldn't get projection/filter pushdown unless pyarrow.dataset was installed.Approach
The core design decision is to prefer
__arrow_c_stream__as the universal entry point rather than maintaining isinstance checks for PyArrow Table and RecordBatchReader. Both types implement__arrow_c_stream__, so they don't need dedicated branches — they fall through to the same PyCapsuleInterface path that handles any third-party Arrow producer (Polars, ADBC, etc.).This collapses the type detection in
GetArrowType()from 6isinstancechecks down to three (the types that don't have__arrow_c_stream__):...followed by a single
hasattr(obj, "__arrow_c_stream__")catch-all.The PyCapsuleInterface path now has "tiered" pushdown:
pyarrow.datasetis available: import the stream as a RecordBatchReader, feed through Scanner.from_batches for projection/filter pushdown, export back to C stream.For schema extraction we use schema._export_to_c as a fallback between arrow_c_schema and the stream-consuming fallback. This hopefully prevents single-use streams from being consumed during schema extraction.
Polars DataFrames with arrow_c_stream (v1.4+) now fall through to the unified path instead of going through .to_arrow(). We keep a fallback for Polars < 1.4.Edit: this resulted in a big performance degradation. Polars doesn't seem to do zero-copy conversion and will re-convert for every new scan. I've reverted for now.I'll post some benchmarks tomorrow. First results look good.