Unify Arrow stream scanning via __arrow_c_stream__ #307

evertlammerts · 2026-02-10T17:13:59Z

Unify Arrow stream scanning via arrow_c_stream

Problem

Related: #70

DuckDB's Python client had separate code paths for every Arrow-flavored object type: PyArrow Table, RecordBatchReader, Scanner, Dataset, PyCapsule, and PyCapsuleInterface. Many of these did the same thing through different routes — materialize to a PyArrow Table, then scan it. This made the codebase harder to extend, and objects implementing the PyCapsule Interface (__arrow_c_stream__) couldn't get projection/filter pushdown unless pyarrow.dataset was installed.

Approach

The core design decision is to prefer __arrow_c_stream__ as the universal entry point rather than maintaining isinstance checks for PyArrow Table and RecordBatchReader. Both types implement __arrow_c_stream__, so they don't need dedicated branches — they fall through to the same PyCapsuleInterface path that handles any third-party Arrow producer (Polars, ADBC, etc.).

This collapses the type detection in GetArrowType() from 6 isinstance checks down to three (the types that don't have __arrow_c_stream__):

Scanner
Dataset
MessageReader

...followed by a single hasattr(obj, "__arrow_c_stream__") catch-all.

The PyCapsuleInterface path now has "tiered" pushdown:

if pyarrow.dataset is available: import the stream as a RecordBatchReader, feed through Scanner.from_batches for projection/filter pushdown, export back to C stream.
otherwise: return the raw C stream directly. DuckDB handles projection/filter post-scan via arrow_scan_dumb.

For schema extraction we use schema._export_to_c as a fallback between arrow_c_schema and the stream-consuming fallback. This hopefully prevents single-use streams from being consumed during schema extraction.

~~Polars DataFrames with arrow_c_stream (v1.4+) now fall through to the unified path instead of going through .to_arrow(). We keep a fallback for Polars < 1.4.~~ Edit: this resulted in a big performance degradation. Polars doesn't seem to do zero-copy conversion and will re-convert for every new scan. I've reverted for now.

I'll post some benchmarks tomorrow. First results look good.

…filters if pyarrow is present

evertlammerts · 2026-02-11T12:43:40Z

I ran a number of benchmarks (testing scanning, projection and filter pushdown) with arrow-backed pandas dataframes, arrow tables, record batch readers and polars. There is no significant performance gain or drop in memory consumption or processing time.

I'll still merge this. Because of the slight simplification and because it provides support for any object that provides the PyCapsule interface / __arrow_c_stream__.

Unify Arrow stream scanning via __arrow_c_stream__ and only pushdown …

155e388

…filters if pyarrow is present

evertlammerts force-pushed the pycapsuleinterface branch from fb5051e to 155e388 Compare February 11, 2026 12:31

Merge branch 'v1.5-variegata' into pycapsuleinterface

daf4bd5

trigger CI

824fae7

evertlammerts mentioned this pull request Feb 11, 2026

Incorrect results / failure from query requiring multiple scans of arrow stream #70

Open

2 tasks

make the pyarrow.dataset import optional

5b6edb9

evertlammerts merged commit a4a4208 into duckdb:v1.5-variegata Feb 11, 2026
29 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify Arrow stream scanning via __arrow_c_stream__ #307

Unify Arrow stream scanning via __arrow_c_stream__ #307

Uh oh!

evertlammerts commented Feb 10, 2026 •

edited

Loading

Uh oh!

evertlammerts commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Unify Arrow stream scanning via __arrow_c_stream__ #307

Unify Arrow stream scanning via __arrow_c_stream__ #307

Uh oh!

Conversation

evertlammerts commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

Uh oh!

evertlammerts commented Feb 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

evertlammerts commented Feb 10, 2026 •

edited

Loading