UnicodeDecodeError on CSV that is successfully read by pandas and polars

### What happens?

### Issue
I am attempting to create a database from ~380 CSVs, but at a certain point I hit an error with one of the files:
```
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[5], [line 1](vscode-notebook-cell:?execution_count=5&line=1)
----> [1](vscode-notebook-cell:?execution_count=5&line=1) con.sql(
      2     f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
      3 )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte
```
The above was using a glob pattern, but the issue also occurs if I attempt to create the table in a loop:
```
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[19], [line 1](vscode-notebook-cell:?execution_count=19&line=1)
----> [1](vscode-notebook-cell:?execution_count=19&line=1) con.sql(f"INSERT INTO company SELECT * FROM '{path}/rhetorik/companies/csv/{file}'")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte
```

### To Reproduce

### Reproduce
What is odd is that I do not have this issue when reading this file in either polars or pandas. The below code successfully reads the file of interest with polars and pandas, but not with duckdb:
```
import duckdb
import polars as pl
import pandas as pd


path = "/run/media/rgilland/T9/data"
file = f"{path}/rhetorik/companies/csv/company-data_15_7_0.csv.gz"

# SUCCEEDS
pl.read_csv(file, encoding='utf8')
pd.read_csv(file, encoding='utf-8')

# FAILS
duckdb.sql(f"SELECT count(*) FROM '{file}'")
duckdb.sql(f"SELECT count(*) FROM read_csv_auto('{file}')")
```
### Code
The actual code I am attempting to run is this (with minor adjustments for readability):
```
import os
import duckdb

path = "/run/media/rgilland/T9/data"

if os.path.exists(f"{path}/rhetorik/rhetorik.db"):
    os.remove(f"{path}/rhetorik/rhetorik.db")

with duckdb.connect(f"{path}/rhetorik/rhetorik.db") as con:
    con.sql(
        f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
    )
```
It doesn't appear to be related to the gzip compression as the same error occurs on the uncompressed file.

I'm unable to share the data itself as it is both large and licensed, but find the varying success across duckdb, polars, and pandas odd.

### OS:

EndeavourOS [6.16.8-arch3-1] (64-bit)

### DuckDB Package Version:

1.4.0

### Python Version:

3.13.7

### Full Name:

Ryan Gilland

### Affiliation:

New York University

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

### Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnicodeDecodeError on CSV that is successfully read by pandas and polars #93

What happens?

Issue

To Reproduce

Reproduce

Code

OS:

DuckDB Package Version:

Python Version:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UnicodeDecodeError on CSV that is successfully read by pandas and polars #93

Description

What happens?

Issue

To Reproduce

Reproduce

Code

OS:

DuckDB Package Version:

Python Version:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions