Skip to content

UnicodeDecodeError on CSV that is successfully read by pandas and polars #93

@Hillard28

Description

@Hillard28

What happens?

Issue

I am attempting to create a database from ~380 CSVs, but at a certain point I hit an error with one of the files:

UnicodeDecodeError                        Traceback (most recent call last)
Cell In[5], [line 1](vscode-notebook-cell:?execution_count=5&line=1)
----> [1](vscode-notebook-cell:?execution_count=5&line=1) con.sql(
      2     f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
      3 )

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte

The above was using a glob pattern, but the issue also occurs if I attempt to create the table in a loop:

UnicodeDecodeError                        Traceback (most recent call last)
Cell In[19], [line 1](vscode-notebook-cell:?execution_count=19&line=1)
----> [1](vscode-notebook-cell:?execution_count=19&line=1) con.sql(f"INSERT INTO company SELECT * FROM '{path}/rhetorik/companies/csv/{file}'")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte

To Reproduce

Reproduce

What is odd is that I do not have this issue when reading this file in either polars or pandas. The below code successfully reads the file of interest with polars and pandas, but not with duckdb:

import duckdb
import polars as pl
import pandas as pd


path = "/run/media/rgilland/T9/data"
file = f"{path}/rhetorik/companies/csv/company-data_15_7_0.csv.gz"

# SUCCEEDS
pl.read_csv(file, encoding='utf8')
pd.read_csv(file, encoding='utf-8')

# FAILS
duckdb.sql(f"SELECT count(*) FROM '{file}'")
duckdb.sql(f"SELECT count(*) FROM read_csv_auto('{file}')")

Code

The actual code I am attempting to run is this (with minor adjustments for readability):

import os
import duckdb

path = "/run/media/rgilland/T9/data"

if os.path.exists(f"{path}/rhetorik/rhetorik.db"):
    os.remove(f"{path}/rhetorik/rhetorik.db")

with duckdb.connect(f"{path}/rhetorik/rhetorik.db") as con:
    con.sql(
        f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
    )

It doesn't appear to be related to the gzip compression as the same error occurs on the uncompressed file.

I'm unable to share the data itself as it is both large and licensed, but find the varying success across duckdb, polars, and pandas odd.

OS:

EndeavourOS [6.16.8-arch3-1] (64-bit)

DuckDB Package Version:

1.4.0

Python Version:

3.13.7

Full Name:

Ryan Gilland

Affiliation:

New York University

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions