-
Notifications
You must be signed in to change notification settings - Fork 65
Description
What happens?
Issue
I am attempting to create a database from ~380 CSVs, but at a certain point I hit an error with one of the files:
UnicodeDecodeError Traceback (most recent call last)
Cell In[5], [line 1](vscode-notebook-cell:?execution_count=5&line=1)
----> [1](vscode-notebook-cell:?execution_count=5&line=1) con.sql(
2 f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
3 )
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte
The above was using a glob pattern, but the issue also occurs if I attempt to create the table in a loop:
UnicodeDecodeError Traceback (most recent call last)
Cell In[19], [line 1](vscode-notebook-cell:?execution_count=19&line=1)
----> [1](vscode-notebook-cell:?execution_count=19&line=1) con.sql(f"INSERT INTO company SELECT * FROM '{path}/rhetorik/companies/csv/{file}'")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 10061: invalid continuation byte
To Reproduce
Reproduce
What is odd is that I do not have this issue when reading this file in either polars or pandas. The below code successfully reads the file of interest with polars and pandas, but not with duckdb:
import duckdb
import polars as pl
import pandas as pd
path = "/run/media/rgilland/T9/data"
file = f"{path}/rhetorik/companies/csv/company-data_15_7_0.csv.gz"
# SUCCEEDS
pl.read_csv(file, encoding='utf8')
pd.read_csv(file, encoding='utf-8')
# FAILS
duckdb.sql(f"SELECT count(*) FROM '{file}'")
duckdb.sql(f"SELECT count(*) FROM read_csv_auto('{file}')")
Code
The actual code I am attempting to run is this (with minor adjustments for readability):
import os
import duckdb
path = "/run/media/rgilland/T9/data"
if os.path.exists(f"{path}/rhetorik/rhetorik.db"):
os.remove(f"{path}/rhetorik/rhetorik.db")
with duckdb.connect(f"{path}/rhetorik/rhetorik.db") as con:
con.sql(
f"CREATE TABLE company AS SELECT * FROM '{path}/rhetorik/companies/csv/*.csv.gz'"
)
It doesn't appear to be related to the gzip compression as the same error occurs on the uncompressed file.
I'm unable to share the data itself as it is both large and licensed, but find the varying success across duckdb, polars, and pandas odd.
OS:
EndeavourOS [6.16.8-arch3-1] (64-bit)
DuckDB Package Version:
1.4.0
Python Version:
3.13.7
Full Name:
Ryan Gilland
Affiliation:
New York University
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration to reproduce the issue?
- Yes, I have