CSV reader tries to convert quoted string column to BIGINT? #15376
Description
What happens?
As discussed in posit-dev/positron#5707, a large tab-delimited text file has a column of quoted strings that contain all numeric digits for the first couple million lines and then start including non-numeric digits after that. Since the values in the column are all quoted, DuckDB should consider them strings without trying to infer a numeric data type.
To Reproduce
Download and unzip this file: https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip
The following error is generated when trying to scan this file:
create table patent as from 'g_patent.tsv';
Conversion Error: CSV Error on Line: 8123649
Original Line: "D1000263" "design" "2023-10-03" "Connector for expanded cell confinement web with polygon handle" "S1" 1 0 "ipg231003.xml"
Error when converting column "patent_id". Could not convert string "D1000263" to 'BIGINT'
Column patent_id is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'patent_id': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.
file = /home/wesm/Downloads/g_patent.tsv
delimiter = (Auto-Detected)
quote = " (Auto-Detected)
escape = " (Auto-Detected)
new_line = \n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding = 0
sample_size = 20480
ignore_errors = false
all_varchar = 0
This works however:
create table patent as from read_csv('g_patent.tsv', delim='\t', sample_size=-1);
The nightly build (dev2261) has a different error:
create table patent as from 'g_patent.tsv';
Invalid Input Error:
CSV Error on Line: 2156093
Original Line: "4001958" "utility" "1977-01-11" """""""Take-one"""" display card""" "A" 2 0 "pftaps19770111_wk02.zip"
Value with unterminated quote found.
Possible fixes:
* Enable ignore errors (ignore_errors=true) to skip this row
* Set quote to empty or to a different value (e.g., quote='')
file = /home/wesm/Downloads/g_patent.tsv
delimiter = (Auto-Detected)
quote = " (Auto-Detected)
escape = \ (Auto-Detected)
new_line = \n (Auto-Detected)
header = true (Auto-Detected)
skip_rows = 0 (Auto-Detected)
comment = \0 (Auto-Detected)
rfc_4180 = true (Auto-Detected)
date_format = (Auto-Detected)
timestamp_format = (Auto-Detected)
null_padding = 0
sample_size = 20480
ignore_errors = false
all_varchar = 0
This may be the intended behavior, but I couldn't find an explanation for how quoting is used/not used in the documentation (it seems that quoting is only being used for delimiting, but has no bearing on the type inference logic). It may be that a new option could limit type inference for quoted fields only to non-numeric data types.
OS:
duckdb-wasm, duckdb-cli
DuckDB Version:
1.1.3
DuckDB Client:
duckdb-cli, duckdb-wasm
Hardware:
No response
Full Name:
Wes McKinney
Affiliation:
Posit
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have