Skip to content

CSV reader tries to convert quoted string column to BIGINT? #15376

Closed
@wesm

Description

What happens?

As discussed in posit-dev/positron#5707, a large tab-delimited text file has a column of quoted strings that contain all numeric digits for the first couple million lines and then start including non-numeric digits after that. Since the values in the column are all quoted, DuckDB should consider them strings without trying to infer a numeric data type.

To Reproduce

Download and unzip this file: https://s3.amazonaws.com/data.patentsview.org/download/g_patent.tsv.zip

The following error is generated when trying to scan this file:

create table patent as from 'g_patent.tsv';
Conversion Error: CSV Error on Line: 8123649
Original Line: "D1000263"	"design"	"2023-10-03"	"Connector for expanded cell confinement web with polygon handle"	"S1"	1	0	"ipg231003.xml"
Error when converting column "patent_id". Could not convert string "D1000263" to 'BIGINT'

Column patent_id is being converted as type BIGINT
This type was auto-detected from the CSV file.
Possible solutions:
* Override the type for this column manually by setting the type explicitly, e.g. types={'patent_id': 'VARCHAR'}
* Set the sample size to a larger value to enable the auto-detection to scan more values, e.g. sample_size=-1
* Use a COPY statement to automatically derive types from an existing table.

  file = /home/wesm/Downloads/g_patent.tsv
  delimiter = 	 (Auto-Detected)
  quote = " (Auto-Detected)
  escape = " (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0

This works however:

create table patent as from read_csv('g_patent.tsv', delim='\t', sample_size=-1);

The nightly build (dev2261) has a different error:

create table patent as from 'g_patent.tsv';
Invalid Input Error:
CSV Error on Line: 2156093
Original Line: "4001958"	"utility"	"1977-01-11"	"""""""Take-one"""" display card"""	"A"	2	0	"pftaps19770111_wk02.zip"
Value with unterminated quote found.

Possible fixes:
* Enable ignore errors (ignore_errors=true) to skip this row
* Set quote to empty or to a different value (e.g., quote='')

  file = /home/wesm/Downloads/g_patent.tsv
  delimiter = 	 (Auto-Detected)
  quote = " (Auto-Detected)
  escape = \ (Auto-Detected)
  new_line = \n (Auto-Detected)
  header = true (Auto-Detected)
  skip_rows = 0 (Auto-Detected)
  comment = \0 (Auto-Detected)
  rfc_4180 = true (Auto-Detected)
  date_format =  (Auto-Detected)
  timestamp_format =  (Auto-Detected)
  null_padding = 0
  sample_size = 20480
  ignore_errors = false
  all_varchar = 0

This may be the intended behavior, but I couldn't find an explanation for how quoting is used/not used in the documentation (it seems that quoting is only being used for delimiting, but has no bearing on the type inference logic). It may be that a new option could limit type inference for quoted fields only to non-numeric data types.

OS:

duckdb-wasm, duckdb-cli

DuckDB Version:

1.1.3

DuckDB Client:

duckdb-cli, duckdb-wasm

Hardware:

No response

Full Name:

Wes McKinney

Affiliation:

Posit

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Assignees

No one assigned

    Labels

    Needs DocumentationUse for issues or PRs that require changes in the documentationexpected behaviorThe behavior described in the issue is expectedreproduced

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions