feat: `infer_schema` expands csv and ndjson support #18552

KKould · 2025-08-18T08:21:55Z

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

# CSV
#  If `skip_header` is not 0, the first line is used as headers
create or replace file format head_csv_format type = 'CSV' field_delimiter = ',' skip_header = 1;

select * from infer_schema(location => '@data/csv/numbers_with_headers.csv', file_format => 'head_csv_format');

# NDJSON
select * from infer_schema(location => '@data/ndjson/max_records.ndjson', file_format => 'NDJSON', max_records_pre_file => 5);

infer_schema currently supports types

parquet
csv
ndjson

add max_record_pre_file for infer_schema used to get the first few lines of the file for calculation

sinlge file max file size is 100mb

example: test.csv is 150 mb

select * from infer_schema(location => '@data/csv/large/test.csv', file_format => 'head_csv_format');
error: APIError: QueryFailed: [2004]The file 'csv/large/test.csv' is too large(maximum allowed: 100.00 MB)

ref: https://docs.snowflake.com/en/sql-reference/functions/infer_schema

Tests

Unit Test
Logic Test
Benchmark Test
No Test - Explain why

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Breaking Change (fix or feature that could cause existing functionality not to work as expected)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

This change is

youngsofun · 2025-08-19T02:47:16Z

@KKould

for current implementation，maybe we should return error early and more clear when:

file too large ( or 1. choose the smallest one when list, 2. read only first max_bytes (will not get error as long as enough for max_records_pre_file? if not enough read more bytes?) 3. for ndjson it is easy to find first N rows)
compressed file

KKould · 2025-08-19T05:25:01Z

@youngsofun

file too large ( or 1. choose the smallest one when list, 2. read only first max_bytes (will not get error as long as enough for max_records_pre_file? if not enough read more bytes?) 3. for ndjson it is easy to find first N rows)

Currently, max_records_pre_file exists to avoid overly large files (supported by both arrow-json and arrow-csv), but the reader using the operator cannot implement std::io::Read, so it can only read the entire file, which may cause large memory usage.

I think adding max_bytes to determine in advance whether the file is too large can solve this problem, but there is no similar parameter in Snowflake. Even so, do we still need to add max_bytes?

src/query/service/src/table_functions/infer_schema/source.rs

youngsofun · 2025-08-20T10:20:34Z

src/query/service/src/table_functions/infer_schema/source.rs

+                Ok((schema, _)) => {
+                    return Ok(schema);
+                }
+                Err(err) => {


should check the error type？ for example if the json file is bad，should return error at once. instead of of read until the end of the file.

Errors other than CSV and JsonError have been modified to be thrown directly

I checked the code of infer_json_schema:

pub fn infer_json_schema<R: BufRead>( reader: R, max_read_records: Option<usize>, ) -> Result<(Schema, usize), ArrowError> { let mut values = ValueIter::new(reader, max_read_records); let schema = infer_json_schema_from_iterator(&mut values)?; Ok((schema, values.record_count)) }

we should

try real read whole max_read_records use arrow. not enough data may lead to error because end with part of a row, read more data, if the error is the same, return bad file error.

then use infer_json_schema_from_iterator, error is a conflict type of, return directly.

src/query/service/src/table_functions/infer_schema/source.rs

…th when max_records is present

…SV and NDJSON

… type

src/query/service/src/table_functions/infer_schema/separator.rs

src/meta/app/src/principal/file_format.rs

src/query/service/src/table_functions/infer_schema/table_args.rs

youngsofun · 2025-09-12T08:08:35Z

src/query/service/src/table_functions/infer_schema/separator.rs

+        bytes.extend(batch.data);
+
+        if bytes.len() > MAX_SINGLE_FILE_BYTES {
+            return Err(ErrorCode::InvalidArgument(format!(


if max_records is not used, we can recommand user to set it?

github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Aug 18, 2025

KKould requested review from sundy-li and youngsofun and removed request for sundy-li August 18, 2025 09:38

KKould self-assigned this Aug 18, 2025

KKould marked this pull request as ready for review August 18, 2025 09:39

sundy-li approved these changes Aug 19, 2025

View reviewed changes

KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 792238a to 48cd72c Compare August 19, 2025 05:03

youngsofun reviewed Aug 20, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/source.rs Outdated Show resolved Hide resolved

youngsofun reviewed Aug 20, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/source.rs Outdated Show resolved Hide resolved

youngsofun reviewed Aug 21, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/source.rs Outdated Show resolved Hide resolved

KKould added 13 commits September 2, 2025 10:37

feat: infer_schema expands csv and ndjson support

e28fe06

chore: codefmt

f39edd1

chore: add check on csv and ndjson compression

cc4d62e

chore: add max_bytes

1481aa6

feat: support compressed files for infer_schema csv ndjson

54ed208

chore: add xz on infer_schema.test

7075934

chore: codefmt

7ef9f88

feat(infer_schema): remove max_bytes and automatically infer the leng…

69dbbd3

…th when max_records is present

test: add more type test for infer_schema

684918c

test: add array & object type ndjson test for infer_schema

9be4b6f

chore: add file size check and throw more detailed errors for json

b2a6327

chore: codefmt

41b221d

feat: Support multiple file scanning for infer_schema

dd452b7

KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 18bf009 to c307c9c Compare September 3, 2025 06:39

refactor: using Pipeline as an implementation of infer_schema for C…

c66aae7

…SV and NDJSON

KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from c307c9c to c66aae7 Compare September 3, 2025 06:40

KKould added 2 commits September 4, 2025 17:53

feat: InferSeparator multi-file processing and Schema promote merging…

178aacf

… type

chore: codefmt

4bd26e5

youngsofun reviewed Sep 8, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/separator.rs Show resolved Hide resolved

youngsofun reviewed Sep 8, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/separator.rs Outdated Show resolved Hide resolved

youngsofun reviewed Sep 8, 2025

View reviewed changes

src/meta/app/src/principal/file_format.rs Outdated Show resolved Hide resolved

youngsofun reviewed Sep 8, 2025

View reviewed changes

src/query/service/src/table_functions/infer_schema/table_args.rs Show resolved Hide resolved

KKould added 3 commits September 9, 2025 12:10

feat: impl max_file_count for infer_schema

fb7fd0e

chore: codefmt

b26101e

feat: impl max_file_count for infer_schema

fc6ce4b

KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 65abd3e to fc6ce4b Compare September 9, 2025 10:47

KKould added 4 commits September 9, 2025 20:34

chore: codefmt

4b6ef6d

Merge branch 'main' into feat/infer_schema_for_csv_ndjson

a77260d

Merge branch 'main' into feat/infer_schema_for_csv_ndjson

1eb9928

Merge branch 'main' into feat/infer_schema_for_csv_ndjson

282e0db

youngsofun approved these changes Sep 12, 2025

View reviewed changes

youngsofun reviewed Sep 12, 2025

View reviewed changes

sundy-li merged commit 92b7161 into databendlabs:main Sep 13, 2025
87 checks passed

BohuTANG mentioned this pull request Sep 19, 2025

infer_schema databendlabs/databend-docs#2758

Closed

KKould mentioned this pull request Nov 25, 2025

INFER_SCHEMA supports more file types #13959

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: `infer_schema` expands csv and ndjson support #18552

feat: `infer_schema` expands csv and ndjson support #18552

Uh oh!

KKould commented Aug 18, 2025 •

edited

Loading

Uh oh!

youngsofun commented Aug 19, 2025 •

edited

Loading

Uh oh!

KKould commented Aug 19, 2025

Uh oh!

Uh oh!

youngsofun Aug 20, 2025

Uh oh!

KKould Aug 20, 2025

Uh oh!

youngsofun Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youngsofun Sep 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: infer_schema expands csv and ndjson support #18552

feat: infer_schema expands csv and ndjson support #18552

Uh oh!

Conversation

KKould commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

sinlge file max file size is 100mb

Tests

Type of change

Uh oh!

youngsofun commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KKould commented Aug 19, 2025

Uh oh!

Uh oh!

youngsofun Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

KKould Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

youngsofun Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

youngsofun Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: `infer_schema` expands csv and ndjson support #18552

feat: `infer_schema` expands csv and ndjson support #18552

KKould commented Aug 18, 2025 •

edited

Loading

youngsofun commented Aug 19, 2025 •

edited

Loading