Skip to content

Conversation

@KKould
Copy link
Member

@KKould KKould commented Aug 18, 2025

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

close: #17943

# CSV
#  If `skip_header` is not 0, the first line is used as headers
create or replace file format head_csv_format type = 'CSV' field_delimiter = ',' skip_header = 1;

select * from infer_schema(location => '@data/csv/numbers_with_headers.csv', file_format => 'head_csv_format');

# NDJSON
select * from infer_schema(location => '@data/ndjson/max_records.ndjson', file_format => 'NDJSON', max_records_pre_file => 5);

infer_schema currently supports types

  • parquet
  • csv
  • ndjson

add max_record_pre_file for infer_schema used to get the first few lines of the file for calculation

sinlge file max file size is 100mb

example: test.csv is 150 mb

select * from infer_schema(location => '@data/csv/large/test.csv', file_format => 'head_csv_format');
error: APIError: QueryFailed: [2004]The file 'csv/large/test.csv' is too large(maximum allowed: 100.00 MB)

ref: https://docs.snowflake.com/en/sql-reference/functions/infer_schema

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Aug 18, 2025
@KKould KKould requested review from sundy-li and youngsofun and removed request for sundy-li August 18, 2025 09:38
@KKould KKould self-assigned this Aug 18, 2025
@KKould KKould marked this pull request as ready for review August 18, 2025 09:39
@youngsofun
Copy link
Member

youngsofun commented Aug 19, 2025

@KKould

for current implementation,maybe we should return error early and more clear when:

  1. file too large ( or 1. choose the smallest one when list, 2. read only first max_bytes (will not get error as long as enough for max_records_pre_file? if not enough read more bytes?) 3. for ndjson it is easy to find first N rows)
  2. compressed file

@KKould KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 792238a to 48cd72c Compare August 19, 2025 05:03
@KKould
Copy link
Member Author

KKould commented Aug 19, 2025

@youngsofun

  1. file too large ( or 1. choose the smallest one when list, 2. read only first max_bytes (will not get error as long as enough for max_records_pre_file? if not enough read more bytes?) 3. for ndjson it is easy to find first N rows)

Currently, max_records_pre_file exists to avoid overly large files (supported by both arrow-json and arrow-csv), but the reader using the operator cannot implement std::io::Read, so it can only read the entire file, which may cause large memory usage.

I think adding max_bytes to determine in advance whether the file is too large can solve this problem, but there is no similar parameter in Snowflake. Even so, do we still need to add max_bytes?

Ok((schema, _)) => {
return Ok(schema);
}
Err(err) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should check the error type? for example if the json file is bad,should return error at once. instead of of read until the end of the file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Errors other than CSV and JsonError have been modified to be thrown directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the code of infer_json_schema:

pub fn infer_json_schema<R: BufRead>(
    reader: R,
    max_read_records: Option<usize>,
) -> Result<(Schema, usize), ArrowError> {
    let mut values = ValueIter::new(reader, max_read_records);
    let schema = infer_json_schema_from_iterator(&mut values)?;
    Ok((schema, values.record_count))
}

we should

  1. try real read whole max_read_records use arrow. not enough data may lead to error because end with part of a row, read more data, if the error is the same, return bad file error.
  2. then use infer_json_schema_from_iterator, error is a conflict type of, return directly.

@KKould KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 18bf009 to c307c9c Compare September 3, 2025 06:39
@KKould KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from c307c9c to c66aae7 Compare September 3, 2025 06:40
@KKould KKould force-pushed the feat/infer_schema_for_csv_ndjson branch from 65abd3e to fc6ce4b Compare September 9, 2025 10:47
bytes.extend(batch.data);

if bytes.len() > MAX_SINGLE_FILE_BYTES {
return Err(ErrorCode::InvalidArgument(format!(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if max_records is not used, we can recommand user to set it?

@sundy-li sundy-li merged commit 92b7161 into databendlabs:main Sep 13, 2025
87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature this PR introduces a new feature to the codebase

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: infer_schema support more file types

3 participants