-
Notifications
You must be signed in to change notification settings - Fork 841
feat: infer_schema expands csv and ndjson support
#18552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: infer_schema expands csv and ndjson support
#18552
Conversation
|
for current implementation,maybe we should return error early and more clear when:
|
792238a to
48cd72c
Compare
Currently, max_records_pre_file exists to avoid overly large files (supported by both arrow-json and arrow-csv), but the reader using the operator cannot implement std::io::Read, so it can only read the entire file, which may cause large memory usage. I think adding max_bytes to determine in advance whether the file is too large can solve this problem, but there is no similar parameter in Snowflake. Even so, do we still need to add max_bytes? |
| Ok((schema, _)) => { | ||
| return Ok(schema); | ||
| } | ||
| Err(err) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should check the error type? for example if the json file is bad,should return error at once. instead of of read until the end of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Errors other than CSV and JsonError have been modified to be thrown directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked the code of infer_json_schema:
pub fn infer_json_schema<R: BufRead>(
reader: R,
max_read_records: Option<usize>,
) -> Result<(Schema, usize), ArrowError> {
let mut values = ValueIter::new(reader, max_read_records);
let schema = infer_json_schema_from_iterator(&mut values)?;
Ok((schema, values.record_count))
}
we should
- try real read whole
max_read_recordsuse arrow. not enough data may lead to error because end with part of a row, read more data, if the error is the same, return bad file error. - then use
infer_json_schema_from_iterator, error is a conflict type of, return directly.
…th when max_records is present
18bf009 to
c307c9c
Compare
c307c9c to
c66aae7
Compare
src/query/service/src/table_functions/infer_schema/separator.rs
Outdated
Show resolved
Hide resolved
65abd3e to
fc6ce4b
Compare
| bytes.extend(batch.data); | ||
|
|
||
| if bytes.len() > MAX_SINGLE_FILE_BYTES { | ||
| return Err(ErrorCode::InvalidArgument(format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if max_records is not used, we can recommand user to set it?
I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/
Summary
close: #17943
infer_schemacurrently supports typesadd
max_record_pre_fileforinfer_schemaused to get the first few lines of the file for calculationsinlge file max file size is 100mb
example: test.csv is 150 mb
ref: https://docs.snowflake.com/en/sql-reference/functions/infer_schema
Tests
Type of change
This change is