-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
There appear to be two issues in the arrow-avro RecordEncoder:
-
Nullable struct with non-nullable child field + row-wise sliced encoding
When row-wise encoding a
RecordBatchwith a nullableStructfield containing a non-nullable child field, the Avro writer fails with:Invalid argument error: Avro site '{field}' is non-nullable, but array contains nullsThis happens even though the struct value is null at that row (i.e., the parent struct is null, so the child field’s value should be ignored).
-
Dense
UnionArraywith non-zero & non-consecutive type idsWhen encoding a dense
UnionArraywhoseUnionFieldsuse non-consecutive type IDs (e.g.2and5), the AvroWriterfails with:Error: SchemaError("Binding and field mismatch")The same data layout should be valid per Arrow’s semantics for dense unions, where type ids are not required to be contiguous or start at 0, as long as they are listed in
UnionFields.
To Reproduce
- Nullable Child Field Encoder Bug
Running this test in arrow-avro/src/writer/mod.rs:
#[test]
fn test_nullable_struct_with_nonnullable_field_sliced_encoding() {
use arrow_array::{ArrayRef, Int32Array, StringArray, StructArray};
use arrow_buffer::NullBuffer;
use arrow_schema::{DataType, Field, Fields, Schema};
use std::sync::Arc;
let inner_fields = Fields::from(vec![
Field::new("id", DataType::Int32, false), // non-nullable
Field::new("name", DataType::Utf8, true), // nullable
]);
let inner_struct_type = DataType::Struct(inner_fields.clone());
let schema = Schema::new(vec![
Field::new("before", inner_struct_type.clone(), true), // nullable struct
Field::new("after", inner_struct_type.clone(), true), // nullable struct
Field::new("op", DataType::Utf8, false), // non-nullable
]);
let before_ids = Int32Array::from(vec![None, None]);
let before_names = StringArray::from(vec![None::<&str>, None]);
let before_struct = StructArray::new(
inner_fields.clone(),
vec![
Arc::new(before_ids) as ArrayRef,
Arc::new(before_names) as ArrayRef,
],
Some(NullBuffer::from(vec![false, false])),
);
let after_ids = Int32Array::from(vec![1, 2]); // non-nullable, no nulls
let after_names = StringArray::from(vec![Some("Alice"), Some("Bob")]);
let after_struct = StructArray::new(
inner_fields.clone(),
vec![
Arc::new(after_ids) as ArrayRef,
Arc::new(after_names) as ArrayRef,
],
Some(NullBuffer::from(vec![true, true])),
);
let op_col = StringArray::from(vec!["r", "r"]);
let batch = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![
Arc::new(before_struct) as ArrayRef,
Arc::new(after_struct) as ArrayRef,
Arc::new(op_col) as ArrayRef,
],
)
.expect("failed to create test batch");
let mut sink = Vec::new();
let mut writer = WriterBuilder::new(schema)
.with_fingerprint_strategy(FingerprintStrategy::Id(1))
.build::<_, AvroSoeFormat>(&mut sink)
.expect("failed to create writer");
for row_idx in 0..batch.num_rows() {
let single_row = batch.slice(row_idx, 1);
let after_col = single_row.column(1);
assert_eq!(
after_col.null_count(),
0,
"after column should have no nulls in sliced row"
);
writer
.write(&single_row)
.unwrap_or_else(|e| panic!("Failed to encode row {row_idx}: {e}"));
}
writer.finish().expect("failed to finish writer");
assert!(!sink.is_empty(), "encoded output should not be empty");
}Results in this error:
thread 'writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding' (769056) panicked at arrow-avro/src/writer/mod.rs:553:37:
Failed to encode row 0: Invalid argument error: Avro site 'id' is non-nullable, but array contains nulls
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
2: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
at ./src/writer/mod.rs:553:37
3: core::result::Result<T,E>::unwrap_or_else
at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1615:23
4: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding
at ./src/writer/mod.rs:553:18
5: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
at ./src/writer/mod.rs:493:69
6: core::ops::function::FnOnce::call_once
at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
7: core::ops::function::FnOnce::call_once
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
- Reproducing Union Type Id Encoder Bug
Running this test in arrow-avro/src/writer/mod.rs:
#[test]
fn test_union_nonzero_type_ids() -> Result<(), ArrowError> {
use arrow_array::UnionArray;
use arrow_buffer::Buffer;
use arrow_schema::UnionFields;
let union_fields = UnionFields::new(
vec![2, 5],
vec![
Field::new("v_str", DataType::Utf8, true),
Field::new("v_int", DataType::Int32, true),
],
);
let strings = StringArray::from(vec!["hello", "world"]);
let ints = Int32Array::from(vec![10, 20, 30]);
let type_ids = Buffer::from_slice_ref([2_i8, 5, 5, 2, 5]);
let offsets = Buffer::from_slice_ref([0_i32, 0, 1, 1, 2]);
let union_array = UnionArray::try_new(
union_fields.clone(),
type_ids.into(),
Some(offsets.into()),
vec![Arc::new(strings) as ArrayRef, Arc::new(ints) as ArrayRef],
)?;
let schema = Schema::new(vec![Field::new(
"union_col",
DataType::Union(union_fields, UnionMode::Dense),
false,
)]);
let batch = RecordBatch::try_new(
Arc::new(schema.clone()),
vec![Arc::new(union_array) as ArrayRef],
)?;
let mut writer = AvroWriter::new(Vec::<u8>::new(), schema.clone())?;
assert!(!writer.write(&batch).is_err(), "Expected no error from writing");
writer.finish()?;
assert!(!writer.finish().is_err(), "Expected no error from finishing writer");
Ok(())
}Results in this error:
thread 'writer::tests::test_union_nonzero_type_ids' (831962) panicked at arrow-avro/src/writer/mod.rs:707:9:
Expected no error from writing
stack backtrace:
0: __rustc::rust_begin_unwind
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
1: core::panicking::panic_fmt
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
2: arrow_avro::writer::tests::test_union_nonzero_type_ids
at ./src/writer/mod.rs:707:9
3: arrow_avro::writer::tests::test_union_nonzero_type_ids::{{closure}}
at ./src/writer/mod.rs:676:41
4: core::ops::function::FnOnce::call_once
at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
5: core::ops::function::FnOnce::call_once
at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Expected behavior
- Nullable struct / sliced encoding
- Encoding each
RecordBatchslice of length 1 should succeed without error. - The fact that the parent
Structvalue is null at a given row should mean its non-nullable child fields are not validated/encoded for that row. - The writer should complete successfully and produce non-empty Avro output for each row.
- Encoding each
- Dense union with non-zero type ids
- A dense
UnionArraywhoseUnionFieldsuse non-zero and non-consecutive type ids (e.g.2and5) should be encoded without error.
- A dense
Additional context
- Environment:
- Rust toolchain:
1.91.0aarch64-apple-darwin
- Rust toolchain:
- Both issues arise when using the new
arrow-avrowriter APIs (WriterBuilderwithAvroSoeFormatandAvroWriter). - For Bug 1, it looks like the encoder is validating the nullability of a non-nullable child field without masking by the parent struct's null bitmap (or doing a column-level check that doesn't consider that the parent struct is null).
- For Bug 2, the union encoder/decoder expects a 0-based, contiguous mapping between union type IDs and underlying child bindings, and fails when the type IDs are
2and5even though theUnionFieldsandtype_idsbuffer are consistent with Arrow's union semantics. - Both bugs surfaced while testing row-wise / streaming-style encodes, which rely heavily on correct handling of nested structures and unions in the encoder.