Skip to content

[arrow-avro] RecordEncoder Bugs #8934

@jecsand838

Description

@jecsand838

Describe the bug

There appear to be two issues in the arrow-avro RecordEncoder:

  1. Nullable struct with non-nullable child field + row-wise sliced encoding

    When row-wise encoding a RecordBatch with a nullable Struct field containing a non-nullable child field, the Avro writer fails with:

    Invalid argument error: Avro site '{field}' is non-nullable, but array contains nulls
    

    This happens even though the struct value is null at that row (i.e., the parent struct is null, so the child field’s value should be ignored).

  2. Dense UnionArray with non-zero & non-consecutive type ids

    When encoding a dense UnionArray whose UnionFields use non-consecutive type IDs (e.g. 2 and 5), the Avro Writer fails with:

    Error: SchemaError("Binding and field mismatch")
    

    The same data layout should be valid per Arrow’s semantics for dense unions, where type ids are not required to be contiguous or start at 0, as long as they are listed in UnionFields.

To Reproduce

  1. Nullable Child Field Encoder Bug

Running this test in arrow-avro/src/writer/mod.rs:

    #[test]
    fn test_nullable_struct_with_nonnullable_field_sliced_encoding() {
        use arrow_array::{ArrayRef, Int32Array, StringArray, StructArray};
        use arrow_buffer::NullBuffer;
        use arrow_schema::{DataType, Field, Fields, Schema};
        use std::sync::Arc;
        let inner_fields = Fields::from(vec![
            Field::new("id", DataType::Int32, false), // non-nullable
            Field::new("name", DataType::Utf8, true), // nullable
        ]);
        let inner_struct_type = DataType::Struct(inner_fields.clone());
        let schema = Schema::new(vec![
            Field::new("before", inner_struct_type.clone(), true), // nullable struct
            Field::new("after", inner_struct_type.clone(), true),  // nullable struct
            Field::new("op", DataType::Utf8, false),               // non-nullable
        ]);
        let before_ids = Int32Array::from(vec![None, None]);
        let before_names = StringArray::from(vec![None::<&str>, None]);
        let before_struct = StructArray::new(
            inner_fields.clone(),
            vec![
                Arc::new(before_ids) as ArrayRef,
                Arc::new(before_names) as ArrayRef,
            ],
            Some(NullBuffer::from(vec![false, false])),
        );
        let after_ids = Int32Array::from(vec![1, 2]); // non-nullable, no nulls
        let after_names = StringArray::from(vec![Some("Alice"), Some("Bob")]);
        let after_struct = StructArray::new(
            inner_fields.clone(),
            vec![
                Arc::new(after_ids) as ArrayRef,
                Arc::new(after_names) as ArrayRef,
            ],
            Some(NullBuffer::from(vec![true, true])),
        );
        let op_col = StringArray::from(vec!["r", "r"]);
        let batch = RecordBatch::try_new(
            Arc::new(schema.clone()),
            vec![
                Arc::new(before_struct) as ArrayRef,
                Arc::new(after_struct) as ArrayRef,
                Arc::new(op_col) as ArrayRef,
            ],
        )
            .expect("failed to create test batch");
        let mut sink = Vec::new();
        let mut writer = WriterBuilder::new(schema)
            .with_fingerprint_strategy(FingerprintStrategy::Id(1))
            .build::<_, AvroSoeFormat>(&mut sink)
            .expect("failed to create writer");
        for row_idx in 0..batch.num_rows() {
            let single_row = batch.slice(row_idx, 1);
            let after_col = single_row.column(1);
            assert_eq!(
                after_col.null_count(),
                0,
                "after column should have no nulls in sliced row"
            );
            writer
                .write(&single_row)
                .unwrap_or_else(|e| panic!("Failed to encode row {row_idx}: {e}"));
        }
        writer.finish().expect("failed to finish writer");
        assert!(!sink.is_empty(), "encoded output should not be empty");
    }

Results in this error:

thread 'writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding' (769056) panicked at arrow-avro/src/writer/mod.rs:553:37:
Failed to encode row 0: Invalid argument error: Avro site 'id' is non-nullable, but array contains nulls
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
   1: core::panicking::panic_fmt
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
   2: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
             at ./src/writer/mod.rs:553:37
   3: core::result::Result<T,E>::unwrap_or_else
             at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/result.rs:1615:23
   4: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding
             at ./src/writer/mod.rs:553:18
   5: arrow_avro::writer::tests::test_nullable_struct_with_nonnullable_field_sliced_encoding::{{closure}}
             at ./src/writer/mod.rs:493:69
   6: core::ops::function::FnOnce::call_once
             at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   7: core::ops::function::FnOnce::call_once
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
  1. Reproducing Union Type Id Encoder Bug

Running this test in arrow-avro/src/writer/mod.rs:

    #[test]
    fn test_union_nonzero_type_ids() -> Result<(), ArrowError> {
        use arrow_array::UnionArray;
        use arrow_buffer::Buffer;
        use arrow_schema::UnionFields;
        let union_fields = UnionFields::new(
            vec![2, 5],
            vec![
                Field::new("v_str", DataType::Utf8, true),
                Field::new("v_int", DataType::Int32, true),
            ],
        );
        let strings = StringArray::from(vec!["hello", "world"]);
        let ints = Int32Array::from(vec![10, 20, 30]);
        let type_ids = Buffer::from_slice_ref([2_i8, 5, 5, 2, 5]);
        let offsets = Buffer::from_slice_ref([0_i32, 0, 1, 1, 2]);
        let union_array = UnionArray::try_new(
            union_fields.clone(),
            type_ids.into(),
            Some(offsets.into()),
            vec![Arc::new(strings) as ArrayRef, Arc::new(ints) as ArrayRef],
        )?;
        let schema = Schema::new(vec![Field::new(
            "union_col",
            DataType::Union(union_fields, UnionMode::Dense),
            false,
        )]);
        let batch = RecordBatch::try_new(
            Arc::new(schema.clone()),
            vec![Arc::new(union_array) as ArrayRef],
        )?;
        let mut writer = AvroWriter::new(Vec::<u8>::new(), schema.clone())?;
        assert!(!writer.write(&batch).is_err(), "Expected no error from writing");
        writer.finish()?;
        assert!(!writer.finish().is_err(), "Expected no error from finishing writer");
        Ok(())
    }

Results in this error:

thread 'writer::tests::test_union_nonzero_type_ids' (831962) panicked at arrow-avro/src/writer/mod.rs:707:9:
Expected no error from writing
stack backtrace:
   0: __rustc::rust_begin_unwind
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/std/src/panicking.rs:698:5
   1: core::panicking::panic_fmt
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/panicking.rs:75:14
   2: arrow_avro::writer::tests::test_union_nonzero_type_ids
             at ./src/writer/mod.rs:707:9
   3: arrow_avro::writer::tests::test_union_nonzero_type_ids::{{closure}}
             at ./src/writer/mod.rs:676:41
   4: core::ops::function::FnOnce::call_once
             at /Users/connorsanders/.rustup/toolchains/1.91-aarch64-apple-darwin/lib/rustlib/src/rust/library/core/src/ops/function.rs:250:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/ed61e7d7e242494fb7057f2657300d9e77bb4fcb/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

Expected behavior

  1. Nullable struct / sliced encoding
    • Encoding each RecordBatch slice of length 1 should succeed without error.
    • The fact that the parent Struct value is null at a given row should mean its non-nullable child fields are not validated/encoded for that row.
    • The writer should complete successfully and produce non-empty Avro output for each row.
  2. Dense union with non-zero type ids
    • A dense UnionArray whose UnionFields use non-zero and non-consecutive type ids (e.g. 2 and 5) should be encoded without error.

Additional context

  • Environment:
    • Rust toolchain: 1.91.0 aarch64-apple-darwin
  • Both issues arise when using the new arrow-avro writer APIs (WriterBuilder with AvroSoeFormat and AvroWriter).
  • For Bug 1, it looks like the encoder is validating the nullability of a non-nullable child field without masking by the parent struct's null bitmap (or doing a column-level check that doesn't consider that the parent struct is null).
  • For Bug 2, the union encoder/decoder expects a 0-based, contiguous mapping between union type IDs and underlying child bindings, and fails when the type IDs are 2 and 5 even though the UnionFields and type_ids buffer are consistent with Arrow's union semantics.
  • Both bugs surfaced while testing row-wise / streaming-style encodes, which rely heavily on correct handling of nested structures and unions in the encoder.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions