Skip to content

Enable LargeListArray support in Parquet reader schema validation #513

@callmepandey

Description

@callmepandey

Summary

Follow-up to #502. The data conversion layer now supports LargeListArray (64-bit offsets) via ProjectRecordBatch, but the Parquet reader's schema validation still rejects LARGE_LIST types. Additionally, the reader needs to expose Arrow's list_type property to allow users to request LargeListArray output.

Problem

  1. ValidateParquetSchemaEvolution in parquet_schema_util.cc:177-180 only accepts ::arrow::Type::LIST:
case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST) {
    return {};
  }
  break;
  1. Arrow's Parquet reader defaults to Type::LIST output. Without exposing ArrowReaderProperties::set_list_type(), users cannot request LargeListArray output.

Proposed Solution

1. Update schema validation to accept both list types

case TypeId::kList:
  if (arrow_type->id() == ::arrow::Type::LIST ||
      arrow_type->id() == ::arrow::Type::LARGE_LIST) {
    return {};
  }
  break;

2. Add kListType to ReaderProperties

Expose a property to configure the Arrow list type preference.

3. Pass through to Arrow reader

In ParquetReader::Impl::Open(), call arrow_reader_properties.set_list_type() with the configured value.

Why This Is Safe

  1. Iceberg's ListType doesn't distinguish between LIST and LARGE_LIST
  2. The projection layer (ProjectRecordBatch) already handles both via templated ProjectListArrayImpl<>
  3. Both represent the same logical "list" concept, just with different offset sizes

Files to Change

  • src/iceberg/parquet/parquet_schema_util.cc - Update ValidateParquetSchemaEvolution
  • src/iceberg/parquet/parquet_reader.cc - Pass list_type to ArrowReaderProperties
  • src/iceberg/reader.h - Add kListType to ReaderProperties
  • src/iceberg/test/parquet_test.cc - Add integration tests

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions