fix: allow reading pyarrow timestamp as iceberg timestamptz#2333
Uh oh!
There was an error while loading. Please reload this page.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This PR fix reading pyarrow timestamp as Iceberg timestamptz type. It mirrors the pyarrow logic for dealing with pyarrow timestamp types here
Two changes were made to
ArrowProjectionVisitor._cast_if_neededpromote()timestamp to timestamptz and fail.Nonetimezone. This is allowed because we gate on the target type has "UTC" timezone. It mirrors the java logic for reading with default UTC timezone (1, 2)Context
I ran into an interesting edge case while testing metadata virtualization between delta and iceberg.
Delta has both TIMESTAMP and TIMESTAMP_NTZ data types. TIMESTAMP has a timezone while TIMESTAMP_NTZ has no timezone.
While Iceberg has timestamp and timestamptz. timestamp has no timezone and timestamptz has a timezone.
So Delta's TIMESTAMP -> Iceberg timestamptz and Delta's TIMESTAMP_NTZ -> Iceberg timestamp.
Regardless of delta or iceberg, the parquet file stores timestamp without the timezone information
So I end up a parquet file with timestamp column, and an iceberg table with timestamptz column, and pyiceberg cannot read this table.
Its hard to recreate the scenario but i did trace it to the
_to_requested_schemafunction. I added a unit test for this case.The issue is that
ArrowProjectionVisitor._cast_if_neededwill try to promotetimestamptotimstamptzand this is not a valid promotion.iceberg-python/pyiceberg/io/pyarrow.py
Lines 1779 to 1782 in 640c592
The
elifcase below that can handle this caseiceberg-python/pyiceberg/io/pyarrow.py
Lines 1800 to 1806 in 640c592
So maybe we just need to switch the order of execution...
This was also an interesting read.. https://arrow.apache.org/docs/python/timestamps.html
Are these changes tested?
Are there any user-facing changes?