perf: optimize octet_length for string arrays#19581

Brijesh-Thakkar · 2025-12-31T13:39:39Z

Which issue does this PR close?

Addresses [EPIC] Optimize performance for slow expressions datafusion-comet#2986

Rationale for this change

The octet_length scalar function showed significant performance degradation in
Spark workloads when executed via Comet, as reported in the Comet performance EPIC.

The existing implementation relied on the generic Arrow length kernel for array
inputs, which introduces unnecessary overhead in vectorized execution. Since
octet_length semantics require computing the number of bytes in UTF-8 strings,
this can be implemented more efficiently using Arrow’s concrete string array APIs.

Optimizing this function in DataFusion improves performance for downstream projects
such as Comet and Spark without changing behavior or semantics.

What changes are included in this PR?

Replaced the use of the generic Arrow length kernel for array inputs in
octet_length
Added a specialized implementation for:
- StringArray
- LargeStringArray
- StringViewArray
Computed byte lengths directly using value_length, avoiding unnecessary
indirection and overhead
Left the scalar execution path unchanged

Are these changes tested?

Yes.

Existing unit tests for octet_length were executed and pass successfully
Core integration tests exercising octet_length also pass
No new tests were required, as existing coverage already validates correctness
across scalar and array inputs, including UTF-8 and null handling

Are there any user-facing changes?

No.

This change is purely a performance optimization and does not affect:

SQL syntax
Function semantics
Return types
Error behavior

Copilot

Pull request overview

This PR optimizes the octet_length function for string arrays by replacing the generic Arrow length kernel with specialized implementations that directly compute byte lengths using Arrow's concrete string array APIs.

Key changes:

Removed dependency on Arrow's generic length kernel for array inputs
Added specialized manual loop implementations for StringArray, LargeStringArray, and StringViewArray
Used value_length() method for StringArray/LargeStringArray and value().len() for StringViewArray

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

datafusion/functions/src/string/octet_length.rs

Brijesh-Thakkar · 2025-12-31T14:35:21Z

@andygrove Please review this PR

Brijesh-Thakkar · 2025-12-31T18:15:31Z

@andygrove Hi Andy, I’ve addressed the CI failures (rustfmt + clippy warnings) and pushed the fixes.
This PR optimizes octet_length by avoiding generic kernels for string arrays and using direct length access.
All tests are passing locally. Would appreciate your review when you have time.

Jefffrey · 2026-01-02T02:39:21Z

The existing implementation relied on the generic Arrow length kernel for array inputs, which introduces unnecessary overhead in vectorized execution. Since octet_length semantics require computing the number of bytes in UTF-8 strings, this can be implemented more efficiently using Arrow’s concrete string array APIs.

I'm curious about this claim; do you have benchmarks to show a speedup here?

Brijesh-Thakkar · 2026-01-02T07:02:12Z

I don’t have concrete benchmarks yet — the motivation here was based on reducing obvious overhead (type dispatch + generic kernel) and aligning with patterns used in other string functions in the codebase that operate directly on concrete string arrays.

Happy to add a small micro-benchmark comparing the previous implementation vs this one for StringArray / StringViewArray if that would be useful, or adjust the PR if you’d prefer to keep the generic kernel without measured data.

Jefffrey · 2026-01-02T07:12:39Z

I don’t have concrete benchmarks yet — the motivation here was based on reducing obvious overhead (type dispatch + generic kernel) and aligning with patterns used in other string functions in the codebase that operate directly on concrete string arrays.
Happy to add a small micro-benchmark comparing the previous implementation vs this one for StringArray / StringViewArray if that would be useful, or adjust the PR if you’d prefer to keep the generic kernel without measured data.

Performance improvement PRs should come with benchmark results (unless the improvement is clearly obvious in code). Would you be able to provide some benchmark results to see if this PR gives us a performance benefit?

Brijesh-Thakkar · 2026-01-02T07:18:04Z

Yes, that makes sense — I’ll add a small benchmark comparing the previous implementation and this one for StringArray and StringViewArray, and update the PR with the results.

Thanks for the guidance.

Brijesh-Thakkar · 2026-01-04T13:19:51Z

@Jefffrey How can I run benchmarks locally??

Jefffrey · 2026-01-04T14:53:44Z

@Jefffrey How can I run benchmarks locally??

See some examples of microbenchmarks here: #19551

They should be able to be run using cargo bench command (ensuring you select the correct benchmark, etc.)

perf: optimize octet_length for string arrays
ae3ebcb

CopilotAI review requested due to automatic review settings December 31, 2025 13:39

github-actionsbot added the functions Changes to functions implementation label Dec 31, 2025

Copilotstarted reviewing on behalf of Brijesh-Thakkar December 31, 2025 13:40 View session

Brijesh-Thakkar mentioned this pull request Dec 31, 2025
[EPIC] Optimize performance for slow expressions apache/datafusion-comet#2986
Open

CopilotAI reviewed Dec 31, 2025
View reviewed changes

Jefffrey marked this pull request as draft January 3, 2026 07:57

Brijesh-Thakkar force-pushed the perf-octet-length branch from 08b9a3c to ae3ebcbCompare January 4, 2026 13:13

Merge branch 'main' into perf-octet-length
006d00f

Brijesh-Thakkar closed this Jan 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: optimize octet_length for string arrays#19581

perf: optimize octet_length for string arrays #19581

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

CopilotAI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Jefffrey commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 2, 2026

Uh oh!

Jefffrey commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 4, 2026

Uh oh!

Jefffrey commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: optimize octet_length for string arrays#19581

perf: optimize octet_length for string arrays #19581

Conversation

Brijesh-Thakkar commented Dec 31, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

CopilotAI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Brijesh-Thakkar commented Dec 31, 2025

Uh oh!

Jefffrey commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 2, 2026

Uh oh!

Jefffrey commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 2, 2026

Uh oh!

Brijesh-Thakkar commented Jan 4, 2026

Uh oh!

Jefffrey commented Jan 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants