- Notifications
You must be signed in to change notification settings - Fork 1.9k
perf: optimize octet_length for string arrays#19581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR optimizes the octet_length function for string arrays by replacing the generic Arrow length kernel with specialized implementations that directly compute byte lengths using Arrow's concrete string array APIs.
Key changes:
- Removed dependency on Arrow's generic
lengthkernel for array inputs - Added specialized manual loop implementations for StringArray, LargeStringArray, and StringViewArray
- Used
value_length()method for StringArray/LargeStringArray andvalue().len()for StringViewArray
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
Brijesh-Thakkar commented Dec 31, 2025
@andygrove Please review this PR |
Brijesh-Thakkar commented Dec 31, 2025
@andygrove Hi Andy, I’ve addressed the CI failures (rustfmt + clippy warnings) and pushed the fixes. |
Jefffrey commented Jan 2, 2026
I'm curious about this claim; do you have benchmarks to show a speedup here? |
Brijesh-Thakkar commented Jan 2, 2026
I don’t have concrete benchmarks yet — the motivation here was based on reducing obvious overhead (type dispatch + generic kernel) and aligning with patterns used in other string functions in the codebase that operate directly on concrete string arrays. Happy to add a small micro-benchmark comparing the previous implementation vs this one for StringArray / StringViewArray if that would be useful, or adjust the PR if you’d prefer to keep the generic kernel without measured data. |
Jefffrey commented Jan 2, 2026
Performance improvement PRs should come with benchmark results (unless the improvement is clearly obvious in code). Would you be able to provide some benchmark results to see if this PR gives us a performance benefit? |
Brijesh-Thakkar commented Jan 2, 2026
Yes, that makes sense — I’ll add a small benchmark comparing the previous implementation and this one for StringArray and StringViewArray, and update the PR with the results. Thanks for the guidance. |
08b9a3c to ae3ebcbCompareBrijesh-Thakkar commented Jan 4, 2026
@Jefffrey How can I run benchmarks locally?? |
Which issue does this PR close?
Rationale for this change
The
octet_lengthscalar function showed significant performance degradation inSpark workloads when executed via Comet, as reported in the Comet performance EPIC.
The existing implementation relied on the generic Arrow
lengthkernel for arrayinputs, which introduces unnecessary overhead in vectorized execution. Since
octet_lengthsemantics require computing the number of bytes in UTF-8 strings,this can be implemented more efficiently using Arrow’s concrete string array APIs.
Optimizing this function in DataFusion improves performance for downstream projects
such as Comet and Spark without changing behavior or semantics.
What changes are included in this PR?
lengthkernel for array inputs inoctet_lengthStringArrayLargeStringArrayStringViewArrayvalue_length, avoiding unnecessaryindirection and overhead
Are these changes tested?
Yes.
octet_lengthwere executed and pass successfullyoctet_lengthalso passacross scalar and array inputs, including UTF-8 and null handling
Are there any user-facing changes?
No.
This change is purely a performance optimization and does not affect: