- Notifications
You must be signed in to change notification settings - Fork 1.7k
AVRO-4134: [C++] Allocate std::vector block by block when decoding#3546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:main
Are you sure you want to change the base?
AVRO-4134: [C++] Allocate std::vector block by block when decoding #3546
Uh oh!
There was an error while loading. Please reload this page.
Conversation
stephanlachnit commented Nov 17, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Have you actually tested if this is actually faster? Calling |
philippeVerney commented Nov 17, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
The short and rigorous answer is : no. However, it would be hard to make a clear answer about if it is faster or slower because, as often, it depends... Now, if we try to guess what AVRO users will do (knowing that they can do exactly what they want in the end), I think the current AVRO C++ code encodes an array as a single block which is the "perfect" case for this change. See avro/lang/c++/include/avro/Specific.hh Lines 238 to 249 in 9110c69
e.setItemCount(b.size());Consequently, when using the current AVRO code to encode arrays, you end up with a single block array which will make the "problematic" decoding loop irrelevant. I think that this change is clearly faster for the default usage of the AVRO C++ library. However, it has indeed drawbacks and can be worst for some exotic and/or proprietary and/or future array encodings. |
stephanlachnit commented Nov 17, 2025
Right, given the encoder this then makes sense. Alternative you could iterate twice: s.clear(); std::size_t new_size = 0; for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()){new_size += n} s.reserve(new_size); for (std::size_t n = d.arrayStart(); n != 0; n = d.arrayNext()){for (std::size_t i = 0; i < n; ++i){T t; avro::decode(d, t); s.emplace_back(std::move(t))} }But since I don't know how costly the |
philippeVerney commented Nov 17, 2025
Indeed, this is a more general solution but it would need either to read twice the entire array or to "seekg" (i.e. to set the input position indicator of the input stream) several times which may in both cases impact performance. In both cases, we would need to seekg to the beginning of the array before the second iteration. The change would look more heavy, further from the corresponding JIRA issue and, I don't know on top of my head if it would generally be faster or not. |
RyanSkraba commented Dec 2, 2025
Thanks for this contribution :D I'm not a C++ dev! Can you tell me if this change should be cherry-picked to 1.12.x and/or 1.11.x for a new release, or is it for 1.13.0? @stephanlachnit I see a 👍 -- are you happy with the PR? |
marlowa commented Dec 2, 2025
First, many thanks for putting in the effort on the PR. But I have to ask, was this change analysed for performance? I hope so since that was the whole point of the issue I raised. |
philippeVerney commented Dec 3, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
Mentally "analysed" : yes on my side as discussed in my above post. |
philippeVerney commented Dec 3, 2025 • edited
Loading Uh oh!
There was an error while loading. Please reload this page.
edited
Uh oh!
There was an error while loading. Please reload this page.
IMHO, it would be ideal to be cherry picked on other dev branches since it generally provide only advantages. But this is not required. |
KalleOlaviNiemitalo commented Dec 3, 2025
IMO it should be 1.13.0 only. Preallocating like this makes it more vulnerable to malicious data that claims to have a huge number of array elements and then causes the library to allocate a lot of memory, despite the attacker not spending any resources to actually send that much data. #3403 disclaims responsibility on using the library on untrusted data but users of older branches, which were published before that text was added, may still be doing that. Moreover, https://issues.apache.org/jira/plugins/servlet/mobile#issue/AVRO-4134 is categorised as a trivial improvement, rather than a bug. |
marlowa commented Dec 4, 2025
As the person who originally raised this issue, I want to chip in to agree with KalleOlavNiemitalo. It makes sense to do the change in or beyond the version where that documentation was changed, rather than earlier. I can wait for 1.13.0, that's fine. |
What is the purpose of the change
This pull request improves std::vector decoding by allocating memory block by block instead of a default memory allocation strategy.
Verifying this change
This change is already covered by existing tests, such as testArray() in test/SpecificTests.cc
Documentation