Skip to content

Conversation

@picnixz
Copy link
Member

@picnixzpicnixz commented May 30, 2025

This is a work in progress (I need to go now) but I'll continue tomorrow. I want to add tests, and some other places are still not fixed because I didn't find a straightforward fix.

@picnixzpicnixz changed the title gh-134873: fix various quadratic worst-time complexity in _header_value_parser.py [WIP]gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP]May 30, 2025
@picnixzpicnixz added type-security A security issue needs backport to 3.9 needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels May 30, 2025
@picnixz
Copy link
MemberAuthor

picnixz commented May 31, 2025

Urgh, so there is still a quadratic complexity that I need to think about, it's when we alternate if-branches. For instance in get_phrase:

try: token, value=get_word(value) phrase.append(token) excepterrors.HeaderParseError: phrase.defects.append(errors.InvalidHeaderDefect( "phrase does not start with word")) whilevalueandvalue[0] notinPHRASE_ENDS: ifvalue[0]=='.': phrase.append(DOT) phrase.defects.append(errors.ObsoleteHeaderDefect( "period in 'phrase'")) value=value[1:] else: try: token, value=get_word(value) excepterrors.HeaderParseError: ifvalue[0] inCFWS_LEADER: token, value=get_cfws(value) phrase.defects.append(errors.ObsoleteHeaderDefect( "comment found without atom")) else: raisephrase.append(token) returnphrase, value

The

ifvalue[0]=='.': phrase.append(DOT) phrase.defects.append(errors.ObsoleteHeaderDefect( "period in 'phrase'")) value=value[1:]

is quadratic if we're processing multiple times .. However, if I have something like 'a + '.a' * 100, then the if branch still requires a copy every two iterations, whatever I put inside. Even if the length reduces at each iteration, it doesn't sufficiently reduce. I'll need to think a bit more.

One idea would have been to use a deque to prevent a copy when stripping parts, but then this requires to reconstruct a deque everytime.

Maybe switch to a stateful parser? That way, we shouldn't have high complexity and we should be fine. But this requires a complete rewrite of this module.

@picnixz
Copy link
MemberAuthor

Ok, this patch still fixes some cases but not all. Cases where two branches alternate would still be subject to O(n²) complexities (unless we avoid the copy in .lstrip() or in [1:], it's not possible to avoid this with .lstrip() or slices since they still need O(n) to copy the rest of the string).

The advantage of .lstrip() over slices is essentially when the if branch is executed more than once before going to the else branch (namely, we can batch-process some characters). For instance, "a" + "." * 50000 is parsed using lstrip() in O(n) instead of O(n²). However, "a" + ".a" * 50000 is still subject to O(n²) parsing.

@picnixz
Copy link
MemberAuthor

picnixz commented May 31, 2025

@serhiy-storchaka I'm a bit stuck here. I don't really have a better idea than to rewrite the module where we would use a deque to hold the current value. That way, I can call .popleft() to drop prefixed chars. Unfortunately, this also means that cannot really return a string anymore as the module is used and signatures are actually in _typeshed: https://github.com/python/typeshed/blob/main/stdlib/email/_header_value_parser.pyi.

So, to ensure backward compatibility, I think I'll need a new module... I can't think of another solution instead of entirely rewriting the logic so that we don't have un-necessary slice so your help would be appreciated, TiA!


I thought about holding the current "index" where the parser stopped but again, I don't think it'll be sufficient as I'll still need to make slices at some point to extract some values to hold (OTOH, using a deque allows me to move some characters to elsewhere without having to copy the string twice, though I'll still need a ''.join() on the part that is being stored).

defget_something(value): storage=Storage() head= [] whilecond(value): head.append(value.popleft()) storage.value=''.join(head) returnstorage, value

@serhiy-storchakaserhiy-storchaka self-requested a review June 1, 2025 06:06
johnzhou721

This comment was marked as resolved.

@serhiy-storchakaserhiy-storchaka changed the title gh-134873: fix various quadratic worst-time complexities in _header_value_parser.py [WIP]gh-136063: fix various quadratic worst-time complexities in _header_value_parser.py [WIP]Jun 28, 2025
@serhiy-storchaka
Copy link
Member

This approach doesn't work, because the code has inherently quadratic complexity. Every function splits the input string on two parts -- the parsed part and the rest. This operation (creating a new still unparsed string) has linear complexity. Repeated, the total complexity is quadratic.

The only way to avoid quadratic complexity is to change interface of all functions. To rewrite the code in a style similar to the re parser or HTMLParser. I'll take care of this when I'm done with HTMLParser.

@picnixz
Copy link
MemberAuthor

The only way to avoid quadratic complexity is to change interface of all functions. To rewrite the code in a style similar to the re parser or HTMLParser. I'll take care of this when I'm done with HTMLParser.

Yeah that's what I feared and thus I didn't start this work until you confirmed it. Thanks for taking care of this as it would take me too much time.

@picnixzpicnixz closed this Jun 28, 2025
@picnixzpicnixz deleted the fix/email/dos-134873 branch June 28, 2025 12:02
@picnixzpicnixz removed type-security A security issue needs backport to 3.9 needs backport to 3.10 only security fixes needs backport to 3.11 only security fixes needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes needs backport to 3.14 bugs and security fixes labels Jun 28, 2025
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

@picnixz@serhiy-storchaka@johnzhou721