Hi Greg,
Thanks for sharing with the community!
I believe your examples are correctly detected as polynomial time regexes by SonarQube. There are more details in the description of rule S5852:
If you have multiple non-possessive repetitions that can match the same contents and are consecutive or are only separated by an optional separator or a separator that can be matched by both of the repetitions, the worst case matching time can be polynomial (O(n^c) where c is the number of problematic repetitions). For example a*b*
is not a problem because a*
and b*
match different things and a*_a*
is not a problem because the repetitions are separated by a '_'
and can’t match that '_'
. However, a*a*
and .*_.*
have quadratic runtime.
This applies to the first example (quadratic running time in the size of the input) and the third example (cubic in the size of the input). You can probably prevent this issue by using a possessive quantifier.
The case of your second example is also described:
If you’re performing a partial match (such as by using Matcher.find
, String.split
, String.replaceAll
etc.) and the regex is not anchored to the beginning of the string, quadratic runtime is especially hard to avoid because whenever a match fails, the regex engine will try again starting at the next index. This means that any unbounded repetition (even a possessive one), if it’s followed by a pattern that can fail, can cause quadratic runtime on some inputs. For example str.split("\\s*,")
will run in quadratic time on strings that consist entirely of spaces (or at least contain large sequences of spaces, not followed by a comma).
Therefore, your second example is quadratic in the size of the input. I believe first replacing sequences of multiple whitespaces with replaceAll("\\s+", " ")
(which should run in linear time, because there is no character after the repetition), and then using split("\\s,\\s")
instead would fix the problem.
When conditions you cited from OWASP are fullfilled, the time complexity is exponential, not polynomial. This is of course, much worse.
In your case, you can decide to ignore the warnings if the input is not user-controlled or its size is reasonably bounded, as polynomial time regexes need to be run on large inputs before they become problematic.
I hope this helps!