Why is a email validation regular expression flagged for having control characters?

Version: SonarQube * Version 9.3 (build 51899)

We have observed that a regular expression used to validate emails is being flagged as a bug by SonarQube. I’ve done some research and from what I can tell, the control characters are non-printable, so it sort of makes sense that they not be present, but I’ve yet to find out why a top ranking website is advocating for this regex: https://emailregex.com/

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Here’s an example of one of the “bugs”:

Remove this control character: \x0e.

And here is the full list of control characters which are flagged by SonarQube:

‘.\x0e…\x0e…\x0c…\x0c…\x0b…\x0c…\x1f…\x01…\x1f…\x01…\x01…\x09…\x08…\x0b…\x0b…\x0e…\x0b…\x08…\x0c…\x0e…\x09…\x01.’

I do know that email addresses can now contain special characters from different languages, but my experience with this is severely limited. Hoping someone can help shed some light on this. Thank you!

Hey there.

Can you be specific which rule (and on what programming language) the issue is being raised? This helps us direct the question to the right team.

Hi Colin,

Thanks for the quick reply. The programming language is Node.js. The rule is L27.

Hope this helps,
James

I think that’s your line number. :wink: You can find the Rule ID by clicking on “Why is this an issue?”

Sorry. I’m hoping this is it: javascript:S6324. From the top right part of the “Why is this an issue?” panel.

Hi,

I was wondering if there were any updates on this issue?

I’ve also created a similar Stack Overflow post to gather more information on these characters: node.js - What is the purpose of non-printable control characters in this email validation regular expression? - Stack Overflow

Thank you,
James

Your question:

Why is a email validation regular expression flagged for having control characters?

Is answerable by a broader super-question:

Why is a regular expression flagged for having control characters?

The docs for S6324 state:

Entries in the ASCII table below code 32 are known as control characters or non-printing characters. As they are not common in JavaScript strings, using these invisible characters in regular expressions is most likely a mistake.

Is your problem really that you think SonarQube should have a special case to detect that a regex is for email validation? If so:

  • say so.
  • While I agree that it would be nice, I think the added value is not much (how many people have had the same question as you out of all the people who use this lint rule?), and this kind of special-casing doesn’t really scale well. How should it be decided what and what not to special-case? It could be a big can of worms that could equally be addressed on your end just by using the provided linting escape hatch mechanisms.
3 Likes

I’m trying to figure out what we’re supposed to do with the conflicting information from emailregex’s site and SonarQube. Looking for an answer from someone who has dealt with this issue and who has specific experience with the issue. Thank you.

Colin - Does SonarQube have a way to ignore the rule? We’re also ok with doing that at this point as well.

You can use NOSONAR comments to ignore warnings on a case-by-case basis. I’ve also updated my answer on stackoverflow.com accordingly (along with an edit pointing out the part of the RFC spec that shows that addresses can have these characters). Please mark it as accepted if it answers your question.

hello @jamesmortensen,

Yes, you can mark individual issues as “False positive” or “Won’t fix” in SQ UI, and/or you can remove the rule from your quality profile.

For your original question - I think this is very complicated regex, and perhaps usage of control characters are justified there (tbh I am not really an expert, but I wonder why control characters should be enabled in email, are you able to configure such email address and send/receive emails on it?). However, I think such complex regexes are going to be an exception and rule will work fine and provide value for 99% of other cases.