sonar.sourceEncoding

I’m using SonarScanner 4.4.0.2170 and the property-file for the scanner contains “sonar.sourceEncoding=ISO-8859-1”

Yet, in the scanner log I find that it still reads some sources as UTF-8 and then complains that they aren’t UTF-8:

e.g.: (first line is just to show that it treats other files correctly as iso-8859-1)

17:17:05.960 DEBUG: 'xxx/Foo.java' generated metadata with charset 'ISO-8859-1'
17:17:05.964 WARN: Invalid character encountered in file /abs/zzz/Bar.java at line 17 for encoding UTF-8. Please fix file content or configure the encoding to be used using property 'sonar.sourceEncoding'.
17:17:05.965 DEBUG: 'zzz/Bar.java' generated metadata with charset 'UTF-8'

Did I already write, that “sonar.sourceEncoding” is set to “ISO-8859-1” ?

What other magic might kick in to override sourceEncoding for just some sources?

PS: Line 17 is really a comment, that does appear to have a few “�” (hex: ef bf bd) sequences… which I think is some special utf-8 codepoint for a broken char, but it should still qualify as valid iso8859-1 codepoints, at least if I explicitly tell the scanner to treat it as iso-8859-1.

Hi,

First, can you upgrade the scanner to the latest version, 4.8, and see if this is still replicable?

If it is, could you add -Dsonar.scanner.dumpToFile=[path to file] to the analysis command line of one of the projects where you’re seeing this behavior? Then we can see the encoding value analysis is actually getting. If it’s what you expect (ISO-8859-1) then we can pursue this as a scanner bug. And if it’s not, we’ll (you’ll) need to track down where the override is coming from.

 
Ann

I’ll be back to it tomorrow…

In the meantime (I don’t know your timezone), it might be worth trying to reproduce it directly in your labs, by creating a plain java file, with a // � comment containing the mentioned sequence, and setting the sourceEncoding to iso-8859-1.

Back with some more data…

  • I upgraded to scanner 4.8.0.2856
  • I still get the same “WARN” for the same file and line.
  • with sonar.scanner.dumpToFile , I got a dump of all system properties… which had all the charset-related properties in it, that I configured for it.
  • for a test, I changed ISO-8859-1 to ISO-8859-15 (which are pretty similar), and most of the java files then got identified as ISO-8859-15, but others were still identified as UTF-8 (probably based on their content)
  • What I didn’t write before: quite a couple of java files get identified as utf-8, and maybe they even are. The problem is, that if it identifies a file as utf-8 against the given property, then it shouldn’t then complain about the file not being utf-8 :wink:

Hi,

Thanks for the followup. I’ve flagged this for more expert eyes.

 
Ann

Hi Andreas,

Thank you for raising this issue.
The way the scanner works is that it tries to detect the encoding automatically, and falls back to the one specified by the property if it can’t. There seems to be a bug with the detection method, in that it assumes the encoding is UTF8 when this character sequence is in the file.

I created the ticket SONAR-20012 to track the bug, but for now I would recommend fixing the files in your sources to not have these characters anymore.

Cheers,
Eric