Mixed file encodings in repos

  • Manual scan from command line
  • sonar-scanner-cli-8.0.1.6346-windows-x64
  • OpenJDK Runtime Environment Temurin-21.0.11+10 (build 21.0.11+10-LTS)
  • Languages of the repository: Legacy repo with VB6, C++, Python, …

  • Different file encodings in the repo: ANSI, UTF-8

  • Error observed (wrap logs/code around with triple quotes ``` for proper formatting)

    • File encodings not recognizes automatically

    • Invalid characters recognized

      10:29:07.146 WARN Invalid character encountered in file C:/_Repos/…/Paketwhl.bas at line 26 for encoding UTF-8. Please fix file content or configure the encoding to be used using property ‘sonar.sourceEncoding’.

    • Parse errors (1)

      10:32:59.530 ERROR Unable to parse file: file:///C:/_Repos/…/clsPiwmLists.cls. Parse error at position 55:11
      10:32:59.530 ERROR Cannot parse ‘../clsPiwmLists.cls’: ParseException: Lexical(error = UnrecognizedSymbol(loc = (55, 12, 1431, 1431), symbol = "))

      • Parse errors (2)

      10:30:49.389 ERROR Unable to parse file: entwickl/Paket56/bas/DefaultCalc.bas
      10:30:49.389 ERROR Parse error at line 799 column 28:
      795: : preci_D1 = 0.758 * d
      796: End If
      797: If preci_X = 0# Then
      798: If d < 5# Then preci_X = 0.2
      → ElseIf d < 10# Then preci_X = 0.3
      800: ElseIf d < 13.5 Then preci_X = 0.4

  • Steps to reproduce
    If needed files with parse errors can be provides as a private message.

  • Potential workaround
    For file encoding set sonar.sourceEncoding=windows-1252 improved as most files are encoded as ANSI

Hi,

Would it be possible for you to provide a reproducer project with a few files with different encoding so that we can see how encoding is declared within the files and what we should be picking up on?

Because yes, our default assumption has always been that all the files in a project will have the same encoding. But you’re not the first person to raise the topic of individual files needing special treatment.

 
Thx,
Ann

I reduced the repo to 2 subfolders and snt you a private message with the zip. Even in one folder there are .bas files encoded as UTF-8 and ANSI according to Notepad++

Also some of the parse errors I mentioned in the original post are included, I took corresponding folders in the example.

Hi,

Thanks for the reproducer. I’ve flagged this for the team.

 
Ann