Directories ignored by scm are still scanned

TL;DR: Sonar-scanner spends a lot of time scanning ignored directories.

We are trying to scan a massive project with millions of lines of code. We have some build directories which are in our .gitignore file and hence should also be ignored by the sonar-scanner. However, when scanning the code, I can see from the logs that it does spend a lot of time scanning those directories (10+ minutes). The output of those scans seem to be empty, so it doesn’t seem to use the result of the scans for anything, but it does spend time on it anyways. Is this a bug or is it intended for a purpose I just don’t understand?

Here is an example of the log-output where you can see it always says “Analyzed 0 file(s)” - I assume because it recognises that it’s supposed to be ignored - but it still spends time on it.

INFO: Creating TypeScript program
INFO: TypeScript configuration file /this/directory/is/ignored/by/gitignore/tsconfig.json
INFO: Creating TypeScript program (done) | time=980ms
INFO: Starting analysis with current program
INFO: Analyzed 0 file(s) with current program

Another hint that it’s actually spending time traversing all the files is that when I run the scanner with -X, I see thousands of lines like this:

09:24:56.829 DEBUG: File '/this/directory/is/ignored/by/gitignore/file-a' is excluded by the scm ignore settings.
09:24:56.832 DEBUG: File '/this/directory/is/ignored/by/gitignore/file-b' is excluded by the scm ignore settings.
09:24:56.834 DEBUG: File '/this/directory/is/ignored/by/gitignore/file-c' is excluded by the scm ignore settings.

(I’m using SonarScanner 5.0.1.3006 on MacOS installed via Homebrew)

Hey there.

I’m not sure why this happens if all the files are excluded (and it makes sense to look into it), but 980ms = 1 second. Let’s not miss the forest for the trees!

I think a good first step would be identifying what is taking the longest in your scanning process. 5-10 minutes is quite a lot, but of course, it depends on the size of your codebase.

A grep command like this might be useful:

grep -E 'time=[0-9]{6,}ms' scanner.log

This finds any duration with more than 6 digits, but if nothing is taking that long it can be adjusted to 5, 4, etc…