Best way to scan multiple projects in large code base?

Tool versions:

  • SonarQube Server Enterprise Edition v10.7 (96327)
  • sonar-scanner-cli-6.2.1.4610-linux-x64 deployed using zip

What am I trying to achieve:

In my repository of about 5M LOC, I have 10+ products. These products use different toolchains and share a great deal of common code. I’m trying to find an optimal way to scan all of these products, with following criteria:

  • No duplicate issues (or as few as possible)
  • No duplicate counting LOC (or as few as possible), to not exhaust our license
  • Keep scan time as short as possible

Sample repository structure:

REPO_ROOT/
├── Code/
│   ├── LibraryOne/
│   │   ├── CommonCode
│   │   ├── ProductSpecificCode/
│   │   │   ├── ProductOne
│   │   │   │   └── config.cmake
│   │   │   └── ProductTwo
│   │   │       └── config.cmake
│   │   └── CMakeLists.txt (include $PRODUCT/config.cmake)
│   └── ... Many more libraries following the LibraryOne pattern
└── Projects/
    ├── ProductOne/
    │   └── CMakeLists.txt (PRODUCT=ProductOne)
    └── ProductTwo/
        └── CMakeLists.txt (PRODUCT=ProductTwo)

What I tried:

1. Scan individual products into individual projects in SonarQube

Upsides: Can parallelize the scans on mutliple machines, taking only about 1.5 hour for all products.
Downsides: I get duplicate issues from common code. Common code is counted multiple times, exhausting company’s license.Code base is about 5M LOC, but if I scanned all products individually, I’d get to 15+M LOC.

2. Scan whole repository using Code variants for each product

Upsides: Issues are not duplicated, LOC is counted only once. Nice unified display of issues at one place.
Downsides: Fresh scan takes almost 6 hours.

3. Scan products into single SonarQube project with mutliple branches

Upsides: LOC counted only for largest branch(=product).
Downsides: Issues remain duplicated. Display is not much user friendly. This is just trying to bend the tool’s workflow to do what I want.

4. Other attempts

  • Tried the Applications feature to aggregate results of all the projects - neither issues nor LOC are de-duplicated
  • Tried the Monorepo feature, but it appears the main purpose of that feature is to better integrate projects and not de-duplicate issues and LOC
  • Recommended usage of sonar.exclusions parameter is not very feasible, due to structure of my repository. There are hundreds of libraries, some also include product specific code additionally to common code. Using the sonar.exclusions parameter would take a lot (and I mean a lot) and scanner configuration clutter to get to work.
  • Didn’t find any option to scan without sending the results to the server. The workflow being, scan all products without sending results, then combine all results manually and send at once. I know about the possibility to use 3rd party scanners, but I’d like to keep SonarQube’s scanner as it works really great and I don’t need to invest more time configuring a second tool.

Final notes

The most usable solution appears to be number 2 - use Code variants, even with the long scanning time. One addition that would solve my problem is an option, or a flag, that would prevent the overwriting of whatever is on the server, when a new scan finishes. Instead of overwriting, the tool would rather “add” the new findings and cross-detect and remove duplicates.

I found a suggestion, and I hope I understood the proposition correctly. It proposes to allow scanning variants on multiple hosts. That would also solve my case: https://portal.productboard.com/sonarsource/3-sonarqube-server/c/444-analyze-multiple-code-variants-built-on-distinct-hosts.

I’d like to get your opinion on which options is best, or maybe propose a whole different alternative that I have missed.

Thanks,
Michal

Hi @mletavk,

Here is what I would do in your case:

  • Scan each project independently. During each of these analyses, exclude all files that are common.
  • Also analyze the common code once, as a separate project, possibly including the unit tests of this common code.

What do you think?

Hey Loïc and thanks for reply!

I did consider such solution as well, but I think it won’t scale well with the amount of libraries in my repository. The biggest obstacle being - getting a list of files which to include and exclude.

Unfortunately the folder structure of the libraries is not standardized - sometimes the product specific part is in a separate folder, sometimes just a file named differently, sometimes pairs of products share a file…

I could craft the include/exclude list manually, but that’s unsustainable over time as structure of libraries may change. I could place an include/exclude configuration file into each library, but that would be a lot of noise (I’m speaking upper 10s up to lower 100s of libraries). Or automate the list creation, but then I’d be back in how to distinguish which files are common in each individual library. Or perhaps generate the list based on CMakeLists.txt.

Either way, given the volume of files and libraries, this would take extensive amount of work and I’d prefer a more effective or simpler solution, if such solution is possible.

If the discrepancies in project structure make it difficult to be clever, what about using brute force? A single analysis for all projects and libraries, on a very large machine with enough computing power.

You mention that a fresh scan takes almost 6 hours with code variants, I expect a single analysis to be faster (a common file will be analyzed several times and aggregated with code variants, while it will be analyzed only once in a single analysis). On which kind of machine did that happen?

To completely understand, do you mean to create an artificial project that builds/includes all? Or to combine the multiple compile_commands from all projects into one super compile_commands? Could you please elaborate more on the solution you have in mind.

I run scans locally on Windows machine, in WSL, in a Docker container. My CPU is Intel Xeon E3-1270 v5, 4 physical (8 logical) cores @3.6GHz. Later the scan will run in CI.

Hi @mletavk,

Yes, I meant a single compile_commands to rule them all. In that case, all files mentioned several times will only be analyzed once.

For large projects, large machines are very helpful to reduce the analysis time. We commonly use 32-core machines on CI (with enough memory to handle the parallel tasks). A rough approximation is that analysis speed scales linearly with the number of cores (assuming enough memory is available).

Using a single scan approach, the scan time is about 2 hours.

I realised though, this approach has a trade-off in the form of possible file misses, see example below.

Library/
├── common.cpp // #include "product_specific.hpp"
├── ProductOne/
│   └── product_specific.hpp
└── ProductTwo/
    └── product_specific.hpp

When first project builds using include path -ILibrary/ProductOne and second project -ILibrary/ProductTwo, when I combine the compile_commands, the scanner sees the common.cpp as a duplicate and only includes the first occurrence that includes ProductOne/product_specific.hpp. And the file ProductTwo/product_specific.hpp remains unscanned.

But thanks, this is at least one other alternative I can consider when scanning the repository.

Please let me know if you have any more ideas. I’ll run this by my colleagues and let you know what approach we ultimately chose.

Yes, the whole point of code variants is to analyze common.cpp twice in different contexts, which takes more time.

I don’t have more ideas for now, except:

  • using a bigger machine to analyze the code base
  • Totally change the architecture of your code so that each library becomes more self-contained, isolated from the products that use it, and can be analyzed independently. But it is several orders of magnitude more complex, and I don’t know enough of your domain to know if it even makes sense.

Team decided to go with the code variants approach, accepting the extended scanning time for the benefit of having precise results together in a single SonarQube project, without LOC being counted multiple times, and without issues being duplicated.

Loïc, thanks for all the information and consultation!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.