[Tech Story] Takeaways from building a SAST product, and why OWASP benchmark is not enough

(5 min read)

In 2018, SonarSource decided to enter more seriously into the Static Application Security Testing (SAST) market. On top of our existing rules for detecting vulnerabilities, we developed a new security analysis engine to detect complex injection vulnerabilities, first targeting Java, C# and PHP.

Because we were new to the security domain, we looked for test cases to help us better understand which types of problems we should be able to detect. We quickly found the famous OWASP Benchmark, which is composed of 2,740 Java test cases. This benchmark really helped us bootstrap our development and measure our progress during initial development.

In this post, I’m going to explain how we used the OWASP Benchmark to improve our taint analyzer (in a SAST context, data that comes from users is considered “tainted”) and most importantly why we decided that getting a score of :100: on the OWASP Benchmark is not our goal.

The First 10 Months

We started implementation of our taint analyzer by targeting SQL Injection, which is probably the most famous vulnerability. We identified the basic sources of user input (ex: HttpServletRequest.getParameter(“ParamKey”)) and the APIs where the vulnerability can be exploited (the “sinks” in SAST jargon). We were pretty proud of our initial results, but wanted to confirm them on more concrete projects, which is why we started using the OWASP Benchmark.

You can split the benchmark test cases into two main sets: Injection Vulnerabilities and Non-Injection Vulnerabilities (see Appendix 2 for more details). Because we were developing a taint analyzer, we focused only on the 1572 Injection Vulnerabilities test cases.

Our first analyses of these cases were promising, with 27% true-positives. But there were also 21% false-positives, so we knew we had more work to do.

Digging into the false-positives made us realize we needed to track values in arrays and collections. We also had to add support for Objects and field access to our existing String variable support. Finally, we needed to add additional sources, sinks, and the standard methods that are used to make the data flow safe (i.e. sanitizers).

All along the way we’ve been releasing new versions to our community, and it was pretty exciting to see feedback coming and results improving after each release. We ultimately reached 74% true-positives with pretty good performance (see Appendix 3), but weren’t satisfied with the 30% false-positive rate (false-positives are the worst, they can kill trust in a product ). Digging into that, we started to understand some of the benchmark’s limits. :thinking:

Reaching The Limits

Weird Test Cases

Execution Path Sensitivity

While digging into the issues generated by our analyzer, we discovered that a lot of the tests cases were related to Execution Path Sensitivity, which is the ability to detect that an execution path won’t be reachable under certain conditions. All the Execution Path Sensitivity test cases contain an if statement that is either always true or always false, and the vulnerability is somewhere in the code that is not actually reachable. See BenchmarkTest00104 for a concrete example.

At SonarSource, we believe that in real life, if a vulnerability exists in a branch of the code, eventually it will be called. Also, if you really have dead code and conditions that are always true|false, that’s not a security issue, that’s a bug and we have other rules (RSPEC-2589, RSPEC-2583, RSPEC-1850, RSPEC-1145) for that.

So we decided to discard Execution Path Sensitivity test cases and not consider them at all regardless of what weakness they were supposed to demonstrate. To be clear, our taint analyzer is not Execution Path Sensitive and probably never will be. If there’s a vulnerability in unreachable code, we’re going to raise an issue because we still think it should be fixed.

Trust Boundary

There are 126 test cases in the benchmark related to something called Trust Boundary. As it turns out, they aren’t actually exploitable and Dave Wichers, the primary author of the OWASP Benchmark proposed dropping them in a future version of the benchmark. We feel it would be incorrect to raise issues on these cases because in real life nothing bad can happen. So we also removed those cases from the scope of what we want to cover.

Path Traversal

Finally, 78% of the Path Traversal test cases consider that instantiating a java.io.File is a problem whereas we don’t think it’s correct (the same applies to java.io.FileOutputStream or java.io.FileInputStream from user-provided data or checking if a file exists with File.exists()). It’s not instantiating the class that’s the problem, it’s only if you perform actions on these objects (read/write) that you are at risk. Therefore, we also excluded those test cases.

Where did that lead us?

In total, from the 1572 injection vulnerabilities test cases of the OWASP Benchmark, we discarded 372 and retained the 1200 that we feel are relevant to security testing. If you consider only these 1200 test cases, the SonarQube Developer Edition (as of Sept 2019) gets an OWASP Score of 84 with a True-Positive Rate of 85% and False-Positive Rate of 1%. And again, this was achieved by focusing on value and doing what makes sense, more than chasing the OWASP score itself.

What’s Next?

We will continue to improve our SAST engine to reduce our False-Positive Rate over time, and obviously we’ll work to improve our True-Positive Rate. But today more than before, getting an amazing OWASP Benchmark Score is not our goal. It would be completely wrong to get a score of 100 now that we understand the limits of the benchmark.

The OWASP Benchmark was a great set of test cases to bootstrap our SAST engine with, but it’s not the end of the journey. There’s still lots more to do! For instance, we want to improve our coverage of the rest of the OWASP Top 10 2017 categories, such as A4-XXE and A8-Insecure Deserialization. For all the OWASP categories we need to follow the trend of Java frameworks (Vert.x, SparkJava) and consider specificities of well established ones (Spring Dependency Injection). We also need to improve our XSS rule to consider front-end technologies whereas today we stop looking at the execution flow when the data leaves the Java scope. We need to continue the data tracking on the templating system (Thymeleaf, JSP) or the JavaScript front-end (React, Angular, Vue.js).

Obviously, continuing to use the OWASP Benchmark for this work would be ideal, but those things aren’t included in the benchmark today and it’s uncertain whether new test cases will be added in the future. The project is looking for maintainers, and while SonarSource could contribute - it’s part of our DNA to contribute to open-source - that would look weird, and it’s possible that we might inadvertently skew the results.

Plus, getting good results on any benchmark doesn’t necessarily mean you can detect real-world vulnerabilities. As a consequence we need other sources of test cases (CVE databases, GitHub commits) in order to be confident our taint analyzer can detect vulnerabilities in a majority of the OWASP Top 10 2017 categories. To make sure we can find these vulnerabilities we need good examples to work from, so we’re relying on the commit comments of vulnerability fixes in open source projects, and then using the “before” code as our examples. Studying these special commits helps us better understand which types of vulnerabilities are fixed these days and how developers managed to mitigate them.

The results of this effort is already available for on-premise pipelines with the SonarQube Developer Edition, and also through online code analysis using SonarCloud (which is totally free if your project is open-source). We currently cover injection flaw detection for Java, C# and PHP, and are already working on Python support.

Cheers,
Alex

Appendix 1: OWASP Benchmark Score

Here are the basics of how the OWASP Benchmark Score is computed.

TP - True-Positive An issue is expected to be detected and the analysis is finding it
FN - False-Negative An issue is expected and the analysis is NOT finding it
TN - True-Negative No issue is expected and the analysis is correctly silent about it
FP - False-Positive No issue is expected and the analysis is raising an unexpected issue
TPR - True-Positive Rate (TPR) = TP / ( TP + FN )
FPR - False-Positive Rate (FPR) = FP / ( FP + TN )
OWASP Benchmark Score TPR - FPR

In order to get a score of 100, you have to find all the real problems without raising any false-positives.

If you look at the officially published OWASP Score for the “SonarQube Java Plugin”, you will see it is far from good at 33%. This bad score is linked to the fact that the OWASP Benchmark was last measured with SonarJava 3.14, which was released in Sept. 2016 - nearly three years ago at this writing - and at the time no one at SonarSource was looking to improve this score because developing security rules was not not our main concern.

Things have changed a lot since that version. If you consider only the 1200 injection vulnerability test cases, the SonarQube Developer Edition gets an OWASP Score of 84 with a True-Positive Rate of 85% and False-Positive Rate of 1%.

This score was produced using SonarQube Developer Edition 7.9.1 running the Security Engine 8.0-M1.

Note: We tried to produce the input needed for the official OWASP benchmark scoring, but for technical reasons and because we discarded 372 test cases, we found it easier to compute our OWASP Score.

Appendix 2: OWASP Benchmark Content

Official OWASP Benchmark Content

The OWASP Benchmark only targets Java. It is made of 2740 test cases stored in a single directory named “testcode”. The expected results are described in a CSV file. For each test case it details:

  • which type of vulnerability it targets
  • whether an issue is expected
  • the CWE identifier related to the test case

The benchmark covers 11 types of vulnerability that we grouped into 2 sets:

  • Injection Vulnerabilities (6):
    • SQL Injection: 504 tests cases
    • Path Traversal: 268 test cases
    • LDAP Injection: 59 test cases
    • Command Injection: 251 test cases
    • XPath Injection: 35 test cases
    • Cross-Site Scripting (XSS): 455 test cases
  • Non-Injection Vulnerabilities (5):
    • Cryptography: 246 test cases
    • Hashing: 236 test cases
    • Secure Cookie: 67 test cases
    • Trust Boundary: 126 test cases
    • Weak Random Number: 493 test cases

SonarSource OWASP Benchmark’s Content

In order to make it clear which test cases we excluded and on which ones an issue is expected, we cloned the official OWASP Benchmark and re-organized the test cases by vulnerability type (sqli, pathtraver, ldapi, 
) and issueexpected / noissueexpected sub-directories. This helped us to see easily in SonarQube/SonarCloud when unexpected issues were raised.

Appendix 3: Performance

If you want to reproduce the figures we mentioned in this document, you can clone our own version of the OWASP Benchmark (or the official one) and run your own scan on SonarCloud.io. Here is an example of command line to trigger such scan:

mvn clean package sonar:sonar \
 -Dsonar.projectKey=org.owasp:benchmark:changeme \
 -Dsonar.organization=changeme-github \
 -Dsonar.host.url=https://sonarcloud.io \
 -Dsonar.login=changeme \
 -Dsonar.scm.disabled=true \
 -Dsonar.cpd.exclusions=** \
 -Dsonar.branch.autoconfig.disabled=true

Despite the fact we discarded some cases, we still analyze all the files available on the OWASP Benchmark, they are just in special directories to not be counted when we compute the score.

On an average machine made of an Intel Core i5 3570 @ 3.40 GHz + 16Go of RAM, scanning the OWASP Benchmark should take less than 3 minutes.

15 Likes

Hi,

I’m trying to recreate the results from this post. I’ve cloned the Benchmark to my GitHub repo and have scanned it with SonarCloud as suggested.

Could you please please let me know how to extract the csv results file in order to generate the scorecards?

Thanks

Hello,

There is no automatic way to extract the results as a CSV file nor to recompute the ScoreCard. The script provided by the OWASP Benchmark for SonarQube is outdated, no longer work and no one took the time to update it.
I produced these figures using a custom script relying on the SonarQube/SonarCloud API to extract the results as JSON data and then I compared that with the expected/not expected file.

If you are using my version of the OWASP Benchmark you will see that I sorted the test cases into different sub-directories so it’s easier to review manually.

Did you manage to reproduce the speed of analysis?

Alex

1 Like

Hello,

Please note now a days in 2023, as part of AppDefenceAlliance CASA Audit, auditors require an official OWASP Benchmark to be submitted beside SAST results. OrientaçÔes para o teste de outras ferramentas  |  App Defense Alliance

Using OWASP Benchmark on Fresh version of 9.9LTS Enterprise. The Official OWASP Script seems to be totally broken with JQ errors argument list too long. If we clone and run test only few Results will be picked up (~200 issue i.e. total result json file size is few kbs only). This is evidently wrong and super small comparing to the actual ~18000 vulnerabilities and ~ 1200 Security hotspots recorded on SonarQube dashboard. If using that to generate score card the actual score will be a red Fail score close to 0-3% points.

Got mannaged to modify this script from Official OWASP REPO https://github.com/OWASP-Benchmark/BenchmarkJava/blob/master/scripts/runSonarQube.sh

Could work out collecting All Vulnerabilities and SecurityHotspots and create some what good result json file ~ 16Mb size, which ended up in the following Score:

TPR 69.64% with FPR 23.52% = Overal score of 46.26%

Not sure if Benchmark is broken or SonarQube is not reporting what Benchmark expects ? some wiered numbers on Score (Marked in Red):

SQL Injection: TPR 100% and FPR 100%
Trust Bundary: TPR 0.% and FPR 0%

When executing ./createScorecards.sh Multiple lines of Following repeating Errors happen:

  • SonarQubeReader: Unknown squid number: S5883 has no CWE mapping
  • WARN: Found new SonarQue HotSpot rule not seen before. Category: command-injection with message: “Make sure that this user-controlled command argument doesn’t lead to unwanted behaviour”
  • WARN: Failed to translate SonarQube security category: ‘command-injection’ with message: ‘Make sure that this user-controlled command argument doesn’t lead to unwanted behavior’
  • WARN: Failed to translate SonarQube security category: ‘others’ with message ‘Make sure creating this cookie without the “HTTPOnly” flag is safe.’

I think would be great if SonarSource team could contribute and provide an Official up to date OWASP Based Benchmark that is working for Version 9.9 LTS with a script/guide to run and get reports out.

2 Likes

Just in case here is the Working PR/Branch Version make runSonarqube.sh to work on LTS9.9 by zoobinn · Pull Request #196 · OWASP-Benchmark/BenchmarkJava · GitHub

2 Likes

Hello,

The script related to SonarQube that is provided by the OWASP Benchmark is indeed broken for at least 4 years. SonarSource never participated in writing this script and now we are in this situation where we are kind of forced to jump on it :frowning: That’s great that we have an awesome community and that you managed to update the script to make it work with SQ 9.9 LTS. I will definitely have a look at it and see if this matches our internal measurements.

At SonarSource, we are in the process to review our coverage of famous SAST benchmarks because we haven’t done that for a while with the goal to publish an updated version of the post I did in 2019 and be transparent about our TPR on other SAST benchmarks.

Alex

2 Likes

Hello,

I just published a blog post about how Sonar scores on the Top 3 SAST Benchmarks and in particular on the OWASP Benchmark:

Please have a look at the blog post that contains the link to the public repository containing the ground truths (= expected and not expected issues) for each of the SAST benchmarks.

Regards
Alex

4 Likes