Takeaways from building a SAST product, and why OWASP benchmark is not enough

(5 min read)

In 2018, SonarSource decided to enter more seriously into the Static Application Security Testing (SAST) market. On top of our existing rules for detecting vulnerabilities, we developed a new security analysis engine to detect complex injection vulnerabilities, first targeting Java, C# and PHP.

Because we were new to the security domain, we looked for test cases to help us better understand which types of problems we should be able to detect. We quickly found the famous OWASP Benchmark, which is composed of 2,740 Java test cases. This benchmark really helped us bootstrap our development and measure our progress during initial development.

In this post, I’m going to explain how we used the OWASP Benchmark to improve our taint analyzer (in a SAST context, data that comes from users is considered “tainted”) and most importantly why we decided that getting a score of :100: on the OWASP Benchmark is not our goal.

The First 10 Months

We started implementation of our taint analyzer by targeting SQL Injection, which is probably the most famous vulnerability. We identified the basic sources of user input (ex: HttpServletRequest.getParameter(“ParamKey”)) and the APIs where the vulnerability can be exploited (the “sinks” in SAST jargon). We were pretty proud of our initial results, but wanted to confirm them on more concrete projects, which is why we started using the OWASP Benchmark.

You can split the benchmark test cases into two main sets: Injection Vulnerabilities and Non-Injection Vulnerabilities (see Appendix 2 for more details). Because we were developing a taint analyzer, we focused only on the 1572 Injection Vulnerabilities test cases.

Our first analyses of these cases were promising, with 27% true-positives. But there were also 21% false-positives, so we knew we had more work to do.

Digging into the false-positives made us realize we needed to track values in arrays and collections. We also had to add support for Objects and field access to our existing String variable support. Finally, we needed to add additional sources, sinks, and the standard methods that are used to make the data flow safe (i.e. sanitizers).

All along the way we’ve been releasing new versions to our community, and it was pretty exciting to see feedback coming and results improving after each release. We ultimately reached 74% true-positives with pretty good performance (see Appendix 3), but weren’t satisfied with the 30% false-positive rate (false-positives are the worst, they can kill trust in a product ). Digging into that, we started to understand some of the benchmark’s limits. :thinking:

Reaching The Limits

Weird Test Cases

Execution Path Sensitivity

While digging into the issues generated by our analyzer, we discovered that a lot of the tests cases were related to Execution Path Sensitivity, which is the ability to detect that an execution path won’t be reachable under certain conditions. All the Execution Path Sensitivity test cases contain an if statement that is either always true or always false, and the vulnerability is somewhere in the code that is not actually reachable. See BenchmarkTest00104 for a concrete example.

At SonarSource, we believe that in real life, if a vulnerability exists in a branch of the code, eventually it will be called. Also, if you really have dead code and conditions that are always true|false, that’s not a security issue, that’s a bug and we have other rules (RSPEC-2589, RSPEC-2583, RSPEC-1850, RSPEC-1145) for that.

So we decided to discard Execution Path Sensitivity test cases and not consider them at all regardless of what weakness they were supposed to demonstrate. To be clear, our taint analyzer is not Execution Path Sensitive and probably never will be. If there’s a vulnerability in unreachable code, we’re going to raise an issue because we still think it should be fixed.

Trust Boundary

There are 126 test cases in the benchmark related to something called Trust Boundary. As it turns out, they aren’t actually exploitable and Dave Wichers, the primary author of the OWASP Benchmark proposed dropping them in a future version of the benchmark. We feel it would be incorrect to raise issues on these cases because in real life nothing bad can happen. So we also removed those cases from the scope of what we want to cover.

Path Traversal

Finally, 78% of the Path Traversal test cases consider that instantiating a java.io.File is a problem whereas we don’t think it’s correct (the same applies to java.io.FileOutputStream or java.io.FileInputStream from user-provided data or checking if a file exists with File.exists()). It’s not instantiating the class that’s the problem, it’s only if you perform actions on these objects (read/write) that you are at risk. Therefore, we also excluded those test cases.

Where did that lead us?

In total, from the 1572 injection vulnerabilities test cases of the OWASP Benchmark, we discarded 372 and retained the 1200 that we feel are relevant to security testing. If you consider only these 1200 test cases, the SonarQube Developer Edition (as of Sept 2019) gets an OWASP Score of 84 with a True-Positive Rate of 85% and False-Positive Rate of 1%. And again, this was achieved by focusing on value and doing what makes sense, more than chasing the OWASP score itself.

What’s Next?

We will continue to improve our SAST engine to reduce our False-Positive Rate over time, and obviously we’ll work to improve our True-Positive Rate. But today more than before, getting an amazing OWASP Benchmark Score is not our goal. It would be completely wrong to get a score of 100 now that we understand the limits of the benchmark.

The OWASP Benchmark was a great set of test cases to bootstrap our SAST engine with, but it’s not the end of the journey. There’s still lots more to do! For instance, we want to improve our coverage of the rest of the OWASP Top 10 2017 categories, such as A4-XXE and A8-Insecure Deserialization. For all the OWASP categories we need to follow the trend of Java frameworks (Vert.x, SparkJava) and consider specificities of well established ones (Spring Dependency Injection). We also need to improve our XSS rule to consider front-end technologies whereas today we stop looking at the execution flow when the data leaves the Java scope. We need to continue the data tracking on the templating system (Thymeleaf, JSP) or the JavaScript front-end (React, Angular, Vue.js).

Obviously, continuing to use the OWASP Benchmark for this work would be ideal, but those things aren’t included in the benchmark today and it’s uncertain whether new test cases will be added in the future. The project is looking for maintainers, and while SonarSource could contribute - it’s part of our DNA to contribute to open-source - that would look weird, and it’s possible that we might inadvertently skew the results.

Plus, getting good results on any benchmark doesn’t necessarily mean you can detect real-world vulnerabilities. As a consequence we need other sources of test cases (CVE databases, GitHub commits) in order to be confident our taint analyzer can detect vulnerabilities in a majority of the OWASP Top 10 2017 categories. To make sure we can find these vulnerabilities we need good examples to work from, so we’re relying on the commit comments of vulnerability fixes in open source projects, and then using the “before” code as our examples. Studying these special commits helps us better understand which types of vulnerabilities are fixed these days and how developers managed to mitigate them.

The results of this effort is already available for on-premise pipelines with the SonarQube Developer Edition, and also through online code analysis using SonarCloud (which is totally free if your project is open-source). We currently cover injection flaw detection for Java, C# and PHP, and are already working on Python support.

Cheers,
Alex

Appendix 1: OWASP Benchmark Score

Here are the basics of how the OWASP Benchmark Score is computed.

TP - True-Positive An issue is expected to be detected and the analysis is finding it
FN - False-Negative An issue is expected and the analysis is NOT finding it
TN - True-Negative No issue is expected and the analysis is correctly silent about it
FP - False-Positive No issue is expected and the analysis is raising an unexpected issue
TPR - True-Positive Rate (TPR) = TP / ( TP + FN )
FPR - False-Positive Rate (FPR) = FP / ( FP + TN )
OWASP Benchmark Score TPR - FPR

In order to get a score of 100, you have to find all the real problems without raising any false-positives.

If you look at the officially published OWASP Score for the “SonarQube Java Plugin”, you will see it is far from good at 33%. This bad score is linked to the fact that the OWASP Benchmark was last measured with SonarJava 3.14, which was released in Sept. 2016 - nearly three years ago at this writing - and at the time no one at SonarSource was looking to improve this score because developing security rules was not not our main concern.

Things have changed a lot since that version. If you consider only the 1200 injection vulnerability test cases, the SonarQube Developer Edition gets an OWASP Score of 84 with a True-Positive Rate of 85% and False-Positive Rate of 1%.

This score was produced using SonarQube Developer Edition 7.9.1 running the Security Engine 8.0-M1.

Note: We tried to produce the input needed for the official OWASP benchmark scoring, but for technical reasons and because we discarded 372 test cases, we found it easier to compute our OWASP Score.

Appendix 2: OWASP Benchmark Content

Official OWASP Benchmark Content

The OWASP Benchmark only targets Java. It is made of 2740 test cases stored in a single directory named “testcode”. The expected results are described in a CSV file. For each test case it details:

  • which type of vulnerability it targets
  • whether an issue is expected
  • the CWE identifier related to the test case

The benchmark covers 11 types of vulnerability that we grouped into 2 sets:

  • Injection Vulnerabilities (6):
    • SQL Injection: 504 tests cases
    • Path Traversal: 268 test cases
    • LDAP Injection: 59 test cases
    • Command Injection: 251 test cases
    • XPath Injection: 35 test cases
    • Cross-Site Script: 455 test cases
  • Non-Injection Vulnerabilities (5):
    • Cryptography: 246 test cases
    • Hashing: 236 test cases
    • Secure Cookie: 67 test cases
    • Trust Boundary: 126 test cases
    • Weak Random Number: 493 test cases

SonarSource OWASP Benchmark’s Content

In order to make it clear which test cases we excluded and on which ones an issue is expected, we cloned the official OWASP Benchmark and re-organized the test cases by vulnerability type (sqli, pathtraver, ldapi, …) and issueexpected / noissueexpected sub-directories. This helped us to see easily in SonarQube/SonarCloud when unexpected issues were raised.

Appendix 3: Performance

If you want to reproduce the figures we mentioned in this document, you can clone our own version of the OWASP Benchmark (or the official one) and run your own scan on SonarCloud.io. Here is an example of command line to trigger such scan:

mvn clean package sonar:sonar \
 -Dsonar.projectKey=org.owasp:benchmark:changeme \
 -Dsonar.organization=changeme-github \
 -Dsonar.host.url=https://sonarcloud.io \
 -Dsonar.login=changeme \
 -Dsonar.scm.disabled=true \
 -Dsonar.cpd.exclusions=** \
 -Dsonar.branch.autoconfig.disabled=true

Despite the fact we discarded some cases, we still analyze all the files available on the OWASP Benchmark, they are just in special directories to not be counted when we compute the score.

On an average machine made of an Intel Core i5 3570 @ 3.40 GHz + 16Go of RAM, scanning the OWASP Benchmark should take less than 3 minutes.

9 Likes