Introducing SonarSweep: Improve training data quality for coding LLMs

Sonar is excited to announce SonarSweep, a new service that is designed to improve the coding datasets used by LLMs in model pre-training and post-training (including via supervised fine-tuning and reinforcement learning).

As developers, many of us are now using AI coding tools in our daily work. They can be incredibly helpful for productivity, but we’ve also seen that the quality and security of the code they generate can be inconsistent. Sometimes it’s great, and other times it contains bugs, security vulnerabilities, or maintainability issues.

At Sonar, we’ve been looking into why this happens, and the root cause is simple: an AI model is only as good as the data it was trained on. To address this, we are building SonarSweep.

SonarSweep is engineered to systematically remediate, optimize, and secure coding datasets for model training. It proactively ensures that models learn from high-quality, and secure examples, from pre-training to model alignment—an essential step to building reliable AI coding models. Models trained on data prepared by SonarSweep produced code with up to 67% fewer security vulnerabilities and up to 42% fewer bugs compared to models trained on the original, un-swept data, without loss in functional performance.

Additional detail into the extensive testing can be found in the blog post. SonarSweep is now available in early access.

2 Likes