Hi everyone!
Thanks to all who attended our session yesterday. Please find below the questions asked during the webinar and the resources mentioned by the speaker.
Q&A
Q: Do Sonar’s internal Machine Learning experts use SonarQube?
A: Yes, we dogfood extensively. Personally, it found some bugs in my code which otherwise would have crashed my script after training models, and before saving it to disk.
Q: How can we better control our workflow’s entropic/noisy/chaotic nature?
A: There are ways to channel the chaos or abstract it out. Look into bootstrapping your next project (repo templates; kedro etc). A good project structure goes a long way. Use the linters you’re comfortable with (SonarQube for ide; pyblack+ruff etc). You have to be intentional about it. Some of your time would need to be invested in managing the chaos/noise.
Q: How does Sonar help with MLOps?
A: Deployment and cloud management is an open avenue for us as of now. But if you’re writing webservers, managing secrets and credentials in your repo, or writing complex monitoring pipelines, you should want to make sure your code smells good; and minimize potential vulnerabilities. We can help there.
Q: How does one effectively balance code maintainability and optimizing performance? Especially with large datasets?
A: Finding that sweet spot is key. What is clear is that the benefits of prioritizing clear, readable code pays-off when you need to understand what to change. It is harder to find performance bottlenecks and tweak code that is harder to understand. Premature optimization also leads to complex hard to maintain code.
Q: Can you recommend the best tools for reproducibility of ML experiments?
A: As far as I can recall, there are no specific libraries that help with ensuring reproducibility. But there are some steps you can take. For instance, fixing your project’s requirements. So instead of adding ‘torch’ in your requirement files, you should specify precisely which version of libraries you’re using. Other things like ensuring every time you generate some randomness, you control the state/seed. Correct me if I’m wrong Jean but there is a rule in Sonar that helps with making sure you’re setting the seed correctly and consistently, especially across multiple torch backends. But beyond that, I guess, it’s just another thing to be intentional about.
Another crucial factor in ensuring reproducibility is of course taking care of the data and configuration versions. Not a single tool, but I highly recommend using some sort of an experiment and data version tracking system (like tensorboard/weights and biases for the former; DVC for the latter, or even your own S3 bucket management-based solution). Then ideally, you should make sure that your code version (git hash), data version (dvc version id), and experiment version (all your configurations) are recorded for each experiment. I find that is a good enough insurance of reproducibility, with a reasonable amount of effort.
Q: Do you have samples or recommendations around CI/CD pipelines for Python code bases and especially notebooks? It’s very different from traditional Java ci/cd processes.
A: This is dependent on your pipeline requirements and tooling integrations (eg Databricks notebooks via Azure DevOps). It is also an emerging field with endless options, start with the best DevOps practices and modify them based on your needs.
Q: We use Sonar Scan for Python for our ML Algos, does Sonar offer something more specific for ML/AI?
A: Yes, we do have some rules specific to ML/AI. We have rules specific for PyTorch, Pandas, numpy and other commonly used ML libraries. Have a look at Python static code analysis
Q: Does the Sonar platform also offer any utility code tools to fix anomalies/issues automatically using AI capabilities?
A: Yes, the commercial editions of SonarQube (Cloud and Server) include AI CodeFix that uses LLM to offer code fix suggestions. Have a look here: AI CodeFix: Automatically Generate AI Code Fix Suggestions
Resources
- Machine Learning Robustness: A Primer - Houssem Ben Braiek, Foutse Khomh
- Hidden technical debt in Machine learning systems. D. Sculley et al
- Machine Learning: The High-Interest Credit Card of Technical Debt D. Sculley et al
- What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities - Chattopadhyay et al
- A Large-Scale Study About Quality and Reproducibility of Jupyter Notebooks - Pimentel et al
- MAD landscape - https://mad.firstmark.com/