Validate and standardize chemical structures for machine learning with one tool
GitHub RepoImpressions432

Validate and standardize chemical structures for machine learning with one tool

@githubprojectsPost Author

Project Description

View on GitHub

Clean Your Data, Train Better Models: Introducing ChemAudit

If you've ever worked on a machine learning project involving chemical data, you know the dirty secret: your dataset is probably a mess. Inconsistent atom labels, weird valence states, and salts hanging off your molecules can silently tank your model's performance. Cleaning and standardizing structures by hand is a tedious, error-prone chore.

That's where ChemAudit comes in. It's a new open-source tool designed to validate and standardize chemical structures specifically for machine learning pipelines. Think of it as a quality control checkpoint for your cheminformatics data, ensuring everything is consistent and model-ready before training begins.

What It Does

ChemAudit is a Python library that takes a list of chemical structures (as SMILES strings) and runs them through a series of checks and corrections. It doesn't just flag problems—it fixes common issues. Its main jobs are to standardize atom labeling (so that every nitrogen is an "N", not an "n" or something else), neutralize charges where appropriate, remove solvents and salts, and generate canonical SMILES. The output is a clean, uniform dataset that's far more reliable for building predictive models.

Why It's Cool

The clever part is its focused, pipeline-friendly design. It's not a general-purpose molecule editor; it's built for the specific preprocessing needs of ML. It works in batches, which is perfect for processing entire datasets. Under the hood, it leverages the powerful RDKit library, but it wraps that power in a simple, single-function interface: audit_smiles(). You give it a list, you get back a clean list, along with a report of what was changed. This simplicity makes it easy to slot into an existing data preparation script or MLOps workflow.

For researchers and developers, this means you can stop writing one-off sanitization scripts and spend more time on the actual model. It brings consistency, which is crucial for reproducibility. If your team uses ChemAudit, everyone starts from the same standardized data, making experiments comparable.

How to Try It

Getting started is straightforward. ChemAudit is on PyPI, so you can install it with pip.

pip install chemaudit

A basic usage example looks like this:

from chemaudit import audit_smiles

# Your raw, potentially messy input
raw_smiles_list = ["CCO", "[Na+].OC(=O)c1ccccc1", "c1ccccc1C=O"]

# Run the audit
clean_smiles, audit_report = audit_smiles(raw_smiles_list)

print("Cleaned SMILES:", clean_smiles)
print("
Report:", audit_report)

Head over to the ChemAudit GitHub repository for the full source code, more detailed examples, and to contribute or report issues.

Final Thoughts

In machine learning, garbage in equals garbage out. For cheminformatics, the "garbage" is often just inconsistent data formatting. ChemAudit tackles this foundational problem directly. It's a practical, no-nonsense tool that feels like it was built by developers who've suffered through the data-cleaning process themselves. If you're building models with chemical data, adding this single step to your preprocessing could save you hours of debugging and lead to more robust, generalizable results. It's the kind of utility that, once you use it, becomes an indispensable part of your toolkit.

@githubprojects

Back to Projects
Project ID: 37807508-38bd-47df-a113-58ac24e96d02Last updated: April 14, 2026 at 07:34 AM