Case Study: 3io0

3io0: EtuB from Clostridium kluyveri

3.0Å crystal, 230 protein residues (Heldt et al., 2009)

(Thank you very much to Richard Pickersgill for allowing me to use his model and data for this demonstration)

The goal of any crystallographic refinement package is to obtain the model consistent with its restraints that best explains the given dataset. While straightforward in principle, the combination of low-resolution data and insufficient restraints can easily lead to an over-fitted model that appears to fit the data well, but in fact contains various stereochemical errors.

The demonstration model bundled with ISOLDE (3io0: PDBe, PDB-REDO, accessed by clicking the "Load demo" button on the ISOLDE panel) is one of the best examples of this I have seen. For the resolution its R-factors are remarkably low (R_work: 0.187, R_free: 0.202) - yet it contains six incorrect cis (and a further two severely twisted) peptide bonds as well as numerous flipped peptide bonds and incorrect rotamers, yielding a MolProbity score of 3.04.

I hasten to add that none of the errors I found are particularly unusual for models of this resolution and era. This is perhaps best exemplified by the MolProbity score itself: this scoring metric was designed such that models with a score approximately equal to their resolution had an "average" number of conformational errors for that resolution. My paper introducing ISOLDE discusses the conformational error rate in the PDB at some length (albeit for the slightly lower resolution 3.5-4.0Å range), and erroneous incorporation of non-proline cis bonds in particular was something of an epidemic at the time, apparently due mostly to the fact that common model-building and validation tools simply weren't flagging them.

The demonstration video here is an unedited real-time screencast covering the inspection and rebuilding of this model in ISOLDE. Key points to note:

Upon starting a simulation for the first time, the R-factors for this model immediately increase by about 3%. This is not a bad thing: it's a result of the rigorous handling of non-bonded interactions in the AMBER force field helping ISOLDE to avoid "lying" to you. Erroneous conformations are pushed out of their original over-fitted positions, reducing the correlation to the data and creating tell-tale difference density blobs.
As the model improves, the combination of live density map recalculation, live validation and AMBER forces often allows you to confidently choose between rotamers even for sidechains as small as valine, serine and threonine.
There is still no perfect substitute for human eyes looking through the model in detail. After checking and correcting (where necessary) all flagged peptide bond, Ramachandran and rotamer outliers, a final residue-by-residue inspection still picks up a half-dozen or so correctable errors.

Some vital stats for the as-published model, immediately after rebuilding in ISOLDE, and after further refinement in phenix.refine (mainly to re-refine B-factors - ISOLDE doesn't currently touch these) are shown in the table below. R-factors in brackets are as calculated by ISOLDE; otherwise by phenix.model_vs_data.

	Before (as published)	After	After + phenix.refine
Ramachandran outliers (%)	1.76	1.32	0.88
Ramachandran favoured (%)	91.19	97.80	97.80
Rotamer outliers (%)	12.14	0.00	0.00
Clashscore	15.84	0.30	0.60
Non-proline cis bonds	6	0	0
MolProbity score	3.04	0.66	0.75
R_work	0.188 (0.218)	0.192 (0.222)	0.168 (0.189)
R_free	0.203 (0.227)	0.200 (0.230)	0.180 (0.199)