Inconsistencies have been uncovered in how acid dissociation constants for zwitterionic compounds are recorded in chemical databases, as well as how they are used in modelling.1 This could have a significant impact on areas like drug design or environmental chemistry where pKa values play a crucial role. ‘We found that the ChEMBL database, one of the largest data repositories for biochemicals – and frequently used as a data corpus for training pKa models – includes many incorrect pKa values due to this nomenclature issue,’ says Jonathan Zheng from the Massachusetts Institute of Technology who participated in the study.
The pKa of a molecule describes how that molecule behaves at different pH values, influencing important properties such as its solubility in water and other media or its ability to penetrate cell membranes. In pharmaceutical chemistry, this property can indicate whether a compound is suitable for medical applications or not. ‘Drug molecules often form zwitterions – molecules that have distinct charged centres while being neutral overall,’ explains Zheng. ‘Our results could therefore help to avoid confusion during drug development efforts, as well as in the broader literature.’
Kai Leonhard from RWTH Aachen University in Germany, who wasn’t involved in the research, agrees that the errors discovered by the US team might affect areas such as drug discovery. ‘Candidates for new medicines, even if active in vitro, are screened with respect to several properties like their solubility in blood. So, it could happen that an effective drug candidate isn’t brought into clinical studies just because the misinterpreted pKa data suggests it’s not soluble in blood, although it actually is.’
The scientists realised that something was wrong when they weren’t able to reconcile the values obtained from pKa prediction models with experimental data that had been previously compiled by the International Union of Pure and Applied and Chemistry (Iupac). After discovering the mismatch between experimental data and that on ChEMBL, Zheng and colleagues noticed that a popular machine-learning model called QupKake,2 which was trained on ChEMBL data, was less accurate for zwitterionic compounds too.
These discrepancies result from confusion about what the terms acidic and basic mean when describing pKa values, points out Zheng. He explains that for compounds that can form zwitterions, these designations are ambiguous because of the presence of different isomers (uncharged and dipolar) in solution.
‘While it may seem simple to tell what a dissociation constant is, matters can become complex for molecules with multiple acidic and basic functional groups,’ comments Leonhard. ‘In this case, the pKa value of one group may depend on the protonation states of the others.’
That’s why, decades ago, chemists thought it would be reasonable to label the lower pKa for zwitterion-forming compounds as acidic and the higher pKa values as basic, while the opposite convention is used for compounds that don’t really form zwitterions, notes Zheng. But he adds that this difference in nomenclature isn’t widely known, or if known, it isn’t handled consistently. ‘Modellers and data curators typically use one of these conventions, applying it broadly to all species, which leads to incorrect applications of the data.’
Zheng says that to fix this, data curators may have to re-examine many compounds for which errors have been identified and use more precise labels and metadata in future. He suggests avoiding the use of acidic or basic as pKa labels and either use pKa to only refer to acidic phenomena or use proton gain and proton loss instead. He also recommends introducing the tags ‘macroscopic’ and ‘microscopic’ whenever possible to indicate whether the reported pKa values refer to the ensemble of multiple isomeric forms of an acid or a specific isomer.
The magnitude of the mismatch between experimental data and modelling that has made its way into databases depends on the molecule and the type of error made. However, it can be significant. For example, the values the team calculated for glycine showed that modelling errors could put dissolved ion concentrations out by two orders of magnitude, while errors where proton gain was used when the user wanted proton loss or vice versa led to values out by as much as five orders of magnitude.
‘In general, we believe that researchers should spend more time in carefully examining and curating data for any chemical property,’ concludes Zheng. ‘Based on our experience, it is very likely that systematic issues are pervasive in other physicochemical property data as well.’
References
1 J Zheng et al, J. Chem. Inf. Model., 2024, DOI: 10.1021/acs.jcim.4c01420
2 OD Abarbanel and GR Hutchison, J. Chem. Theory Comput., 2022, 20, 6946 (DOI: 10.1021/acs.jctc.4c00328)
Comments
Something to say?
Login or Sign up for free