Welcome to the IKCEST
Understanding the diversity of the metal-organic framework ecosystem

Development of descriptors for MOF chemistry

One of the aims of this work is to express the diversity of a MOF database in terms of features that can be related to the chemistry that is used in synthesizing MOFs as well as generating the libraries of hypothetical structures. At present, different strategies have been developed to represent MOFs with feature vectors9,10,11,12. However, the global material descriptors9,13,14,15,16 that are presently used are not ideal for our purpose. We would like to directly connect to the structural building blocks of MOFs, which closely resemble the chemical intuition of MOF chemists, in which a MOF is a combination of the pore geometry and chemistry (i.e., metal nodes, ligands and functional groups)6,17. However, it is important to note that in developing these descriptors, it is impossible to completely separate the different effects and scopes. For example, for some MOFs adding a functional group can completely change the pore shape. Hence, depending on the details of the different types of descriptors and properties of interest, this may be seen as mainly pore-shape effect, while other sets of descriptions will assign it as functional-group effect.

To describe the pore geometry of nanoporous materials we use simple geometric descriptors, such as the pore size and volume18. For the MOF chemistry, we adapt the revised autocorrelations (RACs) descriptors19, which have been successfully applied19,20,21,22 for building structure–property relationships in transition metal chemistry19,23. RACs are discrete correlations between heuristic atomic properties (e.g., the Pauling electronegativity, nuclear charge, etc.) of atoms on a graph. We compute RACs using the molecular or crystal graphs derived from the adjacency matrix computed for the primitive cell of the crystal structure (see the “Methods” section). To describe the MOF chemistry, we extended conventional RACs to include descriptors for all domains of a MOF material, namely metal chemistry, linker chemistry, and functional groups (Fig. 1 and the “Methods” section).

Fig. 1: Description of the three domains of MOF chemistry.

Metal centre RACs are computed on the crystal graph. Linker and functional-group RACs are computed on the corresponding linker molecular graph. Linker chemistry includes two types of RACs, namely full linker and linker connecting atoms. The graphs show the start atom (in green) and the nearby atom (in orange) used to define the RACs descriptors (see the “Methods” section).

Description of the databases

We consider several MOF databases: one experimental and five with in silico predicted structures (see Supplementary Note 2 for more details of databases). The Computation-Ready, Experimental (CoRE)2,24,25,26 MOF database represents a selection of synthesised MOFs.

The first in silico generated MOF database (hMOF) was developed by Wilmer et al.3 using a “Tinkertoy” algorithm by snapping MOF building blocks to form 130,000 MOF structures. This Tinkertoy algorithm, however, gave only a few underlying nets27. An alternative approach, using topology-based algorithms has been applied by Gomez-Gualdron et al.28 for their ToBaCCo database (~13,000 structures), and by Boyd and Woo4,29 for their BW-DB (over 300,000 structures). A comprehensive review of this topic can be found here30.

We use CoRE-2019 and a diverse subset of 20,000 structures from the BW-DB (called BW-20K) to establish the validity of the material descriptors. In addition, a relatively small database of around 400 structures developed by Anderson et al.14 (ARABG-DB) was included for comparison with their conclusions about importance of structural domains14. For this test, we focus on adsorption properties as their accurate prediction requires a meaningful descriptor for both the chemistry and pore geometry. We study the adsorption properties of methane and carbon dioxide. Because of their differences in chemistry (i.e. molecule shape and size, and non-zero quadrupole moment of carbon dioxide), designing porous materials with desired adsorption properties requires different strategies for each gas. To emphasize on these differences, we study the adsorption properties at three different conditions, namely infinite dilution (i.e. Henry regime), low pressure and high pressure.

Predicting adsorption properties of MOFs

We first establish that our descriptors capture the chemical similarity of MOF structures. As a test we show that instance-based machine-learning models (kernel ridge regression (KRR)) using these descriptors can accurately predict adsorption properties. A KRR model with a radial basis function kernel uses only similarity that is quantified using pairwise distances in the feature space; hence, the performance of the model can demonstrate the validity of the descriptors. KRR models show good performance in predictions of the adsorption properties of CoRE-2019 and BW-20K databases (see Supplementary Note 3 for parities and statistics). We observe that for those properties that are less dependent on the chemistry, e.g., the high-pressure applications of CH4 and CO2, the geometric descriptors are sufficient to describe the materials with the average relative error (RMAE) in the prediction of the gas uptake being below 5%. In addition, if we compare the relative ranking of the materials, we also obtain satisfactory agreement as expressed by the Spearman rank correlation coefficient (SRCC) above 0.9. On the other hand, for the applications where chemistry plays a role, e.g., the Henry coefficient of CO2, the chemical descriptors are essential to accurately predict the materials properties (RMAE ~ 5% and SRCC ~ 0.8). The performance and accuracy of our models is comparable with the prior studies14,31,32,33,34,35 (see a comprehensive list in ref. 36). However, to be able to compare the accuracy and performance of different models and feature sets, one needs to perform a benchmark study using a fixed set of materials with high diversity and their corresponding properties as for example, we observe the performance of machine-learning models varies considerably from one database to another.

The significance of the chemical descriptors is further illustrated by the predictions of the maximum positive charge (MPC) and the minimum negative charge (MNC) of MOF structures (SRCC above 0.9 and 0.7, respectively). The geometric descriptors are nearly irrelevant for these charges (SRCCs below 0.5 for all cases). This explains the relatively poor performance in prediction of CO2 adsorption properties at low pressures using only geometric descriptors as electrostatic interaction plays a crucial role. This analysis shows that our RACs and geometric descriptors are meaningful representations for the chemical space of MOFs for both CH4 and CO2 adsorption over the complete range of pressures. As a consequence, if two materials have similar descriptors, their adsorption properties will be similar. Hence, we can now quantify how the different regions of design space are covered by the different databases.

Diversity of MOF databases

We define the current chemical design space as the combination of all the synthesized materials and the in silico predicted structures, i.e., all the materials in the known databases. The real chemical design space, of course, can be much larger, as one can expect that novel classes of MOFs will be discovered. It is instructive to visualize how each MOF database is covering the current design space. This design space, as described by our descriptors, is a high-dimensional space and to visualize this we make a projection on two dimensions.

The projection of the pore geometry of our current design space is shown in Fig. 2a. The colour distribution shows a gradient in the pore size of the MOFs, from small to large pores moving on the map from left to the right. Other panels show how the different MOF databases are covering this space. The distributions of the geometric properties of the databases are considerably different from each other (Fig. 2b–d). For example, the experimental MOFs (CoRE-2019) are mainly in the small pore region of the map. Remarkably, the hypothetical databases also have very different distributions. While BW-DB covers the intermediate pore size regions, ToBaCCo is biased to the large pore regions of the design space.

Fig. 2: Map of the pore geometry of MOFs.

To project the geometric descriptor space of MOFs to a 2D map we use the t-distributed stochastic neighbour embedding (t-SNE)67 method (see Supplementary Note 6 for principal component analysis (PCA)). The t-SNE method preserves local similarity, ensuring similar structures are mapped close to each other in two dimensions. a The current design space colour coded with the largest included sphere. In (b), (c), and (d), the green, blue and red dots are representing the materials in the CoRE-2019, BW-DB and ToBaCCo databases, respectively, which are overlaid on the design space represented in grey. PCA plots show a similar distribution of databases (see Supplementary Note 6).

The hypothetical structures have been generated to explore the design space of MOFs beyond the experimentally known structures. In Fig. 3 we show how these databases are covering the design space (see Supplementary Note 6 for the distribution of each database and for PCA method). We use diversity metrics37 to quantify the coverage of these databases in terms of variety (V), balance (B) and disparity (D). The pore geometry, linker chemistry and functional groups design spaces are well covered and sampled by the hypothetical databases. However, we observe a serious limitation in diversity, in particular in the variety of the metal chemistry in hypothetical databases (Fig. 3b). Compared with the experimental database, the variety of the metal chemistry of MOFs by hypothetical databases is surprisingly low; only a limited number of MOF metal centres are present (18 metal SBUs for all hypothetical databases, see Supplementary Note 14).

Fig. 3: Diversity metrics and maps of different domains of MOF structures.

The t-SNE method was used to project the a pore geometry, b metal chemistry, c linker chemistry and d functional groups descriptor spaces to 2D maps. Only descriptors up to the second coordination shell were included for metal chemistry to emphasize the local metal chemistry environment. In each panel, the structures from the hypothetical databases are coloured and overlaid on the entire known design space represented in grey. The radar charts show the three diversity metrics: variety (V), balance (B) and disparity (D), for the three databases. For this analysis, first we discretize the space into a fixed number of bins. Variety measures the number of bins that are sampled, balance the evenness of the distribution of materials among the sampled bins, and disparity the spread of the sampled bins (see the “Methods” section for more details).

The choice of the organic linker and the placement of functional groups are readily enumerated; one can take the large databases of organic molecules38 as a rich source of the possible MOF linkers or functional groups. In contrast, the metal nodes of MOFs are typically only known after a MOF is synthesised. For example, at present we cannot predict that if Zinc atoms during the MOF formation would cluster in a Zinc paddle-wheel (e.g., in Zn-HKUST-1)39, a single node (e.g., in ZIFs)40, Zn4O (e.g., in IRMOFs)6, or to a totally new configuration.

The diversity in metal chemistry was further reduced by the choice of researchers and/or limitations in the MOF structure assembly algorithms. For example, some of the hypothetical MOF databases are deliberately focused on specific sub-classes of MOFs to systematically investigate structure–property relationships. For example, the study by Gomez-Gualdron et al.41 that focuses on generating stable MOFs using Zirconium-based metal nodes for gas storage, Witman et al.42 on 1-D rod MOFs featuring open-metal sites for CO2 capture, and Moosavi et al.43 on ZIFs with various functional groups and underlying nets for the mechanical stability. Lastly, in silico assembly of MOFs possessing complex nodes that are connected via multiple linkers, especially on a low-symmetry net, is still challenging for the current structure generation algorithms44. Therefore, we expect that there are many missing points on the metal chemistry map in Fig. 3b which will be found in the coming years.

Applications of diversity analysis

We illustrate the importance of quantifying the diversity of the different databases by three examples. The first example illustrates how machine learning can be used to extract insight on how the performance of a material is related to its underlying structure14,19,21. As our descriptors represent each domain of the MOF architecture, we can quantify the relative importance of these domains on CH4 and CO2 adsorption.

Within each database, the importance of variables varies significantly across different gases and different adsorption conditions (see Supplementary Note 5). These results follow our intuition; the chemistry of the material is more important in the low-pressure regime, while at high pressures the pore geometry is the dominant factor. Moreover, we observe that material chemistry is more important for CO2 than for CH4 adsorption.

If each of these databases would have covered a representative subset of MOF chemistry, one would expect that each database would give a similar result for the importance of the different variables. However, we observe striking differences when we compare across different databases. An illustrative example is CO2 adsorption at low pressure. Anderson et al.14 concluded from their analysis of the (ARABG-DB) database that the metal chemistry is not an important variable for CO2 adsorption. However, Fig. 4a shows that for each of these databases different material characteristics are important for the models in predicting CO2 adsorption. For example, pore geometry is the most important variable in the BW-20K, while metal chemistry in CoRE-2019, and the functional groups in ARABG-DB. Since the material properties were computed using a consistent methodology for all databases, these differences in the importance of variables originate in the differences in the underlying distribution of material databases (see Fig. 3 and Supplementary Note 6 for distribution of databases). For instance, the reason why metal chemistry was not identified as an important factor by Anderson et al. was that metal chemistry was not explored sufficiently in their database as only four SBUs were used for structure enumeration. Also, since these values are the relative importance, one can argue that in CoRE-2019 MOFs, the functional groups were not exploited as much as metal chemistry. At this point, it is important to note that our analysis is based on the current state-of-the-art methods that is used in screening studies, i.e., generic force fields and rigid crystals. It would be interesting to see how improvements in, for example, the description of open-metal sites in MOFs will change this analysis. If the changes are large, such improvements will likely have a large impact.

Fig. 4: Database dependence of the importance of material characteristics.

Pie charts showing the SHapley Additive exPlanations (SHAP) values (importance of variables) for a the low-pressure CO2 adsorption and b CH4 deliverable capacity. SHAP values were computed for the random forest regression models using a training set of CoRE-2019 and BW-20K, and all structures in ARABG-DB. For the chemical features, the importance of variables was summed over all RAC depths for each of the heuristic atomic properties. See the “Methods” section for the meaning of the labels. Similar values for importance of variables were obtained using other techniques (see Supplementary Note 5).

In our second example, we focus on how our diversity analysis can help us to identify opportunities for the design of new structures. At present, there are over 90,000 MOFs that have been synthesised and one would like to be sure that MOF 90,001 adds relevant information. Similarly, for the hypothetical databases one would add new structures to any screening study only if they are complementary to the many that already exist.

For CO2 capture from flue gases, which corresponds to CO2 adsorption at low pressure in our study, we have shown that metal chemistry cannot be ignored (Fig. 4a). Our diversity analysis shows that this domain is not well covered by hypothetical databases (see Fig. 3). Therefore, exploring different metal chemistries in these databases would increase the diversity of these databases. For this we have developed a methodology to mine unique MOF building blocks from the experimental MOF databases (see the “Methods” section). In Supplementary Note 7, we show some of these SBUs that have not been used for structure enumeration in these hypothetical databases yet, and including these missing structures in a screening study could lead to the discovery of materials with superior performance.

For methane storage our analysis shows that the single most important factor is the pore geometry (see Fig. 4b). All databases confirm that pore geometry is the most important factor. For this application, each of the databases have a sufficient diversity in geometric structures and other factors do not matter. This observation provides an important rationale for the provocative conclusion of Simon et al.45 that there is no point in looking for new structures for methane storage as they are not expected to perform significantly better for this application. Simon et al. arrived at this conclusion from a large screening of 650,000 random selection of structures from many databases of different classes of nanoporous materials. Our study shows that indeed a large selection of structures from different databases will cover the entire geometric space of the current design space. To significantly outperform the best performing materials one would need a completely new chemistry and mechanism, e.g., framework flexibility46.

In the final example, we focus on the effect of bias in the databases on the generalisability and transferability of machine-learning predictions. Intuitively, one would expect that if we include structures from all regions of the design space in our training set, our machine-learning results should be transferable to any database. We illustrate this point for the two databases CoRE-2019 and BW-DB. We randomly select 2000 structures that we use as test set. A diverse set of structures based on the chemical and geometric descriptors was obtained from the remaining structures in these two databases47,48. The accuracy of random forest models trained using this diverse set is compared with the models trained using training sets from each database in Fig. 5. Clearly, the models that were trained on databases which are biased to some regions of the design space result in poor transferability for predictions in unseen regions of the space. In contrast and not surprisingly, the model that is trained with a diverse set performs relatively well for both databases. Besides, the diversity in training set lead to a more efficient learning. In supplementary materials, we show the learning curves that demonstrate the models trained on the diverse set have systematically lower error than the ones trained using biased databases. The number of training points in which the learning curves plateau can be an indication of the minimum number of structures for optimal coverage of the design space for a particular application. This number is obviously proportional to the complexity of the material property, i.e., how many materials characteristics are affecting the materials properties.

Fig. 5: Impact of diversity in training data on transferability of models.

The parity plots of random forest models using full features; rows and columns correspond to the training and test sets, respectively. The dashed lines represent the parity. The size of training sets is equal in all cases. The same structures were used as test sets in each column. The diverse set was selected using the MaxMin47 algorithm using all geometric and chemical descriptors. The colour bars show the number of structures in each cell of the histograms.

Original Text (This is the original text for your reference.)

Development of descriptors for MOF chemistry

One of the aims of this work is to express the diversity of a MOF database in terms of features that can be related to the chemistry that is used in synthesizing MOFs as well as generating the libraries of hypothetical structures. At present, different strategies have been developed to represent MOFs with feature vectors9,10,11,12. However, the global material descriptors9,13,14,15,16 that are presently used are not ideal for our purpose. We would like to directly connect to the structural building blocks of MOFs, which closely resemble the chemical intuition of MOF chemists, in which a MOF is a combination of the pore geometry and chemistry (i.e., metal nodes, ligands and functional groups)6,17. However, it is important to note that in developing these descriptors, it is impossible to completely separate the different effects and scopes. For example, for some MOFs adding a functional group can completely change the pore shape. Hence, depending on the details of the different types of descriptors and properties of interest, this may be seen as mainly pore-shape effect, while other sets of descriptions will assign it as functional-group effect.

To describe the pore geometry of nanoporous materials we use simple geometric descriptors, such as the pore size and volume18. For the MOF chemistry, we adapt the revised autocorrelations (RACs) descriptors19, which have been successfully applied19,20,21,22 for building structure–property relationships in transition metal chemistry19,23. RACs are discrete correlations between heuristic atomic properties (e.g., the Pauling electronegativity, nuclear charge, etc.) of atoms on a graph. We compute RACs using the molecular or crystal graphs derived from the adjacency matrix computed for the primitive cell of the crystal structure (see the “Methods” section). To describe the MOF chemistry, we extended conventional RACs to include descriptors for all domains of a MOF material, namely metal chemistry, linker chemistry, and functional groups (Fig. 1 and the “Methods” section).

Fig. 1: Description of the three domains of MOF chemistry.

Metal centre RACs are computed on the crystal graph. Linker and functional-group RACs are computed on the corresponding linker molecular graph. Linker chemistry includes two types of RACs, namely full linker and linker connecting atoms. The graphs show the start atom (in green) and the nearby atom (in orange) used to define the RACs descriptors (see the “Methods” section).

Description of the databases

We consider several MOF databases: one experimental and five with in silico predicted structures (see Supplementary Note 2 for more details of databases). The Computation-Ready, Experimental (CoRE)2,24,25,26 MOF database represents a selection of synthesised MOFs.

The first in silico generated MOF database (hMOF) was developed by Wilmer et al.3 using a “Tinkertoy” algorithm by snapping MOF building blocks to form 130,000 MOF structures. This Tinkertoy algorithm, however, gave only a few underlying nets27. An alternative approach, using topology-based algorithms has been applied by Gomez-Gualdron et al.28 for their ToBaCCo database (~13,000 structures), and by Boyd and Woo4,29 for their BW-DB (over 300,000 structures). A comprehensive review of this topic can be found here30.

We use CoRE-2019 and a diverse subset of 20,000 structures from the BW-DB (called BW-20K) to establish the validity of the material descriptors. In addition, a relatively small database of around 400 structures developed by Anderson et al.14 (ARABG-DB) was included for comparison with their conclusions about importance of structural domains14. For this test, we focus on adsorption properties as their accurate prediction requires a meaningful descriptor for both the chemistry and pore geometry. We study the adsorption properties of methane and carbon dioxide. Because of their differences in chemistry (i.e. molecule shape and size, and non-zero quadrupole moment of carbon dioxide), designing porous materials with desired adsorption properties requires different strategies for each gas. To emphasize on these differences, we study the adsorption properties at three different conditions, namely infinite dilution (i.e. Henry regime), low pressure and high pressure.

Predicting adsorption properties of MOFs

We first establish that our descriptors capture the chemical similarity of MOF structures. As a test we show that instance-based machine-learning models (kernel ridge regression (KRR)) using these descriptors can accurately predict adsorption properties. A KRR model with a radial basis function kernel uses only similarity that is quantified using pairwise distances in the feature space; hence, the performance of the model can demonstrate the validity of the descriptors. KRR models show good performance in predictions of the adsorption properties of CoRE-2019 and BW-20K databases (see Supplementary Note 3 for parities and statistics). We observe that for those properties that are less dependent on the chemistry, e.g., the high-pressure applications of CH4 and CO2, the geometric descriptors are sufficient to describe the materials with the average relative error (RMAE) in the prediction of the gas uptake being below 5%. In addition, if we compare the relative ranking of the materials, we also obtain satisfactory agreement as expressed by the Spearman rank correlation coefficient (SRCC) above 0.9. On the other hand, for the applications where chemistry plays a role, e.g., the Henry coefficient of CO2, the chemical descriptors are essential to accurately predict the materials properties (RMAE ~ 5% and SRCC ~ 0.8). The performance and accuracy of our models is comparable with the prior studies14,31,32,33,34,35 (see a comprehensive list in ref. 36). However, to be able to compare the accuracy and performance of different models and feature sets, one needs to perform a benchmark study using a fixed set of materials with high diversity and their corresponding properties as for example, we observe the performance of machine-learning models varies considerably from one database to another.

The significance of the chemical descriptors is further illustrated by the predictions of the maximum positive charge (MPC) and the minimum negative charge (MNC) of MOF structures (SRCC above 0.9 and 0.7, respectively). The geometric descriptors are nearly irrelevant for these charges (SRCCs below 0.5 for all cases). This explains the relatively poor performance in prediction of CO2 adsorption properties at low pressures using only geometric descriptors as electrostatic interaction plays a crucial role. This analysis shows that our RACs and geometric descriptors are meaningful representations for the chemical space of MOFs for both CH4 and CO2 adsorption over the complete range of pressures. As a consequence, if two materials have similar descriptors, their adsorption properties will be similar. Hence, we can now quantify how the different regions of design space are covered by the different databases.

Diversity of MOF databases

We define the current chemical design space as the combination of all the synthesized materials and the in silico predicted structures, i.e., all the materials in the known databases. The real chemical design space, of course, can be much larger, as one can expect that novel classes of MOFs will be discovered. It is instructive to visualize how each MOF database is covering the current design space. This design space, as described by our descriptors, is a high-dimensional space and to visualize this we make a projection on two dimensions.

The projection of the pore geometry of our current design space is shown in Fig. 2a. The colour distribution shows a gradient in the pore size of the MOFs, from small to large pores moving on the map from left to the right. Other panels show how the different MOF databases are covering this space. The distributions of the geometric properties of the databases are considerably different from each other (Fig. 2b–d). For example, the experimental MOFs (CoRE-2019) are mainly in the small pore region of the map. Remarkably, the hypothetical databases also have very different distributions. While BW-DB covers the intermediate pore size regions, ToBaCCo is biased to the large pore regions of the design space.

Fig. 2: Map of the pore geometry of MOFs.

To project the geometric descriptor space of MOFs to a 2D map we use the t-distributed stochastic neighbour embedding (t-SNE)67 method (see Supplementary Note 6 for principal component analysis (PCA)). The t-SNE method preserves local similarity, ensuring similar structures are mapped close to each other in two dimensions. a The current design space colour coded with the largest included sphere. In (b), (c), and (d), the green, blue and red dots are representing the materials in the CoRE-2019, BW-DB and ToBaCCo databases, respectively, which are overlaid on the design space represented in grey. PCA plots show a similar distribution of databases (see Supplementary Note 6).

The hypothetical structures have been generated to explore the design space of MOFs beyond the experimentally known structures. In Fig. 3 we show how these databases are covering the design space (see Supplementary Note 6 for the distribution of each database and for PCA method). We use diversity metrics37 to quantify the coverage of these databases in terms of variety (V), balance (B) and disparity (D). The pore geometry, linker chemistry and functional groups design spaces are well covered and sampled by the hypothetical databases. However, we observe a serious limitation in diversity, in particular in the variety of the metal chemistry in hypothetical databases (Fig. 3b). Compared with the experimental database, the variety of the metal chemistry of MOFs by hypothetical databases is surprisingly low; only a limited number of MOF metal centres are present (18 metal SBUs for all hypothetical databases, see Supplementary Note 14).

Fig. 3: Diversity metrics and maps of different domains of MOF structures.

The t-SNE method was used to project the a pore geometry, b metal chemistry, c linker chemistry and d functional groups descriptor spaces to 2D maps. Only descriptors up to the second coordination shell were included for metal chemistry to emphasize the local metal chemistry environment. In each panel, the structures from the hypothetical databases are coloured and overlaid on the entire known design space represented in grey. The radar charts show the three diversity metrics: variety (V), balance (B) and disparity (D), for the three databases. For this analysis, first we discretize the space into a fixed number of bins. Variety measures the number of bins that are sampled, balance the evenness of the distribution of materials among the sampled bins, and disparity the spread of the sampled bins (see the “Methods” section for more details).

The choice of the organic linker and the placement of functional groups are readily enumerated; one can take the large databases of organic molecules38 as a rich source of the possible MOF linkers or functional groups. In contrast, the metal nodes of MOFs are typically only known after a MOF is synthesised. For example, at present we cannot predict that if Zinc atoms during the MOF formation would cluster in a Zinc paddle-wheel (e.g., in Zn-HKUST-1)39, a single node (e.g., in ZIFs)40, Zn4O (e.g., in IRMOFs)6, or to a totally new configuration.

The diversity in metal chemistry was further reduced by the choice of researchers and/or limitations in the MOF structure assembly algorithms. For example, some of the hypothetical MOF databases are deliberately focused on specific sub-classes of MOFs to systematically investigate structure–property relationships. For example, the study by Gomez-Gualdron et al.41 that focuses on generating stable MOFs using Zirconium-based metal nodes for gas storage, Witman et al.42 on 1-D rod MOFs featuring open-metal sites for CO2 capture, and Moosavi et al.43 on ZIFs with various functional groups and underlying nets for the mechanical stability. Lastly, in silico assembly of MOFs possessing complex nodes that are connected via multiple linkers, especially on a low-symmetry net, is still challenging for the current structure generation algorithms44. Therefore, we expect that there are many missing points on the metal chemistry map in Fig. 3b which will be found in the coming years.

Applications of diversity analysis

We illustrate the importance of quantifying the diversity of the different databases by three examples. The first example illustrates how machine learning can be used to extract insight on how the performance of a material is related to its underlying structure14,19,21. As our descriptors represent each domain of the MOF architecture, we can quantify the relative importance of these domains on CH4 and CO2 adsorption.

Within each database, the importance of variables varies significantly across different gases and different adsorption conditions (see Supplementary Note 5). These results follow our intuition; the chemistry of the material is more important in the low-pressure regime, while at high pressures the pore geometry is the dominant factor. Moreover, we observe that material chemistry is more important for CO2 than for CH4 adsorption.

If each of these databases would have covered a representative subset of MOF chemistry, one would expect that each database would give a similar result for the importance of the different variables. However, we observe striking differences when we compare across different databases. An illustrative example is CO2 adsorption at low pressure. Anderson et al.14 concluded from their analysis of the (ARABG-DB) database that the metal chemistry is not an important variable for CO2 adsorption. However, Fig. 4a shows that for each of these databases different material characteristics are important for the models in predicting CO2 adsorption. For example, pore geometry is the most important variable in the BW-20K, while metal chemistry in CoRE-2019, and the functional groups in ARABG-DB. Since the material properties were computed using a consistent methodology for all databases, these differences in the importance of variables originate in the differences in the underlying distribution of material databases (see Fig. 3 and Supplementary Note 6 for distribution of databases). For instance, the reason why metal chemistry was not identified as an important factor by Anderson et al. was that metal chemistry was not explored sufficiently in their database as only four SBUs were used for structure enumeration. Also, since these values are the relative importance, one can argue that in CoRE-2019 MOFs, the functional groups were not exploited as much as metal chemistry. At this point, it is important to note that our analysis is based on the current state-of-the-art methods that is used in screening studies, i.e., generic force fields and rigid crystals. It would be interesting to see how improvements in, for example, the description of open-metal sites in MOFs will change this analysis. If the changes are large, such improvements will likely have a large impact.

Fig. 4: Database dependence of the importance of material characteristics.

Pie charts showing the SHapley Additive exPlanations (SHAP) values (importance of variables) for a the low-pressure CO2 adsorption and b CH4 deliverable capacity. SHAP values were computed for the random forest regression models using a training set of CoRE-2019 and BW-20K, and all structures in ARABG-DB. For the chemical features, the importance of variables was summed over all RAC depths for each of the heuristic atomic properties. See the “Methods” section for the meaning of the labels. Similar values for importance of variables were obtained using other techniques (see Supplementary Note 5).

In our second example, we focus on how our diversity analysis can help us to identify opportunities for the design of new structures. At present, there are over 90,000 MOFs that have been synthesised and one would like to be sure that MOF 90,001 adds relevant information. Similarly, for the hypothetical databases one would add new structures to any screening study only if they are complementary to the many that already exist.

For CO2 capture from flue gases, which corresponds to CO2 adsorption at low pressure in our study, we have shown that metal chemistry cannot be ignored (Fig. 4a). Our diversity analysis shows that this domain is not well covered by hypothetical databases (see Fig. 3). Therefore, exploring different metal chemistries in these databases would increase the diversity of these databases. For this we have developed a methodology to mine unique MOF building blocks from the experimental MOF databases (see the “Methods” section). In Supplementary Note 7, we show some of these SBUs that have not been used for structure enumeration in these hypothetical databases yet, and including these missing structures in a screening study could lead to the discovery of materials with superior performance.

For methane storage our analysis shows that the single most important factor is the pore geometry (see Fig. 4b). All databases confirm that pore geometry is the most important factor. For this application, each of the databases have a sufficient diversity in geometric structures and other factors do not matter. This observation provides an important rationale for the provocative conclusion of Simon et al.45 that there is no point in looking for new structures for methane storage as they are not expected to perform significantly better for this application. Simon et al. arrived at this conclusion from a large screening of 650,000 random selection of structures from many databases of different classes of nanoporous materials. Our study shows that indeed a large selection of structures from different databases will cover the entire geometric space of the current design space. To significantly outperform the best performing materials one would need a completely new chemistry and mechanism, e.g., framework flexibility46.

In the final example, we focus on the effect of bias in the databases on the generalisability and transferability of machine-learning predictions. Intuitively, one would expect that if we include structures from all regions of the design space in our training set, our machine-learning results should be transferable to any database. We illustrate this point for the two databases CoRE-2019 and BW-DB. We randomly select 2000 structures that we use as test set. A diverse set of structures based on the chemical and geometric descriptors was obtained from the remaining structures in these two databases47,48. The accuracy of random forest models trained using this diverse set is compared with the models trained using training sets from each database in Fig. 5. Clearly, the models that were trained on databases which are biased to some regions of the design space result in poor transferability for predictions in unseen regions of the space. In contrast and not surprisingly, the model that is trained with a diverse set performs relatively well for both databases. Besides, the diversity in training set lead to a more efficient learning. In supplementary materials, we show the learning curves that demonstrate the models trained on the diverse set have systematically lower error than the ones trained using biased databases. The number of training points in which the learning curves plateau can be an indication of the minimum number of structures for optimal coverage of the design space for a particular application. This number is obviously proportional to the complexity of the material property, i.e., how many materials characteristics are affecting the materials properties.

Fig. 5: Impact of diversity in training data on transferability of models.

The parity plots of random forest models using full features; rows and columns correspond to the training and test sets, respectively. The dashed lines represent the parity. The size of training sets is equal in all cases. The same structures were used as test sets in each column. The diverse set was selected using the MaxMin47 algorithm using all geometric and chemical descriptors. The colour bars show the number of structures in each cell of the histograms.

Comments

    Something to say?

    Log in or Sign up for free

    Disclaimer: The translated content is provided by third-party translation service providers, and IKCEST shall not assume any responsibility for the accuracy and legality of the content.
    Translate engine
    Article's language
    English
    中文
    Pусск
    Français
    Español
    العربية
    Português
    Kikongo
    Dutch
    kiswahili
    هَوُسَ
    IsiZulu
    Action
    Related

    Report

    Select your report category*



    Reason*



    By pressing send, your feedback will be used to improve IKCEST. Your privacy will be protected.

    Submit
    Cancel