TY - JOUR
T1 - Automatic selection of molecular descriptors using random forest
T2 - Application to drug discovery
AU - Cano, Gaspar
AU - Garcia-Rodriguez, Jose
AU - Garcia-Garcia, Alberto
AU - Perez-Sanchez, Horacio
AU - Benediktsson, Jón Atli
AU - Thapa, Anil
AU - Barr, Alastair
N1 - Publisher Copyright: © 2016 Elsevier Ltd
PY - 2017/4/15
Y1 - 2017/4/15
N2 - The optimal selection of chemical features (molecular descriptors) is an essential pre-processing step for the efficient application of computational intelligence techniques in virtual screening for identification of bioactive molecules in drug discovery. The selection of molecular descriptors has key influence in the accuracy of affinity prediction. In order to improve this prediction, we examined a Random Forest (RF)-based approach to automatically select molecular descriptors of training data for ligands of kinases, nuclear hormone receptors, and other enzymes. The reduction of features to use during prediction dramatically reduces the computing time over existing approaches and consequently permits the exploration of much larger sets of experimental data. To test the validity of the method, we compared the results of our approach with the ones obtained using manual feature selection in our previous study (Perez-Sanchez, Cano, and Garcia-Rodriguez, 2014).The main novelty of this work in the field of drug discovery is the use of RF in two different ways: feature ranking and dimensionality reduction, and classification using the automatically selected feature subset. Our RF-based method outperforms classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches.
AB - The optimal selection of chemical features (molecular descriptors) is an essential pre-processing step for the efficient application of computational intelligence techniques in virtual screening for identification of bioactive molecules in drug discovery. The selection of molecular descriptors has key influence in the accuracy of affinity prediction. In order to improve this prediction, we examined a Random Forest (RF)-based approach to automatically select molecular descriptors of training data for ligands of kinases, nuclear hormone receptors, and other enzymes. The reduction of features to use during prediction dramatically reduces the computing time over existing approaches and consequently permits the exploration of much larger sets of experimental data. To test the validity of the method, we compared the results of our approach with the ones obtained using manual feature selection in our previous study (Perez-Sanchez, Cano, and Garcia-Rodriguez, 2014).The main novelty of this work in the field of drug discovery is the use of RF in two different ways: feature ranking and dimensionality reduction, and classification using the automatically selected feature subset. Our RF-based method outperforms classification results provided by Support Vector Machine (SVM) and Neural Networks (NN) approaches.
KW - Computational chemistry
KW - Drug discovery
KW - Molecular descriptors
KW - Random forest
UR - https://www.scopus.com/pages/publications/85006745435
U2 - 10.1016/j.eswa.2016.12.008
DO - 10.1016/j.eswa.2016.12.008
M3 - Article
SN - 0957-4174
VL - 72
SP - 151
EP - 159
JO - Expert Systems with Applications
JF - Expert Systems with Applications
ER -