1、InforSense&CCG All Rights ReservedAdvanced Application in MOEQSARInforSense&CCG All Rights ReservedOutlineQSAR OverviewDescriptor calculationDescriptor selection(PCA)Deriving QSAR modelsModel ValidationInforSense&CCG All Rights ReservedQSARQuantitative Structure-Activity Relationship(QSAR)applicatio
2、ns correlate experimental data(e.g.biological activity or physical properties)with the structure of chemical compounds in a quantitative manner.QSAR models allow the interpretation and prediction of properties of structurally related compounds.*)The art of deriving a QSAR model lies in:Identifying a
3、 suitable mathematical functional form Reducing the complex dimensionality of reality into as few dimensions as possible while still being able to give useful predictions of specific properties for molecules not experimentally tested so far.Most QSAR models are based on linear correlations.InforSens
4、e&CCG All Rights ReservedQSAR Model DevelopmentRobust QSAR model development generally proceeds as follows:Assemble a database of experimental results and molecular structures.Identify a descriptor set that correlates highly with the property in question,use descriptors which are mutually orthogonal
5、 and as meaningful and intuitive as possible(based on the underlying physico-chemical properties).Split the dataset into an appropriate training and test set.The training set will be used to develop the model.The test set will be used to validate the predictive power of the model.In most cases the a
6、pplicability of the model will be closely limited to the property space of the test set.Apply methods(regression,classification,etc.)to generate the predictive models based on the training set.Predict activities for the test set to assess robustness of the model.Descriptor calculation(QuaSAR-Descrip
7、tor)Descriptor selection(Principle Components,QuaSAR-Contingency)Modelvalidation(Model-Evaluate)Modeldevelopment(QuaSAR-Model,Model-Composer)InforSense&CCG All Rights ReservedQuantitative&Qualitative QSAR Models in MOEBesides the selection of most appropriate descriptors and a meaningful separation
8、of available data into training and test sets,the choice of an appropriate functional form is key to successful QSAR modeling.MOE provides quantitative as well as qualitative QSAR approaches:Quantitative approaches include linear regression methods such as Partial Least Squares(PLS)and Principal Com
9、ponent Regression(PCR).Qualitative approaches include a non-linear binary filter based on Bayesian statistics as well as a binary classification tree.InforSense&CCG All Rights ReservedModeldevelopmentModelvalidationDescriptor selectionDescriptorcalculationDescriptor CalculationInforSense&CCG All Rig
10、hts ReservedInitial Steps in Understanding a DatasetThe initial steps in interpreting an experimental dataset involve:Building preliminary Structure Activity RelationshipsCommon fragments to actives/inactives Looking for patterns within the data structureAre clusters present in the data?Evaluating t
11、he relative importance of descriptors for a potential modelInvolves both stochastic and heuristic evaluation Finding commonality,and diversity within the dataRobustness in chemical spaceStep 1 in data analysis:Find the relevant set of descriptorsInforSense&CCG All Rights ReservedMolecular Descriptor
12、s and FingerprintsMolecular Descriptors encode molecular properties per molecule into single numerical values.Qualitative:yes/no flags for presence or absence of certain features(like bits in fingerprints see below).Quantitative:numerical measures of physico-chemical or structural properties.May dep
13、endent on connectivity and chemistry only(2D)or also on conformation/3D geometry(3D).Fingerprints typically consists of bit strings of several hundreds or even thousands of individual yes/no flags.Each position of the bit string encodes the presence(1)or absence(0)of a distinct property or feature.I
14、ncluding substructure fragments,connectivity patterns or pharmacophore type functional properties.010010100101001.bit stringBrSNCH3OOONH2Molecular weight:385.282logP:2.552#rotatable single bonds:5or2,5,7,10,12,15,bit positionInforSense&CCG All Rights ReservedQuaSAR Descriptor PanelNumerical molecula
15、r descriptors may be calculated either via(MOE|Compute|QuaSAR|QuaSAR-Descriptor)without opening a database or via(DBV|Compute|Descriptors).Input databaseDescriptor synchronization with databaseDescriptor listDisplay filtersInforSense&CCG All Rights ReservedOverview of MOE Descriptors300 2D and 3D de
16、scriptors Topological indices Surface area properties Physical properties Energy termsAdd new descriptors with SVL Automatically added to relevant calculations Existing descriptors can be used as templateProprietary VSA descriptors Subdivision of surface area based on LogP,MR(molar refractivity)and
17、Partial Charge 2D based approximation(for speed on large datasets)Semi empirical descriptors Descriptor names prefixed with Hamiltonian:AM1_,PM3_,MNDO_ Total energy,electronic energy,heat of formation,HOMO,LUMO,Ionization PotentialInforSense&CCG All Rights ReservedBinned VSA Descriptors I A subset o
18、f highly uncorrelated,intuitive and meaningful 2D descriptors has been implemented in MOE to provide a stable“default”approach for new datasets:the binned Van-der-Waals surface area descriptors(referred to as binned VSA descriptors in MOE)1).LogP(partition coefficient),MR(molar refractivity)and part
19、ial charge are used to cover a meaningful property space from hydrophobic to hydrophilic interactions.Each of these descriptor sets is derived from,or related to the Hansch汉施 and Leo descriptors.2)The descriptor returns the approximate surface area of a molecule,produced from a 2D representation,tha
20、t falls into a given range of property values.Using the subset of binned VSA descriptors may help to overcome the necessity of using automatic descriptor selection routines.3)InforSense&CCG All Rights ReservedBinned VSA Descriptors IIThe surface contribution which may be sensed by neighboring molecu
21、les is approximated by subtracting overlapping surface areas from first shell atom neighbors.The 2019 Wildman&Crippen1)atom type model is used to map properties onto individual atoms.Contributions to LogP and MR are derived in linear models from datasets of about 10,000 experimental data points each
22、2).For partial charge calculation,the Gasteiger PEOE charges is used.The approximate surface area contributions of a given molecule are added for each property bin.3)Vi values:V7 V2 V1 V6 V3 V4+V8+V5Pi range:0,1)1,2)2,3)3,4)4,5)5,6)6 Descriptors:D1D2D3D4D5D6C8C3C4C5C6N7O2C1InforSense&CCG All Rights
23、Reserved2D BCUT and GCUT Descriptors BCUT:Burden Matrix eigenvalues The BCUT descriptors*)are calculated from the eigenvalues of a modified adjacency matrix.The adjacency matrix contains a 1 if atoms i,j are bonded;0 otherwise.Each ij entry of the adjacency matrix takes the value bij-1/2 where bij i
24、s the formal bond order between bonded atoms i and j.The diagonal takes the value of the associated PEOE,SMR,logP descriptor.The resulting eigenvalues are sorted and the smallest,1/3 percentile,2/3 percentile and largest eigenvalues are reported.GCUT:Inverse graph distance matrix eigenvalues The GCU
25、T descriptors are calculated from the eigenvalues of a modified graph distance adjacency matrix,similar to BCUT descriptors.Each ij entry of the adjacency matrix takes the value dij-2 where dij is the(modified)graph distance between atoms i and j.The diagonal takes the value of the associated PEOE p
26、artial charges,SMR or logP descriptors.The resulting eigenvalues are sorted and the smallest,1/3 percentile,2/3 percentile and largest eigenvalues are reported.InforSense&CCG All Rights ReservedCaveats in Descriptor CalculationTo ensure consistent i3D and x3D descriptor values if starting from 2D st
27、ructures without hydrogens,the following procedure should be used:Via the DBV:1.Import the structures without adding hydrogens 2.Energy minimize the database enabling the following options:-“Rebuild 3D”-“Add Hydrogens”-“Calculate forcefield partial charges”In the Command Line via sdproc,which adds h
28、ydrogens,calculates partial charges,performs energy minimization,and descriptor calculation in a single pass.Note:Differences may arise when SMILES structures are used as a molecular source random initial coordinates.Hydrogens,partial charge,and energy minimization steps are performed in series coor
29、dinate truncation errors InforSense&CCG All Rights ReservedExercise:Descriptor CalculationDescriptor selection depends on the experience of the user.TPSA is used to consider the molecule size and electrostatic interaction,SlogP is used for the permeability,and SMR for polarization.Correlation betwee
30、n the 3 descriptors is plotted.1.Open the merged_bb.mdb file,and save a local copy to the working directory.*)2.Open the QuaSAR-Descriptor panel(DBV|Compute|Descriptors).A list of the built-in descriptors is displayed,which can be navigated using text filters.3.Enter TPSA in the Descriptor Filter.4.
31、Left mouse click once to select the TPSA descriptor in the descriptor list.InforSense&CCG All Rights ReservedExercise:Descriptor Calculation5.Enter SMR in Descriptor Filter and select the SMR descriptor from the filtered list.6.Enter SlogP in Descriptor Filter and select SlogP from the filtered list
32、.7.Press OK to calculate the three selected descriptors.InforSense&CCG All Rights ReservedExercise:Descriptor CalculationCheck descriptor correlations:8.Select the activity field(logBB)and the three descriptor fields in the database(SlogP,SMR,TPSA)InforSense&CCG All Rights ReservedDescriptor Calcula
33、tion:CorrelationThe relationship between two variables X and Y is described by the correlation coefficient R.This is determined by linear regression analysis(see QSAR models),where a linear equation that has the smallest x and y values of all data points is derived.The correlation coefficient is cal
34、culated by:A correlation coefficient of 1 indicates a perfect correlation,-1 being inversely correlated and 0 being unrelated.*)yxyxxy niii 1nn22iii 1i 1xxyycov X,YRvar Xvar YxxyyR=1.00R=-0.72R=-0.06R=0.95R=0.77InforSense&CCG All Rights ReservedCorrelation Between Stork Populations and Human Birthra
35、tes(H.Sies,Nature,332(1988)495)Any correlation between descriptors and experimental data has to be meaningful mechanistically.1965 1967 1969 1971 1973 1975 1977 1979 1981Year500700900110013001500170019002100AmountStorksBabiesInforSense&CCG All Rights ReservedExercise:Descriptor Calculation-Correlati
36、on MatrixModels will be more robust if uncorrelated descriptors are used*).Correlation can be inspected using either a correlation plot or a matrix.1.Select(DBV|Compute|Analysis|Correlation Matrix).The numbers in the icons in the correlation matrix correspond to percent correlation.2.Double-Click on
37、 the highlighted cell to bring up the correlation plot(or by(DBV|Compute|Analysis|Correlation Plot)and selecting two numeric fields).InforSense&CCG All Rights ReservedExercise:Descriptor Calculation-Correlation PlotA correlation coefficient(R2)of 0.0756 and the linear regression equation are indicat
38、ed in the header line of the correlation plot.There is virtually no correlation between SlogP and TPSA.3.Select e.g.active compounds(logBB 0.5)in the DBV or any data points in the plot(Left mouse drag over selection).The selection is interactive between the plot and the database viewer.To deselect e
39、ntries,use the(DBV|Entry|Clear Entry Selection)menu,the Entry Popup menu or the Clear Selection button in the DBV plot.InforSense&CCG All Rights ReservedExercise:Descriptor Calculation-Correlation PlotA correlation coefficient(R2)of 0.0756 and the linear regression equation are indicated in the head
40、er line of the correlation plot.There is virtually no correlation between SlogP and TPSA.3.Select e.g.active compounds(logBB 0.5)in the DBV or any data points in the plot(Left mouse drag over selection).The selection is interactive between the plot and the database viewer.To deselect entries,use the
41、(DBV|Entry|Clear Entry Selection)menu,the Entry Popup menu or the Clear Selection button in the DBV plot.InforSense&CCG All Rights ReservedExercise:Descriptor Calculation-Correlation PlotDisplay attributes may be modified and data exported to other tools.4.Clear the selection using Clear Selection b
42、utton in the Plot5.Select Data to Clipboard to copy the XY values e.g.into a text editor,or to import the data into Excel.6.Select Attributes to change to a white background,black foreground,black markers,etc.*)InforSense&CCG All Rights ReservedDescriptor SelectionDescriptorcalculationModeldevelopme
43、ntModelvalidationDescriptor selectionInforSense&CCG All Rights ReservedDescriptor SelectionIn the preceding example one of the three descriptors(SMR)shows low relationship to logBB.In practice,many descriptors(some correlated,some not)are calculated and used as starting point to build a QSAR model.T
44、here are two approaches in the development of robust QSAR models:Descriptor reduction:Select calculated descriptors which are not or which are only weakly correlated(orthogonal).Either manually or semi-automatic by QuaSAR-Contingency.Dimension reduction:Use all calculated(possibly correlated)descrip
45、tors in a Principal Component Analysis(PCA).InforSense&CCG All Rights ReservedDescriptor Selection:QuaSAR-ContingencyQuaSAR-Contingency(DBV|Compute|QuaSAR-Contingency)is a statistical application to assist in the selection of descriptors for QSAR or QSPR.The application performs a bivariate continge
46、ncy analysis for each descriptor and the activity or property value.It produces a table of coefficients that helps to select important descriptors.Input databasePredictable propertyDescriptor listInforSense&CCG All Rights ReservedExercise:QuaSAR-ContingencyDetermine the most(un)important descriptors
47、 for the merged_bb.mdb dataset.1.Open the QuaSAR-Contingency panel(DBV|Compute|QuaSAR-Contingency).2.Select the 3 descriptors(SlogP,SMR,and TPSA)and press OK.3.Examine the result in the text editor.SlogP is considered as the most unimportant descriptor Contingency measuresDescriptor dependenceMost i
48、mportant descriptorsInforSense&CCG All Rights ReservedPrincipal Components Analysis(PCA)PCA reduces the dimensionality of a set of molecular descriptors by linearly transforming the data such that all components remain orthogonal.The 1st PC describes the direction of greatest data varianceThe 2nd PC
49、 describes the direction of the second greatest data variance etc.Descriptor 1Descriptor 2Descriptor 3PC 1PC 2InforSense&CCG All Rights ReservedPCA Pre-ProcessingSince descriptors may be heterogeneous in nature(units,scale,etc.),the data should be pre-processed to build meaningful models.PCA is gene
50、rally applied to scaled and/or mean centered data.Scaling:Usually appropriate in systems where the variables have different units and/or cover different magnitudes,e.g.variation between 100-110 C and 0.01-0.1 M.Puts all descriptors on an equal basis in the analysisMean centering:Translates the origi