Bioinformatics Chapter 3: Data bases and data mining GenBank sample record information (http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#LocusB) K Van Steen 22
Bioinformatics Chapter 3: Data bases and data mining Statistics at NCBI (http://www.ncbi.nlm.nih.gov/Sitemap/Summary/statistics.html#GenBankStats) K Van Steen 23
Bioinformatics Chapter 3: Data bases and data mining Primary databases in detail: dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/) K Van Steen 24
Bioinformatics Chapter 3: Data bases and data mining (http://www.ncbi.nlm.nih.gov/SNP/snp_summary.cgi) K Van Steen 25
Bioinformatics Chapter 3: Data bases and data mining NCBI SNPs (http://www.ncbi.nlm.nih.gov/sites/entrez?db=snp&cmd=search&term=) K Van Steen 26
Bioinformatics Chapter 3: Data bases and data mining NCBI SNPs (http://www.ncbi.nlm.nih.gov/snp/limits) K Van Steen 27
Bioinformatics Chapter 3: Data bases and data mining The “equivalent” of the US NCBI: EMBL (http://www.embl.org/) K Van Steen 28
Bioinformatics Chapter 3: Data bases and data mining Primary data bases in detail: EMBL nucleotide sequence data base (http://www.ebi.ac.uk/embl/index.html) K Van Steen 29
Bioinformatics Chapter 3: Data bases and data mining DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp/ ) K Van Steen 30
Bioinformatics Chapter 3: Data bases and data mining DNA Data Bank of Japan (DDBJ) (http://www.ddbj.nig.ac.jp/ddbjingtop-e.html) K Van Steen 31
Bioinformatics Chapter 3: Data bases and data mining The International Sequence Data • These databases automatically base Collaboration update each other with the new sequences collected from each region, every 24 hours. The result is that they contain exactly the same information, except for any sequences that have been added in the last 24 hours. • This is an important consideration • These three databases have in your choice of database. If you collaborated since 1982. Each need accurate and up to date database collects and processes information, you must search an up new sequence data and relevant to date database. biological information from (S Star slide: Ping) scientists in their region K Van Steen 32
Bioinformatics Chapter 3: Data bases and data mining Secondary data bases • Derived information/ curated or procesed • Fruits of analyses of sequences in the primary sources: - patterns, - blocks, - profiles etc. which represent the most conserved features of multiple alignments K Van Steen 33
Bioinformatics Chapter 3: Data bases and data mining Examples of secondary data bases • Sequence-related Information • ProSite, Enzyme, REBase • Genome-related Information • OMIM, TransFac • Structure-related Information • DSSP, HSSP, FSSP, PDBFinder • Pathway Information • KEGG, Pathways K Van Steen 34
Bioinformatics Chapter 3: Data bases and data mining Secondary data bases in detail: OMIM (http://www.ncbi.nlm.nih.gov/sites/entrez?db=omim) K Van Steen 35
Bioinformatics Chapter 3: Data bases and data mining Examples of questions that can be answered with OMIM in Entrez • What human genes are related to hypertension? Which of those genes are on chromosome 17? (strategy) • List the OMIM entries that describe genes on chromosome 10. (strategy) • List the OMIM entries that contain information about allelic variants. (strategy) • Retrieve the OMIM record for the cystic fibrosis transmembrane conductance regulator (CFTR), and link to related protein sequence records via Entrez. (strategy) • Find the OMIM record for the p53 tumor protein, and link out to related information in Entrez Gene and the p53 Mutation Database (strategy) The "strategy" links lead to the Sample Searches section in the document (http://www.ncbi.nlm.nih.gov/Omim/omimhelp.html#MainFeatures) K Van Steen 36
Bioinformatics Chapter 3: Data bases and data mining Secondary data bases in detail: KEGG portal (http://www.genome.jp/kegg/) K Van Steen 37
Bioinformatics Chapter 3: Data bases and data mining Secondary data bases in detail: KEGG pathways data base (http://www.genome.ad.jp/kegg/pathway.html) K Van Steen 38
Bioinformatics Chapter 3: Data bases and data mining K Van Steen 39
Bioinformatics Chapter 3: Data bases and data mining KEGGpathway for asthma (http://www.genome.ad.jp/kegg-bin/resize_map.cgi?map=hsa05310&scale=0.67) K Van Steen 40
Bioinformatics Chapter 3: Data bases and data mining Secondary data bases in detail: NCBI dbGaP (http://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/about.html) K Van Steen 41
Bioinformatics Chapter 3: Data bases and data mining NCBI as portal to dbGAP (http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap) K Van Steen 42
Bioinformatics Chapter 3: Data bases and data mining Tertiary data bases • Tertiary sources consist of information which is a distillation and collection of primary and secondary sources. • These include: - structure databases - flatfile databases K Van Steen 43
Bioinformatics Chapter 3: Data bases and data mining 1.c Searching data bases Where the h… is the d… thing? • Start looking in some of the big systems (EMBL, NCBI, KEGG, etc). • Read their help pages. • Use their data. • Follow their hyperlinks. K Van Steen 44
Bioinformatics Chapter 3: Data bases and data mining Ensembl genome browser portal • Ensembl is a joint project between EMBL-EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on eukaryotic genomes (http://www.ensembl.org/index.html) K Van Steen 45
Bioinformatics Chapter 3: Data bases and data mining Ensembl genome browser portal (http://www.ensembl.org/Homo_sapiens/Info/Index) K Van Steen 46
Bioinformatics Chapter 3: Data bases and data mining Contigs • In order to make it easier to talk about our data gained by the shotgun method of sequencing, researchers have invented the word "contig". • A contig is a set of gel readings that are related to one another by overlap of their sequences. • All gel readings belong to one and only one contig, and each contig contains at least one gel reading. • The gel readings in a contig can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig K Van Steen 47
Bioinformatics Chapter 3: Data bases and data mining Entrez genome browser portal (http://www.ncbi.nlm.nih.gov/) K Van Steen 48
Bioinformatics Chapter 3: Data bases and data mining NCBI Site Map K Van Steen 49
Bioinformatics Chapter 3: Data bases and data mining NCBI Site Map (continued) K Van Steen 50
Bioinformatics Chapter 3: Data bases and data mining NCBI Handbook K Van Steen 51
Bioinformatics Chapter 3: Data bases and data mining NCBI Handbook snapshot K Van Steen 52
Bioinformatics Chapter 3: Data bases and data mining NCBI Site Map K Van Steen 53
Bioinformatics Chapter 3: Data bases and data mining Entrez: An integrated database search and retrieval system (http://www.ncbi.nlm.nih.gov/sites/gquery) K Van Steen 54
Bioinformatics Chapter pter 3: Data bases and data mining Information integration is e is essential: data aggregation from s m several databases (Bioinf oinformatics: Managing Scientific Data) K Van Steen 55
Bioinformatics Chapter 3: Data bases and data mining 2 Data mining 2.1 Supervised machine learning Introduction • Machine learning (ML) is typically divided into two separate areas, � supervised ML (referred to as classification) and � unsupervised ML (referred to as clustering). • Both types of machine learning are concerned with the analysis of datasets containing multivariate observations. • There is a large amount of literature that can provide an introduction into these topics; here we refer to Breiman et al. (1984) and Hastie et al. (2001). K Van Steen 56
Bioinformatics Chapter 3: Data bases and data mining Introduction • In supervised learning, a p-dimensional multivariate observation x is asso- ciated with a class label c (e.g. Class label below). � The p components of datum x are called features. The objective is to "learn" a mathematical function f that can be evaluated on the input x to yield a prediction of its class c SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7 SNP8 SNP9 SNP10 Class 1 2 0 0 0 0 1 0 1 1 1 1 2 1 1 0 2 0 0 0 1 1 1 2 0 0 0 0 0 0 1 1 1 2 1 0 0 0 0 2 2 1 0 1 2 1 0 0 1 0 0 1 1 1 1 … K Van Steen 57
Bioinformatics Chapter 3: Data bases and data mining Introduction • One issue that typically arises in ML applications to high-throughput biological data is feature selection . � For example, in the case of microarray data one typically has tens of thousands of features that were collected on all samples, but many will correspond to genes that are not expressed. Other features will be important for predicting one phenotype, but largely irrelevant for predicting other phenotypes. Thus, feature selection is an important issue. K Van Steen 58
Bioinformatics Chapter 3: Data bases and data mining Introduction • Fundamental to the task of ML is selecting a measure of similarity among (or distance between) multivariate data points. • We emphasize the term "selecting" here because it can easily be forgotten that the units in which features have been measured have no legitimate priority over other transformed representations that may lead to more biologically sensible criteria for classification. • If we simply drop our expression data into a classification procedure, we have made an implicit selection to embed our observations in the feature space employed by the procedure. Oftentimes this feature space has Euclidean structure. K Van Steen 59
Bioinformatics Chapter 3: Data bases and data mining Introduction • Effective classification requires attention to the possible transformations (equivalently, distance metric in the implied feature space) of complex machine learning tools such as kernel support vector machines. � If we extended our expression data to include, say, squares of expression values for certain genes, a given classification procedure may perform very differently, even though the original data have only been deterministically transformed. • In many cases the distance metric is more important than the choice of classification algorithm, and MLInterfaces makes it reasonably easy to explore different choices for distances. K Van Steen 60
Bioinformatics Chapter 3: Data bases and data mining Supervised machine learning check list 1. Filter out features (genes) that show little variation across samples, or that are known not to be of interest. If appropriate, transform the data of each feature so that they are all on the same scale. 2. Select a distance, or similarity, measure. What does it mean for two samples to be close? Make sure that the selected distance embodies your notion of similarity. 3. Feature selection: Select features to be used for ML. If you are using cross- validation, be sure that feature selection according to your criteria, which may be data-dependent, is performed at each iteration. 4. Select the algorithm: Which of the many ML algorithms do you want to use? 5. Assess the performance of your analysis. With supervised ML, performance is often assessed using cross-validation, but this itself can be performed in various ways. K Van Steen 61
Bioinformatics Chapter 3: Data bases and data mining Running example • The ALL dataset contains over 100 samples, for a variety of different subtypes of leukemia • In particular, the ALL data consist of microarrays from 128 different individuals with acute lymphoblastic leukemia (ALL). There are 95 samples with B-cell ALL and 33 with T-cell ALL. These involve different tissues and different diseases. • Two different analyses have been reported which are useful to read more about it: Chiaretti et al 2004, 2005. • The data have been normalized using rma (see later) and stored in the form of an ExpressionSet … (What is it?) K Van Steen 62
Bioinformatics Chapter 3: Data bases and data mining Introduction • Once Bioconductor and biocLite have been installed, you can find out more about it using the command openVignette() and by selection “1” • You will then be directed to a pdf file: Opening C:/PROGRA~1/R/R-27~1.2/library/ALL/doc/ALLintro.pdf source("http://www.bioconductor.org/getBioC.R") getBioC() source('http://bioconductor.org/biocLite.R') biocLite('ALL') library(“ALL”) data("ALL") class(ALL) show(ALL) K Van Steen 63
Bioinformatics Chapter 3: Data bases and data mining Running example slotNames(ALL) ## note, slots like exprs and phenoType ## can be accessed by slot accessor "@" ## or by functions like exprs() or pData() levels(ALL$mol.biol) ## list different molecular biology types table(ALL$mol.biol) ## frequency of these > slotNames(ALL) [1] "assayData" "phenoData" "featureData" "experimentData" "annotation" ".__classVersion__" > table(ALL$mol.biol) ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16 10 37 5 74 1 1 K Van Steen 64
Bioinformatics Chapter 3: Data bases and data mining Running example ## let's only select two molecular types: selSamples <- ALL$mol.biol %in% c("ALL1/AF4", "E2A/PBX1") ALLs <- ALL[, selSamples] show(ALLs) ALLs$mol.biol <- factor(ALLs$mol.biol) ALLs$mol.biol > show(ALLs) ExpressionSet (storageMode: lockedEnvironment) assayData: 12625 features, 15 samples element names: exprs phenoData sampleNames: 04006, 08018, ..., LAL5 (15 total) varLabels and varMetadata description: cod: Patient ID diagnosis: Date of diagnosis ...: ... K Van Steen 65
Bioinformatics Chapter 3: Data bases and data mining date last seen: date patient was last seen (21 total) featureData featureNames: 1000_at, 1001_at, ..., AFFX-YEL024w/RIP1_at (12625 total) fvarLabels and fvarMetadata description: none experimentData: use 'experimentData(object)' pubMedIds: 14684422 16243790 Annotation: hgu95av2 > ALLs$mol.biol <- factor(ALLs$mol.biol) > ALLs$mol.biol [1] ALL1/AF4 E2A/PBX1 ALL1/AF4 ALL1/AF4 ALL1/AF4 ALL1/AF4 E2A/PBX1 ALL1/AF4 E2A/PBX1 ALL1/AF4 ALL1/AF4 ALL1/AF4 E2A/PBX1 [14] ALL1/AF4 E2A/PBX1 Levels: ALL1/AF4 E2A/PBX1 K Van Steen 66
Bioinformatics Chapter 3: Data bases and data mining Running example ## add molecular biology type to colnames of samples colnames(exprs(ALLs)) colnames(exprs(ALLs)) <- paste(ALLs$mol.biol, colnames(exprs(ALLs))) colnames(exprs(ALLs)) > colnames(exprs(ALLs)) [1] "04006" "08018" "15004" "16004" "19005" "24005" "24019" "26008" "28003" "28028" "28032" "31007" "36001" "63001" "LAL5" > colnames(exprs(ALLs)) <- paste(ALLs$mol.biol, colnames(exprs(ALLs))) > colnames(exprs(ALLs)) [1] "ALL1/AF4 04006" "E2A/PBX1 08018" "ALL1/AF4 15004" "ALL1/AF4 16004" "ALL1/AF4 19005" "ALL1/AF4 24005" "E2A/PBX1 24019" [8] "ALL1/AF4 26008" "E2A/PBX1 28003" "ALL1/AF4 28028" "ALL1/AF4 28032" "ALL1/AF4 31007" "E2A/PBX1 36001" "ALL1/AF4 63001" [15] "E2A/PBX1 LAL5" hist(exprs(ALLs)) hist(ALLs@exprs) K Van Steen 67
Bioinformatics Chapter 3: Data bases and data mining Corresponding output K Van Steen 68
Bioinformatics Chapter 3: Data bases and data mining Curtailing the data to our needs • In the code below we load the ALL data again, and then subset them to the particular phenotypes in which we are interested. • The specific information we need is to select those with B-cell ALL, and then within that subset, those that are NEG and those that are labeled as BCR/ABL. • The last line in the code below is used to drop unused levels of the factor encoding mol.biol. K Van Steen 69
Bioinformatics Chapter 3: Data bases and data mining Curtailing the data to our needs The comparison of BCR/ABL to NEG is difficult, and the error rates are typically quite high. You could instead compare BCR/ABL to ALL1/AF4; they are rather easy to distinguish and the error rates should be smaller. K Van Steen 70
Bioinformatics Chapter 3: Data bases and data mining Non-specific filtering of features • Nonspecific filtering removes those genes that we believe are not sufficiently informative for any phenotype, so that there is little point in considering them further. � For the purpose of this teaching exercise, we used a very stringent filter so that the dataset is small and the examples will run quickly; in practice you would probably use a less stringent filter. • We use the function nsFilter from the genefilter package to filter for a number of different criteria. � For instance, by default it removes the control probes on Affymetrix arrays, which can be identified by their AFFX prefix. � We also exclude genes without Entrez Gene identifiers, and select the top 25% of genes on the basis of variability across samples. K Van Steen 71
Bioinformatics Chapter 3: Data bases and data mining Non-specific filtering of features library(genefilter) Allfilt_bcrneg = nsFilter(ALL_bcrneg, var.cutoff=0.75)$eset > class(ALLfilt_bcrneg) [1] "ExpressionSet" attr(,"package") [1] "Biobase" K Van Steen 72
Bioinformatics Chapter 3: Data bases and data mining Feature selection and standardization • Feature selection is an important component of machine learning. • Typically the identification and selection of features used for supervised ML relies on knowledge of the system being studied, and on univariate assessments of predictive capability. � Among the more commonly used methods are the selection of features that are predictive using t-statistic and ROC curves (at least for two-sample problems). K Van Steen 73
Bioinformatics Chapter 3: Data bases and data mining Interludium on ROC Curves (http://gim.unmc.edu/dxtests/ROC1.htm) K Van Steen 74
Bioinformatics Chapter 3: Data bases and data mining How to draw a ROC curve K Van Steen 75
Bioinformatics Chapter 3: Data bases and data mining How to draw a ROC curve (http://www.medcalc.be/manual/roc.php) K Van Steen 76
Bioinformatics Chapter 3: Data bases and data mining Feature selection and standardization (continued) � In order to correctly assess error rates it is essential to accounted for the effects of feature selection. If cross-validation is used then feature selection must be incorporated within the cross-validation process and not performed ahead of time using all of the data. • A second important aspect is standardization. For gene expression data the recorded expression level is not directly interpretable, and so users must be careful to ensure that the statistics used are comparable. • This standardization ensures that all genes have equal weighting in the ML applications. � In most cases this is most easily achieved by standardizing the expres- sion data, within genes, across samples. In some cases (such as with a t- test) there is no real need to standardize because the statistic itself is standardized. K Van Steen 77
Bioinformatics Chapter 3: Data bases and data mining Feature selection and standardization (continued) • In the code segments below, we standardize all gene expression values. It is important that nonspecific filtering has already been performed. • We first write a helper function to compute a row-wise Inter-Quartile Range (IQR) for us. rowIQRs = function(eSet){ numSamp = ncol(eSet) lowQ = rowQ(eSet,floor(0.25 * numSamp)) upQ = rowQ(eSet, ceiling(0.75 * numSamp)) upQ - lowQ } • Next we subtract the row medians and divide by the row IQRs. Again, we write a helper function, standardize, that does most of the work. standardize = function(x) (x-rowMedians(x)) / rowIQRs(x) exprs(ALLfilt_bcrneg) = standardize(exprs(ALLfilt_bcrneg)) K Van Steen 78
Bioinformatics Chapter 3: Data bases and data mining Selecting a distance • To some extent your choices here are not always that flexible because many ML algorithms have a certain choice of distance measure, say, the Euclidean distance, built in. • In such cases, you still have the choice of transformation of the variables; examples are � coordinatewise logarithmic transformation, � the linear Mahalonobis transformation, and � other linear or nonlinear projections of the original features into a (possibly lower-dimensional) space. K Van Steen 79
Bioinformatics Chapter 3: Data bases and data mining Selecting a distance • If the ML algorithm does allow explicit specification of the distance metric, there are a number of different tools in R to compute the distance between objects. � They include the function dist, the function daisy from the cluster package (Kaufman and Rousseeuw, 1990), and the functions in the bioDist package. • The dist function computes the distance between rows of an input matrix. � We want the distances between samples (and not genes), thus we transpose the matrix using the function t. � The return value is an instance of the dist class. Because this class is not supported by some R functions that we want to use, we also convert it to a matrix. K Van Steen 80
Bioinformatics Chapter 3: Data bases and data mining Selecting a distance eucD = dist(t(exprs(ALLfilt_bcrneg))) eucM = as.matrix(eucD) dim(eucM) • We next visualize the distances using a heatmap. � In the code below we generate a range of colors to use in the heatmap. � The RColor Brewer package provides a number of different palettes to use and we have selected one that uses red and blue. Because we want red to correspond to high values, and blue to low, we must reverse the palette. library("RColorBrewer") hmcol = colorRampPalette(brewer.pal(10,"RdBu"))(256) hmcol = rev(hmcol) heatmap(eucM,sym=TRUE,col=hmcol,distfun=as.dist) K Van Steen 81
Bioinformatics Chapter 3: Data bases and data mining Heat-map of the between-sample distances a heatmap of the between-sample distances in our example data K Van Steen 82
Bioinformatics Chapter 3: Data bases and data mining Machine learning • The user interfaces (i.e., the calling parameters and return values of the machine learning algorithms that are available in R) are quite diverse, and this can make switching your application code from one machine learning algorithm to another tedious. • For this reason, the MLInterfaces provides wrappers around the various machine learning algorithms that accept a standardized set of calling parameters and produce a standardized return value. K Van Steen 83
Bioinformatics Chapter 3: Data bases and data mining Machine learning • The package does not implement any of the machine learning algorithms, it just converts the in- and out-going data structures into the appropriate format. � In general, the name of the function or method remains the same, but an I is appended, so we, for instance, use the MLInterfaces functions knnI to interface to the functions knn from the class package. K Van Steen 84
Bioinformatics Chapter 3: Data bases and data mining Machine learning • It is easiest to understand most supervised ML methods in the setting where one has both a � training set on which to build the model, and a � test set on which to test the model. • We begin by artificially dividing our data into a test and training set. Such a dichotomy is not actually that useful and in practice one tends to rely on cross-validation, or other similar schemes (see later). Negs = which(ALLfilt_bcrneg$mol.biol=="NEG") Bcr = which(ALLfilt_bcrneg$mol.biol =="BCR/ABL") set.seed(1969) S1=sample(Negs,20,replace=FALSE) S2=sample(Bcr,20,replace=FALSE) TrainInd =c(S1,S2) TestInd = setdiff(1:79,TrainInd) K Van Steen 85
Bioinformatics Chapter 3: Data bases and data mining Machine learning > Negs [1] 2 4 5 6 7 8 11 12 14 19 22 24 26 28 31 35 37 38 39 43 44 45 46 49 50 [26] 51 52 54 55 56 57 58 61 62 65 66 67 68 70 74 75 77 > ALLfilt_bcrneg$mol.biol [1] BCR/ABL NEG BCR/ABL NEG NEG NEG NEG NEG BCR/ABL [10] BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL BCR/ABL BCR/ABL [19] NEG BCR/ABL BCR/ABL NEG BCR/ABL NEG BCR/ABL NEG BCR/ABL [28] NEG BCR/ABL BCR/ABL NEG BCR/ABL BCR/ABL BCR/ABL NEG BCR/ABL [37] NEG NEG NEG BCR/ABL BCR/ABL BCR/ABL NEG NEG NEG [46] NEG BCR/ABL BCR/ABL NEG NEG NEG NEG BCR/ABL NEG [55] NEG NEG NEG NEG BCR/ABL BCR/ABL NEG NEG BCR/ABL [64] BCR/ABL NEG NEG NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL [73] BCR/ABL NEG NEG BCR/ABL NEG BCR/ABL BCR/ABL Levels: BCR/ABL NEG K Van Steen 86
Bioinformatics Chapter 3: Data bases and data mining Machine learning • The term confusion matrix is typically used to refer to the table that cross- classifies the test set predictions with the true test set class labels. • The MLInterfaces packages provides a function called confuMat that will compute this matrix from most inputs • Let’s get the MLInterfaces packages first... K Van Steen 87
Bioinformatics Chapter 3: Data bases and data mining Machine learning K Van Steen 88
Bioinformatics Chapter 3: Data bases and data mining Machine learning • In every machine learning algorithm one can, at least conceptually, make one of three decisions: � To classify the sample into one of the known classes as defined by the training set. � To indicate doubt, the sample is somehow between two or more classes and there is no clear indication as to which class it belongs. � To indicate that the sample is an outlier, in the sense that it is so dis- similar to all samples in the training set that no sensible classification is possible. K Van Steen 89
Bioinformatics Chapter 3: Data bases and data mining K Nearest Neighbors Classification (KNN) • k-nearest neighbor classification for test set from training set works as follows: � For each “row” of the test set, the k nearest (in Euclidean distance) training set vectors are found, � and the classification is decided by majority vote, � with ties broken at random. � If there are ties for the kth nearest vector, all candidates are included in the vote. K Van Steen 90
Bioinformatics Chapter pter 3: Data bases and data mining K Nearest Neighbors Classif assification (KNN) (www.wikipedia.org) • Example of k-NN classifica sification. The test sample (green circle ircle) should be classified either to the firs first class of blue squares or to the se e second class of red triangles. � If k = 3 it is classified sified to the second class because ther there are 2 triangles and only 1 square in re inside the inner circle. � If k = 5 it is classifie sified to first class (3 squares vs. 2 tria triangles inside the outer circle). K Van Steen 91
Bioinformatics Chapter 3: Data bases and data mining KNN code using MLInterfaces > krun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, knnI(k=1,l=0),TrainInd) > krun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = knnI(k = 1, l = 0), trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 17 22 … > names(RObject(krun)) [1] "traindat" "ans" "traincl" > confuMat(krun) predicted given BCR/ABL NEG BCR/ABL 10 7 NEG 7 15 K Van Steen 92
Bioinformatics Chapter 3: Data bases and data mining Linear Discriminant Analysis (LDA) • Originally developed in 1936 by R.A. Fisher, Discriminant Analysis is a classic method of classification that has stood the test of time. Discriminant analysis often produces models whose accuracy approaches (and occasionally exceeds) more complex modern methods. • Discriminant analysis can be used class variance as illustrated by this figure produced by Ludwig only for classification (i.e., with a Schwardt and Johan du Preez: categorical target variable), not for regression. The target variable may have two or more categories. • A transformation function is found that maximizes the ratio of between-class variance to within- K Van Steen 93
Bioinformatics Chapter 3: Data bases and data mining LDA code using MLInterfaces > ldarun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, ldaI,TrainInd) > ldarun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = ldaI, trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 12 27 > names(RObject(ldarun)) [1] "prior" "counts" "means" "scaling" "lev" "svd" "N" [8] "call" "terms" "xlevels" > confuMat(ldarun) predicted given BCR/ABL NEG BCR/ABL 10 7 NEG 2 20 K Van Steen 94
Bioinformatics Chapter 3: Data bases and data mining Diagonal linear discriminant analysis (DLDA) • DLDA is the maximum likelihood discriminant rule, for multivariate normal class densities, when the class densities have the same diagonal variance- covariance matrix (i.e., variables are uncorrelated, and for each variable, its variance is the same in all classes). • In spite of its simplicity and its somewhat unrealistic assumptions (independent multivariate normal class densities), this method has been found to work very well. • In contrast to the more common Fisher's LDA technique, DLDA works even when the number of cases is smaller than the number of variables. Details and explanations of DLDA can be found in Dudoit et al. 2002 . K Van Steen 95
Bioinformatics Chapter pter 3: Data bases and data mining Diagonal linear discrimina ant analysis (DLDA) • The assumptions of DLD DA give rise to a simple linear rule, w e, where a sample is assigned to the class k wh which minimizes , where p is the number of r of variables, is the value on variab riable (gene) j of the test sample, is the sam sample mean of class k and variable (g le (gene) j, and is the (pooled) estimate o te of the variance of gene j. K Van Steen 96
Bioinformatics Chapter 3: Data bases and data mining DLDA code using MLInterfaces > dldarun = MLearn(mol.biol ~ ., data=ALLfilt_bcrneg, dldaI,TrainInd) Loading required package: sma > dldarun MLInterfaces classification output container The call was: MLearn(formula = mol.biol ~ ., data = ALLfilt_bcrneg, method = dldaI, trainInd = TrainInd) Predicted outcome distribution for test set: BCR/ABL NEG 21 18 > names(RObject(dldarun)) [1] "traindat" "ans" "traincl" > confuMat(dldarun) predicted given BCR/ABL NEG BCR/ABL 13 4 NEG 8 14 > K Van Steen 97
Bioinformatics Chapter 3: Data bases and data mining Machine learning • Some features that remained after our non-specific filtering procedure are not likely to be predictive of the phenotypes of interest • What happens if we instead select genes that are able to discriminate between those with BCR/ABL and those samples labeled NEG? • We use the t-test to select genes; those with small p-values for comparing BCR/ABL to NEG are used. • Although it is tempting to use all the data to do this selection, that is not really a good idea as it tends to give misleadingly low values for the error rates. Ever heard of “data snooping”? (adapted from http://www.travelnotes.de/rays/fortran/snoopy.gif) K Van Steen 98
Bioinformatics Chapter 3: Data bases and data mining Machine learning • In the code below, we compute the t-tests on the training set, then sort them from largest to smallest, and then obtain the names of the 50 that have the largest observed test statistics . • Note: > Traintt[1,] statistic dm p.value 41654_at -1.01298 -0.1983496 0.3174765 K Van Steen 99
Bioinformatics Chapter 3: Data bases and data mining Machine learning • Now we can see how well the different machine learning algorithms work when the features have been selected to help discriminate between the two groups. • For instance, with KNN. > BNf = ALLfilt_bcrneg[fNtt,] > knnf = MLearn(mol.biol ~.,data=BNf, knnI(k=1,l=0), TrainInd) > confuMat(knnf) predicted given BCR/ABL NEG BCR/ABL 14 3 NEG 1 21 • What do you conclude? K Van Steen 100
Recommend
More recommend