support vector machines for classification of flow data
play

Support Vector Machines for Classification of Flow Data - PowerPoint PPT Presentation

Support Vector Machines for Classification of Flow Data Classification of Flow Data Funded by SBIR Grant # R43 RR024094-01A1 FlowCap 2010 p John Quinn Ph.D. Treestar john@treestar.com Our Objective Our Objective Demonstrate that


  1. Support Vector Machines for Classification of Flow Data Classification of Flow Data Funded by SBIR Grant # R43 RR024094-01A1 FlowCap 2010 p John Quinn Ph.D. Treestar john@treestar.com

  2. Our Objective Our Objective • Demonstrate that supervised training algorithms can effectively replicate user created gates – Very useful for high throughput settings – Can increase robustness • We believe this will be the first application in pp which algorithmic gate placement becomes the norm.

  3. Selected Algorithm Selected Algorithm • Support Vector Machine (SVM) pp ( ) – Radial kernel • Supervised linear classifier that solves an optimization problem to find the hyperplane(s) that separate classes with the maximum distance between classes – With non-linear mapping data that is not linearly Wi h li i d h i li l separable can be classified

  4. SVM Operation SVM Operation Optimization: p • Determine which elements of the training data mark training data mark the boundary of D maximum distance between two classes or Support vectors Class 1 Class 2 D Maximum separation

  5. SVM Operation SVM Operation • Optimization problem Optimization problem For data: A hyperplane that separates any two classes can be defined as: A h l th t t t l b d fi d For c i =1 For c i =-1 Knowing that the data points should be outside of the margin, we can impose the constraint: p

  6. SVM Operation SVM Operation We know that the support vectors will have a perpendicular di t distance from the hyperplane of: f th h l f and The distance between SV’s can then be expressed as: So optimization is the minimization of D

  7. SVM Operation SVM Operation We then use the inequality, q y, as a constraint to fix a critical point and use as a constraint to fix a critical point and use Lagrangian multipliers α i , to express w as a linear combination of the training vectors: The support vectors, N SV , are then the X i associated with non-negative Lagrange multipliers

  8. SVM Operation SVM Operation Once w is known, and the support vectors have been identified, b can be solved as: If there are more than two classes, the operation remains the same but the hyperplanes are determined either as one hyperplanes are determined either as one versus all or pairwise • We chose a one versus all format

  9. SVM Operation SVM Operation • Data not linearly separable? Map it to a y p p space where it is! – We assume that flow data will have a Gaussian distribution and selected a Gaussian mapping G Input Space Mapped Space

  10. Why use an SVM? Why use an SVM? • SVM’s are deterministic • Find the global maxima and not local maxima – If the training data are representative of the real data, you cannot do better. • SVM’s are fast – They solve a maximization problem, as opposed to doing an iterative fitting d d i i i fi i

  11. Preprocessing Preprocessing • To prepare the training data, we: – Normalize the data to a range of -1 to 1 N li th d t t f 1 t 1 – Identified the training data set with the largest number of clusters • Used this data set as the reference set – Calculated the centroid of each cluster in the reference set – In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matching reference set and assigned them cluster ID s matching the reference cluster with the smallest distance measure – Took a sample of each training data set and combined Took a sample of each training data set and combined them into one training vector to present to the SVM

  12. Algorithm choice Algorithm choice Matlab has a free file share repository � Someone has already put almost any algorithm p y g you can think of into code I I used the SVM coded by d th SVM d d b By Junshui Ma, and Yi Zhao of Ohio St. University � It received 5 stars

  13. Training Data Training Data • Example training data p g – Showing parameters 1 & 2, and 3 & 4 of the stem cell data set

  14. Results Results

  15. Results Results Speed: p Data set Training time Classification time • • CFSE CFSE 4 sec 4 sec 2 min 48 sec 2 min 48 sec (13 files) (13 files) • DLBCL 5 sec 67 sec (30 files) • GvHD 5 sec 38 sec (12 files) • NDD 11 sec 27 min 28 sec (30 files) • Stem cell Stem cell 4 sec 4 sec 19 sec 19 sec (30 files) (30 files)

  16. Room for improvement… Room for improvement… • The SVM’s are highly dependant on g y p identifying a transform that maps the data to a linearly separable space. • We could experiment with a number of different transforms

  17. FlowCap Feedback FlowCap Feedback • What went well What went well – Data easily available – Submission process easy Submission process easy – Questions answered immediately! • What could be improved – Wider publicity particularly out of our Wid bli it ti l l t f domain

  18. Questions? Questions?

Recommend


More recommend