tttt BDT Nick Amin September 29, 2018
Overview ⚫ Last time, showed cut-based analysis with latest data and lumi of (35.87+41.53+35.53=) 112.9fb -1 getting around 2.84 𝜏 expected significance ⚫ Repeat with updated BDT (previously, had 19-variable TMVA BDT trained with 2016 samples) ⚫ Explore xgboost instead of TMVA, and in any case, retrain TMVA BDT with 2016+2017 samples for more statistics • One intermediate goal is to come up with a sane binning scheme/formula (rather than trying random partitions and picking the best one) � 2
Input details, TMVA ⚫ 19 variables on the right extracted from 2016+2017 MC • Looser baseline for more stats: Njets ≥ 2, Nb ≥ 1, HT ≥ 250, MET ≥ 30, lepton p T ≥ 15 • No CRZ included, because we will separate that into its own bin for the full feature_names = [ analysis "nbtags", ⚫ All numbers in these slides should be consistent, and associated with a luminosity "njets", of 35.9+41.5= 77.4fb -1 — multiply significances by 1.2 to project to 112.9fb -1 , or 1.3 "met", to project to 132fb -1 "ptl2", ⚫ I checked that the discriminator shape for signal is essentially the same for OS and "nlb40", "ntb40", SS events, so include signal OS events to double the statistics "nleps", • ~400k unweighted signal and background events in total "htb", ⚫ Retrain TMVA BDT with below configuration (found from hyperparameter scan last "q1", time) "ptj1", • Key points — 500 trees with a depth of 5 using the AdaBoost algorithm "ptj6", "ptj7", method = factory.BookMethod(loader, ROOT.TMVA.Types.kBDT, "BDT", "ml1j1", ":".join([ "dphil1l2", "!H", "maxmjoverpt", "!V", "ptl1", "NTrees=500", "detal1l2", "nEventsMin=150", "ptj8", "MaxDepth=5", "ptl3", "BoostType=AdaBoost", ] "AdaBoostBeta=0.25", "SeparationType=GiniIndex", "nCuts=20", "PruneMethod=NoPruning", ])) � 3
xgboost ⚫ Preprocessing • Use absolute value of weights in training for reasons of stability • When re-weighting signal and background to have average weights of 1, throw away a small (sub%) fraction of events num_trees = 500 that have large relative weights, from x+gamma mainly ⚫ Tried to use BayesianOptimization package to get optimal param['objective'] = 'binary:logistic' param['eta'] = 0.07 hyperparameters param['max_depth'] = 5 • This attempts to iteratively find the best point by exploring param['silent'] = 1 regions for which "information gained" is maximized param['nthread'] = 15 • Turns out once you get the learning rate (eta), the number of param['eval_metric'] = "auc" trees, and the subsampling fraction right, the rest don’t param['subsample'] = 0.6 matter/matter very little param['alpha'] = 8.0 ⚫ Also naively tried Condor (pick random points and submit ~4-5k param['gamma'] = 2.0 param['lambda'] = 1.0 trainings) param['min_child_weight'] = 1.0 • Same story here param['colsample_bytree'] = 1.0 ⚫ To avoid picking an overtrained hyperparameter set, rather than pick exactly the best point, I used representative values for the parameters on the right (definitions documented here) and made numbers more round ⚫ Key points here • 500 trees, depth of 5 — same as TMVA • Gradient boosting algorithm instead of AdaBoost — this can actually a ff ect the shape of the discriminator output � 4
Training results ⚫ Bottom left plot shows discriminator shapes for signal/ bkg in train/test sets • Kolmogorov-Smirnov test shows good consistency — no overtraining observed ⚫ Top right shows AUC of xgboost is ~1.2% higher than TMVA ⚫ Bottom right shows maximal s/sqrt(s+b) (single cut) is 1.83 for xgboost, but 1.75 for TMVA (5% higher for xgboost) • The shape is qualitatively di ff erent however xgboost � 5
Significance metrics ⚫ Ran HiggsCombine 10-50k times, using a simplified card/ nuisance structure • Group fakes/flips into Others, rares/ttxx/tttx/xg into "Rares" as shown in the plot on the right • Then compute two versions of the expected significance • significance without MC stats: 5 background processes + 1 signal process + 0 nuisances • significance with MC stats: 5 background processes + 1 signal process + (Nbins * (5+1)) uncorrelated nuisances representing MC statistical uncertainty in each bin • Use the latter for optimization/ranking to hopefully avoid low MC statistic bins/fluctuations, though the TMVA output mapped di ff erence in the two values is only a few percent from [-1,1] to [0,1] because this analysis is statistically limited ⚫ I’m showing s/sqrt(s+b) as the metric for each bin in the ratio panels, but I found that for low number of bins (e.g., Note, these discriminator plots 2-3), it is not indicative of the expected significance from require the actual baseline selection combine. However, the below higher-order likelihood- (HT>300, MET>50, Nb/Njets ≥ 2, lepton p T >25,20) approximation usually agrees with combine within ~2% (again, for 2-3 bins, so not useful in the right plot) σ = 2( s + b )ln(1 + s / b ) − 2 s � 6
Exp. 𝜏 (out-of-the-box) ⚫ Here we can see the shape di ff erence between TMVA and xgboost, though both get very similar AUC and s/sqrt(s+b) ⚫ Note that I scaled the TMVA plot from the previous slide from [0.15,1] to [0,1] to avoid empty bins because the TMVA output doesn’t cover the full [-1,1] range initially • This is one source of slight ambiguity for binning since you can’t just equally partition [-1,1] — you have to decide where to start binning on the left ⚫ Afterwards, create 20 equal-width bins for TMVA and xgboost and calculate the expected significance without MC stat and with MC stat • TMVA is ~6% higher than xgboost even though s,b and AUC metrics indicate xgboost should be winning… • Presumably, combine likes several moderately high s/sqrt(s+b) bins (TMVA) rather than one really high one (xgboost) • AUC doesn’t care about the squished signal on the right, but a fit probably does ⚫ As a quick comparison (in backup), I ran this procedure on the cut-based SR binning (18 bins) and get ~2.25 𝜏 stretched TMVA from [0.15,1] to [0,1] xgboost � 7 2.63477, 2.59117 2.60103, 2.44803
Run combine a lot ⚫ Run combine a few thousand times for TMVA and xgboost discriminators with a random number of bins (between 10-20) and random binning • Get a set of flat- or gaussian-distributed random numbers (50-50 chance) and take cumulative sum and squeeze to [0,1] to obtain a "random binning" • Reject binning scheme if there is an empty bin (or one with <0.05 s+b events) • Additionally, compute s/sqrt(s+b) and make sure >~80% of bins are increasing in this metric to avoid weird-looking distributions (e.g., right) ⚫ Left plot shows significance (no MC stat) vs significance — on average, "sig no stat" is ~1.8% higher than "sig stat" ⚫ Middle plot has 1D distributions of "sig stat" for xgboost and TMVA • The di ff erence here is quite striking. TMVA is better than xgboost and fairly stable ⚫ Right plot shows maximum s/sqrt(s+b) across all bins against the significance • Narrow orange line at the top left contains cases where the last xgboost bin has a higher s/sqrt(s+b) than any other bin, so it dominates the result and is clearly correlated with the output of combine • TMVA has a lower maximum than xgboost on average, but obtains a better significance • This is along the lines of the suspicion on the previous slide about squishing the signal � 8
Dependence on bin count ⚫ Plot expected significance for TMVA (left) and xgboost (right) — the legend is more useful than the histograms though ⚫ For each bin count, display the mean significance and also the mean of the highest 10% of significances • TMVA only has a ~1% gain going from lowest bin count to highest • xgboost has a 5-8% gain � 9
E ff ect of MC stats ⚫ Now plot the di ff erence between expected significance without MC statistics nuisances and with , as a function of the number of bins ⚫ For TMVA, the di ff erence decreases a little bit going from 10 to 19 bins ⚫ For xgboost, the di ff erence increases going from 10 to 19 ⚫ I would expect less bins to mean a smaller e ff ect of MC statistics, along the lines of what xgboost shows � 10
Reshaping xgboost output ⚫ From an earlier slide, signal is compressed at disc=1 for xgboost. Naively try to reshape it to look like TMVA by matching the relative signal counts in each bin ⚫ Take the equally-spaced bins in the xgboost discriminator (x-axis) and make them match TMVA (y-axis) — bins more finely where signal is bunched up ⚫ Green dots are calculated by matching integrals, blue is a linear interpolation that we can apply ⚫ Two approaches • Convert the xgboost discriminator value on an event-by-event basis (blue) • Re-space the bins (orange, which is the inverse of blue) ⚫ Note that orange is very sigmoid-like… � 11
Recommend
More recommend