Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข ๐ฅ ๐ เท SGD is great โฆโฆ
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข A b A g A z B e B o B y C h C l C o C o C u D a D e D i D o D r D y E d E f ๐ฅ ๐ เท E l E n E p E r E s E t E x F a F i F l F o F r F u G e G i SGD is great โฆโฆ G l G m G r H a H i H o if you run on iid (randomly shuffled) data
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ 1 ๐ ฯ ๐ ๐ ๐ overall distribution: ๐ = ๐ฅ ๐ เท SGD is great โฆโฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ 1 ๐ ฯ ๐ ๐ ๐ overall distribution: ๐ = ๐ฅ ๐ เท SGD is great โฆโฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning โข Train model by executing SGD steps on user devices when device available (plugged in, idle, on WiFi) โข Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ 1 ๐ ฯ ๐ ๐ ๐ overall distribution: ๐ = ๐ฅ ๐ เท โข Train เท ๐ฅ ๐ by running block-cyclic SGD โ could be MUCH slower, by an arbitrary large factor
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ ๐ฅ 1 เท ๐ฅ 2 เท โข Train เท ๐ฅ ๐ by running block-cyclic SGD โ could be MUCH slower, by an arbitrary large factor ๐ฅ ๐ for each block ๐ = 1. . ๐ Pluralistic approach: learn different เท ๐ฅ ๐ separately on data from that block (across all cycles) โข Train each เท โ could be slower/less efficient by a factor of ๐
๐ฅ ๐ข+1 โ ๐ฅ ๐ข โ ๐โ๐ ๐ฅ ๐ข , ๐จ ๐ข Samples in block ๐ = 1. . ๐ are sampled from as ๐จ ๐ข โผ ๐ ๐ ๐ฅ 1 เทฅ ๐ฅ 2 เทฅ โข Train เท ๐ฅ ๐ by running block-cyclic SGD โ could be MUCH slower, by an arbitrary large factor ๐ฅ ๐ for each block ๐ = 1. . ๐ Pluralistic approach: learn different เท ๐ฅ ๐ separately on data from that block (across all cycles) โข Train each เท โ could be slower/less efficient by a factor of ๐ ๐ฅ ๐ using single SGD chain+ โ pluralistic averaging โ โข Our solution: train เทฅ โ exactly same guarantee as if using random shuffling (no degradation) โ no extra comp. cost, no assumptions about ๐ ๐ nor relatedness
Recommend
More recommend