semi cyclic sgd
play

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal - PowerPoint PPT Presentation

Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro +1 , SGD is great +1


  1. Semi-Cyclic SGD Hubert Eichner Tomer Koren Brendan McMahan Kunal Talwar Google Google Google Google Nati Srebro

  2. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ

  3. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข A b A g A z B e B o B y C h C l C o C o C u D a D e D i D o D r D y E d E f ๐‘ฅ ๐‘ˆ เท E l E n E p E r E s E t E x F a F i F l F o F r F u G e G i SGD is great โ€ฆโ€ฆ G l G m G r H a H i H o if you run on iid (randomly shuffled) data

  4. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data

  5. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท SGD is great โ€ฆโ€ฆ if you run on iid (randomly shuffled) data Cyclically varying (not fully shuffled) data, e.g. in Federated Learning โ€ข Train model by executing SGD steps on user devices when device available (plugged in, idle, on WiFi) โ€ข Diurnal variations (e.g. Day vs night available devices; US vs UK vs India)

  6. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— 1 ๐‘› ฯƒ ๐‘— ๐’  ๐‘— overall distribution: ๐’  = ๐‘ฅ ๐‘ˆ เท โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor

  7. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— ๐‘ฅ 1 เท ๐‘ฅ 2 เท โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor ๐‘ฅ ๐‘— for each block ๐‘— = 1. . ๐‘› Pluralistic approach: learn different เท ๐‘ฅ ๐‘— separately on data from that block (across all cycles) โ€ข Train each เท โž” could be slower/less efficient by a factor of ๐‘›

  8. ๐‘ฅ ๐‘ข+1 โ† ๐‘ฅ ๐‘ข โˆ’ ๐œƒโˆ‡๐‘” ๐‘ฅ ๐‘ข , ๐‘จ ๐‘ข Samples in block ๐‘— = 1. . ๐‘› are sampled from as ๐‘จ ๐‘ข โˆผ ๐’  ๐‘— ๐‘ฅ 1 เทฅ ๐‘ฅ 2 เทฅ โ€ข Train เท ๐‘ฅ ๐‘ˆ by running block-cyclic SGD โž” could be MUCH slower, by an arbitrary large factor ๐‘ฅ ๐‘— for each block ๐‘— = 1. . ๐‘› Pluralistic approach: learn different เท ๐‘ฅ ๐‘— separately on data from that block (across all cycles) โ€ข Train each เท โž” could be slower/less efficient by a factor of ๐‘› ๐‘ฅ ๐‘— using single SGD chain+ โ€œ pluralistic averaging โ€ โ€ข Our solution: train เทฅ โž” exactly same guarantee as if using random shuffling (no degradation) โž” no extra comp. cost, no assumptions about ๐“” ๐’‹ nor relatedness

Recommend


More recommend