◀ BIT ▶ Maintaining Training Efficiency and Accuracy for Edge-assisted Online Federated Learning with ABS Jiayu Wang, Zehua Guo, Sen Liu, Yuanqing Xia Beijing Institute Of Technology, Fudan University
1 Federated Learning User devices Cloud Data interaction 2
1 Parameter Server Server Gradient synchronization flow Worker 3
1 Existing method Existing method Existing problem Training batch size Training data batch size The decrease in batch can fluctuate. size can have a negative effect on the training process. Computing speed Do not consider of the Worker with more difference of training data and low computing speed. computing speed may drag the training process. Utilization of the Do not consider of the The improper batch training data utilization of the size can decrease the training data. utilization of the training data. 4
2 Observation Increase batch size Training model: Resnet18 Dataset: CIFAR10 l Increase the batch size can accelerate the Iteration Batch size training process. Case1 32 to 32 l More improvement can further accelerate Case2 32 to 64 the training Case3 32 to 128 Changing of batch size 5
2 Observation Decrease batch size Training model: Resnet18 Dataset: CIFAR10 l A decrease in the batch size can slow Iteration Batch size down the training. Case4 128 to 128 l Extreme small batch size will have a Case5 128 to 64 serious negative effect and lead to a long training process duration. Case6 128 to 32 Changing of batch size 6
3 Our method l Consider of the changeable data receiving speed, we adopt an adaptative batch size. l Consider of the different computing speed, we set different batch size upper bound for different workers. l To improve the utilization of the training data, we adpot lower bound for the training batch size. Existing method Our method 7
3 Warm-up phase The setting of lower bound: l We train the machine learning model with different batch size on the training data with one iteration, the batch size with the best training result will be set as the lower bound. The setting of upper bound: l We set an iteration duration at first. In each worker, the maximum batch size, which can be processed within this duration, will be set as the upper bound. 8
3 System design n Processing phase Batch size Warm-up phase l Training data selection: bound decision Choose C% of the data. Start signal Batch size from server l Batch size selection: Training data Batch size Restrict the batch size selection selection within the bound. Amount of data in buffer l Batch size bound update: Batch size Compare the batch size Processing phase bound update with the lower bound and update the lower bound. ABS structure 9
4 Experimental setup Training data Training model l CIFAR10 dataset. l Base on Resnet18 and adjust the last layer. Simulation of the Other parameters Comparison data stream algorithm l We assume there is no l We download the traffic l FederatedAveraging: network congestion. dataset from Kaggle. Each worker's training l We choose 1% of the data in batch size is the size of each iteration. all the data on it. l We improve the lower bound when the training batch size is higher than the lower bound for 120 iterations. 10
4 Experimental result Training loss Accuracy l The training loss of ABS can convergence faster and more smooth. l The testing accuracy of ABS can be higher. 11
Thank you! Questions?
Recommend
More recommend