Use Tesla to provide first GPU VM Service in China Feng Zhu 专注 • 服务 • 中立
Outline • UCloud Introduction • K80 GPU VM • P40 GPU VM • UCloud GPU PaaS Service: UAI-Service • UCloud GPU ecosystem 2
About UCloud • Top 3 IaaS Provider in China • Found in 2012 • HQ in Shanghai • Served 50,000+ Enterprise 3
Data Centers Frankfurt LA BJ2 DC BJ1 SH1 Seoul SH2 ZJ GZ HK TW Bangkok SG 14 Global Regions 4
UCloud Product Line ������� ������� ����� � ���&"���� �� ���� �������� �"����� ���'��� ����� ��� ����� ���� ������� �%�� %��$������! ���*��$�� ��!)� ����� ��������� ��� ��� ���'� �(�)�� ���%# ����*��$�� �! ��������� ��� �"�� ������ ���� �%# ������ �" ������ ���� �#�$� ��� "�!���� +���!�! ������� ������� � �#� ���� �������� ��� ���!!��� �����%�� ��%� 5
GPU Timeline 2012 UCloud founded 2015.11 K80 GPU VM 2016.2 K80 GPU Physical Machine 2017.5 P40 GPU VM 2017.? P40 GPU Physical Machine 6
GPU Decision: Virtualization PCI Pass through Grid ��! $���"��������������'��� ������'�����,����������$���$���"� ���)����!��� ����!���� ��� ����!�����!����)����!�� �����������������'����!������ ( (-.�/�0�1� ��� ) !��$�����)���������'��� ���$���!�� ����!�� √ 2
VM Advantage • Flexibility for VM configuration • CPU、Memory、Disk size、GPU number are all flexible • SDN network flexible • Main OS all supported, Win/Linux • CentOS 6.5/CentOS 7.0/Ubuntu 14.04/Ubuntu 12.04/Gentoo 2.2/Win 2008/Win 2012 • Fast Deployment • Based on self-defined image, can deploy 1000 VMs in 1 minute
VM Performance Degrade • Using Pass-through Technology, almost no degradation Degradation Virtualization Bare Metal �������! 33456 7..6 ��!�����, 3-6 7..6 7.74..6 7..4..6 ������$�9����! 334..6 �����"���$ 3-4..6 384..6 ������� ��
UCloud GPU Virtualization – DL test • Caffe Performance (Ubuntu) Cases iters GPU(secs) CPU(secs) Speedup ����� �!�"���+ 7.... 0:543 3..45 3.5 ��)��7. :... 58;40 03574. 7.8 #�!�*��!�! �)������$�� 7.... 575-43 78<5;4- 5.6 ���� !����! ����������������+���!�! :.... 0:-;40 -35847 3.5 0.... 7:... 7.... ��� :... ��� . 10
UCloud GPU Virtualization – DL test(2) • Theano/Keras ( Ubuntu) ����� ����� ���=����> ���=����> Speedup ����?$���4�� 0.... ;8 053 5.1 ����?�!!4�� 0.... 77 :<5 51.2 ����?�!!?$���4�� 0.... 08 05< 8.7 �������!?�!!4�� ;:... : 50 6.4 ��)��7.?�!!4�� :.... 73- 0<8. 13.5 ����?�!!4�� 3:. 3 00 2.4 �!���?�!!4�� <.... 05 7550 57.9 �!���?��!!4�� <.... 08. ;:7 1.7 �!���?�$�4�� <.... 7 : 5.0 11
K80 Physical Machine Hardware Specification ��� +��$��(-. ��� �!��$��: 0<5.10 "����� 730� ���� 0+���� ��������! 7.�����1; 12
VM Configuration - K80 VM VM GPU GPU GPU ;� CPU -� 7<� Memory -� 7<� 50� <;� 7..��* 7+ Disk 13
Flexible VM Save Cost Configuration Fixed Flexible CPU 7<� ;� Memory 3<� -� Disk 7+ 7..� GPU 7 7 Price ;@5..���A&��!�, 0@0..���A&��!�, USD Price B<7:&� B57:&� 7.... -... <... #$�C��$� ;... #�C�� 0... . 7��� 0��� 14
GPU VM Features VM VPC Networking Self-defined images, deploy 1000 VMs in 1 min Image Snapshot Data backup 24 continuous data protection, call rollback to any second DataArk Hotfix Kernel patch without system shutdown Re scale Resize CPU/Memory/Disk anytime 15
Storage Solution VM Disk Local SSD disk NAS, no limit on device numbers UDisk UFS NFS file system Object storage UFile UArchive Low cost cloud archive 16
Create GPU VM 17
Create GPU VM 18
Create GPU VM 19
P40 Physical Machine Hardware Specification ��� +��$���;.1; ��� �!��$��: 0<:.10 "����� 0:<� ���� 5+���� ��������! 7.�����1; 20
VM Configuration - P40 VM VM VM VM GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU ;� CPU -� 7<� 50� 0;� Memory -� 7<� 50� <;� 3<� 70-� 7..��* 7+ Disk 21
P40 Price Configuration Spec 1 Spec 2 CPU ;� 50� Memory -� 70-� Disk 7..� 7+ GPU 7 ; Price ;@...���A&��!�, 7-@0..���A&��!�, USD Price B:8.&� B0@<..&� 0.... 7:... (-. 7.... �;. :... . "�! "�C 22
UAI-Service Overview �������(&��� %���������� (���� +�!���#$�� "2��� +�����!' �!���)��� ������������"���$ D!$�!�����'��� %���������� +���!�! Resources ����� � ��� #��% ��� 23
Distributed Training Layout �������(&��� Features ���$�� �'�$ +�!���#$�� (���� ��E�F� �!)���!�� +�����!' "2��� Storage &����7&���� Resources &���� ��� ��� &����$ &$� 24
Distributed Training Process 54���$�� 04+����E��'�$ ���$�� ��(&��� +�����!' 74��$��� �� �� +����� Storage &����7&���� �" &���� ������ ������ ������ &���� ���) &$� Resources ��� ��� 25
Online Inference Layout User SDK/Web Online Inference System Deploy Running Eval TensorFlow Keras Test Env MXNet Storage Docker /task1/code /data Resources Images /ckpt CPU FPGA GPU /log 26
Online Inference Process 2.Test & Eval 3.Deploy Deploy SDK/Web Test Env 1.Upload ULB Tester Storage Docker /task1/code Docker Docker Docker /ckpt Perf Docker Resource 27
Online Inference API/SDK Deploy AB Test User ULB Scalable Service Service Service Perf report Docker Docker Docker Model update Docker Resource Rollback 28
GPU Scenario Deep Learning Advertisement CTR Face Recognition Gene Sequencing HPC Weather Voice Recognition Forecasting Picture\Film\ACG Maya Rendering Rendering Online Rendering 3Dmax Simulator Unity
GPU Scenario Training Online Service User Input Big Data Advertisement CTR Neural Neural Network Network Face Recognition Model Model Voice Recognition Output Neural Network Model ( GPU ) ( ( ( ) ) ) ( ( ( ( GPU ) ) ) ) Compute-Intensive Compute-Sensitive
GPU Scenario: Example CTR click through rate estimation • • ��������%�'������G��������! ���������!�$�,��������$����� • %�������! �E���!��! : �+F �$�������� 1���������� • ����� 、 ��� �������!�����! : ����,�! ������
GPU Scenario: Example CTR click through rate estimation • x=[Weekday=Wednesday, Gender=Male, City=Shanghai] x=[0,0,1,0,0,0,0 0,1 0,0,1,0…0] CTR Estimate Model Percent of Click : : 25% : :
Thank You www.ucloud.cn 33
Recommend
More recommend