当 MARS 遇上 RAPIDS : 使用 GPU 加速分布式海量数据处理的原理和实战 阿里云智能 秦续业 何开圣
目录 背景 Mars+RAPIDS 能做什么 CONTENT Mars+RAPIDS 怎么做 性能和展望
机器学习生命周期 特征工程 / 模型训练 { 新的数据 } Data 数据处理 / 模型部署 / 维 训练的模型 数据分析 护 / 改进 { 预测 } 往往要占用 80% 的 时间
Google 趋势(全球)
日益增长的数据科学技术栈
Data Engineer Data Scientist
Mars : numpy 、 pandas 、 scikit-learn 的并行和 分布式加速器,处理更多数据
Numpy Mars tensor 1 import numpy as np 1 import mars.tensor as mt 2 from scipy.special import erf 2 from mars.tensor.special import erf 3 3 4 4 5 def black_scholes(P, S, T, rate, vol): 5 def black_scholes(P, S, T, rate, vol): 6 a = np.log(P / S) 6 a = mt.log(P / S) 7 b = T * -rate 7 b = T * -rate 8 8 9 z = T * (vol * vol * 2) 9 z = T * (vol * vol * 2) 10 c = 0.25 * z 10 c = 0.25 * z 11 y = 1.0 / np.sqrt(z) 11 y = 1.0 / mt.sqrt(z) 12 12 13 w1 = (a - b + c) * y 13 w1 = (a - b + c) * y 14 w2 = (a - b - c) * y 14 w2 = (a - b - c) * y 15 15 运行时间: 5.48 s 16 d1 = 0.5 + 0.5 * erf(w1) 16 d1 = 0.5 + 0.5 * erf(w1) 运行时间: 11.9 s 峰值内存: 1647.85 17 d2 = 0.5 + 0.5 * erf(w2) 17 d2 = 0.5 + 0.5 * erf(w2) 峰值内存: 5479.47 MiB MiB 18 18 19 Se = np.exp(b) * S 19 Se = mt.exp(b) * S 20 20 21 call = P * d1 - Se * d2 21 call = P * d1 - Se * d2 22 put = call - P + Se 22 put = call - P + Se 23 23 24 return call, put 24 return call, put 25 25 26 26 27 N = 50000000 27 N = 50000000 28 price = np.random.uniform(10.0, 50.0, N) 28 price = mt.random.uniform(10.0, 50.0, N) 29 strike = np.random.uniform(10.0, 50.0, N) 29 strike = mt.random.uniform(10.0, 50.0, N) 30 t = np.random.uniform(1.0, 2.0, N) 30 t = mt.random.uniform(1.0, 2.0, N) 31 print (black_scholes(price, strike, t, 0.1, 0.2)) 31 print (mt.ExecutableTuple(black_scholes(price, 32 strike, t, 0.1, 0.2)).execute())
Pandas Mars DataFrame 1 import numpy as np 1 import mars.tensor as mt 2 import pandas as pd 2 import mars.dataframe as md 3 3 运行时间: 5.25 s 运行时间: 18.7 s 峰值内存: 2007.92 MiB 4 df = pd.DataFrame(np.random.rand(100000000, 4), 4 df = md.DataFrame(mt.random.rand(100000000, 4), 峰值内存: 3430.29 MiB 5 columns= list ('abcd')) 5 columns= list ('abcd')) 6 print (df. sum ()) 6 print (df. sum ().execute())
Scikit-learn Mars learn 1 from sklearn.datasets import make_blobs 1 from sklearn.datasets import make_blobs 2 from sklearn.decomposition.pca import PCA 2 from mars.learn.decomposition import PCA 3 3 4 X, y = make_blobs( 4 X, y = make_blobs( 5 n_samples=100000000, n_features=3, 5 n_samples=100000000, n_features=3, 6 centers=[[3, 3, 3], [0, 0, 0], 6 centers=[[3, 3, 3], [0, 0, 0], 运行时间: 19.1 s 运行时间: 12.8 s 7 [1, 1, 1], [2, 2, 2]], 7 [1, 1, 1], [2, 2, 2]], 峰值内存: 7314.82 MiB 峰值内存: 3814.32 MiB 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 9 random_state=9) 9 random_state=9) 10 pca = PCA(n_components=3) 10 pca = PCA(n_components=3) 11 pca.fit(X) 11 pca.fit(X) 12 print (pca.explained_variance_ratio_) 12 print (pca.explained_variance_ratio_.execute()) 13 print (pca.explained_variance_) 13 print (pca.explained_variance_.execute())
机器学习生命周期 支持 GPU 加速 特征工程 / 模型训练 GPU??? { 新的数据 } Data 数据处理 / 模型部署 / 维 训练的模型 数据分析 护 / 改进 { 预测 } 往往要占用 80% 的 支持 GPU 加速 时间
Numpy Cupy In [2]: import cupy as cp In [1]: import numpy as np In [4]: %%time In [5]: %%time ...: a = np.random.rand(8000, 10) ...: a = cp.random.rand(8000, 10) ...: _ = ((a[:, np.newaxis, :] - a) ** 2). sum (axis=-1) ...: _ = ((a[:, cp.newaxis, :] - a) ** 2). sum (axis=-1) ...: ...: CPU times: user 590 ms, sys: 292 ms, total: 882 ms CPU times: user 17 s, sys: 1.84 s, total: 18.8 s Wall time: 5.23 s Wall time: 880 ms
Pandas RAPIDS cuDF In [6]: %%time In [7]: %%time ...: import pandas as pd ...: import cudf ...: ratings = pd.read_csv('ml-20m/ratings.csv') ...: ratings = cudf.read_csv('ml-20m/ratings.csv') ...: ratings.groupby('userId').agg({'rating': [ ...: ratings.groupby('userId').agg({'rating': [ 'sum', 'mean', 'max', 'sum', 'mean', 'max', 'min']} 'min']}) ) ...: ...: CPU times: user 10.5 s, sys: 1.58 s, total: 12.1 s CPU times: user 1.2 s, sys: 409 ms, total: 1.61 s Wall time: 18 s Wall time: 1.66 s
Scikit-learn RAPIDS cuML In [4]: import pandas as pd In [1]: import cudf In [5]: from sklearn.neighbors import NearestNeighbors In [2]: from cuml.neighbors import NearestNeighbors In [6]: %%time In [3]: %%time ...: df = pd.read_csv('data.csv') ...: df = cudf.read_csv('data.csv') ...: nn = NearestNeighbors(n_neighbors=10) ...: nn = NearestNeighbors(n_neighbors=10) ...: nn.fit(df) ...: nn.fit(df) ...: neighbors = nn.kneighbors(df) ...: neighbors = nn.kneighbors(df) ...: ...: CPU times: user 3 min 34s, sys: 1.73 s, total: 3 min 36s CPU times: user 41.6 s, sys: 2.84 s, total: 44.4s Wall time: 1 min 52s Wall time: 17.8 s
Mars+RAPIDS :更快地处理更多数据
Mars tensor :实现了 70% 常见 Numpy 接口 • Tensor creation • Indexing • Basic manipulations • ones • Slice • astype • empty • Boolean indexing • transpose • zeros • Fancy indexing • broadcast_to • ones_like • newaxis • sort • … • Ellipsis • … • Random sampling • Discrete Fourier transform • Aggregation • rand • Linear Algebra • sum • randint • QR • nansum • beta • SVD • max • binomial • Cholesky • all • … • inv • mean • norm • … • …
Mars DataFrame 和 learn • DataFrame 实现接口: https://github.com/mars-project/mars/issues/495 • 创建 DataFrame : DataFrame 、 from_records • IO : read_csv • Basic arithmetic :基本算数运算 • Math :数学运算 • Indexing: iloc ,列选择, set_index • Reduction :聚合 • Groupby :分组聚合 • merge/join • Learn : • Decomposition : PCA , TruncatedSVD • TensorFlow : run_tensorflow_script , MarsDataset 进行中 • XGBoost : XGBClassifier 、 XGBRegressor • PyTorch :进行中
Scale up In [4]: %%time In [4]: %%time ...: a = mt.random.uniform(-1, 1, size=( ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2), gpu= True ) 2000000000, 2), gpu= True ) ...: print (((mt.linalg.norm(a, axis=1) < 1). ...: print (((mt.linalg.norm(a, axis=1) < 1). sum () * 4 / 2000000000).execute( sum () * 4 / 2000000000).execute( n_parallel=1)) n_parallel=4)) ...: ...: 3.14157076 3.14156894 CPU times: user 2.72 s, sys: 1.27 s, total: 3.99s CPU times: user 1.64 s, sys: 918 ms, total: 2.56 s Wall time: 3.98 s Wall time: 2.4 s Scale out 1 x Tesla V100 4 x Tesla V100 24core 4 x 24core In [4]: from mars.session import new_session In [3]: %%time In [5]: new_session('http://192.168.0.111:40002').as_default() ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2)) In [6]: %%time ...: print (((mt.linalg.norm(a, axis=1) < 1). ...: a = mt.random.uniform(-1, 1, size=( sum () * 4 / 2000000000).execute()) 2000000000, 2)) ...: ...: print (((mt.linalg.norm(a, axis=1) < 1). 3.14160312 sum () * 4 / 2000000000).execute()) CPU times: user 3 min 31s, sys: 1 min 42s, total: 5 min 14s ...: Wall time: 25.8 s ...: 3.141611406 CPU times: user 12.2 ms, sys: 2.02 ms, total: 14.3 ms Wall time: 7.66 s 蒙特卡洛求解 PI
Mars 如何作到并行和分布式? 让我们看看 Mars 背后的设计哲学
粗粒度计算图 哲学 1 :分而治之 data In [1]: import mars.tensor as mt Series(s) SeriesData In [2]: import mars.dataframe as md In [3]: a = mt.ones((10, 10), chunk_size=5) Sum In [4]: a[5, 5] = 8 data DataFrame(df) DatFrameData In [5]: df = md.DataFrame(a) In [6]: s = df. sum () FromT In [7]: s.execute() ensor Out[7]: 0 10.0 1 10.0 TensorData 2 10.0 3 10.0 4 10.0 5 17.0 IndexS 6 10.0 etValue indexes: (5, 5) 7 10.0 value: 8 8 10.0 data Tileable 9 10.0 tensor(a) TensorData dtype: float64 TileableData Ones Operand
SeriesChunkData SeriesChunkData 粗粒度计算图 细粒度计算图 Sum Sum SeriesData DatFrameChunkDat DatFrameChunkDat a DatFrameChunkDat a a Sum Sum Conc Conc at at DatFrameData DatFrameChunkDat DatFrameChunkDat DatFrameChunkDat DatFrameChunkDat a a a a FromT ensor Tile FromT Sum Sum Sum ensor TensorData DatFrameChunkDat DatFrameChunkDat DatFrameChunkDat TensorChunkData a a a IndexS etValue indexes: (5, 5) value: 8 FromT FromT FromT IndexS ensor ensor ensor etValue indexes: (0, 0) TensorData value: 8 TensorChunkData TensorChunkData TensorChunkData TensorChunkData (0,0) (1,0) (0,1) (1,1) Ones Ones Ones Ones Ones
Recommend
More recommend