Adam Thompson | adamt@nvidia.com | Senior Solutions Architect Corey Nolet | cnolet@nvidia.com | Data Scientist and Senior Engineer GPU Enabled Data Science
2 About Us Adam Thompson @adamlikesai Senior Solutions Architect at NVIDIA, Signals/RF SME Research interests include applications of machine and deep learning to non- traditional data types, HPC, and large system integration Experience in programming GPUs for signal processing tasks since CUDA Toolkit 2 Roots for and Corey Nolet @cjnolet Data Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIA Focuses on building and scaling machine learning algorithms to support extreme data loads at light-speed Over a decade experience building massive-scale exploratory data science & real- time analytics platforms for HPC environments in the defense industry Working towards PhD in Computer Science, focused on unsupervised ML
3 1980 1990 2000 2010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025 REALITIES OF PERFORMANCE GPU Performance Grows
4 Realities of Data
5 ML Workflow Stifles Innovation It Requires Exploration and Iterations All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Accelerating just `Model Training` does have benefit but doesn’t address the whole problem End-to-End acceleration is needed Iterate … Cross Validate … Grid Search … Iterate some more.
6
7 GPU Adoption Barriers • Too much data movement • Too many makeshift data formats • Writing CUDA C/C++ is hard • No Python API for data manipulation Yes GPUs are fast but …
8 APP A DATA MOVEMENT AND TRANSFORMATION The bane of productivity and performance CPU GPU APP B Read Data Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data APP A APP B
9 APP A DATA MOVEMENT AND TRANSFORMATION What if we could keep data on the GPU? APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B CPU GPU APP A
10 Learning from APACHE ARROW Source: Apache Arrow Home Page (https://arrow.apache.org/)
11 Data Processing Evolution Faster Data Access Less Data Movement 25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS ReadQuery ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
12 APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE • Learn what the data science community needs • Use best practices and standards • Build scalable systems and algorithms • Test Applications and workflows • Iterate
13 cuDF Analytics GPU Memory Data Preparation VisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization RAPIDS Open Source SOFTWARE
14 RAPIDS: OPEN GPU DATA SCIENCE cuDF, cuML, and cuGraph mimic well-known libraries Data Preparation VisualizationModel Training CUDA PYTHON DASK DL FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH APACHE ARROW Pandas-like ScikitLearn-like NetworkX-like
15 Dask What is Dask and why does RAPIDS use it for scaling out? • Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters. • Extremely modular with scheduling, compute, data transfer, and out-of-core handling all being disjointed allowing us to plug in our own implementations. • Can easily run multiple Dask workers per node to allow for an easier development model of one worker per GPU regardless of single node or multi node environment.
16 Dask Scale up and out with cuDF • Use cuDF primitives underneath in map-reduce style operations with the same high level API • Instead of using typical Dask data movement of pickling objects and sending via TCP sockets, take advantage of hardware advancements using a communications framework called OpenUCX: • For intranode data movement, utilize NVLink and PCIe peer-to-peer communications • For internode data movement, utilize GPU RDMA over Infiniband and RoCE https://github.com/rapidsai/dask_gdf
17 Faster speeds, real world benefits 2,290 1,956 1,999 1,948 169 157 0 500 1,000 1,500 2,000 2,500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 cuML — XGBoost 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End cuIO/cuDF — Load and Data Preparation Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network Time in seconds — Shorter is better cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost 8,762 6,148 3,925 3,221 322 213
18 AI Libraries Accelerating more of the AI ecosystem Graph Analytics is fundamental to network analysis Machine Learning is fundamental to prediction, classification, clustering, anomaly detection and recommendations. Both can be accelerated with NVIDIA GPU 8x V100 20-90x faster than dual socket CPU Decisions Trees Random Forests Linear Regressions Logistics Regressions K-Means K-Nearest Neighbor DBSCAN Kalman Filtering Principal Components Single Value Decomposition Bayesian Inferencing PageRank BFS Jaccard Similarity Single Source Shortest Path Triangle Counting Louvain Modularity ARIMA Holt-Winters Machine Learning Graph Analytics Time Series XGBoost, Mortgage Dataset, 90x 3 Hours to 2 mins on 1 DGX-1 cuML & cuGraph
19 cuDF
20 RAPIDS: OPEN GPU DATA SCIENCE Following the Pandas API for optimal code reuse Data Preparation VisualizationModel Training CUDA PYTHON DASK DL FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH APACHE ARROW Pandas-like import pandas as pd import cudf … df = pd.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64']) gdf = cudf.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64’]) … # eventually majority of pandas features will be available in cudf
21 cuDF GPU DataFrame library • Apache Arrow data format • Pandas-like API • Unary and Binary Operations • Joins / Merges • GroupBys • Filters • User-Defined Functions (UDFs) • Accelerated file readers • Much, much more
22 CUDF Today CUDA With Python Bindings • Low level library containing function implementations and C/C++ API • Importing/exporting Apache Arrow using the CUDA IPC mechanism • CUDA kernels to perform element-wise math operations on GPU DataFrame columns • CUDA sort, join, groupby, and reduction operations on GPU DataFrames • A Python library for manipulating GPU DataFrames • Python interface to CUDA C++ with additional functionality • Creating Apache Arrow from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba
23 GPU-Accelerated string functions with a Pandas-like API cuString & NVSTring • API and functionality is following Pandas: https://pandas.pydata.org/pandas- docs/stable/api.html#string-handling • lower() • ~22x speedup • find() • ~40x speedup • slice() • ~100x speedup 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 lower() find(#) slice(1,15) milliseconds Pandas cudastrings
24 Next few months GPU Dataframe • Continue improving performance and functionality • Single GPU • Single node, multi GPU • Multi node, multi GPU • String Support • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas • Accelerated Data Loading • File formats: CSV, Parquet, ORC – to start
25 cuGraph
26 CUGRAPH GPU-Accelerated Graph Analytics Library Coming Soon: Full NVGraph Integration Q1 2019
27 Early cugraph benchmarks NVGRAPH methods (not currently in cuGraph) 100 billion TEPS (single GPU) Can reach 300 billion TEPS on MG (4 GPUs) with hand-tuning PageRank (SG) on graph with 5.5m edges and 51k nodes with RAPIDS CPU (GraphFrames, Spark) [576 GB memory, 384 vcores]: 172 seconds GPU (cuGraph, V100) [32 GB memory, 5120 CUDA cores]: 671 ms Speed-up of 256x For large graphs, there are huge parallelization benefits
28 cuML
29 Problem Data sizes continue to grow
30 Problem Data sizes continue to grow
31 Problem Data sizes continue to grow min(variance) min(bias)
32 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling
33 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Better to start with as much data as possible and explore / preprocess to scale to performance needs.
34 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs.
35 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs.
36 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate.
37 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate.
38 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search.
39 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more.
40 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff
41 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Time Increases
42 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Hours? Time Increases
43 Problem Data sizes continue to grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Hours? Days? Time Increases
44 “More data requires better approaches!” ~ Xavier Amatriain (CTO of Curai)
45 cuML API Python Algorithms Primitives GPU-accelerated machine learning at every layer Scikit-learn-like interface for data scientists utilizing cuDF & Numpy C++ API for developers to utilize accelerated machine learning algorithms. Reusable building blocks for composing machine learning algorithms.
46 Linear Algebra Primitives GPU-accelerated math optimized for feature matrices Statistics Sparse Random Distance / Metrics Optimizers More to come! • Element-wise operations • Matrix multiply • Norms • Eigen Decomposition • SVD/RSVD • Transpose • QR Decomposition
47 Algorithms GPU-accelerated Scikit-Learn Classification / Regression Statistical Inference Clustering Decomposition & Dimensionality Reduction Timeseries Forecasting Recommendations Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Kalman Filtering Bayesian Inference Gaussian Mixture Models Hidden Markov Models K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding ARIMA Holt-Winters Implicit Matrix Factorization Cross Validation More to come! Hyper-parameter Tuning
48 HIGH-LEVEL APIs CUDA/C++ Multi-Node & Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like Host 2 GPU1 GPU3 GPU2 GPU4 Host 1 GPU1 GPU3 GPU2 GPU4
49 HIGH-LEVEL APIs CUDA/C++ Multi-Node / Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like Host 2 GPU1 GPU3 GPU2 GPU4 Host 1 GPU1 GPU3 GPU2 GPU4 Data Parallelism Model Parallelism
50 ML Technology Stack Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA C++ C++ C++
51 ML Technology Stack Dask cuML Dask cuDF cuDF Numpy Thrust Cub cuSolver nvGraph CUTLASS cuSparse cuRand cuBlas Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA C++ C++ C++
52 Dimensionality Reduction Code Example import cudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) Load CSV & Preprocess
53 Dimensionality Reduction Code Example import cudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Load CSV & Preprocess
54 Dimensionality Reduction Code Example import cudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) from cuml import PCA pca = PCA(n_components=df.shape[1]) pca.fit(df) print("Explained var %: %s': " % pca.explained_variance_ratio_) first_two = sum(pca.explained_variance_ratio_[0:2]) print("nFirst two: %f" % first_two) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Load CSV & Preprocess Explore Variance
55 Dimensionality Reduction Code Example import cudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) from cuml import PCA pca = PCA(n_components=df.shape[1]) pca.fit(df) print("Explained var %: %s': " % pca.explained_variance_ratio_) first_two = sum(pca.explained_variance_ratio_[0:2]) print("nFirst two: %f" % first_two) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Explained var %: 0 0.92461634 1 0.05301554 2 0.017185122 3 0.005183111 First two: 0.9776318781077862 Load CSV & Preprocess Explore Variance
56 Dimensionality Reduction pca = PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example Reduce Dimensions
57 Dimensionality Reduction pca = PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
58 Dimensionality Reduction pca = PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
59 Dimensionality Reduction pca = PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
60 Clustering Code Example X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
61 Clustering Code Example X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
62 Clustering Code Example X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
63 Clustering from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = db.fit_predict(X) Code Example Find Clusters X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
64 Clustering from cuml import DBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = db.fit_predict(X) Code Example Find Clusters X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
65 cuML Benchmarks of initial algorithms
66 cuDF + XGBoost DGX-2 vs Scale Out CPU Cluster • Full end to end pipeline • Leveraging Dask + cuDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
67 cuDF + XGBoost Scale Out GPU Cluster vs DGX-2 0 50 100 150 200 250 300 350 5xDGX-1 DGX-2 Chart Title ETL+CSV (s) ML Prep (s) ML (s) • Full end to end pipeline • Leveraging Dask for multi-node + cuDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
68 cuDF + XGBoost Fully In- GPU Benchmarks • Full end to end pipeline • Leveraging Dask cuDF • No Data Prep time all in memory • Arrow to Dmatrix (CSR) for XGBoost
69 XGBoost Multi-node, Multi-GPU Performance
70 End-to-End Benchmarks 2,290 1,956 1,999 1,948 169 157 0 500 1,000 1,500 2,000 2,500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End cuIO/cuDF — Load and Data Preparation Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost cuML — XGBoost Time in seconds - Shorter is better
71 • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://hub.docker.com/r/rapidsai/rapidsai/ • https://github.com/rapidsai • https://anaconda.org/rapidsai/ • WIP: • https://pypi.org/project/cudf • https://pypi.org/project/cuml RAPIDS How do I get the software?
RAPIDS – Open GPU-accelerated Data Science

RAPIDS – Open GPU-accelerated Data Science

  • 1.
    Adam Thompson |adamt@nvidia.com | Senior Solutions Architect Corey Nolet | cnolet@nvidia.com | Data Scientist and Senior Engineer GPU Enabled Data Science
  • 2.
    2 About Us Adam Thompson@adamlikesai Senior Solutions Architect at NVIDIA, Signals/RF SME Research interests include applications of machine and deep learning to non- traditional data types, HPC, and large system integration Experience in programming GPUs for signal processing tasks since CUDA Toolkit 2 Roots for and Corey Nolet @cjnolet Data Scientist & Senior Engineer on the RAPIDS cuML team at NVIDIA Focuses on building and scaling machine learning algorithms to support extreme data loads at light-speed Over a decade experience building massive-scale exploratory data science & real- time analytics platforms for HPC environments in the defense industry Working towards PhD in Computer Science, focused on unsupervised ML
  • 3.
    3 1980 1990 20002010 2020 102 103 104 105 106 107 Single-threaded perf 1.5X per year 1.1X per year GPU-Computing perf 1.5X per year 1000X By 2025 REALITIES OF PERFORMANCE GPU Performance Grows
  • 4.
  • 5.
    5 ML Workflow StiflesInnovation It Requires Exploration and Iterations All Data ETL Manage Data Structured Data Store Data Preparation Training Model Training Visualization Evaluate Inference Deploy Accelerating just `Model Training` does have benefit but doesn’t address the whole problem End-to-End acceleration is needed Iterate … Cross Validate … Grid Search … Iterate some more.
  • 6.
  • 7.
    7 GPU Adoption Barriers • Too muchdata movement • Too many makeshift data formats • Writing CUDA C/C++ is hard • No Python API for data manipulation Yes GPUs are fast but …
  • 8.
    8 APP A DATA MOVEMENTAND TRANSFORMATION The bane of productivity and performance CPU GPU APP B Read Data Copy & Convert Copy & Convert Copy & Convert Load Data APP A GPU Data APP B GPU Data APP A APP B
  • 9.
    9 APP A DATA MOVEMENTAND TRANSFORMATION What if we could keep data on the GPU? APP B Copy & Convert Copy & Convert Copy & Convert APP A GPU Data APP B GPU Data Read Data Load Data APP B CPU GPU APP A
  • 10.
    10 Learning from APACHEARROW Source: Apache Arrow Home Page (https://arrow.apache.org/)
  • 11.
    11 Data Processing Evolution FasterData Access Less Data Movement 25-100x Improvement Less code Language flexible Primarily In-Memory HDFS Read HDFS Write HDFS Read HDFS Write HDFS ReadQuery ETL ML Train HDFS Read Query ETL ML Train HDFS Read GPU Read Query CPU Write GPU Read ETL CPU Write GPU Read ML Train Arrow Read Query ETL ML Train 5-10x Improvement More code Language rigid Substantially on GPU 50-100x Improvement Same code Language flexible Primarily on GPU RAPIDS GPU/Spark In-Memory Processing Hadoop Processing, Reading from disk Spark In-Memory Processing
  • 12.
    12 APPLICATIONS SYSTEMS ALGORITHMS CUDA ARCHITECTURE • Learn whatthe data science community needs • Use best practices and standards • Build scalable systems and algorithms • Test Applications and workflows • Iterate
  • 13.
    13 cuDF Analytics GPU Memory Data PreparationVisualizationModel Training cuML Machine Learning cuGraph Graph Analytics PyTorch & Chainer Deep Learning Kepler.GL Visualization RAPIDS Open Source SOFTWARE
  • 14.
    14 RAPIDS: OPEN GPUDATA SCIENCE cuDF, cuML, and cuGraph mimic well-known libraries Data Preparation VisualizationModel Training CUDA PYTHON DASK DL FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH APACHE ARROW Pandas-like ScikitLearn-like NetworkX-like
  • 15.
    15 Dask What is Daskand why does RAPIDS use it for scaling out? • Dask is a distributed computation scheduler built to scale Python workloads from laptops to supercomputer clusters. • Extremely modular with scheduling, compute, data transfer, and out-of-core handling all being disjointed allowing us to plug in our own implementations. • Can easily run multiple Dask workers per node to allow for an easier development model of one worker per GPU regardless of single node or multi node environment.
  • 16.
    16 Dask Scale up andout with cuDF • Use cuDF primitives underneath in map-reduce style operations with the same high level API • Instead of using typical Dask data movement of pickling objects and sending via TCP sockets, take advantage of hardware advancements using a communications framework called OpenUCX: • For intranode data movement, utilize NVLink and PCIe peer-to-peer communications • For internode data movement, utilize GPU RDMA over Infiniband and RoCE https://github.com/rapidsai/dask_gdf
  • 17.
    17 Faster speeds, realworld benefits 2,290 1,956 1,999 1,948 169 157 0 500 1,000 1,500 2,000 2,500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 cuML — XGBoost 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End cuIO/cuDF — Load and Data Preparation Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network Time in seconds — Shorter is better cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost 8,762 6,148 3,925 3,221 322 213
  • 18.
    18 AI Libraries Accelerating moreof the AI ecosystem Graph Analytics is fundamental to network analysis Machine Learning is fundamental to prediction, classification, clustering, anomaly detection and recommendations. Both can be accelerated with NVIDIA GPU 8x V100 20-90x faster than dual socket CPU Decisions Trees Random Forests Linear Regressions Logistics Regressions K-Means K-Nearest Neighbor DBSCAN Kalman Filtering Principal Components Single Value Decomposition Bayesian Inferencing PageRank BFS Jaccard Similarity Single Source Shortest Path Triangle Counting Louvain Modularity ARIMA Holt-Winters Machine Learning Graph Analytics Time Series XGBoost, Mortgage Dataset, 90x 3 Hours to 2 mins on 1 DGX-1 cuML & cuGraph
  • 19.
  • 20.
    20 RAPIDS: OPEN GPUDATA SCIENCE Following the Pandas API for optimal code reuse Data Preparation VisualizationModel Training CUDA PYTHON DASK DL FRAMEWORKS CUDNN RAPIDS CUMLCUDF CUGRAPH APACHE ARROW Pandas-like import pandas as pd import cudf … df = pd.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64']) gdf = cudf.read_csv('foo.csv', names=['index', 'A'], dtype=['date', 'float64’]) … # eventually majority of pandas features will be available in cudf
  • 21.
    21 cuDF GPU DataFrame library •Apache Arrow data format • Pandas-like API • Unary and Binary Operations • Joins / Merges • GroupBys • Filters • User-Defined Functions (UDFs) • Accelerated file readers • Much, much more
  • 22.
    22 CUDF Today CUDA With PythonBindings • Low level library containing function implementations and C/C++ API • Importing/exporting Apache Arrow using the CUDA IPC mechanism • CUDA kernels to perform element-wise math operations on GPU DataFrame columns • CUDA sort, join, groupby, and reduction operations on GPU DataFrames • A Python library for manipulating GPU DataFrames • Python interface to CUDA C++ with additional functionality • Creating Apache Arrow from Numpy arrays, Pandas DataFrames, and PyArrow Tables • JIT compilation of User-Defined Functions (UDFs) using Numba
  • 23.
    23 GPU-Accelerated string functionswith a Pandas-like API cuString & NVSTring • API and functionality is following Pandas: https://pandas.pydata.org/pandas- docs/stable/api.html#string-handling • lower() • ~22x speedup • find() • ~40x speedup • slice() • ~100x speedup 0.00 100.00 200.00 300.00 400.00 500.00 600.00 700.00 800.00 lower() find(#) slice(1,15) milliseconds Pandas cudastrings
  • 24.
    24 Next few months GPUDataframe • Continue improving performance and functionality • Single GPU • Single node, multi GPU • Multi node, multi GPU • String Support • Support for specific “string” dtype with GPU-accelerated functionality similar to Pandas • Accelerated Data Loading • File formats: CSV, Parquet, ORC – to start
  • 25.
  • 26.
    26 CUGRAPH GPU-Accelerated Graph AnalyticsLibrary Coming Soon: Full NVGraph Integration Q1 2019
  • 27.
    27 Early cugraph benchmarks NVGRAPHmethods (not currently in cuGraph) 100 billion TEPS (single GPU) Can reach 300 billion TEPS on MG (4 GPUs) with hand-tuning PageRank (SG) on graph with 5.5m edges and 51k nodes with RAPIDS CPU (GraphFrames, Spark) [576 GB memory, 384 vcores]: 172 seconds GPU (cuGraph, V100) [32 GB memory, 5120 CUDA cores]: 671 ms Speed-up of 256x For large graphs, there are huge parallelization benefits
  • 28.
  • 29.
  • 30.
  • 31.
    31 Problem Data sizes continueto grow min(variance) min(bias)
  • 32.
    32 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling
  • 33.
    33 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Better to start with as much data as possible and explore / preprocess to scale to performance needs.
  • 34.
    34 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs.
  • 35.
    35 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs.
  • 36.
    36 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate.
  • 37.
    37 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate.
  • 38.
    38 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search.
  • 39.
    39 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more.
  • 40.
    40 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff
  • 41.
    41 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Time Increases
  • 42.
    42 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Hours? Time Increases
  • 43.
    43 Problem Data sizes continueto grow Histograms / Distributions Dimension Reduction Feature Selection Remove Outliers Sampling Massive Dataset Better to start with as much data as possible and explore / preprocess to scale to performance needs. Iterate. Cross Validate & Grid Search. Iterate some more. Meet reasonable speed vs accuracy tradeoff Hours? Days? Time Increases
  • 44.
    44 “More data requiresbetter approaches!” ~ Xavier Amatriain (CTO of Curai)
  • 45.
    45 cuML API Python Algorithms Primitives GPU-accelerated machinelearning at every layer Scikit-learn-like interface for data scientists utilizing cuDF & Numpy C++ API for developers to utilize accelerated machine learning algorithms. Reusable building blocks for composing machine learning algorithms.
  • 46.
    46 Linear Algebra Primitives GPU-accelerated mathoptimized for feature matrices Statistics Sparse Random Distance / Metrics Optimizers More to come! • Element-wise operations • Matrix multiply • Norms • Eigen Decomposition • SVD/RSVD • Transpose • QR Decomposition
  • 47.
    47 Algorithms GPU-accelerated Scikit-Learn Classification /Regression Statistical Inference Clustering Decomposition & Dimensionality Reduction Timeseries Forecasting Recommendations Decision Trees / Random Forests Linear Regression Logistic Regression K-Nearest Neighbors Kalman Filtering Bayesian Inference Gaussian Mixture Models Hidden Markov Models K-Means DBSCAN Spectral Clustering Principal Components Singular Value Decomposition UMAP Spectral Embedding ARIMA Holt-Winters Implicit Matrix Factorization Cross Validation More to come! Hyper-parameter Tuning
  • 48.
    48 HIGH-LEVEL APIs CUDA/C++ Multi-Node &Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like Host 2 GPU1 GPU3 GPU2 GPU4 Host 1 GPU1 GPU3 GPU2 GPU4
  • 49.
    49 HIGH-LEVEL APIs CUDA/C++ Multi-Node /Multi-GPU Communications ML Primitives ML Algorithms Python Dask Multi-GPU ML Scikit-Learn-Like Host 2 GPU1 GPU3 GPU2 GPU4 Host 1 GPU1 GPU3 GPU2 GPU4 Data Parallelism Model Parallelism
  • 50.
    50 ML Technology Stack Python Cython cuMLAlgorithms cuML Prims CUDA Libraries CUDA C++ C++ C++
  • 51.
    51 ML Technology Stack DaskcuML Dask cuDF cuDF Numpy Thrust Cub cuSolver nvGraph CUTLASS cuSparse cuRand cuBlas Python Cython cuML Algorithms cuML Prims CUDA Libraries CUDA C++ C++ C++
  • 52.
    52 Dimensionality Reduction Code Example importcudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) Load CSV & Preprocess
  • 53.
    53 Dimensionality Reduction Code Example importcudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Load CSV & Preprocess
  • 54.
    54 Dimensionality Reduction Code Example importcudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) from cuml import PCA pca = PCA(n_components=df.shape[1]) pca.fit(df) print("Explained var %: %s': " % pca.explained_variance_ratio_) first_two = sum(pca.explained_variance_ratio_[0:2]) print("nFirst two: %f" % first_two) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Load CSV & Preprocess Explore Variance
  • 55.
    55 Dimensionality Reduction Code Example importcudf df = cudf.read_csv("iris/iris.data") df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class'] df.dropna(how="all", inplace=True) labels = df["class"] df = df.drop(["class"], axis = 1).astype(np.float32) from cuml import PCA pca = PCA(n_components=df.shape[1]) pca.fit(df) print("Explained var %: %s': " % pca.explained_variance_ratio_) first_two = sum(pca.explained_variance_ratio_[0:2]) print("nFirst two: %f" % first_two) sepal_len sepal_wid petal_len petal_wid 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 Explained var %: 0 0.92461634 1 0.05301554 2 0.017185122 3 0.005183111 First two: 0.9776318781077862 Load CSV & Preprocess Explore Variance
  • 56.
    56 Dimensionality Reduction pca =PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example Reduce Dimensions
  • 57.
    57 Dimensionality Reduction pca =PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
  • 58.
    58 Dimensionality Reduction pca =PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
  • 59.
    59 Dimensionality Reduction pca =PCA(n_components = 2) pca.fit(df) X = pca.transform(df) Code Example uniq = set(labels.values) colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(uniq))] for k, col in zip(uniq, colors): c = X[l==k].as_gpu_matrix() plt.plot(c[:,0], c[:,1], '.', markerfacecolor=tuple(col)) Reduce Dimensions Chart Rotated Data
  • 60.
    60 Clustering Code Example X, y= make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
  • 61.
    61 Clustering Code Example X, y= make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
  • 62.
    62 Clustering Code Example X, y= make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
  • 63.
    63 Clustering from cuml importDBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = db.fit_predict(X) Code Example Find Clusters X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
  • 64.
    64 Clustering from cuml importDBSCAN dbscan = DBSCAN(eps = 0.3, min_samples = 5) dbscan.fit(X) y_hat = db.fit_predict(X) Code Example Find Clusters X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0) X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) X = cudf.DataFrame.from_pandas(X_df) Load Moons Dataset and Preprocess
  • 65.
  • 66.
    66 cuDF + XGBoost DGX-2vs Scale Out CPU Cluster • Full end to end pipeline • Leveraging Dask + cuDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 67.
    67 cuDF + XGBoost ScaleOut GPU Cluster vs DGX-2 0 50 100 150 200 250 300 350 5xDGX-1 DGX-2 Chart Title ETL+CSV (s) ML Prep (s) ML (s) • Full end to end pipeline • Leveraging Dask for multi-node + cuDF • Store each GPU results in sys mem then read back in • Arrow to Dmatrix (CSR) for XGBoost
  • 68.
    68 cuDF + XGBoost FullyIn- GPU Benchmarks • Full end to end pipeline • Leveraging Dask cuDF • No Data Prep time all in memory • Arrow to Dmatrix (CSR) for XGBoost
  • 69.
  • 70.
    70 End-to-End Benchmarks 2,290 1,956 1,999 1,948 169 157 0 5001,000 1,500 2,000 2,500 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 0 2,000 4,000 6,000 8,000 10,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 2,741 1,675 715 379 42 19 0 1,000 2,000 3,000 20 CPU Nodes 30 CPU Nodes 50 CPU Nodes 100 CPU Nodes DGX-2 5x DGX-1 End-to-End cuIO/cuDF — Load and Data Preparation Benchmark 200GB CSV dataset; Data preparation includes joins, variable transformations. CPU Cluster Configuration CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark DGX Cluster Configuration 5x DGX-1 on InfiniBand network cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost cuML — XGBoost Time in seconds - Shorter is better
  • 71.
    71 • https://ngc.nvidia.com/registry/nvidia- rapidsai-rapidsai • https://hub.docker.com/r/rapidsai/rapidsai/ •https://github.com/rapidsai • https://anaconda.org/rapidsai/ • WIP: • https://pypi.org/project/cudf • https://pypi.org/project/cuml RAPIDS How do I get the software?