Distributed computing is distributed Ion Stoica December 4, 2020
ARPA Network (1970s) HPC (1980s) Web (1990s) Big Data (2000s) Distributed systems not new…
Distributed computing still the exception… Inaccessible to most developers Few universities teach distributed computing This will change… Distributed computing will be the norm, rather than the exception
What is different this time? The rise of deep learning (DL) The end of Moore’s Law Apps becoming AI centric
(https://openai.com/blog/ai-and-compute//) DL demands growing faster than ever 35x every 18 months
(https://openai.com/blog/ai-and-compute//) Not only esoteric apps ResNet 50 ResNeXt 101 im age processing translation
GPT-3 (175B) Turing Proj. (17B) GPT-2 (8.3 B) GPT-1 (1.5B) BERT GPT-1 Transformer ResNet-50 20x every 18 months Not only processing, but memory
GPT-3 (175B) Turing Proj. (17B) GPT-2 (8.3 B) GPT-1 (1.5B) BERT GPT-1 Transformer ResNet-50 200x every 18 m onths! Not only processing, but memory
What is different? The rise of deep learning (DL) The end of Moore’s Law Apps becoming AI centric
The end of Moore’s Law From 2x every 18 months to 1.05x every 18 months.
(https://openai.com/blog/ai-and-compute//) Hardware cannot keep up 35x every 18 months CPU
(https://openai.com/blog/ai-and-compute//) Hardware cannot keep up 35x every 18 months Moore’s Law (2x every 18 months) CPU
What about specialized hardware? Trade generality for performance Not enough! Just extending Moore’s Law
(https://openai.com/blog/ai-and-compute//) Specialized hardware not enough 35x every 18 months Moore’s Law (2x every 18 months) CPU GPU TPU
Turing Proj. (17B) GPT-2 (8.3 B) GPT-1 (1.5B) BERT GPT-1 Transformer ResNet-50 20x every 18 months (https://devblogs.nvidia.com/training-bert-with-gpus/) Memory dwarfed by demand GPU memory increased by just 1.45x every 18 months No way out but distributed!
What is different? The rise of deep learning (DL) The end of Moore’s Law Apps becoming AI centric
Apps becoming AI centric... ... and integrating other distributed workloads
HPC Big Data Micro- services Before: Isolated distributed workloads
HPC AI Big Data Micro- services AI integrates with other dist. workloads
HPC AI Training, Simulations Big Data Micro- services Distributed training using MPI Leverage industrial simulators for RL AI integrates with other dist. workloads
HPC AI Training, Simulations Micro- services Big Data Log processing, Featurization Online learning, RL AI integrates with other dist. workloads
HPC AI Training, Simulations Big Data Log processing, Featurization Micro- services Serving, Business logic Inference, online learning, web backends AI integrates with other dist. workloads
(https://devblogs.nvidia.com/training-bert-with-gpus/)Distributed apps becoming the norm No way to scale AI, but going distributed HPC AI Training, Simulations Big Data Log processing, Featurization Micro- services Serving, Business logic Apps becoming AI centric (and integrating with other distributed workloads)
No general solution for e2e AI apps DL Big Data MicroservicesHPC ?
Natural solution... Stitch together existing frameworks Distributed systems Model Serving Training Distributed systems Distributed systems Hyperparam. Tuning Distributed systems Data processing Simulation Distributed systems Business logic Distributed systems
RAY Universal framework for distributed computing
Model Serving Training Hyperparam. Tuning Data processing Simulation Business Logic Libraries Universal framework for distributed computing (Python and Java) *
DL Big Data MicroservicesHPC
Three key ideas Execute remotely functions as tasks, and instantiate remotely classes as actors ○ Support both stateful and stateless computations Asynchronous execution using futures ○ Enable parallelism Distributed (immutable) object store ○ Efficient communication (send arguments by reference)
def read_array(file): # read ndarray “a” # from “file” return a def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) Function class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Class
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) a = read_array(file1) b = read_array(file2) sum = add(a, b) @ray.remote class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter() c.inc() c.inc() Function → Task Class → Actor
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) @ray.remote(num_gpus=1) class Counter(object): def __init__(self): self.value = 0 def inc(self): self.value += 1 return self.value c = Counter.remote() id4 = c.inc.remote() id5 = c.inc.remote() Function → Task Class → Actor
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 • Blue variables are Object IDs • Similar to futures read_array id1 Return id1 (future) immediately, before read_array() finishes Task API
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) file1 file2 Node 1 Node 2 read_array id1 read_array id2 Dynamic task graph: build at runtime • Blue variables are Object IDs • Similar to futures Task API
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) • Blue variables are Object IDs • Similar to futures file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 Every task scheduled, but not finished yet Task API
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) • Blue variables are Object IDs • Similar to futures file1 file2 Node 1 Node 2 read_array id1 read_array id2 add id Node 3 ray.get() block until result available Task API
@ray.remote def read_array(file): # read ndarray “a” # from “file” return a @ray.remote def add(a, b): return np.add(a, b) id1 = read_array.remote(file1) id2 = read_array.remote(file2) id = add.remote(id1, id2) sum = ray.get(id) • Blue variables are Object IDs • Similar to futures file1 Node 1 Node 2 Node 3 read_array file2 read_array add sumTask graph executed to compute sum Task API
Library Ecosystem
universal framework for distributed computing Native Libraries Most popular scalable RL library ● PyTorch and TF support ● Largest # of algorithms Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries A popular hyperparameter tuning library: “For me, and I say this as a Hyperopt maintainer, Tune is the clear winner down the road. Tune is fairly well architected and it integrates with everything else, and it’s built on top of Ray so it has all the benefits stemming from that as well. … In 2020, I would certainly bet on Tune.” -- Max Pumperla, HyperOpt creator Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries Just launched. Promising start but a long way to go. Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries 3rd Party Libraries Best Distributed Library Ecosystem ●Horovod: most popular distributed training library. ●Optuna and hyperopt: popular hyperparameter search libraries
universal framework for distributed computing Native Libraries 3rd Party Libraries The two most popular NLP libraries using Ray to scale up Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Best Distributed Library Ecosystem Major ML cloud platforms embedding Ray/RLlib/Tune
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Best Distributed Library Ecosystem Dask running on Ray ● Faster, more resilient, and more scalable
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Popular experiment tracking platform ● One-line integration with Ray Tune. Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Intel’s unified Data Analytics and AI platform. ● Integrated Ray together with Spark, TF, etc ● Use cases include streaming ML with customers such as Burger King and BMW. Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Best Distributed Library Ecosystem
universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Your apps here! Best Distributed Library Ecosystem
Adoption
Refactored actor management Placement groups Java API Direct calls between workers Azure support in cluster launcher Original dashboard Refactor worker to C++ Initial port to Bazel Significant community contributions
Summary Ray: universal framework for distributed computing Comprehensive ecosystem of scalable libraries universal framework for distributed computing Native Libraries 3rd Party Libraries ModelArts Your apps here! https://github.com/ray-project/ray
Thank you!

The Future of Computing is Distributed