SPECIAL SECTION ON TRENDS, PERSPECTIVES AND PROSPECTS OF MACHINE LEARNING
APPLIED TO BIOMEDICAL SYSTEMS IN INTERNET OF MEDICAL THINGS
Received August 16, 2018, accepted September 20, 2018, date of publication October 8, 2018, date of current version November 9, 2018.
Digital Object Identifier 10.1109/ACCESS.2018.2874767
Performance Analysis of Google Colaboratory
as a Tool for Accelerating Deep
Learning Applications
TIAGO CARNEIRO 1 , RAUL VICTOR MEDEIROS DA NÓBREGA 1 , THIAGO NEPOMUCENO 2 ,
GUI-BIN BIAN 3 , (Member, IEEE), VICTOR HUGO C. DE ALBUQUERQUE 4 , (Member, IEEE),
AND PEDRO PEDROSA REBOUÇAS FILHO 1
1 Instituto
Federal de Educação, Ciência e Tecnologia do Ceará, Fortaleza-CE 60040-531, Brazil
2 Fraunhofer-Arbeitsgruppefür Supply Chain Services SCS, 90411 Nürnberg, Germany
3 StateKey Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
4 Programa de Pós-Graduação em Informática Aplicada, Universidade de Fortaleza, Fortaleza-CE 60811-905, Brazil
Corresponding author: Gui-Bin Bian (guibin.bian@ia.ac.cn)
This work was supported by the Youth Innovation Promotion Association of the Chinese Academy of Sciences under Grant 218165.
ABSTRACT Google Colaboratory (also known as Colab) is a cloud service based on Jupyter Notebooks
for disseminating machine learning education and research. It provides a runtime fully configured for deep
learning and free-of-charge access to a robust GPU. This paper presents a detailed analysis of Colaboratory
regarding hardware resources, performance, and limitations. This analysis is performed through the use of
Colaboratory for accelerating deep learning for computer vision and other GPU-centric applications. The
chosen test-cases are a parallel tree-based combinatorial search and two computer vision applications: object
detection/classification and object localization/segmentation. The hardware under the accelerated runtime
is compared with a mainstream workstation and a robust Linux server equipped with 20 physical cores.
Results show that the performance reached using this cloud service is equivalent to the performance of the
dedicated testbeds, given similar resources. Thus, this service can be effectively exploited to accelerate not
only deep learning but also other classes of GPU-centric applications. For instance, it is faster to train a CNN
on Colaboratory’s accelerated runtime than using 20 physical cores of a Linux server. The performance of
the GPU made available by Colaboratory may be enough for several profiles of researchers and students.
However, these free-of-charge hardware resources are far from enough to solve demanding real-world
problems and are not scalable. The most significant limitation found is the lack of CPU cores. Finally, several
strengths and limitations of this cloud service are discussed, which might be useful for helping potential users.
INDEX TERMS Deep learning, Colab, convolutional neural networks, Google colaboratory,
GPU computing.
I. INTRODUCTION Hardware resources evolve risks [4]: under and overutiliza-
Deep learning applications are present in different aspects tion, depreciation of the hardware, and failures. There are also
of our daily lives, such as web search engines, social net- costs related to maintenance, energy, and human resources.
work recommendations, natural language recognition, and In a research group reality, it may be difficult to keep a robust
e-commerce suggestions [1]. This class of application usu- computer with several GPUs for tests. Furthermore, it is
ally rely on heavy computations on massive datasets. There- costly to provide for each member of the team a workstation
fore, parallel computing is traditionally considered to run equipped with a high-end GPU.
the training process in a feasible time. Graphics process- Nowadays, cloud solutions are attractive because they pro-
ing units (GPU) are massively parallel devices candidates vide hardware on the fly, and remove the need for maintaining
to perform such a parallel task. This kind of accelerator and configuring hardware resources. Cloud platforms such as
is ubiquitous, accessible, and deliver a high GFlops/Dollar Amazon, Intel, Azure, and Google Cloud provide in a pay-
rate [2]. Additionally, the main deep learning frameworks are by-hour manner GPUs and a runtime fully configured for
programmed for NVIDIA GPUs [3]. deep learning. Also, NVIDIA offers standalone dockers with
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
VOLUME 6, 2018 61677
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
pre-configured CUDA environment for deep learning that can patterns [7]. Many computer vision applications are intended
be applied to several cloud platforms [5]. to use the operating power of deep learning methods, such as
Under the scope presented above, Google has created Convolutional Neural Networks (CNN) [8].
Colaboratory (a.k.a. Colab), a cloud service for disseminating CNN are mainly applied for analyzing visual imagery, and
machine learning education and research [6]. The runtime it has been proven to be crucial in many applications for
provided by this cloud service is fully configured with the recognition and decision making [1]. The additional convo-
leading artificial intelligence (AI) libraries and also offers a lutional layers help the network to learn filters that in others
robust GPU. This Google service is linked to a Google Drive traditional algorithms were hand-engineered. The goal of the
account, and it is free-of-charge. added layers is to make the network more robust when dealing
The primary objective of this paper is to study the feasi- with transformations in the image. Thus, CNN are also known
bility of Colaboratory for accelerating deep learning appli- by space invariant artificial neural networks (SIANN).
cations. To the best of our knowledge, the present paper is Two main problems are encountered when using CNN:
the first work to analyze both performance and resources of specific hardware for good performance and high power con-
Colaboratory, as well as the use of this cloud-based service as sumption of this hardware. For example, health applications
a tool for accelerating deep learning applications. To accom- need to be recognized for diseases and structures at a high
plish the primary objective, we performed preliminary exper- level of accuracy to save lives. It must be done promptly to
iments and implemented two deep learning applications for treat the disease [9]. So CNN’s high accuracy index applies
computer vision: object classification and object localization to this problem, but high-performance hardware is required to
and segmentation. have the life-saving response time. Therefore, GPUs are good
The main contributions of the present research work are candidates for such a task. Besides being massively parallel,
the following. We show that Google Colaboratory can be this sort of device is also energy efficient [2], [3].
effectively used to accelerate not only deep learning but also
other classes of GPU-centric scientific applications. Using B. GOOGLE COLABORATORY
Colaboratory’s accelerated runtime for training a CNN can Before introducing Google Colaboratory, we introduce
be faster than 20 physical cores of a Linux server. Moreover, Jupyter Notebooks, the technology which Colaboratory is
we provide a detailed analysis of this cloud service regarding based on. Jupyter is an open-source and browser-based tool
hardware sources, performance, and possible applications. that integrates interpreted languages, libraries, and tools
Finally, we outline several strengths and limitations of Google for visualization [10]. A Jupyter notebook can work either
Colaboratory, which might be useful for helping potential locally or on the cloud. Each document is composed of mul-
users. tiple cells, where each cell contains script language or mark-
The remainder of this paper is organized as follows. down code, and the output is embedded in the document.
Section II presents the background topics and related works. Typical outputs include text, tables, charts, and graphics.
Section III brings a preliminary evaluation of Colaboratory’s Using this technology makes easier to share and replicate sci-
hardware resources and performance. In turn, two computer entific works since the experiments and results are presented
vision use cases are explored in Section IV: object classifi- in a self-contained manner [11].
cation, and object localization and segmentation. An avail- Google Colaboratory (a.k.a Colab) is a project that has the
ability experiment is performed in Section V. An availability objective of disseminating machine learning education and
experiment is performed in Section V. Next, Section VI brings research [6]. Colaboratory notebooks are based on Jupyter
a discussion about the findings of Sections III–V. Finally, and work as a Google Docs object: can be shared and
Section VII outlines the conclusion of the present research users can collaborate on the same notebook. Colaboratory
work. provides either Python 2 and 3 runtimes pre-configured
with the essential machine learning and artificial intelligence
II. BACKGROUND AND RELATED WORKS libraries, such as TensorFlow, Matplotlib, and Keras. The
This section presents background information and provides virtual machine under the runtime (VM) is deactivated after
an overview of related contributions in the literature that a period of time, and all user’s data and configurations are
investigate the viability of cloud services for processing high- lost. However, the notebook is preserved, and it is also pos-
performance computing applications. The remainder of this sible to transfer files from the VM hard disk to the user’s
section is organized as follows. Section II-A briefly intro- Google Drive account. Finally, this Google service provides a
duces the topic of Deep Learning for Computer Vision. GPU-accelerated runtime, also fully configured with the soft-
In turn, the Google Colaboratory cloud platform is introduced ware previously outlined. The Google Colaboratory infras-
in Section II-B. Finally, Section II-C brings the related works. tructure is hosted on the Google Cloud platform.
A. DEEP LEARNING FOR COMPUTER VISION C. RELATED WORKS
Nowadays, digital image processing is used in a variety of Works that study the viability of cloud services for processing
applications, whether for object segmentation into images, high-performance computing (HPC) applications usually rely
extraction of image information or even classification of on Amazon EC2 service [12]–[14]. The experimental pro-
61678 VOLUME 6, 2018
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
tocol of the listed studies varies. This way, the results and class of high-performance computing (HPC) application [17].
conclusions of the related works are different, even though Moreover, this class of parallel algorithm suits well this eval-
they use Amazon services. According to Juve et al. [12], uation: the GPU portion of the application processes almost
an instance of Amazon EC2 can achieve a performance 100% of the solution space.
comparable to physical systems given similar resources. GPU-accelerated backtracking usually consists of two
In contrast, Jackson et al. [13] claim that the underlying main parts: initial backtracking on CPU, and the search on
network infrastructure of the EC2 cloud platform severely GPU [18]. The N-Queens problem, the problem of placing
limits performance. As a consequence, EC2 instances are N non-attacking queens on a N × N board, is a classic
much slower than a typical mid-range Linux cluster. In turn, benchmark for GPU-based backtracking. For the prelimi-
Expósito et al. [14] analyze performance bottlenecks in HPC nary analysis, the backtracking solves the N-Queens prob-
application scalability on the Amazon EC2 service. lem for board sizes (N ) ranging from 10–18. The solution
Contrasting to the works listed in the last paragraph, space ranges from few thousands to several billions of nodes.
Iosup et al. [15] compares four different cloud services for A serial, multithreaded, and GPU-accelerated versions of the
processing scientific computing workloads: Amazon EC2, backtracking were implemented.
GoGrid, ElasticHosts, and Mosso. According to the results, Two metrics are collected in each experiment: execution
the evaluated services need an order of magnitude in perfor- time and the rate of node evaluations/second. The execution
mance improvement to be useful to the scientific community. time is used for calculating the speedup metric, which means
The literature on Google Colaboratory mainly consists the benefit of using parallel programming for solving the
of online tutorials based on the official documentation [6]. N-Queens problem. In turn, the node evaluations/second rate
Among these tutorials, the one by Fuat [16] distinguishes is the performance of the hardware regarding board con-
from the others. It gives information about the underlying figurations evaluated/second. The CPU baseline used for
hardware, presents tutorials on installing well-known artifi- calculating the speedup is the serial backtracking executed
cial intelligence frameworks on Colaboratory, and also pro- on one CPU core/thread of Colab testbed, further referred to
vides information about how to access a project on GitHub. as Tserial . The speedup metric is calculated as follows [17]:
Differently from the work by Fuat, the present research Tserial
is not a tutorial. Additionally, it is also not a comparison (1)
Tparallel
between cloud services. This work analyses different aspects
of Google Colaboratory, such as hardware resources, perfor- In the scope of this performance analysis, Tparallel means
mance, limitations, and possible uses. In this sense, it is sim- the execution time of the multithread or the GPU-based back-
ilar to Juve et al. (2009) and Jackson et al. (2010). Moreover, tracking.
the present research studies the feasibility of Colaboratory
for accelerating deep learning and other GPU-centric appli- B. PARAMETERS SETTINGS
cations. All CUDA programs were parallelized using CUDA C 9.0
and compiled with NVCC 9.0 and GCC 5.4. All mul-
III. PRELIMINARY EXPERIMENTS ON GOOGLE tithreaded versions were parallelized using OpenMP.
COLABORATORY The kernel execution time was measured through the
There is no consensus in the related works whether cloud cudaEventRecord function of CUDA, whereas the over-
services are useful for processing compute-intensive scien- all application time through the clock function of C.
tific applications. Additionally, the related work about Colab- Three testbeds were used in the present evaluation:
oratory brings no performance analysis. This preliminary A Linux server, a mainstream workstation, and the hardware
set of experiments is focused on knowing better Google configuration under Colaboratory’s accelerated runtime, fur-
Colaboratory’s hardware resources and finding out what kind ther referred to as Colab. All testbeds are summarized
of compute-intensive applications this cloud service can be in Table 1 and detailed as follows.
effectively used to accelerate. Furthermore, this section aims • Server: Operates under CentOS 7.1 64 bits and it is
at answering the following research question: Given similar composed of two Intel Xeon E5-2650v3 @ 2.30 GHz
resources, is it possible to run a compute-intensive scientific with 20 cores, 40 threads, and 32 GB RAM. It is
application on Google Colaboratory and achieve a perfor- equipped with a NVIDIA Tesla K40m (GK110B
mance equivalent to the one of dedicated hardware? chipset), 12 GB RAM, 2880 CUDA cores @745 MHz.
• Mainstream: Operates under Ubuntu 16.04 LTS 64 bits
A. METHODOLOGY and it is composed an Intel Core i7 3770, with 4 cores
In this analysis, a GPU-accelerated backtracking for enu- @3.4 GHz, 8 threads, and 8 GB RAM. It is equipped
merating all feasible solutions of the N-Queens problem is with a NVIDIA GeForce 1050TI (GP107 chipset),
used as test-case. Backtracking is a problem-solver paradigm 4 GB RAM, 768 CUDA cores @1290 MHz.
that evaluates a solution space in depth-first order. It is • Colab: Operates under Ubuntu 17.10 64 bits and it is
present in several research areas, such as artificial intelligence composed of an Intel Xeon processor (not specified)
and operations research, and it is considered an essential with two cores @2.3 GHz and 13 GB RAM. It is
VOLUME 6, 2018 61679
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
TABLE 1. Specification of the three testbeds used in the preliminary performance evaluation: Server, mainstream, and the hardware configuration under
colaboratory’s accelerated runtime.
compared to the serial baseline (Figure 1 - (a)). The results for
the GPU-based implementation are also in accordance with
Table 1: the server testbed, which is equipped with a Tesla
K40 GPU, reaches an average speedup of 19.11×. In turn,
an average speedup of 18.31× and 17× observed for Colab
and Mainstream testbeds, respectively. Moreover, it can be
observed in Figure 2 that Colab provides only one CPU core
that supports two threads: no speedup is observed for the
multicore implementation running on Colab. In turn, average
speedups of 13.9× and 4.27× are observed for the mul-
ticore implementation running on Server and Mainstream,
respectively.
The results of this section are following Juve et al. [12]
and give an affirmative answer to the question posed at
the beginning of the present section: it is possible to run a
compute-intensive scientific application on Google Colabo-
ratory and achieve a performance equivalent to the one of
dedicated hardware, given similar resources. In this case,
the resources are a GPU and a single CPU core/thread.
Furthermore, results show that Google Colaboratory can be
effectively used to accelerate GPU-centric applications.
The most significant limitation found concerning the hard-
ware resources is the lack of CPU cores: only one CPU core
that supports two threads is provided. In contrast, nowadays
FIGURE 1. The average processing rate reached by: (a) one core and one ordinary workstations are equipped with multicore CPUs.
thread and (b) the GPU of Server, Colab, and Mainstream testbeds
enumerating all feasible solutions of the N-Queens. On the graphics,
Therefore, it is not worth, in terms of performance, using
testbed configuration vs. average processing rate (in 106 nodes/second). Colaboratory for running CPU-based multithreaded applica-
tions. However, the lack of CPU cores is not a concern in the
equipped with a NVIDIA Tesla K80 (GK210 chipset), scope of deep learning applications. As stated in Section I,
12 GB RAM, 2496 CUDA cores @560 MHz the essential deep learning frameworks supports NVIDIA
devices.
C. RESULTS
One can see in Figure 1 the average node evaluations/second IV. USING COLABORATORY FOR ACCELERATING DEEP
rate observed for one core/one thread of each testbed, as well LEARNING APPLICATIONS
as for the GPU of each testbed. The results are in accor- This section analyses Google Colaboratory as a tool for
dance with Table 1: the highest node evaluations/second rate accelerating modern and complex deep learning applications.
is observed for the most powerful GPU (K40 - Server). Two computer vision use cases are explored in this eval-
In turn, the free of charge GPU of Colab is superior to the uation: object detection/classification and object localiza-
mainstream device (6%) and reaches 93% of Tesla K40’s tion/segmentation. This section is organized as follows. Ini-
performance. The single core performance is also in agree- tially, each use case is introduced. Then, the implementation
ment with Table 1: the highest performance is observed for of each method is detailed following the organization of the
the mainstream workstation (6.41 × 106 nodes/sec), which provided notebook. Next, the methodology of evaluation is
has the CPU with the highest clock. In turn, Colab’s serial presented. Finally, the results are analyzed.
performance is slightly superior to the serial performance of
the Linux server. A. OBJECT CLASSIFICATION APPLICATION
Figure 2 presents for each testbed the average speedup In Computer Vision, object classification or detection is the
reached by the multithreaded and the GPU implementations task of assigning an image one label from a set of predefined
61680 VOLUME 6, 2018
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
flattening layer, the MaxPooling layer output is reshaped into
a one-dimensional vector. In the first dense layer there are 128
neurons using the ReLU activation function. In the second
dense layer there are 10 neurons using the Softmax activation
function.
Further, this CNN is trained using mini-batch size of 128,
a categorical cross-entropy loss and ADADELTA optimizer
through 12 epochs. Moreover, the images used to train and
test the CNN are from the MNIST dataset [8]. This dataset is
composed of 70, 000 labeled 28 × 28 pixel grayscale images
of handwritten digits. More specifically, 10, 000 of these
images were used as the testing set and the others as the
training set.
B. OBJECT LOCALIZATION AND SEGMENTATION
APPLICATION
With the advent of facial detection, autonomous vehi-
cles, computer-aided diagnosis and many other systems,
the demand for faster and accurate object detection methods
has significantly increased. To accomplish such challenging
tasks, these methods have not only to classify every object
in a given image, but also localize, or even segment them,
which substantially raises the task complexity. Fortunately,
the most successful approaches to solve these tasks, such as
Mask Region-based Convolutional Neural Networks (Mask
FIGURE 2. The average speedup reached by: (a) all cores/threads and
(b) the GPU of Server, Colab, and Mainstream testbeds compared to one
R-CNN) [21], are available on the Internet.
core - one thread of Colab’s CPU. On the graphics, testbed configuration
vs. average speedup. 1) IMPLEMENTATION
The implementation of object localization and segmentation
classes. This task is one of the core problems in Computer uses Python 3, Keras, and the TensorFlow implementation
Vision and has several practical applications, ranging from of Mask R-CNN.2 This implementation consists of 6 steps.
lung nodule malignancy classification [19] to the localization First of all, it is required to download and compile the
of mobile robots [20]. MASK R-CNN implementation, as well as to download and
compile MS-COCO dataset Python API. Next, we change
1) IMPLEMENTATION the MASK R-CNN default execution mode configurations
We implemented the object classification application using to perform only one inference at a time. With the previously
the browse-based notebook provided by Colaboratory.1 The installed MS-COCO Python API, we download the MASK
explored object detection implementation is composed of R-CNN pre-trained weights on MS-COCO dataset. Then,
2 stages. In the first one, the target dataset is loaded into we build the MASK R-CNN model and then load pre-trained
memory and then pre-processed. In this preprocessing phase, weights into it. Finally, we load the target image into memory
the images are normalized between 0 and 1, and their labels and then feed it to the MASK R-CNN model.
are converted to the one-hot vector format, which is a binary
vector where only one element has its value equals to 1. C. METHODOLOGY
In the second stage, a Convolutional Neural Net- The objective of the present computational evaluation is to
work (CNN) is built, compiled and trained. This network can verify whether it is advantageous using Colaboratory for
be outlined as a sequence of 2 convolution layers, a Max- processing modern deep learning applications rather than
Pooling layer, a Dropout layer, a flatten layer, and 2 dense dedicated hardware. The testbeds introduced in Section III
layers. In the convolution layers 32 and 64 (3× 3) filters were are also used: Colab, Server, and Mainstream. For this
applied using the ReLU activation function. In the Dropout evaluation, only the multithreaded and GPU-versions of the
layer, each input has a 0.5 probability to be set to zero, deep learning applications previously detailed are considered
reducing the overfitting of the network. In the MaxPooling (Sections IV-A and IV-B).
layer, the subsampling process is performed by returning the To compare the testbeds, we extend the speedup to define
max value of a 2 × 2 moving window, using stride 2. In the two metrics: speeduptgpu and speedupts . The first one means
1 The description of the implementation follows the cells of the notebook, 2 The description of the implementation follows the cells of the notebook,
which is available at: https://goo.gl/4r6pZ6. which is available at: https://goo.gl/CvD6HQ.
VOLUME 6, 2018 61681
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
TABLE 2. Average time (in seconds) required by the multithreaded and GPU versions of the object classification application for training the CNN on colab
and mainstream testbeds. In angled brackets < (a), (b) >, it is shown: the average speedup observed for the GPU-accelerated implementation compared
to: (a) - the multithreaded implementation executed on the same testbed (speeduptgpu ); (b) - the multithreaded implementation executed on 20 cores of
the Server testbed (speedupts ).
TABLE 3. Average time (in seconds) required by the multithreaded and GPU versions of the object localization and segmentation application for
processing the three images on Colab and Mainstream testbeds. In angled brackets < (a), (b) >, it is shown: the average speedup observed for the
GPU-accelerated implementation compared to: (a) - the multithreaded implementation executed on the same testbed (speeduptgpu ); (b) - the
multithreaded implementation executed on 20 cores of the Server testbed (speedupts ).
how much faster it is on average using the GPU of testbed to its multithreaded counterpart running on the same testbed
t instead of using all its CPU cores, and it is defined as (speeduptgpu ) and compared to its multithreaded counterpart
follows: executed on all cores of the Server testbed (speedupts ). First
t
Tmulticore of all, it is important to point out that the multithreaded
speeduptgpu = t
(2) implementation of TensorFlow does not exploit vectorization
Tgpu
instructions such as AVX2 and FMA, which is detrimental to
t
Where Tmulticore is the average execution time of the mul- the performance of the Server testbed. Using Collaboratory’s
tithreaded version on testbed t using all its CPU cores, and accelerated runtime to train the CNN is on average 2.93×
t
Tgpu is the same, but executing the application on the GPU faster than using all physical cores of the Linux server. In turn,
of testbed t. In turn, speedupts is how much faster it is the mainstream workstation is faster than Colab to train the
on average using the GPU of testbed t than all 20 cores CNN, on both hardware configurations: it is on average 3.6×
of the Linux server testbed. This metric is defined as faster than the Linux server.
follows: It is shown in Table 3 the time required by Colab and
T server Mainstream to perform the object localization and segmen-
speedupts = multicore
t
(3) tation, as well as the speedup reached by the GPU versions
Tgpu
compared to its multithreaded counterpart. The results for
server
Where Tmulticore is the average execution time of the mul- the object localization and segmentation follow the ones of
tithreaded version executed on all 20 physical cores of the the previous experiment: it is more advantageous to perform
Server testbed. object localization and segmentation on Colaboratory’s accel-
Concerning the first implementation, the average execution erated runtime than on 20 physical cores of the Server testbed.
time is the one necessary for training the CNN architecture The Speedups observed for Colab compared to the Server
(Section IV-A). The training procedure is repeated 30 times testbed range from 1.3× to 2.5× (speedupts ). Moreover,
and the average is considered for calculating the metrics given Mainstream’s GPU is also slightly superior to Colab’s one for
above. performing object localization and segmentation: speedups
The second implementation performs object localization compared to the Server testbed range from 1.2× to 3.0×.
and segmentation on three arbitrary images from the Internet: According to both set of experiments, it is worth investing
Highway, Savanna, and Team. In this second application, on GPUs for accelerating deep learning applications. A mid-
the evaluation metric is based on the execution time of the end GPU, such as NVIDIA GeForce 1050TI can be more
R-CNN model introduced in Section IV-B. More specifically, than 20× faster than a quad-core CPU, and more than 3×
for each image, the object localization and segmentation is faster than two robust CPUs (refer to Tables 2 and 3). There-
repeated 30 times. Then the average execution time is com- fore, we conclude that Colaboratory is useful for accelerating
puted, generating three final results. complex deep learning applications, in case the underlying
It is worth noting that the application execution time was deep learning software supports NVIDIA GPUs. On the one
measured using the time python module. hand, it is worth using this free-of-charge Google service,
especially when the user has no access to at least a mid-end
D. RESULTS GPU. In this situation, a substantial benefit is observed when
Concerning the object classification, one can see in Table 2 exploiting the accelerated runtime. On the other hand, if the
the average time required by both Colab and Mainstream to underlying deep learning software is not GPU-ready, it is not
train the CNN. This table also brings the average speedup worth using Colaboratory for performance, due to its lack of
observed for the GPU-accelerate implementation compared CPU cores (as observed in Section III).
61682 VOLUME 6, 2018
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
Even in situations where the underlying software is not
GPU-ready, Colaboratory may be useful. It offers a run-
time configured with several frameworks and libraries, e.g.,
CUDA, Keras, TensorFlow, and OpenCV. Removing from
the programmer the responsibility of setting up software may
increase productivity. Concerning the notebooks, the output
given is also a document, which makes straightforward shar-
ing research results.
V. AVAILABILITY EXPERIMENTS
The training process of a deep learning application may take
a long time. According to Colaboratory’s documentation [6],
a long GPU utilization can be confused with cryptocurrency
Listing 1. CUDA-C program used on the availability experiment.
mining. As a consequence, the user may be forbidden to
access the accelerated runtime. The official documentation
regular runtime while the user is banned from the accelerated
gives no information concerning how much time it is possible
one. The experiment was run for a third and fourth times after
to use the GPU resources, and the tutorial by Fuat [16] claims
the ban of 5 hours, and the results were the same.
that the time limit is 12 hours.
This experiment aims at answering the following research VI. DISCUSSION AND MAIN INSIGHTS
questions: This section brings a discussion on the use of Colaboratory
1) It is possible to use indefinitely the accelerated runtime for accelerating deep learning applications. The content of
provided by Google Colaboratory without penalties? this section is based on the results of Sections III and IV
2) In case the user loses the access to the accelerated and on our experiences using Colaboratory. This discus-
runtime, how much time does it take to get access to sion is given concerning performance, possible applications,
the accelerated runtime again? and limitations. Finally, the main insights from the present
research work are also outlined.
1) METHODOLOGY A. PERFORMANCE
To verify whether it is possible to use the accelerated runtime Our findings evidence that the Colaboratory’s accelerated
uninterruptedly, a CUDA-C program runs on background runtime is adequate not only for accelerating deep learn-
until the user gets disconnected. To figure it out whether ing but also for processing other GPU-centric applications.
the cloud forbids the user to access the accelerated runtime, Using Colaboratory’s accelerated runtime for processing the
after disconnection, the user reconnects, and the program is deep learning applications and the GPU-centric combina-
launched once more. torial search is faster than using 20 physical CPU cores.
The CUDA-C program presented in Listing 1 is the one Therefore, it is worth using the cloud service in question than,
used in this investigation. This code exploits the fact that for instance, a robust server with no GPU, a laptop, or a
kernel launches on GPU are asynchronous concerning the workstation that needs to be configured and has a mid-
host [22, p. 32]. The kernel executed by the GPU performs end GPU. On the one hand, the free-of-charge hardware
an infinite loop on the device (lines 1–5). The CPU por- resources provided by Colaboratory are far from enough to
tion of the code (lines 6–13) launches the kernel loop on solve demanding real-world problems and are not scalable.
GPU (line 7). After launching the kernel, the code on CPU On the other hand, the main deep learning frameworks are
prints the date provided by the system every 60 seconds programmed for NVIDIA GPUs, and the performance of the
(line 8–11). The result is outputted on browser-based the GPU provided by Colaboratory may be enough for several
notebook. This way, it is possible to retrieve the results, even profiles of researchers and students.
in case the VM is lost and restarted. In situations where better hardware resources than the ones
provided by Colaboratory are available, it is possible to install
2) RESULTS Jupyter locally and run a Colab notebook. It is also possible
According to the results, the program of Listing 1 can be to execute the content of each cell individually, without using
executed for 12 hours in the first time. Then, the user was Jupyter. In this case, it is required to configure the whole soft-
disconnected, and the VM restarted. After reconnecting, ware stack: GPU drivers, CUDA toolkit, artificial intelligence
the program was uploaded and run once more. In the sec- libraries, programming languages, and so on. It is worth to
ond time, it was observed that the user was disconnected point out another aspect of Colab: the Internet infrastruc-
after 3 hours of GPU utilization. Then, it was not possible ture is fast, especially when accessing resources from other
to connect to the accelerated runtime for 5 hours, and the Google services, such as Drive and Storage. Thus, it may
platform returned the following message: "No backend with be interesting for some users manipulating datasets using
GPU available". Additionally, it is possible to connect to the Colaboratory rather than a residential Internet connection.
VOLUME 6, 2018 61683
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
B. OTHER POSSIBLE APPLICATIONS • The hardware resources provided by Colaboratory are
Besides the use for accelerating deep learning and other far from enough to solve demanding real-world prob-
GPU-centric applications, Colaboratory may be interesting lems. However, these resources may be enough for sev-
for teaching HPC. The accelerated runtime is fully opera- eral profiles of researchers.
tional, which keeps inexperienced users from configuring • Colab’s performance is equivalent to dedicated hard-
the CUDA Toolkit, compilers, and GPU drivers. Moreover, ware, given similar resources.
a teacher can share notebooks containing lessons and code • The user needs to learn Google API principles to fully
ready for execution. The free-of-charge GPU is superior to exploit Colaboratory features.
several low-end and mid-end gamer GPUs, which democra- • Jupyter Notebooks are a straightforward tool for sharing
tizes the access to such an expensive device. knowledge.
In a scenario of a research group with a low budget,
VII. CONCLUSION AND FUTURE WORKS
Colaboratory can also be useful. The accelerated runtime can
This work presented a study about the feasibility of Google
be exploited for accelerating deep learning and other appli-
Colaboratory for accelerating deep learning for computer
cations, keeping the members from software configuration
vision applications. Results show that Colaboratory hardware
and hardware maintenance. Moreover, it is worth using the
resources can reach a performance similar to dedicated hard-
free-of-charge resources of Colaboratory to decide whether
ware. Results also show that it is wort to run experiments on
to buy a dedicated computer or contract cloud-based services
Colaboratory in case the research group has no GPU more
for deploying applications. Finally, Jupyter notebooks are a
robust than a K80. Moreover, it is possible to accelerate other
straightforward way of collaboration between team members.
GPU-centric applications than deep learning related ones,
with no need for CUDA runtime configuration. Besides the
C. LIMITATIONS performance, it is interesting using this cloud service because
Besides the positive aspects previously discussed, this cloud it is straightforward to share notebooks with code and outputs.
service also presents limitations that are worth to discuss. The present research also found limitations of Colabora-
First of all, it is difficult to program directly on a Colab- tory regarding deep learning and HPC. It is worth to notice the
oratory notebook. Programs written in compiled languages lifetime of a VM, time limit for GPU utilization, the need of
must be compiled on the user’s computer, and then uploaded transferring data to/from Google drive or Git, the limit of data
for execution. Therefore, in case the user has no GPU, it is transfer between Drive and Colab, and the lack of CPU cores.
not straightforward using the accelerated runtime for testing Furthermore, the hardware provided by Colaboratory is not
and validating a GPU-based application. Moreover, there are scalable and far from necessary for solving bigger problems.
limits on both virtual machine lifetime and GPU utilization. As a future research direction, we propose an application
The VM and all files are lost after 12 hours, and the user that breaks the problem in such a way it could be executed
needs to reconfigure the runtime from scratch. Thanks to on more than one Colaboratory instance. This way, it would
the notebooks saved in Google Drive, this reconfiguration is be possible to use more than one GPU without paying for
straightforward but may take a while. an expensive service. Furthermore, Jupyter notebooks can be
There may be situations where it is required to use Google executed locally or on other cloud services. We also propose
Drive as an interface between Colaboratory and the user’s as future research investigating the use of notebooks on other
computer. For instance, several deep learning applications platforms, such as local machines and cloud services. Finally,
receive huge datasets as input. In such a situation, the user future investigations of Google Colaboratory as a tool for
must learn the API used by Drive to transfer files from Drive teaching HPC are also considered, as suggested in the pre-
to the VM hard disk. The main limitation concerning this fact vious section.
is the limit on transfers between Colaboratory and Drive. One
way of coping with such a restriction is creating scripts to REFERENCES
compress the dataset. Finally, there are no contracts or guar- [1] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
no. 7553, p. 436, 2015.
antees, which means that the hardware resources may change [2] A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, and
along the time, Google may finish the Colaboratory project, O. O. Storaasli, ‘‘State-of-the-art in heterogeneous computing,’’ Sci. Pro-
and so on. Thus, users may lose computer resources or even gram., vol. 18, no. 1, pp. 1–33, 2010.
[3] NVIDIA Corporation, ‘‘TESLA V100 performance guide: Deep learning
means of visualizing saved notebooks. and HPC applications,’’ White Paper, 2016. [Online]. Available: https://
images.nvidia.com/content/pdf/v100-application-performance-guide.pdf
[4] M. Armbrust et al., ‘‘Above the clouds: A Berkeley view of cloud com-
D. MAIN INSIGHTS puting,’’ Dept. EECS, Univ. California, Berkeley, CA, USA, Tech. Rep.
The following summarizes the main insights from our study UCB/EECS-2009-28, 2009.
on Colaboratory as a tool for accelerating deep learning [5] NVIDIA Corporation, ‘‘Introduction to NVIDIA GPU cloud,’’ Appl. Note
DA-08792-001, 2018.
applications: [6] Google. (2018). Colaboratory: Frequently Asked Questions.
• Colaboratory can be effectively used to accelerate Accessed: Jun. 21, 2018. [Online]. Available: https://research.google.
com/colaboratory/faq.html
not only deep learning applications but also other [7] R. Girshick et al., ‘‘Deep learning for computer vision,’’ Comput.
GPU-centric applications. Vis. Image Understand., vol. 164, pp. 1–2, 2017. [Online]. Available:
61684 VOLUME 6, 2018
T. Carneiro et al.: Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications
http://www.sciencedirect.com/science/article/pii/S1077314217301972, THIAGO NEPOMUCENO was born in Fortaleza,
doi: 10.1016/j.cviu.2017.11.006. Brazil, in 1990. He received the B.Sc. and M.Sc.
[8] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn- degrees in computer science from the State Uni-
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11, versity of Ceará, where he developed research
pp. 2278–2324, Nov. 1998. on optimization and meta-heuristics, in special
[9] P. P. R. Filho, E. D. S. Rebouças, L. B. Marinho, R. M. Sarmento, genetic algorithms. He is currently pursuing the
J. M. R. Tavares, and V. H. C. de Albuquerque, ‘‘Analysis of human Ph.D. degree with the University of Erlangen–
tissue densities: A new approach to extract features from medical images,’’
Nuremberg. He is also a full-time Employee at
Pattern Recognit. Lett., vol. 94, pp. 211–218, Jul. 2017.
the Fraunhofer-Arbeitsgruppe für Supply Chain
[10] F. Pèrez and B. E. Granger, ‘‘IPython: A system for interactive scientific
computing,’’ Comput. Sci. Eng., vol. 9, no. 3, pp. 21–29, May/Jun. 2007. Services SCS, where he researches in the field of
[11] B. M. Randles, I. V. Pasquetto, M. S. Golshan, and C. L. Borgman, ‘‘Using Internet of Things.
the Jupyter notebook as a tool for open science: An empirical study,’’ in
Proc. ACM/IEEE Joint Conf. Digit. Libraries (JCDL), Jun. 2017, pp. 1–2.
[12] G. Juve et al., ‘‘Scientific workflow applications on Amazon EC2,’’ in
Proc. 5th IEEE Int. Conf. E-Sci. Workshops, Dec. 2009, pp. 59–66. GUI-BIN BIAN (M’15) received the bachelor’s
[13] K. R. Jackson et al., ‘‘Performance analysis of high performance com- degree in mechanical engineering from the North
puting applications on the Amazon Web services cloud,’’ in Proc. IEEE China University of Technology, Beijing, China,
2nd Int. Conf. Cloud Comput. Technol. Sci. (CloudCom), Nov./Dec. 2010, in 2004, and the master’s and Ph.D. degrees in
pp. 159–168.
mechanical engineering from the Beijing Insti-
[14] R. R. Expósito, G. L. Taboada, S. Ramos, J. Touriño, and R. Doallo,
‘‘Performance analysis of HPC applications in the cloud,’’ Future Gener.
tute of Technology, Beijing, in 2007 and 2010,
Comput. Syst., vol. 29, no. 1, pp. 218–229, 2013. respectively. He is currently an Associate Profes-
[15] A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer, and sor with the State Key Laboratory of Management
D. H. Epema, ‘‘Performance analysis of cloud computing services for and Control for Complex Systems, Institute of
many-tasks scientific computing,’’ IEEE Trans. Parallel Distrib. Syst., Automation, Chinese Academy of Sciences, Bei-
vol. 22, no. 6, pp. 931–945, Jun. 2011. jing, where his research interests include design, sensing, and control for
[16] Fuat. (2018). Google Colab Free GPU Tutorial. Accessed: Jun. 25, 2018. medical robotics.
[Online]. Available: https://medium.com/deep-learning-turkey/google-
colab-free-gpu-tutorial-e113627b9f5d
[17] K. Asanović et al., ‘‘The landscape of parallel computing research: A view
from Berkeley,’’ EECS Dept., Univ. California, Berkeley, CA, USA, Tech.
Rep. UCB/EECS-2006-183, 2006. VICTOR HUGO C. DE ALBUQUERQUE received
[18] T. Carneiro Pessoa, J. Gmys, F. H. de Carvalho, Jr., N. Melab, and the degree in mechatronics technology with the
D. Tuyttens, ‘‘GPU-accelerated backtracking using CUDA dynamic par- Federal Center of Technological Education of
allelism,’’ Concurrency Comput., Pract. Exper., vol. 30, no. 9, p. e4374,
Ceará in 2006, the M.Sc. degree in teleinformat-
2017, doi: 10.1002/cpe.4374.
[19] G. Litjens et al., ‘‘A survey on deep learning in medical image analysis,’’
ics engineering from the Federal University of
Med. Image Anal., vol. 42, pp. 60–88, Dec. 2017. Ceará in 2007, and the Ph.D. degree in mechanical
[20] Z. Chen et al., ‘‘Deep learning features at scale for visual place recog- engineering with emphasis on materials from the
nition,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May/Jun. 2017, Federal University of Paraíba in 2010. He is cur-
pp. 3223–3230. rently an Assistant VI Professor with the Graduate
[21] K. He, G. Gkioxari, P. Dollár, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc. Program in Applied Informatics, Universidade de
IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988. Fortaleza. He has experience in computer systems, mainly in the research
[22] CUDA C Programming Guide (Version 9.1), NVIDIA, Santa Clara, CA, fields of applied computing, intelligent systems, and visualization and inter-
USA, 2018. action, with specific interest in pattern recognition, artificial intelligence,
image processing and analysis, and automation with respect to biologi-
cal signal/image processing, image segmentation, biomedical circuits, and
TIAGO CARNEIRO was born in Fortaleza, Brazil, human/brain–machine interaction, including augmented and virtual reality
in 1986. He received the B.Sc. and M.Sc. simulation modeling for animals and humans. In addition, he has researched
degrees from the State University of Ceará, Brazil, in the microstructural characterization field through the combination of non-
in 2009 and 2012, respectively, the Ph.D. degree destructive techniques with signal/image processing and analysis and pattern
in computer science from the Federal University recognition.
of Ceará, Brazil, in 2017, all in computer science.
He is currently a Post-Doctoral Fellow at the Insti-
tuto Federal de Educação, Ciência e Tecnologia
do Ceará, Brazil. His research interests include
the use of heterogeneous architectures for solving PEDRO PEDROSA REBOUÇAS FILHO received
combinatorial optimization problems. the degree in mechatronics engineering from the
Federal Institute of Ceará, Brazil, in 2008, and
the M.Sc. degree in teleinformatics engineering,
RAUL VICTOR MEDEIROS DA NÓBREGA in the field of biomedical engineering, and the
received the bachelor’s and master’s degrees in Ph.D. degree in teleinformatics engineering from
computer science from the Instituto Federal de the Federal University of Ceará, in 2010 and 2013,
Educação, Ciência e Tecnologia do Ceará, Brazil, respectively. He has been an Assistant Professor
in 2016 and 2018, respectively. He is currently a with the Instituto Federal de Educação, Ciência e
member of the Laboratory of Image Processing, Tecnologia do Ceará, since 2008. From 2015 to
and Computational Simulation (LAPISCO), Insti- 2016, he was a Post-Doctoral Researcher at the University of Porto, Portugal.
tuto Federal de Educação, Ciência e Tecnologia do He has co-author over 100 articles in national and international journals and
Ceará. His research interests include the classifica- conferences. Also, he has been involved in several research projects in the
tion of pulmonary nodules through techniques of field of computer vision, medical image, and embedded systems.
digital image processing and artificial intelligence.
VOLUME 6, 2018 61685