Skip to content

Commit 167cdac

Browse files
TensorBoard TensorFlow example notebook (aws#1267) (aws#1269)
Co-authored-by: Yegor Tokmakov <yegor@tokmakov.biz>
1 parent a26b1e6 commit 167cdac

File tree

1 file changed

+25
-70
lines changed

1 file changed

+25
-70
lines changed

aws_sagemaker_studio/frameworks/keras_pipe_mode_horovod/keras_pipe_mode_horovod_cifar10.ipynb

Lines changed: 25 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,28 @@
66
"source": [
77
"# Train and Host a Keras Model with Pipe Mode and Horovod on Amazon SageMaker\n",
88
"\n",
9-
"*(This notebook was tested with the \"Python 3 (TensorFlow CPU Optimized)\" kernel.)*\n",
10-
"\n",
119
"Amazon SageMaker is a fully-managed service that provides developers and data scientists with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon SageMaker removes the heavy lifting from each step of the machine learning process to make it easier to develop high-quality models. The SageMaker Python SDK makes it easy to train and deploy models in Amazon SageMaker with several different machine learning and deep learning frameworks, including TensorFlow and Keras.\n",
1210
"\n",
1311
"In this notebook, we train and host a [Keras Sequential model](https://keras.io/getting-started/sequential-model-guide) on SageMaker. The model used for this notebook is a simple deep convolutional neural network (CNN) that was extracted from [the Keras examples](https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py).\n",
1412
"\n",
15-
"For training our model, we also demonstrate distributed training with [Horovod](https://horovod.readthedocs.io) and Pipe Mode. Amazon SageMaker's Pipe Mode streams your dataset directly to your training instances instead of being downloaded first, which translates to training jobs that start sooner, finish quicker, and need less disk space."
13+
"For training our model, we also demonstrate distributed training with [Horovod](https://horovod.readthedocs.io) and Pipe Mode. Amazon SageMaker's Pipe Mode streams your dataset directly to your training instances instead of being downloaded first, which translates to training jobs that start sooner, finish quicker, and need less disk space. \n",
14+
"<br>\n",
15+
"<br>\n",
16+
"<b> Instance Type and Pricing: </b>\n",
17+
"\n",
18+
"This notebook was trained using the <b>Python 3 (TensorFlow CPU Optimized)</b> kernel using the <b>ml.p3.2xlarge</b> compute instance type in the <b>us-west-2 region</b>. Training time is approximately 70 minutes with the aforementioned hardware specifications.\n",
19+
"\n",
20+
"Price per hour depends on your region and instance type. You can reference prices on the [SageMaker pricing page](https://aws.amazon.com/sagemaker/pricing/). \n",
21+
"\n",
22+
"---\n",
23+
"---"
1624
]
1725
},
1826
{
1927
"cell_type": "markdown",
2028
"metadata": {},
2129
"source": [
2230
"## Setup\n",
23-
"\n",
2431
"First, we define a few variables that are be needed later in the example."
2532
]
2633
},
@@ -161,7 +168,7 @@
161168
"from sagemaker.tensorflow import TensorFlow\n",
162169
"\n",
163170
"hyperparameters = {'epochs': 10, 'batch-size': 256}\n",
164-
"tags = [{'Key': 'Project', 'Value': 'cifar10'}, {'Key': 'TensorBoard', 'Value': 'file'}]\n",
171+
"tags = [{'Key': 'Project', 'Value': 'cifar10'}]\n",
165172
"\n",
166173
"estimator = TensorFlow(entry_point='keras_cifar10.py',\n",
167174
" source_dir='source',\n",
@@ -171,7 +178,7 @@
171178
" framework_version='1.15.2',\n",
172179
" py_version='py3',\n",
173180
" train_instance_count=1,\n",
174-
" train_instance_type='ml.p2.xlarge',\n",
181+
" train_instance_type='ml.p3.2xlarge',\n",
175182
" base_job_name='cifar10-tf',\n",
176183
" tags=tags)"
177184
]
@@ -195,7 +202,7 @@
195202
" 'eval': '{}/eval'.format(dataset_uri),\n",
196203
"}\n",
197204
"\n",
198-
"estimator.fit(inputs, wait=True)"
205+
"estimator.fit(inputs)"
199206
]
200207
},
201208
{
@@ -273,26 +280,19 @@
273280
" framework_version='1.15.2',\n",
274281
" py_version='py3',\n",
275282
" train_instance_count=1,\n",
276-
" train_instance_type='ml.p2.xlarge',\n",
283+
" train_instance_type='ml.p3.2xlarge',\n",
277284
" input_mode='Pipe',\n",
278285
" base_job_name='cifar10-tf-pipe',\n",
279286
" tags=tags)"
280287
]
281288
},
282-
{
283-
"cell_type": "markdown",
284-
"metadata": {},
285-
"source": [
286-
"Using the same training inputs from before, we call `fit()` on our estimator. Here, we set `wait=False`, but you can set `wait=True` if you would like to see the training logs in the notebook."
287-
]
288-
},
289289
{
290290
"cell_type": "code",
291291
"execution_count": null,
292292
"metadata": {},
293293
"outputs": [],
294294
"source": [
295-
"pipe_mode_estimator.fit(inputs, wait=False)"
295+
"pipe_mode_estimator.fit(inputs)"
296296
]
297297
},
298298
{
@@ -327,13 +327,6 @@
327327
"```python\n",
328328
"opt = Adam(lr=learning_rate * size, decay=weight_decay)\n",
329329
"opt = hvd.DistributedOptimizer(opt)\n",
330-
"```\n",
331-
"\n",
332-
"4. Choose to save checkpoints and send TensorBoard logs only from the master node:\n",
333-
"\n",
334-
"```python\n",
335-
"if hvd.rank() == 0:\n",
336-
" save_model(model, args.model_output_dir)\n",
337330
"```"
338331
]
339332
},
@@ -380,60 +373,18 @@
380373
" framework_version='1.15.2',\n",
381374
" py_version='py3',\n",
382375
" train_instance_count=2,\n",
383-
" train_instance_type='ml.p2.xlarge',\n",
376+
" train_instance_type='ml.p3.2xlarge',\n",
384377
" base_job_name='cifar10-tf-dist',\n",
385378
" tags=tags)"
386379
]
387380
},
388-
{
389-
"cell_type": "markdown",
390-
"metadata": {},
391-
"source": [
392-
"Like before, we call `fit()` on our estimator. If you want to see the training job logs in the notebook output, set `wait=True`."
393-
]
394-
},
395381
{
396382
"cell_type": "code",
397383
"execution_count": null,
398384
"metadata": {},
399385
"outputs": [],
400386
"source": [
401-
"dist_estimator.fit(inputs, wait=False)"
402-
]
403-
},
404-
{
405-
"cell_type": "markdown",
406-
"metadata": {},
407-
"source": [
408-
"### Compare the training jobs with TensorBoard\n",
409-
"\n",
410-
"Using the visualization tool [TensorBoard](https://www.tensorflow.org/tensorboard), we can compare our training jobs.\n",
411-
"\n",
412-
"In a local setting, install TensorBoard with `pip install tensorboard`. Then run the command generated by the following code:"
413-
]
414-
},
415-
{
416-
"cell_type": "code",
417-
"execution_count": null,
418-
"metadata": {},
419-
"outputs": [],
420-
"source": [
421-
"!python generate_tensorboard_command.py"
422-
]
423-
},
424-
{
425-
"cell_type": "markdown",
426-
"metadata": {},
427-
"source": [
428-
"After running that command, we can access TensorBoard locally at http://localhost:6006.\n",
429-
"\n",
430-
"Based on the TensorBoard metrics, we can see that:\n",
431-
"1. All jobs run for 10 epochs (0 - 9).\n",
432-
"1. Both File Mode and Pipe Mode run for ~1 minute - Pipe Mode doesn't affect training performance.\n",
433-
"1. Distributed training runs for only 45 seconds.\n",
434-
"1. All of the training jobs resulted in similar validation accuracy.\n",
435-
"\n",
436-
"This example uses a relatively small dataset (179 MB). For larger datasets, Pipe Mode can significantly reduce training time because it does not copy the entire dataset into local memory."
387+
"dist_estimator.fit(inputs)"
437388
]
438389
},
439390
{
@@ -444,7 +395,11 @@
444395
"\n",
445396
"After we train our model, we can deploy it to a SageMaker Endpoint, which serves prediction requests in real-time. To do so, we simply call `deploy()` on our estimator, passing in the desired number of instances and instance type for the endpoint.\n",
446397
"\n",
447-
"Because we're using TensorFlow Serving for deployment, our training script saves the model in TensorFlow's SavedModel format. For more details, see [this blog post on deploying Keras and TF models in SageMaker](https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker)."
398+
"Because we're using TensorFlow Serving for deployment, our training script saves the model in TensorFlow's SavedModel format. \n",
399+
"\n",
400+
"We don't need accelerated computing power for inference, so let's switch over to a <b>ml.m4.xlarge</b> instance type. \n",
401+
"\n",
402+
"For more information about deploying Keras and TensorFlow models in SageMaker, see [this blog post](https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker)."
448403
]
449404
},
450405
{
@@ -604,7 +559,7 @@
604559
"kernelspec": {
605560
"display_name": "Python 3 (TensorFlow CPU Optimized)",
606561
"language": "python",
607-
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-east-2:429704687514:image/tensorflow-1.15-cpu-py36"
562+
"name": "python3__SAGEMAKER_INTERNAL__arn:aws:sagemaker:us-west-2:236514542706:image/tensorflow-1.15-cpu-py36"
608563
},
609564
"language_info": {
610565
"codemirror_mode": {
@@ -616,7 +571,7 @@
616571
"name": "python",
617572
"nbconvert_exporter": "python",
618573
"pygments_lexer": "ipython3",
619-
"version": "3.6.9"
574+
"version": "3.6.3"
620575
},
621576
"notice": "Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.",
622577
"pycharm": {

0 commit comments

Comments
 (0)