aws
diff --git a/‎sagemaker-python-sdk/tensorflow_script_mode_quickstart/tensorflow_script_mode_quickstart.ipynb‎
Lines changed: 274 additions & 0 deletions b/‎sagemaker-python-sdk/tensorflow_script_mode_quickstart/tensorflow_script_mode_quickstart.ipynb‎
Lines changed: 274 additions & 0 deletions
@@ -0,0 +1,274 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Using TensorFlow in SageMaker - Quickstart\n",
+ "\n",
+ "Starting by the TensorFlow's framework version 1.11, you can use the SageMaker TensorFlow Container to train any TensorFlow script. \n",
+ "\n",
+ "For this example, you use [Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow](https://github.com/sherjilozair/char-rnn-tensorflow), but you can use the same technique for other scripts or repositories. For example, [TensorFlow Model Zoo](https://github.com/tensorflow/models) and [TensorFlow benchmark scripts](https://github.com/tensorflow/benchmarks/tree/master/scripts/tf_cnn_benchmarks)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Get the data\n",
+ "For training data, use plain text versions of Sherlock Holmes stories."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "data_dir = os.path.join(os.getcwd(), 'sherlock')\n",
+ "\n",
+ "os.makedirs(data_dir, exist_ok=True)\n",
+ "\n",
+ "!wget https://sherlock-holm.es/stories/plain-text/cnus.txt --force-directories --output-document=sherlock/input.txt"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Preparing the training script\n",
+ "\n",
+ "Let's start by cloning the repository that contains the example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!git clone https://github.com/sherjilozair/char-rnn-tensorflow"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "This repository includes a [README.md](https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/README.md#basic-usage) with an overview of the project, requirements, and basic usage:\n",
+ "\n",
+ "> #### **Basic Usage**\n",
+ "> _To train with default parameters on the tinyshakespeare corpus, run **python train.py**. \n",
+ "To access all the parameters use **python train.py --help.**_\n",
+ "\n",
+ "[train.py](https://github.com/sherjilozair/char-rnn-tensorflow/blob/master/train.py#L11) uses the Python [argparse](https://docs.python.org/3/library/argparse.html) library and requires the following arguments:\n",
+ "\n",
+ "```python\n",
+ "parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)\n",
+ "# Data and model checkpoints directories\n",
+ "parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare', help='data directory containing input.txt with training examples')\n",
+ "parser.add_argument('--save_dir', type=str, default='save', help='directory to store checkpointed models')\n",
+ "...\n",
+ "args = parser.parse_args()\n",
+ "\n",
+ "```\n",
+ "When SageMaker training finishes, it deletes all data generated inside the container with exception of the directory _/opt/ml/model_. To ensure that model data is not lost during training, training scripts are invoked in SageMaker with an additional argument **--model_dir**, that needs to be handle by the training script. We need to replace the argument **--save_dir** with the required argument **--model_dir**: "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# this command will replace data_dir by model_dir in the training script\n",
+ "!sed -i 's/save_dir/model_dir/g' char-rnn-tensorflow/train.py"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now, the training script can executed as follow in the container:\n",
+ "\n",
+ "> ```bash\n",
+ "python train.py --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model\n",
+ "```"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Test locally using SageMaker Python SDK TensorFlow Estimator"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "You can use the SageMaker Python SDK [TensorFlow](https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow) estimator to easily train locally and in SageMaker. To train locally, you set the instance type to [local](https://github.com/aws/sagemaker-python-sdk#local-mode) as follow:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "\n",
+ "import sagemaker\n",
+ "from sagemaker.tensorflow import TensorFlow\n",
+ "\n",
+ "# sets the script arguments --num_epochs and --data_dir\n",
+ "hyperparameters = {'num_epochs': 1, \n",
+ " 'data_dir': '/opt/ml/input/data/training'}\n",
+ "\n",
+ "estimator = TensorFlow(entry_point='train.py',\n",
+ " source_dir='char-rnn-tensorflow',\n",
+ " train_instance_type='local', # Run in local mode\n",
+ " train_instance_count=1,\n",
+ " hyperparameters=hyperparameters,\n",
+ " role=sagemaker.get_execution_role(), # Passes to the container the AWS role that you are using on this notebook\n",
+ " framework_version='1.11.0', # Uses TensorFlow 1.11\n",
+ " py_version='py3',\n",
+ " script_mode=True)\n",
+ " \n",
+ "\n",
+ "estimator.fit({'training': f'file://{data_dir}'}) # Starts training and creates a data channel named training with the contents\n",
+ "# data_dir in the folder /opt/ml/input/data/training"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## How Script Mode executes the script in the container\n",
+ "\n",
+ "The above cell downloads SageMaker TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. \n",
+ "When training starts, the SageMaker TensorFlow executes **train.py**, passing **hyperparameters** and **model_dir** as script arguments. The example above is executed as follows:\n",
+ "```bash\n",
+ "python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model\n",
+ "```\n",
+ "\n",
+ "Let's explain the values of **--data_dir** and **--model_dir** with more details:\n",
+ "\n",
+ "- **/opt/ml/input/data/training** is the directory inside the container where the training data is downloaded. The data is downloaded to this folder because **training** is the channel name defined in ```estimator.fit({'training': inputs})```. See [training data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-running-container-trainingdata) for more information. \n",
+ "\n",
+ "- **/opt/ml/model** use this directory to save models, checkpoints, or any other data. Any data saved in this folder is saved in the S3 bucket defined for training. See [model data](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html#your-algorithms-training-algo-envvariables) for more information.\n",
+ "\n",
+ "### Reading additional information from the container\n",
+ "\n",
+ "Often, a user script needs additional information from the container that is not available in ```hyperparameters```.\n",
+ "SageMaker containers write this information as **environment variables** that are available inside the script.\n",
+ "\n",
+ "For example, the example above can read information about the **training** channel provided in the training job request by adding the environment variable `SM_CHANNEL_TRAINING` as the default value for the `--data_dir` argument:\n",
+ "\n",
+ "```python\n",
+ "if __name__ == '__main__':\n",
+ " parser = argparse.ArgumentParser()\n",
+ " # reads input channels training and testing from the environment variables\n",
+ " parser.add_argument('--data_dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])\n",
+ "```\n",
+ "\n",
+ "Script mode displays the list of available environment variables in the training logs. You can find the [entire list here](https://github.com/aws/sagemaker-containers/blob/master/README.md#environment-variables-full-specification)."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Training in SageMaker"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import sagemaker\n",
+ "\n",
+ "inputs = sagemaker.Session().upload_data(path='sherlock', key_prefix='datasets/sherlock')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The returned variable inputs above is a string with a s3 location which SageMaker Tranining has permissions\n",
+ "to read data from. **It has education purposes, requiring\n",
+ " more robust solutions for larger datasets:**"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "inputs"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To train in SageMaker:\n",
+ "- change the estimator argument **train_instance_type** to any SageMaker ml instance available for training.\n",
+ "- set the **training** channel to a S3 location.\n",
+ "\n",
+ "For example:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "estimator = TensorFlow(entry_point='train.py',\n",
+ " source_dir='char-rnn-tensorflow',\n",
+ " train_instance_type='ml.c4.xlarge', # Executes training in a ml.c4.xlarge instance\n",
+ " train_instance_count=1,\n",
+ " hyperparameters=hyperparameters,\n",
+ " role=sagemaker.get_execution_role(),\n",
+ " framework_version='1.11.0',\n",
+ " py_version='py3',\n",
+ " script_mode=True)\n",
+ " \n",
+ "\n",
+ "estimator.fit({'training': inputs})"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "conda_tensorflow_p36",
+ "language": "python",
+ "name": "conda_tensorflow_p36"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.6.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}