Hvass-Labs
diff --git a/‎16_Reinforcement_Learning.ipynb‎
Lines changed: 4 additions & 4 deletions b/‎16_Reinforcement_Learning.ipynb‎
Lines changed: 4 additions & 4 deletions
@@ -123,11 +123,11 @@
  "cell_type": "markdown",
  "metadata": {},
  "source": [
- "The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state *t* the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state *t* and *t+1* because NOOP means \"No Operation\".\n",
+ "The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state $t$ the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state $t$ and $t+1$ because NOOP means \"No Operation\".\n",
  "\n",
- "In state *t+1* the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state *t+1* is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
+ "In state $t+1$ the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state $t+1$ is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
  "\n",
- "Now that we know the reward of taking the NOOP action from state *t* to *t+1*, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
+ "Now that we know the reward of taking the NOOP action from state $t$ to $t+1$, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
  "\n",
  "$$\n",
  " Q(state_{t},NOOP) \\leftarrow \\underbrace{r_{t}}_{\\rm reward} + \\underbrace{\\gamma}_{\\rm discount} \\cdot \\underbrace{\\max_{a}Q(state_{t+1}, a)}_{\\rm estimate~of~future~rewards} = 1.0 + 0.97 \\cdot 1.830 \\simeq 2.775\n",
@@ -4456,7 +4456,7 @@
  "name": "python",
  "nbconvert_exporter": "python",
  "pygments_lexer": "ipython3",
- "version": "3.6.1"
+ "version": "3.6.8"
  }
  },
  "nbformat": 4,