Skip to content

Commit 0495ab1

Browse files
committed
Tiny math font fix.
1 parent 7065009 commit 0495ab1

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

16_Reinforcement_Learning.ipynb

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -123,11 +123,11 @@
123123
"cell_type": "markdown",
124124
"metadata": {},
125125
"source": [
126-
"The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state *t* the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state *t* and *t+1* because NOOP means \"No Operation\".\n",
126+
"The Q-values for the possible actions have been estimated by a Neural Network. For the action NOOP in state $t$ the Q-value is estimated to be 2.900, which is the highest Q-value for that state so the agent takes that action, i.e. the agent does not do anything between state $t$ and $t+1$ because NOOP means \"No Operation\".\n",
127127
"\n",
128-
"In state *t+1* the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state *t+1* is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
128+
"In state $t+1$ the agent scores 4 points, but this is limited to 1 point in this implementation so as to stabilize the training. The maximum Q-value for state $t+1$ is 1.830 for the action RIGHTFIRE. So if we select that action and continue to select the actions proposed by the Q-values estimated by the Neural Network, then the discounted sum of all the future rewards is expected to be 1.830.\n",
129129
"\n",
130-
"Now that we know the reward of taking the NOOP action from state *t* to *t+1*, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
130+
"Now that we know the reward of taking the NOOP action from state $t$ to $t+1$, we can update the Q-value to incorporate this new information. This uses the formula above:\n",
131131
"\n",
132132
"$$\n",
133133
" Q(state_{t},NOOP) \\leftarrow \\underbrace{r_{t}}_{\\rm reward} + \\underbrace{\\gamma}_{\\rm discount} \\cdot \\underbrace{\\max_{a}Q(state_{t+1}, a)}_{\\rm estimate~of~future~rewards} = 1.0 + 0.97 \\cdot 1.830 \\simeq 2.775\n",
@@ -4456,7 +4456,7 @@
44564456
"name": "python",
44574457
"nbconvert_exporter": "python",
44584458
"pygments_lexer": "ipython3",
4459-
"version": "3.6.1"
4459+
"version": "3.6.8"
44604460
}
44614461
},
44624462
"nbformat": 4,

0 commit comments

Comments
 (0)