Examining the Loss Calculation from a Deep Q Network
I finally got around to reading Playing Atari with Deep Reinforcement Learning, a well cited paper from a few years back where DeepMind trained a neural network to play several Atari games.
Digging into RL and OpenAI’s gym, I kept on seeing references to it and it seemed like only a matter of time until I would have to make my way back to it. The paper did not disappoint, as even in the “Background” section, some details that previously were a bit fuzzy became clear.
I was familiar with stochastic gradient descent, backpropagation, and the basic idea of how loss is calculated in supervised learning, but it’s taken a little while to digest how this works in reinforcement learning, particularly in optimizing a deep Q network.
In the words of the paper:
A Q-network can be trained by minimising a sequence of loss functions Li(θi) that changes at each iteration i,
where yi = Es′~E [r + γ maxa′ Q(s′, a′; θi−1)|s, a] is the target for iteration i and ρ(s, a) is a probability distribution over sequences s and actions a that we refer to as the behaviour distribution.
In supervised learning, it is usually very clear what y is, as it is simply the labeled data. But in a RL, this becomes less clear. y-hat also seems intuitive to me as it is simply the Q-value output from the Q-network. But y, uses the Q-network itself in addition to the immediate reward from the environment to calculated the expected cumulative reward.
Archive
chinese
tang-dynasty-poetry
李白
python
王维
rl
pytorch
numpy
emacs
杜牧
spinningup
networking
deep-learning
贺知章
白居易
王昌龄
杜甫
李商隐
tips
reinforcement-learning
macports
jekyll
骆宾王
贾岛
孟浩然
xcode
time-series
terminal
regression
rails
productivity
pandas
math
macosx
lesson-plan
helicopters
flying
fastai
conceptual-learning
command-line
bro
黄巢
韦应物
陈子昂
王翰
王之涣
柳宗元
杜秋娘
李绅
张继
孟郊
刘禹锡
元稹
youtube
visdom
system
sungho
stylelint
stripe
softmax
siri
sgd
scipy
scikit-learn
scikit
safari
research
qtran
qoe
qmix
pyhton
poetry
pedagogy
papers
paper-review
optimization
openssl
openmpi
nyc
node
neural-net
multiprocessing
mpi
morl
ml
mdp
marl
mandarin
macos
machine-learning
latex
language-learning
khan-academy
jupyter-notebooks
ios-programming
intuition
homebrew
hacking
google-cloud
github
flashcards
faker
docker
dme
deepmind
dec-pomdp
data-wrangling
craftsman
congestion-control
coding
books
book-review
atari
anki
analogy
3brown1blue
2fa