Examining the Loss Calculation from a Deep Q Network

Nov 10, 2019

[ rl reinforcement-learning atari deepmind ]

I finally got around to reading Playing Atari with Deep Reinforcement Learning, a well cited paper from a few years back where DeepMind trained a neural network to play several Atari games.

Digging into RL and OpenAI’s gym, I kept on seeing references to it and it seemed like only a matter of time until I would have to make my way back to it. The paper did not disappoint, as even in the “Background” section, some details that previously were a bit fuzzy became clear.

I was familiar with stochastic gradient descent, backpropagation, and the basic idea of how loss is calculated in supervised learning, but it’s taken a little while to digest how this works in reinforcement learning, particularly in optimizing a deep Q network.

In the words of the paper:

A Q-network can be trained by minimising a sequence of loss functions L_i(θ_i) that changes at each iteration i, where y_i = E_s′~E [r + γ max_a′ Q(s′, a′; θ_i−1)|s, a] is the target for iteration i and ρ(s, a) is a probability distribution over sequences s and actions a that we refer to as the behaviour distribution.

In supervised learning, it is usually very clear what y is, as it is simply the labeled data. But in a RL, this becomes less clear. y-hat also seems intuitive to me as it is simply the Q-value output from the Q-network. But y, uses the Q-network itself in addition to the immediate reward from the environment to calculated the expected cumulative reward.

Archive

chinese tang-dynasty-poetry 李白 python 王维 rl pytorch numpy emacs 杜牧 spinningup networking deep-learning 贺知章 白居易 王昌龄 杜甫 李商隐 tips reinforcement-learning macports jekyll 骆宾王 贾岛 孟浩然 xcode time-series terminal regression rails productivity pandas math macosx lesson-plan helicopters flying fastai conceptual-learning command-line bro 黄巢 韦应物 陈子昂 王翰 王之涣 柳宗元 杜秋娘 李绅 张继 孟郊 刘禹锡 元稹 youtube visdom system sungho stylelint stripe softmax siri sgd scipy scikit-learn scikit safari research qtran qoe qmix pyhton poetry pedagogy papers paper-review optimization openssl openmpi nyc node neural-net multiprocessing mpi morl ml mdp marl mandarin macos machine-learning latex language-learning khan-academy jupyter-notebooks ios-programming intuition homebrew hacking google-cloud github flashcards faker docker dme deepmind dec-pomdp data-wrangling craftsman congestion-control coding books book-review atari anki analogy 3brown1blue 2fa