How important is scaling for SGDRegressor in SciKit Learn?

Feb 10, 2020

[ pandas regression sgd scikit-learn ]

I’ve been playing around with SGDRegressor from the scikit learn library and was having some trouble with nonsensical outputs.

Even with a simple manufactured dataset, to which a LinearRegressor could fit a perfect line, SGDRegressor was spitting out nonsensical values.

Here was the sample dataset I used, where the predicted value was simply 5 times the input value:

num_samples = 100
multiple = 5
y = np.array([i*multiple for i in range(num_samples)])
x = np.array([i for i in range(num_samples)])
x[:5], y[:5]

# Output:
# (array([0, 1, 2, 3, 4]), array([ 0,  5, 10, 15, 20]))

It wasn’t until I started scaling the data that I was able to get the results I expected.

From the scikit-learn website:

Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results.

I created a Jupyter Notebook below as a simple demonstration.

Archive

chinese tang-dynasty-poetry 李白 python 王维 rl pytorch numpy emacs 杜牧 spinningup networking deep-learning 贺知章 白居易 王昌龄 杜甫 李商隐 tips reinforcement-learning macports jekyll 骆宾王 贾岛 孟浩然 xcode time-series terminal regression rails productivity pandas math macosx lesson-plan helicopters flying fastai conceptual-learning command-line bro 黄巢 韦应物 陈子昂 王翰 王之涣 柳宗元 杜秋娘 李绅 张继 孟郊 刘禹锡 元稹 youtube visdom system sungho stylelint stripe softmax siri sgd scipy scikit-learn scikit safari research qtran qoe qmix pyhton poetry pedagogy papers paper-review optimization openssl openmpi nyc node neural-net multiprocessing mpi morl ml mdp marl mandarin macos machine-learning latex language-learning khan-academy jupyter-notebooks ios-programming intuition homebrew hacking google-cloud github flashcards faker docker dme deepmind dec-pomdp data-wrangling craftsman congestion-control coding books book-review atari anki analogy 3brown1blue 2fa

More

Archive