What percentage of samples from a dataset fall into a range of another dataset (using Python, numpy, scipy)?
This notebook demonstrates using Python to compare two datasets with one variable. More specifically, it answers “What percentage of samples from dataset A fall into a particular range (e.g., the interquartile) of dataset B?” We use Python 3, numpy, and scipy.
!python --version
Python 3.7.4
import numpy as np
import scipy.stats
Consider the following two datasets. I have chosen easy ones for clarity, but this methodology should work on any datasets of a single variable.
a = np.arange(10, 40)
b = np.arange(0, 100)
a, b
(array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]),
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]))
How many data points from dataset A fall into the interquartile range of dataset B? For this, we can create a numpy histogram from dataset B, translate that into a random variable histogram distribution using scipy, and then call the ppf
function for the range we want. Here, I round for convenience.
hist_b = np.histogram(b, bins=100)
dist_b = scipy.stats.rv_histogram(hist_b)
start = dist_b.ppf(0.25).round(2)
finish = dist_b.ppf(0.75).round(2)
start, finish
(24.75, 74.25)
Next, I create a similar distribution for dataset a
, but now I take the ppf values from dataset b
and call the cdf
function. Taking the difference tells me the percentage of values from dataset a
that are in the specified range in b
.
hist_a = np.histogram(a, bins=100)
dist_a = scipy.stats.rv_histogram(hist_a)
(dist_a.cdf(finish) - dist_a.cdf(start)).round(2)
0.5
In our sample, the answer is 50%.
Archive
chinese
tang-dynasty-poetry
李白
python
王维
rl
pytorch
numpy
emacs
杜牧
spinningup
networking
deep-learning
贺知章
白居易
王昌龄
杜甫
李商隐
tips
reinforcement-learning
macports
jekyll
骆宾王
贾岛
孟浩然
xcode
time-series
terminal
regression
rails
productivity
pandas
math
macosx
lesson-plan
helicopters
flying
fastai
conceptual-learning
command-line
bro
黄巢
韦应物
陈子昂
王翰
王之涣
柳宗元
杜秋娘
李绅
张继
孟郊
刘禹锡
元稹
youtube
visdom
system
sungho
stylelint
stripe
softmax
siri
sgd
scipy
scikit-learn
scikit
safari
research
qtran
qoe
qmix
pyhton
poetry
pedagogy
papers
paper-review
optimization
openssl
openmpi
nyc
node
neural-net
multiprocessing
mpi
morl
ml
mdp
marl
mandarin
macos
machine-learning
latex
language-learning
khan-academy
jupyter-notebooks
ios-programming
intuition
homebrew
hacking
google-cloud
github
flashcards
faker
docker
dme
deepmind
dec-pomdp
data-wrangling
craftsman
congestion-control
coding
books
book-review
atari
anki
analogy
3brown1blue
2fa