What percentage of samples from a dataset fall into a range of another dataset (using Python, numpy, scipy)?

Nov 7, 2020

[ python numpy scipy ]

This notebook demonstrates using Python to compare two datasets with one variable. More specifically, it answers “What percentage of samples from dataset A fall into a particular range (e.g., the interquartile) of dataset B?” We use Python 3, numpy, and scipy.

!python --version

Python 3.7.4

import numpy as np
import scipy.stats

Consider the following two datasets. I have chosen easy ones for clarity, but this methodology should work on any datasets of a single variable.

a = np.arange(10, 40)
b = np.arange(0, 100)
a, b

(array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
        27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]),
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
        34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
        51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67,
        68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84,
        85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]))

How many data points from dataset A fall into the interquartile range of dataset B? For this, we can create a numpy histogram from dataset B, translate that into a random variable histogram distribution using scipy, and then call the ppf function for the range we want. Here, I round for convenience.

hist_b = np.histogram(b, bins=100)
dist_b = scipy.stats.rv_histogram(hist_b)
start = dist_b.ppf(0.25).round(2)
finish = dist_b.ppf(0.75).round(2)
start, finish

(24.75, 74.25)

Next, I create a similar distribution for dataset a, but now I take the ppf values from dataset b and call the cdf function. Taking the difference tells me the percentage of values from dataset a that are in the specified range in b.

hist_a = np.histogram(a, bins=100)
dist_a = scipy.stats.rv_histogram(hist_a)
(dist_a.cdf(finish) - dist_a.cdf(start)).round(2)

0.5

In our sample, the answer is 50%.

Archive

chinese tang-dynasty-poetry 李白 python 王维 rl pytorch numpy emacs 杜牧 spinningup networking deep-learning 贺知章 白居易 王昌龄 杜甫 李商隐 tips reinforcement-learning macports jekyll 骆宾王 贾岛 孟浩然 xcode time-series terminal regression rails productivity pandas math macosx lesson-plan helicopters flying fastai conceptual-learning command-line bro 黄巢 韦应物 陈子昂 王翰 王之涣 柳宗元 杜秋娘 李绅 张继 孟郊 刘禹锡 元稹 youtube visdom system sungho stylelint stripe softmax siri sgd scipy scikit-learn scikit safari research qtran qoe qmix pyhton poetry pedagogy papers paper-review optimization openssl openmpi nyc node neural-net multiprocessing mpi morl ml mdp marl mandarin macos machine-learning latex language-learning khan-academy jupyter-notebooks ios-programming intuition homebrew hacking google-cloud github flashcards faker docker dme deepmind dec-pomdp data-wrangling craftsman congestion-control coding books book-review atari anki analogy 3brown1blue 2fa