scipy.stats

時間 2019-12-05

標籤 scipy.stats scipy stats 简体版

原文原文鏈接

scipy.statspython

Scipy的stats模塊包含了多種機率分佈的隨機變量，隨機變量分爲連續的和離散的兩種。
全部的連續隨機變量都是rv_continuous的派生類的對象，而全部的離散隨機變量都是 rv_discrete的派生類的對象。api

This module contains a large number of probability distributions as well as a growing library of statistical functions.數組

Each univariate distribution is an instance of a subclass of rv_continuous(rv_discrete for discrete distributions):app

rv_continuous([momtype, a, b, xtol, ...]) A generic continuous random variable class meant for subclassing.
rv_discrete([a, b, name, badvalue, ...]) A generic discrete random variable class meant for subclassing.
皮皮blogdom

連續分佈及其相關的函數
連續分佈
alpha An alpha continuous random variable.
anglit An anglit continuous random variable.
arcsine An arcsine continuous random variable.
beta A beta continuous random variable.
betaprime A beta prime continuous random variable.
bradford A Bradford continuous random variable.
burr A Burr (Type III) continuous random variable.
burr12 A Burr (Type XII) continuous random variable.
cauchy A Cauchy continuous random variable.
chi A chi continuous random variable.
chi2 A chi-squared continuous random variable.
cosine A cosine continuous random variable.
dgamma A double gamma continuous random variable.
dweibull A double Weibull continuous random variable.
erlang An Erlang continuous random variable.
expon An exponential continuous random variable.
exponnorm An exponentially modified Normal continuous random variable.
exponweib An exponentiated Weibull continuous random variable.
exponpow An exponential power continuous random variable.
f An F continuous random variable.
fatiguelife A fatigue-life (Birnbaum-Saunders) continuous random variable.
fisk A Fisk continuous random variable.
foldcauchy A folded Cauchy continuous random variable.
foldnorm A folded normal continuous random variable.
frechet_r A Frechet right (or Weibull minimum) continuous random variable.
frechet_l A Frechet left (or Weibull maximum) continuous random variable.
genlogistic A generalized logistic continuous random variable.
gennorm A generalized normal continuous random variable.
genpareto A generalized Pareto continuous random variable.
genexpon A generalized exponential continuous random variable.
genextreme A generalized extreme value continuous random variable.
gausshyper A Gauss hypergeometric continuous random variable.
gamma A gamma continuous random variable.
gengamma A generalized gamma continuous random variable.
genhalflogistic A generalized half-logistic continuous random variable.
gilbrat A Gilbrat continuous random variable.
gompertz A Gompertz (or truncated Gumbel) continuous random variable.
gumbel_r A right-skewed Gumbel continuous random variable.
gumbel_l A left-skewed Gumbel continuous random variable.
halfcauchy A Half-Cauchy continuous random variable.
halflogistic A half-logistic continuous random variable.
halfnorm A half-normal continuous random variable.
halfgennorm The upper half of a generalized normal continuous random variable.
hypsecant A hyperbolic secant continuous random variable.
invgamma An inverted gamma continuous random variable.
invgauss An inverse Gaussian continuous random variable.
invweibull An inverted Weibull continuous random variable.
johnsonsb A Johnson SB continuous random variable.
johnsonsu A Johnson SU continuous random variable.
kappa4 Kappa 4 parameter distribution.
kappa3 Kappa 3 parameter distribution.
ksone General Kolmogorov-Smirnov one-sided test.
kstwobign Kolmogorov-Smirnov two-sided test for large N.
laplace A Laplace continuous random variable.
levy A Levy continuous random variable.
levy_l A left-skewed Levy continuous random variable.
levy_stable A Levy-stable continuous random variable.
logistic A logistic (or Sech-squared) continuous random variable.
loggamma A log gamma continuous random variable.
loglaplace A log-Laplace continuous random variable.
lognorm A lognormal continuous random variable.
lomax A Lomax (Pareto of the second kind) continuous random variable.
maxwell A Maxwell continuous random variable.
mielke A Mielke’s Beta-Kappa continuous random variable.
nakagami A Nakagami continuous random variable.
ncx2 A non-central chi-squared continuous random variable.
ncf A non-central F distribution continuous random variable.
nct A non-central Student’s T continuous random variable.
norm A normal continuous random variable.
pareto A Pareto continuous random variable.
pearson3 A pearson type III continuous random variable.
powerlaw A power-function continuous random variable.
powerlognorm A power log-normal continuous random variable.
powernorm A power normal continuous random variable.
rdist An R-distributed continuous random variable.
reciprocal A reciprocal continuous random variable.
rayleigh A Rayleigh continuous random variable.
rice A Rice continuous random variable.
recipinvgauss A reciprocal inverse Gaussian continuous random variable.
semicircular A semicircular continuous random variable.
skewnorm A skew-normal random variable.
t A Student’s T continuous random variable.
trapz A trapezoidal continuous random variable.
triang A triangular continuous random variable.
truncexpon A truncated exponential continuous random variable.
truncnorm A truncated normal continuous random variable.
tukeylambda A Tukey-Lamdba continuous random variable.
uniform A uniform continuous random variable.
vonmises A Von Mises continuous random variable.
vonmises_line A Von Mises continuous random variable.
wald A Wald continuous random variable.
weibull_min A Frechet right (or Weibull minimum) continuous random variable.
weibull_max A Frechet left (or Weibull maximum) continuous random variable.
wrapcauchy A wrapped Cauchy continuous random variable.
連續隨機變量對象的方法
rvs(*args, **kwds) Random variates of given type.產生服從這種分佈的一個樣本，對隨機變量進行隨機取值，能夠經過size參數指定輸出的數組大小。
pdf(x, *args, **kwds) Probability density function at x of the given RV.隨機變量的機率密度函數。產生對應x的這種分佈的y值。
logpdf(x, *args, **kwds) Log of the probability density function at x of the given RV.
cdf(x, *args, **kwds) Cumulative distribution function of the given RV.隨機變量的累積分佈函數，它是機率密度函數的積分（也就是x時p(X<x)的機率）。產生對應x的這種分佈的累積分佈函數的值。
logcdf(x, *args, **kwds) Log of the cumulative distribution function at x of the given RV.
sf(x, *args, **kwds) Survival function (1 - cdf) at x of the given RV.隨機變量的生存函數，它的值是1-cdf(t)。
logsf(x, *args, **kwds) Log of the survival function of the given RV.
ppf(q, *args, **kwds) Percent point function (inverse of cdf) at q of the given RV.累積分佈函數的反函數。q=0.01時，ppf就是p(X<x)=0.01時的x值。
isf(q, *args, **kwds) Inverse survival function (inverse of sf) at q of the given RV.
moment(n, *args, **kwds) n-th order non-central moment of distribution.
stats(*args, **kwds) Some statistics of the given RV.計算隨機變量的指望值和方差。
entropy(*args, **kwds) Differential entropy of the RV.
expect([func, args, loc, scale, lb, ub, ...]) Calculate expected value of a function with respect to the distribution.
median(*args, **kwds) Median of the distribution.
mean(*args, **kwds) Mean of the distribution.
std(*args, **kwds) Standard deviation of the distribution.
var(*args, **kwds) Variance of the distribution.
interval(alpha, *args, **kwds) Confidence interval with equal areas around the median.
__call__(*args, **kwds) Freeze the distribution for the given arguments.
fit(data, *args, **kwds) Return MLEs for shape, location, and scale parameters from data.對一組隨機取樣進行擬合，找出最適合取樣數據的機率密度函數的係數。如stats.norm.fit(x)就是將x當作是某個norm分佈的抽樣，求出其最好的擬合參數（mean, std）。
fit_loc_scale(data, *args) Estimate loc and scale parameters from data using 1st and 2nd moments.
nnlf(theta, x) Return negative loglikelihood function.
[Continuous distributions]
[scipy.stats.rv_continuous]ssh

多變量分佈Multivariate distributions
multivariate_normal A multivariate normal random variable.
matrix_normal A matrix normal random variable.
dirichlet A Dirichlet random variable.
wishart A Wishart random variable.
invwishart An inverse Wishart random variable.
special_ortho_group A matrix-valued SO(N) random variable.
ortho_group A matrix-valued O(N) random variable.
random_correlation A random correlation matrix.
multivariate_normal
>>> x, y = np.mgrid[-1:1:.01, -1:1:.01]
>>> pos = np.dstack((x, y)) #二維座標組合成三維座標點座標
>>> rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
>>> rv.pdf(pos) #接受的參數是三維數據，第三維表明一個數據座標，一、2維表明網格座標位置。ide

皮皮blog函數

離散分佈及其相關的函數
當分佈函數的值域爲離散時，稱之爲離散機率分佈。例如投擲有6個面的骰子時，只能得到1到6的整數，所以獲得的機率分佈爲離散的。測試

對於離散隨機分佈，一般使用機率質量函數(PMF)描述其分佈狀況。在stats庫中全部描述離散分佈的隨機變量都從rv_discrete類繼承。ui

直接用rv_discrete 類自定義離散機率分佈
stats.rv_discrete(values=(x,p))中的參數表示隨機變量x和其對應的機率。

設有一個不均勻的骰子，各點出現的機率不相等。能夠用下面的數組x保存骰子的全部可能值，數組p保存每一個值出現的機率：
>>> x = range(1,7)
>>> p = (0.4, 0.2, 0.1, 0.1, 0.1, 0.1)
用下面的語句定義表示這個特殊骰子的隨機變量，並調用其rvs()方法投擲此骰子20次，得到符合機率p的隨機數:
>>> dice = stats.rv_discrete(values=(x,p))
>>> dice.rvs(size=20)
Array([2, 5, 1, 2, 1, 1, 2, 4, 1, 3, 1, 1, 4, 3, 1, 1, 1, 2, 6, 4])

from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
fs_meetsig = np.random.random(30)
fs_xk = np.sort(fs_meetsig)
fs_pk = np.ones_like(fs_xk) / len(fs_xk)
fs_rv_dist = stats.rv_discrete(name='fs_rv_dist', values=(fs_xk, fs_pk))

plt.plot(fs_xk, fs_rv_dist.cdf(fs_xk), 'b-', ms=12, mec='r', label='friend')
plt.show()
[rv_discrete Examples]

離散分佈
bernoulli A Bernoulli discrete random variable.
binom A binomial discrete random variable.
boltzmann A Boltzmann (Truncated Discrete Exponential) random variable.
dlaplace A Laplacian discrete random variable.
geom A geometric discrete random variable.
hypergeom A hypergeometric discrete random variable.
logser A Logarithmic (Log-Series, Series) discrete random variable.
nbinom A negative binomial discrete random variable.
planck A Planck discrete exponential random variable.
poisson A Poisson discrete random variable.
randint A uniform discrete random variable.
skellam A Skellam discrete random variable.
zipf A Zipf discrete random variable.
離散分佈的函數
rvs(*args, **kwargs) Random variates of given type.
pmf(k, *args, **kwds) Probability mass function at k of the given RV.
logpmf(k, *args, **kwds) Log of the probability mass function at k of the given RV.
cdf(k, *args, **kwds) Cumulative distribution function of the given RV.
logcdf(k, *args, **kwds) Log of the cumulative distribution function at k of the given RV.
sf(k, *args, **kwds) Survival function (1 - cdf) at k of the given RV.
logsf(k, *args, **kwds) Log of the survival function of the given RV.
ppf(q, *args, **kwds) Percent point function (inverse of cdf) at q of the given RV.
isf(q, *args, **kwds) Inverse survival function (inverse of sf) at q of the given RV.
moment(n, *args, **kwds) n-th order non-central moment of distribution.
stats(*args, **kwds) Some statistics of the given RV.
entropy(*args, **kwds) Differential entropy of the RV.
expect([func, args, loc, lb, ub, ...]) Calculate expected value of a function with respect to the distribution for discrete distribution.
median(*args, **kwds) Median of the distribution.
mean(*args, **kwds) Mean of the distribution.
std(*args, **kwds) Standard deviation of the distribution.
var(*args, **kwds) Variance of the distribution.
interval(alpha, *args, **kwds) Confidence interval with equal areas around the median.
__call__(*args, **kwds) Freeze the distribution for the given arguments.
皮皮blog

統計函數Statistical functions
{scipy.stats頂層函數，能夠應用於不少分佈的函數}

Several of these functions have a similar version in scipy.stats.mstats which work for masked arrays.

describe(a[, axis, ddof, bias, nan_policy]) Computes several descriptive statistics of the passed array.
gmean(a[, axis, dtype]) Compute the geometric mean along the specified axis.
hmean(a[, axis, dtype]) Calculates the harmonic mean along the specified axis.
kurtosis(a[, axis, fisher, bias, nan_policy]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis, nan_policy]) Tests whether a dataset has normal kurtosis
mode(a[, axis, nan_policy]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis, nan_policy]) Calculates the nth moment about the mean for a sample.
normaltest(a[, axis, nan_policy]) Tests whether a sample differs from a normal distribution.
skew(a[, axis, bias, nan_policy]) Computes the skewness of a data set.
skewtest(a[, axis, nan_policy]) Tests whether the skew is different from the normal distribution.
kstat(data[, n]) Return the nth k-statistic (1<=n<=4 so far).
kstatvar(data[, n]) Returns an unbiased estimator of the variance of the k-statistic.
tmean(a[, limits, inclusive, axis]) Compute the trimmed mean.
tvar(a[, limits, inclusive, axis, ddof]) Compute the trimmed variance
tmin(a[, lowerlimit, axis, inclusive, ...]) Compute the trimmed minimum
tmax(a[, upperlimit, axis, inclusive, ...]) Compute the trimmed maximum
tstd(a[, limits, inclusive, axis, ddof]) Compute the trimmed sample standard deviation
tsem(a[, limits, inclusive, axis, ddof]) Compute the trimmed standard error of the mean.
variation(a[, axis, nan_policy]) Computes the coefficient of variation, the ratio of the biased standard deviation to the mean.
find_repeats(arr) Find repeats and repeat counts.
trim_mean(a, proportiontocut[, axis]) Return mean of array after trimming distribution from both tails.
cumfreq(a[, numbins, defaultreallimits, weights]) Returns a cumulative frequency histogram, using the histogram function.
histogram2(*args, **kwds) histogram2 is deprecated!
histogram(*args, **kwds) histogram is deprecated!
itemfreq(a) Returns a 2-D array of item frequencies.
percentileofscore(a, score[, kind]) The percentile rank of a score relative to a list of scores.
scoreatpercentile(a, per[, limit, ...]) Calculate the score at a given percentile of the input sequence.
relfreq(a[, numbins, defaultreallimits, weights]) Returns a relative frequency histogram, using the histogram function.
binned_statistic(x, values[, statistic, ...]) Compute a binned statistic for one or more sets of data.
binned_statistic_2d(x, y, values[, ...]) Compute a bidimensional binned statistic for one or more sets of data.
binned_statistic_dd(sample, values[, ...]) Compute a multidimensional binned statistic for a set of data.
obrientransform(*args) Computes the O’Brien transform on input data (any number of arrays).
signaltonoise(*args, **kwds) signaltonoise is deprecated!
bayes_mvs(data[, alpha]) Bayesian confidence intervals for the mean, var, and std.
mvsdist(data) ‘Frozen’ distributions for mean, variance, and standard deviation of data.
sem(a[, axis, ddof, nan_policy]) Calculates the standard error of the mean (or standard error of measurement) of the values in the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample mean and standard deviation.
iqr(x[, axis, rng, scale, nan_policy, ...]) Compute the interquartile range of the data along the specified axis.
sigmaclip(a[, low, high]) Iterative sigma-clipping of array elements.
threshold(*args, **kwds) threshold is deprecated!
trimboth(a, proportiontocut[, axis]) Slices off a proportion of items from both ends of an array.
trim1(a, proportiontocut[, tail, axis]) Slices off a proportion from ONE end of the passed array distribution.
f_oneway(*args) Performs a 1-way ANOVA.
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation.
spearmanr(a[, b, axis, nan_policy]) Calculates a Spearman rank-order correlation coefficient and the p-value to test for non-correlation.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and its p-value.
kendalltau(x, y[, initial_lexsort, nan_policy]) Calculates Kendall’s tau, a correlation measure for ordinal data.
linregress(x[, y]) Calculate a linear least-squares regression for two sets of measurements.
theilslopes(y[, x, alpha]) Computes the Theil-Sen estimator for a set of points (x, y).
f_value(*args, **kwds) f_value is deprecated!
ttest_1samp(a, popmean[, axis, nan_policy]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var, nan_policy]) Calculates the T-test for the means of two independent samples of scores.
ttest_ind_from_stats(mean1, std1, nobs1, ...) T-test for means of two independent samples from descriptive statistics.
ttest_rel(a, b[, axis, nan_policy]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity, alternative]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args, **kwargs) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements
combine_pvalues(pvalues[, method, weights]) Methods for combining the p-values of independent tests bearing upon the same hypothesis.
ss(*args, **kwds) ss is deprecated!
square_of_sums(*args, **kwds) square_of_sums is deprecated!
jarque_bera(x) Perform the Jarque-Bera goodness of fit test on sample data.
ansari(x, y) Perform the Ansari-Bradley test for equal scale parameters
bartlett(*args) Perform Bartlett’s test for equal variances
levene(*args, **kwds) Perform Levene test for equal variances.
shapiro(x[, a, reta]) Perform the Shapiro-Wilk test for normality.
anderson(x[, dist]) Anderson-Darling test for data coming from a particular distribution
anderson_ksamp(samples[, midrank]) The Anderson-Darling test for k-samples.
binom_test(x[, n, p, alternative]) Perform a test that the probability of success is p.
fligner(*args, **kwds) Perform Fligner-Killeen test for equality of variance.
median_test(*args, **kwds) Mood’s median test.
mood(x, y[, axis]) Perform Mood’s test for equal scale parameters.
boxcox(x[, lmbda, alpha]) Return a positive dataset transformed by a Box-Cox power transformation.
boxcox_normmax(x[, brack, method]) Compute optimal Box-Cox transform parameter for input data.
boxcox_llf(lmb, data) The boxcox log-likelihood function.
entropy(pk[, qk, base]) Calculate the entropy of a distribution for given probability values.
chisqprob(*args, **kwds) chisqprob is deprecated!
betai(*args, **kwds) betai is deprecated!
describe函數
這個函數的輸出太難看了！

age = [23, 23, 27, 27, 39, 41, 47, 49, 50, 52, 54, 54, 56, 57, 58, 58, 60, 61]
fat_percent = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7]
age = np.array(age)
fat_percent = np.array(fat_percent)
data = np.vstack([age, fat_percent]).reshape([-1, 2])

print(stats.describe(data))
DescribeResult(nobs=18, minmax=(array([ 7.8, 17.8]), array([ 60., 61.])), mean=array([ 37.36111111, 37.86666667]), variance=array([ 236.58604575, 188.78588235]), skewness=array([-0.30733374, 0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))
修改了一個輸出結果形式

for key, value in stats.describe(data)._asdict().items():
print(key, ':', value)
nobs : 18
minmax : (array([ 7.8, 17.8]), array([ 60., 61.]))
mean : [ 37.36111111 37.86666667]
variance : [ 236.58604575 188.78588235]
skewness : [-0.30733374 0.40999364]
kurtosis : [-0.65245849 -1.26315357]
也可使用pandas中的函數進行替代，這樣輸出比較舒服[python數據處理庫pandas]

機率分佈的熵和kl散度的計算 scipy.stats.entropy
scipy.stats.entropy(pk, qk=None, base=None)[source]
    Calculate the entropy of a distribution for given probability values.
    If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
    If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
    This routine will normalize pk and qk if they don’t sum to 1.

香農熵的計算entropy

shannon_entropy = stats.entropy(ij/sum(ij), base=None)
print(shannon_entropy)
entropy的python直接實現

shannon_entropy_func = lambda pij: -sum(pij*np.log(pij))
shannon_entropy = shannon_entropy_func(ij[np.nonzero(ij)])
print(shannon_entropy)
def entropy(counts):
    '''Compute entropy.'''
    ps = counts/float(sum(counts)) # coerce to float and normalize
    ps = ps[nonzero(ps)]            # toss out zeros
    H = -sum(ps * numpy.log2(ps))   # compute entropy
    return H

兩個分佈的kl散度的計算

kl = sp.stats.entropy(fs_rv_dist, nonfs_rv_dist)
kl散度的其它實現[距離和類似度度量方法]

[scipy.stats.entropy¶]

假設檢驗相關的
ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements

ttest_1samp實現了單樣本t檢驗。所以，若是咱們想檢驗數據Abra列的稻穀產量均值，經過零假設，這裏咱們假定整體稻穀產量均值爲15000，咱們有：

from scipy import stats as ss
# Perform one sample t-test using 1500 as the true mean
print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)

# OUTPUT
(-1.1281738488299586, 0.26270472069109496)
返回下述值組成的元祖：

t : 浮點或數組類型
t統計量
prob : 浮點或數組類型
two-tailed p-value 雙側機率值
經過上面的輸出，看到p值是0.267遠大於α等於0.05，所以沒有充分的證聽說平均稻穀產量不是150000。將這個檢驗應用到全部的變量，一樣假設均值爲15000，咱們有：

print ss.ttest_1samp(a = df, popmean = 15000)

# OUTPUT
(array([ -1.12817385,   1.07053437, -65.81425599, -4.564575 ,   6.17156198]),
array([ 2.62704721e-01,   2.87680340e-01,   4.15643528e-70,
          1.83764399e-05,   2.82461897e-08]))
第一個數組是t統計量，第二個數組則是相應的p值。

皮皮blog

列聯表函數Contingency table functions
chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.

繪圖測試Plot-tests
ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.

Statistical functions for masked arrays (scipy.stats.mstats)

蒙面統計函數Masked statistics functions
argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.

msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample

單變量和多變量核密度估計Univariate and multivariate kernel density estimation (scipy.stats.kde)
gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.

皮皮blog

統計函數使用舉例
連續分佈-Norm高斯分佈
{高斯[正態]分佈隨機變量,A normal continuous random variable.}

生成服從高斯分佈的隨機向量（從正態分佈中採樣）stats.norm.rvs(loc, scale, size)
參數：

The location (loc) keyword specifies the mean.

The scale (scale) keyword specifies the standard deviation.

norm經過loc和scale參數能夠指定隨機變量的偏移和縮放參數。對於正態分佈的隨機變量來講，這兩個參數至關於指定其指望值和標準差。

高斯分佈N(0,0.01)隨機誤差
y = stats.norm.rvs(loc=0, scale=0.1, size=10)
輸出：array([ 0.05419826, 0.04151471, -0.10784729, 0.18283546, 0.02348312, -0.04611974, 0.0069336 , 0.03840133, -0.05015316, 0.23315205])
y.stats()
(array(0.0), array(0.1)

Note: 也可使用numpy.random.norm函數生成高斯分佈隨機數[numpy庫 - 隨機數模塊numpy.random]。

求正態分佈最佳擬合參數stats.norm.fit(x)
>>> X =stats.norm(loc=1.0,scale=2.0,size = 100)
可使用fit()方法對隨機取樣序列x進行擬合，返回的是與隨機取樣值最吻合的隨機變量的參數
>>> stats.norm.fit(x) #獲得隨機序列的指望值和標準差
array([ 1.01810091, 2.00046946])

求正態分佈N(1,1)機率密度函數某個x對應的值

lambda x: norm.pdf(x, 1, 1)
Note: 從正態分佈機率密度中看出，這個和norm.pdf(x - 1)是不同的，只有標準差爲1時才相等。
求正態分佈N(1,1)累積分佈函數某個x對應的值

lambda x: norm.cdf(x, 1, 1)
繪製一維和二維正態分佈機率密度圖
[機率論：高斯分佈]
[scipy.stats.norm]

均勻分佈
mu = uniform.rvs(size=N) # 從均勻分佈採樣
伽瑪分佈
伽瑪分佈須要額外的形狀參數。伽瑪分佈可用於描述等待k個獨立的隨機事件發生所需的時間，k就是伽瑪分佈的形狀參數。
伽瑪分佈的尺度參數theta和隨機事件發生的頻率相關，由scale參數指定。
>>> stats.gamma.stats(2.0,scale=2)
(array(4.0), array(8.0))
根據伽瑪分佈的數學定義可知其指望值爲k*theta,方差爲k*theta^2 。上面的程序驗證了這兩個公式。當隨機分佈有額外的形狀參數時，它所對應的rvs()、pdf()等方法都會增長額外的參數以接收形狀參數。

離散分佈-二項分佈
假設有一種只有兩個結果的試驗，其成功機率爲 P,那麼二項分佈描述了進行n次這樣的獨立試驗而成功k次的機率。
二項分佈的機率質量函數公式以下：

使用二項分佈的機率質量函數pmf()能夠很容易計算出現k次6點的機率。

pmf()
pmf()的第一個參數爲隨機變量的取值，後面的參數爲描述隨機分佈所需的參數。對於二項分佈來講，參數分別爲n和P,而取值範圍則爲0到n之間的整數。

程序經過二項分佈的機率質量公式計算投擲5次骰子出現0到6所對應的機率：

>>> stats.binom.pmf(range(6), 5, 1/6.0)
array([0.401878, 0.401878, 0.166751, 0.032150, 0.003215, 0.000129])

由結果可知：出現0或1次6點的機率爲40.2%,而出現3次6點的機率爲3.215%

泊松分佈
在二項分佈中，若是試驗次數n很大，而每次試驗成功的機率p很小，其乘積np比較適中，那麼試驗成功次數的機率能夠用泊松分佈近似描述。
在泊松分佈中，使用lambda描述單位時間(或單位面積)內隨機事件的平均發生率。若是將二項分佈中的試驗次數n看做單位時間內所作的試驗次數，那麼它和事件出現機率P的乘積就是事件的平均發生率，即lambda = np。
泊松分佈的機率質量函數公式以下：

二項分佈的近似分佈
程序分別計算二項分佈和泊松分佈的機率質量函數，當n足夠大時，兩者是十分接近的。
程序中事件平均發生率lambda恆等於10。根據二項分佈的試驗次數計算每次事件出現的機率p=lambda/n。
>>> _lambda = 10.0
>>> k = np.arange(20)
>>> possion = stats .poisson .pmf(k, _lambda) # 泊松分佈
>>> binom100 = stats.binom.pmf(k, 100, _lambda/100) #二項式分佈 100
>>> binom1000=stats.binom.pmf(k, 1000 , _lambda/1000) #二項式分佈 1000
>>> np.max(np.abs(binom100-possion)) # 計算最大偏差
0.006755311103353312
>>> np.max(np.abs(binom1000-possion))# n爲 1000時，偏差較小
0.00063017540509099912

泊松分佈的模擬過程
泊松分佈適合描述單位時間內隨機事件發生次數的分佈狀況。例如某設施在必定時間內的使用次數。機器出現故障的次數。天然災害發生的次數等等。

下面使用隨機數模擬泊松分佈，並與其機率質量函數進行比較，事件每秒的平均發生次數爲lambda=10。其中觀察時間分別爲1000秒，50000秒。能夠看出：觀察時間越長，事件每秒發生的次數就越符合泊松分佈。

>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time )*time
>>> count, time_edges = np.histogram(t, bins=time, range=(0,time))
>>> count
array([10, 9, 8, …, 11, 10, 18])
>>>x = count_edges[:-1]
>>> dist, count_edges = np. histogram (count, bins=20, range= (0,20), normed=True)
>>> poisson = stats .poisson.pmf(x, _lambda)
>>> np.max(np.abs(dist-poisson)) #最大偏差很小,符合泊松分佈
0.0088356241037075706

Note: 用rand()產平生均分佈於0到time之間的_lambda*time 個事件所發生的時刻。
用histogram()能夠統計數組t中每秒以內事件發生的次數count。
根據泊松分佈的定義，count數組中數值的分佈狀況應該符合泊松分佈。統計事件次數在0到20區間內的機率分佈。當histogram()的normed參數爲True而且每一個統計區間的長度爲1時，其結果和機率質量函數相等。

泊松分佈的時間間隔:伽瑪分佈
還能夠換一個角度看隨機事件的分佈問題。能夠觀察相鄰兩個事件之間時間間隔的分佈狀況，或者隔k個事件的時間間隔的分佈狀況。根據機率論，事件之間的時間間隔應符合伽瑪分佈，因爲時間間隔能夠是任意數值，所以伽瑪分佈是一種連續機率分佈。伽瑪分佈的機率密度函數公式以下，它描述第k個亊件發生所需的等待時間的機率分佈。伽瑪函數，當 k爲整數時，它的值和k的階乘k!相等。

程序模擬事件的時間間隔的伽瑪分佈，觀察時間爲1 000秒，平均每秒產生10個事件。
圖中「k=1」，它表示相鄰兩個事件之間的時間間隔的分佈，而「k=2」則表示相隔一個事件的兩個事件之間的時間間隔的分佈，能夠看出它們都符合伽瑪分佈.

>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time)*time
>>> t.sort()#計算事性先後的時間間隔，須要先對隨機時刻進行排序
>>> s1 = t[1:] - t[:-1] #相鄰兩個事件之間的時間間隔
>>> s2 = t[2:] - t[:-2] #相隔一個事件的兩個亊件之間的時間間隔
>>> dist1, x1= np.histogram(s1, bins=100, normed=True)
>>> dist2, x2 = np.histogram(s2 , bins=100, normed=True)
>>> gamma1 = stats.gamma.pdf((x1[:-1]+x1[1:])/2, 1, scale=1.0/_lambda)
>>> gamma2 = stats.gamma.pdf((x2[:-1]+x2[1:])/2, 2, scale=1.0/_lambda)
>>> np.max(np.abs(gamma1 - dist1))
0.13557317865888141
>>> np.max(np.abs(gamma2 - dist2))
0.087375030861794656
>>> np.max(gamma1), np.max(gamma2)
(9.3483221580498537, 3.6767953241013656) #因爲機率密度函數的值自己比較大，所以上面的偏差已經很小了:
Note:模擬伽瑪分佈:
首先在10000秒以內產生100000個隨機事件發生的時刻.所以事件的平均發生次數爲每秒10次;
爲了計算事性先後的時間間隔，須要先對隨機時刻進行排序;
histogram()返回的第二個值爲統計區間的邊界，採用gamma.pdf()計算伽瑪分佈的機率密度時，使用各個區間的中值進行計算。Pdf()的第二個參數爲k值，scale參數爲1/λ;

from:http://blog.csdn.net/pipisorry/article/details/49515215

ref:Statistical functions (scipy.stats)

python標準庫中的隨機分佈函數

相關標籤/搜索

scipy.stats

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。