Bootstrap的直白說明

時間 2019-12-09

原文原文鏈接

bootstrap不是twitter的那個前端，而是統計學中的概念，下邊隨實驗進行說明
假設有個事件，共發生了10000000次，發生的機率呈泊松分佈。固然，假設咱們是不知道他是泊松分佈的
前端

import numpy as np
import scipy.stats
ALL = np.random.poisson(2, size=10000000)
ALL.mean() # 2.005085!
ALL.var()  # 2.0007084414277481

x = np.arange(0, 20)
y = scipy.stats.poisson(2).pmf(x)
import matplotlib.pyplot as plt
fig = plt.figure()
plot = fig.add_subplot(111)
plot.plot(x, y)

咱們只知道它的一個採樣，從這個採樣中看不出來什麼，好比其均值都不對python

SAMPLE = np.random.choice(ALL, size=20)
# array([1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 2, 1, 4, 2, 5, 2])
SAMPLE.mean()
# 1.3500000000000001

如今使用bootstrap(其中的一種，resampling)，從SAMPLE中重複採樣，而後計算平均值，這樣就能夠計算置信區間了。bootstrap

samples = [ np.random.choice(SAMPLE, size=20) for i in range(1000) ]
means = [ s.mean() for s in samples ]
plot.hist(means, bins=30)

能夠多來幾回dom

def plot_hist():
    fig = plt.figure()
    plot1 = fig.add_subplot(221)
    plot2 = fig.add_subplot(222)
    plot3 = fig.add_subplot(223)
    plot4 = fig.add_subplot(224)
    for plot in (plot1, plot2, plot3, plot4):
        SAMPLE = np.random.choice(ALL, size=50)
        samples = [ np.random.choice(SAMPLE, size=20) for i in range(1000) ]
        means = [ s.mean() for s in samples ]
        plot.clear()
        plot.hist(means, bins=30)
    return fig

能夠看出SAMPLE的隨機性仍是對最終的圖形有很大影響的。可是在此計算假設檢驗的話，基本上都靠譜。

spa

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。