sklearn 中的 Pipeline 機制

<div id="article_content" class="article_content clearfix csdn-tracking-statistics" data-pid="blog" data-mod="popu_307" data-dsm="post" style="height: 1264px; overflow: hidden;"> 轉載自:https://blog.csdn.net/lanchunhui/article/details/50521648 <div class="markdown_views"> <pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline</code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li></ul></pre>css

<p>管道機制在機器學習算法中得以應用的根源在於,參數集在新數據集(好比測試集)上的<strong>重複使用</strong>。</p>python

<p>管道機制實現了對所有步驟的流式化封裝和管理(<strong>streaming workflows with pipelines</strong>)。</p>算法

<p>注意:管道機制更像是編程技巧的創新,而非算法的創新。</p>編程

<p>接下來咱們以一個具體的例子來演示sklearn庫中強大的Pipeline用法:</p>bash

<h2 id="1-加載數據集"><a name="t0"></a>1. <strong>加載數據集</strong></h2>markdown

<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">from</span> sklearn.cross_validation <span class="hljs-keyword">import</span> train_test_split <span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> LabelEncoder df = pd.read_csv(<span class="hljs-string">'https://archive.ics.uci.edu/ml/machine-learning-databases/'</span> <span class="hljs-string">'breast-cancer-wisconsin/wdbc.data'</span>, header=<span class="hljs-keyword">None</span>) <span class="hljs-comment"># Breast Cancer Wisconsin dataset</span> X, y = df.values[:, <span class="hljs-number">2</span>:], df.values[:, <span class="hljs-number">1</span>] <span class="hljs-comment"># y爲字符型標籤</span> <span class="hljs-comment"># 使用LabelEncoder類將其轉換爲0開始的數值型</span> encoder = LabelEncoder() y = encoder.fit_transform(y) &gt;&gt;&gt; encoder.transform([<span class="hljs-string">'M'</span>, <span class="hljs-string">'B'</span>]) array([<span class="hljs-number">1</span>, <span class="hljs-number">0</span>]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=<span class="hljs-number">.2</span>, random_state=<span class="hljs-number">0</span>) </code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li><li style="color: rgb(153, 153, 153);">15</li><li style="color: rgb(153, 153, 153);">16</li><li style="color: rgb(153, 153, 153);">17</li></ul></pre>網絡

<h2 id="2-構思算法的流程"><a name="t1"></a>2. <strong>構思算法的流程</strong></h2>dom

<p>可放在Pipeline中的步驟可能有:</p>機器學習

<ul> <li>特徵標準化是須要的,可做爲第一個環節</li> <li>既然是分類器,classifier也是少不了的,天然是最後一個環節</li> <li>中間可加上好比數據降維(PCA)</li> <li>。。。</li> </ul>ide

<pre class="prettyprint" name="code"><code class="hljs python has-numbering"><span class="hljs-keyword">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> StandardScaler <span class="hljs-keyword">from</span> sklearn.decomposition <span class="hljs-keyword">import</span> PCA <span class="hljs-keyword">from</span> sklearn.linear_model <span class="hljs-keyword">import</span> LogisticRegression <span class="hljs-keyword">from</span> sklearn.pipeline <span class="hljs-keyword">import</span> Pipeline pipe_lr = Pipeline([(<span class="hljs-string">'sc'</span>, StandardScaler()), (<span class="hljs-string">'pca'</span>, PCA(n_components=<span class="hljs-number">2</span>)), (<span class="hljs-string">'clf'</span>, LogisticRegression(random_state=<span class="hljs-number">1</span>)) ]) pipe_lr.fit(X_train, y_train) print(<span class="hljs-string">'Test accuracy: %.3f'</span> % pipe_lr.score(X_test, y_test)) <span class="hljs-comment"># Test accuracy: 0.947</span></code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li><li style="color: rgb(153, 153, 153);">2</li><li style="color: rgb(153, 153, 153);">3</li><li style="color: rgb(153, 153, 153);">4</li><li style="color: rgb(153, 153, 153);">5</li><li style="color: rgb(153, 153, 153);">6</li><li style="color: rgb(153, 153, 153);">7</li><li style="color: rgb(153, 153, 153);">8</li><li style="color: rgb(153, 153, 153);">9</li><li style="color: rgb(153, 153, 153);">10</li><li style="color: rgb(153, 153, 153);">11</li><li style="color: rgb(153, 153, 153);">12</li><li style="color: rgb(153, 153, 153);">13</li><li style="color: rgb(153, 153, 153);">14</li></ul></pre>

<p>Pipeline對象接受<strong>二元tuple構成的list</strong>,每個二元 tuple 中的第一個元素爲 arbitrary <strong>identifier string</strong>,咱們用以獲取(access)Pipeline object 中的 individual elements,二元 tuple 中的第二個元素是 scikit-learn與之相適配的<strong>transformer 或者 estimator。</strong></p>

<pre class="prettyprint" name="code"><code class="hljs bash has-numbering">Pipeline([(<span class="hljs-string">'sc'</span>, StandardScaler()), (<span class="hljs-string">'pca'</span>, PCA(n_components=<span class="hljs-number">2</span>)), (<span class="hljs-string">'clf'</span>, LogisticRegression(random_state=<span class="hljs-number">1</span>))])</code><ul class="pre-numbering" style=""><li style="color: rgb(153, 153, 153);">1</li></ul></pre>

<h2 id="3-pipeline執行流程的分析"><a name="t2"></a>3. <strong>Pipeline執行流程的分析</strong></h2>

<p>Pipeline 的中間過程由scikit-learn相適配的轉換器(transformer)構成,最後一步是一個estimator。好比上述的代碼,<em>StandardScaler</em>和<em>PCA</em> <strong>transformer</strong> 構成intermediate steps,LogisticRegression 做爲最終的<strong>estimator</strong>。</p>

<p>當咱們執行 <code>pipe_lr.fit(X_train, y_train)</code>時,首先由<em>StandardScaler</em>在訓練集上執行 <em>fit</em>和<em>transform</em>方法,transformed後的數據又被傳遞給Pipeline對象的下一步,也即PCA()。和<em>StandardScaler</em>同樣,PCA也是執行fit和transform方法,最終將轉換後的數據傳遞給 <em>LosigsticRegression</em>。整個流程以下圖所示:</p>

<p></p><center> <br> <img src="https://img-blog.csdn.net/20160115095855517" height="400," width="500"> <br> </center><p></p>

<h2 id="4-pipeline-與深度神經網絡的multi-layers"><a name="t3"></a>4. <strong>pipeline 與深度神經網絡的multi-layers</strong></h2>

<p>只不過步驟(step)的概念換成了層(layer)的概念,甚至the last step 和 輸出層的含義都是同樣的。</p>

<p>只是拋出一個問題,是否是有那麼一丟丟的類似性?</p> </div> <link rel="stylesheet" href="https://csdnimg.cn/release/phoenix/template/css/markdown_views-ea0013b516.css"> </div>

相關文章
相關標籤/搜索