實戰--利用SVM對基因表達標本是否癌變的預測

利用支持向量機對基因表達標本是否癌變的預測html

As we mentioned earlier, gene expression analysis has a wide variety of applications, including cancer studies. In 1999, Uri Alon analyzed gene expression data for 2,000 genes from 40 colon tumor tissues and compared them with data from colon tissues belonging to 21 healthy individuals, all measured at a single time point. We can represent his data as a 2,000 × 61 gene expression matrix, where the first 40 columns describe tumor samples and the last 21 columns describe normal samples.express

Now, suppose you performed a gene expression experiment with a colon sample from a new patient, corresponding to a 62nd column in an augmented gene expression matrix. Your goal is to predict whether this patient has a colon tumor. Since the partition of tissues into two clusters (tumor vs. healthy) is known in advance, it may seem that classifying the sample from a new patient is easy. Indeed, since each patient corresponds to a point in 2,000-dimensional space, we can compute the center of gravity of these points for the tumor sample and for the healthy sample. Afterwards, we can simply check which of the two centers of gravity is closer to the new tissue.app

Alternatively, we could perform a blind analysis, pretending that we do not already know the classification of samples into cancerous vs. healthy, and analyze the resulting 2,000 x 62 expression matrix to divide the 62 samples into two clusters. If we obtain a cluster consisting predominantly of cancer tissues, this cluster may help us diagnose colon cancer.dom

Final Challenge: These approaches may seem straightforward, but it is unlikely that either of them will reliably diagnose the new patient. Why do you think this is? Given Alon’s 2,000 × 61 gene expression matrix and gene data from a new patient, derive a superior approach to evaluate whether this patient is likely to have a colon tumor.ide

1、原理函數

參見this

https://www.cnblogs.com/dfcao/p/3462721.htmllua

https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVCspa

2、code

數據:

40 Cancer Samples

21 Healthy Samples

Unknown Sample

問題分析:

這是一個分類問題,訓練集有61個,特徵量有2000個,若是利用高斯核函數的SVM會出現過擬合,故選擇線性核函數

代碼

 1 from os.path import dirname  2 import numpy as np  3 import math  4 import random  5 import matplotlib.pyplot as plt  6 from sklearn import datasets, svm  7  8 def Input():  9 X = [] 10 Y = [] 11 check_x=[] 12 check_y=[] 13 14 dataset1 = open(dirname(__file__)+'colon_cancer.txt').read().strip().split('\n') 15 dataset1=[list(map(float,line.split()))[:] for line in dataset1] 16 X += dataset1[10:] 17 check_x += dataset1[:10] 18 Y += [1]*(len(dataset1)-10) 19 check_y += [1]*10 20 21 dataset2 = open(dirname(__file__)+'colon_healthy.txt').read().strip().split('\n') 22 dataset2=[list(map(float,line.split()))[:] for line in dataset2] 23 X += dataset2[5:] 24 check_x += dataset2[:5] 25 Y += [0]*(len(dataset2)-5) 26 check_y += [0]*5 27 28 dataset3 = open(dirname(__file__)+'colon_test.txt').read().strip().split('\n') 29 test_X = [list(map(float,line.split()))[:] for line in dataset3] 30 31 32 return [X ,Y , test_X , check_x , check_y] 33 34 if __name__ == '__main__': 35 INF = 999999 36 37 [X_train ,y_train , test_X,check_x, check_y] = Input() 38 39 kernel = 'linear' # 線性核函數 40 41  clf = svm.SVC(kernel=kernel, gamma=10) 42  clf.fit(X_train,y_train) 43 44 predict_for_ckeck = clf.predict(check_x) 45 cnt=0 46 for i in range(len(check_y)): 47 if check_y[i]==predict_for_ckeck[i]: 48 cnt+=1 49 print('Accuracy %.2f%%'%(cnt/len(check_y))) 50 51 print(clf.predict(test_X))
Accuracy 87%
[0]

奇怪的是,只選擇前20個基因進行分析,訓練集預測正確率竟然上升到90%

Accuracy 93%

  [0]

相關文章
相關標籤/搜索