計算Gene co-expression features

時間 2019-11-30

標籤計算 gene expression features 简体版

原文原文鏈接

Gene co-expression features

The following co-expression coefficient features were attained from COXPRESdb.html

打開這個頁面咱們點擊bulk downloadapp

而後咱們下載budding yeast 文件。this

在最下面咱們也能夠看到文件格式的說明url

Under the directory named Hsa.coex.v6, 19777 files will appear.spa

 
 Hsa.coex.v6 ----- 1
               |-- 10
               |-- 100
               |-- ...
               |-- 9997

1
462 8.1 0.596 2158 10.9 0.590 189 12.7 0.574 ... 220963 19749.5 -0.163 130367 19760.5 -0.175

10
80168 4.9 0.553 10223 5.8 0.650 27284 5.9 0.608 ... 84058 19772.0 -0.276 83871 19775.5 -0.304

100
85449 37.9 0.478 140807 47.7 0.391 636 50.2 0.469 ... 126969 19269.8 -0.113 55930 19273.0 -0.082

Column 1; Entrez Gene ID of an opposite gene of coexpression (19776 genes)
Column 2; MR (Mutual Rank) as a final measure of coexpression. Lines are sorted by this value.
Column 3; Pearson's correlation coefficient of gene expression pattern

下載後的數據以下圖

Sce.v14-08.G4461-S3819.rma.mrgeo.d文件夾包含4461個文件

每一個文件含有4461行對應4461個GENE ID

和前面一樣一個問題，它這數據只給了3列，第一列 Entrez Gene ID，第二列MR,第三列COR 。

蛋白質和 Entrez Gene ID的映射咱們怎麼獲得呢？

經過各類資料。咱們知道 Entrez Gene ID就是ncbi裏的ID,一樣咱們能夠經過uniport下載對應的GENEID.

咱們能夠經過uniport上右邊的筆頭添加 Gene ID列以下圖所示

以後咱們找到Genome annotation 下面的GeneID，打上勾（藏那麼深也逃脫不了個人法眼。

）。這樣咱們就能獲得 GeneID信息了。

而後咱們把下面的數據下載下來。

到此數據準備工做基本完成。

咱們獲得一個文件夾和一個uniprot和geneid對應關係的文件。

uniprot_to_geneid.csv

Ensemble learning prediction of protein–protein interactions using proteins functional annotations

獲取論文的 uniprot codes列(idA,idB)

yeast_gold_protein_pair.csv

到這裏咱們的數據已經準備完成了。下面開始寫代碼！

# -*- coding: utf-8 -*-
"""
Created on Thu Nov 10 10:49:21 2016
@author: sun
"""
import pandas as pd
import os
yeast_gold_protein_pair=pd.read_csv('yeast_gold_protein_pair.csv',usecols=['idA','idB'])
GeneID=pd.read_csv('uniprot_to_geneid.csv',usecols=['Entry','Cross-reference (GeneID)'],index_col=0)
#注loc經過標籤選擇數據,iloc經過位置選擇數據
idA=GeneID.loc[yeast_gold_protein_pair.idA,:]
idB=GeneID.loc[yeast_gold_protein_pair.idB,:]
idA.index=range(len(idA))
idB.index=range(len(idB))
mr=[]
cor=[]
for i in range(len(idA)):
GeneIDA=str(idA.iloc[i].values)
GeneIDB=str(idB.iloc[i].values)
ifGeneIDB!='[nan]'andGeneIDA!='[nan]':
GeneIDA=GeneIDA[2:8]
GeneIDB=int(GeneIDB[2:8])
path='Sce.v14-08.G4461-S3819.rma.mrgeo.d/'+GeneIDA
if os.path.exists(path):
coex=pd.read_csv(path,header=None,sep=' ',index_col=0)
ifGeneIDBin coex.index:
mr.append(coex.loc[GeneIDB,1])
cor.append(coex.loc[GeneIDB,2])
else:
mr.append("nan")
cor.append("nan")
else:
mr.append("nan")
cor.append("nan")
else:
mr.append("nan")
cor.append("nan")
yeast_gold_protein_pair['MR']=mr
yeast_gold_protein_pair['COR']=cor
yeast_gold_protein_pair.to_csv('coexpression.csv',index=False)