背景:php
什麼是富集分析,本身能夠百度。我到目前也沒發現一個比較通俗易懂的介紹。直接理解爲一種統計學方法就能夠了。 用於查看顯著性。html
富集分析有不少種,最多見的是GO富集分析。也有pathway富集分析。[pathway的我目前不會啊 ::>_<:: ]java
工具:python
也有不少種,我這裏主要是用Ontologizer (links:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html)web
須要的文件:express
1.gene_ontology_edit.obo (這個相似GO的庫文件, links:http://www.geneontology.org/GO.downloads.ontology.shtml)app
2.whole_gene_list (這是所研究物種中,包含的全部gene_list)工具
經過一下腳本能夠從CDS文件中提取gene_listoop
grep "^>" sequence.fasta | tr -d ">"> sequenceHeader.listthis
3. sample_1_gene_list (其中一個點的gene_list)
4.最後一個也是最關鍵的。就是要有一個Association file .這個經過一下腳本生成
4.1須要2個文件 和一個腳本
文件1:GO_ids 文件 格式以下
gene136870 GO:0006950,GO:0005737
gene143400 0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020
文件2: GOIDTermIndexFile 這個能夠下載到 links:http://www.geneontology.org/GO.downloads.files.shtml
腳本1:
import sys #input the gene annotation file (inFile) and your GOID/terms index file (goFile)執行:
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))
python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf
以上4個文件都有了 能夠執行Ontologizer了 參考:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html
java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology_edit.obo -s moso_target_gene.txt -p sequnecHeader.list -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000
5.這個軟件是能夠畫圖的。可是須要下載軟件。下載步驟以下: (as root )
yum list available 'graphviz*' yum install 'graphviz*'
dot -Tpng view-4hourSMinduced-Parent-Child-Westfall-Young-Single-Step.dot -oExample.png
還能夠參考 http://blog.nextgenetics.net/?e=5
Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.
To get custom enrichments, I used Ontologizer. The inputs for ontologizer are:
grep "^>" sequence.fasta | tr -d ">" > sequenceHeader.list
Assuming your GO annotations are formatted in two columns where first column is the gene id and second column is a comma separated list of your GO ids:
gene136870 GO:0006950,GO:0005737
gene143400 GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020
Here is the python script to generate the association file with comments:
import sys #input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))
Make sure you have your custom GO annotation file generated in the format shown above and the GOID to term aspect index file from here (text file version of 'Terms, IDs, secondary IDs, obsoletes'). Save this script and use it by:
python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf
If you want to check if any of your annotation is potentially deprecated or obsolete, use this command:
python myScript.py GO.customAnnotationFile GOIDTermIndexFile y > potentialDeprecated.goids
Now that you have the Gene Assocation File, you can use Ontologizer by:
java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology.1_2.obo -o out -p myOrganism.ids -s in
The parameters used are:
The enrichments with all relevent statistics will be generated per input gene list in a tab delimited file.