[protocol]GO enrichment analysis

[protocol]GO enrichment analysis

 

   

背景:php

   什麼是富集分析,本身能夠百度。我到目前也沒發現一個比較通俗易懂的介紹。直接理解爲一種統計學方法就能夠了。 用於查看顯著性。html

   富集分析有不少種,最多見的是GO富集分析。也有pathway富集分析。[pathway的我目前不會啊 ::>_<:: ]java

 

工具:python

   也有不少種,我這裏主要是用Ontologizer (links:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html)web

 

須要的文件:express

  1.gene_ontology_edit.obo (這個相似GO的庫文件, links:http://www.geneontology.org/GO.downloads.ontology.shtml)app

 

  2.whole_gene_list (這是所研究物種中,包含的全部gene_list)工具

      經過一下腳本能夠從CDS文件中提取gene_listoop

        grep "^>" sequence.fasta | tr -d ">"> sequenceHeader.listthis

 

  3. sample_1_gene_list (其中一個點的gene_list)

 

  4.最後一個也是最關鍵的。就是要有一個Association file .這個經過一下腳本生成

       4.1須要2個文件 和一個腳本

           文件1:GO_ids 文件 格式以下

gene136870 GO:0006950,GO:0005737
gene143400 0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

     文件2: GOIDTermIndexFile 這個能夠下載到 links:http://www.geneontology.org/GO.downloads.files.shtml

     腳本1:

   

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))
   執行:
python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

 

 

以上4個文件都有了 能夠執行Ontologizer了 參考:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

 

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology_edit.obo  -s moso_target_gene.txt -p sequnecHeader.list -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000

 

 

 

   5.這個軟件是能夠畫圖的。可是須要下載軟件。下載步驟以下: (as root )

        yum list available 'graphviz*'
        yum install 'graphviz*'
dot -Tpng view-4hourSMinduced-Parent-Child-Westfall-Young-Single-Step.dot -oExample.png
 
 
還能夠參考 http://blog.nextgenetics.net/?e=5
 
 
  
 
 
by   Damian Kao

Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.

 

Software and inputs

To get custom enrichments, I used Ontologizer. The inputs for ontologizer are:

  • Gene Ontology OBO file
    A flatfile containing information on all the current GO terms and their relationship to other terms. You can get it here.
  • List of all gene ids
    If you have a .fasta file of all your genes, you can use this command to extract the gene id information:
    grep "^>" sequence.fasta | tr -d ">" > sequenceHeader.list
    
  • List of gene ids you want to find enrichment for
    The subset of all your genes that you want the enrichment for, ie. differentially expressed gene. Ontologizer can take in an entire folder of gene id lists if you set a directory as input. 
  • Association file for your organism
    This is a GO Annotation File (.gaf). The specifications is detailed here. Fortunately, Ontologizer doesn't really care about most of the database information columns in the .gaf file, so you can fake most of it. To generate this file, you will have to first generate GO annotations for your genes through software like InterproScan or Blast2Go. You will also need a tab delimited index of GO IDs to GO category which you can get here.

 

Generating a gene association file

Assuming your GO annotations are formatted in two columns where first column is the gene id and second column is a comma separated list of your GO ids:

gene136870 GO:0006950,GO:0005737
gene143400 GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

Here is the python script to generate the association file with comments:

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website inFile = open(sys.argv[1],'r') goFile = open(sys.argv[2],'r') #parse the GOID/terms index and store it in the dictionary, goHash goHash = {} for line in goFile: #skip lines that start with '!' character as they are comment headers if line[0] != "!": data = line.strip().split('\t')
#skip obsolete terms
if data[-1] != 'obs':
for info in data:
if info[0:3] == "GO:":
#create dictionary of term aspects goHash[data[0]] = data[-1] #Here are some columns that the GAF format wants. #Since Ontologizer doesn't care about this, we can just make it up DB = 'yourOrganism' DBRef = 'PMID:0000000' DBEvi = 'ISO' DBObjectType = 'gene' DBTaxon = 'taxon:79327' DBDate = '23022011' DBAssignedBy = 'PFAM' #potential obselete goids that you have in your annotation potentialObs = []
#if you specified to not print out obsolete goids, then print out the .gaf if len(sys.argv) == 3: print '!gaf-version: 2.0' #Loop through the GO annotation file and generate assocation lines. for line in inFile: data = line.strip().split('\t') #if gene has go annotations if len(data) > 1: #gid is the gene id, goIDs is an array of assocated GO ids gid = data[0] goIDs = data[1].split(',') #second column of the .gaf file that Ontologizer doesn't care about DBID = "db." gid #third column is the name of the gene which Ontologizer does use DBObjSym = gid #for each GO ID in the line, print out an association line for goID in goIDs: if goHash.has_key(goID): DBAspect = goHash[goID] DBObjName = 'myOrganism' DBID outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
#only print out the .gaf file if you didn't specify to print out obsolete goids. if len(sys.argv) == 3: print '\t'.join(outArray) else: potentialObs.append(goID)
#if there is a 4th argument, print out the potential obsolete list if len(sys.argv) == 4: print '\n'.join(set(potentialObs))

 

Putting it all together

Make sure you have your custom GO annotation file generated in the format shown above and the GOID to term aspect index file from here (text file version of 'Terms, IDs, secondary IDs, obsoletes'). Save this script and use it by:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

If you want to check if any of your annotation is potentially deprecated or obsolete, use this command:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile y > potentialDeprecated.goids

Now that you have the Gene Assocation File, you can use Ontologizer by:

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology.1_2.obo -o out -p myOrganism.ids -s in

The parameters used are:

  • a - the gaf file
  • g - the obo file
  • o - the output directory
  • p - the list of all gene ids
  • s - the input directory/file of gene subsets 

The enrichments with all relevent statistics will be generated per input gene list in a tab delimited file.

相關文章
相關標籤/搜索