[protocol]GO enrichment analysis

時間 2019-11-20

標籤 protocol enrichment analysis 欄目微服務简体版

原文原文鏈接

[protocol]GO enrichment analysis

背景：php

什麼是富集分析，本身能夠百度。我到目前也沒發現一個比較通俗易懂的介紹。直接理解爲一種統計學方法就能夠了。用於查看顯著性。html

富集分析有不少種，最多見的是GO富集分析。也有pathway富集分析。[pathway的我目前不會啊 ::>_<:: ]java

工具：python

也有不少種，我這裏主要是用Ontologizer （links:http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html）web

須要的文件：express

1.gene_ontology_edit.obo (這個相似GO的庫文件， links:http://www.geneontology.org/GO.downloads.ontology.shtml)app

2.whole_gene_list (這是所研究物種中，包含的全部gene_list)工具

經過一下腳本能夠從CDS文件中提取gene_listoop

grep "^>" sequence.fasta | tr -d ">"> sequenceHeader.listthis

3. sample_1_gene_list (其中一個點的gene_list)

4.最後一個也是最關鍵的。就是要有一個Association file .這個經過一下腳本生成

4.1須要2個文件和一個腳本

文件1：GO_ids 文件格式以下

gene136870 GO:0006950,GO:0005737
gene143400 0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

文件2： GOIDTermIndexFile 這個能夠下載到 links:http://www.geneontology.org/GO.downloads.files.shtml

腳本1：

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website
inFile = open(sys.argv[1],'r')
goFile = open(sys.argv[2],'r')

#parse the GOID/terms index and store it in the dictionary, goHash
goHash = {}

for line in goFile:
    #skip lines that start with '!' character as they are comment headers
    if line[0] != "!":
        data = line.strip().split('\t')
        #skip obsolete terms
        if data[-1] != 'obs':
            for info in data:
                if info[0:3] == "GO:":
                    #create dictionary of term aspects
                    goHash[data[0]] = data[-1]

#Here are some columns that the GAF format wants. 
#Since Ontologizer doesn't care about this, we can just make it up
DB = 'yourOrganism'
DBRef = 'PMID:0000000'
DBEvi = 'ISO'
DBObjectType = 'gene'
DBTaxon = 'taxon:79327'
DBDate = '23022011'
DBAssignedBy = 'PFAM'

#potential obselete goids that you have in your annotation
potentialObs = []

#if you specified to not print out obsolete goids, then print out the .gaf
if len(sys.argv) == 3:
    print '!gaf-version: 2.0'

#Loop through the GO annotation file and generate assocation lines.
for line in inFile:
    data = line.strip().split('\t')

    #if gene has go annotations
    if len(data) > 1:
        #gid is the gene id, goIDs is an array of assocated GO ids
        gid = data[0]
        goIDs = data[1].split(',')

        #second column of the .gaf file that Ontologizer doesn't care about
        DBID = "db."   gid

        #third column is the name of the gene which Ontologizer does use
        DBObjSym = gid
    
        #for each GO ID in the line, print out an association line
        for goID in goIDs:
            if goHash.has_key(goID):
                DBAspect = goHash[goID]
                DBObjName = 'myOrganism'   DBID

                outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
                
                #only print out the .gaf file if you didn't specify to print out obsolete goids.
                if len(sys.argv) == 3:
                    print '\t'.join(outArray)
            else:
                potentialObs.append(goID)

#if there is a 4th argument, print out the potential obsolete list
if len(sys.argv) == 4:
    print '\n'.join(set(potentialObs))

執行：

python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

以上4個文件都有了能夠執行Ontologizer了參考：http://compbio.charite.de/contao/index.php/cmdlineOntologizer.html

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology_edit.obo -s moso_target_gene.txt -p sequnecHeader.list -c Parent-Child-Union -m Westfall-Young-Single-Step -d 0.05 -r 1000

5.這個軟件是能夠畫圖的。可是須要下載軟件。下載步驟以下：（as root ）

        yum list available 'graphviz*'
        yum install 'graphviz*'

dot -Tpng view-4hourSMinduced-Parent-Child-Westfall-Young-Single-Step.dot -oExample.png

還能夠參考 http://blog.nextgenetics.net/?e=5

Method: Gene enrichment using Ontologizer without a predefined association file

by Damian Kao

Most gene enrichment websites out there only allow you to find enrichments for popular model organisms using pre-established gene ontology annotations. I ran into this problem early on during my phd when confronted with having to generate enrichment data on Schmidtea mediterranea.

Software and inputs

To get custom enrichments, I used Ontologizer. The inputs for ontologizer are:

Gene Ontology OBO file
A flatfile containing information on all the current GO terms and their relationship to other terms. You can get it here.
List of all gene ids
If you have a .fasta file of all your genes, you can use this command to extract the gene id information:
```
grep "^>" sequence.fasta | tr -d ">" > sequenceHeader.list
```
List of gene ids you want to find enrichment for
The subset of all your genes that you want the enrichment for, ie. differentially expressed gene. Ontologizer can take in an entire folder of gene id lists if you set a directory as input.
Association file for your organism
This is a GO Annotation File (.gaf). The specifications is detailed here. Fortunately, Ontologizer doesn't really care about most of the database information columns in the .gaf file, so you can fake most of it. To generate this file, you will have to first generate GO annotations for your genes through software like InterproScan or Blast2Go. You will also need a tab delimited index of GO IDs to GO category which you can get here.

Generating a gene association file

Assuming your GO annotations are formatted in two columns where first column is the gene id and second column is a comma separated list of your GO ids:

gene136870 GO:0006950,GO:0005737
gene143400 GO:0016020,GO:0005524,GO:0006468,GO:0005737,GO:0004674,GO:0006914,GO:0016021,GO:0015031,GO:0004714
gene144020 GO:0003779,GO:0006941,GO:0005524,GO:0003774,GO:0005516,GO:0005737,GO:0005863
gene000620 GO:0005634,GO:0003677,GO:0030154,GO:0006350,GO:0006355,GO:0007275,GO:0030528
gene144530 GO:0016020

Here is the python script to generate the association file with comments:

import sys

#input the gene annotation file (inFile) and your GOID/terms index file (goFile)
#inFile is your custom annotations
#goFile is the GO IDs to GO categories index you download from the GO website
inFile = open(sys.argv[1],'r')
goFile = open(sys.argv[2],'r')

#parse the GOID/terms index and store it in the dictionary, goHash
goHash = {}

for line in goFile:
    #skip lines that start with '!' character as they are comment headers
    if line[0] != "!":
        data = line.strip().split('\t')
        #skip obsolete terms
        if data[-1] != 'obs':
            for info in data:
                if info[0:3] == "GO:":
                    #create dictionary of term aspects
                    goHash[data[0]] = data[-1]

#Here are some columns that the GAF format wants. 
#Since Ontologizer doesn't care about this, we can just make it up
DB = 'yourOrganism'
DBRef = 'PMID:0000000'
DBEvi = 'ISO'
DBObjectType = 'gene'
DBTaxon = 'taxon:79327'
DBDate = '23022011'
DBAssignedBy = 'PFAM'

#potential obselete goids that you have in your annotation
potentialObs = []

#if you specified to not print out obsolete goids, then print out the .gaf
if len(sys.argv) == 3:
    print '!gaf-version: 2.0'

#Loop through the GO annotation file and generate assocation lines.
for line in inFile:
    data = line.strip().split('\t')

    #if gene has go annotations
    if len(data) > 1:
        #gid is the gene id, goIDs is an array of assocated GO ids
        gid = data[0]
        goIDs = data[1].split(',')

        #second column of the .gaf file that Ontologizer doesn't care about
        DBID = "db."   gid

        #third column is the name of the gene which Ontologizer does use
        DBObjSym = gid
    
        #for each GO ID in the line, print out an association line
        for goID in goIDs:
            if goHash.has_key(goID):
                DBAspect = goHash[goID]
                DBObjName = 'myOrganism'   DBID

                outArray = [DB,DBID,DBObjSym,'0',goID,DBRef,DBEvi,'0',DBAspect,'0','0',DBObjectType,DBTaxon,DBDate,DBAssignedBy]
                
                #only print out the .gaf file if you didn't specify to print out obsolete goids.
                if len(sys.argv) == 3:
                    print '\t'.join(outArray)
            else:
                potentialObs.append(goID)

#if there is a 4th argument, print out the potential obsolete list
if len(sys.argv) == 4:
    print '\n'.join(set(potentialObs))

Putting it all together

Make sure you have your custom GO annotation file generated in the format shown above and the GOID to term aspect index file from here (text file version of 'Terms, IDs, secondary IDs, obsoletes'). Save this script and use it by:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile > myOrganism.gaf

If you want to check if any of your annotation is potentially deprecated or obsolete, use this command:

python myScript.py GO.customAnnotationFile GOIDTermIndexFile y > potentialDeprecated.goids

Now that you have the Gene Assocation File, you can use Ontologizer by:

java -jar Ontologizer.jar -a myOrganism.gaf -g gene_ontology.1_2.obo -o out -p myOrganism.ids -s in

The parameters used are:

a - the gaf file
g - the obo file
o - the output directory
p - the list of all gene ids
s - the input directory/file of gene subsets

The enrichments with all relevent statistics will be generated per input gene list in a tab delimited file.

相關標籤/搜索

analysis

enrichment

微服務

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。