(1) GenBank file formathtml
GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found herelinux
When do we use the GenBank format?ios
GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.git
(2) FASTA formatgithub
在生物信息學中,FASTA格式是一種用於記錄核酸序列或肽序列的文本格式,其中的核酸或氨基酸均以單個字母編碼呈現。該格式同時還容許在序列以前定義名稱和編寫註釋。這一格式最初由FASTA軟件包定義,但現今已經是生物信息學領域的一項標準。
FASTA簡明的格式下降了序列操縱和分析的難度,令序列可被文本處理工具和諸如Python、Ruby和Perl等腳本語言處理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式web
(3) FASTQ file formatshell
FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:數據庫
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Further reading:
Differences between FASTA, FASTQ and SAM formatsexpress
I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.ubuntu
Install Aspera connect on Ubuntu Linux
mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz bash aspera-connect-3.7.4.147727-linux-64.sh # Installing Aspera Connect # Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only. # Unable to update desktop database, Aspera Connect may not be able to auto-launch # Restart firefox manually to load the Aspera Connect plug-in # Install complete. # construct soft link sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp ascp -h # help ascp -A # version
If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:
# ~/.mozilla/plugins/libnpasperaweb.so # ~/.aspera/connect rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so yes|rm -rf ~/.aspera/connect
According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.
1. Download SRA files by using prefetch
I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit
because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra
.
# Install SRAtoolkit mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz # You echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >> ~/.bashrc source ~/.bashrc
Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.
mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240 # manually generate SRA file list touch GSE48240.txt for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done # Using efetch to generate SRA file list esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt
Alternatively, you can use curl
, wget
or ftp
to download from generated download links, but will be as slow as snail.
2. Convert SRA files to FASTQ files on the fly
This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.
cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1
names(leadership) names(leadership)[2] <- 「testDate」 names(leadership)[6:10] <-c(「item1」, 「item2」, 「item3」, 「item4」, 「item5」)
# install bioawk apt-get install bison cd ~/biosoft git clone https://github.com/lh3/bioawk cd bioawk make sudo cp bioawk /usr/local/bin
# Download and unzip the file on the fly. curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa # Look at the file cat chr22.fa | head -4 # Count how many "N" are in chr22 sequence cat chr22.fa | grep -o N | wc -l # Count how many bases are in Chr22? cat chr22.fa | bioawk -c fastx '{ print length($seq) }'
-------------------------------------------------------------------------------------------------------
各基因組的對應關係
首先是NCBI對應UCSC,對應ENSEMBL數據庫:
GRCh36 (hg18): ENSEMBL release_52.
GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.
GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.
能夠看到ENSEMBL的版本特別複雜!!!很容易搞混!
可是UCSC的版本就簡單了,就hg18,19,38, 經常使用的是hg19,可是我推薦你們都轉爲hg38
看起來NCBI也是很簡單,就GRCh36,37,38,可是裏面水也很深!
Feb 13 2014 00:00 Directory April_14_2003 Apr 06 2006 00:00 Directory BUILD.33 Apr 06 2006 00:00 Directory BUILD.34.1 Apr 06 2006 00:00 Directory BUILD.34.2 Apr 06 2006 00:00 Directory BUILD.34.3 Apr 06 2006 00:00 Directory BUILD.35.1 Aug 03 2009 00:00 Directory BUILD.36.1 Aug 03 2009 00:00 Directory BUILD.36.2 Sep 04 2012 00:00 Directory BUILD.36.3 Jun 30 2011 00:00 Directory BUILD.37.1 Sep 07 2011 00:00 Directory BUILD.37.2 Dec 12 2012 00:00 Directory BUILD.37.3
能夠看到,有37.1, 37.2, 37.3 等等,不過這種版本通常指的是註釋在更新,基因組序列通常不會更新!!!
反正你記住hg19基因組大小是3G,壓縮後八九百兆便可!!!
若是要下載GTF註釋文件,基因組版本尤其重要!!!
對NCBI:ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版(hg38)
ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本
對於ensembl:
ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
變幻中間的release就能夠拿到全部版本信息:ftp://ftp.ensembl.org/pub/
對於UCSC,那就有點麻煩了:
須要選擇一系列參數:
http://genome.ucsc.edu/cgi-bin/hgTables
1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables
2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser3. Click 'get output'.
如今重點來了,搞清楚版本關係了,就要下載呀!
UCSC裏面下載很是方便,只須要根據基因組簡稱來拼接url便可:
http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz
http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz
或者用shell腳本指定下載的染色體號:
for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;## 這裏也能夠用NCBI的:ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前綴
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta
---------------------------------------------------------------------------------------------------------
Usage:
fastq-dump [options] <path/file> [<path/file> ...]
fastq-dump [options] <accession>
Frequently Used Options:
General: | |||
-h | | | --help | Displays ALL options, general usage, and version information. |
-V | | | --version | Display the version of the program. |
Data formatting: | |||
--split-files | Dump each read into separate file. Files will receive suffix corresponding to read number. | ||
--split-spot | Split spots into individual reads. | ||
--fasta <[line width]> | FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping). | ||
-I | | | --readids | Append read id after spot id as 'accession.spot.readid' on defline. |
-F | | | --origfmt | Defline contains only original sequence name. |
-C | | | --dumpcs <[cskey]> | Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation. |
-B | | | --dumpbase | Formats sequence using base space (default for other than SOLiD). |
-Q | | | --offset <integer> | Offset to use for ASCII quality scores. Default is 33 ("!"). |
Filtering: | |||
-N | | | --minSpotId <rowid> | Minimum spot id to be dumped. Use with "X" to dump a range. |
-X | | | --maxSpotId <rowid> | Maximum spot id to be dumped. Use with "N" to dump a range. |
-M | | | --minReadLen <len> | Filter by sequence length >= <len> |
--skip-technical | Dump only biological reads. | ||
--aligned | Dump only aligned sequences. Aligned datasets only; see sra-stat. | ||
--unaligned | Dump only unaligned sequences. Will dump all for unaligned datasets. | ||
Workflow and piping: | |||
-O | | | --outdir <path> | Output directory, default is current working directory ('.'). |
-Z | | | --stdout | Output to stdout, all split data become joined into single stream. |
--gzip | Compress output using gzip. | ||
--bzip2 | Compress output using bzip2. |
Use examples:
fastq-dump -X 5 -Z SRR390728
Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.
fastq-dump -I --split-files SRR390728
Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.
fastq-dump --split-files --fasta 60 SRR390728
Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).
fastq-dump --split-files --aligned -Q 64 SRR390728
Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.Possible errors and their solution:
fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table
This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.
fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra
The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here
--------------------------------------------------------------------------------------------------------
下載流程:
1:wget -i ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000001/SRR000001/SRR000001.sra
從NCBI官網下載sra數據文件
2:
使用fastq-dump工具將sra轉換成雙端fastq