fastq相關

時間 2019-12-07

標籤 fastq 相關简体版

原文原文鏈接

Sequence data formats

1. Common sequence data formats including GenBank, FASTA, FASTQ formats. GenBank and FASTA format often represent curated sequencing information. FASTQ often represent experimentally obtained data.

(1) GenBank file formathtml

GenBank is part of the International Nucleotide Sequence Database Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Nucleotide Archive (ENA), and GenBank at NCBI. These three organizations exchange data on a daily basis.
More information on GenBank format can be found herelinux

When do we use the GenBank format?ios

GenBank format can represent variety of information while keeping this information human-readable. It is not suitable for data-analysis.git

(2) FASTA formatgithub

在生物信息學中，FASTA格式是一種用於記錄核酸序列或肽序列的文本格式，其中的核酸或氨基酸均以單個字母編碼呈現。該格式同時還容許在序列以前定義名稱和編寫註釋。這一格式最初由FASTA軟件包定義，但現今已經是生物信息學領域的一項標準。
FASTA簡明的格式下降了序列操縱和分析的難度，令序列可被文本處理工具和諸如Python、Ruby和Perl等腳本語言處理。
FASTA is a DNA sequence format for specifying or representing DNA sequences. It does not contain sequence quality information.
Reference: Wikipedia FASTA格式web

(3) FASTQ file formatshell

FASTQ is extended FASTA file format with sequencing quality score (phred score).
Please refer to the following references:數據庫

fasta與fastq格式文件解讀
Wikipedia FASTQ格式 (Simplified Chinese) or FASTQ format (English)
FASTQ文件中，一個序列一般由四行組成：
第一行以@開頭，以後爲序列的標識符以及描述信息（與FASTA格式的描述行相似）
第二行爲序列信息
第三行以+開頭，以後能夠再次加上序列的標識及描述信息（可選）
第四行爲質量得分信息，與第二行的序列相對應，長度必須與第二行相同
The character '!' represents the lowest quality while '~' is the highest. Here are the quality value characters in left-to-right increasing order of quality (ASCII):

!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~

Further reading:
Differences between FASTA, FASTQ and SAM formatsexpress

2. Databases that contain gene sequencing data

NCBI GEO: can search datasets (sequencing data from a series of participants)
NCBI SRA: can search sequencing data from individual participant
ArrayExpress: Experiments are submitted directly to ArrayExpress or are imported from the NCBI Gene Expression Omnibus database. For high-throughput sequencing based experiments the raw data is brokered to the European Nucleotide Archive, while the experiment descriptions and processed data are archived in ArrayExpress.
European Nucleotide Archive: Learn more about how to use ENA by reading ENA: Guidelines and Tips.

I prefer NCBI GEO and SRA because I can use Aspera to download SRA files, which is super fast. It's best to keep Aspera connect software up-to-date.ubuntu

Install Aspera connect on Ubuntu Linux

mkdir -p ~/biosoft/ascp && cd ~/biosoft/ascp
wget https://download.asperasoft.com/download/sw/connect/3.7.4/aspera-connect-3.7.4.147727-linux-64.tar.gz
tar -zxvf aspera-connect-3.7.4.147727-linux-64.tar.gz
bash aspera-connect-3.7.4.147727-linux-64.sh
# Installing Aspera Connect
# Deploying Aspera Connect (/home/jshi/.aspera/connect) for the current user only.
# Unable to update desktop database, Aspera Connect may not be able to auto-launch
# Restart firefox manually to load the Aspera Connect plug-in
# Install complete.
# construct soft link
sudo ln -s /home/jshi/.aspera/connect/bin/ascp /usr/bin/ascp
ascp -h # help
ascp -A # version

If you have older version, you need to uninstall before you install newer version of Aspera. Actually, you need to delete related files in the following folder:

# ~/.mozilla/plugins/libnpasperaweb.so
# ~/.aspera/connect
rm ~/.mozilla/plugins/libnpasperaweb_{connect build #}.so
yes|rm -rf ~/.aspera/connect

3. How to download SRA files from NCBI SRA database?

According to SRA group, they recommand Prefetch program provided in SRAtoolkit. More detail can be found in Download Guide.

1. Download SRA files by using prefetch

I don't recommand install SRAtoolkit by using sudo apt-get install sratoolkit because the version might be older. I personally prefer to install the latest softwares.
SRA files will be deposited in the default file folder ~/ncbi/public/sra.

# Install SRAtoolkit
mkdir -p ~/biosoft/sratools && cd ~/biosoft/sratools
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.8.2-1/sratoolkit.2.8.2-1-ubuntu64.tar.gz
tar -zxvf sratoolkit.2.8.2-1-ubuntu64.tar.gz
# You
echo 'export PATH=$PATH:/home/jshi/biosoft/sratools/sratoolkit.2.8.2-1-ubuntu64/bin' >>
~/.bashrc 
source ~/.bashrc

Prefetch can use several different way to download SAR files, the default one is Aspera, if you want prefetch to use only Aspera to download, you can use the following code.

mkdir -p ~/data/project/GSE48240 && cd ~/data/project/GSE48240
# manually generate SRA file list
touch GSE48240.txt
for i in $(seq -w 1 3); do echo "SRR92222""$i" >>GSE48240.txt;done
# Using efetch to generate SRA file list
esearch -db sra -query PRJNA209632 | efetch -format runinfo | cut -f 1 -d ',' |grep SRR >> GSE48240.txt
prefetch -t ascp -a "/usr/bin/ascp|/home/jshi/.aspera/connect/etc/asperaweb_id_dsa.openssh" --option-file GSE48240.txt

Alternatively, you can use curl, wget or ftp to download from generated download links, but will be as slow as snail.

2. Convert SRA files to FASTQ files on the fly

This is a better way if you don't have too much space to save the SRA files. fastq-dump will covert SRA files to fastq files on the fly.

cat GSE48240.txt | xargs -n 1 echo fastq-dump --split-files $1

other

R中修改個別變量名（reshape包）使用names（）函數

names(leadership)

names(leadership)[2] <- 「testDate」

names(leadership)[6:10] <-c(「item1」, 「item2」, 「item3」, 「item4」, 「item5」)

How do I remove part of a string?
https://stackoverflow.com/questions/9704213/r-remove-part-of-string
gsub
sub_str

# install bioawk
apt-get install bison
cd ~/biosoft
git clone https://github.com/lh3/bioawk
cd bioawk
make
sudo cp bioawk /usr/local/bin

# Download and unzip the file on the fly. 
curl http://hgdownload.cse.ucsc.edu/goldenPath/hg38/chromosomes/chr22.fa.gz | gunzip -c > chr22.fa

# Look at the file
cat chr22.fa | head -4

# Count how many "N" are in chr22 sequence
cat chr22.fa | grep -o N  | wc -l

# Count how many bases are in Chr22?
cat chr22.fa | bioawk -c fastx '{ print length($seq) }'

-------------------------------------------------------------------------------------------------------

各基因組的對應關係

首先是NCBI對應UCSC，對應ENSEMBL數據庫：

GRCh36 (hg18): ENSEMBL release_52.

GRCh37 (hg19): ENSEMBL release_59/61/64/68/69/75.

GRCh38 (hg38): ENSEMBL release_76/77/78/80/81/82.

能夠看到ENSEMBL的版本特別複雜！！！很容易搞混！

可是UCSC的版本就簡單了，就hg18,19,38, 經常使用的是hg19，可是我推薦你們都轉爲hg38

看起來NCBI也是很簡單，就GRCh36,37,38，可是裏面水也很深！

Feb 13 2014 00:00    Directory April_14_2003
Apr 06 2006 00:00    Directory BUILD.33
Apr 06 2006 00:00    Directory BUILD.34.1
Apr 06 2006 00:00    Directory BUILD.34.2
Apr 06 2006 00:00    Directory BUILD.34.3
Apr 06 2006 00:00    Directory BUILD.35.1
Aug 03 2009 00:00    Directory BUILD.36.1
Aug 03 2009 00:00    Directory BUILD.36.2
Sep 04 2012 00:00    Directory BUILD.36.3
Jun 30 2011 00:00    Directory BUILD.37.1
Sep 07 2011 00:00    Directory BUILD.37.2
Dec 12 2012 00:00    Directory BUILD.37.3

能夠看到，有37.1, 37.2， 37.3 等等，不過這種版本通常指的是註釋在更新，基因組序列通常不會更新！！！

反正你記住hg19基因組大小是3G，壓縮後八九百兆便可！！！

若是要下載GTF註釋文件，基因組版本尤其重要！！！

對NCBI：ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/GFF/ ##最新版（hg38）

ftp://ftp.ncbi.nlm.nih.gov/genomes/Homo_sapiens/ARCHIVE/ ## 其它版本

對於ensembl：

ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

變幻中間的release就能夠拿到全部版本信息：ftp://ftp.ensembl.org/pub/

對於UCSC，那就有點麻煩了：

須要選擇一系列參數：

http://genome.ucsc.edu/cgi-bin/hgTables

1. Navigate to http://genome.ucsc.edu/cgi-bin/hgTables

2. Select the following options:
clade: Mammal
genome: Human
assembly: Feb. 2009 (GRCh37/hg19)
group: Genes and Gene Predictions
track: UCSC Genes
table: knownGene
region: Select "genome" for the entire genome.
output format: GTF - gene transfer format
output file: enter a file name to save your results to a file, or leave blank to display results in the browser

3. Click 'get output'.

如今重點來了，搞清楚版本關係了，就要下載呀！

UCSC裏面下載很是方便，只須要根據基因組簡稱來拼接url便可：

http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/chromFa.tar.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/chromFa.tar.gz

或者用shell腳本指定下載的染色體號：

for i in $(seq 1 22) X Y M;
do echo $i;
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr${i}.fa.gz;

## 這裏也能夠用NCBI的：ftp://ftp.ncbi.nih.gov/genomes/M_musculus/ARCHIVE/MGSCv3_Release3/Assembled_Chromosomes/chr前綴
done
gunzip *.gz
for i in $(seq 1 22) X Y M;
do cat chr${i}.fa >> hg19.fasta;
done
rm -fr chr*.fasta

---------------------------------------------------------------------------------------------------------

Tool: fastq-dump

Usage:

fastq-dump [options] <path/file> [<path/file> ...]

fastq-dump [options] <accession>

Frequently Used Options:

General:
-h	\|	--help	Displays ALL options, general usage, and version information.
-V	\|	--version	Display the version of the program.
Data formatting:
		--split-files	Dump each read into separate file. Files will receive suffix corresponding to read number.
		--split-spot	Split spots into individual reads.
		--fasta <[line width]>	FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
-I	\|	--readids	Append read id after spot id as 'accession.spot.readid' on defline.
-F	\|	--origfmt	Defline contains only original sequence name.
-C	\|	--dumpcs <[cskey]>	Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
-B	\|	--dumpbase	Formats sequence using base space (default for other than SOLiD).
-Q	\|	--offset <integer>	Offset to use for ASCII quality scores. Default is 33 ("!").
Filtering:
-N	\|	--minSpotId <rowid>	Minimum spot id to be dumped. Use with "X" to dump a range.
-X	\|	--maxSpotId <rowid>	Maximum spot id to be dumped. Use with "N" to dump a range.
-M	\|	--minReadLen <len>	Filter by sequence length >= <len>
		--skip-technical	Dump only biological reads.
		--aligned	Dump only aligned sequences. Aligned datasets only; see sra-stat.
		--unaligned	Dump only unaligned sequences. Will dump all for unaligned datasets.
Workflow and piping:
-O	\|	--outdir <path>	Output directory, default is current working directory ('.').
-Z	\|	--stdout	Output to stdout, all split data become joined into single stream.
		--gzip	Compress output using gzip.
		--bzip2	Compress output using bzip2.

Use examples:

fastq-dump -X 5 -Z SRR390728

Prints the first five spots (-X 5) to standard out (-Z). This is a useful starting point for verifying other formatting options before dumping a whole file.

fastq-dump -I --split-files SRR390728

Produces two fastq files (--split-files) containing ".1" and ".2" read suffices (-I) for paired-end data.

fastq-dump --split-files --fasta 60 SRR390728

Produces two (--split-files) fasta files (--fasta) with 60 bases per line ("60" included after --fasta).

fastq-dump --split-files --aligned -Q 64 SRR390728

Produces two fastq files (--split-files) that contain only aligned reads (--aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.Possible errors and their solution:

fastq-dump.2.x err: item not found while constructing within virtual database module - the path '<path/SRR*.sra>' cannot be opened as database or table

This error indicates that the .sra file cannot be found. Confirm that the path to the file is correct.

fastq-dump.2.x err: name not found while resolving tree within virtual file system module - failed SRR*.sra

The data are likely reference compressed and the toolkit is unable to acquire the reference sequence(s) needed to extract the .sra file. Please confirm that you have tested and validated the configuration of the toolkit. If you have elected to prevent the toolkit from contacting NCBI, you will need to manually acquire the reference(s) here

--------------------------------------------------------------------------------------------------------

下載流程：

1：wget -i ftp://ftp.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000001/SRR000001/SRR000001.sra

從NCBI官網下載sra數據文件

2：

使用fastq-dump工具將sra轉換成雙端fastq