Deep-learning augmented RNA-seq analysis of transcript splicing | 用深度學習預測可變剪切

時間 2019-11-05

標籤 deep learning augmented rna seq analysis transcript splicing 深度學習預測可變剪切简体版

原文原文鏈接

可變剪切的預測已經很流行了，目前主要有兩個流派：python

用DNA序列以及variant來預測可變剪切；GeneSplicer、MaxEntScan、dbscSNV、S-CAP、MMSplice、clinVar、spliceAI
用RNA來預測可變剪切；MISO、rMATS、DARTS

前言廢話git

科研圈的熱點扎堆現象是永遠存在的，且一波接一波，大部分不屑於追熱點且不出成果也基本都被圈子給淘汰了。github

作純方法開發的實際上是很心累的，費時費力費腦，特別是本身的研究領域已通過時的時候，另外還得承受外行的歧視：「大家搞這個有什麼用嗎？文章也發不了，最後也沒人用。」算法

最近這些年最大的一個熱點就是「單細胞」，不少人都趁着這股東風撈了一些文章，最先一批開發方法的也發了很多nature method和NBT，bioinformatics和NAR更多。但大部分後面就銷聲匿跡了，由於門檻愈來愈低，進入者愈來愈多，通過幾年的發展，如今已經成了三足鼎立之勢，強者愈強，弱者退場。express

寫方法類的文章也有個潛規則，千萬不要寫得過於通俗易懂，大部分審稿人若是一眼就能看懂，就會天然認爲你作的研究過於簡單，沒有發表的必要。最好要寫得有理有據，且90%的審稿人無法一眼看懂，但細細琢磨後有那麼點意思。哈哈，當笑話聽就好。bash

跳到另一篇用深度學習來預測可變剪切的。app

Deep-learning augmented RNA-seq analysis of transcript splicingdom

PDF ide

文章裏面須要重點了解的基礎知識：工具

Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10，看看前人是如何根據cis sequence特徵來預測exon的剪切模式的

涉及到的文獻：

The human splicing code reveals new insights into the genetic determinants of disease - 2015

Deciphering the splicing code - 2010

Deep learning of the tissue-regulated splicing code - 2014

BRIE: transcriptome-wide splicing quantification in single cells - 2017

哈哈，深度學習在可變剪切上的應用的風2014年就開始颳了，你不多是第一個吃螃蟹的了。

想了解什麼是AS，能夠直接看如今開發的工具，裏面確定有圖詳細介紹，同時介紹其算法，一圖勝千言。

MISO (Mixture of Isoforms) software documentation 目前只支持python2版本，用conda的話還須要從文檔中copy一下miso_settings.txt文件。

rMATS: Robust and flexible detection of differential alternative splicing from replicate RNA-Seq data

生物和信息之間存在一個巨大的gap，優秀的人能很快察覺到這個gap，並填補這個gap。

問題：

爲何AS的鑑定依賴測序深度？得了解如今主流的AS檢測算法

怎麼理解樣本間的差別可變剪切這個概念？

如何理解cis sequence features，這個文件裏都包含了哪些信息？

怎麼predict exon-inclusion/skipping levels in bulk tissues or single cells

怎麼理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.

兩部分：

a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features

a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing

During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.

During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.

This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.

差很少懂了，第一個BHT就是常規差別剪切分析工具（如MISO 和 rMATS）的升級版，用於製造有lable的訓練數據。BHT的結果用於訓練DNN模型；新的數據能夠放進DNN模型裏，獲得的結果能夠做爲後面貝葉斯模型的先驗，咱們的RNA-seq數據則是用於更新先驗造成後驗，若是先驗足夠準確，則更新時對數據的依賴不搞，這也就是爲何該方法能夠彌補RNA-seq測序深度不足的情形。

To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.

cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions

這個DNN把可變剪切轉換成了一個regression的問題，特徵就是上面兩種，由於它們決定了最終的一個特徵是否發生了可變剪切。

最終用到的特徵：2,926 cis sequence features and 1,498 annotated RBPs

DNN用到的訓練數據具體是什麼？

large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium

We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons

The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN

From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 這樣就把每一個exon給的特徵給提取出來了，lable也有了，就能夠用於訓練了。

比較了三個模型：

We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.

關於貝葉斯模型的部分：

incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).

Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data

若是文章看得迷迷糊糊的，就直接跑代碼吧！

第一個功能BHT：

Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/

test_norep_data.txt文件是這樣的：

ID	GeneID	geneSymbol	chr	strand	exonStart_0base	exonEnd	upstreamES	upstreamEE	downstreamES	downstreamEE	ID	IJC_SAMPLE_1	SJC_SAMPLE_1	IJC_SAMPLE_2	SJC_SAMPLE_2	IncFormLen	SkipFormLen
82439	ENSG00000169045.17_1	HNRNPH1	chr5	-	179046269	179046408	179045145	179045324	179047892	179048036	82439	15236	319	6774	834	180	90
21374	ENSG00000131876.16_3	SNRPA1	chr15	-	101826418	101826498	101825930	101826006	101827112	101827215	21374	4105	118	292	54	169	90
32815	ENSG00000141027.20_3	NCOR1	chr17	-	15990485	15990659	15989712	15989756	15995176	15995232	32815	624	564	549	1261	180	90
43143	ENSG00000133731.9_2	IMPA1	chr8	-	82597997	82598198	82593732	82593819	82598486	82598518	43143	155	332	22	341	180	90
111671	ENSG00000100320.22_3	RBFOX2	chr22	-	36232366	36232486	36205826	36206051	36236238	36236460	111671	93	193	35	534	180	90

每一行是一個基因，無冗餘，而後就是一些屬性.

跑出來的結果是這樣的：

1 ID      I1      S1      I2      S2      inc_len skp_len psi1    psi2    delta.mle       post_pr
      2 1225    160     0       169     6       180     90      1       0.934   -0.0663 0.4367
      3 15829   52      58      12      41      180     90      0.31    0.128   -0.1819 0.8867
      4 20347   1084    930     371     615     180     90      0.368   0.232   -0.1365 1
      5 21374   4105    118     292     54      169     90      0.949   0.742   -0.2065 1
      6 24817   177     275     263     741     143     90      0.288   0.183   -0.1057 0.974
      7 32815   624     564     549     1261    180     90      0.356   0.179   -0.1774 1
      8 43143   155     332     22      341     180     90      0.189   0.031   -0.158  1
      9 46548   1685    4040    216     1752    180     90      0.173   0.058   -0.1145 1

每一行是對以前條目的預測。　　

第二個功能DNN：

下載model

Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS

預測

Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS

其中的第一個文件是Input feature file (*.h5) or Darts_BHT output (*.txt)

ID      I1      S1      I2      S2      inc_len skp_len mu.mle  delta.mle       post_pr
chr1:-:10002681:10002840:10002738:10002840:9996576:9996685      581     0       462     0       155     99      1       0       0
chr1:-:100176361:100176505:100176389:100176505:100174753:100174815      28      0       49      2       126     99      1       -0.0493827160493827     0.248
chr1:-:109556441:109556547:109556462:109556547:109553537:109554340      2       37      0       81      119     99      0.0430341230167355
      -0.0430341230167355     0.188
chr1:-:11009680:11009871:11009758:11009871:11007699:11008901    11      2       49      4       176     99      0.755725190839695       0.117542135892979       0.329333333333333
chr1:-:11137386:11137500:11137421:11137500:11136898:11137005    80      750     64      738     133     99      0.0735580941766509      -0.0129207126090368     0

第二個文件是Kallisto expression files

thymus  adipose
RPS11   2678.83013      2531.887535
ERAL1   14.350975       13.709394
DDX27   18.2573 14.02368
DEK     32.463558       14.520312
PSMA6   102.332592      77.089475
TRIM56  4.519675        6.14762566667
TRIM71  0.082009        0.0153936666667
UPF2    7.150812        5.23628033333
FARS2   6.332831        7.291382
ALKBH8  3.056208        1.27043633333
ZNF579  5.13265 8.248575

結果文件，第一列是ID，第二列是真實的標籤，第三列是預測的標籤：

ID      Y_true  Y_pred
chr22:-:39136893:39137055:39137011:39137055:39136271:39136437   1.000000        0.318161
chr12:-:69326921:69326979:69326949:69326979:69326457:69326620   1.000000        0.073966
chr3:-:49053236:49053305:49053251:49053305:49052920:49053140    0.947333        0.295664
chr4:-:68358468:68358715:68358586:68358715:68357897:68357993    1.000000        0.304907
chr11:-:124972532:124972705:124972629:124972705:124972027:124972213     0.937333        0.365548
chr15:+:43695880:43696040:43695880:43695997:43696610:43696750   1.000000        0.450762

參考：

The Expanding Landscape of Alternative Splicing Variation in Human Populations.　

這篇是比較純粹的DL應用：

Gene expression inference with deep learning | 基於深度學習的基因表達推測

案例文章：Gene expression inference with deep learning

uci-cbcl/D-GEX - github

深度學習的風已通過了幾年了，目前在醫療影像處理上已經公認很是有效，因此後面想發文章必須數據足夠大足夠靚，方法上想創新太難。

LINCS L1000 data

核心的意思就是這個項目只測了不到一千個基因的表達，卻要經過LR和DL來推測出其餘所有的3萬個基因的表達，因此稱那978個基因叫landmark genes。