可變剪切的預測已經很流行了,目前主要有兩個流派:python
前言廢話git
科研圈的熱點扎堆現象是永遠存在的,且一波接一波,大部分不屑於追熱點且不出成果也基本都被圈子給淘汰了。github
作純方法開發的實際上是很心累的,費時費力費腦,特別是本身的研究領域已通過時的時候,另外還得承受外行的歧視:「大家搞這個有什麼用嗎?文章也發不了,最後也沒人用。」算法
最近這些年最大的一個熱點就是「單細胞」,不少人都趁着這股東風撈了一些文章,最先一批開發方法的也發了很多nature method和NBT,bioinformatics和NAR更多。但大部分後面就銷聲匿跡了,由於門檻愈來愈低,進入者愈來愈多,通過幾年的發展,如今已經成了三足鼎立之勢,強者愈強,弱者退場。express
寫方法類的文章也有個潛規則,千萬不要寫得過於通俗易懂,大部分審稿人若是一眼就能看懂,就會天然認爲你作的研究過於簡單,沒有發表的必要。最好要寫得有理有據,且90%的審稿人無法一眼看懂,但細細琢磨後有那麼點意思。哈哈,當笑話聽就好。bash
跳到另一篇用深度學習來預測可變剪切的。app
Deep-learning augmented RNA-seq analysis of transcript splicingdom
PDF ide
文章裏面須要重點了解的基礎知識:工具
Unlike methods that use cis sequence features to predict exon splicing patterns in specific samples7–10,看看前人是如何根據cis sequence特徵來預測exon的剪切模式的
涉及到的文獻:
The human splicing code reveals new insights into the genetic determinants of disease - 2015
Deciphering the splicing code - 2010
Deep learning of the tissue-regulated splicing code - 2014
BRIE: transcriptome-wide splicing quantification in single cells - 2017
哈哈,深度學習在可變剪切上的應用的風2014年就開始颳了,你不多是第一個吃螃蟹的了。
想了解什麼是AS,能夠直接看如今開發的工具,裏面確定有圖詳細介紹,同時介紹其算法,一圖勝千言。
MISO (Mixture of Isoforms) software documentation 目前只支持python2版本,用conda的話還須要從文檔中copy一下miso_settings.txt文件。
生物和信息之間存在一個巨大的gap,優秀的人能很快察覺到這個gap,並填補這個gap。
問題:
爲何AS的鑑定依賴測序深度?得了解如今主流的AS檢測算法
怎麼理解樣本間的差別可變剪切這個概念?
如何理解cis sequence features,這個文件裏都包含了哪些信息?
怎麼predict exon-inclusion/skipping levels in bulk tissues or single cells
怎麼理解we hypothesized that large-scale RNA-seq resources can be used to construct a deep-learning model of differential alternative splicing.
兩部分:
a deep neural network (DNN) model that predicts differential alternative splicing between two conditions on the basis of exon-specific sequence features and sample-specific regulatory features
a Bayesian hypothesis testing (BHT) statistical model that infers differential alternative splicing by integrating empirical evidence in a specific RNA-seq dataset with prior probability of differential alternative splicing
During training, large-scale RNA-seq data are analyzed by the DARTS BHT with an uninformative prior (DARTS BHT(flat), with only RNA-seq data used for the inference) to generate training labels of high-confidence differential or unchanged splicing events between conditions, which are then used to train the DARTS DNN.
During application, the trained DARTS DNN is used to predict differential alternative splicing in a user-specific dataset.
This prediction is then incorporated as an informative prior with the observed RNA-seq read counts by the DARTS BHT (DARTS BHT(info)) for deeplearning-augmented splicing analysis.
差很少懂了,第一個BHT就是常規差別剪切分析工具(如MISO 和 rMATS)的升級版,用於製造有lable的訓練數據。BHT的結果用於訓練DNN模型;新的數據能夠放進DNN模型裏,獲得的結果能夠做爲後面貝葉斯模型的先驗,咱們的RNA-seq數據則是用於更新先驗造成後驗,若是先驗足夠準確,則更新時對數據的依賴不搞,這也就是爲何該方法能夠彌補RNA-seq測序深度不足的情形。
To generate training labels, we applied DARTS BHT(flat) to calculate the probability of an exon being differentially spliced or unchanged in each pairwise comparison.
cis sequence features and messenger RNA (mRNA) levels of trans RNA-binding proteins (RBPs) in two conditions
這個DNN把可變剪切轉換成了一個regression的問題,特徵就是上面兩種,由於它們決定了最終的一個特徵是否發生了可變剪切。
最終用到的特徵:2,926 cis sequence features and 1,498 annotated RBPs
DNN用到的訓練數據具體是什麼?
large-scale RBP-depletion RNA-seq data in two human cell lines (K562 and HepG2) generated by the ENCODE consortium
We used RNA-seq data of 196 RBPs depleted by short-hairpin RNA (shRNA) in both cell lines, corresponding to 408 knockdown-versus-control pairwise comparisons
The remaining ENCODE data, corresponding to 58 RBPs depleted in only one cell line, were excluded from training and used as leave-out data for independent evaluation of the DARTS DNN
From the high-confidence differentially spliced versus unchanged exons called by DARTS BHT(flat) (Supplementary Table 2), we used 90% of labeled events for training and fivefold cross-validation, and the remaining 10% of events for testing (Methods). 這樣就把每一個exon給的特徵給提取出來了,lable也有了,就能夠用於訓練了。
比較了三個模型:
We used the leave-out data to compare the DARTS DNN with three alternative baseline methods: the identical DNN structure trained on individual leave-out datasets (DNN), logistic regression with L2 penalty (logistic), and random forest.
關於貝葉斯模型的部分:
incorporating the DARTS DNN predictions as the informative prior, and observed RNA-seq read counts as the likelihood (DARTS BHT(info)).
Simulation studies demonstrated that the informative prior improves the inference when the observed data are limited, for instance, because of low levels of gene expression or limited RNA-seq depth, but does not overwhelm the evidence in the observed data
若是文章看得迷迷糊糊的,就直接跑代碼吧!
第一個功能BHT:
Darts_BHT bayes_infer --darts-count test_data/test_norep_data.txt --od test_data/
test_norep_data.txt文件是這樣的:
ID GeneID geneSymbol chr strand exonStart_0base exonEnd upstreamES upstreamEE downstreamES downstreamEE ID IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 IncFormLen SkipFormLen 82439 ENSG00000169045.17_1 HNRNPH1 chr5 - 179046269 179046408 179045145 179045324 179047892 179048036 82439 15236 319 6774 834 180 90 21374 ENSG00000131876.16_3 SNRPA1 chr15 - 101826418 101826498 101825930 101826006 101827112 101827215 21374 4105 118 292 54 169 90 32815 ENSG00000141027.20_3 NCOR1 chr17 - 15990485 15990659 15989712 15989756 15995176 15995232 32815 624 564 549 1261 180 90 43143 ENSG00000133731.9_2 IMPA1 chr8 - 82597997 82598198 82593732 82593819 82598486 82598518 43143 155 332 22 341 180 90 111671 ENSG00000100320.22_3 RBFOX2 chr22 - 36232366 36232486 36205826 36206051 36236238 36236460 111671 93 193 35 534 180 90
每一行是一個基因,無冗餘,而後就是一些屬性.
跑出來的結果是這樣的:
1 ID I1 S1 I2 S2 inc_len skp_len psi1 psi2 delta.mle post_pr 2 1225 160 0 169 6 180 90 1 0.934 -0.0663 0.4367 3 15829 52 58 12 41 180 90 0.31 0.128 -0.1819 0.8867 4 20347 1084 930 371 615 180 90 0.368 0.232 -0.1365 1 5 21374 4105 118 292 54 169 90 0.949 0.742 -0.2065 1 6 24817 177 275 263 741 143 90 0.288 0.183 -0.1057 0.974 7 32815 624 564 549 1261 180 90 0.356 0.179 -0.1774 1 8 43143 155 332 22 341 180 90 0.189 0.031 -0.158 1 9 46548 1685 4040 216 1752 180 90 0.173 0.058 -0.1145 1
每一行是對以前條目的預測。
第二個功能DNN:
下載model
Darts_DNN get_data -d transFeature cisFeature trainedParam -t A5SS
預測
Darts_DNN predict -i darts_bht.flat.txt -e RBP_tpm.txt -o pred.txt -t A5SS
其中的第一個文件是Input feature file (*.h5) or Darts_BHT output (*.txt)
ID I1 S1 I2 S2 inc_len skp_len mu.mle delta.mle post_pr chr1:-:10002681:10002840:10002738:10002840:9996576:9996685 581 0 462 0 155 99 1 0 0 chr1:-:100176361:100176505:100176389:100176505:100174753:100174815 28 0 49 2 126 99 1 -0.0493827160493827 0.248 chr1:-:109556441:109556547:109556462:109556547:109553537:109554340 2 37 0 81 119 99 0.0430341230167355 -0.0430341230167355 0.188 chr1:-:11009680:11009871:11009758:11009871:11007699:11008901 11 2 49 4 176 99 0.755725190839695 0.117542135892979 0.329333333333333 chr1:-:11137386:11137500:11137421:11137500:11136898:11137005 80 750 64 738 133 99 0.0735580941766509 -0.0129207126090368 0
第二個文件是Kallisto expression files
thymus adipose RPS11 2678.83013 2531.887535 ERAL1 14.350975 13.709394 DDX27 18.2573 14.02368 DEK 32.463558 14.520312 PSMA6 102.332592 77.089475 TRIM56 4.519675 6.14762566667 TRIM71 0.082009 0.0153936666667 UPF2 7.150812 5.23628033333 FARS2 6.332831 7.291382 ALKBH8 3.056208 1.27043633333 ZNF579 5.13265 8.248575
結果文件,第一列是ID,第二列是真實的標籤,第三列是預測的標籤:
ID Y_true Y_pred chr22:-:39136893:39137055:39137011:39137055:39136271:39136437 1.000000 0.318161 chr12:-:69326921:69326979:69326949:69326979:69326457:69326620 1.000000 0.073966 chr3:-:49053236:49053305:49053251:49053305:49052920:49053140 0.947333 0.295664 chr4:-:68358468:68358715:68358586:68358715:68357897:68357993 1.000000 0.304907 chr11:-:124972532:124972705:124972629:124972705:124972027:124972213 0.937333 0.365548 chr15:+:43695880:43696040:43695880:43695997:43696610:43696750 1.000000 0.450762
參考:
The Expanding Landscape of Alternative Splicing Variation in Human Populations.
這篇是比較純粹的DL應用:
Gene expression inference with deep learning | 基於深度學習的基因表達推測
案例文章:Gene expression inference with deep learning
uci-cbcl/D-GEX - github
深度學習的風已通過了幾年了,目前在醫療影像處理上已經公認很是有效,因此後面想發文章必須數據足夠大足夠靚,方法上想創新太難。
LINCS L1000 data
核心的意思就是這個項目只測了不到一千個基因的表達,卻要經過LR和DL來推測出其餘所有的3萬個基因的表達,因此稱那978個基因叫landmark genes。