Laplacian和PCA貌似是同一種性質的方法,座標系變換。html
最近在看SC3聚類這篇文章,SC3使用了這個工具。算法
SC3: consensus clustering of single-cell RNA-seq dataexpress
All distance matrices are then transformed using either principal component analysis (PCA) or by calculating the eigenvectors of the associated graph Laplacian (L = I – D–1/2AD–1/2, where I is the identity matrix, A is a similarity matrix (A = e–A′/max(A′)), where A′ is a distance matrix) and D is the degree matrix of A, a diagonal matrix that contains the row-sums of A on the diagonal (Dii = ΣjAij). The columns of the resulting matrices are then sorted in ascending order by their corresponding eigenvalues.dom
先看下該工具的功能:SC3 package manualide
跑一下常規代碼:函數
library(SingleCellExperiment) library(SC3) library(scater) head(ann) yan[1:3, 1:3] # create a SingleCellExperiment object sce <- SingleCellExperiment( assays = list( counts = as.matrix(yan), logcounts = log2(as.matrix(yan) + 1) ), colData = ann ) # define feature names in feature_symbol column rowData(sce)$feature_symbol <- rownames(sce) # remove features with duplicated names sce <- sce[!duplicated(rowData(sce)$feature_symbol), ] # define spike-ins isSpike(sce, "ERCC") <- grepl("ERCC", rowData(sce)$feature_symbol) plotPCA(sce, colour_by = "cell_type1") sce <- sc3(sce, ks = 2:4, biology = TRUE) # sc3_interactive(sce) # sc3_export_results_xls(sce) ###################################### sce <- sc3_prepare(sce) sce <- sc3_estimate_k(sce) sce <- sc3_calc_dists(sce) names(metadata(sce)$sc3$distances) sce <- sc3_calc_transfs(sce) names(metadata(sce)$sc3$transformations) metadata(sce)$sc3$distances sce <- sc3_kmeans(sce, ks = 2:4) names(metadata(sce)$sc3$kmeans) col_data <- colData(sce) head(col_data[ , grep("sc3_", colnames(col_data))]) sce <- sc3_calc_consens(sce) names(metadata(sce)$sc3$consensus) names(metadata(sce)$sc3$consensus$`3`) col_data <- colData(sce) head(col_data[ , grep("sc3_", colnames(col_data))]) sce <- sc3_calc_biology(sce, ks = 2:4) sce <- sc3_run_svm(sce, ks = 2:4) col_data <- colData(sce) head(col_data[ , grep("sc3_", colnames(col_data))])
接下來會嘗試拆一下該工具。工具
怎麼拆這個工具?idea
這種封裝的很好的R包其實比較難拆,通常的經過函數名字就能夠看到R代碼,但這裏你輸入函數名,如sc3_calc_dists,看到的只是如下的封裝好的代碼:spa
new("nonstandardGenericFunction", .Data = function (object) { standardGeneric("sc3_calc_dists") }, generic = structure("sc3_calc_dists", package = "SC3"), package = "SC3", group = list(), valueClass = character(0), signature = "object", default = NULL, skeleton = (function (object) stop("invalid call in method dispatch to 'sc3_calc_dists' (no default method)", domain = NA))(object))
暫時還不熟悉這種形式,因此只能經過函數名去GitHub裏面查了。插件
GitHub真的很優秀,能夠直接查文件內部代碼,能夠很快定位到sc3_calc_dists。
再配合這個目錄插件,效率提升了很多,https://www.octotree.io/?utm_source=lite&utm_medium=extension
如下是封裝前的代碼:
#' Calculate distances between the cells. #' #' This function calculates distances between the cells. It #' creates and populates the following items of the \code{sc3} slot of the \code{metadata(object)}: #' \itemize{ #' \item \code{distances} - contains a list of distance matrices corresponding to #' Euclidean, Pearson and Spearman distances. #' } #' #' @name sc3_calc_dists #' @aliases sc3_calc_dists, sc3_calc_dists,SingleCellExperiment-method #' #' @param object an object of \code{SingleCellExperiment} class #' #' @return an object of \code{SingleCellExperiment} class #' #' @importFrom doRNG %dorng% #' @importFrom foreach foreach %dopar% #' @importFrom parallel makeCluster stopCluster #' @importFrom doParallel registerDoParallel sc3_calc_dists.SingleCellExperiment <- function(object) { dataset <- get_processed_dataset(object) # check whether in the SVM regime if (!is.null(metadata(object)$sc3$svm_train_inds)) { dataset <- dataset[, metadata(object)$sc3$svm_train_inds] } # NULLing the variables to avoid notes in R CMD CHECK i <- NULL distances <- c("euclidean", "pearson", "spearman") message("Calculating distances between the cells...") if (metadata(object)$sc3$n_cores > length(distances)) { n_cores <- length(distances) } else { n_cores <- metadata(object)$sc3$n_cores } cl <- parallel::makeCluster(n_cores, outfile = "") doParallel::registerDoParallel(cl, cores = n_cores) # calculate distances in parallel dists <- foreach::foreach(i = distances) %dorng% { try({ calculate_distance(dataset, i) }) } # stop local cluster parallel::stopCluster(cl) names(dists) <- distances metadata(object)$sc3$distances <- dists return(object) } #' @rdname sc3_calc_dists #' @aliases sc3_calc_dists setMethod("sc3_calc_dists", signature(object = "SingleCellExperiment"), sc3_calc_dists.SingleCellExperiment)
經過setMethod連接到一塊兒的。
順路找到了原函數:
#' Calculate a distance matrix #' #' Distance between the cells, i.e. columns, in the input expression matrix are #' calculated using the Euclidean, Pearson and Spearman metrics to construct #' distance matrices. #' #' @param data expression matrix #' @param method one of the distance metrics: 'spearman', 'pearson', 'euclidean' #' @return distance matrix #' #' @importFrom stats cor dist #' #' @useDynLib SC3 #' @importFrom Rcpp sourceCpp #' calculate_distance <- function(data, method) { return(if (method == "spearman") { as.matrix(1 - cor(data, method = "spearman")) } else if (method == "pearson") { as.matrix(1 - cor(data, method = "pearson")) } else { ED2(data) }) } #' Distance matrix transformation #' #' All distance matrices are transformed using either principal component #' analysis (PCA) or by calculating the #' eigenvectors of the graph Laplacian (Spectral). #' The columns of the resulting matrices are then sorted in #' descending order by their corresponding eigenvalues. #' #' @param dists distance matrix #' @param method transformation method: either 'pca' or #' 'laplacian' #' @return transformed distance matrix #' #' @importFrom stats prcomp cmdscale #' transformation <- function(dists, method) { if (method == "pca") { t <- prcomp(dists, center = TRUE, scale. = TRUE) return(t$rotation) } else if (method == "laplacian") { L <- norm_laplacian(dists) l <- eigen(L) # sort eigenvectors by their eigenvalues return(l$vectors[, order(l$values)]) } } #' Calculate consensus matrix #' #' Consensus matrix is calculated using the Cluster-based Similarity #' Partitioning Algorithm (CSPA). For each clustering solution a binary #' similarity matrix is constructed from the corresponding cell labels: #' if two cells belong to the same cluster, their similarity is 1, otherwise #' the similarity is 0. A consensus matrix is calculated by averaging all #' similarity matrices. #' #' @param clusts a matrix containing clustering solutions in columns #' @return consensus matrix #' #' @useDynLib SC3 #' @importFrom Rcpp sourceCpp #' @export consensus_matrix <- function(clusts) { res <- consmx(clusts) colnames(res) <- as.character(c(1:nrow(clusts))) rownames(res) <- as.character(c(1:nrow(clusts))) return(res) }
都在這裏。。。
ED2是他們實驗室本身用Rcpp寫的一個計算歐氏距離的工具。
transformation輸入的是對稱的距離矩陣(行列都是樣本細胞),而後作完PCA,返回了rotation,不知道這樣作有什麼意義?
還真有用PCA來處理距離類似度矩陣的,MDS,目的就是降維,由於後面要用kmean聚類;
而後對每個降維了的矩陣用kmeans;
consensus用的是這個算法:Cluster-based Similarity Partitioning Algorithm (CSPA),作這個的意義何在?輸入是每一個細胞的多重聚類結果,而後作了一個一致性統一。
參考: