Title: | Statistical Inference for Unsupervised Learning |
---|---|
Description: | Test for association between the observed data and their estimated latent variables. The jackstraw package provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their latent variables. Depending on the data type and the analysis aim, the latent variables may be estimated by principal component analysis (PCA), factor analysis (FA), K-means clustering, and related unsupervised learning algorithms. The jackstraw methods learn over-fitting characteristics inherent in this circular analysis, where the observed data are used to estimate the latent variables and used again to test against that estimated latent variables. When latent variables are estimated by PCA, the jackstraw enables statistical testing for association between observed variables and latent variables, as estimated by low-dimensional principal components (PCs). This essentially leads to identifying variables that are significantly associated with PCs. Similarly, unsupervised clustering, such as K-means clustering, partition around medoids (PAM), and others, finds coherent groups in high-dimensional data. The jackstraw estimates statistical significance of cluster membership, by testing association between data and cluster centers. Clustering membership can be improved by using the resulting jackstraw p-values and posterior inclusion probabilities (PIPs), with an application to unsupervised evaluation of cell identities in single cell RNA-seq (scRNA-seq). |
Authors: | Neo Christopher Chung [aut, cre] , John D. Storey [aut] , Wei Hao [aut], Alejandro Ochoa [aut] |
Maintainer: | Neo Christopher Chung <[email protected]> |
License: | GPL-2 |
Version: | 1.3.17.9000 |
Built: | 2024-11-08 17:30:05 UTC |
Source: | https://github.com/ncchung/jackstraw |
There are a wide range of algorithms and visual techniques to identify a number of clusters or principal components embedded in the observed data.
find_k()
find_k()
It is critical to explore the eigenvalues, cluster stability, and visualization.
See R packages bootcluster
, EMCluster
, and nFactors
.
Please see the R package SC3
, which provides estkTW()
function to
find the number of significant eigenvalues according to the Tracy-Widom test.
ADPclust
package includes adpclust()
function that runs the algorithm
on a range of K values. It helps you to identify the most suitable number of clusters.
This package also provides an alternative methods in permutationPA
.
Through a resampling-based Parallel Analysis, it finds a number of significant components.
Test for association between the observed data and their estimated latent variables. The jackstraw package provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their latent variables. Depending on the data type and the analysis aim, the latent variables may be estimated by principal component analysis (PCA), factor analysis (FA), K-means clustering, and related unsupervised learning algorithms. The jackstraw methods learn over-fitting characteristics inherent in this circular analysis, where the observed data are used to estimate the latent variables and used again to test against that estimated latent variables. When latent variables are estimated by PCA, the jackstraw enables statistical testing for association between observed variables and latent variables, as estimated by low-dimensional principal components (PCs). This essentially leads to identifying variables that are significantly associated with PCs. Similarly, unsupervised clustering, such as K-means clustering, partition around medoids (PAM), and others, finds coherent groups in high-dimensional data. The jackstraw estimates statistical significance of cluster membership, by testing association between data and cluster centers. Clustering membership can be improved by using the resulting jackstraw p-values and posterior inclusion probabilities (PIPs), with an application to unsupervised evaluation of cell identities in single cell RNA-seq (scRNA-seq).
The jackstraw package provides a resampling strategy and testing scheme to estimate statistical significance of association between the observed data and their latent variables. Depending on the data type and the analysis aim, the latent variables may be estimated by principal component analysis (PCA), K-means clustering, and related algorithms. The jackstraw methods learn over-fitting characteristics inherent in this circular analysis, where the observed data are used to estimate the latent variables and used again to test against those estimated latent variables.
The jackstraw tests enable us to identify the data features (i.e., variables or observations) that are
driving systematic variation, in a completely unsupervised manner. Using jackstraw_pca, we can
find statistically significant features with regard to the top r
principal components.
Alternatively, jackstraw_kmeans can identify the data features that are statistically significant
members of the data-dependent clusters. Furthermore, this package includes more general algorithms such as
jackstraw_subspace for the dimension reduction techniques and jackstraw_cluster for the clustering algorithms.
Overall, it computes m
p-values of association between the m
data features and their corresponding latent variables.
From m
p-values, pip computes posterior inclusion probabilities, that are useful for feature selection and visualization.
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
jackstraw_pca jackstraw_subspace jackstraw_kmeans jackstraw_cluster
Test association between the observed variables and population structure estimated by ALStructure.
jackstraw_alstructure( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE )
jackstraw_alstructure( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE )
dat |
a genotype matrix with |
r |
a number of significant LFs. |
FUN |
a function to ALStructure |
r1 |
a numeric vector of LFs of interest (implying you are not interested in all |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. There will be a total of |
covariate |
a data matrix of covariates with corresponding |
verbose |
a logical specifying to print the computational progress. |
This function uses ALStructure from Cabreros and Storey (2019). A deviation dev
in logistic regression
(the full model with r
LFs vs. the intercept-only model) is used to assess association.
This function also requires the Bioconductor gcatest
package to be installed.
jackstraw_alstructure
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
## Not run: # load genotype data to analyze (not shown) into this variable X # choose the number of ancestries r <- 3 # load alstructure package (install from https://github.com/StoreyLab/alstructure) library(alstructure) # define the function this way, a function of the genotype matrix only FUN <- function(x) t( alstructure(x, d_hat = r)$Q_hat ) # calculate p-values (and other statistics) for each SNP out <- jackstraw_alstructure( X, r, FUN ) ## End(Not run)
## Not run: # load genotype data to analyze (not shown) into this variable X # choose the number of ancestries r <- 3 # load alstructure package (install from https://github.com/StoreyLab/alstructure) library(alstructure) # define the function this way, a function of the genotype matrix only FUN <- function(x) t( alstructure(x, d_hat = r)$Q_hat ) # calculate p-values (and other statistics) for each SNP out <- jackstraw_alstructure( X, r, FUN ) ## End(Not run)
Test the cluster membership using a user-defined clustering algorithm
jackstraw_cluster( dat, k, cluster, centers, algorithm = function(x, centers, ...) stats::kmeans(x, centers, ...), s = 1, B = 1000, center = TRUE, noise = NULL, covariate = NULL, pool = TRUE, verbose = FALSE, ... )
jackstraw_cluster( dat, k, cluster, centers, algorithm = function(x, centers, ...) stats::kmeans(x, centers, ...), s = 1, B = 1000, center = TRUE, noise = NULL, covariate = NULL, pool = TRUE, verbose = FALSE, ... )
dat |
a data matrix with |
k |
a number of clusters. |
cluster |
a vector of cluster assignments. |
centers |
a matrix of all cluster centers. |
algorithm |
a clustering algorithm to use, where an output must include |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
center |
a logical specifying to center the rows. By default, |
noise |
specify a parametric distribution to generate a noise term. If |
covariate |
a model matrix of covariates with |
pool |
a logical specifying to pool the null statistics across all clusters. By default, |
verbose |
a logical specifying to print the computational progress. By default, |
... |
additional, optional arguments to |
The clustering algorithms assign m
rows into K
clusters. This function enable statistical
evaluation if the cluster membership is correctly assigned. Each of m
p-values refers to
the statistical test of that row with regard to its assigned cluster.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of clusters from the observed data
and protects against an anti-conservative bias.
The user is expected to explore the data with a given clustering algorithm and
determine the number of clusters k
.
Furthermore, provide cluster
and centers
as given by applying algorithm
onto dat
.
The rows of centers
correspond to k
clusters, as well as available levels in cluster
.
This function allows you to specify a parametric distribution of a noise term. It is an experimental feature.
jackstraw_cluster
returns a list consisting of
F.obs |
|
F.null |
F null statistics between null variables and cluster centers, from the jackstraw method. |
p.F |
|
Neo Christopher Chung [email protected]
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
Test association between the observed variables and their latent variables captured by principal components (PCs). PCs are computed using the augmented implicitly restarted Lanczos bidiagonalization algorithm (IRLBA; see irlba
).
jackstraw_irlba( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE, ... )
jackstraw_irlba( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE, ... )
dat |
a data matrix with |
r |
a number (a positive integer) of significant principal components. See permutationPA and other methods. |
r1 |
a numeric vector of principal components of interest. Choose a subset of |
s |
a number (a positive integer) of “synthetic” null variables. Out of |
B |
a number (a positive integer) of resampling iterations. There will be a total of |
covariate |
a data matrix of covariates with corresponding |
verbose |
a logical specifying to print the computational progress. |
... |
additional arguments to |
This function computes m
p-values of linear association between m
variables and their PCs.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of PCs from the observed data
and protects against an anti-conservative bias.
Provide the data matrix, with m
variables as rows and n
observations as columns.
Given that there are r
significant PCs, this function tests for linear association between
m
variables and their r
PCs.
You could specify a subset of significant PCs that you are interested in (r1
). If r1
is given,
then this function computes statistical significance of association between m
variables and r1
,
while adjusting for other PCs (i.e., significant PCs that are not your interest).
For example, if you want to identify variables associated with first and second PCs,
when your data contains three significant PCs, set r=3
and r1=c(1,2)
.
Please take a careful look at your data and use appropriate graphical and statistical criteria
to determine a number of significant PCs, r
. The number of significant PCs depends on the data structure and the context.
In a case when you fail to specify r
, it will be estimated from a permutation test (Buja and Eyuboglu, 1992)
using a function permutationPA.
If s
is not supplied, s
is set to about 10\
If B
is not supplied, B
is set to m*10/s
.
jackstraw_irlba
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
jackstraw jackstraw_subspace permutationPA
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,10),rep(-1,10), rep(0,180)) L = rnorm(20) E = matrix(rnorm(200*20), nrow=200) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_irlba(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## Not run: ## out = jackstraw_irlba(dat, r=1, s=10, B=200) ## End(Not run)
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,10),rep(-1,10), rep(0,180)) L = rnorm(20) E = matrix(rnorm(200*20), nrow=200) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_irlba(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## Not run: ## out = jackstraw_irlba(dat, r=1, s=10, B=200) ## End(Not run)
Test the cluster membership for K-means clustering
jackstraw_kmeans( dat, kmeans.dat, s = NULL, B = NULL, center = FALSE, covariate = NULL, match = TRUE, pool = TRUE, verbose = FALSE, ... )
jackstraw_kmeans( dat, kmeans.dat, s = NULL, B = NULL, center = FALSE, covariate = NULL, match = TRUE, pool = TRUE, verbose = FALSE, ... )
dat |
a matrix with |
kmeans.dat |
an output from applying |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
center |
a logical specifying to center the rows of the null samples. By default, |
covariate |
a model matrix of covariates with |
match |
a logical specifying to match the observed clusters and jackstraw clusters using minimum Euclidean distances. |
pool |
a logical specifying to pool the null statistics across all clusters. By default, |
verbose |
a logical specifying to print the computational progress. By default, |
... |
optional arguments to control the k-means clustering algorithm (refers to |
K-means clustering assign m
rows into K
clusters. This function enable statistical
evaluation if the cluster membership is correctly assigned. Each of m
p-values refers to
the statistical test of that row with regard to its assigned cluster.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of clusters from the observed data
and protects against an anti-conservative bias.
The input data (dat
) must be of a class matrix
.
jackstraw_kmeans
returns a list consisting of
F.obs |
|
F.null |
F null statistics between null variables and cluster centers, from the jackstraw method. |
p.F |
|
Neo Christopher Chung [email protected]
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
## Not run: dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) kmeans.dat <- kmeans(dat, centers=2, nstart = 10, iter.max = 100) jackstraw.out <- jackstraw_kmeans(dat, kmeans.dat) ## End(Not run)
## Not run: dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) kmeans.dat <- kmeans(dat, centers=2, nstart = 10, iter.max = 100) jackstraw.out <- jackstraw_kmeans(dat, kmeans.dat) ## End(Not run)
Test the cluster membership for K-means clustering, using K-means++ initialization
jackstraw_kmeanspp( dat, kmeans.dat, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, pool = TRUE, ... )
jackstraw_kmeanspp( dat, kmeans.dat, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, pool = TRUE, ... )
dat |
a matrix with |
kmeans.dat |
an output from applying |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
center |
a logical specifying to center the rows. By default, |
covariate |
a model matrix of covariates with |
verbose |
a logical specifying to print the computational progress. By default, |
pool |
a logical specifying to pool the null statistics across all clusters. By default, |
... |
optional arguments to control the k-means clustering algorithm (refers to |
K-means clustering assign m
rows into K
clusters. This function enable statistical
evaluation if the cluster membership is correctly assigned. Each of m
p-values refers to
the statistical test of that row with regard to its assigned cluster.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of clusters from the observed data
and protects against an anti-conservative bias.
Generally, it functions identical to jackstraw_kmeans
, but this uses ClusterR::KMeans_rcpp
instead of stats::kmeans
.
A speed improvement is gained by K-means++ initialization and RcppArmadillo
. If the input data is still too large,
consider using jackstraw_MiniBatchKmeans
.
The input data (dat
) must be of a class matrix
.
jackstraw_kmeanspp
returns a list consisting of
F.obs |
|
F.null |
F null statistics between null variables and cluster centers, from the jackstraw method. |
p.F |
|
Neo Christopher Chung [email protected]
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
## Not run: library(ClusterR) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) kmeans.dat <- KMeans_rcpp(dat, clusters = 10, num_init = 1, max_iters = 100, initializer = 'kmeans++') jackstraw.out <- jackstraw_kmeanspp(dat, kmeans.dat) ## End(Not run)
## Not run: library(ClusterR) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) kmeans.dat <- KMeans_rcpp(dat, clusters = 10, num_init = 1, max_iters = 100, initializer = 'kmeans++') jackstraw.out <- jackstraw_kmeanspp(dat, kmeans.dat) ## End(Not run)
Test association between the observed variables and their latent variables captured by logistic factors (LFs).
jackstraw_lfa( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, permute_alleles = TRUE, verbose = TRUE )
jackstraw_lfa( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, permute_alleles = TRUE, verbose = TRUE )
dat |
either a genotype matrix with |
r |
a number of significant LFs. |
FUN |
a function to use for LFA. |
r1 |
a numeric vector of LFs of interest (implying you are not interested in all |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. There will be a total of |
covariate |
a data matrix of covariates with corresponding |
permute_alleles |
If TRUE (default), alleles (rather than genotypes) are permuted, which results in a more Binomial synthetic null when data is highly structured. Changing to FALSE is not recommended, except for research purposes to confirm that it performs worse than the default. |
verbose |
a logical specifying to print the computational progress. |
This function uses logistic factor analysis (LFA) from Hao et al. (2016).
Particularly, the deviance in logistic regression (the full model with r
LFs vs. the intercept-only model) is used to assess significance.
This function requires the gcatest
package, and in practice also the lfa
package, to be installed from Bioconductor.
The random outputs of the regular matrix versus the BEDMatrix
versions are equal in distribution.
However, fixing a seed and providing the same data to both versions does not result in the same exact outputs.
This is because the BEDMatrix
version permutes loci in a different order by necessity.
jackstraw_lfa
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Alejandro Ochoa [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
jackstraw_pca jackstraw jackstraw_subspace
## Not run: ## simulate genotype data from a logistic factor model: drawing rbinom from logit(BL) m <- 5000; n <- 100; pi0 <- .9 m0 <- round(m*pi0) m1 <- m - round(m*pi0) B <- matrix(0, nrow=m, ncol=1) B[1:m1,] <- matrix(runif(m1*n, min=-.5, max=.5), nrow=m1, ncol=n) L <- matrix(rnorm(n), nrow=1, ncol=n) BL <- B %*% L prob <- exp(BL)/(1+exp(BL)) dat <- matrix(rbinom(m*n, 2, as.numeric(prob)), m, n) # load lfa package (install from Bioconductor) library(lfa) # choose the number of logistic factors, including the intercept r <- 2 # define the function this way, a function of the genotype matrix only FUN <- function(x) lfa( x, r ) ## apply the jackstraw_lfa out <- jackstraw_lfa( dat, r, FUN ) # if you had very large genotype data in plink BED/BIM/FAM files, # use BEDMatrix and save memory by reading from disk (at the expense of speed) library(BEDMatrix) dat_BM <- BEDMatrix( 'filepath' ) # assumes filepath.bed, .bim and .fam exist # run jackstraw! out <- jackstraw_lfa( dat_BM, r, FUN ) ## End(Not run)
## Not run: ## simulate genotype data from a logistic factor model: drawing rbinom from logit(BL) m <- 5000; n <- 100; pi0 <- .9 m0 <- round(m*pi0) m1 <- m - round(m*pi0) B <- matrix(0, nrow=m, ncol=1) B[1:m1,] <- matrix(runif(m1*n, min=-.5, max=.5), nrow=m1, ncol=n) L <- matrix(rnorm(n), nrow=1, ncol=n) BL <- B %*% L prob <- exp(BL)/(1+exp(BL)) dat <- matrix(rbinom(m*n, 2, as.numeric(prob)), m, n) # load lfa package (install from Bioconductor) library(lfa) # choose the number of logistic factors, including the intercept r <- 2 # define the function this way, a function of the genotype matrix only FUN <- function(x) lfa( x, r ) ## apply the jackstraw_lfa out <- jackstraw_lfa( dat, r, FUN ) # if you had very large genotype data in plink BED/BIM/FAM files, # use BEDMatrix and save memory by reading from disk (at the expense of speed) library(BEDMatrix) dat_BM <- BEDMatrix( 'filepath' ) # assumes filepath.bed, .bim and .fam exist # run jackstraw! out <- jackstraw_lfa( dat_BM, r, FUN ) ## End(Not run)
Test the cluster membership for K-means clustering
jackstraw_MiniBatchKmeans( dat, MiniBatchKmeans.output = NULL, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, batch_size = floor(nrow(dat)/100), initializer = "kmeans++", pool = TRUE, ... )
jackstraw_MiniBatchKmeans( dat, MiniBatchKmeans.output = NULL, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, batch_size = floor(nrow(dat)/100), initializer = "kmeans++", pool = TRUE, ... )
dat |
a data matrix with |
MiniBatchKmeans.output |
an output from applying |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
center |
a logical specifying to center the rows. By default, |
covariate |
a model matrix of covariates with |
verbose |
a logical specifying to print the computational progress. By default, |
batch_size |
the size of the mini batches. |
initializer |
the method of initialization. By default, |
pool |
a logical specifying to pool the null statistics across all clusters. By default, |
... |
optional arguments to control the Mini Batch K-means clustering algorithm (refers to |
K-means clustering assign m
rows into K
clusters. This function enable statistical
evaluation if the cluster membership is correctly assigned. Each of m
p-values refers to
the statistical test of that row with regard to its assigned cluster.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of clusters from the observed data
and protects against an anti-conservative bias.
jackstraw_MiniBatchKmeans
returns a list consisting of
F.obs |
|
F.null |
F null statistics between null variables and cluster centers, from the jackstraw method. |
p.F |
|
Neo Christopher Chung [email protected]
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
## Not run: library(ClusterR) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) MiniBatchKmeans.output <- MiniBatchKmeans(data=dat, clusters = 2, batch_size = 300, initializer = "kmeans++") jackstraw.output <- jackstraw_MiniBatchKmeans(dat, MiniBatchKmeans.output = MiniBatchKmeans.output) ## End(Not run)
## Not run: library(ClusterR) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) MiniBatchKmeans.output <- MiniBatchKmeans(data=dat, clusters = 2, batch_size = 300, initializer = "kmeans++") jackstraw.output <- jackstraw_MiniBatchKmeans(dat, MiniBatchKmeans.output = MiniBatchKmeans.output) ## End(Not run)
Test the cluster membership for Partitioning Around Medoids (PAM)
jackstraw_pam( dat, pam.dat, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, pool = TRUE, ... )
jackstraw_pam( dat, pam.dat, s = NULL, B = NULL, center = TRUE, covariate = NULL, verbose = FALSE, pool = TRUE, ... )
dat |
a matrix with |
pam.dat |
an output from applying |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
center |
a logical specifying to center the rows. By default, |
covariate |
a model matrix of covariates with |
verbose |
a logical specifying to print the computational progress. By default, |
pool |
a logical specifying to pool the null statistics across all clusters. By default, |
... |
optional arguments to control the k-means clustering algorithm (refers to |
PAM assigns m
rows into K
clusters. This function enable statistical
evaluation if the cluster membership is correctly assigned. Each of m
p-values refers to
the statistical test of that row with regard to its assigned cluster.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of clusters from the observed data
and protects against an anti-conservative bias.
For a large dataset, PAM could be too slow. Consider using cluster::clara
and jackstraw::jackstraw_clara
.
The input data (dat
) must be of a class matrix
.
jackstraw_pam
returns a list consisting of
F.obs |
|
F.null |
F null statistics between null variables and cluster medoids, from the jackstraw method. |
p.F |
|
Neo Christopher Chung [email protected]
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
## Not run: library(cluster) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) pam.dat <- pam(dat, k=2) jackstraw.out <- jackstraw_pam(dat, pam.dat = pam.dat) ## End(Not run)
## Not run: library(cluster) dat = t(scale(t(Jurkat293T), center=TRUE, scale=FALSE)) pam.dat <- pam(dat, k=2) jackstraw.out <- jackstraw_pam(dat, pam.dat = pam.dat) ## End(Not run)
Test association between the observed variables and their latent variables captured by principal components (PCs).
jackstraw_pca( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE )
jackstraw_pca( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE )
dat |
a data matrix with |
r |
a number (a positive integer) of significant principal components. See permutationPA and other methods. |
r1 |
a numeric vector of the principal components that are of interest. Choose a subset of |
s |
a number (a positive integer) of “synthetic” null variables. Out of |
B |
a number (a positive integer) of resampling iterations. There will be a total of |
covariate |
a data matrix of covariates with corresponding |
verbose |
a logical specifying to print the computational progress. |
This function computes m
p-values of linear association between m
variables and their PCs.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of PCs from the observed data
and protects against an anti-conservative bias.
Provide the data matrix, with m
variables as rows and n
observations as columns.
Given that there are r
significant PCs, this function tests for linear association between
m
variables and their r
PCs.
You could specify a subset of significant PCs that you are interested in (r1
). If r1
is given,
then this function computes statistical significance of association between m
variables and r1
,
while adjusting for other PCs (i.e., significant PCs that are not your interest).
For example, if you want to identify variables associated with first and second PCs,
when your data contains three significant PCs, set r=3
and r1=c(1,2)
.
Please take a careful look at your data and use appropriate graphical and statistical criteria
to determine a number of significant PCs, r
. The number of significant PCs depends on the data structure and the context.
In a case when you fail to specify r
, it will be estimated from a permutation test (Buja and Eyuboglu, 1992)
using a function permutationPA.
If s
is not supplied, s
is set to about 10\
If B
is not supplied, B
is set to m*10/s
.
jackstraw_pca
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
jackstraw jackstraw_subspace permutationPA
## Not run: ## simulate data from a latent variable model: Y = BL + E B = c(rep(1,50),rep(-1,50), rep(0,900)) L = rnorm(20) E = matrix(rnorm(1000*20), nrow=1000) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_pca(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## out = jackstraw_pca(dat, r=1, s=10, B=1000) ## End(Not run)
## Not run: ## simulate data from a latent variable model: Y = BL + E B = c(rep(1,50),rep(-1,50), rep(0,900)) L = rnorm(20) E = matrix(rnorm(1000*20), nrow=1000) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_pca(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## out = jackstraw_pca(dat, r=1, s=10, B=1000) ## End(Not run)
Test association between the observed variables and their latent variables captured by principal components (PCs). PCs are computed by randomized Singular Value Decomposition (see rsvd
).
jackstraw_rpca( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE, ... )
jackstraw_rpca( dat, r = NULL, r1 = NULL, s = NULL, B = NULL, covariate = NULL, verbose = TRUE, ... )
dat |
a data matrix with |
r |
a number (a positive integer) of significant principal components. See permutationPA and other methods. |
r1 |
a numeric vector of principal components of interest. Choose a subset of |
s |
a number (a positive integer) of “synthetic” null variables. Out of |
B |
a number (a positive integer) of resampling iterations. There will be a total of |
covariate |
a data matrix of covariates with corresponding |
verbose |
a logical specifying to print the computational progress. |
... |
additional arguments to |
This function computes m
p-values of linear association between m
variables and their PCs.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of PCs from the observed data
and protects against an anti-conservative bias.
Provide the data matrix, with m
variables as rows and n
observations as columns.
Given that there are r
significant PCs, this function tests for linear association between
m
variables and their r
PCs.
You could specify a subset of significant PCs that you are interested in (r1
). If r1
is given,
then this function computes statistical significance of association between m
variables and r1
,
while adjusting for other PCs (i.e., significant PCs that are not your interest).
For example, if you want to identify variables associated with first and second PCs,
when your data contains three significant PCs, set r=3
and r1=c(1,2)
.
Please take a careful look at your data and use appropriate graphical and statistical criteria
to determine a number of significant PCs, r
. The number of significant PCs depends on the data structure and the context.
In a case when you fail to specify r
, it will be estimated from a permutation test (Buja and Eyuboglu, 1992)
using a function permutationPA.
If s
is not supplied, s
is set to about 10\
If B
is not supplied, B
is set to m*10/s
.
jackstraw_rpca
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
jackstraw jackstraw_subspace permutationPA
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,10),rep(-1,10), rep(0,180)) L = rnorm(20) E = matrix(rnorm(200*20), nrow=200) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_rpca(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## Not run: ## out = jackstraw_rpca(dat, r=1, s=10, B=200) ## End(Not run)
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,10),rep(-1,10), rep(0,180)) L = rnorm(20) E = matrix(rnorm(200*20), nrow=200) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw out = jackstraw_rpca(dat, r=1) ## Use optional arguments ## For example, set s and B for a balance between speed of the algorithm and accuracy of p-values ## Not run: ## out = jackstraw_rpca(dat, r=1, s=10, B=200) ## End(Not run)
Test association between the observed variables and their latent variables, captured by a user-defined dimension reduction method.
jackstraw_subspace( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, noise = NULL, verbose = TRUE )
jackstraw_subspace( dat, r, FUN, r1 = NULL, s = NULL, B = NULL, covariate = NULL, noise = NULL, verbose = TRUE )
dat |
a data matrix with |
r |
a number of significant latent variables. |
FUN |
Provide a specific function to estimate LVs. Must output |
r1 |
a numeric vector of latent variables of interest. |
s |
a number of “synthetic” null variables. Out of |
B |
a number of resampling iterations. |
covariate |
a model matrix of covariates with |
noise |
specify a parametric distribution to generate a noise term. If |
verbose |
a logical specifying to print the computational progress. |
This function computes m
p-values of linear association between m
variables and their latent variables,
captured by a user-defined dimension reduction method.
Its resampling strategy accounts for the over-fitting characteristics due to direct computation of PCs from the observed data
and protects against an anti-conservative bias.
This function allows you to specify a parametric distribution of a noise term. It is an experimental feature. Then, a small number s
of observed variables
are replaced by synthetic null variables generated from a specified distribution.
jackstraw_subspace
returns a list consisting of
p.value |
|
obs.stat |
|
null.stat |
|
Neo Christopher Chung [email protected]
Chung and Storey (2015) Statistical significance of variables driving systematic variation in high-dimensional data. Bioinformatics, 31(4): 545-554 doi:10.1093/bioinformatics/btu674
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,50),rep(-1,50), rep(0,900)) L = rnorm(20) E = matrix(rnorm(1000*20), nrow=1000) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw with the svd as a function out = jackstraw_subspace(dat, FUN = function(x) svd(x)$v[,1,drop=FALSE], r=1, s=100, B=50)
## simulate data from a latent variable model: Y = BL + E B = c(rep(1,50),rep(-1,50), rep(0,900)) L = rnorm(20) E = matrix(rnorm(1000*20), nrow=1000) dat = B %*% t(L) + E dat = t(scale(t(dat), center=TRUE, scale=TRUE)) ## apply the jackstraw with the svd as a function out = jackstraw_subspace(dat, FUN = function(x) svd(x)$v[,1,drop=FALSE], r=1, s=100, B=50)
50%:50% Jurkat:293T
Jurkat293T
Jurkat293T
A data frame with 3381 rows corresponding to single cells and 10 columns corresponding to the top 10 principal components
Supplementary Data 1 from Zheng et al. (2017) https://static-content.springer.com/esm/art%3A10.1038%2Fncomms14049/MediaObjects/41467_2017_BFncomms14049_MOESM829_ESM.xlsx
Zheng et al. (2017) Massively parallel digital transcriptional profiling of single cells. Nature Communications. 8:14049. doi:10.1038/ncomms14049
This function estimates the non-centrality parameter (NCP) of data assuming a non-central chi squared distribution with known degrees of freedom. The function first uses the method of moments (MOM) to get a rough estimate of the NCP, then uses that as a starting point for calculating the maximum likelihood estimate (MLE).
ncp_est(x, df)
ncp_est(x, df)
x |
The vector of data whose parameter we are interested in. The data is assumed to follow a non-central chi squared distribution, and appropriate inputs are deviance or likelihood ratio test statistics, or other tests that ordinarily follow a central chi-squared distribution but become non-central due to "double dipping" (where fixed variables being tested were in fact estimated from the same data). Any values that are NA, infinites, or negatives are ignored (removed before all calculations). |
df |
The degrees of freedom of the test used to calculate the statistics. |
There is one case where this function does not calculate MLEs directly, due to numerical issues, namely when the MOM estimate is negative, in which case estimates of zero for both are returned. Negative MOM NCP estimates arise in two common cases. First, the data has a central chi squared distribution, in which case the NCP estimate may take on small negative values, and overall these estimates have a mean of zero and variance dependent on sample size; in this case the return values of zero are appropriate. Second, the data may not have a central or non-central chi squared distribution at all, in which case the estimation model is completely misspecified; we opted to return zero estimates quietly, without issuing errors or warnings, as this case overlaps with central chi-squared data and it's not trivial to tell the difference. This function does not aim to detect misspecified data, it is up to the researcher to compare the model against the observed statistics for goodness of fit, for example using q-q plots.
A numeric vector with two values, namely the MLE and MOM estimates.
pvals_nc_chisq()
, a wrapper around this function that facilitates p-value calculation for Jackstraw objects.
# true parameters of toy data df <- 1 ncp_true <- 1 x <- rchisq( 100, df, ncp_true ) # get estimates! ncp_ests <- ncp_est( x, df ) # this is MLE ncp_ests[1] # this is MOM estimate ncp_ests[2]
# true parameters of toy data df <- 1 ncp_true <- 1 x <- rchisq( 100, df, ncp_true ) # get estimates! ncp_ests <- ncp_est( x, df ) # this is MLE ncp_ests[1] # this is MOM estimate ncp_ests[2]
Estimate a number of significant principal components from a permutation test.
permutationPA(dat, B = 100, threshold = 0.05, verbose = TRUE)
permutationPA(dat, B = 100, threshold = 0.05, verbose = TRUE)
dat |
a data matrix with |
B |
a number (a positive integer) of resampling iterations. |
threshold |
a numeric value between 0 and 1 to threshold p-values. |
verbose |
a logical indicator as to whether to print the progress. |
Adopted from sva::num.sv
, and based on Buja and Eyuboglu (1992)
permutationPA
returns
r |
an estimated number of significant principal components based on thresholding p-values at |
p |
a list of p-values for significance of principal components |
Buja A and Eyuboglu N. (1992) Remarks on parallel analysis. Multivariate Behavioral Research, 27(4), 509-540
From a set of p-values, computes posterior probabilities that a feature should be truly included. For example, membership inclusion in a given cluster can be improved by filtering low quality members. In using PCA and related methods, it helps select variables that are truly associated with given latent variables.
pip(pvalue, group = NULL, pi0 = NULL, verbose = TRUE, ...)
pip(pvalue, group = NULL, pi0 = NULL, verbose = TRUE, ...)
pvalue |
a vector of p-values. |
group |
a vector of group indicators (optional). If provided, PIP analysis is stratified. Assumes groups are in 1:k where k is the number of unique groups. |
pi0 |
a vector of pi0 values (optional). Its length has to be either 1 or equal the number of groups. |
verbose |
If TRUE, reports information. |
... |
optional arguments for |
This function requires the Bioconductor qvalue
package to be installed.
pip
returns a vector of posterior inclusion probabilities
Neo Christopher Chung [email protected] John R. Yamamoto-Wilson
Chung (2020) Statistical significance of cluster membership for unsupervised evaluation of cell identities. Bioinformatics, 36(10): 3107–3114 doi:10.1093/bioinformatics/btaa087
Chung (2014) "Jackstraw Weighted Shrinkage for Principal Component Analysis and Covariance Matrix" in Statistical Inference of Variables Driving Systematic Variation in High-Dimensional Biological Data. PhD thesis, Princeton University. https://www.proquest.com/openview/e90b562d689cf3a021c35a93c6f346db/1?pq-origsite=gscholar&cbl=18750
This function takes a Jackstraw object for convenience, estimates the non-centrality parameter (NCP) from the null statistics (using ncp_est()
), then uses this non-central chi squared model to calculate p-values for the observed statistics.
The goal is to be able to calculate small p-values with arbitrary precision, which is normally not possible with the ordinary Jackstraw functions since the minimum empirical p-value is determined by the inverse of the number of null samples (1 /( s * B )
where sand
B' are Jackstraw parameters (see jackstraw_lfa()
)).
In contrast, large p-values values are typically similar between empirical and NCP model versions, and in fact this must hold if the null model is correctly specified.
This model works well empirically for jackstraw_lfa()
and jackstraw_alstructure()
, whose statistics are deviances that would ordinarily be central chi-squared distributed were the test factors not estimated from the same data.
This model is not appropriate for data from jackstraw_pca()
, whose statistics have a roughly F distribution, and possibly many other functions from this package.
pvals_nc_chisq(out, df, null.stat = out$null.stat, obs.stat = out$obs.stat)
pvals_nc_chisq(out, df, null.stat = out$null.stat, obs.stat = out$obs.stat)
out |
The object returned by |
df |
The degrees of freedom of the test used to calculate the statistics.
Note that for ordinary jackstraw runs this is |
null.stat |
Alternate way to provide these values if the input is not a Jackstraw object, so |
obs.stat |
Alternate way to provide these values if the input is not a Jackstraw object, so |
A list with two named values:
p.value
: the vector of p-values calculated for the input vector obs.stat
ncp
: the numeric vector of two values of NCP estimates calculated from null.stat
, namely the maximum likelihood (MLE) and method-of-moments estimates (see ncp_est()
). Only MLE is used to calculate p-values.
ncp_est()
for more notes on estimating non-centrality parameters only.
jackstraw_lfa()
, jackstraw_alstructure()
# instead of running jackstraw here, we simulate toy data df <- 1 ncp_true <- 1 # null data are truly non-central chi squared distributed # observed data are the same but with huge power reflected in a larger NCP out <- list( null.stat = rchisq( 100, df, ncp_true ), obs.stat = rchisq( 11, df, ncp_true + 10 ) ) # this calculates new p-values with much higher precision for highly significant cases out2 <- pvals_nc_chisq( out, df ) # these are the desired p-values out2$p.value # and NCP estimates from tow methds out2$ncp ## Not run: # This is the more typical usage in practice # first run Jackstraw-LFA, which returns empirical p-values with limited precision and raw stats out <- jackstraw_lfa( dat, r, FUN = function(x) lfa( x, r ) ) # then fit NCP model and use it to calculate high-precision p-values! out2 <- pvals_nc_chisq( out, r-1 ) # compare p-values! In linear scale they should agree very well plot( out$p.value, out2$p.value, pch = '.' ) # but in log scale it's clear the NCP values can take on much smaller values plot( out$p.value, out2$p.value, pch = '.', log = 'xy' ) ## End(Not run)
# instead of running jackstraw here, we simulate toy data df <- 1 ncp_true <- 1 # null data are truly non-central chi squared distributed # observed data are the same but with huge power reflected in a larger NCP out <- list( null.stat = rchisq( 100, df, ncp_true ), obs.stat = rchisq( 11, df, ncp_true + 10 ) ) # this calculates new p-values with much higher precision for highly significant cases out2 <- pvals_nc_chisq( out, df ) # these are the desired p-values out2$p.value # and NCP estimates from tow methds out2$ncp ## Not run: # This is the more typical usage in practice # first run Jackstraw-LFA, which returns empirical p-values with limited precision and raw stats out <- jackstraw_lfa( dat, r, FUN = function(x) lfa( x, r ) ) # then fit NCP model and use it to calculate high-precision p-values! out2 <- pvals_nc_chisq( out, r-1 ) # compare p-values! In linear scale they should agree very well plot( out$p.value, out2$p.value, pch = '.' ) # but in log scale it's clear the NCP values can take on much smaller values plot( out$p.value, out2$p.value, pch = '.', log = 'xy' ) ## End(Not run)