Title: | Testing similarity between binary datasets using Jaccard/Tanimoto coefficients |
---|---|
Description: | Calculate statistical significance of Jaccard/Tanimoto similarity coefficients. |
Authors: | Neo Christopher Chung <[email protected]>, Błażej Miasojedow <[email protected]>, Michał Startek <[email protected]> |
Maintainer: | Neo Christopher Chung <[email protected]> |
License: | GPL-2 |
Version: | 0.1.0 |
Built: | 2024-11-05 03:57:00 UTC |
Source: | https://github.com/ncchung/jaccard |
Compute a Jaccard/Tanimoto similarity coefficient
jaccard(x, y, center = FALSE, px = NULL, py = NULL)
jaccard(x, y, center = FALSE, px = NULL, py = NULL)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
center |
whether to center the Jaccard/Tanimoto coefficient by its expectation |
px |
probability of successes in |
py |
probability of successes in |
jaccard.test.bootstrap
returns an expected value.
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard(x,y)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard(x,y)
Compute an expected Jaccard/Tanimoto similarity coefficient under independence
jaccard.ev(x, y, px = NULL, py = NULL)
jaccard.ev(x, y, px = NULL, py = NULL)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
px |
probability of successes in |
py |
probability of successes in |
jaccard.test.bootstrap
returns an expected value.
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.ev(x,y)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.ev(x,y)
In the EC-BLAST paper, Rahman et al. (2014) provide the following description: The mean (μ) and s.d. (σ) of the similarity scores are used to define the z score, z = (Tw – μ)/σ. For the purpose of calculating the P value, only hits with T > 0 are considered. The P value w is derived from the z score using an extreme value distribution P = 1 – exp(−e−zπ/√(6) − Γ′ (1)), where the Euler-Mascheroni constant Γ′ (1) ≈ 0.577215665.
jaccard.rahman(j)
jaccard.rahman(j)
j |
a numeric vector of observed Jaccard coefficients (uncentered) |
jaccard.rahman
returns a numeric vector of p-values
Rahman, Cuesta, Furnham, Holliday, and Thornton (2014) EC-BLAST: a tool to automatically search and compare enzyme reactions. Nature Methods, 11(2) http://www.nature.com/nmeth/journal/v11/n2/full/nmeth.2803.html
Compute statistical significance of Jaccard/Tanimoto similarity coefficients between binary vectors, using four different methods.
jaccard.test(x, y, method = "mca", px = NULL, py = NULL, verbose = TRUE, ...)
jaccard.test(x, y, method = "mca", px = NULL, py = NULL, verbose = TRUE, ...)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
method |
a method to compute a p-value ( |
px |
probability of successes in |
py |
probability of successes in |
verbose |
whether to print progress messages |
... |
optional arguments for specific computational methods |
There exist four methods to compute p-values of Jaccard/Tanimoto similarity coefficients:
mca
, bootstrap
, asymptotic
, and exact
. This is simply a wrapper function for
corresponding four functions in this package: jaccard.test.mca, jaccard.test.bootstrap, jaccard.test.asymptotic, and jaccard.test.exact.
We recommand using either mca
or bootstrap
methods,
since the exact
solution is slow for a moderately large vector and asymptotic
approximation may be inaccurate depending on the input vector size.
The bootstrap method uses resampling with replacement binary vectors to compute a p-value (see optional arguments).
The mca
method uses the measure concentration algorithm that estimates the multinomial distribution with a known error bound (specified by an optional argument accuracy
).
jaccard.test
returns a list mainly consisting of
statistics |
centered Jaccard/Tanimoto similarity coefficient |
pvalue |
p-value |
expectation |
expectation |
method="bootstrap"
whether to fix (i.e., not resample) x
and/or y
a total bootstrap iteration
a seed for a random number generator
method="mca"
an error bound on approximating a multinomial distribution
an error type on approximating a multinomial distribution ("average"
, "upper"
, "lower"
)
a seed for the random number generator.
jaccard.test.bootstrap jaccard.test.mca jaccard.test.exact jaccard.test.asymptotic
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test(x,y,method="bootstrap") jaccard.test(x,y,method="mca") jaccard.test(x,y,method="exact") jaccard.test(x,y,method="asymptotic")
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test(x,y,method="bootstrap") jaccard.test(x,y,method="mca") jaccard.test(x,y,method="exact") jaccard.test(x,y,method="asymptotic")
Compute statistical significance of Jaccard/Tanimoto similarity coefficients.
jaccard.test.asymptotic(x, y, px = NULL, py = NULL, verbose = TRUE)
jaccard.test.asymptotic(x, y, px = NULL, py = NULL, verbose = TRUE)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
px |
probability of successes in |
py |
probability of successes in |
verbose |
whether to print progress messages |
jaccard.test.asymptotic
returns a list consisting of
statistics |
centered Jaccard/Tanimoto similarity coefficient |
pvalue |
p-value |
expectation |
expectation |
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.asymptotic(x,y)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.asymptotic(x,y)
Compute statistical significance of Jaccard/Tanimoto similarity coefficients.
jaccard.test.bootstrap(x, y, px = NULL, py = NULL, verbose = TRUE, fix = "x", B = 1000, seed = NULL)
jaccard.test.bootstrap(x, y, px = NULL, py = NULL, verbose = TRUE, fix = "x", B = 1000, seed = NULL)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
px |
probability of successes in |
py |
probability of successes in |
verbose |
whether to print progress messages |
fix |
whether to fix (i.e., not resample) |
B |
a total bootstrap iteration |
seed |
a seed for a random number generator |
jaccard.test.bootstrap
returns a list consisting of
statistics |
centered Jaccard/Tanimoto similarity coefficient |
pvalue |
p-value |
expectation |
expectation |
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.bootstrap(x,y,B=500)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.bootstrap(x,y,B=500)
Compute statistical significance of Jaccard/Tanimoto similarity coefficients.
jaccard.test.exact(x, y, px = NULL, py = NULL, verbose = TRUE)
jaccard.test.exact(x, y, px = NULL, py = NULL, verbose = TRUE)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
px |
probability of successes in |
py |
probability of successes in |
verbose |
whether to print progress messages |
jaccard.test.exact
returns a list consisting of
statistics |
centered Jaccard/Tanimoto similarity coefficient |
pvalue |
p-value |
expectation |
expectation |
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.exact(x,y)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.exact(x,y)
Compute statistical significance of Jaccard/Tanimoto similarity coefficients.
jaccard.test.mca(x, y, px = NULL, py = NULL, accuracy = 1e-05, error.type = "average", verbose = TRUE)
jaccard.test.mca(x, y, px = NULL, py = NULL, accuracy = 1e-05, error.type = "average", verbose = TRUE)
x |
a binary vector (e.g., fingerprint) |
y |
a binary vector (e.g., fingerprint) |
px |
probability of successes in |
py |
probability of successes in |
accuracy |
an error bound on approximating a multinomial distribution |
error.type |
an error type on approximating a multinomial distribution ("average", "upper", "lower") |
verbose |
whether to print progress messages |
jaccard.test.mca
returns a list consisting of
statistics |
centered Jaccard/Tanimoto similarity coefficient |
pvalue |
p-value |
expectation |
expectation |
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.mca(x,y,accuracy = 1e-05)
set.seed(1234) x = rbinom(100,1,.5) y = rbinom(100,1,.5) jaccard.test.mca(x,y,accuracy = 1e-05)
Given a data matrix, it computes pair-wise Jaccard/Tanimoto similarity coefficients and p-values among rows (variables). Only for testing due to its use of a for-loop.
jaccard.test.pairwise(dat, method = "mca", verbose = TRUE, compute.qvalue = TRUE, ...)
jaccard.test.pairwise(dat, method = "mca", verbose = TRUE, compute.qvalue = TRUE, ...)
dat |
a data matrix |
method |
a method to compute a p-value ( |
verbose |
whether to print progress messages |
compute.qvalue |
whether to compute q-values |
... |
optional arguments for specific computational methods |
jaccard.test.pairwise
returns a list of matrices
statistics |
Jaccard/Tanimoto similarity coefficients |
pvalues |
p-values |
qvalues |
q-values |
Launch an interactive Shiny app on a local network
runJaccardApp()
runJaccardApp()