Package 'omicwas'

Title: Cell-Type-Specific Association Testing in Bulk Omics Experiments
Description: In bulk epigenome/transcriptome experiments, molecular expression is measured in a tissue, which is a mixture of multiple types of cells. This package tests association of a disease/phenotype with a molecular marker for each cell type. The proportion of cell types in each sample needs to be given as input. The package is applicable to epigenome-wide association study (EWAS) and differential gene expression analysis. Takeuchi and Kato (submitted) "omicwas: cell-type-specific epigenome-wide and transcriptome association study".
Authors: Fumihiko Takeuchi [aut, cre]
Maintainer: Fumihiko Takeuchi <[email protected]>
License: GPL-3
Version: 0.8.0
Built: 2025-02-16 05:17:45 UTC
Source: https://github.com/fumi-github/omicwas

Help Index


Cell-Type-Specific Association Testing

Description

Cell-Type-Specific Association Testing

Usage

ctassoc(
  X,
  W,
  Y,
  C = NULL,
  test = "full",
  regularize = FALSE,
  num.cores = 1,
  chunk.size = 1000,
  seed = 123
)

Arguments

X

Matrix (or vector) of traits; samples x traits.

W

Matrix of cell type composition; samples x cell types.

Y

Matrix (or vector) of bulk omics measurements; markers x samples.

C

Matrix (or vector) of covariates; samples x covariates. X, W, Y, C should be numeric.

test

Statistical test to apply; either "full", "marginal", "nls.identity", "nls.log", "nls.logit", "propdiff.identity", "propdiff.log", "propdiff.logit" or "reducedrankridge".

regularize

Whether to apply Tikhonov (ie ridge) regularization to βhjk\beta_{h j k}. The regularization parameter is chosen automatically according to an unbiased version of (Lawless & Wang, 1976). Effective for nls.* and propdiff.* tests.

num.cores

Number of CPU cores to use. Full, marginal and propdiff tests are run in serial, thus num.cores is ignored.

chunk.size

The size of job for a CPU core in one batch. If you have many cores but limited memory, and there is a memory failure, decrease num.cores and/or chunk.size.

seed

Seed for random number generation.

Details

Let the indexes be hh for cell type, ii for sample, jj for marker (CpG site or gene), kk for each trait that has cell-type-specific effect, and ll for each trait that has a uniform effect across cell types. The input data are XikX_{i k}, CilC_{i l}, WihW_{i h} and YjiY_{j i}, where CilC_{i l} can be omitted. XikX_{i k} and CilC_{i l} are the values for two types of traits, showing effects that are cell-type-specific or not, respectively. Thus, calling XikX_{i k} and CilC_{i l} as "traits" and "covariates" gives a rough idea, but is not strictly correct. WihW_{i h} represents the cell type composition and YjiY_{j i} represents the marker level, such as methylation or gene expression. For each tissue sample, the cell type proportion WihW_{i h} is the proportion of each cell type in the bulk tissue, which is measured or imputed beforehand. The marker level YjiY_{j i} in bulk tissue is measured and provided as input.

The parameters we estimate are the cell-type-specific trait effect βhjk\beta_{h j k}, the tissue-uniform trait effect γjl\gamma_{j l}, and the basal marker level αhj\alpha_{h j} in each cell type.

We first describe the conventional linear regression models. For marker jj in sample ii, the maker level specific to cell type hh is

αhj+kβhjkXik.\alpha_{h j} + \sum_k \beta_{h j k} * X_{i k}.

This is a representative value rather than a mean, because we do not model a probability distribution for cell-type-specific expression. The bulk tissue marker level is the average weighted by WihW_{i h},

μji=hWih[αhj+kβhjkXik]+lγjlCil.\mu_{j i} = \sum_h W_{i h} [ \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} ] + \sum_l \gamma_{j l} C_{i l}.

The statistical model is

Yji=μji+ϵji,Y_{j i} = \mu_{j i} + \epsilon_{j i},

ϵji N(0,σj2).\epsilon_{j i} ~ N(0, \sigma^2_j).

The error of the marker level is is noramlly distributed with variance σj2\sigma^2_j, independently among samples.

The full model is the linear regression

Yji=(hαhjWih)+(hkβhjkWihXik)+(lγjlCil)+error.Y_{j i} = (\sum_h \alpha_{h j} * W_{i h}) + (\sum_{h k} \beta_{h j k} * W_{i h} * X_{i k}) + (\sum_l \gamma_{j l} * C_{i l}) + error.

The marginal model tests the trait association only in one cell type hh, under the linear regression,

Yji=(hαhjWih)+(kβhjkWihXik)+(lγjlCil)+error.Y_{j i} = (\sum_{h'} \alpha_{h' j} * W_{i h'}) + (\sum_k \beta_{h j k} * W_{i h} * X_{i k}) + (\sum_l \gamma_{j l} * C_{i l}) + error.

The nonlinear model simultaneously analyze cell type composition in linear scale and differential expression/methylation in log/logit scale. The normalizing function is the natural logarithm ff = log for gene expression, and ff = logit for methylation. Conventional linear regression can be formulated by defining ff as the identity function. The three models are named nls.log, nls.logit and nls.identity. We denote the inverse function of ff by gg; gg = exp for gene expression, and gg = logistic for methylation. The mean normalized marker level of marker jj in sample ii becomes

μji=f(hWihg(αhj+kβhjkXik))+lγjlCil.\mu_{j i} = f(\sum_h W_{i h} g( \alpha_{h j} + \sum_k \beta_{h j k} * X_{i k} )) + \sum_l \gamma_{j l} C_{i l}.

The statistical model is

f(Yji)=μji+ϵji,f(Y_{j i}) = \mu_{j i} + \epsilon_{j i},

ϵji N(0,σj2).\epsilon_{j i} ~ N(0, \sigma^2_j).

The error of the marker level is is noramlly distributed with variance σj2\sigma^2_j, independently among samples.

The ridge regression aims to cope with multicollinearity of the interacting terms WihXikW_{i h} * X_{i k}. Ridge regression is fit by minimizing the residual sum of squares (RSS) plus λhkβhjk2\lambda \sum_{h k} \beta_{h j k}^2, where λ>0\lambda > 0 is the regularization parameter.

The propdiff tests try to cope with multicollinearity by, roughly speaking, using mean-centered WihW_{i h}. We obtain, instead of βhjk\beta_{h j k}, the deviation of βhjk\beta_{h j k} from the average across cell types. Accordingly, the null hypothesis changes. The original null hypothesis was βhjk=0\beta_{h j k} = 0. The null hypothesis when centered is βhjk(ihWihβhjk)/(ihWih)=0\beta_{h j k} - (\sum_{i h'} W_{i h'} \beta_{h' j k}) / (\sum_{i h'} W_{i h'}) = 0. It becomes difficult to detect a signal for a major cell type, because βhjk\beta_{h j k} would be close to the average across cell types. The tests propdiff.log and propdiff.logit include an additional preprocessing step that converts YjiY_{j i} to f(Yji)f(Y_{j i}). Apart from the preprocessing, the computations are performed in linear scale. As the preprocessing distorts the linearity between the dependent variable and (the centered) WihW_{i h}, I actually think propdiff.identity is better.

Value

A list with one element, which is named "coefficients". The element gives the estimate, statistic, p.value in tibble format. In order to transform the estimate for αhj\alpha_{h j} to the original scale, apply plogis for test = nls.logit and exp for test = nls.log. The estimate for βhjk\beta_{h j k} by test = nls.log is the natural logarithm of fold-change, not the log2. If numerical convergence fails, NA is returned for that marker.

References

Lawless, J. F., & Wang, P. (1976). A simulation study of ridge and other regression estimators. Communications in Statistics - Theory and Methods, 5(4), 307–323. https://doi.org/10.1080/03610927608827353

See Also

ctcisQTL

Examples

data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients

Cell-Type-Specific QTL analysis

Description

Cell-Type-Specific QTL analysis

Usage

ctcisQTL(
  X,
  Xpos,
  W,
  Y,
  Ypos,
  C = NULL,
  max.pos.diff = 1e+06,
  outdir = tempdir(),
  outfile = "ctcisQTL.out.txt"
)

Arguments

X

Matrix (or vector) of SNP genotypes; SNPs x samples.

Xpos

Vector of the physical position of X

W

Matrix of cell type composition; samples x cell types.

Y

Matrix (or vector) of bulk omics measurements; markers x samples.

Ypos

Vector of the physical position of Y

C

Matrix (or vector) of covariates; samples x covariates. X, Xpos, W, Y, Ypos, C should be numeric.

max.pos.diff

Maximum positional difference to compute cis-QTL. Association analysis is performed between a row of X and a row of Y, only when they are within this limit. Since the limiting is only by position, the function needs to be run separately for each chromosome.

outdir

Output directory.

outfile

Output file.

Details

A function for analyses of QTL, such as eQTL, mQTL, pQTL. The statistical test is almost identical to ctassoc(test = "nls.identity", regularize = "TRUE"). Association analysis is performed between each row of Y and each row of X. Usually, the former will be a methylation/expression marker, and the latter will be a SNP. To cope with the large number of combinations, the testing is limited to pairs whose position is within the difference specified by max.pos.diff; i.e., limited to cis-QTL. In detail, this function performs linear ridge regression, whereas ctassoc(test = "nls.identity", regularize = "TRUE") actually is nonlinear regression but with ff = identity as normalizing transformation. In order to speed up computation, first, the parameters αhj\alpha_{h j} and γjl\gamma_{j l} are fit by ordinary linear regression assuming βhjk=0\beta_{h j k} = 0. Next, βhjk\beta_{h j k} are fit and tested by linear ridge regression (see documentation for ctassoc).

Value

The estimate, statistic, p.value are written to the specified file.

See Also

ctassoc

Examples

data(GSE79262small)
X    = GSE79262small$X
Xpos = GSE79262small$Xpos
W    = GSE79262small$W
Y    = GSE79262small$Y
Ypos = GSE79262small$Ypos
C    = GSE79262small$C
X    = X[seq(1, 3601, 100), ] # for brevity
Xpos = Xpos[seq(1, 3601, 100)]
ctcisQTL(X, Xpos, W, Y, Ypos, C = C)

Small Subset of GSE42861 Dataset From GEO

Description

The dataset includes 336 rheumatoid arthritis cases and 322 controls. A subset of 500 CpG sites were randomly selected from the original EWAS dataset.

Usage

data(GSE42861small)

Format

An object of class list of length 4.

Source

GEO

See Also

ctassoc

Examples

data(GSE42861small)
X = GSE42861small$X
W = GSE42861small$W
Y = GSE42861small$Y
Y = Y[seq(1, 20), ] # for brevity
C = GSE42861small$C
result = ctassoc(X, W, Y, C = C)
result$coefficients

Small Subset of GSE79262 Dataset From GEO

Description

The dataset includes 53 samples. A subset of 737 CpG sites and 3624 SNPs within Chr1:100,000,000-110,000,000 were selected from the original EWAS dataset. DNA methylation was measured in T cells. The estimated proportion of CD4T, CD8T, NK cells are saved in W.

Usage

data(GSE79262small)

Format

An object of class list of length 6.

Source

GEO

See Also

ctcisQTL

Examples

data(GSE79262small)
X    = GSE79262small$X
Xpos = GSE79262small$Xpos
W    = GSE79262small$W
Y    = GSE79262small$Y
Ypos = GSE79262small$Ypos
C    = GSE79262small$C
X    = X[seq(1, 3001, 100), ] # for brevity
Xpos = Xpos[seq(1, 3001, 100)]
Y    = Y[seq(1, 501, 100), ]
Ypos = Ypos[seq(1, 501, 100)]
ctcisQTL(X, Xpos, W, Y, Ypos, C = C)

Small Subset of GTEx Dataset

Description

The dataset includes gene expression measured in whole blood for 389 samples. A subset of 500 genes were randomly selected from the original dataset.

Usage

data(GTExsmall)

Format

An object of class list of length 4.

Source

GTEx

See Also

ctassoc

Examples

data(GTExsmall)
X = GTExsmall$X
W = GTExsmall$W
Y = GTExsmall$Y + 1
Y = Y[seq(1, 20), ] # for brevity
C = GTExsmall$C
result = ctassoc(X, W, Y, C = C)
result$coefficients

Fitting reduced-rank ridge regression with given rank and shrinkage penalty

Description

Fitting reduced-rank ridge regression with given rank and shrinkage penalty This is a modification of rrs.fit in rrpack version 0.1-6. In order to handle extremely large q = ncol(Y), generation of a q by q matrix is avoided.

Usage

rrs.fit(Y, X, nrank = min(ncol(Y), ncol(X)), lambda = 1, coefSVD = FALSE)

Arguments

Y

a matrix of response (n by q)

X

a matrix of covariate (n by p)

nrank

an integer specifying the desired rank

lambda

tunging parameter for the ridge penalty

coefSVD

logical indicating the need for SVD for the coeffient matrix int the output

Value

S3 rrr object, a list consisting of

coef

coefficient of rrs

coef.ls

coefficient of least square

fitted

fitted value of rrs

fitted.ls

fitted value of least square

A

right singular matrix

Ad

sigular value vector

nrank

rank of the fitted rrr

References

Mukherjee, A. and Zhu, J. (2011) Reduced rank ridge regression and its kernal extensions.

Mukherjee, A., Chen, K., Wang, N. and Zhu, J. (2015) On the degrees of freedom of reduced-rank estimators in multivariate regression. Biometrika, 102, 457–477.

Examples

Y <- matrix(rnorm(400), 100, 4)
X <- matrix(rnorm(800), 100, 8)
rfit <- rrs.fit(Y, X)