models.Rmd
Pando provides various regression models and modeling options that can be used for GRN inference. This vignette gives an overview of the available options. First, let’s load the object.
library(Pando)
library(tidyverse)
library(doParallel)
registerDoParallel(4)
muo_data <- read_rds('muo_data.rds')
The default option when running infer_grn()
is a generalized linear model (GLM) with gaussian noise. Using the family
argument, one can choose other noise models, e.g to fit directly on counts instead of on log-normalized data.
## # A tibble: 46 x 10
## tf target region term estimate std_err statistic pval padj corr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 POU3… SFRP2 chr4-… chr4… 0.0507 0.0201 2.53 1.15e- 2 3.83e- 2 0.156
## 2 NR2E1 SFRP2 chr4-… chr4… 0.0421 0.0197 2.14 3.27e- 2 8.72e- 2 0.191
## 3 PAX6 SFRP2 chr4-… chr4… 0.0804 0.0290 2.77 5.56e- 3 2.43e- 2 0.334
## 4 FOS SFRP2 chr4-… chr4… -0.0474 0.0189 -2.51 1.20e- 2 3.83e- 2 -0.168
## 5 JUNB SFRP2 chr4-… chr4… 0.00628 0.0369 0.170 8.65e- 1 9.65e- 1 -0.105
## 6 MSX1 SFRP2 chr4-… chr4… -0.0315 0.0140 -2.25 2.47e- 2 7.40e- 2 -0.168
## 7 MSX2 SFRP2 chr4-… chr4… -0.0596 0.0130 -4.59 4.57e- 6 5.48e- 5 -0.190
## 8 NR2E1 SFRP2 chr4-… NR2E… 0.185 0.0215 8.63 9.11e-18 1.46e-16 0.191
## 9 HES5 OTX2 chr14… chr1… 0.0727 0.123 0.591 5.55e- 1 7.99e- 1 0.100
## 10 ID3 OTX2 chr14… chr1… -0.00695 0.0178 -0.390 6.97e- 1 8.80e- 1 0.132
## # … with 36 more rows
The coefficients of the models are tested using a t-test and modules are extracted by applying a significance threshold on the p-value.
In regularized linear models, the coefficients can be penalized so that they are pushed towards 0. In this way, only ‘strong’ connections are maintained. Here we use the glmnet
implementation
muo_data <- infer_grn(
muo_data,
method = 'cv.glmnet',
parallel = T,
genes = c('OTX2', 'SFRP2')
)
coef(muo_data)
## # A tibble: 54 x 6
## tf target region term estimate corr
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 POU3F1 SFRP2 chr4-153764659-15376… chr4_153764659_153764803… 0 0.156
## 2 NR2E1 SFRP2 chr4-153765329-15376… chr4_153765329_153765353… 0 0.191
## 3 PAX6 SFRP2 chr4-153782079-15378… chr4_153782079_153782141… 0 0.334
## 4 FOS SFRP2 chr4-153782176-15378… chr4_153782176_153782415… 0 -0.168
## 5 JUNB SFRP2 chr4-153782176-15378… chr4_153782176_153782415… 0 -0.105
## 6 MSX1 SFRP2 chr4-153787173-15378… chr4_153787173_153787202… 0 -0.168
## 7 MSX2 SFRP2 chr4-153787173-15378… chr4_153787173_153787202… -0.00309 -0.190
## 8 NR2E1 SFRP2 chr4-153794115-15379… chr4_153794115_153794213… 0.0400 0.191
## 9 NR2E1 SFRP2 chr4-153794305-15379… chr4_153794305_153794350… 0.0368 0.191
## 10 HES5 OTX2 chr14-56730526-56730… chr14_56730526_56730914:… 0 0.100
## # … with 44 more rows
You might notice that this time there are no p-values here, but a lot of the coefficients (estimate
) are 0. In this case, modules will be extracted not by p-value, but by selecting non-zero coefficients. The alpha
argument can be used to adjust the elasticnet mixing parameter. 1 amounts to a lasso penalty and 0 to the ridge penalty. Lasso models are more sparse and will push more coefficients to zero.
CellOracle, another method for GRN inference, uses Bagging ridge and Bayesian ridge regression models from sklearn (python). We have used reticulate
to interact with python and implement these models also here. You do have to install scikit-learn in python for it, though.
muo_data <- infer_grn(
muo_data,
method = 'bagging_ridge',
parallel = T,
genes = c('OTX2', 'SFRP2')
)
coef(muo_data)
## # A tibble: 54 x 9
## tf target region term estimate pval neglog10p padj corr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 POU3F1 SFRP2 chr4-153… chr4_153… 0.0521 3.50e-27 26.5 1.24e-26 0.156
## 2 NR2E1 SFRP2 chr4-153… chr4_153… 0.0452 1.09e-25 25.0 2.63e-25 0.191
## 3 PAX6 SFRP2 chr4-153… chr4_153… 0.0737 1.15e-27 26.9 4.68e-27 0.334
## 4 FOS SFRP2 chr4-153… chr4_153… -0.0450 1.77e-28 27.8 1.34e-27 -0.168
## 5 JUNB SFRP2 chr4-153… chr4_153… 0.00724 2.66e- 2 1.57 2.88e- 2 -0.105
## 6 MSX1 SFRP2 chr4-153… chr4_153… -0.0353 3.50e-27 26.5 1.24e-26 -0.168
## 7 MSX2 SFRP2 chr4-153… chr4_153… -0.0634 7.72e-28 27.1 3.41e-27 -0.190
## 8 NR2E1 SFRP2 chr4-153… NR2E1:ch… 0.112 2.32e-26 25.6 6.48e-26 0.191
## 9 NR2E1 SFRP2 chr4-153… NR2E1:ch… 0.118 2.48e-28 27.6 1.64e-27 0.191
## 10 HES5 OTX2 chr14-56… chr14_56… 0.0594 8.67e- 8 7.06 1.12e- 7 0.100
## # … with 44 more rows
As with the regular glm
, modules will be extracted based on p-value.
XGBoost is yet another popular method that is used by SCENIC. It is not based on linear regression but uses gradient-boosted Random Forest regression to model non-linear relationships.
muo_data <- infer_grn(
muo_data,
method = 'xgb',
parallel = T,
genes = c('OTX2', 'SFRP2')
)
coef(muo_data)
## # A tibble: 42 x 8
## tf target region term gain cover frequency corr
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 NR2E1 SFRP2 chr4-15379411… NR2E1:chr4_153… 3.77e-1 1.79e-1 0.145 0.191
## 2 MSX1 SFRP2 chr4-15378717… chr4_153787173… 2.05e-1 1.98e-1 0.213 -0.168
## 3 MSX2 SFRP2 chr4-15378717… chr4_153787173… 1.97e-1 1.81e-1 0.188 -0.190
## 4 POU3F1 SFRP2 chr4-15376465… chr4_153764659… 1.03e-1 1.85e-1 0.210 0.156
## 5 PAX6 SFRP2 chr4-15378207… chr4_153782079… 5.22e-2 9.87e-2 0.117 0.334
## 6 NR2E1 SFRP2 chr4-15376532… chr4_153765329… 4.38e-2 9.28e-2 0.0805 0.191
## 7 FOS SFRP2 chr4-15378217… chr4_153782176… 2.12e-2 6.59e-2 0.0442 -0.168
## 8 JUNB SFRP2 chr4-15378217… chr4_153782176… 5.14e-4 1.93e-5 0.00129 -0.105
## 9 EMX2 OTX2 chr14-5680575… EMX2:chr14_568… 1.32e-1 6.64e-2 0.0715 0.241
## 10 GLI3 OTX2 chr14-5680843… chr14_56808439… 9.45e-2 1.18e-1 0.0994 0.209
## # … with 32 more rows
Here we get neither a ‘normal’ coefficient, nor a p-value, but instead 3 different importance values: gain
, cover
, and frequency
. These indicate the importance of the variable to the regressor. To extract modules one can select the top target genes for each TF based on the gain
value. Alternatively, one can select the top TFs for each target gene.