Models

Pando provides various regression models and modeling options that can be used for GRN inference. This vignette gives an overview of the available options. First, let’s load the object.

library(Pando)
library(tidyverse)
library(doParallel)
registerDoParallel(4)
muo_data <- read_rds('muo_data.rds')

Generalized linear model

The default option when running infer_grn() is a generalized linear model (GLM) with gaussian noise. Using the family argument, one can choose other noise models, e.g to fit directly on counts instead of on log-normalized data.

muo_data <- infer_grn(
    muo_data,
    parallel = T,
    genes = c('OTX2', 'SFRP2')
) 
coef(muo_data)

## # A tibble: 46 x 10
##    tf    target region term  estimate std_err statistic     pval     padj   corr
##    <chr> <chr>  <chr>  <chr>    <dbl>   <dbl>     <dbl>    <dbl>    <dbl>  <dbl>
##  1 POU3… SFRP2  chr4-… chr4…  0.0507   0.0201     2.53  1.15e- 2 3.83e- 2  0.156
##  2 NR2E1 SFRP2  chr4-… chr4…  0.0421   0.0197     2.14  3.27e- 2 8.72e- 2  0.191
##  3 PAX6  SFRP2  chr4-… chr4…  0.0804   0.0290     2.77  5.56e- 3 2.43e- 2  0.334
##  4 FOS   SFRP2  chr4-… chr4… -0.0474   0.0189    -2.51  1.20e- 2 3.83e- 2 -0.168
##  5 JUNB  SFRP2  chr4-… chr4…  0.00628  0.0369     0.170 8.65e- 1 9.65e- 1 -0.105
##  6 MSX1  SFRP2  chr4-… chr4… -0.0315   0.0140    -2.25  2.47e- 2 7.40e- 2 -0.168
##  7 MSX2  SFRP2  chr4-… chr4… -0.0596   0.0130    -4.59  4.57e- 6 5.48e- 5 -0.190
##  8 NR2E1 SFRP2  chr4-… NR2E…  0.185    0.0215     8.63  9.11e-18 1.46e-16  0.191
##  9 HES5  OTX2   chr14… chr1…  0.0727   0.123      0.591 5.55e- 1 7.99e- 1  0.100
## 10 ID3   OTX2   chr14… chr1… -0.00695  0.0178    -0.390 6.97e- 1 8.80e- 1  0.132
## # … with 36 more rows

The coefficients of the models are tested using a t-test and modules are extracted by applying a significance threshold on the p-value.

Regularized linear model

In regularized linear models, the coefficients can be penalized so that they are pushed towards 0. In this way, only ‘strong’ connections are maintained. Here we use the glmnet implementation

muo_data <- infer_grn(
    muo_data,
    method = 'cv.glmnet',
    parallel = T,
    genes = c('OTX2', 'SFRP2')
) 
coef(muo_data)

## # A tibble: 54 x 6
##    tf     target region                term                      estimate   corr
##    <chr>  <chr>  <chr>                 <chr>                        <dbl>  <dbl>
##  1 POU3F1 SFRP2  chr4-153764659-15376… chr4_153764659_153764803…  0        0.156
##  2 NR2E1  SFRP2  chr4-153765329-15376… chr4_153765329_153765353…  0        0.191
##  3 PAX6   SFRP2  chr4-153782079-15378… chr4_153782079_153782141…  0        0.334
##  4 FOS    SFRP2  chr4-153782176-15378… chr4_153782176_153782415…  0       -0.168
##  5 JUNB   SFRP2  chr4-153782176-15378… chr4_153782176_153782415…  0       -0.105
##  6 MSX1   SFRP2  chr4-153787173-15378… chr4_153787173_153787202…  0       -0.168
##  7 MSX2   SFRP2  chr4-153787173-15378… chr4_153787173_153787202… -0.00309 -0.190
##  8 NR2E1  SFRP2  chr4-153794115-15379… chr4_153794115_153794213…  0.0400   0.191
##  9 NR2E1  SFRP2  chr4-153794305-15379… chr4_153794305_153794350…  0.0368   0.191
## 10 HES5   OTX2   chr14-56730526-56730… chr14_56730526_56730914:…  0        0.100
## # … with 44 more rows

You might notice that this time there are no p-values here, but a lot of the coefficients (estimate) are 0. In this case, modules will be extracted not by p-value, but by selecting non-zero coefficients. The alpha argument can be used to adjust the elasticnet mixing parameter. 1 amounts to a lasso penalty and 0 to the ridge penalty. Lasso models are more sparse and will push more coefficients to zero.

Bagging ridge & Bayesian ridge

CellOracle, another method for GRN inference, uses Bagging ridge and Bayesian ridge regression models from sklearn (python). We have used reticulate to interact with python and implement these models also here. You do have to install scikit-learn in python for it, though.

muo_data <- infer_grn(
    muo_data,
    method = 'bagging_ridge',
    parallel = T,
    genes = c('OTX2', 'SFRP2')
) 
coef(muo_data)

## # A tibble: 54 x 9
##    tf     target region    term      estimate     pval neglog10p     padj   corr
##    <chr>  <chr>  <chr>     <chr>        <dbl>    <dbl>     <dbl>    <dbl>  <dbl>
##  1 POU3F1 SFRP2  chr4-153… chr4_153…  0.0521  3.50e-27     26.5  1.24e-26  0.156
##  2 NR2E1  SFRP2  chr4-153… chr4_153…  0.0452  1.09e-25     25.0  2.63e-25  0.191
##  3 PAX6   SFRP2  chr4-153… chr4_153…  0.0737  1.15e-27     26.9  4.68e-27  0.334
##  4 FOS    SFRP2  chr4-153… chr4_153… -0.0450  1.77e-28     27.8  1.34e-27 -0.168
##  5 JUNB   SFRP2  chr4-153… chr4_153…  0.00724 2.66e- 2      1.57 2.88e- 2 -0.105
##  6 MSX1   SFRP2  chr4-153… chr4_153… -0.0353  3.50e-27     26.5  1.24e-26 -0.168
##  7 MSX2   SFRP2  chr4-153… chr4_153… -0.0634  7.72e-28     27.1  3.41e-27 -0.190
##  8 NR2E1  SFRP2  chr4-153… NR2E1:ch…  0.112   2.32e-26     25.6  6.48e-26  0.191
##  9 NR2E1  SFRP2  chr4-153… NR2E1:ch…  0.118   2.48e-28     27.6  1.64e-27  0.191
## 10 HES5   OTX2   chr14-56… chr14_56…  0.0594  8.67e- 8      7.06 1.12e- 7  0.100
## # … with 44 more rows

As with the regular glm, modules will be extracted based on p-value.

XGBoost

XGBoost is yet another popular method that is used by SCENIC. It is not based on linear regression but uses gradient-boosted Random Forest regression to model non-linear relationships.

muo_data <- infer_grn(
    muo_data,
    method = 'xgb',
    parallel = T,
    genes = c('OTX2', 'SFRP2')
) 
coef(muo_data)

## # A tibble: 42 x 8
##    tf     target region         term               gain   cover frequency   corr
##    <chr>  <chr>  <chr>          <chr>             <dbl>   <dbl>     <dbl>  <dbl>
##  1 NR2E1  SFRP2  chr4-15379411… NR2E1:chr4_153… 3.77e-1 1.79e-1   0.145    0.191
##  2 MSX1   SFRP2  chr4-15378717… chr4_153787173… 2.05e-1 1.98e-1   0.213   -0.168
##  3 MSX2   SFRP2  chr4-15378717… chr4_153787173… 1.97e-1 1.81e-1   0.188   -0.190
##  4 POU3F1 SFRP2  chr4-15376465… chr4_153764659… 1.03e-1 1.85e-1   0.210    0.156
##  5 PAX6   SFRP2  chr4-15378207… chr4_153782079… 5.22e-2 9.87e-2   0.117    0.334
##  6 NR2E1  SFRP2  chr4-15376532… chr4_153765329… 4.38e-2 9.28e-2   0.0805   0.191
##  7 FOS    SFRP2  chr4-15378217… chr4_153782176… 2.12e-2 6.59e-2   0.0442  -0.168
##  8 JUNB   SFRP2  chr4-15378217… chr4_153782176… 5.14e-4 1.93e-5   0.00129 -0.105
##  9 EMX2   OTX2   chr14-5680575… EMX2:chr14_568… 1.32e-1 6.64e-2   0.0715   0.241
## 10 GLI3   OTX2   chr14-5680843… chr14_56808439… 9.45e-2 1.18e-1   0.0994   0.209
## # … with 32 more rows

Here we get neither a ‘normal’ coefficient, nor a p-value, but instead 3 different importance values: gain, cover, and frequency. These indicate the importance of the variable to the regressor. To extract modules one can select the top target genes for each TF based on the gain value. Alternatively, one can select the top TFs for each target gene.

Brms

Finally, we implemented to option to use Bayesian regression models with brms and Stan.

muo_data <- infer_grn(
    muo_data,
    method = 'brms',
    parallel = T,
    genes = c('OTX2', 'SFRP2')
)

However, these usually have very long runtimes and are only feasible on very small GRNs.

Jonas Simon Fleck

5/17/2022

Generalized linear model

Regularized linear model

Bagging ridge & Bayesian ridge

XGBoost

Brms