MGLM Vignette

The analysis of multivariate count data arises in numerous fields including genomics, image analysis, text mining, and sport analytics. The multinomial logit model is limiting due to its restrictive mean-variance structure. Moreover, it assumes that counts of different categories are negatively correlated. Models that allow over-dispersion and possess more flexible positive and/or negative correlation structures offer more realism. We implement four models in the R package MGLM: multinomial logit (MN), Dirichlet multinomial (DM), generalized Dirichlet multinomial (GDM), and negative mutinomial (NegMN). Distribution fitting, regression, hypothesis testing, and variable selection are treated in a unified framework. The multivariate count data we analyze here has d categories.

Distribution fitting

The function MGLMfit fits various multivariate discrete distributions and outputs a list with the maximum likelihood estimate (MLE) and relevant statistics.

When fitting distributions, i.e. no covariates involved, MN is a sub-model of DM, and DM is a sub-model of GDM. MGLMfit outputs the p-value of the likelihood ratio test (LRT) for comparing the fitted model with the most commonly used multinomial model. The NegMN model does not have a nesting relationship with any of the other three models. Therefore, no LRT is performed when fitting a NegMN distribution.

Multinomial (MN)

We first generate data from a multinomial distribution. Note the multinomial parameter (must be positive) supplied to the rmn function is automatically scaled to be a probability vector.

library(MGLM)
set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rmn(n, m, alpha)

Multinomial distribution fitting, although trivial, is implemented.

mnFit <- MGLMfit(Y, dist="MN")
show(mnFit)

##         estimate       SE
## alpha_1   0.2568 0.030891
## alpha_2   0.2467 0.030483
## alpha_3   0.2451 0.030416
## alpha_4   0.2514 0.030676
## 
## Distribution: Multinomial
## Log-likelihood: -1457.788
## BIC: 2931.471
## AIC: 2921.576
## LRT test p value: 
## Iterations:

As a comparison, we fit the DM distribution to the same data set. The results indicate that using a more complex model on the multinomial data shows no advantage.

compareFit <- MGLMfit(Y, dist="DM")
show(compareFit)

##         estimate        SE
## alpha_1  5270901 786946676
## alpha_2  5063596 755995891
## alpha_3  5030755 751092797
## alpha_4  5160065 770398732
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1457.788
## BIC: 2936.769
## AIC: 2923.576
## LRT test p value: 1.000
## Iterations: 35

Both the DM parameter estimates and their standard errors are large, indicating possible overfitting by the DM model. This is confirmed by the fact that the p-value of the LRT for comparing MN to DM is close to 1.

Dirichlet-multinomial (DM)

DM is a Dirichlet mixture of multinomials and allows over-dispersion. Similar to the MN model, it assumes that the counts of any two different categories are negatively correlated. We generate the data from the DM model and fit the DM distribution.

set.seed(123)
n <- 200
d <- 4
alpha <- rep(1, d)
m <- 50
Y <- rdirmn(n, m, alpha)

dmFit <- MGLMfit(Y, dist="DM")
show(dmFit)

##         estimate       SE
## alpha_1 0.976670 0.076589
## alpha_2 0.995142 0.079255
## alpha_3 1.006120 0.080033
## alpha_4 0.904500 0.072547
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -2011.225
## BIC: 4043.644
## AIC: 4030.451
## LRT test p value: <0.0001
## Iterations: 4

The estimate is very close to the true value with small standard errors. The LRT shows that the DM model is significantly better than the MN model.

Generalized Dirichlet-multinomial (GDM)

GDM model uses d − 2 more parameters than the DM model and allows both positive and negative correlations among different categories. DM is a sub-model of GDM. Here we fit a GDM model to the above DM sample.

gdmFit <- MGLMfit(Y, dist="GDM")
show(gdmFit)

##         estimate       SE
## alpha_1 1.158474 0.123403
## alpha_2 0.993293 0.437239
## alpha_3 0.839967 0.106374
## beta_1  3.706863 0.226414
## beta_2  1.979389 0.094645
## beta_3  0.759641 0.084403
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -2007.559
## BIC: 4046.907
## AIC: 4027.117
## LRT test p value: <0.0001
## Iterations: 27

GDM yields a slightly larger log-likelihood value but a larger BIC, suggesting DM as a preferred model. p-value indiciates GDM is still significantly better thant the MN model. Now we simulate data from GDM and fit the GDM distribution.

set.seed(124)
n <- 200
d <- 4
alpha <- rep(1, d-1)
beta <- rep(1, d-1)
m <- 50
Y <- rgdirmn(n, m, alpha, beta)
gdmFit <- MGLMfit(Y, dist="GDM")
gdmFit

##         estimate       SE
## alpha_1 1.019835 0.103480
## alpha_2 0.826125 0.109316
## alpha_3 0.774387 0.092568
## beta_1  1.062112 0.095566
## beta_2  0.846216 0.108725
## beta_3  0.924559 0.135008
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1820.616
## BIC: 3673.021
## AIC: 3653.231
## LRT test p value: <0.0001
## Iterations: 24

Negative multinomial (NegMN)

NegMN model is a multivariate analog of the negative binomial model. It assumes positive correlation among the counts. The following code generates data from the NegMN model and fit the NegMN distribution,

set.seed(1220)
n <- 100
d <- 4
p <- 5
prob <- rep(0.2, d)
beta <- 10
Y <- rnegmn(n, beta, prob)
negmnFit <- MGLMfit(Y, dist="NegMN")
show(negmnFit)

##      estimate       SE
## p_1  0.188151 0.009584
## p_2  0.194311 0.009837
## p_3  0.191511 0.009722
## p_4  0.196177 0.009914
## phi 12.313935 2.266097
## 
## Distribution: Negative Multinomial
## Log-likelihood: -1104.579
## BIC: 2232.184
## AIC: 2219.158
## LRT test p value: 
## Iterations: 4

Because MN is not a sub-model of NegMN, no LRT is performed here.

Regression

In regression, the n × p covariate matrix X is similar to that used in the glm function. The response should be a n × d count matrix. Unlike estimating a parameter vector β in GLM, we need to estimate a parameter matrix B when the responses are multivariate. The dimension of the parameter matrix depends on the model:

MN: p × (d − 1)
DM: p × d
GDM: p × 2(d − 1)
NegMN: p × (d + 1)

The GDM model provides the most flexibility, but also requires most parameters. In the function MGLMreg for regression, the option dist="MN", "DM", "GDM" or "NegMN" specifies the model.

The rows B_j, ⋅ of the parameter matrix correspond to covariates. By default, the function output the Wald test statistics and p-values for testing $H_0: B_{j,\cdot}={\bf 0}$ vs $H_a: B_{j, \cdot}\neq {\bf 0}$. If specifying the option LRT=TRUE, the function also outputs LRT statistics and p-values.

Next, we demonstrate that model mis-specification results in failure of hypothesis testing. We simulate response data from the GDM model. Covariates X₁ and X₂ have no effect while X₃, X₄, X₅ have different effect sizes.

set.seed(1234)
n <- 200
p <- 5
d <- 4
X <- matrix(runif(p * n), n, p)
alpha <- matrix(c(0.6, 0.8, 1), p, d - 1, byrow=TRUE)
alpha[c(1, 2),] <- 0
Alpha <- exp(X %*% alpha)
beta <- matrix(c(1.2, 1, 0.6), p, d - 1, byrow=TRUE)
beta[c(1, 2),] <- 0
Beta <- exp(X %*% beta)
m <- runif(n, min=0, max=25) + 25
Y <- rgdirmn(n, m, Alpha, Beta)

We fit various regression models and test significance of covariates.

Multinomial regression

mnReg <- MGLMreg(Y~0+X, dist="MN")
show(mnReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "MN")
## 
## Coefficients:
##          Col_1     Col_2     Col_3
## [1,]  0.277063 -0.182760 -0.123204
## [2,]  0.543064  0.422730  0.246523
## [3,]  0.333252  0.517605  0.221851
## [4,]  0.356843  0.486722  0.565427
## [5,] -0.302455  0.192508  0.323713
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   24.63244 1.842859e-05
## X2   21.99680 6.533133e-05
## X3   23.10908 3.832310e-05
## X4   25.07475 1.489470e-05
## X5   49.37327 1.086326e-10
## 
## Distribution: Multinomial
## Log-likelihood: -2194.448
## BIC: 4468.371
## AIC: 4418.896
## Iterations: 5

The Wald test shows that all predictors, including the null predictors X₁ and X₂, are significant.

Dirichlet-multinomial regression

dmReg <- MGLMreg(Y~0+X, dist="DM")
show(dmReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "DM")
## 
## Coefficients:
##         Col_1     Col_2     Col_3     Col_4
## [1,] 0.154137 -0.118264 -0.188339 -0.013172
## [2,] 0.183264  0.142034 -0.183394 -0.333886
## [3,] 1.143146  1.254828  1.092635  0.811259
## [4,] 0.392703  0.545421  0.590015  0.131132
## [5,] 0.226350  0.660108  0.939599  0.487034
## 
## Hypothesis test: 
##    wald value Pr(>wald)
## X1   3.349794  0.501082
## X2   7.845339  0.097411
## X3  25.497386  0.000040
## X4   8.735121  0.068072
## X5  23.136042  0.000119
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1683.961
## BIC: 3473.889
## AIC: 3407.922
## Iterations: 7

Wald test declares that X1, X2 and X4 have not effects, but X3 and X5 are significant.

Generalized Dirichlet-multinomial Regression

gdmReg <- MGLMreg(Y~0+X, dist="GDM")
show(gdmReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "GDM")
## 
## Coefficients:
##      alpha_Col_2 alpha_Col_1 alpha_Col_3 beta_Col_2 beta_Col_1 beta_Col_3
## [1,]   -0.283917    0.190503    0.315706  -0.400203   0.684651   0.467592
## [2,]   -0.209171    0.395541    0.014426  -0.508254   0.552645  -0.171410
## [3,]    1.090140    1.243785    1.171786   1.304955   1.537526   0.916033
## [4,]    0.296819    0.405333    0.809087   0.487838   0.573695   0.305889
## [5,]    0.601841    0.030123    1.281219   1.430931   0.260197   0.853498
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   9.424379 1.510802e-01
## X2   4.865683 5.611523e-01
## X3  26.828718 1.559064e-04
## X4  12.607034 4.971848e-02
## X5  38.637877 8.427803e-07
## 
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1676.399
## BIC: 3511.748
## AIC: 3412.798
## Iterations: 21

When using the correct model GDM, the Wald test is able to differentiate the null effects from the significant ones. GDM regression yields the highest log-likelihood and smallest BIC.

Negative multinomial regression

negReg <- MGLMreg(Y~0+X, dist="NegMN", regBeta=FALSE)
show(negReg)

## Call: MGLMreg(formula = Y ~ 0 + X, dist = "NegMN", regBeta = FALSE)
## 
## Coefficients:
## $alpha
##        Col_1     Col_2     Col_3     Col_4
## X1  0.243606 -0.216368 -0.156525 -0.031371
## X2  0.061896 -0.055885 -0.232635 -0.479780
## X3 -0.160914  0.022688 -0.271238 -0.495127
## X4 -0.176181 -0.043828  0.034781 -0.530126
## X5 -0.607974 -0.115829  0.012937 -0.315857
## 
## $phi
##      phi 
## 13.77531 
## 
## 
## Hypothesis test: 
##    wald value    Pr(>wald)
## X1   24.99903 5.033252e-05
## X2   24.90077 5.267448e-05
## X3   29.32830 6.704198e-06
## X4   28.18693 1.143075e-05
## X5   60.50179 2.275444e-12
## 
## Distribution: Negative Multinomial
## Log-likelihood: -2908.896
## BIC: 5929.056
## AIC: 5859.792
## Iterations: 15

Again, the Wald test declares all predictors to be significant.

Prediction

We can use the fitted model for prediction. The prediction function outputs the probabilities of each category. This helps answer questions such as whether certain features increase the probability of observing category j. Take the fitted GDM model as an example:

newX <- matrix(runif(1*p), 1, p)
pred <- predict(gdmReg, newX)
pred

##          Col_1     Col_2     Col_3     Col_4
## [1,] 0.3218286 0.2235816 0.3304694 0.1241204

Sparse regression

Regularization is an important tool for model selection and improving the risk property of the estimates. In the package, we implemented three types of penalties on the paramter matrix B:

selection by entries
selection by rows/predictors
selection by rank

The function MGLMtune finds the optimal tuning parameter with the smallest BIC and outputs the estimate using the chosen tuning parameter. The output from MGLMtune is a list containing the solution path and the final estimate. Users can either provide a vector of tuning parameters with option lambdas or specify the number of grid points via option ngridpt and let the function decide the default tuning parameters. The function MGLMsparsereg computes the regularized estimate at a given tuning paramter value lambda.

We generate the data from the DM model, with row sparsity, and show how each penalty type works.

set.seed(118)
n <- 100
p <- 10
d <- 5
m <- rbinom(n, 200, 0.8)
X <- matrix(rnorm(n * p), n, p)
alpha <- matrix(0, p, d)
alpha[c(1, 3, 5), ] <- 1
Alpha <- exp(X %*% alpha)
Y <- rdirmn(size=m, alpha=Alpha)

Select by entries

sweep <- MGLMtune(Y ~ 0 + X, dist="DM", penalty="sweep", ngridpt=30)
show(sweep)

## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "sweep", 
##     ngridpt = 30)
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1461.992
## BIC: 3066.744
## AIC: 2985.984
## Degrees of freedom: 31
## Lambda: 4.858822
## Number of grid points: 30

Select by rows

Since the rows of the parameter matrix correspond to predictors, selecting by rows performs variable selection at the predictor level.

group <- MGLMtune(Y ~ 0 + X, dist="DM", penalty="group", ngridpt=30)
show(group)

## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "group", 
##     ngridpt = 30)
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1485.475
## BIC: 3079.238
## AIC: 3017.978
## Degrees of freedom: 23.51454
## Lambda: 20.20407
## Number of grid points: 30

Select by singular values

Nuclear norm regularization encourages low rank in the regularized estimate.

nuclear <- MGLMtune(Y ~ 0 + X, dist="DM", penalty="nuclear", ngridpt=30, warm.start=FALSE)
show(nuclear)

## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "nuclear", 
##     ngridpt = 30, warm.start = FALSE)
## 
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1492.063
## BIC: 3070.422
## AIC: 3021.604
## Degrees of freedom: 18.7391
## Lambda: 37.53776
## Number of grid points: 30