The analysis of multivariate count
data arises in numerous fields including genomics, image analysis, text
mining, and sport analytics. The multinomial logit model is limiting due
to its restrictive mean-variance structure. Moreover, it assumes that
counts of different categories are negatively correlated. Models that
allow over-dispersion and possess more flexible positive and/or negative
correlation structures offer more realism. We implement four models in
the R package MGLM
: multinomial logit (MN), Dirichlet
multinomial (DM), generalized Dirichlet multinomial (GDM), and negative
mutinomial (NegMN). Distribution fitting, regression, hypothesis
testing, and variable selection are treated in a unified framework. The
multivariate count data we analyze here has d categories.
The function MGLMfit
fits various multivariate discrete
distributions and outputs a list with the maximum likelihood estimate
(MLE) and relevant statistics.
When fitting distributions, i.e. no covariates involved, MN is a
sub-model of DM, and DM is a sub-model of GDM.
MGLMfit
outputs the p-value of the likelihood ratio test
(LRT) for comparing the fitted model with the most commonly used
multinomial model. The NegMN model does not have a nesting relationship
with any of the other three models. Therefore, no LRT is performed when
fitting a NegMN distribution.
We first generate data from a multinomial distribution. Note the
multinomial parameter (must be positive) supplied to the
rmn
function is automatically scaled to be a probability
vector.
Multinomial distribution fitting, although trivial, is implemented.
## estimate SE
## alpha_1 0.2568 0.030891
## alpha_2 0.2467 0.030483
## alpha_3 0.2451 0.030416
## alpha_4 0.2514 0.030676
##
## Distribution: Multinomial
## Log-likelihood: -1457.788
## BIC: 2931.471
## AIC: 2921.576
## LRT test p value:
## Iterations:
As a comparison, we fit the DM distribution to the same data set. The results indicate that using a more complex model on the multinomial data shows no advantage.
## estimate SE
## alpha_1 5270901 786946676
## alpha_2 5063596 755995891
## alpha_3 5030755 751092797
## alpha_4 5160065 770398732
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1457.788
## BIC: 2936.769
## AIC: 2923.576
## LRT test p value: 1.000
## Iterations: 35
Both the DM parameter estimates and their standard errors are large, indicating possible overfitting by the DM model. This is confirmed by the fact that the p-value of the LRT for comparing MN to DM is close to 1.
DM is a Dirichlet mixture of multinomials and allows over-dispersion. Similar to the MN model, it assumes that the counts of any two different categories are negatively correlated. We generate the data from the DM model and fit the DM distribution.
## estimate SE
## alpha_1 0.976670 0.076589
## alpha_2 0.995142 0.079255
## alpha_3 1.006120 0.080033
## alpha_4 0.904500 0.072547
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -2011.225
## BIC: 4043.644
## AIC: 4030.451
## LRT test p value: <0.0001
## Iterations: 4
The estimate is very close to the true value with small standard errors. The LRT shows that the DM model is significantly better than the MN model.
GDM model uses d − 2 more parameters than the DM model and allows both positive and negative correlations among different categories. DM is a sub-model of GDM. Here we fit a GDM model to the above DM sample.
## estimate SE
## alpha_1 1.158474 0.123403
## alpha_2 0.993293 0.437239
## alpha_3 0.839967 0.106374
## beta_1 3.706863 0.226414
## beta_2 1.979389 0.094645
## beta_3 0.759641 0.084403
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -2007.559
## BIC: 4046.907
## AIC: 4027.117
## LRT test p value: <0.0001
## Iterations: 27
GDM yields a slightly larger log-likelihood value but a larger BIC, suggesting DM as a preferred model. p-value indiciates GDM is still significantly better thant the MN model. Now we simulate data from GDM and fit the GDM distribution.
set.seed(124)
n <- 200
d <- 4
alpha <- rep(1, d-1)
beta <- rep(1, d-1)
m <- 50
Y <- rgdirmn(n, m, alpha, beta)
gdmFit <- MGLMfit(Y, dist="GDM")
gdmFit
## estimate SE
## alpha_1 1.019835 0.103480
## alpha_2 0.826125 0.109316
## alpha_3 0.774387 0.092568
## beta_1 1.062112 0.095566
## beta_2 0.846216 0.108725
## beta_3 0.924559 0.135008
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1820.616
## BIC: 3673.021
## AIC: 3653.231
## LRT test p value: <0.0001
## Iterations: 24
NegMN model is a multivariate analog of the negative binomial model. It assumes positive correlation among the counts. The following code generates data from the NegMN model and fit the NegMN distribution,
set.seed(1220)
n <- 100
d <- 4
p <- 5
prob <- rep(0.2, d)
beta <- 10
Y <- rnegmn(n, beta, prob)
negmnFit <- MGLMfit(Y, dist="NegMN")
show(negmnFit)
## estimate SE
## p_1 0.188151 0.009584
## p_2 0.194311 0.009837
## p_3 0.191511 0.009722
## p_4 0.196177 0.009914
## phi 12.313935 2.266097
##
## Distribution: Negative Multinomial
## Log-likelihood: -1104.579
## BIC: 2232.184
## AIC: 2219.158
## LRT test p value:
## Iterations: 4
Because MN is not a sub-model of NegMN, no LRT is performed here.
In regression, the n × p covariate matrix
X is similar to that used in
the glm
function. The response should be a n × d count matrix. Unlike
estimating a parameter vector β in GLM, we need to estimate a
parameter matrix B when the
responses are multivariate. The dimension of the parameter matrix
depends on the model:
The GDM model provides the most flexibility, but also requires most
parameters. In the function MGLMreg
for regression, the
option dist="MN"
, "DM"
, "GDM"
or
"NegMN"
specifies the model.
The rows Bj, ⋅ of the
parameter matrix correspond to covariates. By default, the function
output the Wald test statistics and p-values for testing $H_0: B_{j,\cdot}={\bf 0}$ vs $H_a: B_{j, \cdot}\neq {\bf 0}$. If
specifying the option LRT=TRUE
, the function also outputs
LRT statistics and p-values.
Next, we demonstrate that model mis-specification results in failure of hypothesis testing. We simulate response data from the GDM model. Covariates X1 and X2 have no effect while X3, X4, X5 have different effect sizes.
set.seed(1234)
n <- 200
p <- 5
d <- 4
X <- matrix(runif(p * n), n, p)
alpha <- matrix(c(0.6, 0.8, 1), p, d - 1, byrow=TRUE)
alpha[c(1, 2),] <- 0
Alpha <- exp(X %*% alpha)
beta <- matrix(c(1.2, 1, 0.6), p, d - 1, byrow=TRUE)
beta[c(1, 2),] <- 0
Beta <- exp(X %*% beta)
m <- runif(n, min=0, max=25) + 25
Y <- rgdirmn(n, m, Alpha, Beta)
We fit various regression models and test significance of covariates.
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "MN")
##
## Coefficients:
## Col_1 Col_2 Col_3
## [1,] 0.277063 -0.182760 -0.123204
## [2,] 0.543064 0.422730 0.246523
## [3,] 0.333252 0.517605 0.221851
## [4,] 0.356843 0.486722 0.565427
## [5,] -0.302455 0.192508 0.323713
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 24.63244 1.842859e-05
## X2 21.99680 6.533133e-05
## X3 23.10908 3.832310e-05
## X4 25.07475 1.489470e-05
## X5 49.37327 1.086326e-10
##
## Distribution: Multinomial
## Log-likelihood: -2194.448
## BIC: 4468.371
## AIC: 4418.896
## Iterations: 5
The Wald test shows that all predictors, including the null predictors X1 and X2, are significant.
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "DM")
##
## Coefficients:
## Col_1 Col_2 Col_3 Col_4
## [1,] 0.154137 -0.118264 -0.188339 -0.013172
## [2,] 0.183264 0.142034 -0.183394 -0.333886
## [3,] 1.143146 1.254828 1.092635 0.811259
## [4,] 0.392703 0.545421 0.590015 0.131132
## [5,] 0.226350 0.660108 0.939599 0.487034
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 3.349794 0.501082
## X2 7.845339 0.097411
## X3 25.497386 0.000040
## X4 8.735121 0.068072
## X5 23.136042 0.000119
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1683.961
## BIC: 3473.889
## AIC: 3407.922
## Iterations: 7
Wald test declares that X1, X2 and X4 have not effects, but X3 and X5 are significant.
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "GDM")
##
## Coefficients:
## alpha_Col_2 alpha_Col_1 alpha_Col_3 beta_Col_2 beta_Col_1 beta_Col_3
## [1,] -0.283917 0.190503 0.315706 -0.400203 0.684651 0.467592
## [2,] -0.209171 0.395541 0.014426 -0.508254 0.552645 -0.171410
## [3,] 1.090140 1.243785 1.171786 1.304955 1.537526 0.916033
## [4,] 0.296819 0.405333 0.809087 0.487838 0.573695 0.305889
## [5,] 0.601841 0.030123 1.281219 1.430931 0.260197 0.853498
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 9.424379 1.510802e-01
## X2 4.865683 5.611523e-01
## X3 26.828718 1.559064e-04
## X4 12.607034 4.971848e-02
## X5 38.637877 8.427803e-07
##
## Distribution: Generalized Dirichlet Multinomial
## Log-likelihood: -1676.399
## BIC: 3511.748
## AIC: 3412.798
## Iterations: 21
When using the correct model GDM, the Wald test is able to differentiate the null effects from the significant ones. GDM regression yields the highest log-likelihood and smallest BIC.
## Call: MGLMreg(formula = Y ~ 0 + X, dist = "NegMN", regBeta = FALSE)
##
## Coefficients:
## $alpha
## Col_1 Col_2 Col_3 Col_4
## X1 0.243606 -0.216368 -0.156525 -0.031371
## X2 0.061896 -0.055885 -0.232635 -0.479780
## X3 -0.160914 0.022688 -0.271238 -0.495127
## X4 -0.176181 -0.043828 0.034781 -0.530126
## X5 -0.607974 -0.115829 0.012937 -0.315857
##
## $phi
## phi
## 13.77531
##
##
## Hypothesis test:
## wald value Pr(>wald)
## X1 24.99903 5.033252e-05
## X2 24.90077 5.267448e-05
## X3 29.32830 6.704198e-06
## X4 28.18693 1.143075e-05
## X5 60.50179 2.275444e-12
##
## Distribution: Negative Multinomial
## Log-likelihood: -2908.896
## BIC: 5929.056
## AIC: 5859.792
## Iterations: 15
Again, the Wald test declares all predictors to be significant.
We can use the fitted model for prediction. The
prediction
function outputs the probabilities of each
category. This helps answer questions such as whether certain features
increase the probability of observing category j. Take the fitted GDM model as an
example:
## Col_1 Col_2 Col_3 Col_4
## [1,] 0.3218286 0.2235816 0.3304694 0.1241204
Regularization is an important tool for model selection and improving the risk property of the estimates. In the package, we implemented three types of penalties on the paramter matrix B:
The function MGLMtune
finds the optimal tuning parameter
with the smallest BIC and outputs the estimate using the chosen tuning
parameter. The output from MGLMtune
is a list containing
the solution path and the final estimate. Users can either provide a
vector of tuning parameters with option lambdas
or specify
the number of grid points via option ngridpt
and let the
function decide the default tuning parameters. The function
MGLMsparsereg
computes the regularized estimate at a given
tuning paramter value lambda
.
We generate the data from the DM model, with row sparsity, and show how each penalty type works.
set.seed(118)
n <- 100
p <- 10
d <- 5
m <- rbinom(n, 200, 0.8)
X <- matrix(rnorm(n * p), n, p)
alpha <- matrix(0, p, d)
alpha[c(1, 3, 5), ] <- 1
Alpha <- exp(X %*% alpha)
Y <- rdirmn(size=m, alpha=Alpha)
## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "sweep",
## ngridpt = 30)
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1461.992
## BIC: 3066.744
## AIC: 2985.984
## Degrees of freedom: 31
## Lambda: 4.858822
## Number of grid points: 30
Since the rows of the parameter matrix correspond to predictors, selecting by rows performs variable selection at the predictor level.
## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "group",
## ngridpt = 30)
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1485.475
## BIC: 3079.238
## AIC: 3017.978
## Degrees of freedom: 23.51454
## Lambda: 20.20407
## Number of grid points: 30
Nuclear norm regularization encourages low rank in the regularized estimate.
nuclear <- MGLMtune(Y ~ 0 + X, dist="DM", penalty="nuclear", ngridpt=30, warm.start=FALSE)
show(nuclear)
## Call: MGLMtune(formula = Y ~ 0 + X, dist = "DM", penalty = "nuclear",
## ngridpt = 30, warm.start = FALSE)
##
## Distribution: Dirichlet Multinomial
## Log-likelihood: -1492.063
## BIC: 3070.422
## AIC: 3021.604
## Degrees of freedom: 18.7391
## Lambda: 37.53776
## Number of grid points: 30