Overview

In this section, we demonstrate how to use MatrixLM.jl with a simple example using simulated data.

Matrix Linear Models (MLMs) provide a simple yet robust multivariate framework for encoding relationships and groupings in high-throughput data. MLM's flexibility allows it to encode both categorical and continuous relationships, thereby enhancing the detection of associations between responses and predictors.

Within the scope of the matrix linear model framework, the model is articulated as follows:

\[Y = XBZ^T+E\]

Where

  • $Y_{n \times m}$ is the response matrix
  • $X_{n \times p}$ is the matrix for main predictors,
  • $Z_{m \times q}$ denotes the response attributes matrix based on supervised knowledge,
  • $E_{n \times m}$ is the error term,
  • $B_{p \times q}$ is the matrix for main and interaction effects.

Data Generation

For simplicity, we assume that both the responses and the predictors are presented as dataframes.

Our dataset consists of a dataframe X, which includes p = 5 predictors. Among these predictors, two are categorical, and three are numerical, spread across n = 100 samples. We then consider a response dataframe Y composed of m = 250 responses. To simulate the Y data, we need to generate the matrices Z,B, and E.

The matrix Z provides information about the response population, which corresponds to the Y's columns $y_{i \in [1, 250]}$. The dimensions of this matrix are set at 250x10.

Given this setup, the coefficient matrix B is designed to have dimensions of 5x10. This matches the number of predictors in X with the number of information categories in Z. Finally, we construct the noise matrix E containing the error terms. We generate this matrix as a normally distributed matrix (N (0, 4) $), adding a layer of randomness to our simulation.

using MatrixLM, DataFrames, Random, Plots, StatsModels, Statistics
Random.seed!(1)

# Dimensions of matrices
n = 100
m = 250

# Number of groupings designed in the Z matrix
q = 10

# Generate data with two categorical variables and 3 numerical variables.
dfX = hcat(
    DataFrame(catvar1=string.(rand(0:1, n)), catvar2=rand(["A", "B", "C", "D"], n)),
    DataFrame(rand(n,3), ["var3", "var4", "var5"])
    );

Let's use the function design_matrix() to get the predictor model matrix based on the formula expression, including all the variable terms.

# Convert dataframe to prediction matrix
X = design_matrix(@mlmformula(catvar1 + catvar2 + var3 + var4 + var5), dfX)

p = size(X)[2];

We also have the option to specify contrast coding in your model. For a detailed understanding of how to implement contrast coding, please refer to the documentation for contrast coding with StatsModels.jl. This will provide you with comprehensive instructions and examples.

# Convert dataframe to predicton matrix
my_ctrst = Dict(
             :catvar1 => DummyCoding(base = "0"),
             :catvar2 => DummyCoding(base = "A")
           )

X = design_matrix(@mlmformula(catvar1 + catvar2 + var3 + var4 + var5), dfX, my_ctrst);

Randomly generate some data for column covariates Z and the error matrix E:

Z = rand(m,q);
E = randn(n,m).*4;

Next, we will structure the coefficient matrix B following a specific pattern. This approach will facilitate a more effective visualization of the results in the subsequent steps:

# (p,q)
B = [
    2.0   3.0   4.0   5.0  6.0  7.0  8.0  0.0 0.5 -2.0;
    0.01  0.02  0.01  0.09 0.18 0.03 0.14 0.0 0.5 -2.0;
    -1.0  -0.5  0.02  0.49 1.1  2.0  5.0  0.0 0.5 -2.0;
    -0.01 0.02  -0.01 3.0  3.0  7.0  0.14 0.0 0.5 -2.0;
    0.0   0.0   0.0   0.0  3.0  3.0  3.0  3.0 0.5 -2.0;
    3.0   3.0   3.0   3.0  0.08 0.03 0.0  0.0 0.5 -2.0;
    0.01  0.0   3.0   3.0  3.0  3.0 0.04  0.0 0.5 -2.0;
];

Generate the response matrix Y:

Y = X*B*Z' + E;

Now, construct the RawData object consisting of the response variable Y and row/column predictors X and Z. All three matrices must be passed in as 2-dimensional arrays.

# Construct a RawData object
dat = RawData(Response(Y), Predictors(X, Z));

Model estimation

Least-squares estimates for matrix linear models can be obtained by running mlm. An object of type Mlm will be returned, with variables for the coefficient estimates (B), the coefficient variance estimates (varB), and the estimated variance of the errors (sigma). By default, mlm estimates both row and column main effects (X and Z intercepts), but this behavior can be suppressed by setting addXIntercept=false and/or addZIntercept=false. Column weights for Y and the target type for variance shrinkage[1] can be optionally supplied to weights and targetType, respectively.

est = mlm(dat; addXIntercept=false, addZIntercept=false); # Model estimation

Model predictions and residuals

The coefficient estimates can be accessed using coef(est). Predicted values and residuals can be obtained by calling predict() and resid(). By default, both of these functions use the same data used to fit the model. However, a new Predictors object can be passed into predict() as the newPredictors argument, and a new RawData object can be passed into resid() as the newData argument. For convenience, fitted(est) will return the fitted values by calling predict with the default arguments.

To compare the estimated coefficients with the original matrix B, we will visualize the matrices using heatmaps. This graphical representation allows us to readily see differences and similarities between the two.

esti_coef = coef(est); # Get the coefficients of the model

plot(
    heatmap(B[end:-1:1, :],
            size = (800, 300)),
    heatmap(esti_coef[end:-1:1, :],
            size = (800, 300),
            clims = (-2, 8)),
    title = ["\$ \\mathbf{B}\$" "\$ \\mathbf{\\hat{B}}\$"]
)
Example block output

Let's employ the same visualization method to compare the predicted values with the original Y response matrix. This allows us to gauge the accuracy of our model predictions.

preds = predict(est); # Prediction value

plot(
    heatmap(Y[end:-1:1, :],
            size = (800, 300)),
    heatmap(preds.Y[end:-1:1, :],
            size = (800, 300),
            # clims = (-2, 8)
            ),
    title = ["\$ \\mathbf{Y}\$" "\$ \\mathbf{\\hat{Y}}\$"]
)
Example block output

The resid() function, available in MatrixLM.jl, computes residuals for each observation, helping you evaluate the discrepancy between the model's predictions and the observed data.

resids = resid(est);

plot(
    heatmap(resids[end:-1:1, :],
            size = (800, 300)),
    histogram(
        (reshape(resids,250*100,1)),
            grid  = false,
            label = "",
            size = (800, 300)),
    title = ["Residuals" "Distribution of the residuals"]
)
Example block output

T-statistics and permutation test

The t-statistics for an Mlm object (defined as est.B ./ sqrt.(est.varB)) can be obtained by running t_stat. By default, t_stat does not return the corresponding t-statistics for any main effects that were estimated by mlm, but they will be returned if isMainEff=true.

tStats = t_stat(est);

Permutation p-values for the t-statistics can be computed by the mlm_perms function. mlm_perms calls the more general function perm_pvals and will run the permutations in parallel when possible. The illustrative example below runs only 5 permutations; a different number can be specified as the second argument. By default, the function used to permute Y is shuffle_rows, which shuffles the rows for Y. Alternative functions for permuting Y, such as shuffle_cols, can be passed into the argument permFun. mlm_perms calls mlm and t_stat, so the user is free to specify keyword arguments for mlm or t_stat; by default, mlm_perms will call both functions using their default behavior.

nPerms = 500
tStats, pVals = mlm_perms(dat, nPerms, addXIntercept=false, addZIntercept=false);

plot(
    heatmap(tStats[end:-1:1, :],
            c = :bluesreds,
            clims = (-40, 40),
            size = (800, 300)),
    heatmap(-log.(pVals[end:-1:1, :]),
            grid = false,
            size = (800, 300)),
    title = ["T Statistics" "- Log(P-values)"]
)
Example block output

Summary

Call the summary on a fitted MatrixLM object to inspect the estimates, standard errors, confidence intervals (CIs), and the associated row and column labels.

The method returns a dataframe with one row per coefficient and the following columns:

  • row_term, col_term: index labels for the row-side (X) and column-side (Z) associated with each coefficient entry (column-major order).
  • coef: coefficient estimate for each X–Z pair.
  • std_error, t_stat, p_value: respectively, standard error, T-statistics, and p-values for testing.
  • ci_lower, ci_upper: two-sided confidence interval bounds at the specified significance level.

The summary method has three keyword arguments:

  • alpha: two-sided significance level for CIs and margins of error (default 0.05, 95% CI).
  • permutation_test: when true, compute permutation-based t-stats/p-values; when false, use t-distribution.
  • nPerms: number of permutations to run when permutation_test is true.

This summary table can be used to identify which row/column covariate combinations are significant and to estimate their effect sizes along with their uncertainty.

MatrixLM.summary(est, alpha = 0.05, permutation_test = false)
70×8 DataFrame
Rowrow_termcol_termcoefstd_errort_statp_valueci_lowerci_upper
StringStringFloat64Float64Float64Float64Float64Float64
1row_1col_11.751430.1786739.802452.9513e-161.401242.10162
2row_2col_10.3609050.2448691.473860.143691-0.1190310.84084
3row_3col_1-1.045540.242664-4.308583.87877e-5-1.52115-0.569923
4row_4col_10.2785870.2378571.171240.244315-0.1876040.744779
5row_5col_10.09430320.2625710.3591540.720245-0.4203260.608932
6row_6col_12.799980.281619.942771.45831e-162.248043.35193
7row_7col_10.04868620.2641190.1843350.854128-0.4689770.566349
8row_1col_23.222920.15789620.41172.95957e-372.913453.53239
9row_2col_20.07589090.2163950.3507060.726554-0.3482350.500017
10row_3col_2-0.3536710.214446-1.649230.10227-0.7739770.0666353
11row_4col_2-0.004434930.210198-0.02109880.983209-0.4164160.407546
12row_5col_2-0.09092720.232038-0.3918640.696001-0.5457130.363858
13row_6col_23.110680.24886312.49964.43582e-222.622913.59844
14row_7col_2-0.1365390.233406-0.5849860.559888-0.5940060.320928
15row_1col_34.091470.17168523.83138.49482e-433.754984.42797
16row_2col_30.2586020.2352931.099060.274405-0.2025630.719767
17row_3col_30.03568550.2331730.1530430.878676-0.4213260.492697
18row_4col_30.0123370.2285550.05397820.957061-0.4356220.460296
19row_5col_3-0.4268620.252302-1.691870.093816-0.9213640.0676403
20row_6col_33.437530.27059612.70361.64292e-222.907183.96789
21row_7col_32.699360.25378910.63624.49829e-182.201943.19678
22row_1col_45.158940.17902628.81686.09203e-504.808055.50982
23row_2col_40.5720050.2453532.331350.02176190.09112181.05289
24row_3col_40.6807930.2431432.799970.006145540.2042421.15734
25row_4col_42.908030.23832712.20181.90315e-212.440913.37514
26row_5col_4-0.3495630.263089-1.328690.187006-0.8652080.166082
27row_6col_43.161420.28216611.20412.6398e-192.608393.71446
28row_7col_42.603150.264649.836572.48631e-162.084473.12184
29row_1col_55.638870.16800733.56326.81681e-565.309585.96815
30row_2col_5-0.1049460.230253-0.4557880.649541-0.5562330.34634
31row_3col_51.251970.2281795.486783.15013e-70.8047431.69919
32row_4col_52.930380.22365913.1022.38875e-232.492023.36875
33row_5col_52.939670.24689711.90648.134e-212.455763.42358
34row_6col_50.04951660.26480.1869960.852046-0.4694810.568514
35row_7col_53.464940.24835313.95174.13526e-252.978183.95171
36row_1col_67.000790.14282749.01573.05003e-716.720857.28072
37row_2col_60.04151510.1957440.2120890.832474-0.3421350.425166
38row_3col_62.095350.19398110.80181.96448e-181.715152.47554
39row_4col_67.239720.19013838.07616.25764e-616.867067.61239
40row_5col_62.934590.20989413.98133.59536e-252.523213.34597
41row_6col_60.05892020.2251130.2617360.794069-0.3822930.500134
42row_7col_62.979390.21113114.11161.94513e-252.565583.3932
43row_1col_77.881580.15003852.53064.12982e-747.587518.17565
44row_2col_70.1307850.2056260.6360340.526222-0.2722340.533804
45row_3col_74.862240.20377423.8617.64572e-434.462855.26163
46row_4col_7-0.003375780.199737-0.01690110.98655-0.3948540.388102
47row_5col_73.542850.2204916.06812.44207e-293.110693.975
48row_6col_7-0.3391280.236478-1.434080.1547-0.8026160.12436
49row_7col_70.2874880.221791.296220.197914-0.1472120.722188
50row_1col_80.07078720.1862170.3801320.704662-0.2941920.435767
51row_2col_8-0.1412010.255209-0.5532760.581322-0.6414020.359
52row_3col_80.08698860.2529110.343950.731613-0.4087070.582684
53row_4col_8-0.03635690.247901-0.1466590.883699-0.5222340.44952
54row_5col_83.201390.27365811.69852.27048e-202.665043.73775
55row_6col_8-0.04767540.293501-0.1624370.871293-0.6229270.527576
56row_7col_8-0.3396790.275271-1.233980.220131-0.8792010.199842
57row_1col_90.750850.1546074.856514.47286e-60.4478271.05387
58row_2col_90.163140.2118870.7699360.443171-0.2521520.578431
59row_3col_90.2192240.2099791.044030.299014-0.1923260.630775
60row_4col_90.3255820.205821.581880.116866-0.07781650.728981
61row_5col_90.8139840.2272043.582610.0005295940.3686721.2593
62row_6col_90.007779830.2436790.03192660.974595-0.4698220.485382
63row_7col_90.6688510.2285442.926580.004251020.2209141.11679
64row_1col_10-2.006490.164348-12.20881.83921e-21-2.32861-1.68438
65row_2col_10-2.345910.225237-10.41531.36077e-17-2.78737-1.90446
66row_3col_10-2.10220.223209-9.418112.0346e-15-2.53968-1.66472
67row_4col_10-2.046120.218787-9.352092.8342e-15-2.47493-1.6173
68row_5col_10-2.154760.241519-8.921682.45227e-14-2.62813-1.68139
69row_6col_10-1.594950.259032-6.157331.58938e-8-2.10264-1.08725
70row_7col_10-1.773370.242943-7.299527.34675e-11-2.24953-1.29721

References

  • 1Ledoit, O., & Wolf, M. (2003). Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of empirical finance, 10(5), 603-621.