3.5 Additional Matrix Tidbits that Will Come Up

Inverse

An \(n\) x \(n\) matrix A is invertible if there exists an \(n\) x \(n\) inverse matrix \(A^{-1}\) such that:

  • \(AA^{-1} = A^{-1}A = I_n\)
  • where \(I_n\) is the identity matrix (\(n\) x \(n\)), that takes diagonal elements of 1 and off-diagonal elements of 0. Example:

\(I_n = \begin{pmatrix} 1_{11} & 0 & \dots & 0\\ 0& 1_{22} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & 1_{nn} \end{pmatrix}\)

  • Multiplying a matrix by the identity matrix returns in the matrix itself: \(AI_n = A\)
    • It’s like the matrix version of multiplying a number by one.

Note: A matrix must be square \(n\) x \(n\) to be invertible. (But not all square matrices are invertible.) A matrix is invertible if and only if its columns are linearly independent. This is important for understanding why you cannot have two perfectly colinear variables in a regression model.

We will not do much solving for inverses in this course. However, the inverse will be useful in solving for and simplifying expressions.

3.5.1 Transpose

When we transpose a matrix, we flip the \(i\) and \(j\) components.

  • Example: Take a 4 X 3 matrix A and find the 3 X 4 matrix \(A^{T}\).
  • A transpose is usually denoted with as \(A^{T}\) or \(A'\)

\(A = \begin{pmatrix} a_{11} & a_{12} & a_{13}\\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ a_{41} & a_{42} & a_{43} \end{pmatrix}\) then \(A^T = \begin{pmatrix} a'_{11} & a'_{12} & a'_{13} & a'_{14}\\ a'_{21} & a'_{22} & a'_{23} & a'_{24} \\ a'_{31} & a'_{32} & a'_{33} & a'_{34} \end{pmatrix}\)

If \(A = \begin{pmatrix} 1 & 4 & 2 \\ 3 & 1 & 11 \\ 5 & 9 & 4 \\ 2 & 11& 4 \end{pmatrix}\) then \(A^T = \begin{pmatrix} 1 & 3 & 5 & 2\\ 4 & 1 & 9 & 11 \\ 2& 11 & 4 & 4 \end{pmatrix}\)

Check for yourself: What was in the first row (\(i=1\)), second column (\(j=2\)) is now in the second row (\(i=2\)), first column (\(j=1\)). That is \(a_{12} =4 = a'_{21}\).

We can transpose matrices in R using t(). For example, take our matrix A:

A
     [,1] [,2]
[1,]    3    5
[2,]    4    6
[3,]    6    8
t(A)
     [,1] [,2] [,3]
[1,]    3    4    6
[2,]    5    6    8

In R, you can find the inverse of a square matrix with solve()

solve(A)
Error in solve.default(A): 'a' (3 x 2) must be square

Note, while A is not square A’A is square:

AtA <- t(A) %*% A

solve(AtA)
          [,1]      [,2]
[1,]  2.232143 -1.553571
[2,] -1.553571  1.089286

3.5.2 Additional Matrix Properties and Rules

These are a few additional properties and rules that will be useful to us at various points in the course:

  • Symmetric: Matrix A is symmetric if \(A = A^T\)
  • Idempotent: Matrix A is idempotent if \(A^2 = A\)
  • Trace: The trace of a matrix is the sum of its diagonal components \(Tr(A) = a_{11} + a_{22} + \dots + a_{mn}\)

Example of symmetric matrix:

\(D = \begin{pmatrix} 1 & 6 & 22 \\ 6 & 4 & 7 \\ 22 & 7 & 11 \end{pmatrix}\)

## Look at the equivalence
D <- rbind(c(1,6,22), c(6,4,7), c(22,7,11))
D
     [,1] [,2] [,3]
[1,]    1    6   22
[2,]    6    4    7
[3,]   22    7   11
t(D)
     [,1] [,2] [,3]
[1,]    1    6   22
[2,]    6    4    7
[3,]   22    7   11

What is the trace of this matrix?

## diag() pulls out the diagonal of a matrix
sum(diag(D))
[1] 16

3.5.3 Matrix Rules

Due to conformability and other considerations, matrix operations are somewhat more restrictive, particularly when it comes to commutativity.

  • Associative \((A + B) + C = A + (B + C)\) and \((AB) C = A(BC)\)
  • Commutative \(A + B = B + A\)
  • Distributive \(A(B + C) = AB + AC\) and \((A + B) C = AC + BC\)
  • Commutative law for multiplication does not hold– the order of multiplication matters: $ AB BA$

Rules for Inverses and Transposes

These rules will be helpful for simplifying expressions. Treat \(A\), \(B\), and \(C\) as matrices below, and \(s\) as a scalar.

  • \((A + B)^T = A^T + B^T\)
  • \((s A)^T\) \(= s A^T\)
  • \((AB)^T = B^T A^T\)
  • \((A^T)^T = A\) and \(( A^{-1})^{-1} = A\)
  • \((A^T)^{-1} = (A^{-1})^T\)
  • \((AB)^{-1} = B^{-1} A^{-1}\)
  • \((ABCD)^{-1} = D^{-1} C^{-1} B^{-1} A^{-1}\)

3.5.4 Derivatives with Matrices and Vectors

Let’s say we have a \(p \times 1\) “column” vector \(\mathbf{x}\) and another \(p \times 1\) vector \(\mathbf{a}\).

Taking the derivative with respect to vector \(\mathbf{x}\).

Let’s say we have \(y = \mathbf{x}'\mathbf{a}\). This process is explained here. Taking the derivative of this is called the gradient.

  • \(\frac{\delta y}{\delta x} = \begin{pmatrix}\frac{\delta y}{\delta x_1} \\ \frac{\delta y}{\delta x_2} \\ \vdots \\ \frac{dy}{dx_p} \end{pmatrix}\)

  • \(y\) will have dimensions \(1 \times 1\). \(y\) is a scalar.

    • Note: \(y = a_1x_1 + a_2x_2 + ... + a_px_p\). From this expression, we can take a set of “partial derivatives”:
    • \(\frac{\delta y}{\delta x_1} = a_1\)
    • \(\frac{\delta y}{\delta x_2} = a_2\), and so on
    • \(\frac{\delta y}{\delta x} = \begin{pmatrix}\frac{\delta y}{\delta x_1} \\ \frac{\delta y}{\delta x_2} \\ \vdots \\ \frac{\delta y}{\delta x_p} \end{pmatrix} = \begin{pmatrix} a_1 \\ a_2 \\ \vdots \\ a_p \end{pmatrix}\)
  • Well, this is just vector \(\mathbf{a}\)

Answer: \(\frac{\delta }{\delta x} \mathbf{x}^T\mathbf{a} = \mathbf{a}\). We can apply this general rule in other situations.

Example 2

Let’s say we want to differentiate the following where vector \(\mathbf{y}\) is \(n \times 1\), \(X\) is \(n \times k\), and \(\mathbf{b}\) is \(k \times 1\). Take the derivative with respect to \(b\).

  • \(\mathbf{y}'\mathbf{y} - 2\mathbf{b}'X'\mathbf{y}\)
  • Note that the dimensions of the output are \(1 \times 1\), a scalar quantity.

Remember the derivative of a sum is the sum of derivatives. This allows us to focus on particular terms.

  • The first term has no \(\mathbf{b}\) in it, so this will contribute 0.
  • The second term is \(2\mathbf{b}'X'\mathbf{y}\). We can think about this like the previous example
    • \(\frac{\delta }{\delta b} 2\mathbf{b}'X'\mathbf{y} = \begin{pmatrix} \frac{\delta }{\delta b_1}\\ \frac{\delta }{\delta b_2} \\ \vdots \\ \frac{\delta }{\delta b_k} \end{pmatrix}\)
    • The output is needs to be \(k \times 1\) like \(\mathbf{b}\), which is what \(2 * X'\mathbf{y}\) is.
  • The derivative is \(-2X'\mathbf{y}\)

Example 3

Another useful rule when a matrix \(A\) is symmetric: \(\frac{\delta}{\delta \mathbf{x}} \mathbf{x}^TA\mathbf{x} = (A + A^T)\mathbf{x} = 2A\mathbf{x}\).

Details on getting to this result. We are treating the vector \(\mathbf{x}\) as \(n \times 1\) and the matrix \(A\) as symmetric.

When we take \(\frac{\delta}{\delta \mathbf{x}}\) (the derivative with respect to \(\mathbf{x}\)), we will be looking for a result with the same dimensions \(\mathbf{x}\).

\(\frac{\delta }{\delta \mathbf{x}} = \begin{pmatrix}\frac{\delta }{\delta x_1} \\ \frac{\delta }{\delta x_2} \\ \vdots \\ \frac{d}{dx_n} \end{pmatrix}\)

Let’s inspect the dimensions of \(\mathbf{x}^TA\mathbf{x}\). They are \(1 \times 1\). If we perform this matrix multiplication, we would be multiplying:

\(\begin{pmatrix} x_1 & x_2 & \ldots & x_i \end{pmatrix} \times \begin{pmatrix} a_{11} & a_{12} & \ldots & a_{1j} \\ a_{21} & a_{22} & \ldots & a_{2j} \\ \vdots & \vdots & \vdots & \vdots \\ a_{i1} & a_{i2} & \ldots & a_{ij} \end{pmatrix} \times \begin{pmatrix}x_1 \\ x_2 \\ \vdots \\ x_j \end{pmatrix}\)

To simplify things, let’s say we have the following matrices, where \(A\) is symmetric:

\(\begin{pmatrix} x_1 & x_2 \end{pmatrix} \times \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \times \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}\)

We can perform the matrix multiplication for the first two quantities, which will result in a \(1 \times 2\) vector. Recall in matrix multiplication we take the sum of the element-wise multiplication of the \(ith\) row of the first object by the \(jth\) column of the second object. This means multiply the first row of \(\mathbf{x}\) by the first column of \(A\) for the entry in cell \(i=1; j=1\), and so on.

\(\begin{pmatrix} (x_1a_{11} + x_2a_{21}) & (x_1a_{12} + x_2a_{22}) \end{pmatrix}\)

We can then multiply this quantity by the last quantity

\(\begin{pmatrix} (x_1a_{11} + x_2a_{21}) & (x_1a_{12} + x_2a_{22}) \end{pmatrix} \times \begin{pmatrix}x_1 \\ x_2 \end{pmatrix}\)

This will results in the \(1 \times 1\) quantity: \((x_1a_{11} + x_2a_{21})x_1 + (x_1a_{12} + x_2a_{22})x_2 = x_1^2a_{11} + x_2a_{21}x_1 + x_1a_{12}x_2 + x_2^2a_{22}\)

We can now take the derivatives with respect to \(\mathbf{x}\). Because \(\mathbf{x}\) is \(2 \times 1\), our derivative will be a vector of the same dimensions with components:

\(\frac{\delta }{\delta \mathbf{x}} = \begin{pmatrix}\frac{\delta }{\delta x_1} \\ \frac{\delta }{\delta x_2} \end{pmatrix}\)

These represent the partial derivatives of each component within \(\mathbf{x}\)

Let’s focus on the first: \(\frac{\delta }{\delta x_1}\).

\[\begin{align*} \frac{\delta }{\delta x_1} x_1^2a_{11} + x_2a_{21}x_1 + x_1a_{12}x_2 + x_2^2a_{22} &=\\ &= 2x_1a_{11} + x_2a_{21} + a_{12}x_2\\ \end{align*}\]

We can repeat this for \(\frac{\delta }{\delta x_2}\)

\[\begin{align*} \frac{\delta }{\delta x_2} x_1^2a_{11} + x_2a_{21}x_1 + x_1a_{12}x_2 + x_2^2a_{22} &=\\ &= a_{21}x_1 + x_1a_{12} + 2x_2a_{22}\\ \end{align*}\]

Now we can put the result back into our vector format:

\(\begin{pmatrix}\frac{\delta }{\delta x_1} = 2x_1a_{11} + x_2a_{21} + a_{12}x_2\\ \frac{\delta }{\delta x_2} = a_{21}x_1 + x_1a_{12} + 2x_2a_{22}\end{pmatrix}\)

Now it’s just about simplifying to show that we have indeed come back to the rule.

Recall that for a symmetric matrix, the elements in rows and columns \(ij\) = the elements in \(ji\). This allows us to read \(a_{21} = a_{12}\) and combine those terms (e.g., \(x_2a_{21} + a_{12}x_2 =2a_{12}x_2\)) :

\(\begin{pmatrix}\frac{\delta }{\delta x_1} = 2x_1a_{11} + 2a_{12}x_2\\ \frac{\delta }{\delta x_2} = 2a_{21}x_1 + 2x_2a_{22}\end{pmatrix}\)

Second, we can now bring the 2 out front.

\(\begin{pmatrix}\frac{\delta }{\delta x_1} \\ \frac{\delta }{\delta x_2} \end{pmatrix} = 2 * \begin{pmatrix} x_1a_{11} + a_{12}x_2\\ a_{21}x_1 + x_2a_{22}\end{pmatrix}\)

Finally, let’s inspect this and show it is equivalent to this multiplication where we have a \(2 \times 2\) \(A\) matrix multiplied by a \(2 \times 1\) \(\mathbf x\) vector. Minor note: Because any individual element of a vector is just a single quantity, we can change the order (e.g., \(a_{11}*x_1\) vs. \(x_1*a_{11}\)). We just can’t do that for full vectors or matrices

\(2A\mathbf{x} = 2* \begin{pmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{pmatrix} \times \begin{pmatrix}x_1 \\ x_2 \end{pmatrix} = 2 * \begin{pmatrix} a_{11}x_1 + a_{12}x_2 \\ a_{21}x_1 + a_{22}x_2 \end{pmatrix}\)

The last quantity is the same as the previous step. That’s the rule!

Applying the rules

  • This has a nice analogue to the derivative we’ve seen before \(q^2 = 2*q\).
  • Let’s say we want to take the derivative of \(\mathbf{b}'X'X\mathbf{b}\) with respect to \(\mathbf{b}\).
  • We can think of \(X'X\) as if it is \(A\).
    • This gives us \(2X'X\mathbf{b}\) as the result.

Why on earth would we care about this? For one, it helps us understand how we get to our estimates for \(\hat \beta\) in linear regression. When we have multiple variables, we don’t just want the best estimate for one coefficient, but a vector of coefficients. See more here.

In MLE, we will find the gradient of the log likelihood function. We will further go into the second derivatives to arrive at what is called the Hessian. More on that later.