Recall our projection equations
$$x = \frac{X}{Z}, \quad y = \frac{Y}{Z}.$$
However, because we have division, this is not linear. How can we represent it like
$$\mathbf{x}_c = R\mathbf{x}_w + \mathbf{t}?$$
This brings us to homogeneous coordinates, whereby 2D points are actually 3D points and 3D points are actually 4D points. For example, we can write
$$(x,y) = (x,y,1) = (kx, ky, k).$$
This allows us to write the transformation above as
$$\begin{bmatrix} R\mathbf{x}^w + \mathbf{t} \\ 1 \end{bmatrix} = \begin{bmatrix} R & \mathbf{t} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} \mathbf{x}^w \\ 1 \end{bmatrix},$$
which makes successive or inverse transformations easy.
As an example, if the camera coordinate system is the same as the world system but with the $y$ coordinate increased by $h$, then the extrinsic matrix of the camera is
$$\begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & -h \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}.$$
Here, $M$ (or $R$) is the identity matrix because the linear transformation involved is the identity (nothing happens). However, $t$ is not the identity; instead the $y$ coordinate should be $-h$ because in the camera the same point would have to have a $y$ coordinate reduced by $h$. Finally, the bottom right element should be $1$.