Using Rcpp to decrease memory usage and increase throughput.
In this post we will add optimizations to the Matrix::sparse.model.matrix
function to increase throughput and decrease memory usage. The optimizations can allow users to create sparse model matrices with interactions 5x faster.
The sparse.model.matrix
function creates a model matrix that is sparse. This is especially important when there are categorical variables in the data, which will be one-hot encoded and are, by definition, sparse. In addition, interactions with categorical variables are also inherently sparse. Since one-hot encoding generates multiple columns in the feature matrix, we want to make sure storage costs are minimized.
Let’s start with a data.frame that has 3 categorical variables, each has 10 unique values.
n = 1e5
data = data.frame(
x1 = sample(LETTERS[1:10], n, replace = TRUE),
x2 = sample(LETTERS[1:10], n, replace = TRUE),
x3 = sample(LETTERS[1:10], n, replace = TRUE)
)
data %>% head()
x1 x2 x3
1 H D B
2 C H H
3 J G B
4 F I E
5 D E F
6 A F D
A model matrix with all interactions has 1000 columns! However, approximately 8 of those columns are nonzero. The storage cost for such a matrix should be small. Yet, from the RStudio profiler, the memory cost for calling sparse.model.matrix
is roughly 700 MB. That is quite large. The bulk of the allocations come from the function Matrix:::sparse2int
. Below, we will make a copy of the function and add some optimizations.
X1 = sparse.model.matrix(~ x1*x2*x3, data)
dim(X1)
[1] 100000 1000
The role of the Matrix:::sparse2int
function is to compute interactions. It receives two blocks of features, \(X\) and \(Y\). These features are transpose of the original features in the dataframe, so columns in \(X\) and \(Y\) represent observations and rows represent features. Say \(x_1\) is a feature from the \(X\) block, and \(y_1\) is feature from the \(Y\) block. The interaction is simply the product \(x_1 \cdot y_1\). So, for every observation, we want to compute all pairwise products of the features in \(X\) and \(Y\).
This feels like a kronecker product: get feature \(y_1\) for every observation and multiply it with everything in \(X\); repeat for all features in \(Y\). In fact, this operation can be expressed as a kronecker product.
The kronecker product kronecker(Y, rep(1, nrow(X)))
effectively stretches rows in Y, so that they have the same shape as the X matrix, and can be multipled elementwise. However, it’s not necessary to physically stretch the matrices, we should keep the features compact and recycle instead of stretching them.
Below is an example for how to do this using RcppEigen.
Eigen::SparseMatrix<double> sparse_sparse_interaction_kronecker(
Eigen::SparseMatrix<double> X,
Eigen::SparseMatrix<double> Y
) {
SparseMatrix<double, RowMajor> output(X.rows() * Y.rows(), X.cols());
SparseMatrix<double> Y_transpose = Y.transpose();
for (int k=0; k<Y_transpose.outerSize(); ++k) {
SparseMatrix<double> X_scaled = X * VectorXd(Y_transpose.col(k)).asDiagonal();
X_scaled.prune(0.0);
output.middleRows(k * X.rows(), X.rows()) = X_scaled;
}
return output;
}
We’ll introduce a new function, RcppSparse2int
, which implements the interaction kronecker from Rcpp. This is largely a copy of the original Matrix::sparse2int. Next, we will fork a new function: RcppSparse.model.matrix
that will invoke RcppSparse2int
.
#' Faster version of sparse2int.r
#'
#' For most cases, we want to compute
#' kronecker(rep(1, ny), X) * kronecker(Y, rep(1, nx)).
#' However, we should do this without materializing the kronecker.
#' This represents a major speedup and reduction in memory
RcppSparse2int <- function(X, Y, do.names = TRUE, forceSparse = FALSE, verbose = FALSE) {
if (do.names) {
dnx <- dimnames(X)
dny <- dimnames(Y)
}
dimnames(Y) <- dimnames(X) <- list(NULL, NULL)
nx <- nrow(X)
ny <- nrow(Y)
r <-
if ((nX <- is.numeric(X)) | (nY <- is.numeric(Y))) {
# At least one was dense
if (nX) {
# X is dense
if (nY) {
# Y is also dense
dense_dense_interaction_kronecker(X, Y)
} else if (nx > 1) {
# X has more than 1 row and is dense.
# Y is sparse.
dense_sparse_interaction_kronecker(X, Y)
}
# if (nY || nx > 1) { # both numeric, or X >=2 "columns"
# F <- if (forceSparse) function(m) .Call(Matrix:::dense_to_Csparse, m) else identity
# # F((if(ny == 1) X else X[rep.int(seq_len(nx), ny) , ]) *
# # (if(nx == 1) Y else Y[rep (seq_len(ny),each=nx), ]))
#
# }
else { ## numeric X (1 "column"), sparseMatrix Y
r <- Y
dp <- Y@p[-1] - Y@p[-(Y@Dim[2] + 1L)]
## stopifnot(all(dp %in% 0:1)) # just for now
## if(nx == 1)
## FIXME: similar trick would be applicable for nx > 2
r@x <- X[dp == 1L] * Y@x
r
}
}
else { ## sparseMatrix X, dense Y
if (ny == 1) {
## FIXME: similar trick would be applicable for ny > 2
r <- X
dp <- X@p[-1] - X@p[-(X@Dim[2] + 1L)]
## stopifnot(all(dp %in% 0:1)) # just for now - drop! - FIXME
r@x <- Y[dp == 1L] * X@x
r
}
else { ## ny > 1 -- *larger* matrix
# X is sparse, Y is dense
sparse_dense_interaction_kronecker(X, Y)
}
}
}
else { ## X & Y are both sparseMatrix
sparse_sparse_interaction_kronecker(X, Y)
}
if (verbose) {
cat(sprintf(
" sp..2int(%s[%d],%s[%d]) ",
if (nX) "<N>" else "<sparse>", nx,
if (nY) "<N>" else "<sparse>", ny
))
}
if (do.names) {
## FIXME: This names business needs a good solution..
## but maybe "up in the caller"
if (!is.null(dim(r)) &&
!is.null(nX <- dnx[[1]]) &&
!is.null(nY <- dny[[1]])) {
rownames(r) <- outer(nX, nY, paste, sep = ":")
}
}
r
}
#' Faster version of sparseInt.r
#'
#' This version invokes RcppSparse2int instead of sparse2int,
#' which in tern invokes Rcpp code to do fast matrix operations.
RcppSparseInt.r <- function(rList, do.names = TRUE, forceSparse = FALSE, verbose = FALSE) {
nl <- length(rList)
if (forceSparse) {
F <- function(m) if (is.matrix(m)) .Call(Matrix:::dense_to_Csparse, m) else m
}
if (verbose) {
cat("RcppSparseInt.r(<list>[1:", nl, "], f.Sp=", forceSparse, "): is.mat()= (",
paste(symnum(vapply(rList, is.matrix, NA)), collapse = ""),
")\n",
sep = ""
)
}
if (nl == 1) {
if (forceSparse) F(rList[[1]]) else rList[[1]]
} else {
## 'recursion' free:
r <- rList[[1]]
for (j in 2:nl)
r <- RcppSparse2int(r, rList[[j]],
do.names = do.names, verbose = verbose
)
if (forceSparse) F(r) else r
}
}
#' A faster version of model.spmatrix
#'
#' This version calls RcppSparseInt.r instead of sparseint.r.
#' It also uses RcppSparseMatrixRbindList to bind multiple matrices vertically.
RcppModel.spmatrix <- function(trms, mf, transpose=FALSE,
drop.unused.levels = FALSE, row.names=TRUE, sep="", verbose=FALSE) {
## Author: Martin Maechler, Date: 7 Jul 2009
## mf is a model frame or a "simple" data.frame [after reorder !]
stopifnot(is.data.frame(mf))
n <- nrow(mf)
if(row.names)
rnames <- row.names(mf)
## mf: make into list, dropping all attributes (but the names)
### FIXME: for poly(., 5) mf has a 5-column matrix as "one column" => looses names here
fnames <- names(mf <- unclass(mf))
attributes(mf) <- list(names = fnames)
if(length(factorPattern <- attr(trms, "factors"))) {
d <- dim(factorPattern)
nVar <- d[1]
nTrm <- d[2]
n.fP <- dimnames(factorPattern)
fnames <- n.fP[[1]] # == names of variables {incl. "F(var)"} in the model
Names <- n.fP[[2]] # == colnames == names of terms: "a", "b:c", ...
} else { ## degenerate, e.g. 'Y ~ 1'
nVar <- nTrm <- 0L
fnames <- Names <- character(0)
}
## all the "variables in the model" are also in "mf", including "sin(x)";
## actually, ..../src/main/model.c even assumes
stopifnot((m <- length(mf)) >= nVar)
if(verbose)
cat(sprintf("model.spm..(): (n=%d, nVar=%d (m=%d), nTrm=%d)\n",
n, nVar,m, nTrm))
if(m > nVar) mf <- mf[seq_len(nVar)]
stopifnot(fnames == names(mf), allow.logical0 = TRUE)
noVar <- nVar == 0
##>> this seems wrong; we use 1:nVar for indexing mf[] below ..
##>> if(noVar) nVar <- 1L # (as in ~/R/D/r-devel/R/src/main/model.c)
## Note: "character" variables have been changed to factor in the caller;
## hence: both factor and *logical* should be dealt as factor :
is.f <- if(noVar) logical(0) else vapply(mf, function(.)
is.factor(.) | is.logical(.), NA)
indF <- which(is.f)
if(verbose) { cat(" --> indF =\n"); print(indF) }
hasInt <- attr(trms, "intercept") == 1
## the degree of interaction:
## intOrder <- attr(trms, "order")
##
if(!hasInt && length(indF)) {
## change the '1' of the first factor into a '2' :
if(any(i1 <- factorPattern[indF, ] == 1))
## replace at the first '1' location:
factorPattern[indF,][which.max(i1)] <- 2L
else {}
## nothing to do
}
## Convert "factors" to "Rowwise- sparseMatrix ("dummy"-matrix) -----------
## Result: a list of sparse model matrices for the "factor"s :
f.matr <- structure(vector("list", length = length(indF)),
names = fnames[indF])
i.f <- 0
## ---- For each variable in the model -------------------
for(i in seq_len(nVar)) {
nam <- fnames[i]
f <- mf[[i]]
if(is.f[i]) {
fp <- factorPattern[i,] ## == factorPattern[nam,]
contr <- attr(f, "contrasts")
f.matr[[(i.f <- i.f + 1)]] <- # a list of 2
lapply(fac2Sparse(f, to = "d",
drop.unused.levels=drop.unused.levels,
factorPatt12 = 1:2 %in% fp,
contrasts.arg = contr),
function(s) {
if(is.null(s)) return(s)
## else
rownames(s) <- ## for some contr.*(), have lost rownames; hmm..
paste(nam, rownames(s) %||% seq_len(nrow(s)), sep=sep)
s
})
} else { ## continuous variable --> "matrix" - for all of them
if(any(iA <- (cl <- class(f)) == "AsIs")) # drop "AsIs" class
class(f) <- if(length(cl) > 1L) cl[!iA]
nr <- if(is.matrix(f)) nrow(f <- t(f)) else (dim(f) <- c(1L, length(f)))[1]
if(is.null(rownames(f)))
rownames(f) <- if(nr == 1) nam else paste(nam, seq_len(nr), sep=sep)
mf[[i]] <- f
}
}
if(verbose) {
cat(" ---> f.matr list :\n")
str(f.matr, max = as.integer(verbose))
fNms <- format(dQuote(Names))
dim.string <- gsub('5', as.character(floor(1+log10(n))),
" -- concatenating (r, rj): dim = (%5d,%5d) | (%5d,%5d)\n")
}
## FIXME: do all this in C --
getR <- function(N) # using 'nm'
if(!is.null(r <- f.matr[[N]])) r[[factorPattern[N, nm]]] else mf[[N]]
vNms <- "(Intercept)"[hasInt]
counts <- integer(nTrm)
r <-
if(hasInt) ## column of 1's - as sparse
new("dgCMatrix", i = 0:(n-1L), p = c(0L, n),
Dim = c(n, 1L), x = rep.int(1, n))
else new("dgCMatrix", Dim = c(n, 0L))
if(transpose) r <- t(r)
iTrm <- seq_len(nTrm)
r.list <- list(r)
for(j in iTrm) { ## j-th term
nm <- Names[j]
if(verbose) cat(sprintf("term[%2d] %s .. ", j, fNms[j]))
nmSplits <- strsplit(nm, ":", fixed=TRUE)[[1]]
## NOTA BENE: This can be very slow when many terms are involved
## FIXME ??? why does it use *much* memory in those cases ??
# rj <- sparseInt.r(lapply(nmSplits, getR), do.names=TRUE,
# forceSparse = TRUE, verbose=verbose)# or just (verbose >= 2))
rj <- RcppSparseInt.r(lapply(nmSplits, getR),
do.names = TRUE,
forceSparse = TRUE, verbose = verbose
)
r.list[[j + 1]] <- rj
if(verbose) cat(sprintf(dim.string, nrow(r), ncol(r), nrow(rj),ncol(rj)))
## fast version of cbind2() / rbind2(), w/o checks, dimnames, etc
# r <- if(transpose) .Call(Csparse_vertcat, r, rj)
# else .Call(Csparse_horzcat, r, t(rj))
## if(verbose) cat(" [Ok]\n")
vNms <- c(vNms, dimnames(rj)[[1]])
counts[j] <- nrow(rj)
}
r <- RcppSparseMatrixRbindList(r.list) %>% t()
rns <- if(row.names) rnames
dimnames(r) <- if (transpose) list(rns, vNms) else list(vNms, rns)
attr(r, "assign") <- c(if(hasInt) 0L, rep(iTrm, counts))
r
}
#' Sparse Model Matrix
#'
#' Rcpp replacement for sparse.model.matrix.
#' This function replaces the call to model.spmatrix with RcppModel.spmatrix
RcppSparse.model.matrix <-
function(object, data = environment(object), contrasts.arg = NULL,
xlev = NULL, transpose = TRUE,
drop.unused.levels = FALSE, row.names = TRUE
, sep = ""
, verbose = FALSE, ...)
{
t <- if(missing(data)) terms(object) else terms(object, data=data)
if (is.null(attr(data, "terms")))
data <- model.frame(object, data, xlev=xlev)
else {
reorder <- match(sapply(attr(t,"variables"),deparse,
width.cutoff=500)[-1L],
names(data))
if (anyNA(reorder))
stop("model frame and formula mismatch in model.matrix()")
if(!isSeq(reorder, ncol(data), Ostart=FALSE))
data <- data[,reorder, drop=FALSE]
}
int <- attr(t, "response")
if(length(data)) { # otherwise no rhs terms, so skip all this
contr.funs <- as.character(getOption("contrasts"))
namD <- names(data)
## turn any character columns into factors
for(i in namD)
if(is.character(data[[i]]))
data[[i]] <- factor(data[[i]])
isF <- vapply(data, function(x) is.factor(x) || is.logical(x), NA)
isF[int] <- FALSE
isOF <- vapply(data, is.ordered, NA)
for(nn in namD[isF]) # drop response
if(is.null(attr(data[[nn]], "contrasts")))
contrasts(data[[nn]]) <- contr.funs[1 + isOF[nn]]
## it might be safer to have numerical contrasts:
## get(contr.funs[1 + isOF[nn]])(nlevels(data[[nn]]))
if (!is.null(contrasts.arg) && is.list(contrasts.arg)) {
if (is.null(namC <- names(contrasts.arg)))
stop("invalid 'contrasts.arg' argument")
for (nn in namC) {
if (is.na(ni <- match(nn, namD)))
warning(gettextf("variable '%s' is absent, its contrast will be ignored", nn),
domain = NA)
else {
ca <- contrasts.arg[[nn]]
## FIXME: work for *sparse* ca
if(is.matrix(ca)) contrasts(data[[ni]], ncol(ca)) <- ca
else contrasts(data[[ni]]) <- contrasts.arg[[nn]]
}
}
}
} else { # internal model.matrix needs some variable
isF <- FALSE
data <- cbind(data, x = 0)
}
## <Sparse> src/library/stats/R/models.R has
## ans <- .Internal(model.matrix(t, data))
if(verbose) {
cat("RcppModel.spmatrix(t, data, ..) with t =\n"); str(t,give.attr=FALSE) }
ans <- RcppModel.spmatrix(t, data, transpose=transpose,
## ==============
drop.unused.levels=drop.unused.levels,
row.names=row.names, sep=sep, verbose=verbose)
## </Sparse>
attr(ans, "contrasts") <-
lapply(data[isF], function(x) attr(x, "contrasts"))
ans
} ## {sparse.model.matrix}
Below we benchmark the results for accuracy, and for performance. The new Rcpp interaction kronecker produces the correct answer, and needs only 20% of the time.
X1 = sparse.model.matrix(~ x1*x2*x3, data)
X2 = RcppSparse.model.matrix(~ x1*x2*x3, data)
identical(X1, X2)
[1] TRUE
system.time(sparse.model.matrix(~ x1*x2*x3, data, row.names = FALSE))
user system elapsed
2.034 0.384 2.423
system.time(RcppSparse.model.matrix(~ x1*x2*x3, data, row.names = FALSE))
user system elapsed
0.451 0.040 0.492
For attribution, please cite this work as
Wong (2021, April 10). Jeffrey C. Wong: Sparse Model Matrix Optimizations in C++. Retrieved from http://jeffreycwong.com/posts/2021-04-10-sparse-model-matrix-optimizations-in-c/
BibTeX citation
@misc{wong2021sparse, author = {Wong, Jeffrey C.}, title = {Jeffrey C. Wong: Sparse Model Matrix Optimizations in C++}, url = {http://jeffreycwong.com/posts/2021-04-10-sparse-model-matrix-optimizations-in-c/}, year = {2021} }