This is an implementation of different algorithms for matrix completion. Given a partially known matrix, we want to guess the remaining entries.
This can be used for recommendation systems, by predicting user ratings. It was partially inspired by the Netflix Prize. In this case, we have a matrix of ratings (a_ij) of user i for movie j, of which only some are given. Computing the others gives good hints: just recommend user i the best-rated movies j he has not yet watched.
There is no theory for the SVD algorithm, but the idea is rather straightforward: we want the best rank k approximation. We fill the matrix with some arbitrary, but constant value (like the mean value of the given entries) and then compute the singular value decomposition. By taking only the singular vectors belonging to the k largest singular values, we compute the best rank k approximation. Now we iterate these steps to fill the matrix with appropriate values. Seems to work though.
The theory for Hazan's algorithm is explained in Giesen et al. The actual algorithm is much simpler than the math behind, surprisingly.
The data should be provided in CSV format, each line encoding one matrix entry:
i,j,a_ij
To evaluate, just provide another data set with just i,j
lines, you'll then get the computed a_ij
as output. Read in the data with read.data
. You can also use read.all
for reading in both files
at once. If your data is in another format, or already read into R, don't bother. Just do what
read.data
does, then there will be no problems. If many columns or rows in your matrix are unused,
use reduce.data
to 'compress' it (and later expand.data
to decompress). read.all
also does this.
To use the algorithms, first set their parameters. There are functions alg.*.pl
supporting you here.
They produce 'parameter lists', which then can be passed on to the algorithm. The algorithms return the
completed matrix, which you might save. Or evaluate it on the test set by eval
.
For parameter tuning, validation etc., there is a built-in cross validation routine. (It is in fact a
'cross evaluation'.) The results can be analyzed later. (by error.data
, for example) There is also a
neat little tool, which I called bisection. This can be used to tweak parameters. It takes a long time
though, and some parameters are just easily 'guessed'.
On real world data, we got a root mean squared error of about 0.4 to 0.5 on a range of ratings from 1 to 5. The best results came from combining the output of both algorithms. Especially the unsophisticated SVD algorithm showed surprisingly good results. (about 24% of the entries were given after compression)
> source("data.r")
> source("alg.r")
> data<-read.all("train.csv", "qualifiying.csv")
> params<-alg.svd.pl(data$train, k=20)
> mat<-alg.svd(data$train, params)
[1] 0.4468885 0.4513283 0.4576820 0.4675882 0.4852974 0.5279047 0.8372528 4.0000000 4.0812162
> write.res(data, mat, "result.csv")
The output is the (reversed) error history on the training data set. Thus we needed 7 iterations in this case.