Some term-document matrix data we generated and used in our papers
-
MEDLINE2500:
A term-document matrix of 22095 x 1250 for training , and
another term-document matrix of 22095 x 1250 for test ,
each with 5 clusters based on Medline data.
The files are in the sparse format where each entry in the matrix is represented with
the column index, row index, and the weight. Each new cluster starts at column index
1, 251, 501, 751, 1001, resprectively. More detailed description can be found in the paper
"Text classification using support vector machines with dimension reduction"
(with H. Kim and P. Howland)
Paper in ps
-
Reuters:
A term-document matrix for training , and
another term-document matrix for test , each with 90 clusters based
on Reuters data.
The files are in the sparse format where each entry in the matrix is represented with
the column index, row index, and the weight.
The matrix size and the starting positions of each categories can be found in
train_category
and
test_category .
More detailed description can be found in the paper
"Text classification using support vector machines with dimension reduction"
(with H. Kim and P. Howland)
Paper in ps
This page was last modified: March, 2003.