Text to Matrix Generator (TMG) is a MATLAB® toolbox that can be used for various tasks in text mining (TM). Most of TMG (version 6.0; Dec.'11) is written in MATLAB, though a large segment of the indexing phase of the current version of TMG is written in Perl. Previous versions that were strictly MATLAB are also available. If MySQL and the MATLAB Database Toolbox are available, TMG exploits their functionality for additional flexibility.
TMG is especially suited for TM applications where data is highdimensional but extremely sparse as it uses the sparse matrix infrastructure of MATLAB. Originally built as a preprocessing tool for creating termdocument matrices (tdm's) from unstructured text, the new version of TMG (Dec.'11) offers a wide range of tools for the following tasks:
 Indexing
 Retrieval
 Dimensionality Reduction
 Nonnegative Matrix Factorizations
 Clustering
 Classification
TMG functionality is accessible to users in several ways: If MATLAB (version 7.0 or higher) is available then it is anticipated that most users will either use TMG's MATLABbased GUI's, or will invoke the relevant functions directly from the MATLAB command line interface. Command line components of TMG have also been reported to run from Octave.
The indexing process constructs new and updates existing termdocument matrices from documents, in the form of MATLAB sparse arrays. It may include various steps, such as removal of common words, such as articles and conjunctions, removal of very short or very long terms, removal of very frequent or infrequent terms, and the application of stemming. TMG applies common filtering techniques (removal of common words, removal of words that are too infrequent or frequent, removal of words that are too short or too long) to reduce the size of the term dictionary. TMG accepts as input files or directories consisting of ASCII text. In most cases it also processes with reasonable accuracy html and many PostScript and PDF files. TMG allows as option a variety of termweighting and normalization schemes as well as stemming.
Availability and Requirements
To obtain the current version of TMG please file a request form.
 Current Version: TMG 6.0. Some modules are in MATLAB precompiled (.p) format. See version TMG 3.0R4 (below) for strictly MATLAB open source with indexing and retrieval modules.
 Depending on user requirements, version 6.0 might necessitate the following third party software packages.
 If one intends to use MySQL functionality:
 If one wishes to apply the packages below for SVD and lowrank approximation, the following packages are already preloaded:
 Other TMG versions:
 TMG 3.0R4 (version 3.0, release 4), includes only the indexing and query modules and requires (only) MATLAB version 7.0 or higher. It is written entirely in MATLAB and is available in its entirety in source form. To obtain it please
This email address is being protected from spambots. You need JavaScript enabled to view it
l.
Installation
Installing TMG is made simple using the init_tmg script. Specifically:
 Download TMG by completing and filing a simple form.
 Only if MySQL functionality is required, install MySQL and Java Connector.
 Unzip TMG_X.XRX.zip and start MATLAB. See the directory structure of the TMG root directory.
 Change path to the TMG root directory.
 Run init_tmg. Give the MySQL login and password as well as the root directory of the MySQL Java Connector. The installation script creates all necessary information (including MySQL database TMG) and adds to the MATLAB path all necessary directories.
 Run gui. Alternatively, use the command line interface, type "help tmg".
Main FILES
Filename 
Description 
Core indexing functions 
The basic functions for the indexing module, i.e. tmg.m, tdm_update.m, tdm_downdate.m, tmg_query.m, merge_tdms.m. 
Core dimensionality reduction functions 
The basic functions for the dimensionality reduction module, i.e. svd_tmg.m, pca.m, clsi.m, cm.m, sdd_tmg.m. 
Core retrieval functions 
The basic functions for the retieval module, i.e. lsa.m, vsm.m. 
Core NMF functions 
The basic functions for the NMF module, i.e. bisecting_nndsvd.m, block_nndsvd.m, nnmf_mul_update.m. 
Core clustering functions 
The basic functions for the clustering module, i.e. ekmeans.m, skmeans.m, pddp.m, pddp_2means.m, pddp_optcut.m, pddp_optcut_2means.m, pddp_optcutpd.m. 
Core classification functions 
The basic functions for the classification module, i.e. knn_multi.m, knn_single.m, llsf_multi.m, llsf_single.m, rocchio_multi.m, rocchio_single.m, scut_knn.m, scut_llsf.m, scut_rocchio.m. 
GUI functions 
gui.m, tmg_gui.m, dr_gui.m, retrieval_gui.m, nnmf_gui.m, clustering_gui.m, classification_gui.m, open_file.m, tmg_save_results.m, about_tmg_gui.m. 
Auxiliary functions 
block_diagonalize.m, col_normalization.m, clean_filters.m, cleanup.m, col_rearrange.m, column_norms.m, compute_fro_norm, compute_scat.m, create_kmeans_response.m, create_pddp_response.m, create_retrieval_response.m, diff_vector.m, doc2ascii.m, entropy.m, get_node_scat.m, ks_selection.m, ks_selection1.m, make_clusters_multi.m, make_clusters_single.m, make_labels.m, make_val_inds.m, merge_dictionary.m, myperms.m, opt_2means.m, pca_mat.m, pca_mat_afun.m, pca_propack.m, pca_propack_Atransfunc.m, pca_propack_afun.m, pca_update.m, pca_update_afun.m, pddp_extract_centroids.m, ps_pdf2ascii.m, stemmer.m, strip_html.m, svd_update.m, svd_update_afun.m, two_means_1d.m, unique_elements.m, unique_words.m. untex.m 
tmg_template.m 
Template for tmg coupling and testing with querying and clustering. 
init_tmg.m 
Installation script. 
ver.5.0R6_tmg_manual.pdf 
Related documentation. 
ver6.0R7_updates.pdf 
Updates of version 6.0 Release 7 
common_words 
Stoplist file (default is from the GTP project). 
Sample Output
Users can download the termdocument matrices and query matrices that have resulted from its application on standard test IR data, namely MEDLINE, CRANFIELD and CISI collections. Results are provided for the simplest termweighting scheme (term frequency local function, with no global weighting and normalization).
 MEDLINE: termdocument matrix (5,735 x 1,033), dictionary (5,735 indexing terms), query matrix (30 queries)
 CRANFIELD: termdocument matrix (4,563 x 1,398), dictionary (4,563 indexing terms), query matrix (225 queries)
 CISI: termdocument matrix (5,544 x 1460), dictionary (5,544 indexing terms), query matrix (35 queries)
Documentation
Furthermore, the following documents describe the design and use of the package for IR experiments:
 D. Zeimpekis and E. Gallopoulos, "TMG: A MATLAB toolbox for generating termdocument matrices from text collections". In "Grouping Multidimensional Data: Recent Advances in Clustering", J. Kogan, C. Nicholas and M. Teboulle, eds., pp. 187210, Springer, 2006. Also Technical Report HPCLABSCG 1/0105, Computer Engineering & Informatics Dept., University of Patras, Greece, Jan. 2005.
 D. Zeimpekis and E. Gallopoulos, "Design of a MATLAB toolbox for termdocument matrix generation", Technical Report HPCLABSCG 2/0205, Computer Engineering & Informatics Dept., University of Patras, Greece, Februry 2005. In Proc. Workshop on Clustering High Dimensional Data and its Applications, (held in conjunction with 5th SIAM Int'l Conf. Data Mining), I.S. Dhillon, J. Kogan and J. Ghosh eds., pp. 3848, April 2005, Newport Beach, California.
Usage Reports
 There exist several citations (papers and other works) that have utilized TMG for various applications.
 Please
This email address is being protected from spambots. You need JavaScript enabled to view it
information regarding your use of the software: Your questions, comments, and suggestions will help us improve it.
References and related links
 Latent Semantic Indexing Web Site. Maintained by M.W. Berry and S. Dumais.
 Latent Semantic Analysis Web site at the University of Colorado, Boulder.
 D. Boley, Unsupervised Document Set Exploration Using Divisive Partitioning page and the companion paper "Principal direction divisive partitioning", Data Mining and Knowledge Discovery, 2 (1998), no. 4, pp. 325344.
 I.S. Dhillon and D.M. Modha, "Concept Decompositions for Large Sparse Text Data using Clustering", Machine Learning, 42:1, pp. 143175, 2001.
 Lars Elden, Matrix Methods in Data Mining and Pattern Recognition, SIAM, 2007.
 J.R. Gilbert, C. Moler, and R. Schreiber, "Sparse matrices in MATLAB: Design and implementation", SIAM J. Matrix Anal. Appl. 13 (1992), no. 1, 333356.
 J.T. Giles, L. Wo, and M.W. Berry, "GTP (General Text Parser) Software for Text Mining" in Statistical Data Mining and Knowledge Discovery, H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455471.
 T.G. Kolda and D.P. O'Leary, Computation and Uses of the Semidiscrete Matrix Decomposition, Computer Science Department Report CSTR4012 Institute for Advanced Computer Studies Report UMIACSTR9922, University of Maryland, April 1999.
 M. Berry, Z. Drmac, and E. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review 41 (1998), 335362.
 M. W. Berry, S. A. Pulatova, and G. W. Stewart, Computing sparse reducedrank approximations to sparse matrices, ACM TOMS 31 (2005), no. 2.
 D. Boley, Principal direction divisive partitioning, Data Mining and Knowledge Discovery 2 (1998), no. 4, 325344.
 S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and Harshman R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 41 (1990), no. 6, 391407.
 I. S. Dhillon and D. S. Modha, Concept decompositions for large sparse text data using clustering, Machine Learning 42 (2001), no. 1, 143175.
 T. Kolda and D. O'Leary, Algorithm 805: computation and uses of the semidiscrete matrix decomposition, ACM TOMS 26 (2000), no. 3.
 C.D. Manning and H. Schutze, Foundations of statistical Natural language Processing, The MIT Press. 1999.
 C.D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008.
 H. Park, M. Jeon, and J. Rosen, Lower dimensional representation of text data based on centroids and least squares, BIT 43 (2003).
 M.F. Porter, An algorithm for suffix stripping, Program (1980), no. 3, 130137.
 G. Salton, C. Yang, and A. Wong, A VectorSpace Model for Automatic Indexing, Communications of the ACM 18 (1975), no. 11, 613620.
 Y. Yang and C. Chute, A linear least squares fit mapping method for information retrieval from natural language texts, In 14th Conf. Comp. Linguistics, 1992.
 D. Zeimpekis and E. Gallopoulos, PDDP(l): Towards a Flexing Principal Direction Divisive Partitioning Clustering Algorithms, Proc. IEEE ICDM '03 Workshop on Clustering Large Data Sets (Melbourne, Florida) (D. Boley, I. Dhillon, J. Ghosh, and J. Kogan, eds.), 2003, pp. 2635.
 D. Zeimpekis and E. Gallopoulos, CLSI: A flexible approximation scheme from clustered termdocument matrices, In Proc. SIAM 2005 Data Mining Conf. (Newport Beach, California) (H. Kargupta, J. Srivastava, C. Kamath, and A. Goodman, eds.), April 2005, pp. 631635.
 D. Zeimpekis and E. Gallopoulos, Linear and nonlinear dimensional reduction via class representatives for text classification, In Proc. of the 2006 IEEE International Conference on DataMining (Hong Kong), December 2006, pp. 11721177.
 D. Zeimpekis and E. Gallopoulos, TMG: A MATLAB toolbox for generating term document matrices from text collections, Grouping Multidimensional Data: Recent Advances in Clustering (J. Kogan, C. Nicholas, and M. Teboulle, eds.), Springer, Berlin, 2006, pp. 187210.
 D. Zeimpekis and E. Gallopoulos, kmeans steering of spectral divisive clustering algorithms, In Proc. of Text Mining Workshop (Minneapolis), 2007.
Related tools
 Latent Semantic Indexing Web Site. Maintained by M.W. Berry and S. Dumais. This also contains an excellent collection of links to academic and other software tools.
 MC: A Toolkit for Creating Vector Models from Text Documents. See also I.S. Dhillon and D.M. Modha, "Concept Decompositions for Large Sparse Text Data using Clustering", Machine Learning, 42:1, pp. 143175, 2001.
 GTP, J.T. Giles, L. Wo, and M.W. Berry, "GTP (General Text Parser) Software for Text Mining" in Statistical Data Mining and Knowledge Discovery, H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455471.
 G. Karypis, CLUTO  Family of Data Clustering Software Tools
 F. Wild, An Open Source LSA Package for R
 SAS® Text Miner Industrystrength commercial software
 Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering by Andrew McCallum
 Weka 3: Data Mining Software in Java
Disclaimers
TMG comes without ANY warranty. Users that decide to work with TMG cannot hold any of the authors accountable for the program's behavior.
Acknowledgements
Research supported in part by the University of Patras K. Karatheodori grant no. B120 and a Bodossaki foundation scholarship. We extend our thanks to many active, casual and potential users who offered their advice and opinions.
