|
Text to Matrix Generator (TMG) is a MATLAB® toolbox that can be used for various tasks in text mining (TM). Most of TMG (version 6.0; Dec.'11) is written in MATLAB, though a large segment of the indexing phase of the current version of TMG is written in Perl. Previous versions that were strictly MATLAB are also available. If MySQL and the MATLAB Database Toolbox are available, TMG exploits their functionality for additional flexibility.
TMG is especially suited for TM applications where data is high-dimensional but extremely sparse as it uses the sparse matrix infrastructure of MATLAB. Originally built as a preprocessing tool for creating term-document matrices (tdm's) from unstructured text, the new version of TMG (Dec.'11) offers a wide range of tools for the following tasks:
- Indexing
- Retrieval
- Dimensionality Reduction
- Non-negative Matrix Factorizations
- Clustering
- Classification
TMG functionality is accessible to users in several ways: If MATLAB (version 7.0 or higher) is available then it is anticipated that most users will either use TMG's MATLAB-based GUI's, or will invoke the relevant functions directly from the MATLAB command line interface. Command line components of TMG have also been reported to run from Octave.
The indexing process constructs new and updates existing term-document matrices from documents, in the form of MATLAB sparse arrays. It may include various steps, such as removal of common words, such as articles and conjunctions, removal of very short or very long terms, removal of very frequent or infrequent terms, and the application of stemming. TMG applies common filtering techniques (removal of common words, removal of words that are too infrequent or frequent, removal of words that are too short or too long) to reduce the size of the term dictionary. TMG accepts as input files or directories consisting of ASCII text. In most cases it also processes with reasonable accuracy html and many PostScript and PDF files. TMG allows as option a variety of term-weighting and normalization schemes as well as stemming.
Availability and Requirements
To obtain the current version of TMG please file a request form.
- Current Version: TMG 6.0. Some modules are in MATLAB precompiled (.p) format. See version TMG 3.0R4 (below) for strictly MATLAB open source with indexing and retrieval modules.
- Depending on user requirements, version 6.0 might necessitate the following third party software packages.
- If one intends to use MySQL functionality:
- If one wishes to apply the packages below for SVD and low-rank approximation, the following packages are already preloaded:
- Other TMG versions:
- TMG 3.0R4 (version 3.0, release 4), includes only the indexing and query modules and requires (only) MATLAB version 7.0 or higher. It is written entirely in MATLAB and is available in its entirety in source form. To obtain it please
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
l.
Installation
Installing TMG is made simple using the init_tmg script. Specifically:
- Download TMG by completing and filing a simple form.
- Only if MySQL functionality is required, install MySQL and Java Connector.
- Unzip TMG_X.XRX.zip and start MATLAB. See the directory structure of the TMG root directory.
- Change path to the TMG root directory.
- Run init_tmg. Give the MySQL login and password as well as the root directory of the MySQL Java Connector. The installation script creates all necessary information (including MySQL database TMG) and adds to the MATLAB path all necessary directories.
- Run gui. Alternatively, use the command line interface, type "help tmg".
Main FILES
| Filename |
Description |
| Core indexing functions |
The basic functions for the indexing module, i.e. tmg.m, tdm_update.m, tdm_downdate.m, tmg_query.m, merge_tdms.m. |
| Core dimensionality reduction functions |
The basic functions for the dimensionality reduction module, i.e. svd_tmg.m, pca.m, clsi.m, cm.m, sdd_tmg.m. |
| Core retrieval functions |
The basic functions for the retieval module, i.e. lsa.m, vsm.m. |
| Core NMF functions |
The basic functions for the NMF module, i.e. bisecting_nndsvd.m, block_nndsvd.m, nnmf_mul_update.m. |
| Core clustering functions |
The basic functions for the clustering module, i.e. ekmeans.m, skmeans.m, pddp.m, pddp_2means.m, pddp_optcut.m, pddp_optcut_2means.m, pddp_optcutpd.m. |
| Core classification functions |
The basic functions for the classification module, i.e. knn_multi.m, knn_single.m, llsf_multi.m, llsf_single.m, rocchio_multi.m, rocchio_single.m, scut_knn.m, scut_llsf.m, scut_rocchio.m. |
| GUI functions |
gui.m, tmg_gui.m, dr_gui.m, retrieval_gui.m, nnmf_gui.m, clustering_gui.m, classification_gui.m, open_file.m, tmg_save_results.m, about_tmg_gui.m. |
| Auxiliary functions |
block_diagonalize.m, col_normalization.m, clean_filters.m, cleanup.m, col_rearrange.m, column_norms.m, compute_fro_norm, compute_scat.m, create_kmeans_response.m, create_pddp_response.m, create_retrieval_response.m, diff_vector.m, doc2ascii.m, entropy.m, get_node_scat.m, ks_selection.m, ks_selection1.m, make_clusters_multi.m, make_clusters_single.m, make_labels.m, make_val_inds.m, merge_dictionary.m, myperms.m, opt_2means.m, pca_mat.m, pca_mat_afun.m, pca_propack.m, pca_propack_Atransfunc.m, pca_propack_afun.m, pca_update.m, pca_update_afun.m, pddp_extract_centroids.m, ps_pdf2ascii.m, stemmer.m, strip_html.m, svd_update.m, svd_update_afun.m, two_means_1d.m, unique_elements.m, unique_words.m. untex.m |
| tmg_template.m |
Template for tmg coupling and testing with querying and clustering. |
| init_tmg.m |
Installation script. |
| ver.5.0R6_tmg_manual.pdf |
Related documentation. |
| ver6.0R7_updates.pdf |
Updates of version 6.0 Release 7 |
| common_words |
Stoplist file (default is from the GTP project). |
Sample Output
Users can download the term-document matrices and query matrices that have resulted from its application on standard test IR data, namely MEDLINE, CRANFIELD and CISI collections. Results are provided for the simplest term-weighting scheme (term frequency local function, with no global weighting and normalization).
- MEDLINE: term-document matrix (5,735 x 1,033), dictionary (5,735 indexing terms), query matrix (30 queries)
- CRANFIELD: term-document matrix (4,563 x 1,398), dictionary (4,563 indexing terms), query matrix (225 queries)
- CISI: term-document matrix (5,544 x 1460), dictionary (5,544 indexing terms), query matrix (35 queries)
Documentation
Furthermore, the following documents describe the design and use of the package for IR experiments:
- D. Zeimpekis and E. Gallopoulos, "TMG: A MATLAB toolbox for generating term-document matrices from text collections". In "Grouping Multidimensional Data: Recent Advances in Clustering", J. Kogan, C. Nicholas and M. Teboulle, eds., pp. 187-210, Springer, 2006. Also Technical Report HPCLAB-SCG 1/01-05, Computer Engineering & Informatics Dept., University of Patras, Greece, Jan. 2005.
- D. Zeimpekis and E. Gallopoulos, "Design of a MATLAB toolbox for term-document matrix generation", Technical Report HPCLAB-SCG 2/02-05, Computer Engineering & Informatics Dept., University of Patras, Greece, Februry 2005. In Proc. Workshop on Clustering High Dimensional Data and its Applications, (held in conjunction with 5th SIAM Int'l Conf. Data Mining), I.S. Dhillon, J. Kogan and J. Ghosh eds., pp. 38-48, April 2005, Newport Beach, California.
Usage Reports
- There exist several citations (papers and other works) that have utilized TMG for various applications.
- Please
This e-mail address is being protected from spambots. You need JavaScript enabled to view it
information regarding your use of the software: Your questions, comments, and suggestions will help us improve it.
References and related links
- Latent Semantic Indexing Web Site. Maintained by M.W. Berry and S. Dumais.
- Latent Semantic Analysis Web site at the University of Colorado, Boulder.
- D. Boley, Unsupervised Document Set Exploration Using Divisive Partitioning page and the companion paper "Principal direction divisive partitioning", Data Mining and Knowledge Discovery, 2 (1998), no. 4, pp. 325--344.
- I.S. Dhillon and D.M. Modha, "Concept Decompositions for Large Sparse Text Data using Clustering", Machine Learning, 42:1, pp. 143-175, 2001.
- Lars Elden, Matrix Methods in Data Mining and Pattern Recognition, SIAM, 2007.
- J.R. Gilbert, C. Moler, and R. Schreiber, "Sparse matrices in MATLAB: Design and implementation", SIAM J. Matrix Anal. Appl. 13 (1992), no. 1, 333-356.
- J.T. Giles, L. Wo, and M.W. Berry, "GTP (General Text Parser) Software for Text Mining" in Statistical Data Mining and Knowledge Discovery, H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.
- T.G. Kolda and D.P. O'Leary, Computation and Uses of the Semidiscrete Matrix Decomposition, Computer Science Department Report CS-TR-4012 Institute for Advanced Computer Studies Report UMIACS-TR-99-22, University of Maryland, April 1999.
- M. Berry, Z. Drmac, and E. Jessup, Matrices, vector spaces, and information retrieval, SIAM Review 41 (1998), 335-362.
- M. W. Berry, S. A. Pulatova, and G. W. Stewart, Computing sparse reduced-rank approximations to sparse matrices, ACM TOMS 31 (2005), no. 2.
- D. Boley, Principal direction divisive partitioning, Data Mining and Knowledge Discovery 2 (1998), no. 4, 325-344.
- S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and Harshman R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science 41 (1990), no. 6, 391-407.
- I. S. Dhillon and D. S. Modha, Concept decompositions for large sparse text data using clustering, Machine Learning 42 (2001), no. 1, 143-175.
- T. Kolda and D. O'Leary, Algorithm 805: computation and uses of the semidiscrete matrix decomposition, ACM TOMS 26 (2000), no. 3.
- C.D. Manning and H. Schutze, Foundations of statistical Natural language Processing, The MIT Press. 1999.
- C.D. Manning, P. Raghavan and H. Schutze, Introduction to Information Retrieval, Cambridge University Press. 2008.
- H. Park, M. Jeon, and J. Rosen, Lower dimensional representation of text data based on centroids and least squares, BIT 43 (2003).
- M.F. Porter, An algorithm for suffix stripping, Program (1980), no. 3, 130-137.
- G. Salton, C. Yang, and A. Wong, A Vector-Space Model for Automatic Indexing, Communications of the ACM 18 (1975), no. 11, 613-620.
- Y. Yang and C. Chute, A linear least squares fit mapping method for information retrieval from natural language texts, In 14th Conf. Comp. Linguistics, 1992.
- D. Zeimpekis and E. Gallopoulos, PDDP(l): Towards a Flexing Principal Direction Divisive Partitioning Clustering Algorithms, Proc. IEEE ICDM '03 Workshop on Clustering Large Data Sets (Melbourne, Florida) (D. Boley, I. Dhillon, J. Ghosh, and J. Kogan, eds.), 2003, pp. 26-35.
- D. Zeimpekis and E. Gallopoulos, CLSI: A flexible approximation scheme from clustered term-document matrices, In Proc. SIAM 2005 Data Mining Conf. (Newport Beach, California) (H. Kargupta, J. Srivastava, C. Kamath, and A. Goodman, eds.), April 2005, pp. 631-635.
- D. Zeimpekis and E. Gallopoulos, Linear and non-linear dimensional reduction via class representatives for text classification, In Proc. of the 2006 IEEE International Conference on DataMining (Hong Kong), December 2006, pp. 1172-1177.
- D. Zeimpekis and E. Gallopoulos, TMG: A MATLAB toolbox for generating term document matrices from text collections, Grouping Multidimensional Data: Recent Advances in Clustering (J. Kogan, C. Nicholas, and M. Teboulle, eds.), Springer, Berlin, 2006, pp. 187-210.
- D. Zeimpekis and E. Gallopoulos, k-means steering of spectral divisive clustering algorithms, In Proc. of Text Mining Workshop (Minneapolis), 2007.
Related tools
- Latent Semantic Indexing Web Site. Maintained by M.W. Berry and S. Dumais. This also contains an excellent collection of links to academic and other software tools.
- MC: A Toolkit for Creating Vector Models from Text Documents. See also I.S. Dhillon and D.M. Modha, "Concept Decompositions for Large Sparse Text Data using Clustering", Machine Learning, 42:1, pp. 143-175, 2001.
- GTP, J.T. Giles, L. Wo, and M.W. Berry, "GTP (General Text Parser) Software for Text Mining" in Statistical Data Mining and Knowledge Discovery, H. Bozdogan (Ed.), CRC Press, Boca Raton, (2003), pp. 455-471.
- G. Karypis, CLUTO - Family of Data Clustering Software Tools
- F. Wild, An Open Source LSA Package for R
- SAS® Text Miner Industry-strength commercial software
- Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering by Andrew McCallum
- Weka 3: Data Mining Software in Java
Disclaimers
TMG comes without ANY warranty. Users that decide to work with TMG cannot hold any of the authors accountable for the program's behavior.
Acknowledgements
Research supported in part by the University of Patras K. Karatheodori grant no. B120 and a Bodossaki foundation scholarship. We extend our thanks to many active, casual and potential users who offered their advice and opinions.
 
|