TMG

Text to Matrix Generator (TMG) is a MATLAB® toolbox that can be used for various tasks in text mining (TM). Most of TMG (version 6.0; Dec.'11) is written in MATLAB, though a large segment of the indexing phase of the current version of TMG is written in Perl. Previous versions that were strictly MATLAB are also available. If MySQL and the MATLAB Database Toolbox are available, TMG exploits their functionality for additional flexibility.

TMG is especially suited for TM applications where data is high-dimensional but extremely sparse as it uses the sparse matrix infrastructure of MATLAB. Originally built as a preprocessing tool for creating term-document matrices (tdm's) from unstructured text, the new version of TMG (Dec.'11) offers a wide range of tools for the following tasks:

  1. Indexing
  2. Retrieval
  3. Dimensionality Reduction
  4. Non-negative Matrix Factorizations
  5. Clustering
  6. Classification

TMG functionality is accessible to users in several ways: If MATLAB (version 7.0 or higher) is available then it is anticipated that most users will either use TMG's MATLAB-based GUI's, or will invoke the relevant functions directly from the MATLAB command line interface. Command line components of TMG have also been reported to run from Octave.

The indexing process constructs new and updates existing term-document matrices from documents, in the form of MATLAB sparse arrays. It may include various steps, such as removal of common words, such as articles and conjunctions, removal of very short or very long terms, removal of very frequent or infrequent terms, and the application of stemming. TMG applies common filtering techniques (removal of common words, removal of words that are too infrequent or frequent, removal of words that are too short or too long) to reduce the size of the term dictionary. TMG accepts as input files or directories consisting of ASCII text. In most cases it also processes with reasonable accuracy html and many PostScript and PDF files. TMG allows as option a variety of term-weighting and normalization schemes as well as stemming.

Availability and Requirements


To obtain the current version of TMG please file a request form.

  • Current Version: TMG 6.0. Some modules are in MATLAB precompiled (.p) format. See version TMG 3.0R4 (below) for strictly MATLAB open source with indexing and retrieval modules.
  • Depending on user requirements, version 6.0 might necessitate the following third party software packages.
  • If one wishes to apply the packages below for SVD and low-rank approximation, the following packages are already preloaded:
  • Other TMG versions:
    • TMG 3.0R4 (version 3.0, release 4), includes only the indexing and query modules and requires (only) MATLAB version 7.0 or higher. It is written entirely in MATLAB and is available in its entirety in source form. To obtain it please This e-mail address is being protected from spambots. You need JavaScript enabled to view it l.

Installation


Installing TMG is made simple using the init_tmg script. Specifically:

  • Download TMG by completing and filing a simple form.
  • Only if MySQL functionality is required, install MySQL and Java Connector.
  • Unzip TMG_X.XRX.zip and start MATLAB. See the directory structure of the TMG root directory.
  • Change path to the TMG root directory.
  • Run init_tmg. Give the MySQL login and password as well as the root directory of the MySQL Java Connector. The installation script creates all necessary information (including MySQL database TMG) and adds to the MATLAB path all necessary directories.
  • Run gui. Alternatively, use the command line interface, type "help tmg".

Main FILES


Filename Description
Core indexing functions The basic functions for the indexing module, i.e. tmg.m, tdm_update.m, tdm_downdate.m, tmg_query.m, merge_tdms.m.
Core dimensionality reduction functions The basic functions for the dimensionality reduction module, i.e. svd_tmg.m, pca.m, clsi.m, cm.m, sdd_tmg.m.
Core retrieval functions The basic functions for the retieval module, i.e. lsa.m, vsm.m.
Core NMF functions The basic functions for the NMF module, i.e. bisecting_nndsvd.m, block_nndsvd.m, nnmf_mul_update.m.
Core clustering functions The basic functions for the clustering module, i.e. ekmeans.m, skmeans.m, pddp.m, pddp_2means.m, pddp_optcut.m, pddp_optcut_2means.m, pddp_optcutpd.m.
Core classification functions The basic functions for the classification module, i.e. knn_multi.m, knn_single.m, llsf_multi.m, llsf_single.m, rocchio_multi.m, rocchio_single.m, scut_knn.m, scut_llsf.m, scut_rocchio.m.
GUI functions gui.m, tmg_gui.m, dr_gui.m, retrieval_gui.m, nnmf_gui.m, clustering_gui.m, classification_gui.m, open_file.m, tmg_save_results.m, about_tmg_gui.m.
Auxiliary functions block_diagonalize.m, col_normalization.m, clean_filters.m, cleanup.m, col_rearrange.m, column_norms.m, compute_fro_norm, compute_scat.m, create_kmeans_response.m, create_pddp_response.m, create_retrieval_response.m, diff_vector.m, doc2ascii.m, entropy.m, get_node_scat.m, ks_selection.m, ks_selection1.m, make_clusters_multi.m, make_clusters_single.m, make_labels.m, make_val_inds.m, merge_dictionary.m, myperms.m, opt_2means.m, pca_mat.m, pca_mat_afun.m, pca_propack.m, pca_propack_Atransfunc.m, pca_propack_afun.m, pca_update.m, pca_update_afun.m, pddp_extract_centroids.m, ps_pdf2ascii.m, stemmer.m, strip_html.m, svd_update.m, svd_update_afun.m, two_means_1d.m, unique_elements.m, unique_words.m. untex.m
tmg_template.m Template for tmg coupling and testing with querying and clustering.
init_tmg.m Installation script.
ver.5.0R6_tmg_manual.pdf Related documentation.
ver6.0R7_updates.pdf Updates of version 6.0 Release 7
common_words Stoplist file (default is from the GTP project).

 

Sample Output


Users can download the term-document matrices and query matrices that have resulted from its application on standard test IR data, namely MEDLINE, CRANFIELD and CISI collections. Results are provided for the simplest term-weighting scheme (term frequency local function, with no global weighting and normalization).

Documentation


Furthermore, the following documents describe the design and use of the package for IR experiments:

Usage Reports


  • There exist several citations (papers and other works) that have utilized TMG for various applications.
  • Please This e-mail address is being protected from spambots. You need JavaScript enabled to view it information regarding your use of the software: Your questions, comments, and suggestions will help us improve it.

References and related links


Related tools


Disclaimers


TMG comes without ANY warranty. Users that decide to work with TMG cannot hold any of the authors accountable for the program's behavior.

Acknowledgements


Research supported in part by the University of Patras K. Karatheodori grant no. B120 and a Bodossaki foundation scholarship. We extend our thanks to many active, casual and potential users who offered their advice and opinions.

Logo-panepisthmiouen56px-Caratheodory