Chamblon Systems Inc.

TerminologyExtractor


[ Home ] [ TerminologyExtractor ] [ Quant ] [ Contact us ]

 [ Order ]  [ How it works ]  [ Download page ]

Version 3.0 main features

         Extraction of words and collocations from Microsoft Word, RTF, HTML and plain text documents.

         Determines frequencies.

         Keywords in context (KWIC) available on the main window.

         On-line sorting of terms by frequency and alphabetically.

         Term filter that allows a view of only the terms that contain a specific string.

         Searching of terms in all documents. Source document names are included in search results.

         Possibility to export sorted term lists and search results.

         Support for Word for Windows XP.

         Processes all documents at the same time. No need to cut and paste smaller documents together into one single large one before extracting terminology!

You can request a fully functional evaluation version of TerminologyExtractor 3.0 by e-mailing us at info@chamblon.com.

Click here to view a screen shot of version 3.0! You may also want to take a look at the documentation (help file).

Description

TerminologyExtractor is a tool that extracts word and collocation lists, with frequencies, from Microsoft Word document, HTML, Rich-Text Format and plain text files. TerminologyExtractor uses a number of features and algorithms to provide the best possible output. For example, when processing English and French texts, it uses the root form of each word, i.e. it transforms plurals and conjugated verbs into singulars and infinitves. It also uses lists of control words (pronouns, articles, prepositions, etc.) to avoid collocations such as "of the" and "I have". Also, all acronyms and proper nouns are kept in their original form; no changes are made to uppercase and lowercase letters.

One of the main features of TerminologyExtractor is that it differentiates between words and non-words. TerminologyExtractor marks a string as "word" if it is found in its dictionary. Otherwise, the string is marked as "non-word". After TerminologyExtractor has processed a set of documents, the non-word list contains abreviations, proper nouns, misspelled words and words that are very specific to the domain of the text. These can therefore be immediately spotted without having to manually go through a long list of words.

The collocation lists produced byTerminologyExtractor contain all sequences of words and non-words that appear more than once in the text. A special algorithm allows it to see collocations that appear within longer collocations. For example, in a text about law you may find the terms "justice system" and "criminal justice system". These terms will both appear in the collocation list with their respective frequency.

Version 3.0 features an integrated KWIC module. The terms (words, non-words and collocations) identified by TerminologyExtractor are displayed in a list. A filter can be applied to the list in order to display only the terms that include a word or part of a word. Terms can then be selected from the list and their context (i.e. the sentence segments in which they occur) displayed in a window or saved to a file. 

Click here for a description of how TerminologyExtractor works.

Applications

TerminologyExtractor can be applied in many areas. They include:

Translation

Quickly extract terminology from a complete set of documents to speed up translation.

Technical writing

Establish the list of terms that everyone in the company should use. Identify terms that are used inconsistently across a set of documents.

Text summarization

Extract the most commonly used words and collocations as well as the list of proper nouns to provide an overview of what a text is about.

Text indexation

Automatically generate keyword lists, not only word lists.

If you are interested...

... Download a demo version of TerminologyExtractor.

... Look at example 1: output of TerminologyExtractor for a technical article.

... Look at example 2: output of TerminologyExtractor for an Internet RFC..