Traumwind - libTextCat - Lightweight text categorization

Libtextcat is a library with functions that implement the classification technique described in Cavnar & Trenkle, "N-Gram-Based Text Categorization" [1]. It was primarily developed for language guessing, a task on which it is known to perform with near-perfect accuracy.

The central idea of the Cavnar & Trenkle technique is to calculate a "fingerprint" of a document with an unknown category, and compare this with the fingerprints of a number of documents of which the categories are known. The categories of the closest matches are output as the classification. A fingerprint is a list of the most frequent n-grams occurring in a document, ordered by frequency. Fingerprints are compared with a simple out-of-place metric. See the article for more details.

Considerable effort went into making this implementation fast and efficient. The language guesser processes over 100 documents/second on a simple PC, which makes it practical for many uses. It was developed for use in our webcrawler and search engine software, in which it it handles millions of documents a day.

[ by Martin>] [permalink] [similar entries]

similar entries (vs):

TextCat Language Guesser (# 15%)
Automatic Document Classification (# 9%)
The `Bow' Toolkit (# 9%)
Finding Out About (# 9%)

similar entries (cg):

no similar entries (yet?)

Martin Spernau
© 1994-2003

Big things to come (TM) 30th Dez 2002

Water
Oblique Strategies, Ed.3 Brian Eno and Peter Schmidt

amazon.de Wunschliste

usefull links:
Google Graph browser
Traumwind 6-Colormatch
UAV News