Automated Classification System

Download now

This package is licensed under a GPL-like Caltech license. Java 1.5 is required since generics are utilized heavily.

The software was originally written for use on C. elegans literature as part of Textpresso, but Textpresso is not required to use this system. The current deployment of this package can be viewed here. A proof of principle implementation of clustering from the top 150 search result snippets from Yahoo is provided here. The design and methodology are described in a paper in BMC Bioinformatics.

Javadoc comments are in the source and provided here.


Usage:

Both SVM and clustering, as used currently for C. elegans

This is most easily done by following how the package is currently utilized. The cronjob starts by having java start at class Runner/AutoPilot, which checks that the necessary directories are in place. It then converts the XML Textpresso files into flatfiles with sentences. The textfiles with sentences are then used to create files that count the number of occurences of the words in the documents. If you are not using textpresso, you can simply modify the AutoPilot class to not convert the XML files. Instead, you will probably need to write your own function that prepares flat textfiles to be used to count the number of words in them.

Overview of classes used in cronjob

Just the clustering engine:

An example is provided with the YahooTest and the classes in websearch. You will need a class to represent the documents that you want clustered. This class just needs to implement the functions defined in Util/ClusterDoc. You can then search for phrases the way that YahooTest does by calling PhraseFinder.findAndAddPhrases and then creating the hierarchy with TreeHelper.createHierarchyByCrossSimilarity (please see method makeClusters in YahooTest for a concrete example).

The image below should provide a graphical overview of the classes that represent documents. TestDoc and KnownDoc are used in the current deployment for C. elegans literature.

Diagram of classes representing documents