This package is licensed
under a GPL-like Caltech license.
Java
1.5 is required since generics are utilized heavily.
The software was originally written for use on
C. elegans
literature as part of Textpresso, but Textpresso is not required to use
this system. The current deployment of this package can be viewed
here. A proof of principle implementation of clustering from the top 150 search result snippets from Yahoo is provided
here. The design and methodology are described in a paper in
BMC Bioinformatics.
Javadoc comments are in the source and provided
here.
Usage:
Both SVM and clustering, as used currently for C. elegans:
This is most easily done by following how the package is currently
utilized. The cronjob starts by having java start at class
Runner/AutoPilot, which checks that the necessary directories are in
place. It then converts the XML Textpresso files into flatfiles with
sentences. The textfiles with sentences are then used to create files
that count the number of occurences of the words in the documents. If
you are not using textpresso, you can simply modify the AutoPilot class
to not convert the XML files. Instead, you will probably need to write
your own function that prepares flat textfiles to be used to count the
number of words in them.
Just the clustering engine:
An example is provided with the YahooTest and the classes in websearch.
You will need a class to represent the documents that you want
clustered. This class just needs to implement the functions defined in
Util/ClusterDoc. You can then search for phrases the way that YahooTest
does by calling PhraseFinder.findAndAddPhrases and then creating the
hierarchy with TreeHelper.createHierarchyByCrossSimilarity (please see
method makeClusters in YahooTest for a concrete example).
The image below should provide a graphical overview of the classes that
represent documents. TestDoc and KnownDoc are used in the current
deployment for
C. elegans literature.