benchmark
Class ClusterableReuters

java.lang.Object
  extended by benchmark.ClusterableReuters
All Implemented Interfaces:
ClusterDoc, java.lang.Comparable<ClusterDoc>

public class ClusterableReuters
extends java.lang.Object
implements ClusterDoc


Constructor Summary
ClusterableReuters(int id, java.lang.String text, VectorManager vm)
           
 
Method Summary
 void addTermSetCount(Phrase termSet, int n)
          This document should record how frequently this termSet occured
 void checkSourceExists()
          Throw an exception if this file won't be cluster-able
static java.lang.String clean(java.lang.String s)
           
 int compareTo(ClusterDoc arg0)
           
 void destroyLocalDoc()
          After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.
 java.lang.String[][] getFixedWordSentences()
          Each String should be fixed by VectorManager before being returned
 int getId()
           
 int[][] getIdxSentences(VectorManager vm)
          Each entry represents the integer.
 int getNumInstancesOfTermSet(Phrase s)
          Each document should get a unique id
 java.lang.String[][] getSentences()
           
 double getTermSetsSupported()
           
 java.lang.String getText()
           
 java.lang.String getTopic()
           
 boolean isJunkPhrase(java.lang.String phrase)
          Added to allow differentiation between phrases of scientific articles and general search results
 void loadWindowedDoc()
          Initially, the idea was to support proximity windows (eg.
 void setTopic(java.lang.String topic)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ClusterableReuters

public ClusterableReuters(int id,
                          java.lang.String text,
                          VectorManager vm)
Method Detail

addTermSetCount

public void addTermSetCount(Phrase termSet,
                            int n)
Description copied from interface: ClusterDoc
This document should record how frequently this termSet occured

Specified by:
addTermSetCount in interface ClusterDoc

getNumInstancesOfTermSet

public int getNumInstancesOfTermSet(Phrase s)
Description copied from interface: ClusterDoc
Each document should get a unique id

Specified by:
getNumInstancesOfTermSet in interface ClusterDoc
Returns:

getTermSetsSupported

public double getTermSetsSupported()
Specified by:
getTermSetsSupported in interface ClusterDoc

isJunkPhrase

public boolean isJunkPhrase(java.lang.String phrase)
Description copied from interface: ClusterDoc
Added to allow differentiation between phrases of scientific articles and general search results

Specified by:
isJunkPhrase in interface ClusterDoc
Parameters:
phrase - Space-separated words
Returns:

checkSourceExists

public void checkSourceExists()
                       throws java.io.FileNotFoundException
Description copied from interface: ClusterDoc
Throw an exception if this file won't be cluster-able

Specified by:
checkSourceExists in interface ClusterDoc
Throws:
java.io.FileNotFoundException

getFixedWordSentences

public java.lang.String[][] getFixedWordSentences()
                                           throws java.io.FileNotFoundException,
                                                  java.io.IOException
Description copied from interface: ClusterDoc
Each String should be fixed by VectorManager before being returned

Specified by:
getFixedWordSentences in interface ClusterDoc
Returns:
Throws:
java.io.FileNotFoundException
java.io.IOException

getIdxSentences

public int[][] getIdxSentences(VectorManager vm)
Description copied from interface: ClusterDoc
Each entry represents the integer. This method possibly allows PhraseFinder to save time while finding frequent phrases.
Notice that the dimensions returned here should be exactly the same as those returned in getFixedWordSentences()

Specified by:
getIdxSentences in interface ClusterDoc
Returns:

getSentences

public java.lang.String[][] getSentences()
                                  throws java.io.FileNotFoundException,
                                         java.io.IOException
Specified by:
getSentences in interface ClusterDoc
Throws:
java.io.FileNotFoundException
java.io.IOException

loadWindowedDoc

public void loadWindowedDoc()
Description copied from interface: ClusterDoc
Initially, the idea was to support proximity windows (eg. phrases of 5 words excluding stopwords) but just using each sentence as a window yields good results with much better performance.

Specified by:
loadWindowedDoc in interface ClusterDoc

destroyLocalDoc

public void destroyLocalDoc()
Description copied from interface: ClusterDoc
After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.

Specified by:
destroyLocalDoc in interface ClusterDoc

compareTo

public int compareTo(ClusterDoc arg0)
Specified by:
compareTo in interface java.lang.Comparable<ClusterDoc>

getText

public java.lang.String getText()

clean

public static java.lang.String clean(java.lang.String s)

getId

public int getId()

getTopic

public java.lang.String getTopic()

setTopic

public void setTopic(java.lang.String topic)