cluster
Interface ClusterDoc

All Superinterfaces:
java.lang.Comparable<ClusterDoc>
All Known Implementing Classes:
ClusterableReuters, KnownDoc, SearchResultDoc, TestDoc, TestReuters

public interface ClusterDoc
extends java.lang.Comparable<ClusterDoc>

Represents a document that I can conduct unsupervised clustering on.
In fact, pretty much all documents can undergo this process if they implement these methods.
These documents also need to be able to compare themselves (for sorting purposes)

Author:
davidc

Method Summary
 void addTermSetCount(Phrase termSet, int n)
          This document should record how frequently this termSet occured
 void checkSourceExists()
          Throw an exception if this file won't be cluster-able
 void destroyLocalDoc()
          After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.
 java.lang.String[][] getFixedWordSentences()
          Each String should be fixed by VectorManager before being returned
 int[][] getIdxSentences(VectorManager vm)
          Each entry represents the integer.
 int getNumInstancesOfTermSet(Phrase s)
          Each document should get a unique id
 java.lang.String[][] getSentences()
           
 double getTermSetsSupported()
           
 boolean isJunkPhrase(java.lang.String w)
          Added to allow differentiation between phrases of scientific articles and general search results
 void loadWindowedDoc()
          Initially, the idea was to support proximity windows (eg.
 
Methods inherited from interface java.lang.Comparable
compareTo
 

Method Detail

getTermSetsSupported

double getTermSetsSupported()

getNumInstancesOfTermSet

int getNumInstancesOfTermSet(Phrase s)
Each document should get a unique id

Returns:

addTermSetCount

void addTermSetCount(Phrase termSet,
                     int n)
This document should record how frequently this termSet occured

Parameters:
termSet -
n -

isJunkPhrase

boolean isJunkPhrase(java.lang.String w)
Added to allow differentiation between phrases of scientific articles and general search results

Parameters:
w - Space-separated words
Returns:

checkSourceExists

void checkSourceExists()
                       throws java.io.FileNotFoundException
Throw an exception if this file won't be cluster-able

Throws:
java.io.FileNotFoundException

getFixedWordSentences

java.lang.String[][] getFixedWordSentences()
                                           throws java.io.FileNotFoundException,
                                                  java.io.IOException
Each String should be fixed by VectorManager before being returned

Returns:
Throws:
java.io.FileNotFoundException
java.io.IOException

getIdxSentences

int[][] getIdxSentences(VectorManager vm)
Each entry represents the integer. This method possibly allows PhraseFinder to save time while finding frequent phrases.
Notice that the dimensions returned here should be exactly the same as those returned in getFixedWordSentences()

Parameters:
vm -
Returns:

getSentences

java.lang.String[][] getSentences()
                                  throws java.io.FileNotFoundException,
                                         java.io.IOException
Throws:
java.io.FileNotFoundException
java.io.IOException

loadWindowedDoc

void loadWindowedDoc()
Initially, the idea was to support proximity windows (eg. phrases of 5 words excluding stopwords) but just using each sentence as a window yields good results with much better performance.

Throws:
java.io.FileNotFoundException
java.io.IOException

destroyLocalDoc

void destroyLocalDoc()
After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.