cluster
Class PhraseSupporter

java.lang.Object
  extended by cluster.PhraseSupporter

public class PhraseSupporter
extends java.lang.Object

Lots of functions to inspect the documents. Use PhraseFinder first to even identify the phrases to form the possible clusters, and then run those clusters with the documents with the methods provided here.
Clients probably won't want to touch these methods, and instead will want to call methods from TreeHelper, which can create the hierarchy for you.

Author:
davidc

Constructor Summary
PhraseSupporter()
           
 
Method Summary
static double calculateRelevance(java.util.Set<ClusterDoc> docs, Phrase originalSet, Phrase combined)
          The interpretation is that if P(B|A) -> 1, then phrase b is seen every time phrase A occurs.
static double calculateRelevanceRelaxed(java.util.Set<ClusterDoc> docs, Phrase originalSet, Phrase combined)
          An experiment when calculating how two clusters are related.
static boolean checkSet(java.util.Set<ClusterDoc> docs, Phrase termSet, double cutoff)
           
static java.util.List<Phrase> checkSets(java.util.List<? extends ClusterDoc> docs, java.util.List<Phrase> candidates, int sufficientDocs)
          Records in TestDoc the number of terms supported
Records in TermSet the documents that cover each term
static double findRelevanceRelaxed(ClusterDoc d, Phrase set, Phrase required, int slackNum)
          An experiment while calculating the relationship between two phrases
Abandoned in favor of embedding alternative phrases within Phrase
static int getNumInstances(java.util.Set<ClusterDoc> docs, Phrase set)
           
static int getNumInstancesOfCombinedSet(ClusterDoc d, Phrase setA, Phrase setB)
           
static int getNumInstancesOfCombinedSet(java.util.Set<ClusterDoc> docs, Phrase thisI, Phrase thisJ)
           
static int getNumInstancesOfSet(ClusterDoc d, Phrase set)
          Returns how many windows (ie.
static int getNumInstancesOfSetRelaxed(ClusterDoc d, Phrase set, Phrase required, int slackNum)
          Set can occur in a sentence if at most slackNum words of the combined phrase are missing, and none of these missing words are in required
static int getNumInstancesOfSetSingle(ClusterDoc d, Phrase set)
          A faster implementation when there is only one set
static int numDocsWithSet(java.util.Collection<ClusterDoc> docs, Phrase p)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PhraseSupporter

public PhraseSupporter()
Method Detail

checkSets

public static java.util.List<Phrase> checkSets(java.util.List<? extends ClusterDoc> docs,
                                               java.util.List<Phrase> candidates,
                                               int sufficientDocs)
                                        throws java.io.IOException
Records in TestDoc the number of terms supported
Records in TermSet the documents that cover each term

Parameters:
candidates -
sufficientDocs - will keep this term if it's in at least this many docs
Returns:
those phrases that satisfy the criteria
Throws:
java.io.IOException

calculateRelevance

public static double calculateRelevance(java.util.Set<ClusterDoc> docs,
                                        Phrase originalSet,
                                        Phrase combined)
The interpretation is that if P(B|A) -> 1, then phrase b is seen every time phrase A occurs. This implies that B is a child concept of phrase A.

Parameters:
docs -
originalSet -
combined -
Returns:
number of times the combined phrase is seen divided by the number of times the original phrase is seen

getNumInstancesOfCombinedSet

public static int getNumInstancesOfCombinedSet(java.util.Set<ClusterDoc> docs,
                                               Phrase thisI,
                                               Phrase thisJ)

getNumInstances

public static int getNumInstances(java.util.Set<ClusterDoc> docs,
                                  Phrase set)
Parameters:
docs -
set -
Returns:
number of instances of set

calculateRelevanceRelaxed

public static double calculateRelevanceRelaxed(java.util.Set<ClusterDoc> docs,
                                               Phrase originalSet,
                                               Phrase combined)
An experiment when calculating how two clusters are related.

Parameters:
docs -
originalSet -
combined -
Returns:

numDocsWithSet

public static int numDocsWithSet(java.util.Collection<ClusterDoc> docs,
                                 Phrase p)
Parameters:
docs -
p -
Returns:
Number of documents in docs that contain p

checkSet

public static boolean checkSet(java.util.Set<ClusterDoc> docs,
                               Phrase termSet,
                               double cutoff)
Parameters:
docs -
termSet -
cutoff - double between 0 and 1 (fraction of size of docs needed to pass)
Returns:
true if more docs contain the termSet than docs.size() * cutoff

getNumInstancesOfSetSingle

public static int getNumInstancesOfSetSingle(ClusterDoc d,
                                             Phrase set)
                                      throws java.io.IOException
A faster implementation when there is only one set

Parameters:
d -
set -
Returns:
Throws:
java.io.IOException

getNumInstancesOfSet

public static int getNumInstancesOfSet(ClusterDoc d,
                                       Phrase set)
                                throws java.io.IOException
Returns how many windows (ie. sentences based on the current implementation) contain the phrase
As a note, words that occur multiple times in the phrase must also occur multiple times in the sentence

Parameters:
d -
set -
Returns:
Throws:
java.io.IOException

getNumInstancesOfCombinedSet

public static int getNumInstancesOfCombinedSet(ClusterDoc d,
                                               Phrase setA,
                                               Phrase setB)
                                        throws java.io.IOException
Throws:
java.io.IOException

getNumInstancesOfSetRelaxed

public static int getNumInstancesOfSetRelaxed(ClusterDoc d,
                                              Phrase set,
                                              Phrase required,
                                              int slackNum)
                                       throws java.io.IOException
Set can occur in a sentence if at most slackNum words of the combined phrase are missing, and none of these missing words are in required

Parameters:
d -
set -
required -
slackNum -
Returns:
how many times set is present
Throws:
java.io.IOException

findRelevanceRelaxed

public static double findRelevanceRelaxed(ClusterDoc d,
                                          Phrase set,
                                          Phrase required,
                                          int slackNum)
                                   throws java.io.IOException
An experiment while calculating the relationship between two phrases
Abandoned in favor of embedding alternative phrases within Phrase

Parameters:
d -
set -
required -
slackNum -
Returns:
Throws:
java.io.IOException