Util
Class TestDoc

java.lang.Object
  extended by Util.TestDoc
All Implemented Interfaces:
ClusterDoc, java.lang.Comparable<ClusterDoc>, SVMTestable, DocIdentifiable
Direct Known Subclasses:
KnownDoc, TestReuters

public class TestDoc
extends java.lang.Object
implements ClusterDoc, SVMTestable

Represents documents that we wish to classify or cluster.

For classification:
Register all memberships for the categories first and then use isMemberOf. This guarantees that each paper belongs to at least one category.

Author:
davidc

Constructor Summary
TestDoc(java.lang.String wbid)
           
 
Method Summary
 void addTermSetCount(Phrase termSet, int n)
          This document should record how frequently this termSet occured
 void checkSourceExists()
          Throw an exception if this file won't be cluster-able
 int compareTo(ClusterDoc arg0)
           
 int compareTo(TestDoc arg0)
           
 void destroyLocalDoc()
          After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.
 java.lang.StringBuffer getAbstract()
           
 java.lang.String getDocId()
           
 double getExactMembership(int cat)
           
 java.lang.String[][] getFixedWordSentences()
          Each String should be fixed by VectorManager before being returned
 int[][] getIdxSentences(VectorManager vm)
          Each entry represents the integer.
 int getNumInstancesOfTermSet(Phrase s)
          Each document should get a unique id
 java.lang.String[][] getSentences()
           
 double getTermSetsSupported()
           
 java.lang.String getTitle()
           
 java.lang.String getWbid()
           
 java.util.List<java.lang.String>[] getWindows()
           
 XMLDoc getXMLDoc()
           
 boolean isJunkPhrase(java.lang.String phrase)
          Added to allow differentiation between phrases of scientific articles and general search results
 boolean isMemberOf(int cat)
           
 void loadWindowedDoc()
          Initially, the idea was to support proximity windows (eg.
 void readyDoc(VectorManager titlevm, VectorManager vm, java.lang.String src)
           
 void readyDocLocally(VectorManager titleVM, VectorManager articleVm, java.lang.String src)
           
 void readyTitleOnly(VectorManager titleVM)
           
 void registerMembership(int cat, double val, int source)
          Source was added to allow classification by title and article simultaneously and be able to weight them at the end.
 void setWbid(java.lang.String wbid)
           
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

TestDoc

public TestDoc(java.lang.String wbid)
Method Detail

registerMembership

public void registerMembership(int cat,
                               double val,
                               int source)
Source was added to allow classification by title and article simultaneously and be able to weight them at the end.

Specified by:
registerMembership in interface SVMTestable
Parameters:
cat -
val -
source - When guaranteeing that a paper belongs to at least one category, this determination is done for results from source 0 (ie, this functionality can be determined by never setting source to 0)

getExactMembership

public double getExactMembership(int cat)

isJunkPhrase

public boolean isJunkPhrase(java.lang.String phrase)
Description copied from interface: ClusterDoc
Added to allow differentiation between phrases of scientific articles and general search results

Specified by:
isJunkPhrase in interface ClusterDoc
Parameters:
phrase - Space-separated words
Returns:

isMemberOf

public boolean isMemberOf(int cat)

getWbid

public java.lang.String getWbid()
Specified by:
getWbid in interface SVMTestable
Specified by:
getWbid in interface DocIdentifiable

setWbid

public void setWbid(java.lang.String wbid)

getXMLDoc

public XMLDoc getXMLDoc()

getAbstract

public java.lang.StringBuffer getAbstract()

getTitle

public java.lang.String getTitle()

readyDoc

public void readyDoc(VectorManager titlevm,
                     VectorManager vm,
                     java.lang.String src)
Parameters:
titlevm -
vm -
src - art or abs

readyTitleOnly

public void readyTitleOnly(VectorManager titleVM)

readyDocLocally

public void readyDocLocally(VectorManager titleVM,
                            VectorManager articleVm,
                            java.lang.String src)
Specified by:
readyDocLocally in interface SVMTestable

destroyLocalDoc

public void destroyLocalDoc()
Description copied from interface: ClusterDoc
After finding how often all the phrases are in this doc, this method should allow the supporting document to be released to free up memory.

Specified by:
destroyLocalDoc in interface ClusterDoc

compareTo

public int compareTo(TestDoc arg0)

checkSourceExists

public void checkSourceExists()
                       throws java.io.FileNotFoundException
Description copied from interface: ClusterDoc
Throw an exception if this file won't be cluster-able

Specified by:
checkSourceExists in interface ClusterDoc
Throws:
java.io.FileNotFoundException

getFixedWordSentences

public java.lang.String[][] getFixedWordSentences()
                                           throws java.io.FileNotFoundException,
                                                  java.io.IOException
Description copied from interface: ClusterDoc
Each String should be fixed by VectorManager before being returned

Specified by:
getFixedWordSentences in interface ClusterDoc
Returns:
Throws:
java.io.FileNotFoundException
java.io.IOException

getIdxSentences

public int[][] getIdxSentences(VectorManager vm)
Description copied from interface: ClusterDoc
Each entry represents the integer. This method possibly allows PhraseFinder to save time while finding frequent phrases.
Notice that the dimensions returned here should be exactly the same as those returned in getFixedWordSentences()

Specified by:
getIdxSentences in interface ClusterDoc
Returns:

getSentences

public java.lang.String[][] getSentences()
                                  throws java.io.FileNotFoundException,
                                         java.io.IOException
Specified by:
getSentences in interface ClusterDoc
Throws:
java.io.FileNotFoundException
java.io.IOException

getWindows

public java.util.List<java.lang.String>[] getWindows()
                                              throws java.io.FileNotFoundException,
                                                     java.io.IOException
Throws:
java.io.FileNotFoundException
java.io.IOException

loadWindowedDoc

public void loadWindowedDoc()
Description copied from interface: ClusterDoc
Initially, the idea was to support proximity windows (eg. phrases of 5 words excluding stopwords) but just using each sentence as a window yields good results with much better performance.

Specified by:
loadWindowedDoc in interface ClusterDoc

compareTo

public int compareTo(ClusterDoc arg0)
Specified by:
compareTo in interface java.lang.Comparable<ClusterDoc>

addTermSetCount

public void addTermSetCount(Phrase termSet,
                            int n)
Description copied from interface: ClusterDoc
This document should record how frequently this termSet occured

Specified by:
addTermSetCount in interface ClusterDoc

getNumInstancesOfTermSet

public int getNumInstancesOfTermSet(Phrase s)
Description copied from interface: ClusterDoc
Each document should get a unique id

Specified by:
getNumInstancesOfTermSet in interface ClusterDoc
Returns:

getTermSetsSupported

public double getTermSetsSupported()
Specified by:
getTermSetsSupported in interface ClusterDoc

getDocId

public java.lang.String getDocId()