Home Sites Downloads Linking Publications About Contact
Textpresso Sites
Our Production Sites:
C. elegans
D. melanogaster
Curation-specific Sites:
HIV (1)
HIV (2)
WNT Pathway (1)
WNT Pathway (2)
Our Pilot Sites:
Sites by other groups:
S. cerevisiae
Regulon DB
Ecoliwiki Textpresso
Ecocyc Textpresso
O. sativa
wheat, barley
Textpresso for sea urchin
Textpresso for Echinoderm
The Textpresso project serves the biological and biomedical research community by providing:
  • Full text literature searches of model organism research and subject-specific articles at individual sites (see side bar menu on the left). Major elements of these search engines are (1) access to full text, so that the entire content of articles can be searched, and (2) search capabilities using categories of biological concepts and classes that relate two objects (e.g., association, regulation, etc.) or identify one (e.g., cell, gene, allele, etc). The search engines are flexible, enabling users to query the entire literature using keywords, one or more categories or a combination of keywords and categories.
  • Text classification and mining of biomedical literature for database curation. We help database curators to identify and extract biological entities and facts from the full text of research articles. Examples of entity identification and extraction include new allele and gene names and human disease gene orthologs; examples of fact identification and extraction include sentence retrieval for curating gene-gene regulation, Gene Ontology (GO) cellular components and GO molecular function annotations. In addition we classify papers according to curation needs. We employ a variety of methods such as hidden Markov models, support vector machines, conditional random fields and pattern matches. Our collaborators include WormBase, FlyBase, SGD, TAIR, dictyBase and the Neuroscience Information Framework. We are looking forward to collaborating with more model organism databases and projects.
  • Linking biological entities in PDF and online journal articles to online databases. We have established a journal article mark-up pipeline that links select content of Genetics journal articles to model organism databases such as WormBase and SGD. The entity markup pipeline links over nine classes of objects including genes, proteins, alleles, phenotypes, and anatomical terms to the appropriate page at each database. The first article published with online and PDF-embedded hyperlinks to WormBase appeared in the September 2009 issue of Genetics. As of January 2011, we have processed around 70 articles, to be continued indefinitely. Extension of this pipeline to other journals and model organism databases is planned.
Textpresso has currently been implemented for 24 different literatures, and can readily be extended to other corpora of text. A software package can be downloaded from this site and installed locally.


Site last updated: 04/25/11
California Institute of Technology