The Textpresso project serves the biological and biomedical research
community by providing:
Textpresso has currently been implemented for 24
different literatures, and can readily be extended to other
corpora of text. A software package can be
from this site and installed locally.
- Full text literature searches of model organism research and subject-specific articles
at individual sites (see side bar menu on the left).
Major elements of these search engines are (1) access to full text,
so that the entire content of articles can be searched, and (2) search capabilities
using categories of biological concepts and classes that relate two objects (e.g.,
association, regulation, etc.) or identify one (e.g., cell, gene, allele, etc).
The search engines are flexible, enabling users to query the entire literature
using keywords, one or more categories or a combination of keywords and categories.
- Text classification and mining of biomedical literature for database curation.
We help database curators to identify and extract biological entities and facts
from the full text of research articles. Examples of entity identification and extraction
include new allele and gene names and human disease gene orthologs; examples of fact
identification and extraction include sentence retrieval for curating gene-gene regulation,
Gene Ontology (GO) cellular components and GO molecular function annotations. In addition we
classify papers according to curation needs. We employ a variety of methods such as
hidden Markov models, support vector machines, conditional random fields and pattern matches.
Our collaborators include WormBase,
dictyBase and the
Neuroscience Information Framework.
We are looking forward to collaborating
with more model organism databases and projects.
- Linking biological entities in PDF and online journal articles to online databases.
We have established a journal article mark-up pipeline that links select content of
journal articles to model organism databases such as
The entity markup pipeline links over nine classes of objects including genes, proteins, alleles,
phenotypes, and anatomical terms to the appropriate page at each database.
article published with online and PDF-embedded hyperlinks to WormBase appeared in the
September 2009 issue of Genetics. As of January 2011, we have processed around
70 articles, to be continued indefinitely. Extension of this pipeline to other journals and model
organism databases is planned.