TEXTPRESSO 2.0 INSTALLATION GUIDE

INTRODUCTION

Welcome to Textpresso 2.0! This document guides the user through the installation process of Textpresso, describes the components of the Textpresso database and how the various components work to build and maintain a database. The system was developed at the California Institute of Technology. This version, Textpresso 2.0, was developed by Hans-Michael Muller, Arun Rangarajan and Paul W. Sternberg, with contribution from Juancarlos Chan. The primary system is located at www.textpresso.org and can be used for comparison with private installations. The latest package updates will also be available there. Any comments or questions should be directed to Hans-Michael Muller via e-mail at mueller(at)caltech.edu.

Textpresso is a text mining system for scientific literature whose capabilities go far beyond that of a simple keyword search engine. The two key elements are the collection of the full text of scientific articles split into individual sentences, and the implementation of semantic categories, for which the database of articles and individual sentences can be searched. A detailed description of the original Textpresso system (version 1.0) can be found in the November 2004 issue of PLoS Biology.

This is the next-generation version of Textpresso 1.0, which has been deprecated and is no longer supported or available. This version has many new features and improvements over the old version. The old software was fragmented and had a lot of repetitive code and disorganized data structures in the processing pipeline. An expansion of the system in width and breadth would have been increasingly difficult. We modularized the software, and many subroutines can now be found in Textpresso Perl modules. As the corpus of literature is increasing, mark-up and indexing times were becoming prohibitively long. We therefore changed the structure of the lexica by moving away from regular expressions and used static strings, so we could use lookup-tables. This reduced the processing time for mark-up and indexing more than 10-fold. On the frontend, there are several enhancements to the functionality of the website. One can now use boolean operations for keywords in the textfield and can search phrases (as opposed to single words). Keyword searches can be made case-sensitive. Search results can be sorted according to all bibliographical fields, besides the hit score. Once a search result page is returned, results can be further filtered; for example, one can exclude certain years of publication, or only show publications of a certain author. Scrambled sentences (from tables and such) are now suppressed, but can be displayed on demand. Finally, we introduced a query language which allows for very complex searches.

This package is Caltech proprietary. Do not use this package for commercial purposes. Contact the office of technology transfer at the California Institute of Technology (www.ott.caltech.edu) for any questions regarding commercial licensing. Any non-commercial application is granted under the license described in this document. See section LICENSE for details.

INSTALLATION

After unzipping and untarring the tarball, there will be a directory called package/, and it has three subdirectories, one of them is called installation/. Enter it with the command

cd package/installation/

It contains the installation script called 'install_textpresso_2.0'. Execute it by typing

./install_textpresso_2.0

The installation script is divided into two parts, an interview section and an installation part. As long as the interview is proceeding, no script or directory will be installed, so it is safe to interrupt it by pressing "ctrl-c". Even if the the script is fully executed, it can be re-run. If different directories are specified than what was entered in the first run, the directories of the first run will not be eliminated and need to be removed by hand. All entries specified in the interview section will be used in two ways: first for determining where scripts and data are to be installed, and second for modification of the locations of library and data directories in the scripts themselves. The interview section of the script is documented quite well, therefore only additional information is presented here:

  1. The first question asks where to install the Textpresso Perl modules. These are libraries and should be put system-wide in some library-related directories, but any other directory will work too (the 'use lib' statements in the scripts will be modified accordingly). The default location is /usr/local/lib/textpresso/.
  2. The next input determines the location of the database, where the corpus, indices, bibliography and ontology is stored. The default is /usr/local/textpresso/. Some subdirectories of the database directory will be linked to from the html directory. If the web server does not allow links to directories outside the html directory subtree, one needs to either change the configuration of the web server or put those directories affected in the html directory and then link to them from the database directory. Currently, the directories that are affected are

    Data/annotations, Data/indices, Data/ontology/lexica and Data/processedfiles.

  3. The new Textpresso system is capable of hosting search engines of several literatures. The installation script sets up the infrastructure for a first literature of choice, and more literatures can be added (by creating parallel directory structures in the database and html directories). The name of the first literature is to be specified here. It cannot contain non alpha-numeric characters, however, if a fancier name is required on the website, it can be configured in one of Textpresso configuration files (TextpressoDatabaseGlobals.pm).
  4. The location of the Textpresso html directory has to be given in this answer. The default is /var/www/html/textpresso/.
  5. Similarly, the location of the Textpresso cgi-bin directory is queried here. Default location is /var/www/cgi-bin/textpresso/.
  6. The URL of the web server is requested here. Please note that this is not the URL of the Textpresso system itself, but the base URL of the web server. Please do not forget the forwardshlash ('/') at the end.
  7. The URL of the Textpresso webpages RELATIVE to the base URL should be given here. The complete URL that gives access is the concatenation of base url and this relative address. The default is textpresso.
  8. Similarly, the URL of the Textpresso cgi scripts RELATIVE to the base URL is entered in this question. The default address is cgi-bin/textpresso.
  9. Textpresso stores search results temporarily in files. Specify the location of a directory that holds temporary files here. This directory will be made write-able to everybody, including the http daemon. The web server configuration may need to be changed if the web server is not allowed to access the directory specified here. If all fails, this directory can be put within the html directory structure. It needs to be cleaned out periodically, as the Textpresso system does not do this.

Some post-installation of directories and scripts may have to be performed, but they should only concern ownership or access permission. This depends on whether Textpresso has been installed with root or with local user privileges, and under which account Textpresso is intended to be operated.

SYSTEM DESCRIPTION

Textpresso Perl Modules

The Textpresso Perl Modules contain subroutines and configuration files (in form of global constants) that are important for building the database as well as running the website. The modules in the Textpresso Perl modules directory are the master modules; many modules in subdirectories of this directory (such as dataprocessing/ and displaying/) can have symbolic links to these master modules, if no changes are required. These subdirectories, dataprocessing/ and displaying/, bundle master modules that are necessary for dataprocessing and displaying, respectively. If more than one literature is to be hosted, they would contain more directories in parallel to the one already existing.

The master subroutines are roughly grouped together according to their common functionality, as indicated by the first part of their names:

TextpressoSystemTasks.pm
TextpressoSystemGlobals.pm

These routines provide functions and settings to build the database of the system. They mostly act on files in the database directory.

TextpressoDataBaseAttributes.pm TextpressoDataBaseCategories.pm TextpressoDataBaseGlobals.pm TextpressoDataBaseQuery.pm TextpressoDataBaseSearch.pm

Elements in these modules define data objects and categories of the ontology, and perform and configure searches. Among them, TextpressoDatabaseQuery.pm defines the data model for querying the database and TextpressoDatabaseSearch.pm contains subroutines performing database searches.

TextpressoDisplayTasks.pm TextpressoDisplayGlobals.pm

Functions and constants needed to run and configure the actual website are bundled here.

TextpressoGeneralTasks.pm TextpressoGeneralGlobals.pm

These routines and constants are used throughout the whole system and therefore combined in this module.

Generally, the modules containing the name 'Tasks' do not need any change, but it is likely that global constants need adjustments. They can be found in the modules that contain the name 'Globals' in them. A short description of the most important constants are:

TextpressoSystemGlobals.pm
SY_ROOT specifies the root of the database directory.
SY_SUBROOTS a hash that lists all the first level subdirectories of the database. It is in a hash form so one can conveniently change the name of the physical directory without changing all scripts.
SY_ANNOTATION_FIELDS hash contains the fields that are flagged for mark up and specifies the affected directories.
SY_ANNOTATION_TYPE specifies the type of annotation. The most common annotation types are grammatical or semantic. The grammatical annotation is not implemented, but kept here for users so they can easily add it on.
SY_PASSTHROUGH_FIELDS This hash lists the (in this case purely bibliographical) fields that are not annotated.
SY_SOURCE_FIELDS contains all fields and their directories that need processing.
SY_INDEX_FIELDS lists all fields that need indexing.
SY_INDEX_TYPE contains the types of indices that are requested.
SY_ONTOLOGY specifies the structure of the ontology directory. The lexica directory contains the lexica of the semantic categories, while the definitions directory could contain explanations about them. The latter is currently not implemented.
SY_SUPPLEMENTALS This constant is currently unused, as we decided to provide links and supplemental materials dynamically instead of statically. It is being kept for future use.
SY_PREPROCESSING_TAGS This list contains the names of XML-like tags that are used in preprocessing tasks. 'Preprocessing includes all processing before Textpresso processing. This is a convenient way of handling special cases, for example, tagging words that are to be left untouched by subsequent mark-up routines. Those routines need to be modified if there are tags added. This can be found in TextpressoSystemTasks.pm.
SY_MARKUP_EXCEPTIONS List specifies the categories that are not to be used in a semantic mark-up despite the fact that there is a corresponding lexicon available. For special occasions only.

TextpressoDatabaseGlobals.pm
DB_ROOT Path of the root of the database directory. This root is different from SY_ROOT, as this address needs to be publicly accessible via the Web (everybody on the web should be able to search the database). One can link to the path represented in SY_ROOT from the directory cited in DB_ROOT to avoid duplicating the database, but if this approach is pursued, one needs to make sure that links are followed by the web server.
DB_TMP Location of the directory that stores temporary search files.
DB_STOPWORDS File name of list of stopwords. Stopwords are not indexed and cannot be searched for.
DB_LITERATURE Hash contains literatures that can be searched and their directory names.
DB_LITERATURE_DEFAULTS List contains the literature that are searched by default (preset on the web page).
DB_SEARCH_MODE List gives choices how a search result is scored. For 'boolean', each match adds one to the score, while 'tf*idf' accounts for matches of rare words being overrepresented in the score. The search mode 'latent mode' is not yet implemented.
DB_SEARCH_MODE_DEFAULT sets the default search mode.
DB_INDEX Location of the index within the database.
DB_TEXT Location of the fields, i.e., bibliography and full text of the corpus.
DB_ANNOTATION Location of the annotation files.
DB_SEARCH_RANGES This hash specifies how a query is met. Currently, 'sentence', 'field' and 'document' are implemented.
DB_SEARCH_RANGES_DEFAULT Default search range.
DB_IS_BIBLIOGRAPHY Specifies which fields are part of the bibliography.
DB_IS_TEXT Lists fields that are texts.
DB_SEARCH_TARGETS Hash identifies searchable fields.
DB_DISPLAY_FIELDS Hash identifies fields that are displayed.
DB_SEARCH_TARGETS_DEFAULTS Lists default search targets.

TextpressoDisplayGlobals.pm
DSP_BGCOLOR, DSP_TXTCOLOR, DSP_LNKCOLOR, DSP_HDRBCKGRND DSP_HDRFACE, DSP_HDRSIZE, DSP_HDRCOLOR, DSP_TXTFACE, DSP_TXTSIZE These variables affect the appearance of the website, such as font, size and color.
DSP_HIGHLIGHT_COLOR A hash of various colors for different display purposes.
HTML_ROOT Base URL of the server hosting Textpresso.
HTML_LINKTEMPLATES Textpresso offers the opportunity to use displayed text to link out to other web resources, for example, gene names could be linked to gene summary pages of model organism databases. The identification and linking is done via regular Perl expressions called templates. The location of the template file is stored in this constant.
HTML_MENU This hash contains the menu which is displayed right underneath the logo, and specifies the location of the scripts corresponding to the menu items.
HTML_LOGO Location of logo, relative to base URL.
HTML_NONE, HTML_ON, HTML_OFF Textvalues for none, on and off.

Database

The Textpresso database directory is divided into two parts, 'Data' and 'Procedures'. The latter contains the scripts and wrappers that build the database. There is a script in the scripts/ subdirectory called 'MakeDataDirectories.pl' that makes the data directory (called Data/) and all its subdirectories. This script is provided despite the fact that the installation process already installs this directory. However, if another literature is added to the system, one would need to run 'MakeDataDirectories.pl' again, after modifying the global constant SY_ROOT to accomodate a different location for the added literature.

'ProcessSourceFiles.pl' (and its wrapper 'ProcessSourceFiles.com') is the centerpiece for building the database. It can be run to build the database from scratch or incrementally. The incremental build is useful when only a few papers at a time are added. 'ProcessSourceFiles.pl' takes an incoming file (in the directory include/) and checks whether the corresponding file in the processedfiles/ directory is older. If this is the case, the annotation files and relevant entries in the index files are removed, and the incoming file is re-annotated and re-indexed. In case the corresponding file in processedfiles/ does not exist, the removal of annotation files and relevant entries in the index files are skipped. See also subsection Incremental Builds.

Within the scripts/ directory there is also a script called 'purge-processed-data.com'. It cleans out all processed data (including indices and annotations), in case one wants to build a new database from scratch instead of an incremental build. This is advisable if the lexica have significantly changed. See subsection Building a Database from Scratch.

The following are the subdirectories of the Data/ directory. They are 'input' directories, i.e., directories that need to be filled before 'ProcessSourceFiles.com' is executed, and 'output' directories, i.e., directories where processed files are stored. The image shows the directory structure of Data/ and Procedures/. directory image

annotations/, output-directory: all annotation files (categories and syntax) are saved here by the build process. The global constant SY_ANNOTATION_FIELDS determines which fields are annotated, and corresponding subdirectories are created.

etc/, input-directory: currently, this directory only contains a file that has all stopwords. Other miscelleanous input files can be stored here.

includes/, input-directory: the fields of the corpus (bibliography, full text) are stored in its subdirectories. The directory body/ contains the full texts of the corpus. The new Textpresso system does not include any routines that fetch bibliography and full texts anymore, as its acquisition is different for every site anyway. Building a system starts here by dumping all raw data into the subdirectories of this directory.

excludes/, input-directory: if one temporarily does not want to mark up and index some of the files that are in in the includes/ directory, but want to keep them there in the directory for later use, one simply flags those file by putting an empty file (using the command 'touch', for example) of the same name in the corresponding subdirectories of the excludes/ directory. Basically, those empty files act as a flag not to process the corresponding files in the includes/ directory.

indices/, output-directory: all the indices for keyword and category are produced here by the build script. The keyword index is subdivided into many subdirectories. All annotations are also indexed here.

ontology/, input-directory: the global constant SY_ONTOLOGY determines the substructure of this directory. In its default implementation, the subdirectory lexica contains the lexica of the categories with which the corpus is to be annotated. The definitions directory could contain explanations about them, but it is currently not used.

processedfiles/, output-directory: the build process either copies files from the includes/ directory directly here if they do not need to be indexed or marked-up, or takes the files from includes/, tokenizes, indexes and marks them up and then puts them in processedfiles/. The tokenized files are deposited in processedfiles/, while index and mark-up information is stored in index/ and annotations/. If one later on runs the build script again, the it checks the time stamps of the files in includes/ and compares them with the corresponding files in processedfiles/ for the last modification time. If the former are newer than latter, they are processed.

supplementals/, input-directory: directories for supplemental information that is static, i.e., that cannot be produced in real-time for display on the web, can be created here. In the default implementation, this directory is empty, and it does not need to be filled for the system to work.

CGI

The Textpresso cgi-bin directory contains all cgi-scripts to run the website. Their names should be self-explanatory. The default implementation has set their configuration correctly, however, if the location of the Textpresso Perl modules is changed later on, the only line that needs changes in all scripts is the 'use lib' statement at the beginning of each script, pointing at the new location of the modules.

HTML

The Textpresso html pages are mostly links to the Textpresso database, some gifs and index.html pages which are either frames hosting cgis or empty files to prevent directory views. If the hosting web server does not allow links to directory other than the html directory, then the corresponding directories from the database directory need to be copied into subdirectories of tdb/.

SYSTEM BUILD

System Requirements

The package is designed for Linux operating systems and is tested to run on an Intel x86 based hardware. The required minimal disk space is around 3GB per 1000 full text papers. Large size of memory is recommended, of the size of 1.5GB or higher, as the lexica are being loaded into memory during markup for faster processing. Software for a world wide web server such as Apache needs to be installed, and an internet connection should exist. The installation script requires a bash shell. Furthermore, the standard Perl 5.6.1 or higher should be present with the most common Perl packages. If a standard Perl package is missing, it can be downloaded and installed from http://www.cpan.org/. This package has been tested with the RedHat Enterprise Linux 4 distribution (http://www.redhat.com/) and OpenSUSE Linux 10.2 (http://www.opensuse.org/). Both work with a 2.6.9 kernel or higher. The package should also work with other kernels and Linux-based operating systems, but this has not been tested.

Prerequisite

The software package comes without any software that helps acquire full text or bibliography, because the sources of a corpus are so diverse: many users already have a set of PDFs and they are in different formats, others have specific resources on the web from which they will retrieve the corpus. Before the user can start using the system, she or he needs to fill the input directories (see subsection Database) of the database directory. In particular, the includes/ directory needs to be filled with bibliography and full text, in plain ASCII format. Files in its subdirectories abstract/, accession/, author/, body/, citation/, journal/, title/, type/, year/ contain the particular information of each data type (also called fields), and the name of the files should be the publication identifier of the document. For example, the file pmid15383839 in the author/ directory would contain

Muller HM
Kenny EE
Sternberg PW

and the corresponding file in the Title/ directory, again with filename pmid15383839, would contain

Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

because the paper has that particular PMID identifier. All other data types for this document would be similarly stored in their respective directories.

The full texts of papers may often be acquired via theirs PDFs, because this is the standard way of dissemination, besides html. The conversion from pdf to text can be performed via the program pdftotext, which is available on many Linux distributions. It is part of XPDF and is available at http://www.foolabs.com/xpdf/.

The default installation puts the lexica of the categories in the lexica/ subdirectory of the ontology/ directory. If additional categories are needed, they can be put in the same directory. All lexica present in this directory will be used by the script ProcessSourceFiles.pl for mark-up, so this directory needs to be kept tidy and clean, i.e., no other files should be there. The nomenclature of the file name is as follows:

name.number-gram

where 'name' will be used as annotation name and 'number' is an integer. The extension 'number-gram' can be used to distinguish between the number of words a lexicon entry has, for example, anatomy.1-gram, would contain all anatomy terms that consists of one word ('liver'), while anatomy.2-gram contains all terms consisting of two words ('heart muscle'). This convention is for ease of use for the curator, but has no implication for the system. (One can find some *.0-gram lexica in the default implementation, where all terms of different word length are combined.)

The lexica have the form

phrase
attribute1='value-a'
attribute2='value-b'
attribute3='value-c'
#####

The phrase is a term of the lexicon whose category name is determined by the name of the file, as explained above. One can add a system of attributes to each category and assign them to the terms, as illustrated. The attributes are also indexed and can be searched for, but this requires use of the Textpresso query language interface. The five hash marks are the lexicon entry delimiter and can be set to any character string in TextpressoGeneralGlobals.pl. The complete lexica are loaded into memory at the beginning of the build process (ProcessSourceFiles.pl), and then used in a lookup-table fashion. This makes the markup over ten times faster, however, one buys this with increased memory requirement. When all the files of the lexica/ directory (in the default implementation) are loaded, the script can take 600-900 Mb of RAM.

The etc/ directory of the Data/ directory contains a file named 'stopwords'. It contains common stopwords, such as determiners, prepositions, forms of auxiliary verbs, etc. All words in this file are not indexed and cannot be searched for. The reason for this is two-fold: firstly the search engine would probably overload when trying to retrieve all instances of a stopword, and secondly, a search for a stopword would probably return whole articles, which is in violation of copyright laws and does not fall under the fair-use clause.

The supplementals/ directory has been established for future use, when Textpresso extension such as static information of considerable size (to be displayed in the search result page, for example) cannot be created dynamically ("on-the-fly"). The subdirectories of supplemental/ have arbitrary names (from the early days of Textpresso 2.0 development) and need to be adjusted to the actual needs.

Building a Database from Scratch

Building a database from scratch implies that the content of all output-directories are empty. This mode is recommended if

These recommendations are based on the fact that removing and rewriting the category and keyword entries in the index and annotation files are time consuming processes, while just building new indices and annotation is very fast.

When building a database from scratch, one needs to make sure that the directory structure remains intact, and only the directory content is deleted when emptying the output-directories. A script in Procedures/scripts/ called

./purge-processed-data.com

is provided to accomplish this, but may need some adjustments if the system specification have changed.

After all data are stored in the input-directories, the wrapper file

./ProcessSourceFiles.com

in the Procedures/wrappers/ directory can be started. The wrapper checks whether a build process is already running, and if not, calls the script ProcessSourceFiles.pl. If for some reasons, the build process is interrupted by the user or because of a computer system event, then a lock in the directory Procedures/locks/, which prevents a second build process from being started, is probably still set and needs to be removed by hand.

The script ProcessSourceFiles.pl in its original form produces a lot of messages. The administrator might want to reduce this output by commenting out the corresponding print statements in ProcessSourceFiles.pl. We left the script in the most verbose state to help the new user understand what is happening during the build.

Incremental Builds

The incremental build should be initiated if

It is started in the same way as the build from scratch, by invoking

./ProcessSourceFiles.com

located in the Procedures/wrappers/ directory. The difference to the "build from scratch" procedure is that the index and annotation directories and all other output-directories already have files and information stored, and their content should not be deleted. The new information is simply added or corrected according to the new or altered content of the input-directories. The update resulting from an incremental build is immediately available at the website, if the corresponding output directories are linked to from the html pages.

LICENSE

COPYRIGHT 2000-2007 CALIFORNIA INSTITUTE OF TECHNOLOGY.
FOR QUESTIONS OR COMMENTS REGARDING THIS SOFTWARE, YOU MAY CONTACT HANS-MICHAEL MULLER AT MUELLER(AT)ITS.CALTECH.EDU.

PERMISSION TO USE, COPY AND DISTRIBUTE THIS SOFTWARE AND ITS DOCUMENTATION FOR NONCOMMERCIAL, RESEARCH PURPOSES WITHOUT FEE AND WITHOUT A WRITTEN AGREEMENT IS HEREBY GRANTED, PROVIDED THAT THE ABOVE COPYRIGHT NOTICE, THIS PARAGRAPH AND THE FOLLOWING TWO PARAGRAPHS APPEAR IN ALL COPIES. A LICENSE FOR THE USE, COPY, OR DISTRIBUTION OF THIS SOFTWARE FOR COMMERCIAL OR FOR-PROFIT USE MUST BE OBTAINED FROM:

OFFICE OF TECHNOLOGY TRANSFER
CALIFORNIA INSTITUTE OF TECHNOLOGY
1200 E. CALIFORNIA BLVD
M/C 210-85 PASADENA, CA 91125
FAX (626) 356-2486

THE CALIFORNIA INSTITUTE OF TECHNOLOGY MAKES NO PROPRIETARY CLAIMS TO THE RESULTS, PROTOTYPES, OR SYSTEMS SUPPORTING AND/OR NECESSARY FOR THE USE OF THE RESEARCH, RESULTS AND/OR PROTOTYPES FOR NON-COMMERCIAL, RESEARCH USES ONLY. IN NO EVENT SHALL CALIFORNIA INSTITUTE OF TECHNOLOGY BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF THE CALIFORNIA INSTITUTE OF TECHNOLOGY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

THE CALIFORNIA INSTITUTE OF TECHNOLOGY SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING THE IMPLIED WARRANTIES OR MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND DOCUMENTATION PROVIDED HEREUNDER IS ON AN "AS IS" BASIS, AND THE CALIFORNIA INSTITUTE OF TECHNOLOGY HAS NO OBLIGATIONS TO PROVIDE MAINTENANCE, UPDATES, ENHANCEMENTS OR MODIFICATIONS.

AUTHOR

This document was written by Hans-Michael Muller on June 6th, 2007.