Sequence database setup: UniProt proteomes

Overview

A UniProt complete proteome consists of the set of proteins thought to be expressed by an organism whose genome has been completely sequenced. A reference proteome is the complete proteome of a representative, well-studied model organism or an organism of interest for biomedical research.

UniProtKB is a collaboration between the European Bioinformatics Institute, the Swiss Institute of Bioinformatics and the Protein Information Resource.

Download

Database Manager fails to download Fasta from UniProt query

Changes to the way uniprot.org handles requests causes downloads of UniProt proteome Fasta files to fail, (full Swiss-Prot Fasta is not affected). The fix is as follows:

  1. Ensure you are updated to 2.4.1. Your version number is displayed towards the top left of the database status page. For Windows, a service pack can be downloaded from the support page. For Linux, registered contacts were sent an email with download information in October 2012. If you cannot locate this message, email support@matrixscience.com
  2. Download an updated CLI.pm (right click on the link and choose Save link as or similar) and save it to mascot/bin/modules/DBManager/Workhorse, replacing the file of the same name. There is no need to stop or re-start the Mascot Monitor service.

Fasta files representing the proteome for an organism can be downloaded by searching for a specific taxonomy accompanied by the keyword "Complete proteome":

  • Perform the query and view the resulting list of entries (e.g. organism:9606 AND keyword:”Complete proteome” for the human proteome
  • Click the orange Download button in the query result page
  • Choose Fasta, Canonical and isoform sequence data in FASTA format

For example, to get the complete proteome for rice, search for taxonomy:4530 AND keyword:"Complete proteome".

In Database Manager, create a new custom definition using UniProt_proteome_template as the template. You can enable automatic updating of a UniProt Proteome by setting the Fasta file URL. Just change the taxonomy ID in this sample URL to the one for your proteome of interest:
http://www.uniprot.org/uniprot/?query=taxonomy:4530+AND+keyword:"Complete+proteome"&force=yes&format=fasta&include=yes

The complete configuration for the rice proteome in Database Manager will look similar to this

Mascot database manager

Taxonomy

Taxonomy is not required for a single organism database

Parse Rules

When a single entry is expanded into entries for multiple isoforms, they share the same ID, so AC must be used as the unique identifier

>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4

AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

Configuration (Mascot 2.3 and earlier)

A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to /usr/local/mascot/sequence/rice_proteome/current, and renamed to rice_proteome_20120414.fasta.

Mascot database maintenance utility

Full text for individual entries can be retrieved across the web from Uniprot:

Host: www.uniprot.org
Port: 80
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"

Always test a new definition before applying the changes to mascot.dat