PROTINFO documentation

Calculations times
Submission tips
Troubleshooting tips
Methods
Version information

Calculation times

The methods are usually executed on a dedicated 64-processor cluster. Our goal is to ensure that that the prediction time for each sequence is less than 24 hours (comparative modelling predictions will be most likely take only a few hours), but this of course depends on how many people submit sequences. Please see the notes below on the methods and the length dependence to understand why it can take long times. You can also monitor the progress of your jobs using the PROTINFO monitor.

Submission tips

Sequences must be specified in a single line using the one-letter amino acid notation.
Submit sequences split up into domains if you have a good feel for where the boundaries are. This is because the complexities of most calculations are generally exponentially proportional to the lengths of the sequences, and most prediction methods are calibrated to work on domains. The programs currently perform a limited amount of automatic domain parsing, which will be enhanced in the future.
Very short sequences are not a good idea since predictions made aren't usually reliable.
Very long sequences are not a good idea since the computation time required is much longer. People submitting long sequences may have to wait much longer on average before their sequence is pulled for processing, compared to people submitting short sequences.
Any PDB files submitted optionally must generally start with residue 1 and the residues must be numbered consecutively without any chain breaks. Currently, at least the main chain coordinates are required for the template to be used. We do make attempts to clean up PDB files, but this isn't guaranteed to work.
If you're not happy with the results, then submitting with a template and alignment might make things work better.

Troubleshooting tips

Look at the PROTINFO monitor and see if your job has been started, if it has taken an unusually long amount of time, or if reports that it has finished and you've not gotten a response back, then please contact us.
Take a look at the log files in the top level directory pointed to by the URL returned by the server. This will contain a lot of garbage, but the end of the log can be revealing.
Read the warnings and the error messages generated (if any) and please contact us if you think something has gone terribly wrong.

Methods

All papers published regarding the research are accessible from our ongoing areas of research page, and all or most of the software is accessible from our software distribution server.

General notes

Following the CASP convention, up to five models for each prediction method may be returned (in CASP format). Under certain conditions (no clear target-template relationship discerned, for example), both methods may be executed by the PROTINFO server regardless of method.

Comparative modelling using RAMP

If no template and alignment is specified, the method does a sequence-only search using a variety of methods and then uses the "hits" returned as seeds for a multiple sequence alignment. Initial models are then built for each alignment to a template and the resulting models are scored. Loops and side chains are built on the best scoring models using a frozen approximation. A sophisticated graph-theory search to mix and match between various main chain and side chain conformations is done in some cases (when the templates all match well).

During the searches, templates with >= 95% sequence identity to the target are usually ignored (since this could represent the same structure in the PDB). If you really want a model where the target-template alignment has a sequence identity >= 95%, then you should submit the alignment and template structures explicitly (it should be trivial to construct such an alignment by hand).

This approach is likely to produce the best models when the relationship between the target and template proteins is clearly discernible (>= 30% sequence identity). Even though models are built if the sequence identity is lower, they are likely to contain errors.

De novo prediction using RAMP

If there are no related templates to the target and/or if the target sequence has the appropriate length (around 100 residues), then it will be modelled using our de novo methods. This approach is likely to be most useful for small sequences.

Secondary structure assignment using PsiCSI

This method (published in Protein Science) uses neural networks to translate NMR chemical shifts into secondary structure information (somewhat similar to CSI) and combines it with sequence based predictions (à la Psipred). It has a sustained three-state average accuracy of 89% on a rigourously jack-knifed test set of 92 proteins for which NMR chemical shift information was publicly available.

PsiCSI chemical shifts must be supplied in NMR-Star format (tools for converting to this format from other popular formats are available).

The output will include individual secondary structure assignments as well as the confidences for each of the three states.

Version information

nov212012 PROTINFO-CM v0.2
nov212012 PROTINFO-AB v0.2
nov212012 PsiCSI v1.1
nov212012 RAMP v0.4

Protinfo || Bioverse || Samudrala Computational Biology Research Group || protinfo@compbio.org