Home » Protein secondary structure prediction service

Behind the SPARROW server

How does it work?

The predictors offered in the SPARROW server are the result of statistical learning procedures with roots in Vladimir Vapnik's statistical learning theory. To apply this theory, the problem of protein secondary structure prediction is transformed into a multiclass classification problem. The eight secondary structure classes, as provided by DSSP, are used as reference classification in this case. As is commonly done, these eight classes are reduced to three secondary structure classes as described in the help section. Once the problem is defined this way, supervised learning methods from the statistical learning theory can be used to produce classifiers capable of predicting protein secondary structure. The optimal classifiers are obtained using a finite learning dataset of proteins with known secondary structure assignment. The learning procedure consists in minimizing the classification error on this dataset.

The SPARROW server provides two predictors based on such classifiers, SPARROW and *SPARROW (pronounced "star-sparrow"), both operating in three stages. The third stage is in both cases implemented by an artificial neural network, while the first two differ slightly between the two predictors. Both SPARROW and *SPARROW are trained with the most recent release of the ASTRAL40 compendium (currently version 1.75, released June 5, 2009). The following paragraphs should give a brief insight in the methods used.

SPARROW - divide et impera

SPARROW was developed by Francesco Bettella during the time he served as Ph.D. student in the AGKnapp. The strategy behind the SPARROW predictor is to separate the multiclass problem into a set of two-class problems. The latter are addressed singularly using the same approach adopted by R. A. Fisher in the thirties. The linear discriminants (quadratic functions of the amino acid sequence) optimized to discern each secondary structure type provide a measure of the propensity of a sequence to adopt that secondary structure. Direct comparison of the propensity enables to perform the actual multiclass prediction.

*SPARROW - a vectorial approach

In the course of 2010 Dawid Rasinski developed SPARROW's successor, *SPARROW, as part of his Ph.D. project. Similarly to SPARROW, *SPARROW uses quadratic scoring functions to perform classifications in its first two stages. But instead of reducing the multiclass problem into a set of two-class problems, *SPARROW solves the multiclass problem directly. For this purpose a vector-valued scoring function is used in each of the first two stages. Each scalar function component of the vector-valued functions is a quadratic function of the sequence. The number of component functions is one less than the number of classes in the classification problem the vector valued scoring function is being applied to, and so is the dimension of the functions output space as well. In this output space each class is represented by a unique vector, the so-called class vector. Members of a certain class should be as close as possible to the corresponding class vector, which leads to the minimization of the mean squared distances between the outputs of the vector valued scoring function, belonging to the learning examples (residues), and the class vectors representing the classes these examples belong to. Once fully trained, the vector valued scoring function can perform classifications of unknown samples, by assigning them to the class whose class vector the output of the vector valued scoring function is closest to.