This page provides a brief description of the predictor and the data set that was used to train it. More details can be found in:

Exploring sequence characteristics related to high-level production of secreted proteins in Aspergillus niger. B.A. van den Berg, M.J.T. Reinders, M. Hulsman, L. Wu, H.J. Pel, J.A. Roubos, D. de Ridder (2012) PLoS ONE, Volume 7, Issue 10, e45869.

Data set

The predictor is trained using a set of 345 Aspergillus niger proteins. All proteins in the data set have a predicted signal peptide (SignalP 3.0) and do not contain an ER retention signal or predicted transmembrane regions (TMHMM, phobius). The proteins were over-expressed behind a strong constitutive promoter, to test for successful high-level production and secretion. A protein was labeled successful in case of a visible band on gel, which corresponds to a detection level of around 50 mg/l. A table with the protein names and assigned labels (neg and pos for unsuccessful and successful respectively) and a fasta file with the protein sequences can be found in the download section.

Predictive sequence features

Classification was used to find a set of sequence features that best separate the successful from the unsuccessful proteins. A large amount of sequence based features were explored using a linear support vector machine (SVM) as classifier. Best performance was obtained using either the codon composition or the amino acid composition. 10-Fold cross-validation resulted in an avarage performance of 0.85 area under the ROC-curve.