This page provides a brief description of the predictor and the data set that was used to train it. More details can be found in:
The predictor is trained using a set of 345 Aspergillus niger proteins. All proteins in the data set have a predicted signal peptide (SignalP 3.0) and do not contain an ER retention signal or predicted transmembrane regions (TMHMM, phobius). The proteins were over-expressed behind a strong constitutive promoter, to test for successful high-level production and secretion. A protein was labeled successful in case of a visible band on gel, which corresponds to a detection level of around 50 mg/l. A table with the protein names and assigned labels (neg and pos for unsuccessful and successful respectively) and a fasta file with the protein sequences can be found in the download section.
Classification was used to find a set of sequence features that best separate the successful from the unsuccessful proteins. A large amount of sequence based features were explored using a linear support vector machine (SVM) as classifier. Best performance was obtained using either the codon composition or the amino acid composition. 10-Fold cross-validation resulted in an avarage performance of 0.85 area under the ROC-curve.