Molecular Biology and Genetics
EJB Electronic Journal of Biotechnology ISSN: 0717-3458
© 1998 by Universidad Católica de Valparaíso -- Chile
BIP RESEARCH ARTICLE

Using a neural network to backtranslate amino acid sequences

Gilbert White
Department of Biological Sciences
Clark Atlanta University
223 James Brawley Dr., S.W.
Atlanta, GA 30134 USA

William Seffens*
Department of Biological Sciences and Center for Theoretical Study of Physical Systems
Clark Atlanta University
223 James Brawley Dr., S.W. Atlanta, GA 30134 USA
phone: 404-880-6822 (USA) fax: 404-880-6756 (USA)
E-mail: wseffens@cau.edu

*Corresponding author

Financial Support: This work was supported (or partially supported) by NIH grant GM08247, Research Centers in Minority Institutions award G12RR03062 from the Division of Research Resources, National Institutes of Health and NSF CREST Center for Theoretical Studies of Physical Systems (CTSPS) Cooperative Agreement #HRD-9632844.

Keywords: neural network, genetic code, amino acids, nucleic acids, backtranslation

BIP Article

Neural networks simulate the pattern matching ability of the brain using computer programs. A neural network (NN) was trained on amino and nucleic acid sequences to test the NN's ability to predict a nucleic acid sequence given only an amino acid sequence. A multi-layer backpropagation network of one hidden layer with 5 to 9 neurons was used. Different network configurations were used with varying numbers of input neurons to represent amino acids, while a constant representation was used for the output layer representing nucleic acids. In the best-trained network, 93% of the overall bases, 85% of the degenerate bases, and 100% of the fixed bases were correctly predicted from randomly selected test sequences.

Different NN configurations involving the encoding of amino acids under increasing window sizes were evaluated to predict the behavior of the NN with a significantly larger training set. This genetic data analysis effort will assist in understanding human gene structure. Benefits include computational tools that could predict more reliably the backtranslation of amino acid sequences useful for Degenerate PCR cloning, and may assist the identification of human gene coding sequences (CDS) from open reading frames in DNA databases.


Degenerate primers or probes, usually designed from partially sequenced peptides or conserved regions on the basis of comparison of several proteins, have been widely used in the polymerase chain reaction (PCR), DNA library screening, or Southern blot analysis. The degenerate nature of the genetic code prevents backtranslation of amino acids into codons with certainty. Numerous statistical studies have established that codon frequencies are not random, and long-range correlations have been found in gene sequences. Due to these observations, a neural network approach may identify sequence patterns in coding regions that could be used to improve the accuracy of backtranslation.

Neural networks are able to form generalizations and can identify patterns with noisy data sets. To list just a few biological applications, neural networks have been used successfully to identify coding regions in genomic DNA, to detect mRNA splice sites, and to predict the secondary structure of proteins. Neural networks have also been used to study the structure of the genetic code. One such network was trained to classify the 61 nucleotide triplets of the genetic code into 20 amino acid categories (Tolstrup, et.al., 1994). The goal of the research reported here is to utilize the successful NN techniques to analyze and generalize codon usage in mRNA sequences beginning at the CDS start site. Local and global patterns of codon usage in genes may be identifiable by neural networks of suitable architecture. This paper reports on NN performance using different sequence encoding schemes.

Materials and methods


Training set:
Human mRNA sequences were obtained from GenBank on the basis of several criteria. Gene sequences were identified by keywords indicating a complete mRNA could be reconstructed. The sequences were downloaded from the NIH web site (http://www.ncbi.nlm.nih.gov/Entrez/).

Binary representations:
In order to train the neural network (NN) it is necessary to formulate a decoding scheme because the architecture of the NN is binary and does not allow a direct representation of nucleic or amino acid sequences. Therefore, a binary numeric representation was used to encode the amino acid data. Different encoding schemes were evaluated.

Neural network:
All work with the NN was performed on a Sun SPARCstationÔ 20 computer. The NN used was a utility of Partek 2.0b4, called a multi-layer perceptron (MLP). A MLP is a NN, which has at least three layers (the input, output and the hidden layer(s)). Each layer is attached to the next layer by connection weights that are changed during the training process to reduce the overall error. This allows the network to "learn" patterns in the mRNA sequences. Training was stopped when the change in the total output error became less than 0.1% from the previous iteration. This usually occurred after 500 - 1200 iterations using the backpropagation learning method. Test sets were assembled to assess the predictive accuracy of the trained NN. The test sets consisted of 3 randomly selected human gene sequences from the same group of sequences from which the training set was selected. The predicted output was measured in 3 categories: the overall percent correct, percent correct for degenerate bases, and percent correct for fixed bases. These measures allow the assessment of the various schemes used to encode the amino acids.


Results


Encoding the amino acids: Different amino acid decoding schemes were examined to determine how the input configuration would affect prediction accuracy of the networks in backtranslating amino acids into nucleic acids. The simplest and most direct scheme, called "Simple", is a 20-bit representation where each amino acid is represented by a one and nineteen zeros (Figure 1). Alanine would be 10000000000000000000 and the one would shift to the right alphabetically based on the one letter abbreviation of the amino acids. Another scheme called "Simple-Shuffle" is a rearrangement or shuffling of the amino acids in the previous scheme. This is to test if the order of amino acids in the input layer is important, since the composition can be quite different between abundant and rare amino acids.

Adding degeneracy information: The "simple" representation ignores the nucleic acid bases already known from the genetic code. Degeneracy could be incorporated into the input layer similar to the multiple sensor approach taken by Uberbacher and Mural (1991). Some input neurons could then convey processed information about limited codon choices.

Binary encoding:
Another way of encoding the amino acids is to form groups that are based on some ordering. The scheme called "Binary-5-bit", is based on all the possible ways that ones and zeros can be combined in a five-bit group. There are 32 possible ways these numbers can be arranged. When the representations with no or all ones, and those with 1 or 4 ones are removed, there are exactly twenty representations left. This leaves just enough representations to code for the 20 amino acids.

Comparing the schemes:
Four NN schemes were used to predict the correct codons given an amino acid sequence. The percent correct in predicting degenerate bases was used to test the network's ability to backtranslate from amino acid sequences to nucleic acid sequences. The networks were trained and test sets were used to assess the accuracy for each scheme. The change in predictive accuracy of the schemes was analyzed as the window size was increased to determine which scheme or schemes would be most efficient with larger training sets. The best scheme in predicting correctly the degenerate bases was Simple, which predicted 85% of the degenerate bases in Training Set 60S-10C (Table 1). All of the schemes were predicting 100% of the fixed bases for all window sizes. The largest scheme, which has 33 input neurons per amino acid, shows a consistently better performance compared to the smallest scheme with 5 input neurons per amino acid. There is little difference between Simple and Simple-Shuffle, so that the order of amino acids in the input layer is not important. The binary scheme for the smaller window size does not perform as well as the unitary schemes, a result also found by other researchers (O'Neill, 1991, Demeler and Zhou, 1991). However, with the largest window there is very little difference between the schemes. This may be due to more amino acids being present in the training set, allowing for a more complete representation of the genetic code. For the smaller windows not all the codons are represented in the training sets and may explain why Simple's accuracy did not exceed 85%. Therefore the genetic code was incompletely represented in the smaller training sets. The accuracy decreased as the window size increased for Simple, possibly due to the increased complexity or size of the input layer of the NN and the minimal increase of the hidden layer. The size of the hidden layer did not increase as fast as the input layer for increased window sizes due to the default settings of the NN. Overall the four schemes are capable of backtranslating with high accuracy for the degenerate bases from a relatively small training set.

Table 1. Accuracy of NN encoding schemes.

Simple

Simple-

Shuffle

All-

Degeneracy

Binary

5-bit

Bits/amino acid

20

20

33

5

Training set 60S-10C

85%

72%

80%

74%

Training set 60S-15C

72%

77%

77%

74%

Training set 60S-20C

80%

80%

80%

69%

Training set 60S-25C

74%

72%

74%

77%

Shown are the percent of correctly predicted degenerate bases in a test set composed of three sequences selected randomly from the same group of sequences from which the training set was assembled.
The nomenclature for each training set identifies the number of sequences used and the number of codons taken from each sequence. For example, in Training Set 60S-10C there are sixty sequences with a window of ten codons taken from each sequence. Since ten codons were taken from each sequence, there are 600 codons in this set

Discussion

One of the possible uses of this research is to improve the design of oligonucleotide probes (Eberhardt, 1992). One primer-design study found an overall homology greater than 82% between the predicted probe and the target sequence when codon utilization and dinucleotide frequencies were taken into account (Lathe, 1985). Our best network predicted 85% of the degenerate bases, and 93% of the overall bases. The data set used Lathe's study contained 13,000 nucleotides and our largest training set had 4500 nucleotides. Therefore, an increase in our network or training set size could lead to greater accuracy by detecting patterns of codon choice within the mRNA sequences. The architecture of the amino acid encoding method apparently does not have a large impact on predictive accuracy as found in this study. Therefore other factors, such as computational time or memory size may be a criteria used to select an encoding scheme for a larger training set. It is also interesting to note that the network that predicted the highest percentage of correct overall bases did so on a test set that had eight Leucines, one Arginine, and two Serines. These amino acids present difficulties for algorithms based on codon lookup tables, such as Lathe's work or common primer selection programs (such as Nash, 1993). Work reported here demonstrates that a NN approach yields improvement in predictive accuracy for PCR primer selection.

References


Demeler, B. and Zhou, G. (1991). Neural network optimization for E. coli promoter prediction. Nucleic Acids Research 19:1593-1599.

Eberhardt, N.(1992). A shell program for the design of PCR primers using genetics computer group software. BioTechniques 13:914-916.

Lathe, R. (1985). Synthetic oligonucleotide probes deduced from amino acid sequence data. Journal of Molecular Biology 183:1-12.

Nash, J. (1993). A computer program to calculate and design oligonucleotide primers from amino acid sequences. CABIOS 9:469-471.

O'Neill, M.(1991).Training back-propagation neural networks to define and detect DNA binding sites. Nucleic Acids Research 19:313-318.

Tolstrup, N., Toftgard, J., Englebrecht, J., and Brunak, S. (1994). Neural Network Model of the Genetic Code is Strongly Correlated to the GES Scale of Amino Acid Transfer Free Energies. Journal of Molecular Biology 243:816-820.

Uberbacher, E. and Mural, R. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proceedings of the National Academy of Sciences U.S.A. 88:11261-11265.

Supported by UNESCO / MIRCEN network
Home | Mail to Editor | Search | Archive