Training neural networks to identify coding regions in genomic DNA

Document Type

Conference publication

Publication details

Roberts, L, Reeves, C, Steele, N & King, GJ 1995, 'Training neural networks to identify coding regions in genomic DNA', in EC Ifeachor (ed.), Proceedings of the International Workshop on Medical & Biological Signal Processing, University of Plymouth, Plymouth, UK, pp. 107-112.


The four nitrogenous bases of DNA spell out the recipes from which proteins are made. A gene typically contains five thousand or so bases but often only a small percentage of these are protein coding. Computer based prediction systems are increasingly relied upon as submissions to the major genetic databases are growing exponentially. Several systems exist to locate coding regions (exons) and noncoding regions (introns) within genomic DNA; the common models used are neural networks and Markov chains (M. Borodovsky and J. McIninch (1993), A. Krogh et al. (1994). One of the most successful programs is called GRAIL. Currently, two versions of GRAIL are available: GRAIL-I (E. Uberbacher and R. Mural (1991), and GRAIL-II (Y. Xu et al. (1994). In GRAIL-I, a neural network receives its inputs from seven statistical measures taken on a 99 base window. Performance is improved in GRAIL-II by the addition of variable length windows, neural nets trained to locate intron/exon boundaries, and a number of steps designed to evaluate candidate exons and eliminate improbable ones. Both versions of GRAIL predict coding regions in human DNA. A simulation of GRAIL-I was carried out with the goal of improving classification performance without resorting to the additional measures used in GRAIL-II. The intention was then to supplement the resulting module with modules based on physiochemical measures of DNA (such as melting profiles, twist and wedge angles) to enable precise exon prediction in plant sequences