terpath.net!sas!newshost.unx.sas.com!hotellng.unx.sas.com!saswss
Subject: comp.ai.neural-nets FAQ, Part 1 of 7: Introduction
Date: Fri, 29 Aug 1997 03:00:12 GMT

Archive-name: ai-faq/neural-nets/part1
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss@unx.sas.com (Warren S. Sarle)

Copyright 1997 by Warren S. Sarle, Cary, NC, USA. 

  ---------------------------------------------------------------
    Additions, corrections, or improvements are always welcome.
    Anybody who is willing to contribute any information,
    please email me; if it is relevant, I will incorporate it.

    The monthly posting departs at the 28th of every month.
  ---------------------------------------------------------------

This is the first of seven parts of a monthly posting to the Usenet
newsgroup comp.ai.neural-nets (as well as comp.answers and news.answers,
where it should be findable at any time). Its purpose is to provide basic
information for individuals who are new to the field of neural networks or
who are just beginning to read this group. It will help to avoid lengthy
discussion of questions that often arise for beginners. 

   SO, PLEASE, SEARCH THIS POSTING FIRST IF YOU HAVE A QUESTION
                           and
   DON'T POST ANSWERS TO FAQs: POINT THE ASKER TO THIS POSTING

The latest version of the FAQ is available as a hypertext document, readable
by any WWW (World Wide Web) browser such as Mosaic, under the URL: 
"ftp://ftp.sas.com/pub/neural/FAQ.html".

These postings are archived in the periodic posting archive on host
rtfm.mit.edu (and on some other hosts as well). Look in the anonymous ftp
directory "/pub/usenet/news.answers/ai-faq/neural-nets" under the file names
"part1", "part2", ... "part7". If you do not have anonymous ftp access, you
can access the archives by mail server as well. Send an E-mail message to
mail-server@rtfm.mit.edu with "help" and "index" in the body on separate
lines for more information.

For those of you who read this FAQ anywhere other than in Usenet: To read
comp.ai.neural-nets (or post articles to it) you need Usenet News access.
Try the commands, 'xrn', 'rn', 'nn', or 'trn' on your Unix machine, 'news'
on your VMS machine, or ask a local guru. WWW browsers are often set up for
Usenet access, too--try the URL news:comp.ai.neural-nets. 

The FAQ posting departs to comp.ai.neural-nets on the 28th of every month.
It is also sent to the groups comp.answers and news.answers where it should
be available at any time (ask your news manager). The FAQ posting, like any
other posting, may a take a few days to find its way over Usenet to your
site. Such delays are especially common outside of North America. 

This FAQ is not meant to discuss any topic exhaustively.

Disclaimer: 

   This posting is provided 'as is'. No warranty whatsoever is expressed or
   implied, in particular, no warranty that the information contained herein
   is correct or useful in any way, although both are intended. 

To find the answer of question "x", search for the string "Subject: x"

========== Questions ========== 
********************************

Part 1: Introduction

   What is this newsgroup for? How shall it be used?
   Where is comp.ai.neural-nets archived?
   May I copy this FAQ?
   What is a neural network (NN)?
   What can you do with an NN and what not?
   Who is concerned with NNs?
   How many kinds of NNs exist?
   How many kinds of Kohonen networks exist? (And what is k-means?)
   How are layers counted?
   What are cases and variables?
   What are the population, sample, training set, design set, validation
   set, and test set?
   How are NNs related to statistical methods?
   What about Genetic Algorithms and Evolutionary Computation?
   What about Fuzzy Logic?

Part 2: Learning

   What is backprop?
   What learning rate should be used for backprop?
   What are conjugate gradients, Levenberg-Marquardt, etc.?
   How should categories be coded?
   Why use a bias/threshold?
   Why use activation functions?
   What is a softmax activation function?
   What is the curse of dimensionality?
   How do MLPs compare with RBFs?
   What are OLS and subset regression?
   Should I normalize/standardize/rescale the data?
   Should I nonlinearly transform the data?
   How to measure importance of inputs?
   What is ART?
   What is PNN?
   What is GRNN?
   What does unsupervised learning learn?

Part 3: Generalization

   How is generalization possible?
   How does noise affect generalization?
   What is overfitting and how can I avoid it?
   What is jitter? (Training with noise)
   What is early stopping?
   What is weight decay?
   What is Bayesian learning?
   How many hidden layers should I use?
   How many hidden units should I use?
   How can generalization error be estimated?
   What are cross-validation and bootstrapping?

Part 4: Books, data, etc.

   Books and articles about Neural Networks?
   Journals and magazines about Neural Networks?
   The most important conferences concerned with Neural Networks?
   Neural Network Associations?
   On-line and machine-readable information about NNs?
   Databases for experimentation with NNs?

Part 5: Free software

   Freeware and shareware packages for NN simulation?

Part 6: Commercial software

   Commercial software packages for NN simulation?

Part 7: Hardware, etc.

   Neural Network hardware?
   How to learn an inverse of a function?
   How to get invariant recognition of images under translation, rotation,
   etc.?
   Unanswered FAQs

------------------------------------------------------------------------

Subject: What is this newsgroup for? How shall it be
====================================================
used?
=====

The newsgroup comp.ai.neural-nets is intended as a forum for people who want
to use or explore the capabilities of Artificial Neural Networks or
Neural-Network-like structures.

Posts should be in plain-text format, not postscript or html or TEX or any
word-processor format. 

There should be the following types of articles in this newsgroup:

1. Requests
+++++++++++

   Requests are articles of the form "I am looking for X", where X
   is something public like a book, an article, a piece of software. The
   most important about such a request is to be as specific as possible!

   If multiple different answers can be expected, the person making the
   request should prepare to make a summary of the answers he/she got and
   announce to do so with a phrase like "Please reply by email,
   I'll summarize to the group" at the end of the posting.

   The Subject line of the posting should then be something like 
   "Request: X" 

2. Questions
++++++++++++

   As opposed to requests, questions ask for a larger piece of information
   or a more or less detailed explanation of something. To avoid lots of
   redundant traffic it is important that the poster provides with the
   question all information s/he already has about the subject asked and
   state the actual question as precise and narrow as possible. The poster
   should prepare to make a summary of the answers s/he got and announce to
   do so with a phrase like "Please reply by email, I'll
   summarize to the group" at the end of the posting.

   The Subject line of the posting should be something like "Question:
   this-and-that" or have the form of a question (i.e., end with a
   question mark) 

   Students: please do not ask comp.ai.neural-net readers to do your
   homework or take-home exams for you. 

3. Answers
++++++++++

   These are reactions to questions or requests. If an answer is too
   specific to be of general interest, or if a summary was announced with
   the question or request, the answer should be e-mailed to the poster, not
   posted to the newsgroup. 

   Most news-reader software automatically provides a subject line beginning
   with "Re:" followed by the subject of the article which is being
   followed-up. Note that sometimes longer threads of discussion evolve from
   an answer to a question or request. In this case posters should change
   the subject line suitably as soon as the topic goes too far away from the
   one announced in the original subject line. You can still carry along the
   old subject in parentheses in the form "Re: new subject (was:
   old subject)" 

4. Summaries
++++++++++++

   In all cases of requests or questions the answers for which can be
   assumed to be of some general interest, the poster of the request or
   question shall summarize the answers he/she received. Such a summary
   should be announced in the original posting of the question or request
   with a phrase like "Please answer by email, I'll
   summarize"

   In such a case, people who answer to a question should NOT post their
   answer to the newsgroup but instead mail them to the poster of the
   question who collects and reviews them. After about 5 to 20 days after
   the original posting, its poster should make the summary of answers and
   post it to the newsgroup.

   Some care should be invested into a summary: 
    o simple concatenation of all the answers is not enough: instead,
      redundancies, irrelevancies, verbosities, and errors should be
      filtered out (as well as possible) 
    o the answers should be separated clearly 
    o the contributors of the individual answers should be identifiable
      (unless they requested to remain anonymous [yes, that happens]) 
    o the summary should start with the "quintessence" of the answers, as
      seen by the original poster 
    o A summary should, when posted, clearly be indicated to be one by
      giving it a Subject line starting with "SUMMARY:" 
   Note that a good summary is pure gold for the rest of the newsgroup
   community, so summary work will be most appreciated by all of us. Good
   summaries are more valuable than any moderator ! :-) 

5. Announcements
++++++++++++++++

   Some articles never need any public reaction. These are called
   announcements (for instance for a workshop, conference or the
   availability of some technical report or software system).

   Announcements should be clearly indicated to be such by giving them a
   subject line of the form "Announcement: this-and-that" 

6. Reports
++++++++++

   Sometimes people spontaneously want to report something to the newsgroup.
   This might be special experiences with some software, results of own
   experiments or conceptual work, or especially interesting information
   from somewhere else.

   Reports should be clearly indicated to be such by giving them a subject
   line of the form "Report: this-and-that" 

7. Discussions
++++++++++++++

   An especially valuable possibility of Usenet is of course that of
   discussing a certain topic with hundreds of potential participants. All
   traffic in the newsgroup that can not be subsumed under one of the above
   categories should belong to a discussion.

   If somebody explicitly wants to start a discussion, he/she can do so by
   giving the posting a subject line of the form "Discussion:
   this-and-that"

   It is quite difficult to keep a discussion from drifting into chaos, but,
   unfortunately, as many many other newsgroups show there seems to be no
   secure way to avoid this. On the other hand, comp.ai.neural-nets has not
   had many problems with this effect in the past, so let's just go and
   hope... 

8. Jobs Ads
+++++++++++

   Advertisements for jobs requiring expertise in artificial neural networks
   are appropriate in comp.ai.neural-nets. Job ads should be clearly
   indicated to be such by giving them a subject line of the form "Job:
   this-and-that". It is also useful to include the
   country-state-city abbreviations that are conventional in
   misc.jobs.offered, such as: "Job: US-NY-NYC Neural network
   engineer". If an employer has more than one job opening, all such
   openings should be listed in a single post, not multiple posts. Job ads
   should not be reposted more than once per month. 

------------------------------------------------------------------------

Subject: Where is comp.ai.neural-nets archived? 
================================================

The following archives are available for comp.ai.neural-nets: 

 o Deja News at http://www.dejanews.com/ 
 o ftp://ftp.cs.cmu.edu/user/ai/pubs/news/comp.ai.neural-nets 
 o http://asknpac.npac.syr.edu 

   According to Gang Cheng, gcheng@npac.syr.edu, the Northeast Parallel
   Architecture Center (NPAC), Syracue University, maintains an archive
   system for searching/reading USENET newsgroups and mailing lists. Two
   search/navigation interfaces accessible by any WWW browser are provided:
   one is an advanced search interface allowing queries with various options
   such as query by mail header, by date, by subject (keywords), by sender.
   The other is a Hypermail-like navigation interface for users familiar
   with Hypermail. 

For more information on newsgroup archives, see 
http://starbase.neosoft.com/~claird/news.lists/newsgroup_archives.html 

------------------------------------------------------------------------

Subject: May I copy this FAQ?
=============================

The intent in providing a FAQ is to make the information freely available to
whoever needs it. You may copy all or part of the FAQ, but please be sure to
include a reference to the URL of the master copy,
ftp://ftp.sas.com/pub/neural/FAQ.html, and do not sell copies of the FAQ. If
you want to include information from the FAQ in your own web site, it is
better to include links to the master copy rather than to copy text from the
FAQ to your web pages, because various answers in the FAQ are updated at
unpredictable times. To cite the FAQ in an academic-style bibliography, use
something along the lines of: 

   Sarle, W.S., ed. (1997), Neural Network FAQ, part 1 of 7: Introduction,
   periodic posting to the Usenet newsgroup comp.ai.neural-nets, URL:
   ftp://ftp.sas.com/pub/neural/FAQ.html 

------------------------------------------------------------------------

Subject: What is a neural network (NN)?
=======================================

First of all, when we are talking about a neural network, we should more
properly say "artificial neural network" (ANN), because that is what we mean
most of the time in comp.ai.neural-nets. Biological neural networks are much
more complicated than the mathematical models we use for ANNs. But it is
customary to be lazy and drop the "A" or the "artificial". 

There is no universally accepted definition of an NN. But perhaps most
people in the field would agree that an NN is a network of many simple
processors ("units"), each possibly having a small amount of local memory.
The units are connected by communication channels ("connections") which
usually carry numeric (as opposed to symbolic) data, encoded by any of
various means. The units operate only on their local data and on the inputs
they receive via the connections. The restriction to local operations is
often relaxed during training. 

Some NNs are models of biological neural networks and some are not, but
historically, much of the inspiration for the field of NNs came from the
desire to produce artificial systems capable of sophisticated, perhaps
"intelligent", computations similar to those that the human brain routinely
performs, and thereby possibly to enhance our understanding of the human
brain. 

Most NNs have some sort of "training" rule whereby the weights of
connections are adjusted on the basis of data. In other words, NNs "learn"
from examples (as children learn to recognize dogs from examples of dogs)
and exhibit some capability for generalization beyond the training data. 

NNs normally have great potential for parallelism, since the computations of
the components are largely independent of each other. Some people regard
massive parallelism and high connectivity to be defining characteristics of
NNs, but such requirements rule out various simple models, such as simple
linear regression (a minimal feedforward net with only two units plus bias),
which are usefully regarded as special cases of NNs. 

Here is a sampling of definitions from the books on the FAQ maintainer's
shelf. None will please everyone. Perhaps for that reason many NN textbooks
do not explicitly define neural networks. 

According to the DARPA Neural Network Study (1988, AFCEA International
Press, p. 60): 

   ... a neural network is a system composed of many simple processing
   elements operating in parallel whose function is determined by
   network structure, connection strengths, and the processing performed
   at computing elements or nodes. 

According to Haykin, S. (1994), Neural Networks: A Comprehensive
Foundation, NY: Macmillan, p. 2: 

   A neural network is a massively parallel distributed processor that
   has a natural propensity for storing experiential knowledge and
   making it available for use. It resembles the brain in two respects: 

   1. Knowledge is acquired by the network through a learning process. 
   2. Interneuron connection strengths known as synaptic weights are
      used to store the knowledge. 

According to Nigrin, A. (1993), Neural Networks for Pattern Recognition,
Cambridge, MA: The MIT Press, p. 11: 

   A neural network is a circuit composed of a very large number of
   simple processing elements that are neurally based. Each element
   operates only on local information. Furthermore each element operates
   asynchronously; thus there is no overall system clock. 

According to Zurada, J.M. (1992), Introduction To Artificial Neural Systems,
Boston: PWS Publishing Company, p. xv: 

   Artificial neural systems, or neural networks, are physical cellular
   systems which can acquire, store, and utilize experiential knowledge.

For more information on "What is a neural network?", with examples and
diagrams, see Leslie S. Smith's on-line introduction at: 
http://www.cs.stir.ac.uk/~lss/NNIntro/InvSlides.html. 

Another excellent introduction to NNs is Donald Tveter's Backpropagator's
Review at http://www.mcs.com/~drt/bprefs.html, which contains both answers
to additional FAQs and an annotated neural net bibliography emphasizing
on-line articles. 

------------------------------------------------------------------------

Subject: What can you do with an NN and what not?
=================================================

In principle, NNs can compute any computable function, i.e. they can do
everything a normal digital computer can do. 

In practice, NNs are especially useful for classification and function
approximation/mapping problems which are tolerant of some imprecision, which
have lots of training data available, but to which hard and fast rules (such
as those that might be used in an expert system) cannot easily be applied.
Almost any mapping between vector spaces can be approximated to arbitrary
precision by feedforward NNs (which are the type most often used in
practical applications) if you have enough data and enough computing
resources. 

To be somewhat more precise, feedforward networks with a single hidden layer
are statistically consistent estimators of arbitrary measurable,
square-integrable regression functions under certain practically-satisfiable
assumptions regarding sampling, target noise, number of hidden units, size
of weights, and form of hidden-unit activation function (White, 1990). Such
networks can also be trained as statistically consistent estimators of
derivatives of regression functions (White and Gallant, 1992) and quantiles
of the conditional noise distribution (White, 1992a). Feedforward networks
with a single hidden layer using threshold or sigmoid activation functions
are universally consistent estimators of binary classifications (Farag\'o
and Lugosi, 1993; Lugosi and Zeger 1995; Devroye, Gy\"orfi, and Lugosi,
1996) under similar assumptions. 

Unfortunately, the above consistency results depend on one impractical
assumption: that the networks are trained by an error (L_p error or
misclassification rate) minimization technique that comes arbitrarily close
to the global minimum. Such minimization is computationally intractable
except in small or simple problems (Judd, 1990). 

NNs are, at least today, difficult to apply successfully to problems that
concern manipulation of symbols and memory. And there are no methods for
training NNs that can magically create information that is not contained in
the training data. 

As for simulating human consciousness and emotion, that's still in the realm
of science fiction. 

For examples of NN applications, see: 

 o The Pacific Northwest National Laboratory web pages at 
   http://www.emsl.pnl.gov:2080/docs/cie/neural/ including a list of
   commercial applications at 
   http://www.emsl.pnl.gov:2080/docs/cie/neural/products/ 
 o The Stimulation Initiative for European Neural Applications web page at 
   http://www.mbfys.kun.nl/snn/siena/cases/ 
 o Roy Goodacre's web pages on pyrolysis mass spectrometry at 
   http://gepasi.dbs.aber.ac.uk/roy/pymshome.htm and Fourier transform
   infrared (FT-IR) spectroscopy at 
   http://gepasi.dbs.aber.ac.uk/roy/ftir/ftirhome.htm contain applications
   of a variety of NNs as well as PLS (partial least squares) and other
   statistical methods. 
 o The DTI NeuroComputing Web's Applications Portfolio at 
   http://www.globalweb.co.uk/nctt/portfolo/ 
 o The Applications Corner, provided by NeuroDimension, Inc., at 
   http://www.nd.com/appcornr/purpose.htm 
 o The BioComp Systems, Inc. Solutions page at http://www.bio-comp.com 
 o Athanasios Episcopos's web page with References on Neural Net
   Applications to Finance and Economics at 
   http://www.compulink.gr/users/episcopo/neurofin.html 
 o Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
   McGraw-Hill, ISBN 0-07-011189-8. 
 o Trippi, R.R. & Turban, E. (1993), Neural Networks in Finance and
   Investing, Chicago: Probus. 
 o The series Advances in Neural Information Processing Systems containing
   proceedings of the conference of the same name, published yearly by
   Morgan Kauffman starting in 1989. 

There is an on-line application of a Kohonen network with a 2-dimensional
output layer for prediction of protein secondary structure percentages from
UV circular dichroism spectra. According to J.J. Merelo: 

   You only need to submit 41 CD values ranging from 200 nm to 240 nm
   (given in deg cm^2 dmol^-1 multiplied by 0.001) and the k2d server
   gives back the estimated percentages of helix, beta and rest of
   secondary structure of your protein plus an estimation of the
   accuracy of the prediction. 

The address of the k2d server is http://kal-el.ugr.es/k2d/spectra.html. The
home page of the k2d program is at http://kal-el.ugr.es/k2d/k2d.html or 
http://www.embl-heidelberg.de/~andrade/k2d.html. 

References: 

   Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory
   of Pattern Recognition, NY: Springer. 

   Farag\'o, A. and Lugosi, G. (1993), "Strong Universal Consistency of
   Neural Network Classifiers," IEEE Transactions on Information Theory, 39,
   1146-1151. 

   Judd, J.S. (1990), Neural Network Design and the Complexity of
   Learning, Cambridge, MA: The MIT Press. 

   Lugosi, G., and Zeger, K. (1995), "Nonparametric Estimation via Empirical
   Risk Minimization," IEEE Transactions on Information Theory, 41, 677-678.

   White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
   Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,
   535-550. Reprinted in White (1992b). 

   White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
   Using Neural Networks," in Page, C. and Le Page, R. (eds.), Proceedings
   of the 23rd Sympsium on the Interface: Computing Science and Statistics,
   Alexandria, VA: American Statistical Association, pp. 190-199. Reprinted
   in White (1992b). 

   White, H. (1992b), Artificial Neural Networks: Approximation and
   Learning Theory, Blackwell. 

   White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an
   Unknown Mapping with Multilayer Feedforward Networks," Neural Networks,
   5, 129-138. Reprinted in White (1992b). 

------------------------------------------------------------------------

Subject: Who is concerned with NNs?
===================================

Neural Networks are interesting for quite a lot of very different people: 

 o Computer scientists want to find out about the properties of non-symbolic
   information processing with neural nets and about learning systems in
   general. 
 o Statisticians use neural nets as flexible, nonlinear regression and
   classification models. 
 o Engineers of many kinds exploit the capabilities of neural networks in
   many areas, such as signal processing and automatic control. 
 o Cognitive scientists view neural networks as a possible apparatus to
   describe models of thinking and consciousness (High-level brain
   function). 
 o Neuro-physiologists use neural networks to describe and explore
   medium-level brain function (e.g. memory, sensory system, motorics). 
 o Physicists use neural networks to model phenomena in statistical
   mechanics and for a lot of other tasks. 
 o Biologists use Neural Networks to interpret nucleotide sequences. 
 o Philosophers and some other people may also be interested in Neural
   Networks for various reasons. 

For world-wide lists of groups doing research on NNs, see the Foundation for
Neural Networks's (SNN) page at 
http://www.mbfys.kun.nl/snn/pointers/groups.html and see Neural Networks
Research on the IEEE Neural Network Council's homepage 
http://www.ieee.org/nnc. 

------------------------------------------------------------------------

Subject: How many kinds of NNs exist?
=====================================

There are many many kinds of NNs by now. Nobody knows exactly how many. New
ones (or at least variations of existing ones) are invented every week.
Below is a collection of some of the most well known methods, not claiming
to be complete.

The main categorization of these methods is the distinction between
supervised and unsupervised learning: 

 o In supervised learning, there is a "teacher" who in the learning phase
   "tells" the net how well it performs ("reinforcement learning") or what
   the correct behavior would have been ("fully supervised learning"). 
 o In unsupervised learning the net is autonomous: it just looks at the data
   it is presented with, finds out about some of the properties of the data
   set and learns to reflect these properties in its output. What exactly
   these properties are, that the network can learn to recognise, depends on
   the particular network model and learning method. Usually, the net learns
   some compressed representation of the data. 

Many of these learning methods are closely connected with a certain (class
of) network topology.

Now here is the list, just giving some names:

1. UNSUPERVISED LEARNING (i.e. without a "teacher"):
     1). Feedback Nets:
        a). Additive Grossberg (AG)
        b). Shunting Grossberg (SG)
        c). Binary Adaptive Resonance Theory (ART1)
        d). Analog Adaptive Resonance Theory (ART2, ART2a)
        e). Discrete Hopfield (DH)
        f). Continuous Hopfield (CH)
        g). Discrete Bidirectional Associative Memory (BAM)
        h). Temporal Associative Memory (TAM)
        i). Adaptive Bidirectional Associative Memory (ABAM)
        j). Kohonen Self-organizing Map/Topology-preserving map (SOM/TPM)
        k). Competitive learning
     2). Feedforward-only Nets:
        a). Learning Matrix (LM)
        b). Driver-Reinforcement Learning (DR)
        c). Linear Associative Memory (LAM)
        d). Optimal Linear Associative Memory (OLAM)
        e). Sparse Distributed Associative Memory (SDM)
        f). Fuzzy Associative Memory (FAM)
        g). Counterprogation (CPN)

2. SUPERVISED LEARNING (i.e. with a "teacher"):
     1). Feedback Nets:
        a). Brain-State-in-a-Box (BSB)
        b). Fuzzy Congitive Map (FCM)
        c). Boltzmann Machine (BM)
        d). Mean Field Annealing (MFT)
        e). Recurrent Cascade Correlation (RCC)
        f). Backpropagation through time (BPTT)
        g). Real-time recurrent learning (RTRL)
        h). Recurrent Extended Kalman Filter (EKF)
     2). Feedforward-only Nets:
        a). Perceptron
        b). Adaline, Madaline
        c). Backpropagation (BP)
        d). Cauchy Machine (CM)
        e). Adaptive Heuristic Critic (AHC)
        f). Time Delay Neural Network (TDNN)
        g). Associative Reward Penalty (ARP)
        h). Avalanche Matched Filter (AMF)
        i). Backpercolation (Perc)
        j). Artmap
        k). Adaptive Logic Network (ALN)
        l). Cascade Correlation (CasCor)
        m). Extended Kalman Filter(EKF)
        n). Learning Vector Quantization (LVQ)
        o). Probabilistic Neural Network (PNN)
        p). General Regression Neural Network (GRNN) 

------------------------------------------------------------------------

Subject: How many kinds of Kohonen networks exist?
==================================================
(And what is k-means?)
======================

Teuvo Kohonen is one of the most famous and prolific researchers in
neurocomputing, and he has invented a variety of networks. But many people
refer to "Kohonen networks" without specifying which kind of Kohonen
network, and this lack of precision can lead to confusion. The phrase
"Kohonen network" most often refers to one of the following three types of
networks: 

 o VQ: Vector Quantization--competitive networks that can be viewed as
   unsupervised density estimators or autoassociators (Kohonen, 1995;
   Hecht-Nielsen 1990), closely related to k-means cluster analysis
   (MacQueen, 1967; Anderberg, 1973). Each competitive unit corresponds to a
   cluster, the center of which is called a "codebook vector". Kohonen's
   learning law is an on-line algorithm that finds the codebook vector
   closest to each training case and moves the "winning" codebook vector
   closer to the training case. The codebook vector is moved a certain
   proportion of the distance between it and the training case, the
   proportion being specified by the learning rate. Numerous similar
   algorithms have been developed in the neural net and machine learning
   literature; see Hecht-Nielsen (1990) for a brief historical overview, and
   Kosko (1992) for a more technical overview of competitive learning. 

   MacQueen's on-line k-means algorithm is essentially the same as Kohonen's
   learning law except that the learning rate is the reciprocal of the
   number of cases that have been assigned to the winnning cluster; this
   reduction of the learning rate makes each codebook vector the mean of its
   cluster and guarantees convergence of the algorithm to an optimum value
   of the error function (the sum of squared Euclidean distances between
   cases and codebook vectors) as the number of training cases goes to
   infinity. Kohonen's learning law with a fixed learning rate does not
   converge. As is well known from stochastic approximation theory,
   convergence requires the sum of the infinite sequence of learning rates
   to be infinite, while the sum of squared learning rates must be finite
   (Kohonen, 1995, p. 34). These requirements are satisfied by MacQueen's
   k-means algorithm. 

   Kohonen VQ is often used for off-line learning, in which case the
   training data are stored and Kohonen's learning law is applied to each
   case in turn, cycling over the data set many times (incremental
   training). Convergence to a local optimum can be obtained as the training
   time goes to infinity if the learning rate is reduced in a suitable
   manner as described above. However, there are off-line k-means
   algorithms, both batch and incremental, that converge in a finite number
   of iterations (Anderberg, 1973; Hartigan, 1975; Hartigan and Wong, 1979).
   The batch algorithms such as Forgy's (1965; Anderberg, 1973) have the
   advantage for large data sets, since the incremental methods require you
   either to store the cluster membership of each case or to do two
   nearest-cluster computations as each case is processed. Fastest training
   is usually obtained if MacQueen's on-line algorithm is used for the first
   pass and off-line k-means algorithms are applied on subsequent passes.
   However, these training methods do not necessarily converge to a global
   optimum of the error function. The chance of finding a global optimum can
   be improved by using rational initialization (SAS Institute, 1989, pp.
   824-825), multiple random initializations, or various time-consuming
   training methods intended for global optimization (Ismail and Kamel,
   1989; Zeger, Vaisy, and Gersho, 1992). 

   VQ has been a popular topic in the signal processing literature, which
   has been largely separate from the literature on Kohonen networks and
   from the cluster analysis literature in statistics and taxonomy. In
   signal processing, on-line methods such as Kohonen's and MacQueen's are
   called "adaptive vector quantization" (AVQ), while off-line k-means
   methods go by the names of "Lloyd-Max" (Lloyd, 1982; Max, 1960) and "LBG"
   (Linde, Buzo, and Gray, 1980). There is a recent textbook on VQ by Gersho
   and Gray (1992) that summarizes these algorithms as information
   compression methods. 

   Kohonen's work emphasized VQ as density estimation and hence the
   desirability of equiprobable clusters (Kohonen 1984; Hecht-Nielsen 1990).
   However, Kohonen's learning law does not produce equiprobable
   clusters--that is, the proportions of training cases assigned to each
   cluster are not usually equal. If there are I inputs and the number of
   clusters is large, the density of the codebook vectors approximates the 
   I/(I+2) power of the density of the training data (Kohonen, 1995, p.
   35; Ripley, 1996, p. 202; Zador, 1982), so the clusters are approximately
   equiprobable only if the data density is uniform or the number of inputs
   is large. The most popular method for obtaining equiprobability is
   Desieno's (1988) algorithm which adds a "conscience" value to each
   distance prior to the competition. The conscience value for each cluster
   is adjusted during training so that clusters that win more often have
   larger conscience values and are thus handicapped to even out the
   probabilities of winning in later iterations. 

   Kohonen's learning law is an approximation to the k-means model, which is
   an approximation to normal mixture estimation by maximum likelihood
   assuming that the mixture components (clusters) all have spherical
   covariance matrices and equal sampling probabilities. Hence if the
   population contains clusters that are not equiprobable, k-means will tend
   to produce sample clusters that are more nearly equiprobable than the
   population clusters. Corrections for this bias can be obtained by
   maximizing the likelihood without the assumption of equal sampling
   probabilities Symons (1981). Such corrections are similar to conscience
   but have the opposite effect. 

   In cluster analysis, the purpose is not to compress information but to
   recover the true cluster memberships. K-means differs from mixture models
   in that, for k-mean, the cluster membership for each case is considered a
   separate parameter to be estimated, while mixture models estimate a
   posterior probability for each case based on the means, covariances, and
   sampling probabilities of each cluster. Balakrishnan, Cooper, Jacob, and
   Lewis (1994) found that k-means algorithms recovered cluster membership
   more accurately than Kohonen VQ. 

 o SOM: Self-Organizing Map--competitive networks that provide a
   "topological" mapping from the input space to the clusters (Kohonen,
   1995). The SOM was inspired by the way in which various human sensory
   impressions are neurologically mapped into the brain such that spatial or
   other relations among stimuli correspond to spatial relations among the
   neurons. In a SOM, the neurons (clusters) are organized into a
   grid--usually two-dimensional, but sometimes one-dimensional or (rarely)
   three- or more-dimensional. A SOM tries to find clusters such that any
   two clusters that are close to each other in the grid have codebook
   vectors close to each other in the input space. But the converse does not
   hold: codebook vectors that are close to each other in the input space do
   not necessarily correspond to clusters that are close to each other in
   the grid. Another way to look at this is that a SOM tries to embed the
   grid in the input space such every training case is close to some
   codebook vector, but the grid is bent or stretched as little as possible.
   The best way to undestand this is to look at the pictures in Kohonen
   (1995) or various other NN textbooks. 

   A SOM works by smoothing the codebook vectors in a manner somewhat
   similar to kernel estimation methods, but the smoothing is done in
   neighborhoods in the grid space rather than in the input space (Mulier
   and Cherkassky 1995). Kohonen's algorithm is heuristic, and it is not
   clear exactly what a Kohonen SOM learns, but recently some new approaches
   to SOMs have been developed that have better theoretical justification;
   see "What does unsupervised learning learn?" 

   It is important to shrink the smoothing neighborhoods during training. If
   you do not start with large neighborhoods and shrink them during
   training, the network can easily get stuck in bad local optima. But
   Kohonen (1995) is not clear on whether the neighborhoods should shrink to
   zero. On p. 80 he says that the final neighborhoods "can" contain the
   nearest neighbors, but on p. 128, regarding the batch SOM algorithm, he
   says that the final neighborhoods "may" contain only the single cluster.
   But in the latter case, as Kohonen points out, the SOM is basically a
   very fancy initialization algorithm for batch k-means, and you could lose
   the topological mapping properties of the SOM (Kohonen, 1995, p. 111). 

   In a SOM, as in VQ, it is necessary to reduce the learning rate during
   training to obtain convergence. Greg Heath has commented in this regard: 

   I favor separate learning rates for each winning SOM node (or k-means
   cluster) in the form 1/(N_0i + N_i + 1), where N_i is the
   count of vectors that have caused node i to be a winner and N_0i
   is an initializing count that indicates the confidence in the initial
   weight vector assignment. The winning node expression is based on
   stochastic estimation convergence constraints and pseudo-Bayesian
   estimation of mean vectors. Kohonen derived a heuristic recursion
   relation for the "optimal" rate. To my surprise, when I solved the
   recursion relation I obtained the same above expression that I've
   been using for years. 

   In addition, I have had success using the similar form 
   (1/n)/(N_0j + N_j + (1/n)) for the n nodes in the
   shrinking updating-neighborhood. Before the final "winners-only"
   stage when neighbors are no longer updated, the number of updating
   neighbors eventually shrinks to n = 6 or 8 for hexagonal or
   rectangular neighborhoods, respectively. 

   Kohonen's neighbor-update formula is more precise replacing my
   constant fraction (1/n) with a node-pair specific h_ij (h_ij
   < 1). However, as long as the initial neighborhood is sufficiently
   large, the shrinking rate is sufficiently slow, and the final
   winner-only stage is sufficiently long, the results should be
   relatively insensitive to exact form of h_ij. 

Kohonen (1995, p. VII) says that SOMs are not intended for pattern
recognition but for clustering, visualization, and abstraction. Kohonen has
used a "supervised SOM" (1995, pp. 160-161) that is similar to
counterpropagation (Hecht-Nielsen 1990), but he seems to prefer LVQ (see
below) for supervised classification. Many people continue to use SOMs for
classification tasks, sometimes with surprisingly (I am tempted to say
"inexplicably") good results (Cho, 1997). 

o LVQ: Learning Vector Quantization--competitive networks for supervised
classification (Kohonen, 1988, 1995; Ripley, 1996). Each codebook vector is
assigned to one of the target classes. Each class may have one or more
codebook vectors. A case is classified by finding the nearest codebook
vector and assigning the case to the class corresponding to the codebook
vector. Hence LVQ is a kind of nearest-neighbor rule. 

Ordinary VQ methods, such as Kohonen's VQ or k-means, can easily be used for
supervised classification. Simply count the number of training cases from
each class assigned to each cluster, and divide by the total number of cases
in the cluster to get the posterior probability. For a given case, output
the class with the greatest posterior probability--i.e. the class that forms
a majority in the nearest cluster. Such methods can provide universally
consistent classifiers (Devroye et al., 1996) even when the codebook vectors
are obtained by unsupervised algorithms. LVQ tries to improve on this
approach by adapting the codebook vectors in a supervised way. There are
several variants of LVQ--called LVQ1, OLVQ1, LVQ2, and LVQ3--based on
heuristics. However, a smoothed version of LVQ can be trained as a
feedforward network using a NRBFEQ architecture (see "How do MLPs compare
with RBFs?") and optimizing any of the usual error functions; as the width
of the RBFs goes to zero, the NRBFEQ network approaches an optimized LVQ
network. 

There are several other kinds of Kohonen networks described in Kohonen
(1995), including: 

 o DEC--Dynamically Expanding Context 
 o LSM--Learning Subspace Method 
 o ASSOM--Adaptive Subspace SOM 
 o FASSOM--Feedback-controlled Adaptive Subspace SOM 
 o Supervised SOM 
 o LVQ-SOM 

More information on the error functions (or absence thereof) used by Kohonen
VQ and SOM is provided under "What does unsupervised learning learn?" 

For more on-line information on Kohonen networks and other varieties of
SOMs, see: 

 o The web page of The Neural Networks Research Centre, Helsinki University
   of Technology, at http://nucleus.hut.fi/nnrc/ 
 o Akio Utsugi's web page on Bayesian SOMs at the National Institute of
   Bioscience and Human-Technology, Agency of Industrial Science and
   Technology, M.I.T.I., 1-1, Higashi, Tsukuba, Ibaraki, 305 Japan, at 
   http://www.aist.go.jp/NIBH/~b0616/Lab/index-e.html 
 o The GTM (generative topographic mapping) home page at the Neural
   Computing Research Group, Aston University, Birmingham, UK, at 
   http://www.ncrg.aston.ac.uk/GTM/ 

References: 

   Anderberg, M.R. (1973), Cluster Analysis for Applications, New York:
   Academic Press, Inc. 

   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
   study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525. 

   Cho, S.-B. (1997), "Self-organizing map with dynamical node-splitting:
   Application to handwritten digit recognition," Neural Computation, 9,
   1345-1355. 

   Desieno, D. (1988), "Adding a conscience to competitive learning," Proc.
   Int. Conf. on Neural Networks, I, 117-124, IEEE Press. 

   Devroye, L., Gy\"orfi, L., and Lugosi, G. (1996), A Probabilistic Theory
   of Pattern Recognition, NY: Springer, 

   Forgy, E.W. (1965), "Cluster analysis of multivariate data: Efficiency
   versus interpretability," Biometric Society Meetings, Riverside, CA.
   Abstract in Biomatrics, 21, 768. 

   Gersho, A. and Gray, R.M. (1992), Vector Quantization and Signal
   Compression, Boston: Kluwer Academic Publishers. 

   Hartigan, J.A. (1975), Clustering Algorithms, NY: Wiley. 

   Hartigan, J.A., and Wong, M.A. (1979), "Algorithm AS136: A k-means
   clustering algorithm," Applied Statistics, 28-100-108. 

   Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

   Ismail, M.A., and Kamel, M.S. (1989), "Multidimensional data clustering
   utilizing hybrid search strategies," Pattern Recognition, 22, 75-89. 

   Kohonen, T (1984), Self-Organization and Associative Memory, Berlin:
   Springer-Verlag. 

   Kohonen, T (1988), "Learning Vector Quantization," Neural Networks, 1
   (suppl 1), 303. 

   Kohonen, T. (1995), Self-Organizing Maps, Berlin: Springer-Verlag. 

   Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
   N.J.: Prentice-Hall. 

   Linde, Y., Buzo, A., and Gray, R. (1980), "An algorithm for vector
   quantizer design," IEEE Transactions on Communications, 28, 84-95. 

   Lloyd, S. (1982), "Least squares quantization in PCM," IEEE Transactions
   on Information Theory, 28, 129-137. 

   MacQueen, J.B. (1967), "Some Methods for Classification and Analysis of
   Multivariate Observations,"Proceedings of the Fifth Berkeley Symposium on
   Mathematical Statistics and Probability, 1, 281-297. 

   Max, J. (1960), "Quantizing for minimum distortion," IEEE Transactions on
   Information Theory, 6, 7-12. 

   Mulier, F. and Cherkassky, V. (1995), "Self-Organization as an Iterative
   Kernel Smoothing Process," Neural Computation, 7, 1165-1177. 

   Ripley, B.D. (1996), Pattern Recognition and Neural Networks,
   Cambridge: Cambridge University Press. 

   SAS Institute (1989), SAS/STAT User's Guide, Version 6, 4th edition,
   Cary, NC: SAS Institute. 

   Symons, M.J. (1981), "Clustering Criteria and Multivariate Normal
   Mixtures," Biometrics, 37, 35-43. 

   Zador, P.L. (1982), "Asymptotic quantization error of continuous signals
   and the quantization dimension," IEEE Transactions on Information Theory,
   28, 139-149. 

   Zeger, K., Vaisey, J., and Gersho, A. (1992), "Globally optimal vector
   quantizer design by stochastic relaxation," IEEE Transactions on Signal
   Procesing, 40, 310-322. 

------------------------------------------------------------------------

Subject: How are layers counted? 
=================================

How to count layers is a matter of considerable dispute. 

 o Some people count layers of units. But of these people, some count the
   input layer and some don't. 

 o Some people count layers of weights. But I have no idea how they count
   skip-layer connections. 

To avoid ambiguity, you should speak of a 2-hidden-layer network, not a
4-layer network (as some would call it) or 3-layer network (as others would
call it). And if the connections follow any pattern other than fully
connecting each layer to the next and to no others, you should carefully
specify the connections. 

------------------------------------------------------------------------

Subject: What are cases and variables?
======================================

A vector of values presented at one time to all the input units of a neural
network is called a "case", "example", "pattern, "sample", etc. The term
"case" will be used in this FAQ because it is widely recognized,
unambiguous, and requires less typing than the other terms. A case may
include not only input values, but also target values and possibly other
information. 

A vector of values presented at different times to a single input unit is
often called an "input variable" or "feature". To a statistician, it is a
"predictor", "regressor", "covariate", "independent variable", "explanatory
variable", etc. A vector of target values associated with a given output
unit of the network during training will be called a "target variable" in
this FAQ. To a statistician, it is usually a "response" or "dependent
variable". 

A "data set" is a matrix containing one or (usually) more cases. In this
FAQ, it will be assumed that cases are rows of the matrix, while variables
are columns. 

Note that the often-used term "input vector" is ambiguous; it can mean
either an input case or an input variable. 

------------------------------------------------------------------------

Subject: What are the population, sample, training set,
=======================================================
design set, validation set, and test set?
=========================================

There seems to be no term in the NN literature for the set of all cases that
you want to be able to generalize to. Statisticians call this set the
"population". Neither is there a consistent term in the NN literature for
the set of cases that are available for training and evaluating an NN.
Statisticians call this set the "sample". The sample is usually a subset of
the population. 

(Neurobiologists mean something entirely different by "population,"
apparently some collection of neurons, but I have never found out the exact
meaning. I am going to continue to use "population" in the statistical sense
until NN researchers reach a consensus on some other terms for "population"
and "sample"; I suspect this will never happen.) 

In NN methodology, the sample is often subdivided into "training",
"validation", and "test" sets. The distinctions among these subsets are
crucial, but the terms "validation" and "test" sets are often confused.
There is no book in the NN literature more authoritative than Ripley (1996),
from which the following definitions are taken (p.354): 

Training set: 
   A set of examples used for learning, that is to fit the parameters
   [weights] of the classifier. 
Validation set: 
   A set of examples used to tune the parameters of a classifier, for
   example to choose the number of hidden units in a neural network. 
Test set: 
   A set of examples used only to assess the performance [generalization] of
   a fully-specified classifier. 

Bishop (1995), another indispensable reference on neural networks, provides
the following explanation (p. 372): 

   Since our goal is to find the network having the best performance on
   new data, the simplest approach to the comparison of different
   networks is to evaluate the error function using data which is
   independent of that used for training. Various networks are trained
   by minimization of an appropriate error function defined with respect
   to a training data set. The performance of the networks is then
   compared by evaluating the error function using an independent 
   validation set, and the network having the smallest error with
   respect to the validation set is selected. This approach is called
   the hold out method. Since this procedure can itself lead to some
   overfitting to the validation set, the performance of the selected
   network should be confirmed by measuring its performance on a third
   independent set of data called a test set. 

The crucial point is that a test set, by definition, is never used to
choose among two or more networks, so that the error on the test set
provides an unbiased estimate of the generalization error (assuming that the
test set is representative of the population, etc.). Any data set that is
used to choose the best of two or more networks is, by definition, a
validation set, and the error of the chosen network on the validation set is
optimistically biased. 

There is a problem with the usual distinction between training and
validation sets. Some training approaches, such as early stopping, require a
validation set, so in a sense, the validation set is used for training.
Other approaches, such as maximum likelihood, do not inherently require a
validation set. So the "training" set for maximum likelihood might encompass
both the "training" and "validation" sets for early stopping. Greg Heath has
suggested the term "design" set be used for cases that are used solely to
adjust the weights in a network, while "training" set be used to encompass
both design and validation sets. There is considerable merit to this
suggestion, but it has not yet been widely adopted. 

But things can get more complicated. Suppose you want to train nets with 5
,10, and 20 hidden units using maximum likelihood, and you want to train
nets with 20 and 50 hidden units using early stopping. You also want to use
a validation set to choose the best of these various networks. Should you
use the same validation set for early stopping that you use for the final
network choice, or should you use two separate validation sets? That is, you
could divide the sample into 3 subsets, say A, B, C and proceed as follows: 

 o Do maximum likelihood using A. 
 o Do early stopping with A to adjust the weights and B to decide when to
   stop (this makes B a validation set). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on B (the
   validation set). 
 o Estimate the generalization error of the chosen network using C (the test
   set). 

Or you could divide the sample into 4 subsets, say A, B, C, and D and
proceed as follows: 

 o Do maximum likelihood using A and B combined. 
 o Do early stopping with A to adjust the weights and B to decide when to
   stop (this makes B a validation set with respect to early stopping). 
 o Choose among all 3 nets trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (this makes C
   a second validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 

Or, with the same 4 subsets, you could take a third approach: 

 o Do maximum likelihood using A. 
 o Choose among the 3 nets trained by maximum likelihood based on the error
   computed on B (the first validation set) 
 o Do early stopping with A to adjust the weights and B (the first
   validation set) to decide when to stop. 
 o Choose among the best net trained by maximum likelihood and the 2 nets
   trained by early stopping based on the error computed on C (the second
   validation set). 
 o Estimate the generalization error of the chosen network using D (the test
   set). 

You could argue that the first approach is biased towards choosing a net
trained by early stopping. Early stopping involves a choice among a
potentially large number of networks, and therefore provides more
opportunity for overfitting the validation set than does the choice among
only 3 networks trained by maximum likelihood. Hence if you make the final
choice of networks using the same validation set (B) that was used for early
stopping, you give an unfair advantage to early stopping. If you are writing
an article to compare various training methods, this bias could be a serious
flaw. But if you are using NNs for some practical application, this bias
might not matter at all, since you obtain an honest estimate of
generalization error using C. 

You could also argue that the second and third approaches are too wasteful
in their use of data. This objection could be important if your sample
contains 100 cases, but will probably be of little concern if your sample
contains 100,000,000 cases. For small samples, there are other methods that
make more efficient use of data; see "What are cross-validation and
bootstrapping?" 

References: 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

------------------------------------------------------------------------

Subject: How are NNs related to statistical methods? 
=====================================================

There is considerable overlap between the fields of neural networks and
statistics. Statistics is concerned with data analysis. In neural network
terminology, statistical inference means learning to generalize from noisy
data. Some neural networks are not concerned with data analysis (e.g., those
intended to model biological systems) and therefore have little to do with
statistics. Some neural networks do not learn (e.g., Hopfield nets) and
therefore have little to do with statistics. Some neural networks can learn
successfully only from noise-free data (e.g., ART or the perceptron rule)
and therefore would not be considered statistical methods. But most neural
networks that can learn to generalize effectively from noisy data are
similar or identical to statistical methods. For example: 

 o Feedforward nets with no hidden layer (including functional-link neural
   nets and higher-order neural nets) are basically generalized linear
   models. 
 o Feedforward nets with one hidden layer are closely related to projection
   pursuit regression. 
 o Probabilistic neural nets are identical to kernel discriminant analysis. 
 o Kohonen nets for adaptive vector quantization are very similar to k-means
   cluster analysis. 
 o Hebbian learning is closely related to principal component analysis. 

Some neural network areas that appear to have no close relatives in the
existing statistical literature are: 

 o Kohonen's self-organizing maps. 
 o Reinforcement learning (although this is treated in the operations
   research literature on Markov decision processes). 
 o Stopped training (the purpose and effect of stopped training are similar
   to shrinkage estimation, but the method is quite different). 

Feedforward nets are a subset of the class of nonlinear regression and
discrimination models. Statisticians have studied the properties of this
general class but had not considered the specific case of feedforward neural
nets before such networks were popularized in the neural network field.
Still, many results from the statistical theory of nonlinear models apply
directly to feedforward nets, and the methods that are commonly used for
fitting nonlinear models, such as various Levenberg-Marquardt and conjugate
gradient algorithms, can be used to train feedforward nets. The application
of statistical theory to neural networks is explored in detail by Bishop
(1995) and Ripley (1996). Several summary articles have also been published
relating statistical models to neural networks, including Cheng and
Titterington (1994), Kuan and White (1994), Ripley (1993, 1994), Sarle
(1994), and several articles in Cherkassky, Friedman, and Wechsler (1994).
Among the many statistical concepts important to neural nets is the
bias/variance trade-off in nonparametric estimation, discussed by Geman,
Bienenstock, and Doursat, R. (1992). Some more advanced results of
statistical theory applied to neural networks are given by White (1989a,
1989b, 1990, 1992a) and White and Gallant (1992), reprinted in White
(1992b). 

While neural nets are often defined in terms of their algorithms or
implementations, statistical methods are usually defined in terms of their
results. The arithmetic mean, for example, can be computed by a (very
simple) backprop net, by applying the usual formula SUM(x_i)/n, or by
various other methods. What you get is still an arithmetic mean regardless
of how you compute it. So a statistician would consider standard backprop,
Quickprop, and Levenberg-Marquardt as different algorithms for implementing
the same statistical model such as a feedforward net. On the other hand,
different training criteria, such as least squares and cross entropy, are
viewed by statisticians as fundamentally different estimation methods with
different statistical properties. 

It is sometimes claimed that neural networks, unlike statistical models,
require no distributional assumptions. In fact, neural networks involve
exactly the same sort of distributional assumptions as statistical models
(Bishop, 1995), but statisticians study the consequences and importance of
these assumptions while many neural networkers ignore them. For example,
least-squares training methods are widely used by statisticians and neural
networkers. Statisticians realize that least-squares training involves
implicit distributional assumptions in that least-squares estimates have
certain optimality properties for noise that is normally distributed with
equal variance for all training cases and that is independent between
different cases. These optimality properties are consequences of the fact
that least-squares estimation is maximum likelihood under those conditions.
Similarly, cross-entropy is maximum likelihood for noise with a Bernoulli
distribution. If you study the distributional assumptions, then you can
recognize and deal with violations of the assumptions. For example, if you
have normally distributed noise but some training cases have greater noise
variance than others, then you may be able to use weighted least squares
instead of ordinary least squares to obtain more efficient estimates. 

Hundreds, perhaps thousands of people have run comparisons of neural nets
with "traditional statistics" (whatever that means). Most such studies
involve one or two data sets, and are of little use to anyone else unless
they happen to be analyzing the same kind of data. But there is an
impressive comparative study of supervised classification by Michie,
Spiegelhalter, and Taylor (1994), and an excellent comparison of
unsupervised Kohonen networks and k-means clustering by Balakrishnan,
Cooper, Jacob, and Lewis (1994). 

Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in 
ftp://ftp.sas.com/pub/neural/jargon 

For free statistical software, see the StatLib repository at 
http://lib.stat.cmu.edu/ at Carnegie Mellon University. 

There are zillions of introductory textbooks on statistics. One of the
better ones is Moore and McCabe (1989). At an intermediate level, the books
on linear regression by Weisberg (1985) and Myers (1986), on logistic
regression by Hosmer and Lemeshow (1989), and on discriminant analysis by
Hand (1981) can be recommended. At a more advanced level, the book on
generalized linear models by McCullagh and Nelder (1989) is an essential
reference, and the book on nonlinear regression by Gallant (1987) has much
material relevant to neural nets. 

References: 

   Balakrishnan, P.V., Cooper, M.C., Jacob, V.S., and Lewis, P.A. (1994) "A
   study of the classification capabilities of neural networks using
   unsupervised learning: A comparison with k-means clustering",
   Psychometrika, 59, 509-525. 

   Bishop, C.M. (1995), Neural Networks for Pattern Recognition, Oxford:
   Oxford University Press. 

   Cheng, B. and Titterington, D.M. (1994), "Neural Networks: A Review from
   a Statistical Perspective", Statistical Science, 9, 2-54. 

   Cherkassky, V., Friedman, J.H., and Wechsler, H., eds. (1994), From
   Statistics to Neural Networks: Theory and Pattern Recognition
   Applications, Berlin: Springer-Verlag. 

   Gallant, A.R. (1987) Nonlinear Statistical Models, NY: Wiley. 

   Geman, S., Bienenstock, E. and Doursat, R. (1992), "Neural Networks and
   the Bias/Variance Dilemma", Neural Computation, 4, 1-58. 

   Hand, D.J. (1981) Discrimination and Classification, NY: Wiley. 

   Hill, T., Marquez, L., O'Connor, M., and Remus, W. (1994), "Artificial
   neural network models for forecasting and decision making," International
   J. of Forecasting, 10, 5-15. 

   Kuan, C.-M. and White, H. (1994), "Artificial Neural Networks: An
   Econometric Perspective", Econometric Reviews, 13, 1-91. 

   Kushner, H. & Clark, D. (1978), Stochastic Approximation Methods for
   Constrained and Unconstrained Systems, Springer-Verlag. 

   McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, 2nd
   ed., London: Chapman & Hall. 

   Michie, D., Spiegelhalter, D.J. and Taylor, C.C. (1994), Machine
   Learning, Neural and Statistical Classification, Ellis Horwood. 

   Moore, D.S., and McCabe, G.P. (1989), Introduction to the Practice of
   Statistics, NY: W.H. Freeman. 

   Myers, R.H. (1986), Classical and Modern Regression with Applications,
   Boston: Duxbury Press. 

   Ripley, B.D. (1993), "Statistical Aspects of Neural Networks", in O.E.
   Barndorff-Nielsen, J.L. Jensen and W.S. Kendall, eds., Networks and
   Chaos: Statistical and Probabilistic Aspects, Chapman & Hall. ISBN 0 412
   46530 2. 

   Ripley, B.D. (1994), "Neural Networks and Related Methods for
   Classification," Journal of the Royal Statistical Society, Series B, 56,
   409-456. 

   Ripley, B.D. (1996) Pattern Recognition and Neural Networks, Cambridge:
   Cambridge University Press. 

   Sarle, W.S. (1994), "Neural Networks and Statistical Models," 
   Proceedings of the Nineteenth Annual SAS Users Group International
   Conference, Cary, NC: SAS Institute, pp 1538-1550. (
   ftp://ftp.sas.com/pub/neural/neural1.ps) 

   Weisberg, S. (1985), Applied Linear Regression, NY: Wiley 

   White, H. (1989a), "Learning in Artificial Neural Networks: A Statistical
   Perspective," Neural Computation, 1, 425-464. 

   White, H. (1989b), "Some Asymptotic Results for Learning in Single Hidden
   Layer Feedforward Network Models", J. of the American Statistical Assoc.,
   84, 1008-1013. 

   White, H. (1990), "Connectionist Nonparametric Regression: Multilayer
   Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks, 3,
   535-550. 

   White, H. (1992a), "Nonparametric Estimation of Conditional Quantiles
   Using Neural Networks," in Page, C. and Le Page, R. (eds.), Computing
   Science and Statistics. 

   White, H., and Gallant, A.R. (1992), "On Learning the Derivatives of an
   Unknown Mapping with Multilayer Feedforward Networks," Neural Networks,
   5, 129-138. 

   White, H. (1992b), Artificial Neural Networks: Approximation and
   Learning Theory, Blackwell. 

------------------------------------------------------------------------

Subject: What about Genetic Algorithms?
=======================================

There are a number of definitions of GA (Genetic Algorithm). A possible one
is

  A GA is an optimization program
  that starts with
  a population of encoded procedures,       (Creation of Life :-> )
  mutates them stochastically,              (Get cancer or so :-> )
  and uses a selection process              (Darwinism)
  to prefer the mutants with high fitness
  and perhaps a recombination process       (Make babies :-> )
  to combine properties of (preferably) the succesful mutants.

Genetic algorithms are just a special case of the more general idea of
``evolutionary computation''. There is a newsgroup that is dedicated to the
field of evolutionary computation called comp.ai.genetic. It has a detailed
FAQ posting which, for instance, explains the terms "Genetic Algorithm",
"Evolutionary Programming", "Evolution Strategy", "Classifier System", and
"Genetic Programming". That FAQ also contains lots of pointers to relevant
literature, software, other sources of information, et cetera et cetera.
Please see the comp.ai.genetic FAQ for further information. 

Andrew Gray's Hybrid Systems FAQ at the University of Otago at 
http://divcom.otago.ac.nz:800/COM/INFOSCI/SMRL/people/andrew/publications/faq/hybrid/hybrid.htm
also has links to information on neuro-genetic methods. 

For general information on GAs, try the links at 
http://www.shef.ac.uk/~gaipp/galinks.html 

------------------------------------------------------------------------

Subject: What about Fuzzy Logic?
================================

Fuzzy logic is an area of research based on the work of L.A. Zadeh. It is a
departure from classical two-valued sets and logic, that uses "soft"
linguistic (e.g. large, hot, tall) system variables and a continuous range
of truth values in the interval [0,1], rather than strict binary (True or
False) decisions and assignments.

Fuzzy logic is used where a system is difficult to model exactly (but an
inexact model is available), is controlled by a human operator or expert, or
where ambiguity or vagueness is common. A typical fuzzy system consists of a
rule base, membership functions, and an inference procedure.

Most fuzzy logic discussion takes place in the newsgroup comp.ai.fuzzy
(where there is a fuzzy logic FAQ) but there is also some work (and
discussion) about combining fuzzy logic with neural network approaches in
comp.ai.neural-nets.

Early work combining neural nets and fuzzy methods used competitive networks
to generate rules for fuzzy systems (Kosko 1992). This approach is sort of a
crude version of bidirectional counterpropagation (Hecht-Nielsen 1990) and
suffers from the same deficiencies. More recent work (Brown and Harris 1994)
has been based on the realization that a fuzzy system is a nonlinear mapping
from an input space to an output space that can be parameterized in various
ways and therefore can be adapted to data using the usual neural training
methods (see "What is backprop?") or conventional numerical optimization
algorithms (see "What are conjugate gradients, Levenberg-Marquardt, etc.?").

A neural net can incorporate fuzziness in various ways: 

 o The inputs can be fuzzy. Any garden-variety backprop net is fuzzy in this
   sense, and it seems rather silly to call a net "fuzzy" solely on this
   basis, although Fuzzy ART (Carpenter and Grossberg 1996) has no other
   fuzzy characteristics. 
 o The outputs can be fuzzy. Again, any garden-variety backprop net is fuzzy
   in this sense. But competitive learning nets ordinarily produce crisp
   outputs, so for competitive learning methods, having fuzzy output is a
   meaningful distinction. For example, fuzzy c-means clustering (Bezdek
   1981) is meaningfully different from (crisp) k-means. Fuzzy ART does not
   have fuzzy outputs. 
 o The net can be interpretable as an adaptive fuzzy system. For example,
   Gaussian RBF nets and B-spline regression models (Dierckx 1995, van
   Rijckevorsal 1988) are fuzzy systems with adaptive weights (Brown and
   Harris 1994) and can legitimately be called neurofuzzy systems. 
 o The net can be a conventional NN architecture that operates on fuzzy
   numbers instead of real numbers (Lippe, Feuring and Mischke 1995). 
 o Fuzzy constraints can provide external knowledge (Lampinen and Selonen
   1996). 

More information on neurofuzzy systems is available online: 

 o The Fuzzy Logic and Neurofuzzy Resources page of the Image, Speech and
   Intelligent Systems (ISIS) research group at the University of
   Southampton, Southampton, Hampshire, UK: 
   http://www-isis.ecs.soton.ac.uk/research/nfinfo/fuzzy.html. 
 o The Neuro-Fuzzy Systems Research Group's web page at Tampere University
   of Technology, Tampere, Finland: http://www.cs.tut.fi/~tpo/group.html and
   http://dmiwww.cs.tut.fi/nfs/Welcome_uk.html 
 o Marcello Chiaberge's Neuro-Fuzzy page at 
   http://polimage.polito.it/~marcello. 
 o Jyh-Shing Roger Jang's home page at http://www.cs.nthu.edu.tw/~jang/ with
   information on ANFIS (Adaptive Neuro-Fuzzy Inference Systems), articles
   on neuro-fuzzy systems, and more links. 
 o Andrew Gray's Hybrid Systems FAQ at the University of Otago at 
   http://divcom.otago.ac.nz:800/COM/INFOSCI/SMRL/people/andrew/publications/faq/hybrid/hybrid.htm

References: 

   Bezdek, J.C. (1981), Pattern Recognition with Fuzzy Objective Function
   Algorithms, New York: Plenum Press. 

   Bezdek, J.C. & Pal, S.K., eds. (1992), Fuzzy Models for Pattern
   Recognition, New York: IEEE Press. 

   Brown, M., and Harris, C. (1994), Neurofuzzy Adaptive Modelling and
   Control, NY: Prentice Hall. 

   Carpenter, G.A. and Grossberg, S. (1996), "Learning, Categorization, Rule
   Formation, and Prediction by Fuzzy Neural Networks," in Chen, C.H.
   (1996), pp. 1.3-1.45. 

   Chen, C.H., ed. (1996) Fuzzy Logic and Neural Network Handbook, NY:
   McGraw-Hill, ISBN 0-07-011189-8. 

   Dierckx, P. (1995), Curve and Surface Fitting with Splines, Oxford:
   Clarendon Press. 

   Hecht-Nielsen, R. (1990), Neurocomputing, Reading, MA: Addison-Wesley. 

   Klir, G.J. and Folger, T.A.(1988), Fuzzy Sets, Uncertainty, and
   Information, Englewood Cliffs, N.J.: Prentice-Hall. 

   Kosko, B.(1992), Neural Networks and Fuzzy Systems, Englewood Cliffs,
   N.J.: Prentice-Hall. 

   Lampinen, J and Selonen, A. (1996), "Using Background Knowledge for
   Regularization of Multilayer Perceptron Learning", Submitted to
   International Conference on Artificial Neural Networks, ICANN'96, Bochum,
   Germany. 

   Lippe, W.-M., Feuring, Th. and Mischke, L. (1995), "Supervised learning
   in fuzzy neural networks," Institutsbericht Angewandte Mathematik und
   Informatik, WWU Muenster, I-12, 
   http://wwwmath.uni-muenster.de/~feuring/WWW_literatur/bericht12_95.ps.gz 

   van Rijckevorsal, J.L.A. (1988), "Fuzzy coding and B-splines," in van
   Rijckevorsal, J.L.A., and de Leeuw, J., eds., Component and
   Correspondence Analysis, Chichester: John Wiley & Sons, pp. 33-54. 

------------------------------------------------------------------------

Next part is part 2 (of 7). 

-- 

Warren S. Sarle       SAS Institute Inc.   The opinions expressed here
saswss@unx.sas.com    SAS Campus Drive     are mine and not necessarily
(919) 677-8000        Cary, NC 27513, USA  those of SAS Institute.
* Do not send me unsolicited commercial, political, or religious email *
