Self-Organizing Maps: An interesting Neural Network

Introduction

Kohonen Map or Self Organizing Maps are based on how our brain cells are organized to form topologically ordered maps that react to different sensory input signals by activating different regions of cerebral cortex.
On similar lines, SOM also builds a lattice of neurons with similar neurons being closer to each other and uses this lattice to cluster the data together in lower dimensional space. It is an artificial neural network algorithm developed by Teuvo Kohonen to be applied in the field of unsupervised learning. It is based on the principle of competitive learning where the neurons compete with each other for every input sample with winning neuron making its output as one while rest all make it zero.  The process of learning happens in two phases: Ordering Phase where the neurons order themselves topologically and Convergence Phase where the final update of weights happens to move them further closer to the input space.
SOM provides an effective and easy way of visualizing the clusters of high dimensional data by projecting it into a lower dimensional space typically 1 or 2D.

Architecture

To build a SOM, we need to look at following components:

1.1  Structure/Topology          

A typical Self-Organizing Map consists of a set of neurons which have positions relative to each other in the visualization space of chosen dimensions (theoretically it can span N-Dimensions but we generally restrict to 2D for the ease of visualization). When in 2-D, they can be arranged in a rectangular pattern or hexagonal pattern where each neuron will have 6 neighbors. In addition, each of these neurons have reference vectors or weight vectors which span the input space dimensions (i.e. no. of features in input dataset = dim of wt. vector).
   
Fig 1-a: Rectangular Topology 
Fig 1-b: Hexagonal Topology

1.2  Neighborhood Function

The basic principle of SOM is that every input sample has only one winner neuron or BMU (Best Matching Unit). However, winner neuron shares the learning with its neighboring neurons i.e. winner neurons and neighboring neurons, all update their weights. This learning, however, depends on the distance of particular neuron from the winner neuron. As we move farther away from BMU, delta weight decreases.
A popular choice to quantize this learning is
                                                  hci(t)=e
rcri2
2σ(t)2
                                                                                     
------Eq 1
Where     rcri‖ : lateral dist. between BMU & neighbor node
                         σ (t) : Neighborhood radius at time step t given by:                                 
                          σ(t)=σ0×e
t
λ
                                                                                                     
-------- Eq 2
                           Where: σ0 = Neighborhood radius at time step 0
                                             t = Current time step
                                             l = Map constant = total iterations/log(σ0)
Neighborhood region decreases over time to include only winner neuron at the end. The rate of decrease is controlled by l, greater it is slower will be the ordering. Haykin [1] recommends the value of l to be 1000/log(σ0).
Another possibility for this function is just a step function or bubble function as it is generally regarded, given by:
                        hci(t)={1ifiNc
                                       0ifiNc}                                                       ----Eq 3
Where Nc : Set of neighboring neurons

1.2  Learning Parameter - h

This signifies the learning rate of the network. There are few possibilities of this function, most popular of which is:
                                       η(t)=η0×e
t
n
                                                                           
        ------ Eq 4
Where h0 = Learning rate at time step 0
            t = current time step
            n = total no. of iterations

Another choice for decaying h could be a linear function given by
η(t)=η0×(1
t
n
)
                                                       ------ Eq 5
Yet another choice of this function is an inverse decay given by:
                                    η(t)=
η0
t
                                                                      ------ Eq 6
As the number of iterations increase, learning rate decreases more slowly. After ordering phase, the learning parameter should remain at a low value like 0.01 (Haykin) or 0.02 (Kohonen).

1.2  Shape of neighborhood region

The shape of this region affects the set of neighboring neurons during the cooperation phase. Typical choices for this Square or Hexagonal.
                                 N(j,t)={1,djcDt
                                                   0,djc>Dt}                                               ------ Eq 7
Where djc = Lateral distance between BMU and jth neuron
               D= Neighborhood width at time t
               t = { r2for square region
                       3 * r2 * sqrt(r2r2/4) , for hexagonal region}                     ------ Eq 8



Fig 2-a Square Neighborhood
Fig 2-b Hexagonal Neighborhood


PseudoCode: Incremental SOM Algorithm   

  • Initialize weights to random values (between 0,1) or randomly chosen samples from input space
  • Repeat for n epochs
    • Randomly select a sample from input dataset without repetition (to ensure all of the inputs get a chance to influence the weight vectors)
    • Calculate BMU (Best Matching Unit/ Winner Neuron) by minimizing the distance between input and weight vector.
                                     BMU = arg min  x(t)wc(t)
            Euclidean distance is the most commonly used metric, however, Cosine distance is also popular and much more
             effective for sparse matrices.

    • Identify the neighborhood of BMU: Go through the complete set of neurons and evaluate their inclusivity as per the Eq(8)
    • Update the weights of BMU and neighbors
                     wi(t+1)=wi(t)+η(t)×hci(t)(x(t)wi(t))          ------ Eq 9

      • Where
        • hci(t) =  Neighborhood influence given by Eq (1)or Eq (3)
        • h(t)  = Learning rate at time step t
    •  Decrease the learning rate as per Eq(4)or Eq(5) or Eq(6)
    •  Decrease neighborhood/influence radius as per Eq (2)
    • Calculate quantization error if all of the inputs are used once as
                                       QE=
i=1Nxiwic
N
                      Where xi = ith input sample
                                           Wic = BMU for ith sample
                                             N = total samples


I have implemented this basic algorithm in python which is shared on GitHub. This basic code gives about 97% efficiency on the iris database and to give you an idea, how the rearrangement of neurons happens, take a look at the image below:

Fig 3: Graph depicting the neurons rearranging themselves to data

The configuration used for this dataset is summarized below:
No. of neurons
10
Topology
Line
Eta
0.1
Eta_decay
Lin
Dist_metric
Euclidean
Neighborhood fun
Bubble







Hope this gives you a brief idea into this really interesting algorithm. This has enjoyed vast applications from weather, linguistic research, image compression and analysis, etc.
Although this article is a basic and one would need to build upon the knowledge base extensively to fully understand the potential of this neural network. Here are some pointers to get you started, hope I have inrigued you enough to delve into them:
Thank you for reading!!

Comments

Popular posts from this blog

HBase Setup: Standalone Mode on windows machine

Plotting Choropleths with Shapefiles in R- ggplot2 tutorial