Click here for linear version of these pages.
  UK flag DE flag
Página inicial
Quem Somos
New Account
Access Your Account
Notícias
Info & Links
ANN Documentation
Links
Data Security
User License Agreement
How to quote microCortex.com
Scientific Publications Related to the microCortex Algorithm
Why ANN?
To Get Started
WalkThrough Guide
Guide: Credit Risk Assessment
Guide: How to Use and Features
Guide: How to Analyse the ANN
Data Types
Very Quick Guide to Neural Networks
ANN: More Details
The Data
Submitting the Data
Analyse the ANN
Making Predictions
Transfer ANN to SpreadSheet
Download Data
microCortex.com logo
Declaração de Privacidade
Encontrar no Nosso Site
F i c h a    t é c n i c a
Contacte-nos

Guide: How to Use and Features (ver.1.03)

2. The data TOC

On this page:
  • About the sample data
  • Requirements to the data
  • Improvements hints
Let's start!

In this example, we want the ANN to be able to tell us things like:

  • a+b=sum
  • a-b=diff
We call "a" and "b" input data, or, independent variables.
"sum" and "diff" we call here output data, or, dependent variables.

Since the ANNs learn with experience, telling that "1+1" is "2", "1+2" is "3" and "1+3" is "4", is not enough. We have to provide more cases. So we take a spreadsheet to generate the data. This might look like that:

Spreadsheet with data How is the data organised?

OK, there are more columns in the spreadsheet.
These columns with "random" in the name are there just to try to confuse the training. Additionally, we will be able to present a features, just read on.

So the input data is in columns B through E, the output data in the columns G and H.

Cases are stored in rows.

How many cases (so rows in the spreadsheet) are needed?

There is a minimum of 5 cases that you have to submit. As you probably guess, with 5 cases, you won't get a decent result.
150 cases is considered the minimum for reliable results.
It's like in the real world: the more hours you ride bycicle, the better you perform. 5 minutes won't take you far...

And, the more situations you experience, the wider the range of knowledge you get.
So, you should:

  • get as much cases as you can but
  • make sure they are independent and
  • make sure the output you want to consider is nearly equally distributed in the set.

The input range in this example goes from "0+0" through "1+2" and "2+1" to "9+9" for the sum and respectively for the difference.
This way we get 100 cases.
Hmmmm... What's the trick to get 150 here???
Well, just duplicate the complete set, and, voilà, we got 200 cases.
Not quite. We didn't duplicate the random numbers, we chose new ones...

Ohhh, and there is another problem. If you duplicate the data, they are not independent. This means that the crossvalidation process is undermined.
For human you would say, he is learning by heart, but not learning the idea behind it. So, after the learning, he would be able to say that 3+5=8, but if the learning didn't include 5+3, he wouldn't be able to tell the result. This is good to understand the danger of overfitting (=learning by heart), but you can take it too far by taking everything literally.

In our case, overfitting won't give wrong results, but you have to be carefull with your data. And if you train various ANNs with this data, you will observed that the number of hidden nodes is varying a lot, while the ANNs are able to predict nicely. Details on that will be presented later on.

What else has to be considered???

ANNs can extract information from a data set, even if the information is hidden in noise, represented only in a small amount of the data or hidden in another way.
However, your best bet is you distribute the data evenly. For example, if you want a "yes/no" type of answer, your data for training should contain nearly as much case with "yes" as cases with "no".

The input data has to be numeric, so, if you got something else, please read about "What kind of data can be submitted".

Please note that you must not have empty cells.
If you have missing data in your database you must assign a value to it (the average value of other cases, for instance or the median value) before submitting it.But beware because replacing missing data is notorious for biasing the data analysis.

You should format the number with enought decimal digits. This might be relevant later on.
Decimal separator should be "." (dot).

So far, so good. You know now how to organise your data in a spreadsheet, so you can now move on to the next step: "Submitting the Data" (including transfering the data from the spreadsheet to your browser).

Note: The sample data used to generate this example is available: Download Data.