In this example, we want the ANN to be able to tell us things like:
Since the ANNs learn with experience, telling that "1+1" is "2", "1+2" is "3" and "1+3" is "4", is not enough. We have to provide more cases. So we take a spreadsheet to generate the data. This might look like that:
How is the data organised?
OK, there are more columns in the spreadsheet.
These columns with "random" in the name are there just to try to confuse
the training. Additionally, we will be able to present a features,
just read on.
So the input data is in columns B through E, the output data in the columns G and H.
Cases are stored in rows.
How many cases (so rows in the spreadsheet) are needed?
There is a minimum of 5 cases that you have to submit.
As you probably guess,
with 5 cases, you won't get a decent result.
150 cases is considered the minimum for reliable results.
It's like in the real world: the more hours you ride bycicle, the better
you perform. 5 minutes won't take you far...
And, the more situations you experience,
the wider the range of knowledge you get.
So, you should:
The input range in this example goes from "0+0" through "1+2" and "2+1"
to "9+9" for the sum and respectively for the difference.
This way we get 100 cases.
Hmmmm... What's the trick to get 150 here???
Well, just duplicate the complete set, and, voilà, we got 200 cases.
Not quite. We didn't duplicate the random numbers, we chose new ones...
Ohhh, and there is another problem. If you duplicate the data, they are not
independent. This means that the crossvalidation process is undermined.
For human you would say, he is learning by heart, but not learning the
idea behind it. So, after the learning, he would be able to say that 3+5=8,
but if the learning didn't include 5+3, he wouldn't be able to tell the result.
This is good to understand the danger of overfitting (=learning by heart), but
you can take it too far by taking everything literally.
In our case, overfitting won't give wrong results, but you have to be carefull with your data. And if you train various ANNs with this data, you will observed that the number of hidden nodes is varying a lot, while the ANNs are able to predict nicely. Details on that will be presented later on.
What else has to be considered???
ANNs can extract information from a data set, even if the information
is hidden in noise, represented only in a small amount of the data or
hidden in another way.
However, your best bet is you distribute the data evenly. For example, if
you want a "yes/no" type of answer, your data for training should
contain nearly as much case with "yes" as cases with "no".
The input data has to be numeric, so, if you got something else, please read about "What kind of data can be submitted".
Please note that you must not have empty cells.
If you have missing data in your database you must assign a value to it (the
average value of other cases, for instance or the median value) before
submitting it.But beware because replacing missing data is notorious for
biasing the data analysis.
You should format the number with enought decimal digits. This might
be relevant later on.
Decimal separator should be "." (dot).
So far, so good. You know now how to organise your data in a spreadsheet, so you can now move on to the next step: "Submitting the Data" (including transfering the data from the spreadsheet to your browser).
Note: The sample data used to generate this example is available:
Download Data.