For many years we can watch Top Chef Gordon Ramsay on TV. Most of the time he comes to rescue at ailing restaurants. He is infamous for his constant swearing, but in almost all of the episodes he is very outspoken about just one thing “Use fresh ingredients, not those f***ng canned or processed stuff!”.
Use data that have as much of the original information as possible, just like fresh ingredients have more taste than processed ones.
In data analysis we can learn from him. Also, for us it is extremely important to use fresh ingredients and not pre-processed ones. What do I mean with that? Just what it says, use data that have as much of the original information as possible, just like fresh ingredients have more taste than processed ones.
Let me illustrate this with the variable ‘Age of the person”. Often, we use this variable in an already condensed state, collapsed into a few categories like ‘young’, ‘middle’ and ‘old’. That can be handy for cross tabulations or an analysis of variance, using here a series of ten categories will be not very insightful. But why should we collect that variable in just these three categories, we could better try to catch as much information in the collecting process.
So, ask for the age itself, instead of an age category. We get even more precision in information when we ask for the date of birth. Collapsing can always be done later, depending on the technique we are going to use. Then we can decide how many categories we want and what the optimal cut-off values will be.
Collecting data ‘as fresh as possible’ also gives us the possibility of comparison with other research.
Why is this so important?
First because pre-chosen cut-off points for categories can hide peculiarities from the distribution, maybe there is a maximum right at your cut-off point. Also, cut-off points may differ depending on the subject of the study and the techniques used. For education research 16 years is important, youngsters need to go to school at least part-time until that age. But for research on political parties 18 years is a better choice as this is the age on which one gets the right to vote. Collecting data ‘as fresh as possible’ also gives us the possibility of comparison with other research; we can then collapse them into the categories they used.
If this is so important, why is age so often recorded in coarse categories? This is an artefact of the old days, long ago, when data were stored on punched cards. These had 80 columns of 12 positions each. In most cases data processing was nothing more than just feeding the cards to a so-called counting-sorter. You could set these machines to sort your cards into 12 bins according to the content of one single column. First you sorted them according to the column that contained the coding for the 3 categories of age, and then you sorted the content of each separate bin again, this time according to a further variable, like income. Voila: you had your cross tabulation of Age vs. Income.
However clever it was, the process allowed only sorting on one column at a time. Should we have recorded Age as integer numbers between 0 and 99 we would have needed two columns. Then sort first on the first of these columns, the bins would contain the ages 0-9, 11-19, etc. Sorting these ten stacks on the second digit would then produce 100 stacks with the ages 0, 1, 2 etc. each of which we could sort again against the variable Income. Apart from the cumbersome and error-prone process we would need two columns instead of one. And columns were a sparse commodity: a card had only 80 columns available to store all questions so you needed sparse coding or you had to limit your number of questions. Huge questionnaires solved this problem by using more cards for each person. But then you needed to duplicate all ‘background variables’ like Age, Sex, Education, Income, etc. on each separate card as you could only make a crosstabulation with variables within the same card.
Do not think that this problem occurs only with a variable like Age. It is even more important to use ‘fresh’ data with variables like Price or Length. Imagine prices collapsed into a few Euro-based categories, you can never compare them with research from UK Pounds or US Dollars. And length in cm-based categories is hardly comparable with inch-based categories.
The available techniques of some 55 years ago let us no other choice than to record variables in a highly condensed form. But today there is absolutely no need for that, so we can follow the advice of the swearing Chef and collect ‘fresh’ data with as much detail we can get, just like fresh ingredients have a lot more flavour then processed ones!
Look forward to next week’s Christmas themed article where Henk Tijms will show us a few wonderful probability theory puzzles!