Monday, October 13, 2008

A Plea

Please, please, please, please... document your data.

It's not fun to get a matrix without any column or row labels. Like my old physics teacher used to say when somebody proudly proclaimed that the answer to an exercise was 5.029, "5.029 what? Elephants?". In physics, a quantity is not meaningful without a unit, and in data management, a matrix is not useful without labels.

It's not fun to know nothing about any preprocessing of the data. Has it been normalised? I guess I can check the mean and standard deviation, but what if it's only close to 0 and 1, but not exactly so? Was that some special normalisation method? Hell if I know.

It's not fun to know nothing about the experimental setup. Maybe you told me every second data point is a wildtype. Does that mean that these were results from two-colour arrays? Or two singe-colour arrays? Are the wildtypes from the same time point as the mutants? Come to think of it, are these even time-course data?

It's not fun to know nothing about what the biologists* want you to find. Are they looking for similarly expressed genes or for regulators? Would they rather have a network of the knocked-out genes or of all genes? Is it worse if I give them false positives or false negatives?

So please, please, please, document your data. Maybe some time in the future I will tell you about fun things called wikis and databases, but for now even a text file would do.

*Or insert other applied science here.