Gresham college lecture on data
How the heck can one get a handle on the vast variety of data that is used in computers? After all, there are whole degrees on Data Science and whole courses on databases (and these are different courses to those on data-structures). It's not even fair to say it is a big topic: it is big set of big topics.
As I started to construct the lecture it became apparent to me that computer data goes through fashions. Some of those are entirely justifiable and I've put those in the lecture -- the switch from data structures as efficient storage on computers to data-structures as tools for simplifying problems for example -- others are less so, and I left them out of the lecture (nothing on object-oriented databases for example).
So, here for your delectation, ladies and gentlemen are my purely arbitrary seven ages of data.
Who does not love a flat file containing all the records and all their associated field? Flat files are portable, or at least more portable than the multiple tables of a normalised database, Flat files are not proprietary, or need not be. Flat files are often just human-readable records They are perfect for science and they are open and transparent. When Darwin was making notes in his Galapagos islands adventures, it was flat files he was using. He had no concept of computers so he wrote down the data as he found them.
Data for computers
In this era we were heavily influenced by what computers could do. If a computer could store only 8-bit words in a single cycle then characters had to fit in 8-bits.
Data for algorithms
As soon as we move to more complex data structures then it becomes evident that certain operations just become so much easier if I choose the right data structure. Use a Binary search Tree then your database search goes from 1000s of seconds to microseconds. All of which leads us to....
How marvellous to believe that we invented all the data structures before having to tackle real databases. Nothing could be further from the truth. The early databases were plagued with issues caused by poor data structure choices. Neverless the 1980s became the era in which databases were all powerful and provided speed-spped-speed. Without databases, we would have no ecommerce today - our Amazon product searches would takes hours not microseconds. But databases introduced a risk - data became separated from its meaning -- the metadata.
Data about data
When we were working on MPEG-7 it was commonly said that the metadata, the data about the data, was were more valuable than the data. In retrospect this looks most phooey but if you have ever come back of an expensive field trial with mountains of discs and no notebooks describing what is on the discs, then you will recognise the grain of truth in the statement. And it is true to say that for several decades we became complacent about the meaning of data - data was just data.
Data science was more than a buzz word. It was a movement to attempt to bring back the rigorous analysis of data - to convert data into insight. Of course there was nothing new in that but data science implied that natural scientists ought to use the latest machine learning techniques and build robust and open data platforms rather than just muddle through using Excel and few errorbars. And one fo the features of Data science, which uses advanced machine learning, is the ability to handle unstructured data, of the type that we find in flat files.
And thus we have the ultimate recursive history of data from the flat files of early science through multiple billions of dollars of developments so that we can handle...er... flat files!