There will be more connected devices than people in 2014 according to the United Nations. One billion gigabytes = 1 exabyte–that’s 18 zeroes. The average smartphone use equals 1 Gb per month. One exabyte would hold 100,000 times the printed material in the Library of Congress.
Currently, there is extremely high interest in data – specifically big data. What is big data? It is a euphemistic term for the plethora of data that is being captured by many internet sources. Everything you do online from what you watch on regular TV, shows streamed to your electronic devices, listen to on the radio, purchases, and information browsing on the internet is being stored in multiple data warehouses across the world. Purchases made offline in brick and mortar stores contribute through barcodes and registers that track what you buy and when. Add to that the information that is personally put out on social media and there are easily exabytes of data available to mine. One exabyte is equivalent to one quintillion bytes of data or 1018. Exabytes are vast oceans of data. What are the collectors of data doing with all of it?
Analyst Expertise Critical
The hunt for the missing Malaysian Airlines missing plane is an example of data from many sources that was initially examined by inexperienced analysts. This falls into a rare category of planes literally falling off the radar, but it also exemplifies too much data too fast. Early investigation into the incident was compromised by a lack of expertise in interpreting anomalies in the data sources and merging them to produce scenarios that fit known facts. The serious nature of this event made it even more important to know what to look for in mounds of data.
It doesn’t matter how much data you have, it’s what you do with it. A basic analytical plan requires:
- Data sources to include and exclude.
- An understanding of the quality of each data source.
- Compatibility of sources when merging. This is necessary to avoid a higher level of garbage in = garbage out. Garbage data and analyses can actually be created as a result of combining different sources.
- Definitions of common key fields from each source and how they vary – this can be important for data elements such as race or ethnicity.
Type of Analyses
Questions seeking answers. This means that you have questions in mind and try to find the answers in the data. There may be several ways to search for answers but the searching is done to answer questions or hypotheses.
Looking for patterns. Often called data mining, this method culls the data and tries to find patterns and relationships to imply hypotheses. Data mining is used to discover unsuspected relationships such as in shopping data where it was found that baby diapers and beer were purchased at the same time after 10 pm. Dads went to stores when diapers ran out at night and picked up beer while they were at it. With healthcare data, this type of analysis is used to propose cause and effect although it cannot be used to determine cause.
Too Much Data
Can there be more data than an individual can process? Take the company 23andMe, a DNA analysis service that allowed customers to learn about their genetic makeup and help them assess their risk and the risk of their offspring for genetic, inheritable diseases and conditions. DNA data was provided but not interpreted. The customers were left to interpret the test results and risk without adequate knowledge and perspective from a doctor or geneticist. The concern is that there was a lot of data but not enough information. The FDA shut down the company because determining medical risk classified the kit as a medical device and customers could make bad choices based on the kit’s results without professional advice to properly interpret the risk.
In the future, knowing a person’s genetic code will be the basis for personalized medicine as well as understanding genetic diseases. Ethicists are concerned that the average person will not know what to do with information that tells them they have the gene for a certain disease or they are at a higher risk of other diseases. Mapping the human genome provides a lot of data, but without interpretation, it provides little useful information. Scientists are racing toward gene sequencing. Millions of genomes in databases will unlock the therapies to preserve health and battle diseases. Will it help us or confuse us?
As data warehouses surround us filled with exabytes of data, models are already being created of our consumerism based on transactions. Advertisers already target micro-populations of purchasers based on what we watch, read, eat, listen to, buy, and browse online. From this data conglomeration, they are attempting to predict products that we will buy in the future.
If we control our own genome maps under privacy laws, what happens when we give our genome to these giant databases? They will cheerfully tell us how long we will live, our probability of dying from specific diseases and probably make recommendations to what we should do in each year of our lives such as when to start dating, get married, and have children. But it will be our choice to solicit this data and create information from the exabyte sphere.
Trusted Data Leads to Information then Knowledge
With the huge volume of data collected in today’s digital age, many companies have special analytics teams to unearth new insights. With our data and our permission, they will tell us about us in infinite detail. We need to be careful about the sources, quality and accuracy of the data in order to trust what we hear.
In the end, data that we trust leads to information, information leads to knowledge and knowledge leads to wisdom. Good data can set us on the path to wisdom.