Data Quantity
You can be confident in the results of an analysis only if there is sufficient data; for example, could you determine whether a baseball player is a good hitter after watching only a single at-bat? What if you knew whether the player swung at each pitch, but had no information on the result?
- Are there enough observations?
- Were any data removed? If so, why?
- Do the data contain the right fields?
Data Quality
There is an old mantra in data analysis:
garbage in, garbage out. Even the best analytic tools cannot produce good results from bad data. Having a significant quantity of data is not very meaningful if the data are not accurate, complete, and consistent.
Could you determine whether a baseball player is a good hitter if the data on half of their at-bats were corrupted? What if the capital letters “O” and “I” were transcribed as the numerals “0” and “1”, respectively? A computer views “OUT” and “0UT” as being distinct strings!
- What are the possible sources of error?
- What is the error rate?
- Are the data authoritative, i.e. are they used as the basis for other analyses and decision-making?
Data Sanity
- Are the data appropriate for a particular analysis?
- Are there authoritative rules to identify “good” and “bad” data?
- Are the hardware and software systems that acquire, process, and store the data well-understood