ABSTRACT

Dirty data usually begins with key strokes. Remember that it is very possible that someone who was doing one of the most boring jobs in the world has entered the information in a database. Or someone has trusted an algorithm with a flaw in the coding to create a database. Many journalists doing computer-assisted reporting have discovered errors in the record layout and the codebook. Probably more common than a bad record layout is an incomplete or inaccurate code sheet. The way to escape this peril is to always import ZIP codes, identification numbers, and phone numbers into character or text fields. Generally, import as a character field any number that will never be added, subtracted, multiplied or divided. The spreadsheet or database manager then will preserve all the digits. Databases can contain offensive characters. They may be weird-looking smiling faces or misplaced commas or semicolons.