safe harbor de-identification of health data

by Michael Werneburg
on 2017.05.24

You are here:
Risk topics
» Risk topics blog
April, 2018

March, 2018
· the planning fallacy

February, 2018
· Valentine's day vm backup plan

November, 2017
· the unsafe workplace and the body's response

October, 2017
· ISACA article is live

September, 2017
· published
· the Equifax breach
· Tracking Vulnerability Fixes to Production

August, 2017
· evaluating third party cyber risk

July, 2017
· getting it wrong with R
· de-identifying health information
· that's a lot of tracking!

June, 2017
· gaming Google news
· privacy in this day and age
· another record breach
· writing an industry standard
· ISACA article accepted

May, 2017
· Covey time-management quadrants
· safe harbor de-identification of health data
· an ISACA article

April, 2017
· my guide on managing third party risk
· PMP for five years
· metrics that matter
· 720 reads in 48 hours
· I lost my job

March, 2017
· farewell, SIRA board
· the message and the medium
· an interesting take on consulting


The health industry works with a standard called the "Safe Harbor" for de-identifying personal information. It's supposed to reduce the number of unique records to 0.04% of the population, meaning only about 1 in 2,500 people can be uniquely identified with the data once it's been restricted/altered. It's part of HIPAA:

The Safe Harbor method for de-identification is defined
as follows:
(2)(i) The following identifiers of the individual or of
relatives, employers, or household members of the individual,
are removed:
(A) Names
(B) All geographic subdivisions smaller than a state,
including street address, city, county, precinct, ZIP code,
and their equivalent geocodes, except for the initial three
digits of the ZIP code if, according to the current publicly
available data from the Bureau of the Census:
(1) The geographic unit formed by combining all ZIP codes
with the same three initial digits contains more than 20,000
people; and
(2) The initial three digits of a ZIP code for all such
geographic units containing 20,000 or fewer people is changed
to 000.
(C) All elements of dates (except year) for dates that are
directly related to an individual, including birth date,
admission date, discharge date, death date, and all ages over
89 and all elements of dates (including year) indicative of
such age, except that such ages and elements may be
aggregated into a single category of age 90 or older.
(D) Telephone numbers
(L) Vehicle identifiers and serial numbers, including license
plate numbers
(E) Fax numbers
(M) Device identifiers and serial numbers
(F) Email addresses
(N) Web Universal Resource Locators (URLs)
(G) Social security numbers
(O) Internet Protocol (IP) addresses
(H) Medical record numbers
(P) Biometric identifiers, including finger and voice prints
(I) Health plan beneficiary numbers
(Q) Full-face photographs and any comparable images
(J) Account numbers
(R) Any other unique identifying number, characteristic, or
code, except as permitted by paragraph (c) of this section;
(K) Certificate/license numbers

(ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.

I find it odd that the financial industry doesn't push something similar to this, which has been used in the health sphere for years. Or if the finance field has done so, how I could have operated in that area so long without finding similar guidance. Nothing like this is in common practice, no matter the existence of such a standard: I've seen banks throw any and all of these fields at third parties with the slightest provocation. I think they need to learn from the health industry.

big list