How Can Search Engines Identify Medical Images
Optical Character Recognition (OCR) is the process of identifying text characters in a digital image to convert into machine-encoded text. OCR allows images to be translated into code for the purpose of data processing.
Researchers explained, “The ability to use Optical Character Recognition (OCR) at scale allows programs to quickly re-generate explicit PHI that was originally burned into the image pixels. Search engines can then associate (‘index’) the image with that explicit PHI thereby making it discoverable. As a result, this data can be made available and linked to other text-based information.”
This poses a huge risk to PHI security, as educators generally use medical images from real patients to create presentations. When presentations are made available to the public, PowerPoint or Adobe PDF files, search engines can scan the document using OCR. This allows search engines to index files by extracting protected health information (PHI), such as a patient’s name, from the images that presentation-makers thought to be removed.
Researchers commented, “When explicit patient information becomes associated with images in the search engine database, it can be found on subsequent internet searches on the patient’s personal information.” Therefore, this indexing process allows anyone to input the patient’s name into search, and be presented with actual medical images on the patient. They furthered, “For example, when a patient searches her name in a search engine, images from a diagnostic imaging study performed four years ago appear. When she clicks on the images, she is directed to the website of a professional imaging association which stored an Adobe PDF file as part of an educational presentation.”
PHI Security and De-Identification
This discovery reinforces the need to properly de-identify patient information before sharing medical images. The Safe Harbor Provision of HIPAA requires PHI to be de-identified before being shared for research purposes. De-identifying PHI is the process of removing all individually identifying medical information from medical files, which in essence, makes it no longer PHI.
When PHI is shared outside of treatment, payment, or healthcare operations, such as for educational purposes, entities need prior written consent from the patient, otherwise, the data must be de-identified before it is shared.
Most providers are likely unaware of the OCR capabilities of search engines, making the PHI breach accidental. However, the Department of Health and Human Services (HHS) does not take kindly to feigning ignorance. At some point, it is likely that an organization will be audited and/or fined by HHS for failing to adequately de-identify PHI in a medical image shared in an educational presentation.
On that note, researchers provided some guidance on de-identifying medical images, “Specific functions are available in some software to permanently delete cropped, obscured or hidden information in presentation files. As a final quality control check, it’s recommended that these ‘sanitization’ functions be run on all presentations prior to being made public.”
The sanitization functions that the researchers are referring to are methods such as utilizing an anonymization algorithm embedded in the picture archiving and communication system (PACS) or disabling patient information overlays in images. Simply cropping PHI out of an image, or placing black bars over sensitive information, does not adequately de-identify PHI.
For more information on removing PHI from medical images, please click here.