DiSMed

De-identifying Spanish medical texts - Named Entity Recognition applied to radiology reports

Irene Pérez-Díez*, Raúl Pérez-Moraga*, Adolfo López-Cerdán, José Maria Salinas, María de la Iglesia-Vayá

Abstract

Background

Medical texts such as radiology reports or electronic health records are a powerful source of data for researchers. Anonymization methods must be developed to de-identify documents containing personal information from both patients and medical staff. Although currently there are several anonymization strategies for the English language, they are also language-dependent. Here, we introduce a named entity recognition strategy for Spanish medical texts, translatable to other languages.

Results

We tested 4 neural networks on our radiology reports dataset, achieving a recall of 96.55% of the identifying entities. Alongside, we developed a randomization algorithm to substitute the detected entities with new ones from the same category, making it virtually impossible to differentiate real data from synthetic data. The three best architectures were tested with the MEDDOCAN challenge dataset of electronic health records as an external test, getting a recall of 69.86%.

Conclusions

The strategy proposed, combining named entity recognition tasks with randomization of entities, is suitable for Spanish radiology reports. It does not require a big training corpus, thus it can be easily extended to other languages and medical texts, such as electronic health records.

Keywords

Natural Language Processing, Named Entity Recognition, radiology reports, medical texts, Spanish

Data availability
Use of the DiSMed is free to all researchers. Researchers seeking to use the full Clinical Database must formally request access. By requesting access the user agrees that (1) he/she will not share the data, (2) he/she will make no attempt to reidentify individuals.
The DiSMed, although de-identified, still contains information regarding the clinical care of patients, and must be treated with appropriate respect. Researchers seeking to use the full Clinical Database must formally request access.

Investigadores

Investigador Principal: Irene Perez-Díez & Maria de la Iglesia-Vayá

Co-investigadores: Susana Fernandez Celda & José María Salinas

Contact

If you want to know more about the project, or contact the research team, write to us.