Electronical Health Records (EHRs) are primarily designed to serve administrative purposes and to support clinical care. Thus, the data included and the methods to store these data are optimized to accomplish this tasks. The primary types of EHRs contain structured and unstructured information. Structured information includes diagnose codes, dates, billing codes, lab values, medication lists, physiologic measurements and other variables such as demographic information. This kind of data, stored as name-value pair data often encoded using standard terminologies, is easily accessible and constitutes a valuable source of information in health research. However, a huge amount of significant information …show more content…
EHRs contain many personal data like names, telephone numbers, electronic mail addresses, medical record numbers etc. Therefore, EHRs must be completely de-identified before they are released to clinical investigators. From a linguistically point of view, clinical narrative is highly heterogeneous, it often contains spelling and typing errors, does not always conform to normal grammar and includes many acronyms and abbreviations. In short clinical narratives represent a big challenge for natural language …show more content…
But furthermore to gain structured information from EHRs requires the recognition of entities (diseases, symptoms, treatments, medication…) which have to be mapped to codes in relevant controlled medical vocabularies and terminologies like SNOMED-CT4,5 or ICD-10. Additionally, it is crucial to identify negating terms especially before named entities. With deep semantic processing many other high interesting functionalities could be included like tools to find temporality (recent or historical or scenarios), to determine the degree of certainty/uncertainty and to detect relations between entities.
Actually, some medical systems for English have been already developed including the earlier system MedLEE (Medical Language Extraction and Encoding System)6, as well as more recent open source systems like Text Analysis and Knowledge Extraction System cTAKES7 (adopted nowadays by eMERGE) and Informatics for Integrating Biology and the Bedside (i2b2) HITEx8 which could serve as a guide for developing systems in other