Current state of the art solutions from the field of Language Technology employ machine learning algorithms and large volumes of data to solve tasks that enable understanding of textual data, such as the extraction of entities and relations between these. Even though much progress has been made in terms of providing tools for the processing of unstructured data, BigMed faces two main challenges.
Firstly, data from the medical domain is notoriously difficult to process automatically due to large variation, sparse data and the large cost associated with the manual annotation of such data. Further, the processing of Norwegian texts require an effort to adapt and develop tools specifically tailored for Norwegian clinical text.
BigMed's NLP-group aims to develop tools for processing of Norwegian medical text by
(i) re-use and adaptation of existing tools for general-domain Norwegian text, and
(ii) creating reusable medical language resources for Norwegian,
(iii) applying state-of-the-art machine learning techniques that enable the combination of data sources and generalization to new data.
The NLP effort will mostly be focused on the "Sudden Cardiac Death"-use case and collaborations with AHUS. General activities are described below:
- Pre-processing pipeline:
- sentence splitter
- tokenizer
- PoS-tagger
- Clinical input representations: training of vector space representations (embeddings) for clinical terms
- Text corpus of clinical text
- Clinical entity detection using machine learning
Use case 1: Family history extraction from clinical text (w/OUS)
- synthetic corpus of family history statements, manually annotated
- SVM models for family history extraction
Resources: https://github.com/ltgoslo/NorSynthClinical
Publication: Taraka Rama, Pål Brekke, Øystein Nytrø and Lilja Øvrelid: " Iterative development of family history annotation guidelines using a synthetic corpus of clinical text"
Use case 2: Text classification from radiology reports (w/AHUS) - neural models for classification of radiology reports - domain-specific word embeddings trained on clinical text