Hypothesis formulation in medical records space

Thamer Omer, PhD students


Hypothesis formulation in medical records space

Project Overview

  • The data contained within patient records (e.g. the clinical practice research datalink CPRD) is used to make new discoveries about diseases, medications and clinical practice. The most common way for researchers to use such data sources has been to query the data to answer very specific questions. For example, does treatment with a specific drug cause some patients to experience a particular rare side effect? If researchers have good questions, then patient records can provide good answers. But are we missing some opportunities? Are there many other equally important questions that could be asked of the data that people haven’t yet thought to ask?
  • In other areas, like finance or retail, in which large amounts of data have been collected, “data mining” is used to identify new areas that should be explored and to find good questions to ask of the data. In this research, we are using a new “data mining” strategy that we have developed for the patient data to look for unusual and interesting patterns in the data. Some of these patterns will be associated with questions that are already well-known and understood, but some should point to new and important questions that have not yet been asked.

Start: January 2012

End: December 2015


Data Source

  • Primary Care Data, with access to SIR and CPRD databases


  • Developed a novel methodology which builds upon the idea of semantic similarity to take patient data in the form of codes and map it into a low dimensional vector space, in which distance relates to similarity of patient phenotype. To achieve this mapping a two-step approach was taken.
  • The first step was to map patient records into a semantic similarity space and the second step was to reduce the dimensionality using the notion of principal component analysis (PCA). This mapping provides us with good representation of patient data, in which visualisation and clustering are much easier.

Benefits and Outcomes

This research shows that it is possible to take patient data and map it into a low dimensional space in ways such that distance relates to similarity in patient records. It is clear that mapping the patient data into a vector space opens up the possibility of applying a wide range of data mining strategies which have not yet been explored.

It is believed that the ability to present data in this fashion will make it amenable to analysis through more traditional data mining strategies, as well as allowing a much more intuitive and straightforward environment to formulate new medical hypotheses.


In a small-scale study based on Salford Integrated Record (SIR) primary care data, we have demonstrated that applying the mapping methodology provides informative views of patient phenotypes across a population and allows the construction of clusters of patients sharing common diagnosis and treatments. The findings of this study were published and presented at the 14th World Congress on Medical and Health Informatics, the MedInfo 2013.

L. Kalankesh, J. Weatherall, T. Ba-Dhfari, I. Buchan, and A. Brass, “Taming EHR data: Using semantic similarity to reduce dimensionality,” Stud. Health Technol. Inform., vol. 192, pp. 52–56, 2013.

Researchers Involved

Prof. Andrew Bass

Prof. Tjeerd van Staa

Darren Ashcroft

James Weatherall

Mark Davies