Note - Some parts of this post requires some familiarity with basics of Machine Learning. These terms should be understood in a technical form - Binary Classification, Accuracy, AUC, Biased Dataset
DoctorC has been doing diagnostics for 3 years now. We have gone from running operations on spreadsheet to end to end automated systems which handle everything, from placing an order, to delivery of reports. We also store all reports and some report values for our customers so that they can easily access their medical history from anywhere.
A few months ago, we started an initiative to see if we can apply ML on our data and get some useful applications out of it. We are an extremely small company with very limited amount of resources. So it was basically just me running some experiments and dedicating only a few hours a week.
The first question we asked ourselves before starting was - “How is it going to be useful to us as a business?”. The answer was surprisingly easy.
We have two kinds of customers -
#1 are transactional and #2 are our repeat customers. #2 use/need our service the most and they are consequently very valuable to us. In Urban India, chronic conditions (non communicable diseases) are on the rise rapidly and the prevalence of diseases vary from 15% all the way to 35% (WHO link - http://www.who.int/choice/publications/Chronic_diseaseIndia.pdf).
These are the customers we are looking for. So the question boiled down to - “How do we identify people with chronic conditions?”. Once we identify them - we can enroll them in a program specifically tailored for people who need medications and diagnostic tests frequently.
We needed to start with a training set of people who are identified as chronic. We had 0 points of data for this.
So we looked for proxies. “What behaviour indicates if a person has a chronic condition?”. Answer - People who used our services frequently are more likely to be chronic.
We keep track of every test that was bought and their results for every customer. Therefore, we had a (comparatively speaking) rich medical history at our fingertips. On top of that - our in-house proprietary system can “read” a diagnostic report and infer structured medical data for a whole array of tests.
Based on this, we started with an initial training set of a couple of hundred people who were deemed “chronic”. We used another set of infrequent buyers as negative examples for our model.
I explored a whole bunch of options on what platform to use for training ML - Scikit Learn, Tensorflow, AWS Machine Learning and Microsoft Azure. Azure’s ease of use puts it in another league altogether - especially for people who want to just apply existing ML algorithms. Drag and drop, kicking off experiments parallely, easy to construct flows etc. allow you to move really really fast as compared to writing code.
This might become a bottleneck later as we scale or it might become more expensive to run - but the tools they provide save tons of time, especially for prototyping and experimentation. I highly recommend trying out Azure - their free tier gives a good taste of what’s possible on their platform.
Before you start, or while you are playing with your dataset - you need to decide what your “One True Metric™” is. This is the thing you are going to optimize for while evaluating different machine learning models. By definition, it is very specific to your problem and cannot be purely decided based on a formula.
For us, it was identifying people with chronic conditions with a highly imbalanced data set (we have a lot less people with chronic conditions compared to normal population). It’s a binary classification problem. I chose AUC as a measure and started experiments to optimize that. Why did I not choose Accuracy? That is a post for another day.
We did an initial split of 50/50 of positive and negative samples to ensure we didn’t bias the system with a small initial set. Then i started my experiments. This is the messy part - you have to start from the simplest models and run different models simultaneously on the same data.
The first set of results yielded AUC curve similar to this -
We did our first real world test with a model of an AUC of 0.58 (which is not a very good score). We took the top 100 predictions by the model and started verifying if the identified people were chronic. We recorded this data and then used it to create a newer model. This gave us another 100 predictions. Then we repeated the cycle again.
The model started improving with each iteration. Reconstructing new model with new data took only a couple of days and the model training process itself was only a few minutes. Gathering the data was the only bottleneck.
Our model’s curve improved as shown below. And our total dataset from a raw data standpoint was around 1.2 million discrete values. Our accuracy was approaching ~80% (first couple of hundred predictions for every iteration). The AUC of the best model would vary from 0.74 to 0.80. The accuracy had a steep fall afterwards, but it was still significantly better than the 15% to 20% accuracy you would get with a random sample.
More interesting were the predictive factors for chronic conditions. While I cannot delve into too many details, one of the most interesting find was that Vitamin D (both the number of purchases done and the values of it) were one of the most predictive factors for someone being chronic or not. It neatly fits into the current research which suggests that Vit D is extremely important to maintain since it affects almost every system in our body. You should get Vit D levels checked every 6 months or a year - especially if a large part of your day is spent indoors.
Yes, you can predict, with a high degree of accuracy as well as high AUC, if someone is suffering from a chronic condition based on a combination of their purchase behavior (surprising) and their test values (not very surprising, still exciting). The next question was whether we could actually hone down to a specific condition like Diabetes, Thyroid disorders etc. That is a post for another day.
There is also the long term possibility of actually providing diagnosis while working side by side with doctors. This system is a small step in fulfilling that vision. We have miles to go before we sleep.
We are very excited about the possibilities that this kind of data enables. If you would like access to this data (especially for medical research purposes) or would like to know additional details please contact me on karan at the rate doctorc dot in.