1. A high level Overview on the Use Case :
One of the most effective use-case of data science in Healthcare is to predict medical costs of the patient based upon extensive factors contributing to higher expenditures. Rising medical costs have been a major public health concern hence getting an understanding upon the contributing factors is very crucial. Furthermore, one of the significant aspects of medical cost prediction is to identify the patients at risk to contribute to extensive costs to hospitals leading to effective resource planning. The relationship between patient’s primary medical costs and their characteristics (like age, gender, morbidity, BMI etc.) plays a vital role while investigating whether the healthcare resources are equitably allocated and to make sure that patients in equal needs are supplied with equal amount of medical care. Over and above that, it is very necessary to know how patients’ expected costs varies with respect to the commonly affecting factors.
2. If we have data, let’s look at it !
Now that we have a good understanding of what medical cost prediction use-case is, the next obvious step is to analyze it. Any kind of analysis initiates by looking upon the data. Keeping a note that we already have a predefined dataset uploaded to Smarten, let’s get started slowly but surely into how to open a loaded dataset in Smarten and make analysis.
- Steps to Open a loaded dataset in Smarten:
1. Go to Open -> Data in top right drop-down in Smarten
2. Search for the dataset to analyze
3. Examine the dataset
On opening the dataset, we will be able to view the attributes (columns) affecting the use case and also be able to draw some superficial insights by pondering upon the dataset. Your dataset will look as follows:
Medical Cost Prediction Dataset
- Data Analysis from Elementary Point of View:
From the dataset, we can perceive that there are multiple factors (i.e. attributes or columns) which may have an effect on our target variable (i.e. charges). Charges being discrete numeric values, may have increasing or decreasing value effect from dependable attributes. Though this outcome may vary from person to person and also the type of data we use, we can definitely derive some relevance and associations among the depending features and the amount of charges.
Our dataset consists of the following features apart from our target feature charges:
In layman's terms we can derive that a smoker and an aged person might have higher medical costs in typical conditions. But this calculation might differ from person to person as well as the type of data we have in hand.
In order to validate our assumptions as well as to provide extensive data insights, Smarten provides us a straightforward as well as accurate approach to draw better conclusions from our data.
Now it's time to gear up with some amazing data analytics using Smarten!
3. The goal is to figure out what to do with the data!
Smarten provides us a feature of Smarten Assisted Predictive Modelling which reduces the time and skills required to produce accurate, clear results, quickly using machine learning. With Smarten Insights, the user will simply have to select the dataset to be analyzed and the broad category of the algorithm to be applied. That’s it!.The system will interpret the dataset, select important columns of data, analyze its type and variety and other parameters and then use intelligent machine learning to automatically apply the best algorithm in the selected algorithm technique in order to provide data insights.
Let’s gradually familiarize with Smarten Insights!
1. Create a fresh New Smarten Insight:
Goto New ->SmartenInsight Menu in the drop-down provided in the top right corner of the Smarten Dashboard
Create a New Smarten Insight
2. What do you want to do? Choose Algorithm Technique Accordingly:
The following window opens on selecting Smarten Insight:
Smarten Insight - Algorithm Technique Selection
Smarten has a provision wherein users can select the algorithm technique best sufficing their use case and provide quality results for the chosen technique. So in order to select the best category of algorithm, users need to have some basic data literacy.
We need to choose the best algorithm technique from the following list:
- Hypothesis Testing
- Frequent Pattern Mining
- Descriptive Statistics
Let’s put on our thinking caps and try to recognize the best suit algorithm technique for Medical Cost Prediction dataset!
The high level goal is pretty clear that we are keen in acknowledging the factors leading to fluctuations in medical charges and derive a relationship between influencing factors and medical charges. To put it in another way, we desire to describe the changes in our independent variable (i.e. charges) with respect to the influence of dependent variables (i.e. age, smoker, BMI etc.). Our target being a finite continuous numeric variable as well as dependable on various other variables, regression analysis is the most appropriate choice to get our outcomes right.
This suggests that our problem can be prototyped as a Regression Analysis and we can carry forward with our survey of getting better insights from this data using Smarten by selecting the Regression model from the list of provided algorithms techniques. Moreover, Smarten also has a provision to provide description against each algorithm technique along with a basic example to enhance data literacy and assist us in algorithm selection.
3. Select the data for Performing Regression upon:
- Search for the dataset (In our case it’s named Medical cost prediction-Multiple Regression-Dataset)from the search bar.
- Select the radio button against the dataset of interest.
- Click on the next button at the bottom to advance the procedure.
Smarten Insight - Search and Select Dataset for Regression Analysis
4. Wait till Smarten is predicting the targets (outcome variable) and predictors (dependent variable) for us:
Smarten Insight - Identify best Target and Predictor for chosen dataset
5. Select the target and predictors based upon your choice:
After Smarten identifies the target and predictor columns as showcased below, we can change the settings as per our requirements.
Smarten Insight - Screen indicating chosen Target and predictors for Regression Analysis by Smarten
5.1. Target selection:
As we are keen at predicting the cost of medication, charges will be the apt choice for our target. In order to make charges our target variable, we must select it from the drop-down indicating ‘Select the target variable containing predefined classes or groups’.
Smarten Insight - Modifying the Target Variable
5.2. Identify relevant predictors for selected Target:
Smarten has a feature wherein it selects the significant predictors for the chosen target using machine learning techniques. In this scenario, the default target chosen was of our interest (i.e. charges) and hence smarten has already made recommendations for significant predictors for us. The predictors highlighted in yellow indicate those which are recommended to us by Smarten auto mechanism.
5.3. Perform amendments in Predictors (Optional):
Now that Smarten has well equipped us well with the predictors, we still have a choice to add or remove the predictor selections based upon our interest. The right side table of Predictors represents those which significantly affect the target (here charges) and the left side contains the remaining predictors. The predictors on the right side table will be fed to the machine learning model to generate outcomes. So we can make alteration in the right side list of predictors using the + and - symbols before processing further.
5.4. Run model on full or sample data option:
Sometimes, perhaps in most cases, we have to deal with huge data. Our machine learning model may take more time to run and provide us with the required outcome. In such a scenario Smarten provides us with an option of using sampled data wherein it smartly makes a sample of data resembling the full data which may give us the same result set as we would have obtained in the full data. But in our current scenario, we can use the full data to feed to the model by selecting the radio button as follows:
Smarten Insight - Sample vs Full Data Mode Option
5.5. Want to run regression on entire dataset option:
Often users want to make analysis upon some filtered data. Say for instance what if we were to find the medical charges specifically for people who smoke? In such a scenario, we need to select the smoker attribute with value yes, keeping all the predictors as they are. For such filtering of data Smarten provides us with an option to run regression on the entire dataset or filtered dataset as follows:
Smarten Insight - Apply Filter upon Input data attributes
If we have no such filtering condition in mind, we can carry further with yes option in the ‘Do you want to run regression on entire dataset?’
After well performing the above effortless process, finally click on the next button which will let Smarten designate the best fit regression algorithm using machine learning tactics for our dataset.
Smarten Insight - Recommendation of Best fit Regression Algorithm for chosen dataset by Smarten
4. So that was the recipe, Now let’s taste it!
On clicking the next button, Smarten leads us to the results and interpretations for our dataset without investing much of our efforts!
Smarten Insight - Multiple Linear Regression
This audibly suggests that Multiple linear regression is the apt machine learning algorithm for our dataset and the happy face in the interpretation is indicative of the model being quite accurate in its predictions.
5. Smarten brings boring flat data to life using Visualizations!
For multiple linear regressions, Smarten displays the following 3 visualizations in order to get catchy and clear visual interpretations for the chosen data.
- Line Fit Plot
- Normal Probability Plot
- Residual vs Fit plot
5.1. Line Fit Plot:
This plot is used to plot the relationship between each Predictor and Target variable to know if they are linearly correlated. This is a good visualization practice whenever one wants to know whether increasing/decreasing the parameter leads to increase/decrease in target value or no such pattern is retrieved from our data. We can select the parameter of choice to make visualizations from the drop-down provided as in figure below:
Smarten Insight - Line Fit Plot for Regression Analysis
5.2. Normal Probability Plot:
This plot is carried out to check the assumptions of normality in our data. This plot is helpful in adding trend line in order to check whether the variables fit the straight line. A straight, diagonal line means that you have normally distributed data. If the line is skewed to the left or right, it means that you do not have normally distributed data. From the plot below, we can devise that our data doesn’t assume normality owing to the skewness from the normality line.
Smarten Insight - Normal Probability Plot for Regression Analysis
5.3. Residual vs Fit plot:
This is a scatter plot represented to detect the unequal residual variances and outliers in the data. Ideally, the point that lies outside the basic normal pattern of data is an outlier and must be removed from the data to get a better fit model! To determine this, the scatter plot must be randomly scattered among all the four quadrants for a better fit. From the figure below, we can interpret that the data is stable and there are not many outliers owing to the even distribution of the scatter plot among all the quadrants.
Smarten Insight - Residual Versus Fit Plot for Regression Analysis
6. Every Choice we make, has an end result!
Apart from the best fit algorithm automatically opted for by Smarten, the user also has a privilege to play around with other listed algorithms and review results produced by them for further analysis. Users can test the accuracy and other metrics of varied algorithms and make a comparative review upon all as well as make better explorations.
To make analysis and draw conclusions upon algorithms apart from the best fit one, just select the algorithm of interest on the left side menu list as follows (In case of regression analysis, we currently have at best two algorithms namely simple linear regression and multiple linear regression, limiting our choice to be grounded to one, so we can choose simple linear regression for comparative analysis under this scenario):
Note: Smarten makes sure that it’s users are well aware that the default selection is the best choice if the users want to save time upon that grounds. But if a user wants to explore other algorithms, they can select the yes option and proceed.
Smarten Insight - Option to Alter Algorithm choice
As advised by Smarten pop-up, we cannot apply a simple linear regression algorithm for our chosen data as it consists of more than a single predictor variable.it's not always a scenario wherein any algorithm can be applicable to any dataset. There are some regulations for every machine learning algorithm which the data must satisfy in order to perform analysis with. Smarten has provided apt pop-ups if our data is not sustainable for the algorithm selection.
7. Simpler the Insight, more profound the Conclusion !
Out of many features provided, Smarten insight has 3 key sections which will assist us in better interpretation of our data. Them being:
- Model Summary
7.1. Interpretation :
The interpretation section provides the users an in-depth insight from the dataset in easy to understand language.
Smarten Insight - Interpretation Section
We can make interpretations regarding the significant attributes for our target(i.e. Medical charges) as well as devise the type and amount of impact they play on the target. For illustration, here we can make out that person’s age is significant in predicting the medical cost. Moreover, as the age increases, the charges increase linearly with a coefficient of 256.711.
7.2. Model Summary:
The model summary section contains a more technical summary of the data. This is specially put to Smarten insight to provoke data Literacy. For instance, for carrying out regression analysis of the model, data scientists make conclusions by making interpretations from the obtained R square, RMSE and many other utility metrics. Smarten provides us even this feature to validate the technical aspect with the generalized interpretations
Smarten Insight - Model Summary Section
In nomenclature, R square is used in regression analysis to address the goodness of fit as depicted in the model summary. Apart from this, Smarten also makes the understanding upon the regression coefficient values and the significance associated with each attribute in terms of the p values and attempts to explain this jargon in easy to understand language.
7.3. Apply the model:
Using apply feature, the users can select/enter static values of each variable on pressing the Apply button Smarten let us know the outcome (here it will be the amount of charges). This is specially used when we have predefined values of all the other variables and keen to know the target class accordingly.
Smarten Insight - Apply Model Section
As the outcome of the apply feature, users can get to perceive the cost of medication when entered values are taken into consideration. Besides this, Smarten also illustrates the accuracy in terms of R square to cross validate our results.
Smarten Insight - Result of Apply Model
Simulation modelling feature assists in analyzing the model prototype to predict its performance in real world scenarios. It helps the users understand under what conditions the outcome will withstand a particular outcome and make predictions in real time. Say for instance, the user wants to make the settings of all the predictors in real time and make predictions, then the simulation feature can be used. We will find the icon for simulation on top right corner of the screen as follows:
Smarten Insight - Simulation Section
As it's evident from the screen that there is a default setting made for each predictor unlike the Apply feature, and also the outcome when those default parameters are selected.We can change the parameters and obtain the real time medical cost accordingly. We also obtain the line fit plot of the target with selected predicate (we can single out the predictor of our choice from the drop-down menu). The red bubble in the plot indicates our predicted outcome value keeping in consideration the selected parameters while carrying out the simulation.
8. Ready, Set, Go!
Now that we have a made a step-by-step approach in understanding the flow of making regression analysis of the Medical Cost Prediction use-case using Smarten using a sample data and model, we are all set to make similar process in tackling any other sample data with provided predictors and targets as per business requirements and create corresponding model to analyze the outcomes as well as apply this model to provided patient’s data to estimate medication cost variation with respective affecting parameters!
Note: This article is based on Smarten Version 5.0. This may or may not be relevant to the Smarten version you may be using.