Clinicians often produce large amounts of data, from patient metrics to drug component analysis. Classical statistical analysis can provide a peek into data interactions, but in many cases, machine learning can provide additional insight into new features. Recently, with the boom of new artificial intelligence models, these clinicians are more interested in applying machine learning to their data. However, in many cases, they may not possess the necessary knowledge and skills to effectively train and infer a model. Fortunately, using AutoML techniques and a user-friendly web interface, we can provide these clinicians with a way to automatically train tabular data on many different machine learning models to find which produces the best results. Therefore, we present CLASSify as a way for clinicians to bridge the gap to artificial intelligence.
Even with a web interface and clear results and visualizations for each model, it can be difficult to interpret how a model achieved its results or what it could mean for the data itself. Therefore, this interface can also provide explainability scores for each feature that indicates its contribution to the model’s predictions. With this, users can see exactly how each column of the data affects the model and could gain new insights into the data itself.
Finally, CLASSify also provides tools for synthetic data generation. Clinical datasets frequently have imbalanced class labels or protected information that necessitates the use of synthetically-generated data that follows the same patterns and trends as real data. With this interface, users can generate entirely new datasets, bolster existing data with synthetic examples to balance class labels, or fill missing values with appropriate data.
First, the user will upload their tabular dataset, which must take the form of a csv. This csv must follow certain format rules, such as having column names ‘index’ and ‘class’, as well as not containing string values (other than TRUE/FALSE). Therefore, categorical variables must be encoded accordingly. If the dataset follows these rules, it will be uploaded to the website.
The user can then view their data. It will be uploaded to ClearML, the platform used to perform training and host models.
Users will then be taken to a page that displays all datasets they have uploaded. Here, the ‘Prepare Dataset’ is the next step to performing training. This is also the page where users can view their results once training has completed.
The preparation page gives many options for parameter values and customizations to the training process. Some of the important ones to consider include the choice of which models to train, whether the dataset is binary or multiclass, whether to perform testing on a separate dataset, and whether to generate synthetic data. Once the options are chosen, training can begin.
On the results page, users can view various performance metrics for each model or feature combination, depending on the parameters chosen. Users can also view the explainability scores for each feature. On this page, users can download any synthetic data that was generated or view output logs of the training process.
Users can also view visualizations. Below is one example, which displays the performance of each model in various metrics. This can be used to compare the models and determine which is the appropriate choice, depending on what metrics are most important to the user’s goals.
These visualizations may also include graphs that display the explainability scores for different features. In the example below, the top feature, ‘y.Adm.Cr’, was determined as the most important feature to the model. By observing the color and position of those data points, the user can interpret that lower feature values have a negative impact on the model’s predictions (influence the model to predict closer to 0 instead of 1, for binary classification), while higher feature values have a strong positive impact (influence the model to predict closer to 1 instead of 0).