Machine Learning Tasks¶

The Problem Definition is considered a fundamental component that formulates the task for Machine Learning models. It includes generating and identifying two main concepts: the target variable and the cutoff times.

Therefore, the first step to work with Cardea is defining a Machine Learning Task (or using one of the already defined tasks). For example, Missed Appointment is a common task that aims to predict whether the patient showed to the appointment or not, helping hospitals to optimize their scheduling policies and resources efficiently.

Outcome to predict¶

Following with the previous example, the Missed Appointment task is currently defined as a binary classification task in the system, determining whether a patient showed to the appointment or not from the point of appointment scheduling.

Usually, the outcome is defined over the FHIR data schema, using the resource id values for references between instances.

Cutoff times and Labels¶

As it was stated before, the success of the Problem Definition step and its outcome depends on two main concepts: the target variable and the cutoff times. The target variable is generated automatically by Cardea if it does not exist in the dataset and its objective is to set the definition of the model output. In the other hand, the objective of cutoff times is to split the data in such manner that any events before the cutoff time are used for training while events after the cutoff time are used for testing. The following code shows the format for these values in the Missed Appointment task:

In [1]: from cardea import Cardea

In [2]: cardea = Cardea()

In [3]: cardea.load_entityset(data='kaggle')

In [4]: cardea.select_problem('MissedAppointment')
Out[4]: 
                             time  instance_id   label
5642903 2016-04-29 18:38:08+00:00      5642903  noshow
5642503 2016-04-29 16:08:27+00:00      5642503  noshow
5642549 2016-04-29 16:19:04+00:00      5642549  noshow
5642828 2016-04-29 17:29:31+00:00      5642828  noshow
5642494 2016-04-29 16:07:23+00:00      5642494  noshow
...                           ...          ...     ...
5651768 2016-05-03 09:15:35+00:00      5651768  noshow
5650093 2016-05-03 07:27:33+00:00      5650093  noshow
5630692 2016-04-27 16:03:52+00:00      5630692  noshow
5630323 2016-04-27 15:09:23+00:00      5630323  noshow
5629448 2016-04-27 13:30:56+00:00      5629448  noshow

[110527 rows x 3 columns]

Current Prediction Problems¶

Cardea encapsulates six different prediction problems for users to explore easily, these are described as follows:

Diagnosis Prediction: a. Predicts whether a patient will be diagnosed with a specified diagnosis.
Length of Stay: a. Predicts how many days the patient will be in the hospital.
Missed Appointment: a. Predicts whether the patient showed to the appointment or not.
Mortality Prediction: a. Predicts whether a patient will suffer from mortality.
Prolonged Length of Stay: a. Predicts whether a patient stayed in the hospital more or less than a period of time (a week by default).
Readmission: a. Predicts whether a patient will revisit the hospital within certain period of time (a month by default).

You can see the list of problems using the list_problems(...) method, example:

In [5]: from cardea import Cardea

In [6]: cardea = Cardea()

In [7]: cardea.list_problems()
Out[7]: 
{'DiagnosisPrediction',
 'LengthOfStay',
 'MissedAppointment',
 'MortalityPrediction',
 'ProlongedLengthOfStay',
 'Readmission'}

Data Loading Auto - Featurization