QuickstartΒΆ

The first step to use Cardea is to follow the Installation instructions. Once installed and having a working environment, you can start using the Cardea library in a Python console using the following steps:

First, load the core class to work with:

In [1]: from cardea import Cardea

In [2]: cardea = Cardea()

Second, load a dataset. By default, if no path is given, Cardea automatically loads a pre-processed version of the Kaggle dataset: Medical Appointment No Shows, using the following command:

In [3]: cardea.load_entityset(data='kaggle')

In [4]: cardea.es
Out[4]: 
Entityset: fhir
  Entities:
    Patient [Rows: 6100, Columns: 4]
    Coding [Rows: 3, Columns: 2]
    Appointment_Participant [Rows: 6100, Columns: 2]
    Address [Rows: 81, Columns: 2]
    Observation [Rows: 110527, Columns: 3]
    Appointment [Rows: 110527, Columns: 5]
    CodeableConcept [Rows: 4, Columns: 2]
    Reference [Rows: 6100, Columns: 1]
    Identifier [Rows: 227151, Columns: 1]
  Relationships:
    Patient.address -> Address.object_id
    Appointment_Participant.actor -> Reference.identifier
    Observation.code -> CodeableConcept.object_id
    Observation.subject -> Reference.identifier
    Appointment.participant -> Appointment_Participant.object_id
    CodeableConcept.coding -> Coding.object_id

You can see the list of problem definitions and select one with the following commands:

In [5]: cardea.list_problems()
Out[5]: 
{'DiagnosisPrediction',
 'LengthOfStay',
 'MissedAppointment',
 'MortalityPrediction',
 'ProlongedLengthOfStay',
 'Readmission'}

From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the label_times of the problem.

In [6]: label_times = cardea.select_problem('MissedAppointment')

In [7]: label_times.head()
Out[7]: 
                             time  instance_id   label
5642903 2016-04-29 18:38:08+00:00      5642903  noshow
5642503 2016-04-29 16:08:27+00:00      5642503  noshow
5642549 2016-04-29 16:19:04+00:00      5642549  noshow
5642828 2016-04-29 17:29:31+00:00      5642828  noshow
5642494 2016-04-29 16:07:23+00:00      5642494  noshow

Then, you can perform the AutoML steps and take advantage of Cardea.

Cardea extracts features through automated feature engineering by supplying the label_times pertaining to the problem you aim to solve, using the following commands:

In [8]: feature_matrix = cardea.generate_features(label_times[:1000])  # a subset

In [9]: feature_matrix.head()
Out[9]: 
   participant  ...   label
0   3845236363  ...  noshow
1   3210493988  ...  noshow
2    422016776  ...  noshow
3   1775770955  ...  noshow
4   3487122758  ...  noshow

[5 rows x 14 columns]

Once we have the features, we can now split the data into training and testing

In [10]: y = list(feature_matrix.pop('label'))

In [11]: X = feature_matrix.values

In [12]: X_train, X_test, y_train, y_test = cardea.train_test_split(
   ....:     X, y, test_size=0.2, shuffle=True)
   ....: 

Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model is done using the following commands:

In [13]: cardea.select_pipeline('Random Forest')

In [14]: cardea.fit(X_train, y_train)

In [15]: y_pred = cardea.predict(X_test)

Finally, you can see accuracy results using the following commands:

In [16]: cardea.evaluate(X, y, test_size=0.2, metrics=['Accuracy', 'F1 Macro'])
Out[16]: {'Accuracy': 0.775, 'F1 Macro': 0.5396654902562529}