Cardea
The first step to use Cardea is to follow the Installation instructions. Once installed and having a working environment, you can start using the Cardea library in a Python console using the following steps:
First, load the core class to work with:
In [1]: from cardea import Cardea In [2]: cardea = Cardea()
Second, load a dataset. By default, if no path is given, Cardea automatically loads a pre-processed version of the Kaggle dataset: Medical Appointment No Shows, using the following command:
In [3]: cardea.load_entityset(data='kaggle') In [4]: cardea.es Out[4]: Entityset: fhir Entities: Patient [Rows: 6100, Columns: 4] Coding [Rows: 3, Columns: 2] Appointment_Participant [Rows: 6100, Columns: 2] Address [Rows: 81, Columns: 2] Observation [Rows: 110527, Columns: 3] Appointment [Rows: 110527, Columns: 5] CodeableConcept [Rows: 4, Columns: 2] Reference [Rows: 6100, Columns: 1] Identifier [Rows: 227151, Columns: 1] Relationships: Patient.address -> Address.object_id Appointment_Participant.actor -> Reference.identifier Observation.code -> CodeableConcept.object_id Observation.subject -> Reference.identifier Appointment.participant -> Appointment_Participant.object_id CodeableConcept.coding -> Coding.object_id
You can see the list of problem definitions and select one with the following commands:
In [5]: cardea.list_problems() Out[5]: {'DiagnosisPrediction', 'LengthOfStay', 'MissedAppointment', 'MortalityPrediction', 'ProlongedLengthOfStay', 'Readmission'}
From there, you can select the prediction problem you aim to solve by specifying the name of the class, which in return gives us the label_times of the problem.
label_times
In [6]: label_times = cardea.select_problem('MissedAppointment') In [7]: label_times.head() Out[7]: time instance_id label 5642903 2016-04-29 18:38:08+00:00 5642903 noshow 5642503 2016-04-29 16:08:27+00:00 5642503 noshow 5642549 2016-04-29 16:19:04+00:00 5642549 noshow 5642828 2016-04-29 17:29:31+00:00 5642828 noshow 5642494 2016-04-29 16:07:23+00:00 5642494 noshow
Then, you can perform the AutoML steps and take advantage of Cardea.
Cardea extracts features through automated feature engineering by supplying the label_times pertaining to the problem you aim to solve, using the following commands:
In [8]: feature_matrix = cardea.generate_features(label_times[:1000]) # a subset In [9]: feature_matrix.head() Out[9]: participant ... label 0 3845236363 ... noshow 1 3210493988 ... noshow 2 422016776 ... noshow 3 1775770955 ... noshow 4 3487122758 ... noshow [5 rows x 14 columns]
Once we have the features, we can now split the data into training and testing
In [10]: y = list(feature_matrix.pop('label')) In [11]: X = feature_matrix.values In [12]: X_train, X_test, y_train, y_test = cardea.train_test_split( ....: X, y, test_size=0.2, shuffle=True) ....:
Now that we have our feature matrix properly divided, we can use to train our machine learning pipeline, Modeling, optimizing hyperparameters and finding the most optimal model is done using the following commands:
In [13]: cardea.select_pipeline('Random Forest') In [14]: cardea.fit(X_train, y_train) In [15]: y_pred = cardea.predict(X_test)
Finally, you can see accuracy results using the following commands:
In [16]: cardea.evaluate(X, y, test_size=0.2, metrics=['Accuracy', 'F1 Macro']) Out[16]: {'Accuracy': 0.775, 'F1 Macro': 0.5396654902562529}