Data Loading

Cardea makes use of a module to plugin the user’s data and automatically organize it into the framework. It expects data in Fast Healthcare Interoperability Resources (FHIR), a standard for health care data exchange, published by HL7®. Among the advantages of FHIR over other standards are:

  • Fast and easy to implement

  • Specification is free for use with no restrictions

  • Strong foundation in Web standards: XML, JSON, HTTP, OAuth, etc.

  • Support for RESTful architectures

  • Concise and easily understood specifications

  • A human-readable serialization format for ease of use by developers

By default, Cardea loads a dataset hosted in Amazon S3, representing a formatted version of the Kaggle dataset: Medical Appointment No Shows, but it also allows user to load datasets providing a local path with CSV files, using the load_data_entityset(...) method. As an example, the following piece of code will load the default Kaggle dataset:

In [1]: from cardea import Cardea

In [2]: cardea = Cardea()

In [3]: cardea.load_entityset(data='kaggle')

While local files can be loaded using the same method with a data parameter:

cardea.load_entityset(data="your/local/path/")

Cardea handles datasets as a collection of entities and the relationships between them because they are useful for preparing raw, structured datasets for feature engineering. For this, it uses the featuretools.EntitySet class.

Using the following command, you will be able to summarize the dataset:

cardea.es
Entityset: fhir
  Entities:
    Address [Rows: 81, Columns: 2]
    Appointment_Participant [Rows: 6100, Columns: 2]
    Appointment [Rows: 110527, Columns: 5]
    CodeableConcept [Rows: 4, Columns: 2]
    Coding [Rows: 3, Columns: 2]
    Identifier [Rows: 227151, Columns: 1]
    Observation [Rows: 110527, Columns: 3]
    Patient [Rows: 6100, Columns: 4]
    Reference [Rows: 6100, Columns: 1]
  Relationships:
    Appointment_Participant.actor -> Reference.identifier
    Appointment.participant -> Appointment_Participant.object_id
    CodeableConcept.coding -> Coding.object_id
    Observation.code -> CodeableConcept.object_id
    Observation.subject -> Reference.identifier
    Patient.address -> Address.object_id

Showing, in this case, the resources that were loaded into the framework (Entities section) and the relationship between the resources (Relationships section).