“DAI-Lab” An open source project from Data to AI Lab at MIT.

MIT-D3M-TA2¶

MIT-Featuretools TA2 submission for the D3M program.

Free software: MIT license
Development Status: Pre-Alpha
Documentation: https://HDI-Project.github.io/mit-d3m-ta2

Overview¶

This repository contains the TA2 submission for the Data Driven Discovery of Models (D3M) DARPA program developed by the DAI-Lab and Featuretools teams.

Install¶

Requirements¶

mit-d3m-ta2 has been developed and tested on Python 3.6

Also, although it is not strictly required, the usage of a virtualenv is highly recommended in order to avoid interfering with other software installed in the system where mit-d3m-ta2 is run.

These are the minimum commands needed to create a virtualenv using python3.6 for mit-d3m-ta2:

pip install virtualenv
virtualenv -p $(which python3.6) mit-d3m-ta2-venv

Afterwards, you have to execute this command to have the virtualenv activated:

source mit-d3m-ta2-venv/bin/activate

Remember about executing it every time you start a new console to work on mit-d3m-ta2!

Install the latest release¶

In order to install mit-d3m-ta2, you will have to clone the repository and checkout its stable branch:

git clone git@github.com:HDI-Project/mit-d3m-ta2.git
cd mit-d3m-ta2
git checkout stable

Once done, make sure to having created and activated your virtalenv and then simply execute:

make install

Install for Development¶

If you want to contribute to the project, a few more steps are required to make the project ready for development.

First, please head to the GitHub page of the project and make a fork of the project under you own username by clicking on the fork button on the upper right corner of the page.

Afterwards, clone your fork and create a branch from master with a descriptive name that includes the number of the issue that you are going to work on:

git clone git@github.com:{your username}/mit-d3m-ta2.git
cd mit-d3m-ta2
git branch issue-xx-cool-new-feature master
git checkout issue-xx-cool-new-feature

Finally, install the project with the following command, which will install some additional dependencies for code linting and testing.

make install-develop

Make sure to use them regularly while developing by running the commands make lint and make test.

Additional Dependencies¶

Additional dependencies required to execute some of the TA1 primitives have been left out from the command above in order to keep maximum compatibility with the different types of systems and avoid dependency conflicts.

Because of this, some datasets, including timeseries and image data modalities, might not work properly.

In order to make them work, install the additional dependencies and download additional files with the following commands:

sudo apt-get install $(cat system_requirements.txt)
pip install -r devel_requirements.txt
mkdir -p static
python -m d3m.index download -o static

And keep in mind the following considerations:

The command line script ta2 explained in the usage section below will stop working and will need to be replaced with python -m ta2 in all the examples.
Some red warnings might show in the command line indicating that incompatible versions have been install. These warnings can be safely ignored, as their only consequence is the previous point.

Data Format¶

mit-d3m-ta2 runs on datasets in the D3M Format

Datasets Collection¶

You can find a collection of datasets in the D3M format in the d3m-data-dai S3 Bucket in AWS, including the corresponding TRAIN, TEST and SCORE partitions following the schema specification.

More datasets in newer versions of the schema can also be found in the private datasets repository.

D3M Seed Datasets¶

Our TA2 system is regularly evaluated over the collection of Seed Datasets found in the private datasets repostory.

As specified in the README file form this repository, you will need git-lfs in order to download all the included files.

Note that the complete collection of seed datasets is around 60 GB big, so the recommended approach is to download only those parts of the repository that will be used following the instructions in the Partial Downloading section

Once downloaded, the local testing commands can be used passing the seed_datasets_current root folder path to the --input option.

Example: --input /path/to/d3m/datasets/repo/seed_datasets_current

Leaderboard¶

The following leaderboard has been built using the TA2 Standalone Mode with 2 as the maximum number of tuning iterations to perform (budget) and 30 as the maximum time allowed for the tuning (timeout):

dataset	template	cv_score	test_score	elapsed_time	tuning_iterations	data_modality	task_type
30_personae	gradient_boosting_classification.all_hp.yml	0.728894	0.619048	5.93087	2	single_table	classification
57_hypothyroid	gradient_boosting_classification.all_hp.yml	0.862681	0.981003	38.6418	2	single_table	classification
185_baseball	gradient_boosting_classification.all_hp.yml	0.646959	0.675132	17.3313	2	single_table	classification
313_spectrometer	gradient_boosting_classification.all_hp.yml	0.281409	0.304201	45.3676	2	single_table	classification
27_wordLevels	gradient_boosting_classification.all_hp.yml	0.268882	0.288937	169.197	2	single_table	classification
1491_one_hundred_plants_margin	gradient_boosting_classification.all_hp.yml	0.00957403	0.451364	114.561	2	single_table	classification

This table can be also downloaded as a CSV file

Usage¶

Local Testing¶

Two scripts are included in the repository for local testing:

TA2 Standalone Mode¶

The TA2 Standalone mode can be executed locally using the ta2 command line interface.

To use this, run the ta2 test command passing one or more dataset names as positional arguments as well as either a budget. -b, or a timeout, -t.

For example, in order to process the datasets 185_baseball and 196_autoMpg during 60 seconds each, the following command would be used:

ta2 test -t60 185_baseball 196_autoMpg

This will start searching and tuning the best pipeline possible for each dataset during a maximum of 60 seconds and, at the end, print a table with all the results on stdout.

Additionally, the following options can be passed:

-i INPUT_PATH: Path to the folder where the datasets can be found. Defaults to input.
-o OUTPUT_PATH: Path to the folder where the output pipeliens will be saved. Defaults to output.
-b BUDGET: Maximum number of tuning iterations to perform.
-t TIMEOUT: Maximum allowed time for the tuning, in seconds.
-a, --all: Process all the datasets found in the input folder.
-v, --verbose: Set logs to INFO level. Use it twice to increase verbosity to DEBUG.
-r CSV_PATH: Store the results in the indicated CSV file instead of printing them on stdout.
-s STATIC_PATH: Path to a directory with static files required by primitives. Defaults to static.

For a full description of the options, execute ta2 test --help.

TA2-TA3 Server Mode¶

The TA2-TA3 API mode can be executed using the ta2 server command, as well as any of the optional named arguments required.

This will start a ta2 server in the background ready to serve requests from a ta3 client.

ta2 server

For a full description of the script options, execute ta2 server --help.

TA2-TA3 Test¶

In order to test the TA2-TA3 Server, a convenience ta3 command line interface has been included, which allows testing one or more datasets by issuing a predefined sequence of calls to the TA2-TA3 Server.

To use it, run the ta2 ta3 command passing one or more dataset names as positional arguments, as well as any of the optional arguments.

For example, in order to process the datasets 185_baseball and 196_autoMpg during 60 seconds each, the following command would be used:

ta2 ta3 -t60 185_baseball 196_autoMpg

NOTE: In order to be able to execute this command, a ta2 server process must be already running in the same machine.

This will start sending requests to the ta3-server to search and tune the best pipeline possible for each dataset during a maximum of 60 seconds.

For a full description of the script options, execute ta2 ta3 --help.

Also remember that a TA2-TA3 Server must be running when you execute this script!

Docker Usage¶

In order to run TA2-TA3 server from docker, you first have to build the image and execute the run_docker.sh script. After that, in a different console, you can run the ta3 script passing it the --docker flag to adapt the input paths accordingly:

make build
./run_docker.sh

And, in a different terminal:

ta2 ta3 --docker <OPTIONS>

Submission¶

The submission steps are defined here: https://datadrivendiscovery.org/wiki/display/gov/Submission+Procedure+for+TA2

In our case, the submission steps consist of:

Execute the make submit command locally. This will build the docker image and push it to the gitlab registry.
Copy the kubernetes/ta2.yaml file to the Jump Server and execute the validation command /performer-toolbox/d3m_runner/d3m_runner.py --yaml-file ta2.yaml --mode ta2 --debug
If successful, copy the ta2.yaml file over to the submission repository folder and commit/push it.

For winter-2019 evaluation, the submission repository was https://gitlab.datadrivendiscovery.org/ta2-submissions/ta2-mit/may2019

Getting Started