mlblocks¶

MLBlocks top module.

MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface.

Free software: MIT license
Documentation: https://MLBazaar.github.io/MLBlocks

Classes:

`MLBlock`(primitive, **kwargs)	MLBlock Class.
`MLPipeline`([pipeline, primitives, …])	MLPipeline Class.

Functions:

`add_pipelines_path`(path)	Add a new path to look for pipelines.
`add_primitives_path`(path)	Add a new path to look for primitives.
`find_pipelines`([pattern, filters])	Find pipelines by name and filters.
`find_primitives`([pattern, filters])	Find primitives by name and filters.
`get_pipelines_paths`()	Get the list of folders where pipelines will be looked for.
`get_primitives_paths`()	Get the list of folders where primitives will be looked for.
`load_pipeline`(name)	Locate and load the pipeline JSON annotation.
`load_primitive`(name)	Locate and load the primitive JSON annotation.

class mlblocks.MLBlock(primitive, **kwargs)[source]¶

MLBlock Class.

The MLBlock class represents a single step within an MLPipeline.

It is responsible for loading and interpreting JSON primitives, as well as wrapping them and providing a common interface to run them.

name¶

Primitive name.

Type: str

metadata¶

Additional information about this primitive

Type: dict

primitive¶

the actual function or instance which this MLBlock wraps.

Type: object

fit_args¶

specification of the arguments expected by the fit method.

Type: dict

fit_method¶

name of the primitive method to call on fit. None if the primitive is a function.

Type: str

produce_args¶

specification of the arguments expected by the predict method.

Type: dict

produce_output¶

specification of the outputs of the produce method.

Type: dict

produce_method¶

name of the primitive method to call on produce. None if the primitive is a function.

Type: str

Parameters

primitive (str or dict) – primitive name or primitive dictionary.
**kwargs – Any additional arguments that will be used as hyperparameters or passed to the fit or produce methods.

Raises

TypeError – A TypeError is raised if a required argument is not found within the kwargs or if an unexpected argument has been given.

Methods:

`fit`(**kwargs)	Call the fit method of the primitive.
`get_hyperparameters`()	Get hyperparameters values that the current MLBlock is using.
`get_tunable_hyperparameters`()	Get the hyperparameters that can be tuned for this MLBlock.
`produce`(**kwargs)	Call the primitive function, or the predict method of the primitive.
`set_hyperparameters`(hyperparameters)	Set new hyperparameters.

fit(**kwargs)[source]¶

Call the fit method of the primitive.

The given keyword arguments will be passed directly to the fit method of the primitive instance specified in the JSON annotation.

If any of the arguments expected by the produce method had been given during the MLBlock initialization, they will be passed as well.

If the fit method was not specified in the JSON annotation, or if the primitive is a simple function, this will be a noop.

Parameters: **kwargs – Any given keyword argument will be directly passed to the primitive fit method.
Raises: TypeError – A TypeError might be raised if any argument not expected by the primitive fit method is given.

get_hyperparameters()[source]¶

Get hyperparameters values that the current MLBlock is using.

Returns: the dictionary containing the hyperparameter values that the MLBlock is currently using.
Return type: dict

get_tunable_hyperparameters()[source]¶

Get the hyperparameters that can be tuned for this MLBlock.

The list of hyperparameters is taken from the JSON annotation, filtering out any hyperparameter for which a value has been given during the initalization.

Returns: the dictionary containing the hyperparameters that can be tuned, their types and, if applicable, the accepted ranges or values.
Return type: dict

produce(**kwargs)[source]¶

Call the primitive function, or the predict method of the primitive.

The given keyword arguments will be passed directly to the primitive, if it is a simple function, or to the produce method of the primitive instance specified in the JSON annotation, if it is a class.

If any of the arguments expected by the fit method had been given during the MLBlock initialization, they will be passed as well.

Returns: The output of the call to the primitive function or primitive produce method.

set_hyperparameters(hyperparameters)[source]¶

Set new hyperparameters.

Only the specified hyperparameters are modified, so any other hyperparameter keeps the value that had been previously given.

If necessary, a new instance of the primitive is created.

Parameters: hyperparameters (dict) – Dictionary containing as keys the name of the hyperparameters and as values the values to be used.

class mlblocks.MLPipeline(pipeline=None, primitives=None, init_params=None, input_names=None, output_names=None, outputs=None, verbose=True)[source]¶

MLPipeline Class.

The MLPipeline class represents a Machine Learning Pipeline, which is an ordered collection of Machine Learning tools or Primitives, represented by MLBlock instances, that will be fitted and then used sequentially in order to produce results.

The MLPipeline has two working modes or phases: fitting and predicting.

During the fitting phase, each MLBlock instance, or block will be fitted and immediately after used to produce results on the same fitting data. This results will be then passed to the next block of the sequence as its fitting data, and this process will be repeated until the last block is fitted.

During the predicting phase, each block will be used to produce results on the output of the previous one, until the last one has produce its results, which will be returned as the prediction of the pipeline.

primitives¶

List of the names of the primitives that compose this pipeline.

Type: list

blocks¶

OrderedDict of the block names and the corresponding MLBlock instances.

Type: list

init_params¶

init_params dictionary, as given when the instance was created.

Type: dict

input_names¶

input_names dictionary, as given when the instance was created.

Type: dict

output_names¶

output_names dictionary, as given when the instance was created.

Type: dict

Parameters

pipeline (str, list, dict or MLPipeline) –
The pipeline argument accepts four different types with different interpretations:
- str: the name of the pipeline to search and load.
- list: the primitives list.
- dict: a complete pipeline specification.
- MLPipeline: another pipeline to be cloned.
primitives (list) – List with the names of the primitives that will compose this pipeline.
init_params (dict) – dictionary containing initialization arguments to be passed when creating the MLBlocks instances. The dictionary keys must be the corresponding primitive names and the values must be another dictionary that will be passed as **kargs to the MLBlock instance.
input_names (dict) – dictionary that maps input variable names with the actual names expected by each primitive. This allows reusing the same input argument for multiple primitives that name it differently, as well as passing different values to primitives that expect arguments named similary.
output_names (dict) – dictionary that maps output variable names with the name these variables will be given when stored in the context dictionary. This allows storing the output of different primitives in different variables, even if the primitive output name is the same one.
outputs (dict) – dictionary containing lists of output variables associated to a name.
verbose (bool) – whether to log the exceptions that occur when running the pipeline before raising them or not.

Methods:

`fit`([X, y, output_, start_, debug])	Fit the blocks of this pipeline.
`from_dict`(metadata)	Create a new MLPipeline from a dict specification.
`get_diagram`([fit, outputs, image_path])	Creates a png diagram for the pipeline, showing Pipeline Steps, Pipeline Inputs and Outputs, and block inputs and outputs.
`get_hyperparameters`([flat])	Get the current hyperparamters of each block.
`get_inputs`([fit])	Get a relation of all the input variables required by this pipeline.
`get_output_names`([outputs])	Get the names of the outputs that correspond to the given specification.
`get_output_variables`([outputs])	Get the list of variable specifications of the given outputs.
`get_outputs`([outputs])	Get the list of output variables that correspond to the specified outputs.
`get_tunable_hyperparameters`([flat])	Get the tunable hyperparamters of each block.
`load`(path)	Create a new MLPipeline from a JSON specification.
`predict`([X, output_, start_, debug])	Produce predictions using the blocks of this pipeline.
`save`(path)	Save the specification of this MLPipeline in a JSON file.
`set_hyperparameters`(hyperparameters)	Set new hyperparameter values for some blocks.
`to_dict`()	Return all the details of this MLPipeline in a dict.

fit(X=None, y=None, output_=None, start_=None, debug=False, **kwargs)[source]¶

Fit the blocks of this pipeline.

Sequentially call the fit and the produce methods of each block, capturing the outputs each produce method before calling the fit method of the next one.

During the whole process a context dictionary is built, where both the passed arguments and the captured outputs of the produce methods are stored, and from which the arguments for the next fit and produce calls will be taken.

Parameters

X – Fit Data, which the pipeline will learn from.
y – Fit Data labels, which the pipeline will use to learn how to behave.
output_ (str or int or list or None) – Output specification, as required by get_outputs. If None is given, nothing will be returned.
start_ (str or int or None) – Block index or block name to start processing from. The value can either be an integer, which will be interpreted as a block index, or the name of a block, including the conter number at the end. If given, the execution of the pipeline will start on the specified block, and all the blocks before that one will be skipped.
debug (bool or str) –
Debug a pipeline with the following options:
- t:
  Elapsed time for the primitive and the given stage (fit or predict).
- m:
  Amount of memory incrase (or decrease) for the primitive. This amount is represented in bytes.
- i:
  The input values that the primitive takes for that step.
- o:
  The output values that the primitive generates.
If provided, return a dictionary with the fit and predict performance. This argument can be a string containing a combination of the letters listed above, or True which will return a complete debug.
**kwargs – Any additional keyword arguments will be directly added to the context dictionary and available for the blocks.

Returns

If no output is specified, nothing will be returned.
If output_ has been specified, either a single value or a tuple of values will be returned.

Return type

None or dict or object

classmethod from_dict(metadata)[source]¶

Create a new MLPipeline from a dict specification.

The dict structure is the same as the one created by the to_dict method.

Parameters: metadata (dict) – Dictionary containing the pipeline specification.
Returns: A new MLPipeline instance with the details found in the given specification dictionary.
Return type: MLPipeline

get_diagram(fit=True, outputs='default', image_path=None)[source]¶

Creates a png diagram for the pipeline, showing Pipeline Steps, Pipeline Inputs and Outputs, and block inputs and outputs.

If strings are given, they can either be one of the named outputs that have been specified on the pipeline definition or a full variable specification following the format {block-name}.{variable-name}.

Parameters

fit (bool) – Optional argument to include fit arguments or not. Defaults to True.
outputs (str, int, or list[str or int]) – Single or list of output specifications.
image_path (str) – Optional argument for the location at which to save the file. Defaults to None, which returns a graphviz.Digraph object instead of saving the file.

Returns

graphviz.Digraph contains the information about the Pipeline Diagram

Return type

None or graphviz.Digraph object

get_hyperparameters(flat=False)[source]¶

Get the current hyperparamters of each block.

Parameters: flat (bool) – If True, return a flattened dictionary where each key is a two elements tuple containing the name of the block as the first element and the name of the hyperparameter as the second one. If False (default), return a dictionary where each key is the name of a block and each value is a dictionary containing the complete hyperparameter specification of that block.
Returns: A dictionary containing the block names as keys and the current block hyperparameters dictionary as values.
Return type: dict

get_inputs(fit=True)[source]¶

Get a relation of all the input variables required by this pipeline.

The result is a list contains all of the input variables. Optionally include the fit arguments.

Parameters: fit (bool) – Optional argument to include fit arguments or not. Defaults to True.
Returns: Dictionary specifying all the input variables. Each dictionary contains the entry name, as well as any other metadata that may have been included in the pipeline inputs specification.
Return type: list

get_output_names(outputs='default')[source]¶

Get the names of the outputs that correspond to the given specification.

The indicated outputs will be resolved and the names of the output variables will be returned as a single list.

Parameters

outputs (str, int or list[str or int]) – Single or list of output specifications.

Returns

List of variable names

Return type

list

Raises

ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.

get_output_variables(outputs='default')[source]¶

Get the list of variable specifications of the given outputs.

The indicated outputs will be resolved and their variables specifications will be returned as a single list.

Parameters

outputs (str, int or list[str or int]) – Single or list of output specifications.

Returns

List of variable specifications.

Return type

list

Raises

ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.

get_outputs(outputs='default')[source]¶

Get the list of output variables that correspond to the specified outputs.

Outputs specification can either be a single string, a single integer, or a list of strings and integers.

If strings are given, they can either be one of the named outputs that have been specified on the pipeline definition or the name of a block, including the counter number at the end, or a full variable specification following the format {block-name}.{variable-name}.

Alternatively, integers can be passed as indexes of the blocks from which to get the outputs.

If output specifications that resolve to multiple output variables are given, such as the named outputs or block names, all the variables are concatenated together, in order, in a single variable list.

Parameters

outputs (str, int or list[str or int]) – Single or list of output specifications.

Returns

List of dictionaries specifying all the output variables. Each dictionary contains the entries name and variable, as well as any other metadata that may have been included in the pipeline outputs or block produce outputs specification.

Return type

list

Raises

ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.

get_tunable_hyperparameters(flat=False)[source]¶

Get the tunable hyperparamters of each block.

Parameters: flat (bool) – If True, return a flattened dictionary where each key is a two elements tuple containing the name of the block as the first element and the name of the hyperparameter as the second one. If False (default), return a dictionary where each key is the name of a block and each value is a dictionary containing the complete hyperparameter specification of that block.
Returns: A dictionary containing the block names as keys and the block tunable hyperparameters dictionary as values.
Return type: dict

classmethod load(path)[source]¶

Create a new MLPipeline from a JSON specification.

The JSON file format is the same as the one created by the to_dict method.

Parameters: path (str) – Path of the JSON file to load.
Returns: A new MLPipeline instance with the specification found in the JSON file.
Return type: MLPipeline

predict(X=None, output_='default', start_=None, debug=False, **kwargs)[source]¶

Produce predictions using the blocks of this pipeline.

Sequentially call the produce method of each block, capturing the outputs before calling the next one.

During the whole process a context dictionary is built, where both the passed arguments and the captured outputs of the produce methods are stored, and from which the arguments for the next produce calls will be taken.

Parameters

X – Data which the pipeline will use to make predictions.
output_ (str or int or list or None) – Output specification, as required by get_outputs. If not specified the default output will be returned.
start_ (str or int or None) – Block index or block name to start processing from. The value can either be an integer, which will be interpreted as a block index, or the name of a block, including the conter number at the end. If given, the execution of the pipeline will start on the specified block, and all the blocks before that one will be skipped.
debug (bool or str) –
Debug a pipeline with the following options:
- t:
  Elapsed time for the primitive and the given stage (fit or predict).
- m:
  Amount of memory incrase (or decrease) for the primitive. This amount is represented in bytes.
- i:
  The input values that the primitive takes for that step.
- o:
  The output values that the primitive generates.
If True then a dictionary will be returned containing all the elements listed previously. If a string value with the combination of letters is given for each option, it will return a dictionary with the selected elements.
**kwargs – Any additional keyword arguments will be directly added to the context dictionary and available for the blocks.

Returns

If a single output is requested, it is returned alone.
If multiple outputs have been requested, a tuple is returned.
If debug is given, a tupple will be returned where the first element returned are the predictions and the second a dictionary containing the debug information.

Return type

object or tuple

save(path)[source]¶

Save the specification of this MLPipeline in a JSON file.

The content of the JSON file is the dict returned by the to_dict method.

Parameters: path (str) – Path to the JSON file to write.

set_hyperparameters(hyperparameters)[source]¶

Set new hyperparameter values for some blocks.

Parameters: hyperparameters (dict) – A dictionary containing the block names as keys and the new hyperparameters dictionary as values.

to_dict()[source]¶

Return all the details of this MLPipeline in a dict.

The dict structure contains all the __init__ arguments of the MLPipeline, as well as the current hyperparameter values and the specification of the tunable_hyperparameters:

{
    'primitives': [
        'a_primitive',
        'another_primitive'
    ],
    'init_params': {
        'a_primitive': {
            'an_argument': 'a_value'
        }
    },
    'hyperparameters': {
        'a_primitive#1': {
            'an_argument': 'a_value',
            'another_argument': 'another_value',
        },
        'another_primitive#1': {
            'yet_another_argument': 'yet_another_value'
         }
    },
    'tunable_hyperparameters': {
        'another_primitive#1': {
            'yet_another_argument': {
                'type': 'str',
                'default': 'a_default_value',
                'values': [
                    'a_default_value',
                    'yet_another_value'
                ]
            }
        }
    }
}

mlblocks.add_pipelines_path(path)[source]¶

Add a new path to look for pipelines.

The new path will be inserted in the first place of the list, so any primitive found in this new folder will take precedence over any other pipeline with the same name that existed in the system before.

Parameters: path (str) – path to add
Raises: ValueError – A ValueError will be raised if the path is not valid.

mlblocks.add_primitives_path(path)[source]¶

Add a new path to look for primitives.

The new path will be inserted in the first place of the list, so any primitive found in this new folder will take precedence over any other primitive with the same name that existed in the system before.

Parameters: path (str) – path to add
Raises: ValueError – A ValueError will be raised if the path is not valid.

mlblocks.find_pipelines(pattern='', filters=None)[source]¶

Find pipelines by name and filters.

If a patter is given, only the pipelines whose name matches the pattern will be returned.

If filters are given, they should be a dictionary containing key/value filters that will have to be matched within the pipeline annotation for it to be included in the results.

If the given key is not found but it contains dots, split by the dots and consider each part a sublevel in the annotation.

If the key value within the annotation is a list or a dict, check whether any of the given values is contained within it instead of checking for equality.

Parameters

pattern (str) – Regular expression to match agains the pipeline names.
filters (dict) – Dictionary containing the filters to apply over the matchin pipelines.

Returns

Names of the matching pipelines.

Return type

list

mlblocks.find_primitives(pattern='', filters=None)[source]¶

Find primitives by name and filters.

If a patter is given, only the primitives whose name matches the pattern will be returned.

If filters are given, they should be a dictionary containing key/value filters that will have to be matched within the primitive annotation for it to be included in the results.

If the given key is not found but it contains dots, split by the dots and consider each part a sublevel in the annotation.

If the key value within the annotation is a list or a dict, check whether any of the given values is contained within it instead of checking for equality.

Parameters

pattern (str) – Regular expression to match agains the primitive names.
filters (dict) – Dictionary containing the filters to apply over the matchin primitives.

Returns

Names of the matching primitives.

Return type

list

mlblocks.get_pipelines_paths()[source]¶

Get the list of folders where pipelines will be looked for.

This list will include the values of all the entry points named pipelines published under the entry point group mlblocks.

An example of such an entry point would be:

entry_points = {
    'mlblocks': [
        'pipelines=some_module:SOME_VARIABLE'
    ]
}

where the module some_module contains a variable such as:

SOME_VARIABLE = os.path.join(os.path.dirname(__file__), 'jsons')

Returns: The list of folders.
Return type: list

mlblocks.get_primitives_paths()[source]¶

Get the list of folders where primitives will be looked for.

This list will include the values of all the entry points named primitives published under the entry point group mlblocks.

Also, for backwards compatibility reasons, the paths from the entry points named jsons_path published under the mlprimitives group will also be included.

An example of such an entry point would be:

entry_points = {
    'mlblocks': [
        'primitives=some_module:SOME_VARIABLE'
    ]
}

where the module some_module contains a variable such as:

SOME_VARIABLE = os.path.join(os.path.dirname(__file__), 'jsons')

Returns: The list of folders.
Return type: list

mlblocks.load_pipeline(name)[source]¶

Locate and load the pipeline JSON annotation.

All the pipeline paths will be scanned to find a JSON file with the given name, and as soon as a JSON with the given name is found it is returned.

Parameters: name (str) – Path to a JSON file or name of the JSON to look for withouth the .json extension.
Returns: The content of the JSON annotation file loaded into a dict.
Return type: dict
Raises: ValueError – A ValueError will be raised if the pipeline cannot be found.

mlblocks.load_primitive(name)[source]¶

Locate and load the primitive JSON annotation.

All the primitive paths will be scanned to find a JSON file with the given name, and as soon as a JSON with the given name is found it is returned.

Parameters: name (str) – Path to a JSON file or name of the JSON to look for withouth the .json extension.
Returns: The content of the JSON annotation file loaded into a dict.
Return type: dict
Raises: ValueError – A ValueError will be raised if the primitive cannot be found.