mlblocks¶
MLBlocks top module.
MLBlocks is a simple framework for composing end-to-end tunable Machine Learning Pipelines by seamlessly combining tools from any python library with a simple, common and uniform interface.
Free software: MIT license
Documentation: https://MLBazaar.github.io/MLBlocks
Classes:
|
MLBlock Class. |
|
MLPipeline Class. |
Functions:
|
Add a new path to look for pipelines. |
|
Add a new path to look for primitives. |
|
Find pipelines by name and filters. |
|
Find primitives by name and filters. |
Get the list of folders where pipelines will be looked for. |
|
Get the list of folders where primitives will be looked for. |
|
|
Locate and load the pipeline JSON annotation. |
|
Locate and load the primitive JSON annotation. |
-
class
mlblocks.
MLBlock
(primitive, **kwargs)[source]¶ MLBlock Class.
The MLBlock class represents a single step within an MLPipeline.
It is responsible for loading and interpreting JSON primitives, as well as wrapping them and providing a common interface to run them.
-
name
¶ Primitive name.
- Type
str
-
metadata
¶ Additional information about this primitive
- Type
dict
-
primitive
¶ the actual function or instance which this MLBlock wraps.
- Type
object
-
fit_args
¶ specification of the arguments expected by the
fit
method.- Type
dict
-
fit_method
¶ name of the primitive method to call on
fit
.None
if the primitive is a function.- Type
str
-
produce_args
¶ specification of the arguments expected by the
predict
method.- Type
dict
-
produce_output
¶ specification of the outputs of the
produce
method.- Type
dict
-
produce_method
¶ name of the primitive method to call on
produce
.None
if the primitive is a function.- Type
str
- Parameters
primitive (str or dict) – primitive name or primitive dictionary.
**kwargs – Any additional arguments that will be used as hyperparameters or passed to the
fit
orproduce
methods.
- Raises
TypeError – A
TypeError
is raised if a required argument is not found within thekwargs
or if an unexpected argument has been given.
Methods:
fit
(**kwargs)Call the fit method of the primitive.
Get hyperparameters values that the current MLBlock is using.
Get the hyperparameters that can be tuned for this MLBlock.
produce
(**kwargs)Call the primitive function, or the predict method of the primitive.
set_hyperparameters
(hyperparameters)Set new hyperparameters.
-
fit
(**kwargs)[source]¶ Call the fit method of the primitive.
The given keyword arguments will be passed directly to the
fit
method of the primitive instance specified in the JSON annotation.If any of the arguments expected by the produce method had been given during the MLBlock initialization, they will be passed as well.
If the fit method was not specified in the JSON annotation, or if the primitive is a simple function, this will be a noop.
- Parameters
**kwargs – Any given keyword argument will be directly passed to the primitive fit method.
- Raises
TypeError – A
TypeError
might be raised if any argument not expected by the primitive fit method is given.
-
get_hyperparameters
()[source]¶ Get hyperparameters values that the current MLBlock is using.
- Returns
the dictionary containing the hyperparameter values that the MLBlock is currently using.
- Return type
dict
-
get_tunable_hyperparameters
()[source]¶ Get the hyperparameters that can be tuned for this MLBlock.
The list of hyperparameters is taken from the JSON annotation, filtering out any hyperparameter for which a value has been given during the initalization.
- Returns
the dictionary containing the hyperparameters that can be tuned, their types and, if applicable, the accepted ranges or values.
- Return type
dict
-
produce
(**kwargs)[source]¶ Call the primitive function, or the predict method of the primitive.
The given keyword arguments will be passed directly to the primitive, if it is a simple function, or to the
produce
method of the primitive instance specified in the JSON annotation, if it is a class.If any of the arguments expected by the fit method had been given during the MLBlock initialization, they will be passed as well.
- Returns
The output of the call to the primitive function or primitive produce method.
-
set_hyperparameters
(hyperparameters)[source]¶ Set new hyperparameters.
Only the specified hyperparameters are modified, so any other hyperparameter keeps the value that had been previously given.
If necessary, a new instance of the primitive is created.
- Parameters
hyperparameters (dict) – Dictionary containing as keys the name of the hyperparameters and as values the values to be used.
-
-
class
mlblocks.
MLPipeline
(pipeline=None, primitives=None, init_params=None, input_names=None, output_names=None, outputs=None, verbose=True)[source]¶ MLPipeline Class.
The MLPipeline class represents a Machine Learning Pipeline, which is an ordered collection of Machine Learning tools or Primitives, represented by MLBlock instances, that will be fitted and then used sequentially in order to produce results.
The MLPipeline has two working modes or phases: fitting and predicting.
During the fitting phase, each MLBlock instance, or block will be fitted and immediately after used to produce results on the same fitting data. This results will be then passed to the next block of the sequence as its fitting data, and this process will be repeated until the last block is fitted.
During the predicting phase, each block will be used to produce results on the output of the previous one, until the last one has produce its results, which will be returned as the prediction of the pipeline.
-
primitives
¶ List of the names of the primitives that compose this pipeline.
- Type
list
-
blocks
¶ OrderedDict of the block names and the corresponding MLBlock instances.
- Type
list
-
init_params
¶ init_params dictionary, as given when the instance was created.
- Type
dict
-
input_names
¶ input_names dictionary, as given when the instance was created.
- Type
dict
-
output_names
¶ output_names dictionary, as given when the instance was created.
- Type
dict
- Parameters
pipeline (str, list, dict or MLPipeline) –
- The pipeline argument accepts four different types with different interpretations:
str: the name of the pipeline to search and load.
list: the primitives list.
dict: a complete pipeline specification.
MLPipeline: another pipeline to be cloned.
primitives (list) – List with the names of the primitives that will compose this pipeline.
init_params (dict) – dictionary containing initialization arguments to be passed when creating the MLBlocks instances. The dictionary keys must be the corresponding primitive names and the values must be another dictionary that will be passed as
**kargs
to the MLBlock instance.input_names (dict) – dictionary that maps input variable names with the actual names expected by each primitive. This allows reusing the same input argument for multiple primitives that name it differently, as well as passing different values to primitives that expect arguments named similary.
output_names (dict) – dictionary that maps output variable names with the name these variables will be given when stored in the context dictionary. This allows storing the output of different primitives in different variables, even if the primitive output name is the same one.
outputs (dict) – dictionary containing lists of output variables associated to a name.
verbose (bool) – whether to log the exceptions that occur when running the pipeline before raising them or not.
Methods:
fit
([X, y, output_, start_, debug])Fit the blocks of this pipeline.
from_dict
(metadata)Create a new MLPipeline from a dict specification.
get_diagram
([fit, outputs, image_path])Creates a png diagram for the pipeline, showing Pipeline Steps, Pipeline Inputs and Outputs, and block inputs and outputs.
get_hyperparameters
([flat])Get the current hyperparamters of each block.
get_inputs
([fit])Get a relation of all the input variables required by this pipeline.
get_output_names
([outputs])Get the names of the outputs that correspond to the given specification.
get_output_variables
([outputs])Get the list of variable specifications of the given outputs.
get_outputs
([outputs])Get the list of output variables that correspond to the specified outputs.
get_tunable_hyperparameters
([flat])Get the tunable hyperparamters of each block.
load
(path)Create a new MLPipeline from a JSON specification.
predict
([X, output_, start_, debug])Produce predictions using the blocks of this pipeline.
save
(path)Save the specification of this MLPipeline in a JSON file.
set_hyperparameters
(hyperparameters)Set new hyperparameter values for some blocks.
to_dict
()Return all the details of this MLPipeline in a dict.
-
fit
(X=None, y=None, output_=None, start_=None, debug=False, **kwargs)[source]¶ Fit the blocks of this pipeline.
Sequentially call the
fit
and theproduce
methods of each block, capturing the outputs eachproduce
method before calling thefit
method of the next one.During the whole process a context dictionary is built, where both the passed arguments and the captured outputs of the
produce
methods are stored, and from which the arguments for the nextfit
andproduce
calls will be taken.- Parameters
X – Fit Data, which the pipeline will learn from.
y – Fit Data labels, which the pipeline will use to learn how to behave.
output_ (str or int or list or None) – Output specification, as required by
get_outputs
. IfNone
is given, nothing will be returned.start_ (str or int or None) – Block index or block name to start processing from. The value can either be an integer, which will be interpreted as a block index, or the name of a block, including the conter number at the end. If given, the execution of the pipeline will start on the specified block, and all the blocks before that one will be skipped.
debug (bool or str) –
Debug a pipeline with the following options:
t
:Elapsed time for the primitive and the given stage (fit or predict).
m
:Amount of memory incrase (or decrease) for the primitive. This amount is represented in bytes.
i
:The input values that the primitive takes for that step.
o
:The output values that the primitive generates.
If provided, return a dictionary with the
fit
andpredict
performance. This argument can be a string containing a combination of the letters listed above, orTrue
which will return a complete debug.**kwargs – Any additional keyword arguments will be directly added to the context dictionary and available for the blocks.
- Returns
If no
output
is specified, nothing will be returned.If
output_
has been specified, either a single value or a tuple of values will be returned.
- Return type
None or dict or object
-
classmethod
from_dict
(metadata)[source]¶ Create a new MLPipeline from a dict specification.
The dict structure is the same as the one created by the
to_dict
method.- Parameters
metadata (dict) – Dictionary containing the pipeline specification.
- Returns
A new MLPipeline instance with the details found in the given specification dictionary.
- Return type
-
get_diagram
(fit=True, outputs='default', image_path=None)[source]¶ Creates a png diagram for the pipeline, showing Pipeline Steps, Pipeline Inputs and Outputs, and block inputs and outputs.
If strings are given, they can either be one of the named outputs that have been specified on the pipeline definition or a full variable specification following the format
{block-name}.{variable-name}
.- Parameters
fit (bool) – Optional argument to include fit arguments or not. Defaults to True.
outputs (str, int, or list[str or int]) – Single or list of output specifications.
image_path (str) – Optional argument for the location at which to save the file. Defaults to None, which returns a graphviz.Digraph object instead of saving the file.
- Returns
graphviz.Digraph contains the information about the Pipeline Diagram
- Return type
None or graphviz.Digraph object
-
get_hyperparameters
(flat=False)[source]¶ Get the current hyperparamters of each block.
- Parameters
flat (bool) – If True, return a flattened dictionary where each key is a two elements tuple containing the name of the block as the first element and the name of the hyperparameter as the second one. If False (default), return a dictionary where each key is the name of a block and each value is a dictionary containing the complete hyperparameter specification of that block.
- Returns
A dictionary containing the block names as keys and the current block hyperparameters dictionary as values.
- Return type
dict
-
get_inputs
(fit=True)[source]¶ Get a relation of all the input variables required by this pipeline.
The result is a list contains all of the input variables. Optionally include the fit arguments.
- Parameters
fit (bool) – Optional argument to include fit arguments or not. Defaults to
True
.- Returns
Dictionary specifying all the input variables. Each dictionary contains the entry
name
, as well as any other metadata that may have been included in the pipeline inputs specification.- Return type
list
-
get_output_names
(outputs='default')[source]¶ Get the names of the outputs that correspond to the given specification.
The indicated outputs will be resolved and the names of the output variables will be returned as a single list.
- Parameters
outputs (str, int or list[str or int]) – Single or list of output specifications.
- Returns
List of variable names
- Return type
list
- Raises
ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.
-
get_output_variables
(outputs='default')[source]¶ Get the list of variable specifications of the given outputs.
The indicated outputs will be resolved and their variables specifications will be returned as a single list.
- Parameters
outputs (str, int or list[str or int]) – Single or list of output specifications.
- Returns
List of variable specifications.
- Return type
list
- Raises
ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.
-
get_outputs
(outputs='default')[source]¶ Get the list of output variables that correspond to the specified outputs.
Outputs specification can either be a single string, a single integer, or a list of strings and integers.
If strings are given, they can either be one of the named outputs that have been specified on the pipeline definition or the name of a block, including the counter number at the end, or a full variable specification following the format
{block-name}.{variable-name}
.Alternatively, integers can be passed as indexes of the blocks from which to get the outputs.
If output specifications that resolve to multiple output variables are given, such as the named outputs or block names, all the variables are concatenated together, in order, in a single variable list.
- Parameters
outputs (str, int or list[str or int]) – Single or list of output specifications.
- Returns
List of dictionaries specifying all the output variables. Each dictionary contains the entries
name
andvariable
, as well as any other metadata that may have been included in the pipeline outputs or block produce outputs specification.- Return type
list
- Raises
ValueError – If an output specification is not valid.
TypeError – If the type of a specification is not an str or an int.
-
get_tunable_hyperparameters
(flat=False)[source]¶ Get the tunable hyperparamters of each block.
- Parameters
flat (bool) – If True, return a flattened dictionary where each key is a two elements tuple containing the name of the block as the first element and the name of the hyperparameter as the second one. If False (default), return a dictionary where each key is the name of a block and each value is a dictionary containing the complete hyperparameter specification of that block.
- Returns
A dictionary containing the block names as keys and the block tunable hyperparameters dictionary as values.
- Return type
dict
-
classmethod
load
(path)[source]¶ Create a new MLPipeline from a JSON specification.
The JSON file format is the same as the one created by the
to_dict
method.- Parameters
path (str) – Path of the JSON file to load.
- Returns
A new MLPipeline instance with the specification found in the JSON file.
- Return type
-
predict
(X=None, output_='default', start_=None, debug=False, **kwargs)[source]¶ Produce predictions using the blocks of this pipeline.
Sequentially call the
produce
method of each block, capturing the outputs before calling the next one.During the whole process a context dictionary is built, where both the passed arguments and the captured outputs of the
produce
methods are stored, and from which the arguments for the nextproduce
calls will be taken.- Parameters
X – Data which the pipeline will use to make predictions.
output_ (str or int or list or None) – Output specification, as required by
get_outputs
. If not specified thedefault
output will be returned.start_ (str or int or None) – Block index or block name to start processing from. The value can either be an integer, which will be interpreted as a block index, or the name of a block, including the conter number at the end. If given, the execution of the pipeline will start on the specified block, and all the blocks before that one will be skipped.
debug (bool or str) –
Debug a pipeline with the following options:
t
:Elapsed time for the primitive and the given stage (fit or predict).
m
:Amount of memory incrase (or decrease) for the primitive. This amount is represented in bytes.
i
:The input values that the primitive takes for that step.
o
:The output values that the primitive generates.
If
True
then a dictionary will be returned containing all the elements listed previously. If astring
value with the combination of letters is given for each option, it will return a dictionary with the selected elements.**kwargs – Any additional keyword arguments will be directly added to the context dictionary and available for the blocks.
- Returns
If a single output is requested, it is returned alone.
If multiple outputs have been requested, a tuple is returned.
If
debug
is given, a tupple will be returned where the first element returned are the predictions and the second a dictionary containing the debug information.
- Return type
object or tuple
-
save
(path)[source]¶ Save the specification of this MLPipeline in a JSON file.
The content of the JSON file is the dict returned by the
to_dict
method.- Parameters
path (str) – Path to the JSON file to write.
-
set_hyperparameters
(hyperparameters)[source]¶ Set new hyperparameter values for some blocks.
- Parameters
hyperparameters (dict) – A dictionary containing the block names as keys and the new hyperparameters dictionary as values.
-
to_dict
()[source]¶ Return all the details of this MLPipeline in a dict.
The dict structure contains all the
__init__
arguments of the MLPipeline, as well as the current hyperparameter values and the specification of the tunable_hyperparameters:{ 'primitives': [ 'a_primitive', 'another_primitive' ], 'init_params': { 'a_primitive': { 'an_argument': 'a_value' } }, 'hyperparameters': { 'a_primitive#1': { 'an_argument': 'a_value', 'another_argument': 'another_value', }, 'another_primitive#1': { 'yet_another_argument': 'yet_another_value' } }, 'tunable_hyperparameters': { 'another_primitive#1': { 'yet_another_argument': { 'type': 'str', 'default': 'a_default_value', 'values': [ 'a_default_value', 'yet_another_value' ] } } } }
-
-
mlblocks.
add_pipelines_path
(path)[source]¶ Add a new path to look for pipelines.
The new path will be inserted in the first place of the list, so any primitive found in this new folder will take precedence over any other pipeline with the same name that existed in the system before.
- Parameters
path (str) – path to add
- Raises
ValueError – A
ValueError
will be raised if the path is not valid.
-
mlblocks.
add_primitives_path
(path)[source]¶ Add a new path to look for primitives.
The new path will be inserted in the first place of the list, so any primitive found in this new folder will take precedence over any other primitive with the same name that existed in the system before.
- Parameters
path (str) – path to add
- Raises
ValueError – A
ValueError
will be raised if the path is not valid.
-
mlblocks.
find_pipelines
(pattern='', filters=None)[source]¶ Find pipelines by name and filters.
If a patter is given, only the pipelines whose name matches the pattern will be returned.
If filters are given, they should be a dictionary containing key/value filters that will have to be matched within the pipeline annotation for it to be included in the results.
If the given key is not found but it contains dots, split by the dots and consider each part a sublevel in the annotation.
If the key value within the annotation is a list or a dict, check whether any of the given values is contained within it instead of checking for equality.
- Parameters
pattern (str) – Regular expression to match agains the pipeline names.
filters (dict) – Dictionary containing the filters to apply over the matchin pipelines.
- Returns
Names of the matching pipelines.
- Return type
list
-
mlblocks.
find_primitives
(pattern='', filters=None)[source]¶ Find primitives by name and filters.
If a patter is given, only the primitives whose name matches the pattern will be returned.
If filters are given, they should be a dictionary containing key/value filters that will have to be matched within the primitive annotation for it to be included in the results.
If the given key is not found but it contains dots, split by the dots and consider each part a sublevel in the annotation.
If the key value within the annotation is a list or a dict, check whether any of the given values is contained within it instead of checking for equality.
- Parameters
pattern (str) – Regular expression to match agains the primitive names.
filters (dict) – Dictionary containing the filters to apply over the matchin primitives.
- Returns
Names of the matching primitives.
- Return type
list
-
mlblocks.
get_pipelines_paths
()[source]¶ Get the list of folders where pipelines will be looked for.
This list will include the values of all the entry points named
pipelines
published under the entry point groupmlblocks
.An example of such an entry point would be:
entry_points = { 'mlblocks': [ 'pipelines=some_module:SOME_VARIABLE' ] }
where the module
some_module
contains a variable such as:SOME_VARIABLE = os.path.join(os.path.dirname(__file__), 'jsons')
- Returns
The list of folders.
- Return type
list
-
mlblocks.
get_primitives_paths
()[source]¶ Get the list of folders where primitives will be looked for.
This list will include the values of all the entry points named
primitives
published under the entry point groupmlblocks
.Also, for backwards compatibility reasons, the paths from the entry points named
jsons_path
published under themlprimitives
group will also be included.An example of such an entry point would be:
entry_points = { 'mlblocks': [ 'primitives=some_module:SOME_VARIABLE' ] }
where the module
some_module
contains a variable such as:SOME_VARIABLE = os.path.join(os.path.dirname(__file__), 'jsons')
- Returns
The list of folders.
- Return type
list
-
mlblocks.
load_pipeline
(name)[source]¶ Locate and load the pipeline JSON annotation.
All the pipeline paths will be scanned to find a JSON file with the given name, and as soon as a JSON with the given name is found it is returned.
- Parameters
name (str) – Path to a JSON file or name of the JSON to look for withouth the
.json
extension.- Returns
The content of the JSON annotation file loaded into a dict.
- Return type
dict
- Raises
ValueError – A
ValueError
will be raised if the pipeline cannot be found.
-
mlblocks.
load_primitive
(name)[source]¶ Locate and load the primitive JSON annotation.
All the primitive paths will be scanned to find a JSON file with the given name, and as soon as a JSON with the given name is found it is returned.
- Parameters
name (str) – Path to a JSON file or name of the JSON to look for withouth the
.json
extension.- Returns
The content of the JSON annotation file loaded into a dict.
- Return type
dict
- Raises
ValueError – A
ValueError
will be raised if the primitive cannot be found.