mlprimitives.custom.feature_extraction module

class mlprimitives.custom.feature_extraction.CategoricalEncoder(max_labels=None, max_unique_ratio=0, dropna=True, **kwargs)[source]

Bases: mlprimitives.custom.feature_extraction.FeatureExtractor

FeatureExtractor that encodes categorical features using OneHotLabelEncoder.

When autodetecting features, only features with dtype category or object are considered.

Optionally, a max_unique_ratio can be passed, which allows ignoring features that have a high number of unique values, such as primary keys.

Parameters
  • max_labels (int or None) – Maximum number of labels to use by feature. Defaults to None.

  • max_unique_ratio (int) – Max proportion of unique values that a feature must have in order to be considered a categorical feature. If 0 is given, the ratio is ignored. Defaults to 0.

  • dropna (bool) – Whether to drop null values before analyzing the features and fitting the encoders.

>>> df = pd.DataFrame([
... {'a': 'a', 'b': 1, 'c': 1},
... {'a': 'a', 'b': 2, 'c': 2},
... {'a': 'b', 'b': 2, 'c': 1},
... ])
>>> df['c'] = d['c'].astype('category')
>>> ce = CategoricalEncoder(features='auto')
>>> ce.fit_transform(df)
   b  a=a  a=b  c=1  c=2
0  1    1    0    1    0
1  2    1    0    0    1
2  2    0    1    1    0
fit(X, y=None)[source]
class mlprimitives.custom.feature_extraction.DatetimeFeaturizer(copy=True, features=None, keep=False)[source]

Bases: mlprimitives.custom.feature_extraction.FeatureExtractor

Extract features from a datetime.

class mlprimitives.custom.feature_extraction.FeatureExtractor(copy=True, features=None, keep=False)[source]

Bases: object

Extract Features by applying single column feature extracts on multiple columns.

Optionally detect the features on which to apply the feature extractor automatically.

Parameters
  • copy (bool) – Whether to make a copy of the input data or modify it in place. Defaults to True.

  • features (list or str) – List of features to apply the feature extractor to. If 'auto' is passed, try to detect the feature automatically. Defaults to an empty list.

  • keep (bool) – Whether to keep the original features instead of replacing them. Defaults to False.

fit(X, y=None)[source]
fit_transform(X, y=None)[source]
transform(X)[source]
class mlprimitives.custom.feature_extraction.OneHotLabelEncoder(name=None, max_labels=None, dropna=True)[source]

Bases: object

Combination of LabelEncoder + OneHotEncoder.

Parameters
  • name (str or None) – Name of this feature. If None is given, the name is taken from the training feature column.

  • max_labels (int or None) – Maximum number of columns to generate by feature.

  • dropna (bool) – Whether to drop null values before fitting. Defaults to True.

>>> df = pd.DataFrame([
... {'a': 'a', 'b': 1, 'c': 1},
... {'a': 'a', 'b': 2, 'c': 2},
... {'a': 'b', 'b': 2, 'c': 1},
... ])
>>> OneHotLabelEncoder().fit_transform(df.a)
   a=a  a=b
0    1    0
1    1    0
2    0    1
>>> OneHotLabelEncoder(max_labels=1).fit_transform(df.a)
   a=a
0    1
1    1
2    0
>>> OneHotLabelEncoder(name='a_name').fit_transform(df.a)
   a_name=a  a_name=b
0         1         0
1         1         0
2         0         1
fit(x)[source]
fit_transform(x)[source]
transform(x)[source]
class mlprimitives.custom.feature_extraction.StringVectorizer(copy=True, features=None, keep=False, min_words=0, **kwargs)[source]

Bases: mlprimitives.custom.feature_extraction.FeatureExtractor

FeatureExtractor that encodes text features using a scikit-learn CountVectorizer.

When autodetecting features, only features with dtype object features are considered.

Optionally, a min_words can be passed, which allows ignoring features have less than the given value of words in all their occurrences.

Parameters
  • copy (bool) – Whether to make a copy of the input data or modify it in place. Defaults to True.

  • features (list or str) – List of features to apply the feature extractor to. If 'auto' is passed, try to detect the feature automatically. Defaults to an empty list.

  • keep (bool) – Whether to keep the original features instead of replacing them. Defaults to False.

  • min_words (int) – Minimum number of words that the features needs to have in order to be considered a text column.

  • **kwargs – Any additional keywords arguments will be passed to the underlying StringVectorizer instances.

fit(X, y=None)[source]