mlprimitives.custom.feature_extraction module¶
-
class
mlprimitives.custom.feature_extraction.
CategoricalEncoder
(max_labels=None, max_unique_ratio=0, dropna=True, **kwargs)[source]¶ Bases:
mlprimitives.custom.feature_extraction.FeatureExtractor
FeatureExtractor that encodes categorical features using OneHotLabelEncoder.
When autodetecting features, only features with dtype
category
orobject
are considered.Optionally, a
max_unique_ratio
can be passed, which allows ignoring features that have a high number of unique values, such as primary keys.- Parameters
max_labels (int or None) – Maximum number of labels to use by feature. Defaults to
None
.max_unique_ratio (int) – Max proportion of unique values that a feature must have in order to be considered a categorical feature. If
0
is given, the ratio is ignored. Defaults to0
.dropna (bool) – Whether to drop null values before analyzing the features and fitting the encoders.
>>> df = pd.DataFrame([ ... {'a': 'a', 'b': 1, 'c': 1}, ... {'a': 'a', 'b': 2, 'c': 2}, ... {'a': 'b', 'b': 2, 'c': 1}, ... ]) >>> df['c'] = d['c'].astype('category') >>> ce = CategoricalEncoder(features='auto') >>> ce.fit_transform(df) b a=a a=b c=1 c=2 0 1 1 0 1 0 1 2 1 0 0 1 2 2 0 1 1 0
-
class
mlprimitives.custom.feature_extraction.
DatetimeFeaturizer
(copy=True, features=None, keep=False)[source]¶ Bases:
mlprimitives.custom.feature_extraction.FeatureExtractor
Extract features from a datetime.
-
class
mlprimitives.custom.feature_extraction.
FeatureExtractor
(copy=True, features=None, keep=False)[source]¶ Bases:
object
Extract Features by applying single column feature extracts on multiple columns.
Optionally detect the features on which to apply the feature extractor automatically.
- Parameters
copy (bool) – Whether to make a copy of the input data or modify it in place. Defaults to
True
.features (list or str) – List of features to apply the feature extractor to. If
'auto'
is passed, try to detect the feature automatically. Defaults to an empty list.keep (bool) – Whether to keep the original features instead of replacing them. Defaults to
False
.
-
class
mlprimitives.custom.feature_extraction.
OneHotLabelEncoder
(name=None, max_labels=None, dropna=True)[source]¶ Bases:
object
Combination of LabelEncoder + OneHotEncoder.
- Parameters
name (str or None) – Name of this feature. If
None
is given, the name is taken from the training feature column.max_labels (int or None) – Maximum number of columns to generate by feature.
dropna (bool) – Whether to drop null values before fitting. Defaults to True.
>>> df = pd.DataFrame([ ... {'a': 'a', 'b': 1, 'c': 1}, ... {'a': 'a', 'b': 2, 'c': 2}, ... {'a': 'b', 'b': 2, 'c': 1}, ... ]) >>> OneHotLabelEncoder().fit_transform(df.a) a=a a=b 0 1 0 1 1 0 2 0 1 >>> OneHotLabelEncoder(max_labels=1).fit_transform(df.a) a=a 0 1 1 1 2 0 >>> OneHotLabelEncoder(name='a_name').fit_transform(df.a) a_name=a a_name=b 0 1 0 1 1 0 2 0 1
-
class
mlprimitives.custom.feature_extraction.
StringVectorizer
(copy=True, features=None, keep=False, min_words=0, **kwargs)[source]¶ Bases:
mlprimitives.custom.feature_extraction.FeatureExtractor
FeatureExtractor that encodes text features using a scikit-learn CountVectorizer.
When autodetecting features, only features with dtype
object
features are considered.Optionally, a
min_words
can be passed, which allows ignoring features have less than the given value of words in all their occurrences.- Parameters
copy (bool) – Whether to make a copy of the input data or modify it in place. Defaults to
True
.features (list or str) – List of features to apply the feature extractor to. If
'auto'
is passed, try to detect the feature automatically. Defaults to an empty list.keep (bool) – Whether to keep the original features instead of replacing them. Defaults to
False
.min_words (int) – Minimum number of words that the features needs to have in order to be considered a text column.
**kwargs – Any additional keywords arguments will be passed to the underlying StringVectorizer instances.