Text Pipelines¶
Here we will be showing some examples using MLBlocks to resolve text problems.
Text Classification¶
For the text classification examples we will be using the Twenty Newsgroups Dataset,
which we will load using the mlblocks.dataset.load_newsgroups
function.
The data of this dataset is a 1d numpy array vector containing the texts from 11314 newsgroups posts, and the target is a 1d numpy integer array containing the label of one of the 20 topics that they are about.
MLPrimitives + Keras Preprocessing + Keras LSTM¶
In this example we will start by applying some text cleanup using the TextCleaner primitive from MLPrimitives, to then go into some keras preprocessing primitives and end using a Keras LSTM Classifier from MLPrimitives
Note how in this case we are using the input_names
and output_names
to properly
setup the pipeline and allow using the outputs from some primitives as additional inputs
for later ones.
import nltk
from mlblocks import MLPipeline
from mlprimitives.datasets import load_newsgroups
dataset = load_newsgroups()
dataset.describe()
X_train, X_test, y_train, y_test = dataset.get_splits(1)
# Make sure that we have the necessary data
nltk.download('stopwords')
# set up the pipeline
primitives = [
"mlprimitives.custom.counters.UniqueCounter",
"mlprimitives.custom.text.TextCleaner",
"mlprimitives.custom.counters.VocabularyCounter",
"keras.preprocessing.text.Tokenizer",
"keras.preprocessing.sequence.pad_sequences",
"keras.Sequential.LSTMTextClassifier"
]
input_names = {
"mlprimitives.custom.counters.UniqueCounter#1": {
"X": "y"
}
}
output_names = {
"mlprimitives.custom.counters.UniqueCounter#1": {
"counts": "classes"
},
"mlprimitives.custom.counters.VocabularyCounter#1": {
"counts": "vocabulary_size"
}
}
init_params = {
"mlprimitives.custom.counters.VocabularyCounter#1": {
"add": 1
},
"mlprimitives.custom.text.TextCleaner#1": {
"language": "en"
},
"keras.preprocessing.sequence.pad_sequences#1": {
"maxlen": 100
},
"keras.Sequential.LSTMTextClassifier#1": {
"input_length": 100
}
}
pipeline = MLPipeline(primitives, init_params, input_names, output_names)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
dataset.score(y_test, predictions)
Tabular Data with Text¶
For these examples examples we will be using the Personae Dataset, which we will load
using the mlblocks.dataset.load_personae
function.
The data of this dataset is a 2d numpy array vector containing 145 entries that include texts written by Dutch users in Twitter, with some additional information about the author, and the target is a 1d numpy binary integer array indicating whether the author was extrovert or not.
MLPrimitives + Scikit-learn RandomForestClassifier¶
In this example use again the TextCleaner primitive, then use a StringVectorizer primitive, to encode all the string features, and go directly into the RandomForestClassifier from scikit-learn.
import nltk
from mlblocks import MLPipeline
from mlprimitives.datasets import load_personae
dataset = load_personae()
dataset.describe()
X_train, X_test, y_train, y_test = dataset.get_splits(1)
# Make sure that we have the necessary data
nltk.download('stopwords')
primitives = [
'mlprimitives.custom.text.TextCleaner',
'mlprimitives.custom.feature_extraction.StringVectorizer',
'sklearn.ensemble.RandomForestClassifier',
]
init_params = {
'mlprimitives.custom.text.TextCleaner': {
'column': 'text',
'language': 'nl'
},
'sklearn.ensemble.RandomForestClassifier': {
'n_jobs': -1,
'n_estimators': 100
}
}
pipeline = MLPipeline(primitives, init_params)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
dataset.score(y_test, predictions)