- Print
- DarkLight
- PDF
The clean-word
parser will load data from a column in a source data table and map values to other values before producing categorical variables within a collection.
The clean-word
parser will tokenize text from a column into words, removing all non alphabetical characters. The tokenized words will beome the categorical variables within the collection.
In comparison to the dirty-word
parser, clean-word will not remove underscore characters during tokenization. The underscores will be converted back to space characters upon creating variable names.
For example, this can be useful for creating variable names that have both first name and last name, with a space in between.
{
"parser_type": "clean-word",
"table_alias": "tata",
"column": "cccc",
"minimum": ####,
"collection": "collection_name"
}
minimum
usage: optional
The minimum
attribute defines the minimum length of word that will be included - the default is 3 characters.