clean-word

The clean-word parser will load data from a column in a source data table and map values to other values before producing categorical variables within a collection.

The clean-word parser will tokenize text from a column into words, removing all non alphabetical characters. The tokenized words will beome the categorical variables within the collection.

In comparison to the dirty-word parser, clean-word will not remove underscore characters during tokenization. The underscores will be converted back to space characters upon creating variable names.

For example, this can be useful for creating variable names that have both first name and last name, with a space in between.

{
  "parser_type": "clean-word",
  "table_alias": "tata",
  "column": "cccc",
  "minimum": ####,
  "collection": "collection_name"
}

minimum

usage: optional
The minimum attribute defines the minimum length of word that will be included - the default is 3 characters.