---
title: "clean-word"
slug: "clean-word-parser"
updated: 2022-07-27T23:51:25Z
published: 2022-07-27T23:51:25Z
---

> ## Documentation Index
> Fetch the complete documentation index at: https://code.tag.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# clean-word

The `clean-word` **parser** will load data from a column in a source data table and map values to other values before producing categorical **variables** within a **collection**.

The `clean-word` **parser** will tokenize text from a column into words, removing all non alphabetical characters. The tokenized words will beome the categorical **variables** within the **collection**.

In comparison to the `dirty-word` **parser**, *clean-word* will not remove underscore characters during tokenization. The underscores will be converted back to space characters upon creating variable names.

For example, this can be useful for creating variable names that have both first name and last name, with a space in between.

```
{
  "parser_type": "clean-word",
  "table_alias": "tata",
  "column": "cccc",
  "minimum": ####,
  "collection": "collection_name"
}
```

## minimum

**usage: *optional*** The `minimum` attribute defines the minimum length of word that will be included - the default is 3 characters.
