- Print
- DarkLight
- PDF
The config file is the focal point for loading & modeling data in a Data Product.
The config is a single text file, formatted as a JSON object, that instructs the Data Product how to load data from one or more delimited files (or SQL databases).
Naming convention
The config file is typically named config.json
, and located in the Data Product directory path config/config.json
.
Overall schema
There are three primary attributes in the config file schema: the entity_table object, the other_tables array, and the parsers array. There are some other optional attributes as well, like data_dictionary.
Here's the simplest version of a config file. This version assumes that all table objects register their own parsers.
{
// A table object, or a string reference to a file
// containing a table object (best practice)
//
"entity_table": {...},
// An array of table objects and / or string references
// to files containing table objects (best practice)
//
"other_tables": [...]
}
Here's another variant where the config file registers parsers:
{
"entity_table": {...},
"other_tables": [...],
// An array of parsers and / or string references
// to files containing parsers (best practice)
//
"parsers": [...]
}
Attribute - entity_table
The value for the entity_table attribute is a table object, or a string reference to a file containing a table object (best practice). The table object must be in entity_table form.
{
// An embedded table object, entity_table form
//
"entity_table": {...},
...
}
{
// A string reference to a file containing a table object,
// entity_table form (best practice)
//
"entity_table": "config/tables/table_eeee.json",
...
}
See the Data Configuration - Table objects page for detailed information around table objects, and the entity_table form.
Attribute - other_tables
The value for the other_tables attribute is an array of table objects and / or string references to files containing table objects (best practice). The table objects listed must be in other_tables form.
{
"entity_table": {...},
"other_tables": [
// An embedded table object, other_tables form
//
{...},
// A string reference to a file containing a table object,
// other_tables form (best practice)
//
"config/tables/table_oooo.json",
...
]
}
See the Data Configuration - Table objects page for detailed information around table objects, and the other_tables form.
Attribute - parsers
The parsers attribute is an array of parsers and / or string references to files containing parsers (best practice).
Each parser in the array will load a variable or a collection of variables into the Data Product, from either the entity_table, or from one of the other_tables.
To autodetect and load all columns from all tables, without transformation or renaming, set the value of parsers to "auto"
:
{
"entity_table": {...},
"other_tables": [...],
// Will autogenerate parser functions for all tables
//
"parsers": "auto"
}
Auto-generation of parsers is typically not utilized for a mature Data Product, due to the flexibility and power of customized parsers. We can register customized parsers here in the config file, or from within each table object.
{
"entity_table": {...},
"other_tables": [...],
"parsers": [
// A path to a file with parsers
// for entity_table eeee (best practice)
//
"config/parsers/parsers_eeee.json",
// An embedded parser
// for other_table oooo
//
{
"parser_type": "categorical",
"table_alias": "oooo",
"column":"xxx1"
},
...
]
}
See the Data Configuration - Parsers page for detailed information about all the parser options.
Attribute - data_dictionary
It's possible to have the config file automatically produce a data dictionary after loading data.
The value for the data_dictionary attribue is typically a file path to a .tsv file, e.g. data_dictionary.tsv
. That file will contain a tab-delimited overview of all collections and variables created in the Data Product after the config file is processed.
{
"entity_table": {...},
"other_tables": [...],
// The file that will contain the data_dictionary output
//
"data_dictionary": "data_dictionary.tsv"
}
Be careful with this option in the case where you are working with sensitive / protected data. You may want to make sure that the data_dictionary file is output outside the repo, or that the data_dictionary file is added to the .gitignore. Alternatively, consider using the object form of the data_dictionary, shown below, where you can implement a redact attribute that can prevent sensitive data from leaking into the file.
The value for data_dictionary can also be an object. In that case, you can specify additional useful attributes.
{
"entity_table": {...},
"other_tables": [...],
"data_dictionary": {
// The file that will contain the data_dictionary output
//
"file": "data_dictionary.tsv",
// An array of collection names which should not have
// their variable names (e.g. patient IDs) written to data_dictionary
//
"redact": [
"collection1",
...
],
// This will limit the number of variables listed for each collection
//
"variable_limit": 10
}
}
Other attributes
{
"entity_table": {...},
"other_tables": [...],
"parsers": [...],
"data_dictionary": "data_dictionary.tsv",
// These are attributes for a table object that can
// be set here and applied globally for all tables
//
"lines": ####,
"entities": ####,
"random": 0.###,
"seed": ####,
// These are attributes for a parser function that can
// be set here and applied globally for all parsers
//
"null_indicators": [...]
}