config.json

The config file is the focal point for loading & modeling data in a Data Product.

The config is a single text file, formatted as a JSON object, that instructs the Data Product how to load data from one or more delimited files (or SQL databases).

Naming convention

The config file is typically named config.json, and located in the Data Product directory path config/config.json.

Overall schema

There are three primary attributes in the config file schema: the entity_table object, the other_tables array, and the parsers array. There are some other optional attributes as well, like data_dictionary.

Here's the simplest version of a config file. This version assumes that all table objects register their own parsers.

{
  // A table object, or a string reference to a file 
  // containing a table object (best practice)
  //
  "entity_table": {...}, 
  
  // An array of table objects and / or string references 
  // to files containing table objects (best practice)
  //
  "other_tables": [...]
}

Here's another variant where the config file registers parsers:

{
  "entity_table": {...}, 
  "other_tables": [...], 
  
  // An array of parsers and / or string references
  // to files containing parsers (best practice)
  //
  "parsers": [...]
}

Attribute - entity_table

The value for the entity_table attribute is a table object, or a string reference to a file containing a table object (best practice). The table object must be in entity_table form.

{
  // An embedded table object, entity_table form
  //
  "entity_table": {...},
  
  ...
}

{
  // A string reference to a file containing a table object, 
  // entity_table form (best practice)
  //
  "entity_table": "config/tables/table_eeee.json",
  
  ...
}

See the Data Configuration - Table objects page for detailed information around table objects, and the entity_table form.

Attribute - other_tables

The value for the other_tables attribute is an array of table objects and / or string references to files containing table objects (best practice). The table objects listed must be in other_tables form.

{
  "entity_table": {...}, 
  
  "other_tables": [
  
    // An embedded table object, other_tables form
    //
    {...}, 

    // A string reference to a file containing a table object,
    // other_tables form (best practice)
    //
    "config/tables/table_oooo.json",
    
    ...
  ]
}

See the Data Configuration - Table objects page for detailed information around table objects, and the other_tables form.

Attribute - parsers

The parsers attribute is an array of parsers and / or string references to files containing parsers (best practice).

Each parser in the array will load a variable or a collection of variables into the Data Product, from either the entity_table, or from one of the other_tables.

To autodetect and load all columns from all tables, without transformation or renaming, set the value of parsers to "auto":

{
  "entity_table": {...}, 
  "other_tables": [...],
  
  // Will autogenerate parser functions for all tables
  //
  "parsers": "auto" 
}

Auto-generation of parsers is typically not utilized for a mature Data Product, due to the flexibility and power of customized parsers. We can register customized parsers here in the config file, or from within each table object.

{
  "entity_table": {...}, 
  "other_tables": [...],
  
  "parsers": [
  
    // A path to a file with parsers 
    // for entity_table eeee (best practice)
    //
    "config/parsers/parsers_eeee.json",

    // An embedded parser
    // for other_table oooo
    //
    {
      "parser_type":  "categorical",
      "table_alias": "oooo", 
      "column":"xxx1"
    },
    
    ...
  ]
}

See the Data Configuration - Parsers page for detailed information about all the parser options.

Attribute - data_dictionary

It's possible to have the config file automatically produce a data dictionary after loading data.

The value for the data_dictionary attribue is typically a file path to a .tsv file, e.g. data_dictionary.tsv. That file will contain a tab-delimited overview of all collections and variables created in the Data Product after the config file is processed.

{
  "entity_table": {...}, 
  "other_tables": [...],
  
  // The file that will contain the data_dictionary output
  //
  "data_dictionary": "data_dictionary.tsv"
}

Be careful with this option in the case where you are working with sensitive / protected data. You may want to make sure that the data_dictionary file is output outside the repo, or that the data_dictionary file is added to the .gitignore. Alternatively, consider using the object form of the data_dictionary, shown below, where you can implement a redact attribute that can prevent sensitive data from leaking into the file.

The value for data_dictionary can also be an object. In that case, you can specify additional useful attributes.

{
  "entity_table": {...}, 
  "other_tables": [...],
  
  "data_dictionary": {
  
    // The file that will contain the data_dictionary output
    //
    "file": "data_dictionary.tsv",
  
    // An array of collection names which should not have 
    // their variable names (e.g. patient IDs) written to data_dictionary
    //
    "redact": [
      "collection1",
      
      ...
    ],
  
    // This will limit the number of variables listed for each collection
    //
    "variable_limit": 10
  }
}

Other attributes

{
  "entity_table": {...}, 
  "other_tables": [...], 
  "parsers": [...], 
  "data_dictionary": "data_dictionary.tsv",
  
  // These are attributes for a table object that can 
  // be set here and applied globally for all tables
  //
  "lines": ####,
  "entities": ####,
  "random": 0.###,
  "seed": ####,
  
  // These are attributes for a parser function that can 
  // be set here and applied globally for all parsers
  //
  "null_indicators": [...]
}