Tables

A table object provides the basis for mapping data from a delimited file or database table into a Data Product.

Table objects inform the system how to access and process the lines or rows in each data source, and then match those lines with entities in the Data Product.

{
  "table": "~/dir_with_source_data/tttt.csv",
  "table_alias": "tttt",
  "parsers": "config/parsers/parsers_tttt.json"
}

It's worth noting that table objects do not actually load data into a Data Product by themselves - data loading is performed by parsers associated with each table object.

There are two distinct forms of table object: the entity_table and other_tables.

Each table object is typically isolated as a single file conventionally named tablealias_table.json, where you replace tablealias with the value of the table_alias attribute within the table object. You will find these files located within a project at ./config/tables/.

expanded options

{
  "table": "~/dir_with_source_data/tttt.csv",
  "table_alias": "tttt",
  "parsers": "config/parsers/tttt_parsers.json",
  "delimiter": "\t",
  "include": {...},
  "exclude": {...},
  "joins": [...],
  "columns": [
    "ccc1",
    "ccc2"
  ],
  "transform": "uppercase",
  "line_separator": "\r\n",
  "max_columns": ####,
  "max_column_size": ####,
  "null_indicators": [
    "iiii"
  ],
  "groups": [
    "gggg"
  ],
  "lines": ####,
  "random": 0.###,
  "report_lines": #### 
}

table / tables / tables_dir

usage: required
All table objects are required to have either the table, tables, or tables_dir attribute. These attributes inform the system which source data table(s) will be processed by the table object.

{
  "table": "~/dir_with_source_data/tttt.csv"
}

In the case where source data is spread over multiple tables, the table attribute can be replaced with a tables array or the tables_dir attribute.

All the source data tables listed within the tables array or tables_dir path must conform to the same schema.

{
  "tables": [
    "~/dir_with_source_data/tttt1.csv", 
    "~/dir_with_source_data/tttt2.csv",
    ...
  ]
}

In the case where source data is spread over multiple tables in the same directory, you can specify tables_dir to load all non-hidden files from that directory.

{
  "tables_dir": "~/dir_with_source_data/"
}

table_alias

usage: required
The table_alias attribute establishes a namespace element by which parsers will be attached.

parsers

usage: required
Table objects register their own parsers via the parsers array. There are many tools available for crafting parsers specific to a Data Product.

See the parsers reference for further details.

delimiter

usage: optional
The delimiter attribute is used to override the system default for comma-delimited source data ingestion.

extension

usage: optional
When using the tables_dir attribute, you can also specify an extension to only load files of the provided file type.

include

usage: optional
The include attribute is a boolean-type parser object used to restrict which rows will be parsed. If the boolean-type parser evaluates a row and returns true, the row will be included within the Data Produce.

exclude

usage: optional
The exclude attribute is a boolean-type parser object to restrict which rows will be parsed. If the boolean-type parser evaluates a row and returns true, the row will excludedd within the Data Product.

joins

usage: optional
The joins attribute is an array of table objects, or references to files containing table objects.

Data from these joined tables will be appended to each line of the parent table for processing by parsers.

See the joins reference for further details.

columns

usage: optional
This columns attribute is an array of strings which reference column names in the source table to include during data loading.

transform

usage: optional
The transform attribute is used to convert the key provided in id_columns to uppercase or lowercase.

line_separator

usage: optional
The line_separator attribute will inform the CSV parser to observe a specified line terminator for unique data sources.

max_columns

usage: optional
The max_columns attribute will override the CSV loading buffer, which has a default restriction to load the first 16k columns of a source data file.

max_column_size

usage: optional
The max_column_size attribute will override the CSV loading buffer, which has a default restriction on the string length of each parsed value within a column.

null_indicators

usage: optional
The null_indicators attribute is commonly used for individual parsers, but can also be set to apply across all parsers of a table object.

groups

usage: optional
The groups attribute is commonly used for individual parsers, but can also be set to apply across all parsers of a table object.

lines

usage: optional
The lines attribute is used to restrict the number of lines, or rows, to ingest from the source data.

random

usage: optional
The random attribute is used to proportionally restrict the number of lines, or rows, to randomly ingest from the source data. The value provided must be a decimal preceeded by 0. For example, an 80% random sampling would be 0.8.

seed

usage: optional
When using the random attribute, the seed attribute will fix the randomization for reproducability.

report_lines

usage: optional
The report_lines attribute override the frequency of server logs during the data loading process.