- Print
- DarkLight
- PDF
A table object provides the basis for mapping data from a delimited file or database table into a Data Product.
Table objects inform the system how to access and process the lines or rows in each data source, and then match those lines with entities in the Data Product.
{
"table": "~/dir_with_source_data/tttt.csv",
"table_alias": "tttt",
"parsers": "config/parsers/parsers_tttt.json"
}
It's worth noting that table objects do not actually load data into a Data Product by themselves - data loading is performed by parsers associated with each table object.
There are two distinct forms of table object: the entity_table and other_tables.
Each table object is typically isolated as a single file conventionally named tablealias_table.json
, where you replace tablealias with the value of the table_alias
attribute within the table object. You will find these files located within a project at ./config/tables/
.
expanded options
{
"table": "~/dir_with_source_data/tttt.csv",
"table_alias": "tttt",
"parsers": "config/parsers/tttt_parsers.json",
"delimiter": "\t",
"include": {...},
"exclude": {...},
"joins": [...],
"columns": [
"ccc1",
"ccc2"
],
"transform": "uppercase",
"line_separator": "\r\n",
"max_columns": ####,
"max_column_size": ####,
"null_indicators": [
"iiii"
],
"groups": [
"gggg"
],
"lines": ####,
"random": 0.###,
"report_lines": ####
}
table / tables / tables_dir
usage: required
All table objects are required to have either the table
, tables
, or tables_dir
attribute. These attributes inform the system which source data table(s) will be processed by the table object.
{
"table": "~/dir_with_source_data/tttt.csv"
}
In the case where source data is spread over multiple tables, the table
attribute can be replaced with a tables
array or the tables_dir
attribute.
All the source data tables listed within the tables
array or tables_dir
path must conform to the same schema.
{
"tables": [
"~/dir_with_source_data/tttt1.csv",
"~/dir_with_source_data/tttt2.csv",
...
]
}
In the case where source data is spread over multiple tables in the same directory, you can specify tables_dir to load all non-hidden files from that directory.
{
"tables_dir": "~/dir_with_source_data/"
}
table_alias
usage: required
The table_alias
attribute establishes a namespace element by which parsers will be attached.
parsers
usage: required
Table objects register their own parsers via the parsers
array. There are many tools available for crafting parsers specific to a Data Product.
See the parsers reference for further details.
delimiter
usage: optional
The delimiter
attribute is used to override the system default for comma-delimited source data ingestion.
extension
usage: optional
When using the tables_dir
attribute, you can also specify an extension
to only load files of the provided file type.
include
usage: optional
The include
attribute is a boolean-type parser object used to restrict which rows will be parsed. If the boolean-type parser evaluates a row and returns true
, the row will be included within the Data Produce.
exclude
usage: optional
The exclude
attribute is a boolean-type parser object to restrict which rows will be parsed. If the boolean-type parser evaluates a row and returns true
, the row will excludedd within the Data Product.
joins
usage: optional
The joins
attribute is an array of table objects, or references to files containing table objects.
Data from these joined tables will be appended to each line of the parent table for processing by parsers.
See the joins reference for further details.
columns
usage: optional
This columns
attribute is an array of strings which reference column names in the source table to include during data loading.
transform
usage: optional
The transform
attribute is used to convert the key provided in id_columns
to uppercase
or lowercase
.
line_separator
usage: optional
The line_separator
attribute will inform the CSV parser to observe a specified line terminator for unique data sources.
max_columns
usage: optional
The max_columns
attribute will override the CSV loading buffer, which has a default restriction to load the first 16k columns of a source data file.
max_column_size
usage: optional
The max_column_size
attribute will override the CSV loading buffer, which has a default restriction on the string length of each parsed value within a column.
null_indicators
usage: optional
The null_indicators
attribute is commonly used for individual parsers, but can also be set to apply across all parsers of a table object.
groups
usage: optional
The groups
attribute is commonly used for individual parsers, but can also be set to apply across all parsers of a table object.
lines
usage: optional
The lines
attribute is used to restrict the number of lines, or rows, to ingest from the source data.
random
usage: optional
The random
attribute is used to proportionally restrict the number of lines, or rows, to randomly ingest from the source data. The value provided must be a decimal preceeded by 0
. For example, an 80% random sampling would be 0.8
.
seed
usage: optional
When using the random
attribute, the seed
attribute will fix the randomization for reproducability.
report_lines
usage: optional
The report_lines
attribute override the frequency of server logs during the data loading process.