- Print
- DarkLight
- PDF
The Tag.bio engine functions by performing calculations on a number of basic data structures. It's important to have a functional understanding of the verbiage used as we describe these components.
Don't have formal training in programming? That's okay!
Once you have these basic concepts down, you'll be crafting analysis apps in no time.
data product (aka fc or flux-capacitor)
The Data Product is a self-contained suite of analysis apps for a data set. The magic of Tag.bio is that Data Products are not isolated! Data Products can reach out to other Data Products for expanded analyses.
entities
An entity is a singular, identifiable unit. This could be an account or record number, the identifier for a sample, or even the name of a football team. An entity is a way to identify all of the associated variables across collections to one specific point.
collections
A collection is a container for related variables. If you were to think of this spatially, consider a spreadsheet or a dataframe. The collection would be the label used in the header row, and it would contain all of the unique variables listed therein.
variables
Variable are the values stored within a collection. Each variable is a datapoint as referenced by a specific entity.
categorical and numeric
Variables and collections contain two primary data types: categorical and numeric.
Categorical collections and variables contain data that is frequently textual in nature. Categorical data is information to be grouped together when analyzed.
Numeric collections and variables are data containing raw numbers which mathematical computations can be applied to.
protocol (aka analysis app)
A protocol is structured code used to create an interface to explore a question. The code for a protocol comprises everything about a statistical query.
Examples include: attributes to customize the text shown to the user, the collections used as part of an analysis, the way the user would want to define the background and focus of the analysis, and even the components of the visualizations provided in response to a query.
Protocols harness the power of the Tag.bio engine in a manner that makes exploratory analysis genuine and tailored to the needs of the user.
background and focus
In comparative analysis, the goal is to analyze subset parts of the data by a unique factor and compare that to a larger part of the data. In Tag.bio, we've simplified this concept within two primary buckets: background and focus.
The background describes all of the data being used as part of the analysis, and that includes the focus.
The focus is a subset of the background as reduced by a specific factor. These concepts are comparable to the population and sample as used in statistics.
arguments
An argument is structured code used to provide a selection interface for the user to configure the specifics of a protocol.
The variables selected by a user within an argument can be passed within a protocol to define the background, focus, analysis variables, along with many other configuration features.
analysis variables
The analysis variables array is used to define what is to be analyzed.
For a summary protocol, this would be the data summarized, while in a comparison protocol, this is the data analyzed in relation to the specifications for the background and the focus.
Tag.score
The Tag.score is how the Tag.bio engine represents the p-value, or the strength of evidence against the null hypothesis.
The Tag.score is calculated by taking the negative natural log of the p-value: -ln(p-value).
config
The config file points to the table configuration and parsers of a Data Product.
archive
The archive is the data as read into a Data Product and stored in a specialized format that enables fast calculations.
The archive is composed of collections from the output of parsers referenced in the config, and this becomes part of the Data Product to create consistency for analyses.
parsers
Parsers are functions written to take in, shape, and write data to the archive.