Terminology

ACID (Atomicity, Consistency, Isolation and Durability) : Properties required to maintain database integrity.</dd>

Anti-patterns : An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counter-productive.

Entity Value Anti-Pattern : A design that tries to accommodate a lack of specifications by creating a table for store name-value pairs. Where the problem being solved is an attribute volatility, consider having as solid a model as possible and use JSON as extension points.

ID Anti-pattern : The ID anti-pattern is having a unique ID column in each table without requiring (or often discouraging) the application of alternate unique keys. The ID is generated for each new record. This avoids the crucial question of primary key design and fosters repeated insertion of contradictory redundant data. It makes database designers tear their hair out.

Atomicity : An atomic transaction is an indivisible series of database operations that either all occur or none occur.

Big Data : data sets having too high velocity, or too high volume, or too high variety to process in relational databases (see V's below)

[alt : a mis-used term used to describe any data that developers want to develop using new technologies to enhance their CV]

Business Intelligence (BI) : A set of strategies and technologies for analyzing data and presenting actionable information which helps executives, managers and other corporate end users make informed business decisions. BI applications usually gathered data from a Data Warehouse or Data Mart (see Data Warehouse and Data Mart)</dd>

Canonicalisation : process of converting data that has more than one possible representation into a "standard", "normal", or "canonical form" (see https://en.wikipedia.org/wiki/Canonical_form)

"Canonicalisation" (Axioma) -

Cache : hardware or software that is used to store data temporarily in a computing environment. It is often a small amount of faster, more expensive memory used to improve the performance of recently or frequently accessed data.</dd>

Cache (read cache) : a store of data that enables future requests for that data can be served faster; the data stored would either be a copy of data stored elsewhere or the result of an earlier computation elsewhere

Cache (write cache) : a small preliminary store of data that enables faster writes than the underlying larger backing store</dd>

CAP Theorem (Theorem - "general proposition proved by a chain of reasoning") : a distributed data store cannot provide more than two out of the three features : Consistency, Availability, Partition Tolerance. (see https://en.wikipedia.org/wiki/CAP_theorem)

CDC : Change Data Capture : a set of software patterns used to track that data has changed.

Column Store : a database that serialises date by column. (cf row based database). Column Store databases include Sybase IQ, Snowflake</dd>

Consistency : (overloaded)

Consistency (ACID) : any transaction will not violate a desired integrity state

Consistency (CAP) : atomic consistency of a data operation

Consistency (distributed databases) : atomic consistency (aka linearisability), strong consistency, immediate consistency, external consistency

Copy-on-write : a method to efficiently implement a "duplicate" or "copy" operation on modifiable resources. If data is "duplicated" but not modified, no actual copy is made.

Data (cooked) : information you have received and cleansed and/or transformed.

Data (raw) : information you have received and not cleansed or transformed.

Data (semi-structured) : data that includes meta-data or tags to describe the information (eg html).[JSON and XML can be types of semi-structured data but some sites disagree and call it structured as it conforms to a schema]

Data (structured) : information that can be stored in a schema.

Data (un-structured) : information than doesn't have a fixed understood schema (eg a text document).

Data Quality : measure of the accuracy, completeness, consistency and reliability of data

Data Virtualisation : a system to provide a unified view of data across disparate sources and formats without replicating the data. (often implemented over a Data Lake)

Data Lake : (an evolving term) Originally used to refer to a store of raw data. Data can be structured, semi-structured or unstructured. Data is loaded un-cleansed from its source. (https://www.talend.com/resources/data-lake-vs-data-warehouse/). Some are now starting to use the term to mean a centralised repository of structured, semi-structured and unstructured data (https://cloud.google.com/learn/what-is-a-data-lake)

(Not to be confused with Azure Data Lake - which is a Hadoop implementation of a data store). The opposite of a data warehouse. See Data Swamp

Data Mart : a store of structured, cleansed data used within a single department or used in a single application. (see Data Warehouse)

Data Mastering : a process where data sources are linked or merged into another data record (data master). (see MDM)

Data Mesh : a componentised data infrastructure with components owned and operated by domain owners

Data Mining : a process of analysing large data set for patterns

Data Modeler : someone who curates data structures in a database

Data Swamp : a badly curated Data Lake

Data Warehouse : a store of structured, cleansed data either from multiple departments or used across multiple applications. The term is often used interchangeably with Data Mart (but shouldn't be). A Data Warehouse is the opposite of a Data Lake.

Database : some data held on a computer

Database (Document) : some structured data held on a computer, where data can be organised into (usually) JSON formats

Database (Graph) : some structured data held on a computer, where data is organised into a graph for efficient storage

Database (Relational) : some data held on a computer, where data is strictly organised into different but related entities for efficient storage. Entities are usually stored as rows or columns (but some modern DB's allow complex datatypes such as arrays and JSON)

Data Mesh : a paradigm for developing Data intensive systems and platforms.

ETL/ELT : A system for loading source data and performing some cleansing/validation on it. It stands for Extract, Transform, Load. Extract is the data grab (which could be data request, file grab, database select). Transform is the data manipulation to convert data from one schema to another. Load is loading the data into your storage system. ETL system are usually third party systems (eg Informatica, EDM) whereas in-house systems are more often ELT.

Ingestion : The extraction and loading of data into internal systems with little or no modification, but can mean the loading and transformed of data.</dd>

Isolation : Isolation determines how transaction integrity is visible to other "users". Isolation levels defines the degree to which a transaction must be isolated from other data modifications</dd>

JSON/BSON : A datatype-less format for storing or interchanging data (see XML, Database(Document)). Unlike XML, JSON includes datatypes for strings and numbers (but not dates) and data structures for arrays/lists and scalars.</dd>

Lambda Architecture : a system design that loads data using different mechanisms for batch processing and stream processing. Not to be confused with other Lambda things ... (eg Lambda Functions)

Lakehouse (Data Lakehouse) : a term coined by Databricks but without a clear definition. A lake house is "what you would get if you had to redesign data warehouses in the modern world", encompassing some or all of Transactions, Schema Enforcement, BI support, support for un-structured or semi-structured support and open-standards? Is a Lakehouse a stored of raw or warehouse data ? (see https://databricks.com/blog/2020/01/30/what-is-a-data-lakehouse.html)

Linearisability : provides real-time guarantees on the behaviour of operations on data - essentially "atomic consistency" (C from CAP). It could be summarised as the ability to make distributed data stores appear as only one copy of the data exists.

Master Data Management/MDM : A method/process of managing data into a single point of reference. (See Data Mastering)

MVCC - Multi-version Concurrency Control : a method of providing concurrent access to the database (see https://en.wikipedia.org/wiki/Multiversion_concurrency_control)

NewSQL : A type of modern database based on Relational Databases but providing scalable performance (eg Google Spanner)

NoSQL : generic term for any database that stores data in non-tabular format (like a Relational Database). Non-tablular could be JSON or Key-Value (see Database (Document)). Now also used to mean "Not Only SQL" since some of these databases now support SQL language.</dd>

OLAP : OnLine Analytical Processing. Typically this involves the reporting and analysis of data rather than updating or creating of data (see OLTP). Datawarehouses are typically OLAP focused.

OLTP : OnLine Transaction Processing. Typically this involves inserting, updating, and/or deleting small amounts of data in a database in multiple sessions in parallel.

Row Based Database : a database that serialises date by column.(cf Column Store) Oracle, MS-SQL

Schema : An organisation of data that serves as a design for how data is organised or interchanged.

Schema-on-Read : A schema that is created during the reading process of data manipulation (often used over a data lake)

Schema-on-Write : A schema that is created during the writing process of data manipulation

Serializability : provides a guarantee that transactions behave as though processed by in a serial order.

Temporality - uni-temporal : modelling technique to handle data along a single time axis - either validity or system time (although often these are fused (or confused) together)

Temporality - bi-temporal : modelling technique to handle data along two time axis - validity and system time. Allows all data to be immutable.

Temporality - tri-temporal : modelling technique to handle data along three time axis - validity, system time and decision time.

Temporality - quad-temporal : something to be avoided at all costs.

Transformation - (see ELT/ETL) : includes normalisation, mapping of enumerations, standardisations, etc

V's - Measures of data "big-ness".

If one or more of the measures is considered "high", then the data can be considered as big data.

[Note: there is no definition of "high" but any measure which indicates that a single large machine can not handle it would lead a definition of high]

3 V's - Three measures of data - Volume, Velocity, Variety

4 V's - Volume, Velocity, Variety, Veracity : an added dimension to the 3V's by IBM in considering ways to analyse data

5 V's - Volume, Velocity, Variety, Veracity, Value : another dimension added by IBM serving little purpose.

6 V's - Volume, Velocity, Variety, Veracity, Value, Variability : another dimension .</dd>

[Note to IBM: Adding yet more V's doesn't make this more useful]

XML : An html style of file format used to store or interchange data. XML lacks datatypes to identify values as strings, dates, numbers and lacks data structure definitions for arrays/lists and scalars but xsd's can be used to provide these.</dd>

Terminology

Contact