OLTP, OLAP, Columnar Data Stores – overview

All VLEs (very large enterprises) will have a mix of transaction, analytical and data storage technologies.  ‘Polyglot’ may refer to the variations in the database models used for different use-cases and application demands.  A mainframe is very much a large transactional database or on-line transactional processing systems and should be considered part of a polyglot architecture within a VLE. 

There are key differences between a transaction-oriented database and an analytical database as given in the table below.

OLTP vs OLAP

Area OLTP (Mainframe or SQL) OLAP (generally SQL)
Basic Transactional system
Manages database modification
Data retrieval and analytics
Focus Insert, Update, Delete information from the DB Extract data for analysing, decision making
Data OLTP DB original data source OLTP DBs use marts
or cubes
Trans-
actions
Short Long
Time Short Longer
Queries Simpler Complex
Normalisation Tables are normalized (3NF) Not normalized
Integrity Maintain DB
integrity constraint
Not modified, so DB integrity not affected

OLTP systems are typically user-facing, which means that they may see a huge volume of requests.  In order to handle the load, applications usually only touch a small number of records in each query.  The application requests records using a key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.  

Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users.  They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time.  Disk bandwidth (not seek time) is often the bottleneck here, and column-oriented storage is an increasingly popular solution for this kind of workload.

Within OLTP there are 3 main categories of approach.  There is the log-structured approach, which only permits appending to files and deleting obsolete files, but never updates a file that has been written.  Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.  A second method is the update-in-place approach, which treats the disk as a set of fixed-size pages that can be overwritten.  B-trees are the biggest example of this implementation, being used in all major relational databases and also in many nonrelational ones.  

A third OLTP approach is using Log-structured storage engines which are a comparatively recent development.  Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs.

In contrast OLAPs are typified by a data warehouse, which is a separate database that analysts can query without affecting OLTP operations.  The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.  Data is extracted from OLTP databases (using either a periodic data dump or a continuous stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into the data warehouse.

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytical queries and complex joins and commands.  There are many graphical data analysis tools that generate SQL queries, visualize the results, and allow analysts to explore the data (through operations such as drill-down and slicing and dicing).  On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface.  However, the internals of the systems can look quite different, because they are optimized for very different query patterns.  Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.  There is specialisation especially on and between Cloud platforms.

Data Analytics

Cloud based systems are now heavily used for OLAP and analytics.  For example, Amazon RedShift is a popular and AWS-hosted version of the ParAccel open source OLAP-Datawarehouse.  It is data-warehouse with cloud characteristics used to house data and being the process of data-aggregation to build ‘Big Data’ capabilities in analytics and processing.  It is easily integrated with Insight or other BI tools and other services.  This can lead to predictive analytics and over-time cognitive system development and Artificial Intelligence – which is a vaguely defined domain open to much interpretation including criticism of its relevance inside many business processes.  The same is true of Big Data.

Data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL.  If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time.  Instead you could cache some of the counts or sums that queries use most often.  One way of creating such a cache is a materialized view.  In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. The difference is that a materialized view is an actual copy of the query results, written to disk, whereas a virtual view is just a shortcut for writing queries.  When you read from a virtual view, the SQL engine expands it into the view’s underlying query on the fly and then processes the expanded query.

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.  The database can do that automatically, but such updates make writes more expensive, which is why materialized views are not often used in OLTP databases.  In read-heavy data warehouses they can make more sense (whether or not they actually improve read performance depends on the individual case).  A common special case of a materialized view is known as a data cube or OLAP cube. It is a grid of aggregates grouped by different dimensions.

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data.  For example, there is no way of calculating which proportion of sales comes from items that cost more than $100, because the price isn’t one of the dimensions.  Most data warehouses therefore try to keep as much raw data as possible and use aggregates such as data cubes only as a performance boost for certain queries.  

Columnar Storage

Hadoop (DFS, Columnar data storage) is the most used Big Data columnar database platform, but there is now a wide variety of open source SQL-on-Hadoop projects aiming to compete with commercial data warehouse systems.  These include Apache Hive, Spark SQL, Cloudera Impala, Facebook Presto, Apache Tajo, and Apache Drill. Some of them are based on ideas from Google’s Dremel.  All of them are complex to install, manage, understand and train people on, not to mention the difficulty in the extract, translate and load-process of data from database systems.

These platforms are used in the main for analytics including predictions, training to recognise patterns (known as machine-learning); and offering prognosticative solutions to coded-machine learning and quite specific problem sets.  In the basic pattern of data usage, an analytical query needs to scan over a large number of records, only reading a few columns per record, and calculates aggregate statistics (such as count, sum, or average) rather than returning the raw data to the user.

Within OLAP most large enterprises use dimension tables, or OLAP data marts, which usually count records in the millions and are much easier to query and assess than larger ‘Big Data’ data-sets.  If for example, the data-set is literally trillions of rows and petabytes of data in fact tables (not unusual with Hadoop); storing and querying them efficiently becomes a serious issue.  In these systems, if they were traditional, SQL-OLAP based, the fact tables might often be over 100 columns wide, so designing these systems and their associated queries is an important task.  In this case, using columnar datastores might be more efficient (petabytes spread over columns and millions of rows, re-modelled as columnar tables of associated data which can be compressed and queried).

For a data query to be efficient it is suggested that the query only access 4 or 5 of records at one time (“SELECT *” queries are rarely needed for analytics).  The set-up of the queries therefore can be quite granular in both large and small data-set analysis.  If for instance the query needs to access a large number of rows (every occurrence of someone buying a particular product during 2019), it may need only to access 3 columns of data (in a DFS like Hadoop); namely the fact_sales table: date_key, product_sk, and quantity.  The query ignores all other columns.

The idea behind DFS or column-oriented storage is simple: don’t store all the values from one row together but store all the values from each column together instead.  If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work and prevent cross-column query complexity and latency. 

Column storage: Memory bandwidth and vectorized processing

For traditional data warehouse queries that need to scan over millions of rows, a big bottleneck is the bandwidth for getting data from disk into memory.  Developers of analytical databases must design a system with enough bandwidth emanating from main memory into the CPU cache, avoiding branch mispredictions and bubbles in the CPU instruction processing pipeline, and making use of single-instruction-multi-data (SIMD) instructions in modern CPUs.  This is difficult and takes an experiential approach.

Along with reducing the volume of data that needs to be loaded from the disk, column-oriented storage layouts are also good for making efficient use of CPU cycles. For example, the query engine can take a chunk of compressed column data that fits comfortably in the CPU’s L1 cache and iterate through it in a tight loop (that is, with no function calls).  A CPU can execute such a loop much faster than code that requires a lot of function calls and conditions for each record that is processed.  Column compression allows more rows from a column to fit in the same amount of L1 cache.  Operators, such as the bitwise AND and OR can be designed to operate on such chunks of compressed column data directly. This

if queries often target date ranges, such as the last month, it might make sense to make date_key the first sort key.  Then the query optimizer can scan only the rows from the last month, which will be much faster than scanning all rows.  A second column can determine the sort order of any rows that have the same value in the first column. For example, if date_key is the first sort key it might make sense for product_sk to be the second sort key so that all sales for the same product on the same day are grouped together in storage. That will help queries that need to group or filter sales by product within a certain date range. Another advantage of sorted order is that it can help with compression of columns.

Polyglot

Not every data warehouse is necessarily a column store of course and traditional row-oriented databases and a few other architectures are also used.  However, columnar storage can be significantly faster for ad hoc analytical queries, so it is gaining popularity. Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates.  This highlights a key difference between traditional SQL-Data Warehouses and Columnar storage which is the real-time nature of the queried results.  Columnar data stores are usually providing more ‘real-time’ updates and of course are not producing normalised or materialised views of the data for querying, which means that they do not need to be refreshed.

==END