Designing Hybrid Data Systems

by Jon Pool and Brahmaiah Jarugumilli
Presented at 2003 Rocky Mountain Oracle User Group Training Days Conference
Copyright 2003 Maverick Systems, LLC

Jon Pool is the founder of Maverick Systems, the Denver-based specialist in complex data warehouse solutions. Brahmaiah Jarugumilli (JB) is an accomplished Oracle System Designer and Architect with a deep experience in Oracle High Performance Architectures. JB is chief architect at Maverick Systems.

Introduction
This paper discusses design concepts for a hybrid database system in a complex financial environment. A hybrid system is defined here as a database used simultaneously as an OLTP and an OLAP system. The challenge is meeting the divergent functional needs of OLTP (ODS) and a DWH system. The solution requires a process-oriented design, a pattern similar to object-oriented application systems and hybrid partitioning schemes. The development of ‘hybrid’ systems is constrained by existing product boundaries and technological limitations. Solution strategies that allow successful implementation of hybrid database systems aid cost control and rapid implementation timeframes—both of which are attractive to the IT market place.

What are Hybrid Systems?
One can attempt a definition on two characteristics:
· The nature of the data
· The usage of the data

Typically OLTP systems are transaction-oriented and enable online transaction processing like an online ordering system employing a relatively small data set. On the other hand, traditional data warehouses focus on bulk data operations. Their data models are based on FACTs (or MEASUREs) and DIMENSIONS. Star schema modeling involves data warehouse keys and STAR TRANSFORMATIONs that are typically driven by BITMAP-HASH-MERGE operations. This architecture serves aggregation and analytical purposes very well when used for customer or periodic sales information. Hybrid systems are expected to deliver both needs simultaneously such as when one corporate user interface is satisfying an online user tracking his order status, while a different one satisfies a request for an online report on product quantities shipped.

DWHs typically carry historic information that lags behind an ODS system. This lag can vary from a few hours to a week or more. A hybrid system may have to reflect the most current information in its summary and aggregates instantaneously.

As to usage volumes, OLTP systems connect hundreds of concurrent users through web or direct client-server interfaces to the database systems. Data warehouses might involve 10 to 20 users running reports and building analytical cubes through massive parallel processing. Hybrid systems have to display their efficiency through parallel backend processing, while simultaneously responding to hundreds of linear query and transaction processes.

Design Challenges in Hybrid Systems


Assembling Data for Online Presentation
Permissive data structures and storage, as well as their access mechanisms, should be built to let a hybrid system’s transactional and query demands transcend the demands of both DWH and OLTP systems. The following areas require special attention:

1. Instant gratification of new and ad hoc transaction data in aggregates and drill downs: Traditional data warehouses do not offer transactional data for online processing. Hybrid systems expose data to transactions.

2. Natural keys versus DWH keys: Traditional DWHs operate on ID keys based on star schemas; ODS systems operate on natural keys and relationships built on transactional vectors. Hybrid system requirements call for innovative approaches, such as smart keys that serve as DWH ID keys as well as encapsulating their natural values. Using object type data values that encapsulate natural value and key pairs is a possibility, but this approach is both a ground breaking practice and a difficult implementation in the current state of Oracle’s relational paradigm.

3. Multiple DWH measures and grains of analysis: When multiple measures are involved, like quantity versus time versus amount, data structures do not evolve easily.


Data Volumes for Analytical Work
1. Instant Cubes: Traditionally there is a lag between the timing of entry of a transaction and its reflection in a summary. A requirement for instant gratification, such as visualization of an adjustment transaction in the aggregates and in the Trial Balances, would call for aggregating transactions as part of an online transaction, resulting in reduced throughput.

2. Online Summation across variable boundaries: When the aggregation parameters are known, summaries can be preplanned. When the end users need to choose the aggregation parameters through an online ad hoc screen, pre-built summaries are not sufficient. Oracle would have to provide a summary value on the fly, a challenging requirement when hundreds of users are working in a multi-terabyte system.

3. Cumulative values since inception: Requirements for cumulative and year-to-date computations can be problematic, especially when the year-to-date boundaries are not well defined. When several months of fiscal data are concurrently open, cumulative computation requires forward adjustments, whenever transactions are passed to the previous periods.

Technical Challenges
1. I/O subsystem design for I/O: Partitioned systems are conducive to bulk operations. Single redo-vector or a record-set of transactions (unit of work pattern) need b-tree index-based access and writes. Hybrid systems require a centrist approach.

2. Process centric vs. data centric models: OLTP systems are traditionally process centric, with their data structures (tables and indexes) and program elements designed for transaction processing. DWH structures are designed on data centric principles for bulk operations, such as star-transformations, via Cartesian products between dimension table data and hash-merge joins. The ratio of fact table to dimension row counts is disproportionate in DWH systems, whereas they are more evenly distributed in OLTP systems. Hybrid systems, require an integrated approach, which, in the present literature and practice, is not well defined, nor are there database systems that are tailor-made for such situations.

Specific Challenges
1. Ad hoc querying and reporting feature transcends pre-built aggregations: If users require aggregated information across a variable set of dimension keys (selected natural values), across practically ALL attributes in the data model, then any amount of pre-aggregation would be inadequate in such combining operations. Also no query/rewrite could take place properly since the net aggregating requirement would have spanned multiple materialized views.

2. Shifting data models necessitates flexibility in derived facts and aggregates design: In financial environments, derived facts are based on balancing across rotating accounts. Some accounts close into others each fiscal year, necessitating carry forward of multiple cumulative records when such rotation takes place. The rotation is metadata-driven as well as ad hoc. This would inhibit drill-down paths to detail transactions from summarized Trial Balances, for example.

3. Uploading derived facts is both processor and I/O intensive: Classic bitmap-based operations are not conducive to heavy I/O (Index-based fetches) and CPU utilization. A mixed requirement for both bitmaps and b-tree indexes should be expected.

4. Managing cumulative values when multiple fiscal months are open for extended periods: In financial environments, this requirement could pose formidable challenges to process engineering of the derived facts and summaries because of the involvement of cumulative measures. It would dampen the speed of processing as well as the query rewrite capabilities of the system. Expect to build specific and targeted summaries for specific query and transaction processing requirements.

5. Cross functional and multi-level user hierarchies
· Budget Providers, Accountants, Financial Analysts, and Auditors: Access, transactional, reporting and analytical needs of end users will be divergent across user communities spanning organizational hierarchies, geographic regions, functional clusters and auditing requirements. To control and ration the usage across multiple user types, Resource Consumer Groups and multiple resource consumption plans need to be deployed. The logistic issues (such as who controls whom) would require fine-grained user account management.

· Data access and transactional controls: Because of the sensitive nature of budgetary management, users across organizational and functional boundaries usually are restricted to data sets permissible to their ‘level of authority’ and functional/geographic boundaries. An Oracle Virtual Private Database (OLS) solves this problem but increases the processing overhead and requires fairly involved data structure design decisions.

· Lead users at each level pre-build queries for public usage: In a demonstration of the hierarchical nature of work in the organizations, for each group of users, lead users could be designated, who would decide the content, frequency and access controls on the published reports, power cubes and other data dispensing means. The applications (Power Builder, Cognos, etc.) and the backend database (Roles and Grants) need to be aware of such organization.

Logical System Design
One software stack recommended for the type of environment discussed herein could include:
· Cognos 7.0 web client layer, offered to a majority of users for power-cubes and other analytical needs
· An additional Cognos client software layer, provided to selected users for querying interactively and building their own custom reports
· The application would have several modules, one of which would service field staff for inquiry into the data values
· Oracle Discoverer and other 3rd party tools (such as Tivoli), to be used for system health tracking and monitoring purposes.
Solution Strategies—Integrating Process Orientation and Bulk Data Operations

A Transient Solution
The solution is termed transient because of the nature of consumption of the data. Though modeled largely as a data warehouse and star schemas, the consumption is relational as a ‘unit of work’ and associated information. The storage in the DWH model is therefore transitional, in support of aggregations and several pseudo star schemas in the form of Oracle Nested Tables of relevant look-up and rollup information.

‘Silo’s of FACT Tables
Multiple fact tables for Transactions and Trial Balance facts would be designed based on the nature of data uploads. These multiple fact tables share the same conformed dimensions. Some examples of various types of uploads are:
· Transaction records uploaded in batch mode several times in a day (a few hundreds of thousands of records)
· Uploads of a few thousand transactions at a time through online transaction processing and client-server screens several times a day and more frequently (several times an hour) during the first week of a new fiscal month
· Adjustments processed as transaction vectors from approved journal vouchers made continuously by accountants simultaneously looking at the updated balances and summaries

Flexible Data Modeling
Some features would be suitable for rapidly changing dimensions and accounting rules from year to year:
· A metadata-driven application and data management
· Dimensions and accounting rules that are based on mapping tables
· The roll-forward process accounts for cumulative values, year-to-date values and opening and closing of accounts based on accounting rules
· Divergent end-user communities across functional and organizational hierarchies and geographies called for the same data values but using different business names. Metadata-driven business allocation at run time is based on user roles and community affiliation.

Infrastructure for Flexible Aggregation
Summaries and aggregates would be based on a meta information-driven table that decides on the involvement of views, materialized views (for query rewrite), merge frequency and merge SQL, create and rebuild options and create SQL, etc.

Hybrid Partitioning Scheme
Though the granularity of accounting is for a month, for easy storage and archiving (through read-only tablespace options), the architecture could be built on a hybrid-partitioning scheme with fiscal months as RANGE partitions under which an operational day is built as LIST-based sub-partitions. This approach would quicken the ETL and aggregation processes and enable data upload several times a day. Sub-partition exchanges would offer instantaneous data transfers between staging and the main fact tables.

Each fiscal-month partition would have 61 days of a month defined. Thus when past calendar months are still open for business, several concurrent days of operation, into the new month, could be accommodated. The year-end process is similar; extending each month to several calendar days beyond their natural end days would be workable.

‘MERGE’ Operation-Based Processing Coupled with ROLLFWD Methods
Without the powerful bulk MERGE operation in Oracle9i, a cursor-based programmatic SELECT and UPDATE operation from across previous partitions of the Fact tables would be necessitated. Rolling forward the year-to-date and cumulative-since-inception values into future partitions, ahead of anticipated future transactions, enables inline merge operations to preserve these values.

Rebuild Mechanisms for Derived Facts
For the complexity explained above, many facts and summaries are not conducive to rebuild in simple steps. The aggregation mechanisms need to account for the summation plus the roll-forward steps. The rebuild mechanisms are not only needed to recover from a logical corruption, but also for accommodating changing accounting rules.

Consumer and Resource Groups
The Database Resource Manager (DRM) can be used to prioritize users into consumer groups that are then related to plan directives that would limit resources.

Though only one resource plan can be active at a given time, this arrangement offers the following benefits:
· Guarantees certain users or groups a minimum amount of processing resources regardless of the load on the database system and the number of users
· Better distributes available database processing resources by allocating minimum percentages of database CPU time to different users and applications
· Limits the degree of parallelism that a set of users can use
· Dynamically switches database resource consumption plans
Eyes Wide Open
Manage Customer Expectations
Managing and packaging functionality in line with the system’s capacity will be critical to the success of developing a complex and intricate hybrid system.

Be Wary of Tools Being Used
Avoid interoperability-challenged version upgrades.

Data Cleansing and ETL Process Variations
Generating clean data for upload is a critical success factor. Time spent to cleanse data from legacy systems will help avoid repeated loading cycles. By the same token, be prepared to refine and revamp the ETL processes until the final data structures and mappings suitable to the hybrid usage emerge.
Prevailing Technological Limitations
Bitmaps vs. B-tree indexing
There is too wide a gap between these two technologies. The CBO’s index merge operations are still not prime time to meet such hybrid challenges.

Query Rewrite Mechanism
Prevailing QR technology based on SQL statements is inadequate in using summaries for cumulative and YTD values. A Query Rewrite technique based on the column lists of existing MVs and attribute-measure aggregation will be much more productive.

Ideal Architecture on ‘Active Record’ Pattern
A design based on an active record pattern would be ideal for such a hybrid system. In this pattern (typically used in an object-oriented system), an object’s data is retrieved upon initiation and is retained within its attributes. This is possible in a database centric system, such as Oracle, through nesting a parent and multiple detail tables.