by Jon Pool and Brahmaiah Jarugumilli
Presented at 2003 Rocky Mountain Oracle User Group Training Days Conference
Copyright 2003 Maverick Systems, LLC
Jon Pool is the founder of Maverick Systems, the Denver-based specialist in
complex data warehouse solutions. Brahmaiah Jarugumilli (JB) is an accomplished
Oracle System Designer and Architect with a deep experience in Oracle High
Performance Architectures. JB is chief architect at Maverick Systems.
Design Challenges in Hybrid Systems
Assembling Data for Online Presentation
Permissive data structures and storage, as well as their access mechanisms,
should be built to let a hybrid system’s transactional and query demands
transcend the demands of both DWH and OLTP systems. The following areas require
special attention:
1. Instant gratification of new and ad hoc transaction data in aggregates and
drill downs: Traditional data warehouses do not offer transactional data for
online processing. Hybrid systems expose data to transactions.
2. Natural keys versus DWH keys: Traditional DWHs operate on ID keys based on
star schemas; ODS systems operate on natural keys and relationships built on
transactional vectors. Hybrid system requirements call for innovative
approaches, such as smart keys that serve as DWH ID keys as well as
encapsulating their natural values. Using object type data values that
encapsulate natural value and key pairs is a possibility, but this approach is
both a ground breaking practice and a difficult implementation in the current
state of Oracle’s relational paradigm.
3. Multiple DWH measures and grains of analysis: When multiple measures are
involved, like quantity versus time versus amount, data structures do not evolve
easily.
Data Volumes for Analytical Work
1. Instant Cubes: Traditionally there is a lag between the timing of entry of a
transaction and its reflection in a summary. A requirement for instant
gratification, such as visualization of an adjustment transaction in the
aggregates and in the Trial Balances, would call for aggregating transactions as
part of an online transaction, resulting in reduced throughput.
2. Online Summation across variable boundaries: When the aggregation parameters
are known, summaries can be preplanned. When the end users need to choose the
aggregation parameters through an online ad hoc screen, pre-built summaries are
not sufficient. Oracle would have to provide a summary value on the fly, a
challenging requirement when hundreds of users are working in a multi-terabyte
system.
3. Cumulative values since inception: Requirements for cumulative and
year-to-date computations can be problematic, especially when the year-to-date
boundaries are not well defined. When several months of fiscal data are
concurrently open, cumulative computation requires forward adjustments, whenever
transactions are passed to the previous periods.
Technical Challenges
1. I/O subsystem design for I/O: Partitioned systems are conducive to bulk
operations. Single redo-vector or a record-set of transactions (unit of work
pattern) need b-tree index-based access and writes. Hybrid systems require a
centrist approach.
2. Process centric vs. data centric models: OLTP systems are traditionally
process centric, with their data structures (tables and indexes) and program
elements designed for transaction processing. DWH structures are designed on
data centric principles for bulk operations, such as star-transformations, via
Cartesian products between dimension table data and hash-merge joins. The ratio
of fact table to dimension row counts is disproportionate in DWH systems,
whereas they are more evenly distributed in OLTP systems. Hybrid systems,
require an integrated approach, which, in the present literature and practice,
is not well defined, nor are there database systems that are tailor-made for
such situations.
Specific Challenges
1. Ad hoc querying and reporting feature transcends pre-built aggregations: If
users require aggregated information across a variable set of dimension keys
(selected natural values), across practically ALL attributes in the data model,
then any amount of pre-aggregation would be inadequate in such combining
operations. Also no query/rewrite could take place properly since the net
aggregating requirement would have spanned multiple materialized views.
2. Shifting data models necessitates flexibility in derived facts and aggregates
design: In financial environments, derived facts are based on balancing across
rotating accounts. Some accounts close into others each fiscal year,
necessitating carry forward of multiple cumulative records when such rotation
takes place. The rotation is metadata-driven as well as ad hoc. This would
inhibit drill-down paths to detail transactions from summarized Trial Balances,
for example.
3. Uploading derived facts is both processor and I/O intensive: Classic
bitmap-based operations are not conducive to heavy I/O (Index-based fetches) and
CPU utilization. A mixed requirement for both bitmaps and b-tree indexes should
be expected.
4. Managing cumulative values when multiple fiscal months are open for extended
periods: In financial environments, this requirement could pose formidable
challenges to process engineering of the derived facts and summaries because of
the involvement of cumulative measures. It would dampen the speed of processing
as well as the query rewrite capabilities of the system. Expect to build
specific and targeted summaries for specific query and transaction processing
requirements.
5. Cross functional and multi-level user hierarchies
· Budget Providers, Accountants, Financial Analysts, and Auditors: Access,
transactional, reporting and analytical needs of end users will be divergent
across user communities spanning organizational hierarchies, geographic regions,
functional clusters and auditing requirements. To control and ration the usage
across multiple user types, Resource Consumer Groups and multiple resource
consumption plans need to be deployed. The logistic issues (such as who controls
whom) would require fine-grained user account management.
· Data access and transactional controls: Because of the sensitive nature of
budgetary management, users across organizational and functional boundaries
usually are restricted to data sets permissible to their ‘level of
authority’ and functional/geographic boundaries. An Oracle Virtual Private
Database (OLS) solves this problem but increases the processing overhead and
requires fairly involved data structure design decisions.
· Lead users at each level pre-build queries for public usage: In a
demonstration of the hierarchical nature of work in the organizations, for each
group of users, lead users could be designated, who would decide the content,
frequency and access controls on the published reports, power cubes and other
data dispensing means. The applications (Power Builder, Cognos, etc.) and the
backend database (Roles and Grants) need to be aware of such organization.
Logical System Design
One software stack recommended for the type of environment discussed herein
could include:
· Cognos 7.0 web client layer, offered to a majority of users for power-cubes
and other analytical needs
· An additional Cognos client software layer, provided to selected users for
querying interactively and building their own custom reports
· The application would have several modules, one of which would service field
staff for inquiry into the data values
· Oracle Discoverer and other 3rd party tools (such as Tivoli), to be used for
system health tracking and monitoring purposes.
Solution Strategies—Integrating Process Orientation and Bulk Data Operations
A Transient Solution
The solution is termed transient because of the nature of consumption of the
data. Though modeled largely as a data warehouse and star schemas, the
consumption is relational as a ‘unit of work’ and associated information.
The storage in the DWH model is therefore transitional, in support of
aggregations and several pseudo star schemas in the form of Oracle Nested Tables
of relevant look-up and rollup information.
‘Silo’s of FACT Tables
Multiple fact tables for Transactions and Trial Balance facts would be designed
based on the nature of data uploads. These multiple fact tables share the same
conformed dimensions. Some examples of various types of uploads are:
· Transaction records uploaded in batch mode several times in a day (a few
hundreds of thousands of records)
· Uploads of a few thousand transactions at a time through online transaction
processing and client-server screens several times a day and more frequently
(several times an hour) during the first week of a new fiscal month
· Adjustments processed as transaction vectors from approved journal vouchers
made continuously by accountants simultaneously looking at the updated balances
and summaries
Flexible Data Modeling
Some features would be suitable for rapidly changing dimensions and accounting
rules from year to year:
· A metadata-driven application and data management
· Dimensions and accounting rules that are based on mapping tables
· The roll-forward process accounts for cumulative values, year-to-date values
and opening and closing of accounts based on accounting rules
· Divergent end-user communities across functional and organizational
hierarchies and geographies called for the same data values but using different
business names. Metadata-driven business allocation at run time is based on user
roles and community affiliation.
Infrastructure for Flexible Aggregation
Summaries and aggregates would be based on a meta information-driven table that
decides on the involvement of views, materialized views (for query rewrite),
merge frequency and merge SQL, create and rebuild options and create SQL, etc.
Hybrid Partitioning Scheme
Though the granularity of accounting is for a month, for easy storage and
archiving (through read-only tablespace options), the architecture could be
built on a hybrid-partitioning scheme with fiscal months as RANGE partitions
under which an operational day is built as LIST-based sub-partitions. This
approach would quicken the ETL and aggregation processes and enable data upload
several times a day. Sub-partition exchanges would offer instantaneous data
transfers between staging and the main fact tables.
Each fiscal-month partition would have 61 days of a month defined. Thus when
past calendar months are still open for business, several concurrent days of
operation, into the new month, could be accommodated. The year-end process is
similar; extending each month to several calendar days beyond their natural end
days would be workable.
‘MERGE’ Operation-Based Processing Coupled with ROLLFWD Methods
Without the powerful bulk MERGE operation in Oracle9i, a cursor-based
programmatic SELECT and UPDATE operation from across previous partitions of the
Fact tables would be necessitated. Rolling forward the year-to-date and
cumulative-since-inception values into future partitions, ahead of anticipated
future transactions, enables inline merge operations to preserve these values.
Rebuild Mechanisms for Derived Facts
For the complexity explained above, many facts and summaries are not conducive
to rebuild in simple steps. The aggregation mechanisms need to account for the
summation plus the roll-forward steps. The rebuild mechanisms are not only
needed to recover from a logical corruption, but also for accommodating changing
accounting rules.
Consumer and Resource Groups
The Database Resource Manager (DRM) can be used to prioritize users into
consumer groups that are then related to plan directives that would limit
resources.
Though only one resource plan can be active at a given time, this arrangement
offers the following benefits:
· Guarantees certain users or groups a minimum amount of processing resources
regardless of the load on the database system and the number of users
· Better distributes available database processing resources by allocating
minimum percentages of database CPU time to different users and applications
· Limits the degree of parallelism that a set of users can use
· Dynamically switches database resource consumption plans
Eyes Wide Open
Manage Customer Expectations
Managing and packaging functionality in line with the system’s capacity will
be critical to the success of developing a complex and intricate hybrid system.
Be Wary of Tools Being Used
Avoid interoperability-challenged version upgrades.
Data Cleansing and ETL Process Variations
Generating clean data for upload is a critical success factor. Time spent to
cleanse data from legacy systems will help avoid repeated loading cycles. By the
same token, be prepared to refine and revamp the ETL processes until the final
data structures and mappings suitable to the hybrid usage emerge.
Prevailing Technological Limitations
Bitmaps vs. B-tree indexing
There is too wide a gap between these two technologies. The CBO’s index merge
operations are still not prime time to meet such hybrid challenges.
Query Rewrite Mechanism
Prevailing QR technology based on SQL statements is inadequate in using
summaries for cumulative and YTD values. A Query Rewrite technique based on the
column lists of existing MVs and attribute-measure aggregation will be much more
productive.
Ideal Architecture on ‘Active Record’ Pattern
A design based on an active record pattern would be ideal for such a hybrid
system. In this pattern (typically used in an object-oriented system), an
object’s data is retrieved upon initiation and is retained within its
attributes. This is possible in a database centric system, such as Oracle,
through nesting a parent and multiple detail tables.