Business Intelligence Trends on Forecasting and Modeling with Oracle10g SQL
Uncasville, CT (Quest North East Conference at Moheegan Sun Resort), July 20, 2007) Among the most interesting aspects of Business Intelligence, we can encounter Forecasting and Modeling, which can extends to areas such as finance, marketing, and population studies among others. With the new Materialized View cubic features available in Oracle11g, modeling for forecasting can become a superior strategic task provided by this release, with the already outstanding capabilities available in the Oracle10g release. The Quest NE Conference was strong in business applications in various areas, in particular, finance, insurance, general CRM, and various others. The following is an excerpt from that paper.BUSINESS INTELLIGENCE FORECASTING AND MODELING WITH ORACLE10g SQL
Anthony D Noriega, MSCS, BSSE, OCP, MBA
EXECUTIVE OVERVIEW
Oracle10g SQL capabilities approaching Business Intelligence modeling and forecasting capabilities have enormously increased in comparison to previous releases. It has now leveraged syntax and format to align with its business intelligence tools and thirty party vendor tools as well.
There are many different strategies to both modeling and forecasting a serial behavior or pattern in business intelligence, such is finance and market data, with important applications in other important areas of management and research and development. There are methods that smooth historic values; other use time series adjustments; others more complex have been categorized as ARIMA-based methods. This paper discusses and studies various ways to forecast and model with SQL in various business intelligence areas, in particular using the SQL MODEL clause available in Oracle 10g OLAP and Data Warehousing analytics. As such, the discussion focuses in the possible combination of existing business methods and models and the Oracle dimensions, measures, and rules provided by the SQL model, with important consideration to standard and custom aggregation, in particular with functions such as rollup and cube. The discussion is focused on deterministic models, but an important evaluation of a stochastic model is also presented.
This paper emphasizes the importance of applications in finance, marketing, research and development, and other important areas of management science, forecasting, and data mining and dredging. The key aspects of this approach, unlike most business intelligence analytics, is that it is driven by Oracle10g SQL modeling ability to achieve successful forecasting, utilizing traditional and scientific business models based on the dimensions, measures, and rules implemented.
Oracle Data Warehousing and API built-in functions support some of the methods mentioned, in particular, time series, exponential smoothing, and moving average, among others.
THE BUSINESS FRAMEWORK
Forecasting models have had an enormous impact in business and industries successes or failures. Historic areas such as operating research, marketing, research, statistics and artificial intelligence have provided comprehensive technical support to engineer models that successfully achieve desirable goals. Historically, information technology has evolved to support more complex and comprehensive models. In particular, in areas such as finance, demographics, marketing, and healthcare. Important methods and models derived from management science have lead to making forecast modeling a more assertive, congruent, and consistent area of research. This paper will discuss the SQL usage and implementation of forecasting models, such as, namely:
· Exponential Smoothing
· Exponential Smoothing with Seasonal Adjustments
· Time Series Methods
· Moving Average
· Moving Average with Seasonal Adjustments
· ARMA (Auto-regressive Moving Average)
· ARIMA (Auto-regressive Integrated Moving Average)
The areas of business applications involve financial markets, financial derivatives, stock markets, and more explicit areas of relevant financial forecasting models such as financial and economic market models supporting steady error variance regardless of values applied to model (homoskedasticity) , usually associated to models with stationary, resilient, recycling or iterative performance. In the economic markets, forecasting on levels of demand and supply in historic markets can lead to the implementation of successful models for economy of scale, and, similarly, markets and models with iterative behaviors. Also, in demographics models to explicitly predict the population growth, the male-to-female rate, the projected per-capita income, over a large number of years, and other relevant census information that can lead to effective socio-demographics planning directly influencing cultural and socio-economic values in the long-term. Traditional complex models have been usually founded on equivalently complex statistical studies based on techniques such as Linear Algebra, Multiple Regression Analysis measuring variable colinearity, custom Factorial Designs, and in some cases using advanced or stochastic models such as Markov chains, Bayesian Models, and advanced Game Theory.
The focus of this paper is to present a paradigm to use Oracle SQL to implement a advanced models strategy that business can use advantageously in applications and custom solutions, and eventually in conjunction with more sophisticated ad hoc BI tools, such as Oracle Discoverer and Hyperion, among others. This capability enhances Oracle leading positioning in the BI market.
Time Frame Time Level Typical Forecasting Horizon Best Approach Product Level Other Dimension Levels
Short Week, Biweek, or Month Up to 18 months Time Series UPC, SKU, NDC, ISBN Level of interest
Medium Month or Quarter 6 to 36 months Causal Analysis Brand Level of interest
Long Quarter or higher 19 months to 5 years Expert Opinion Brand, Company, Market Level of interest
Table 1. Oracle Corporation Vision to Forecasting Strategies
TECHNICAL FRAMEWORK
The usage of Oracle’s Business Intelligence analytical and aggregate functions of various business natures, in conjunction with analytical operators, namely ROLLUP and CUBE, the model clause in conjunction with dimensions, measures, and rules, and other custom aggregation capabilities driven by grouping sets and identifiers. The following section briefly summarizes the most important aspects of SQL related topics. The following concepts are important to establish the modeling and forecasting framework.
ANALYTICAL WORKSPACE METHODOLOGY
In order to perform SQL forecasting, it is recommended to establish an enhanced Analytical Workspace, involving dimensions, measures, rules, resulting in a logical cube, and a model where appropriate. The analytical workspace can be based on:
· Star Schema
· Snowflake Schema
· Parent-Child Tables
LOGICAL CUBES
Logical cubes provide ways to sort out measures with the same shape (exact same dimensions), which have the same relationships on the same cube to other logical objects and are analyzed and displayed together.
LOGICAL MEASURES
Measures populate the cells of a logical cube with the facts collected on business operations. Measures are organized by dimensions, typically including a Time dimension and seek to mine overall detail data.
LOGICAL DIMENSIONS
Dimensions contain a set of distinct values that identify and categorize data. They build the edges of a logical cube, and thus become the measures within the cube.
LOGICAL HIERARCHIES AND LEVELS
A hierarchy is a manner to categorize data at various levels of aggregation. Upon displaying data, analysts use dimension hierarchies to distinguish trends at one level, drill down to lower levels to identify reasons for these trends, and roll up to higher levels to see what influence these trends have on a larger business segments.
LEVEL-BASED HIERARCHIES
A level represents a position in the hierarchy. A level above the base (or most detailed) level contains aggregate values for the levels below it. The members at different levels have a one-to-many parent-child relation.
VALUE-BASED HIERARCHIES
Although hierarchies are typically composed of levels, they do not have to be. The parent-child relations among dimension members may not define meaningful levels.
LOGICAL ATTRIBUTES
An attribute provides additional information about the data. Some attributes are used for display. For example, you might have a product dimension that uses Stock Keeping Units (SKUs) for dimension members.
FORECAST AND REGRESSION STATEMENTS BEYOND SQL
In addition to these analytical and aggregation functions, Oracle10g OLAP also offer Forecasting and Regression statements, such as FORECAST.report, SMOOTH, and REGRESS.report, among others, as well as aggregation statements.
There are two main general categories to use forecasting and regression model DML statements, namely
· Simple forecasts and regressions
· Forecasts and regressions using a forecasting context
ANALYTICAL DML STRUCTURE DETAILS
SQL CLAUSES
ANALYTICAL OPERATORS
There are two main analytical operators, namely:
· Rollup
· Cube
GROUP BY CLAUSE AGGREGATION OPTIONS
Among aggregation through explicit grouping by option supports some of the following BI analytical enhancements, namely:
· Aggregation using Grouping Sets
· Aggregation using Operators
· Aggregation with the Group Id and Grouping Id differentiators
· Customized Aggregation using the ODCIAggregate interface.
MULTIDIMENSIONAL ANALYSIS
Oracle10g supports both the relational (ROLAP) and the multidimensional (MOLAP) vision of analytics. The following clauses are supported:
DIMENSION STATEMENT
The following dimension illustrates an Oracle DBA efforts to model backup size pattern over different distributed databases in the corporation based on the RMAN_HIST view involving RMAN_RECORDS and LOCATION tables. After creation the DBA validates the view the DBMS_OLAP.validate_dimension procedure.
CREATE DIMENSION RMAN_REC_DIM
LEVEL DB_NAME IS (RMAN_HIST.DB_NAME)
LEVEL CITY IS (RMAN_HIST.CITY)
LEVEL STATE IS (RMAN_HIST.STATE)
LEVEL REGION IS (RMAN_HIST.REGION)
HIERARCHY INST_ROLLUP (DB_NAME CHILD OF CITY CHILD OF STATE CHILD OF REGION)
ATTRIBUTE DB_NAME DETERMINES (RMAN_HIST.HOSTNAME ,
RMAN_HIST.MAX_DURATION,
RMAN_HIST.AVG_DB_SIZE,
RMAN_HIST.RMAN_BKP_MAXSIZE,
RMAN_HIST.RMAN_BKP_MINSIZE,
RMAN_HIST.BACKUP_TYPE,
RMAN_HIST.MIN_DURATION
);
SQL>
1 BEGIN
2 dbms_olap.validate_dimension(
3 dimension_name => 'RMAN_REC_DIM',
4 dimension_owner => 'ANTHONY' );
5*
END;
SQL> /PL/SQL procedure successfully completed.
Exhibit 5A. Create Dimension Statement (OLAP SQL DDL) Exhibit 5B. Dimension Validation with DBMS_OLAP
PARTITION CLAUSE
The partition clause can follow the optional MAIN subclause and precedes all MODEL subclauses and it implies the larger aggregation granularity for the query result set.
MEASURES
The measure clause determines the dimension being calculated at every possible dimensional granularity.
RULES
The rules clause determines the rows to be used in filtering the query, and it can accept insert, updates, and upsert all transactions applied to the most detailed cells in general. The IGNORE NAV clauses allows to resolve the sparsity caused by unavailable data.
MODEL CLAUSE
MODEL[
[MAIN
[PARTITION BY (
DIMENSION BY (
MEASURES (
[RULES]
RETURN {ALL|UPDATED} ROWS
[IGNORE NAV | [KEEP NAV] [UNIQUE DIMENSION | UNIQUE SINGLE REFERENCE]
DIMENSION BY (
MEASURES (
Exhibit 6A. Expanded MODEL Clause Syntax. Exhibit 6B. MODEL Clause Syntax Flowchart
MODEL CONDITIONS
· Using Math and Logical Operators
· IS ANY
· IS PRESENT
UNIFIED PERSPECTIVE
CURRENT AREAS OF APPLICATIONS
Modeling and Forecasting with SQL applications currently involve mostly econometrics, finance, and marketing research areas. It is also important to highlight data analysis in scientific applications such as research and development and LIMS. The critical integration with text management and unstructured data, such as XML, improve the ability to enhance modeling and forecasting with SQL with a variety of areas of application with OLAP and Data Warehousing influence.
SQL INVOLVEMENT
SQL’s improved ability to set up and manipulate custom collections and matrix-based containers, beyond conventional relational tables, allows for the aligned of SQL with other Oracle’s OLAP API and custom J2EE, XML, and SOA-driven applications.
BUSINESS INTELLIGENCE SOFTWARE INTEGRATION AND EXPECTED FUTURE BENEFITS
With the support of Analytic Workspace Manager, the OLAP Registry and metadata, and the usage of tools such as Discoverer, the SQL productivity in customized and ad hoc queries is extremely valuable at the present time and promises a brilliant future. The integration derived from tools API and SQL and PL/SQL API can increase and leverage productivity for J2EE/Java and other environments using other approaches to business intelligence forecasting and modeling.
CONCLUDING REMARKS
Upon completing this study on forecasting and modeling with Oracle10g SQL, these are the most relevant remarks:
· Oracle SQL analytical framework has attained an outstanding level of maturity deserving a leadership positioning.
· Oracle SQL and OLAP API can integrate with custom J2EE, XML, and .net applications, and for a key reporting Data Warehousing capability with Text and related tools.
· SQL Model capabilities have achieved a transactional and reporting level which allows the visualization of modeling and forecasting in query-like mode, with greater flexibility than conventional statistical analysis and modeling tools.
· Oracle10g support through both OLAP DML and API and SQL is essentially solid for conventional forecasting methods such as Time Series (Moving Average and ARIMA) and
Exponential Smoothing, among others.
SQL EXAMPLES
SELECT RTRIM(country) country,
SUBSTR(product,1,16) product,
year,
sales
FROM sales_view
WHERE country in ('France', 'China')
MODEL RETURN UPDATED ROWS
MAIN simple_model
PARTITION BY (country)
DIMENSION BY (product, year)
MEASURES (s)
RULES (
s[‘iPod’, 2002] = 2000,
s[‘iPod’, 2003] = s[‘iPod’, 2001] + s[‘iPod’, 2000],
s['PlayStation2', 2002] = s['PlayStation2', 2001]
)
ORDER BY country, product, year;
SELECT DISTINCT R,C,H,DB,AVG_S,rs,max_dur,dur_diff,slope,r2
FROM ( SELECT region r ,
city c,
hostname h ,
db_name db,
avg(avg_db_size) over (partition by region,city order by region,city) avg_s,
avg(rman_bkp_maxsize) over (partition by region,city order by region,city) rs ,
avg(max_duration) over (partition by region,city order by region,city) max_dur,
avg(max_duration-min_duration) over (partition by region,city order by region,city) dur_diff,
regr_slope(avg_db_size,max_duration) over (partition by region,city order by region,city) slope,
regr_r2(avg_db_size,max_duration) over (partition by region,city order by region,city) r2 from rman_hist ) A order by 1,2,3;
APPENDIX
GLOSSARY
AGGREGATION (SUMMARIZATION)
The process of consolidating data values into a single value. For example, sales data could be collected on a daily basis and then be aggregated to the week level, the week data could be aggregated to the month level, and so on.
ANALYTIC WORKSPACE
A dimensional schema stored in a LOB table in the relational database. An analytic workspace can contain a variety of objects. The OLAP DML is the basic, low-level language for working in an analytic workspace.
ANALYZING TRENDS
Drilling up and down through hierarchies Rotating to change the dimensional orientation
ANCESTOR
A value at any level higher than a given value in a hierarchy. For example, in a Time dimension, the value 2002 might be the ancestor of the values Q1-02 and Jan-02.
ARIMA (AUTO-REGRESSIVE INTEGRATED MOVING AVERAGE)
The general model introduced by Box and Jenkins (1976) includes autoregressive as well as moving average parameters, and explicitly includes differencing in the formulation of the model. Specifically, the three types of parameters in the model are: the autoregressive parameters (p), the number of differencing passes (d), and moving average parameters (q). In the notation introduced by Box and Jenkins, models are summarized as ARIMA (p, d, q); so, for example, a model described as (0, 1, 2) means that it contains 0 (zero) autoregressive (p) parameters and 2 moving average (q) parameters which were computed for the series after it was differenced once. The notation AR(p) refers to the autoregressive model of order p. The AR(p) model is written ARMA (AUTO-REGRESSIVE MOVING AVERAGE) . where are the parameters of the model, c is a constant and is an error or residual term . The constant term is omitted by many authors for simplicity
CALCULATED MEASURE
A measure that is calculated at run-time. The result set includes a value for each dimension member currently in status.
CELL
A single data value of an expression.
CONTROLLED SPARSITY
Occurs when a range of values of one or more dimensions has no data; for example, a new variable dimensioned by month for which you do not have data for past months.
CUBE
A logical organization of measures with identical dimensions. The edges of the cube contain dimension members and the body of the cube contains data values.
DATA WAREHOUSE A relational database that is designed for query and analysis rather than transaction processing.
DATABASE STANDARD FORM
An analytic workspace that has been constructed with a specific set of objects, such as hierarchy dimensions, level dimensions, parent relations, and level relations.
DESCENDANT
A dimension member at any level below a particular member in a hierarchy level.
DICTIONARY
The collection of object definitions in an analytic workspace. The dictionary is also called the workspace dictionary.
DIMENSION
A structure that categorizes data. Among the most common dimensions for sales-oriented data are time, geography, and product. Most dimensions have hierarchies.In an analytic workspace, a dimension is a container for a list of values.
DIMENSION TABLE
A relational table that stores all or part of the values for a logical dimension in a star or snowflake schema.
DIMENSION VIEW
A relational view of data in an analytic workspace that contains the same types of data as a dimension table in a star schema, that is, columns for dimension members and attributes.
DOUBLE EXPONENTIAL SMOOTHING
Identifies the trend, and adjusts the forecast data to reflect this trend instead of generating a single parameter for all forecast periods.
DRILL
To navigate from one item to a set of related items.
DRILL DOWN
To expand the view to include child values that are associated with parent values in the hierarchy.
DRILL UP
To collapse the list of descendant values that are associated with a parent value in the hierarchy.
EIF FILE
A specially formatted file for transferring data between analytic workspaces.
EXPONENTIAL SMOOTHING
A forecasting method based on the trend provided by a previous actual value and a estimated value and related probabilties, p and q=1-p.
FACT TABLE
A table in a star schema that contains facts.
FORMULA A type of workspace object that represents a stored calculation, expression, or procedure that produces a value.
HOLT-WINTERS EXPONENTIAL SMOOTHING
Identifies both trend and seasonal variance, and adjusts the forecast data to reflect these factors. This method is particularly sensitive to both high and low outliers.
MARKOV CHAINS
A matrix for which all the column vectors are probability vectors is called transition or stochastic matrix. Andrei Markov, a russian mathematician, was the first one to study these matrices. A Markov chain is a sequence of random variables X1, X2, X3, ... with the Markov property, namely that, given the present state, the future and past states are independent. A series with the Markov property is a sequence of states for which the conditional probability distribution of a state in the future can be deduced using only the current state
MEASURE VIEW
A relational view of data in analytic workspace that contains the same types of data as a fact table in a star schema. However, in addition to the base-level facts, a measure view also contains derived data, such as aggregates and inter-row calculations.
METADATA
Data that describes data and other structures, such as objects, business rules, and processes.
MODEL
A type of analytic workspace object that contains a set of interrelated equations, which are used to calculate data and assign it to a variable or dimension value. Models are used frequently when working with financial data. In a relational database, the process of removing redundancy in data by separating the data into multiple tables. Contrast with denormalized.
MOVING AVERAGE WITH SEASONAL ADJUSTMENTS
A time series model that receives special adjustments based on periodic seasonal adjustments.
OBJECT
In an analytic workspace, a distinct item in the workspace dictionary. Analytic workspaces consist of one or more objects, such as variables, formulas, dimensions, relations, and programs, which are used to organize, store, and retrieve data.
OLAP DML
The low-level data definition and manipulation language for analytic workspaces.
ON THE FLY
Calculated at run-time in response to a specific query. In an analytic workspace, calculated measures and custom members are typically calculated on the fly. Aggregate data can be precalculated, calculated on the fly, or a combination of the two methods. This concept contrasts with precalculate.
ONLINE TRANSACTION PROCESSING (OLTP)
Systems optimized for fast and reliable transaction handling. Compared to data analysis systems, most OLTP interactions involve a relatively small number of rows, but a larger group of tables.
PARENT-CHILD RELATION
A one-to-many relationship between one parent and one or more children in a hierarchical dimension. For example, New York (at the state level) might be the parent of Albany, Buffalo, Poughkeepsie, and Rochester (at the city level).
PRECALCULATE
Calculated and stored as a data maintenance procedure. In an analytic workspace, aggregate data can be precalculated, calculated on the fly, or a combination of the two methods. Contrast with on the fly.
RANDOM SPARSITY
Occurs when NA values are scattered throughout the variable, usually because some combinations of dimension values never have any data. For example, a district might only sell certain products and never have data for other products.
RELATION
A type of workspace object that is similar to a variable, except that it restricts its data values to the members of a particular dimension. A relation establishes a correspondence between the values of a given dimension and the values of that dimension or other dimensions in the database.
ROLE
The function of a workspace object within its broader categorization of object type. For example, a variable that stores numeric business measures has a role of measure. A variable that stores descriptive product names has a role of attribute. Both are variables, but they contain different types of information and play different roles in the dimensional model.
SCHEMA
A collection of related database objects.
SINGLE EXPONENTIAL SMOOTHING
Identifies the percentage of weight given to the prior period and all other historical periods. It does not adjust for trend or for seasonal variance.
SNOWFLAKE SCHEMA
A type of star schema in which the dimension tables are partly or fully normalized.
SPARSITY
A concept that refers to multidimensional data in which a relatively high percentage of the combinations of dimension values do not contain actual data. Such "empty," or NA, values can take up storage space in an analytic workspace. To handle sparse data efficiently, you can create a composite.
STAR QUERY
A join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other.
STAR SCHEMA
A relational schema whose design represents a dimensional data model. The star schema consists of one or more fact tables and one or more dimension tables that are related through foreign keys.
UPDATE WINDOW
The length of time available for loading new data into your database.
VARIABLE A type of workspace object that stores data. The data type of a variable indicates the kind of data that it contains, such as numeric or text data.
TERMINOLOGY FOR SIMPLE REGRESSION
y x
Dependent Variable Independent Variable
Explained Variable Explanatory Variable
Response Variable Control Variable
Predicted Variable Predictor Variable
Regressand Regressor
No comments:
Post a Comment