Sunday, July 19, 2009

Business Intelligence and Statistical Error

Surface Response Model: From a Bull/Cow Fertility Test
(in my Advanced Design of Experiments Class)

Statisticians often study errors of Type I to refer to the probability of rejecting the null hypothesis when it is in fact true and Type II when accepting that null hypothesis when it is false. Beyond that point, systematic error can occur in general when wrong or inaccurate information is entered into a database and subsequently propagated to various other areas of information systems replication and reporting. While the most common type of data error is inadmissibly the human error, there are other reasons why decision support systems, and business intelligence systems may be subject to erroneous decision or reporting, other common causes are inappropriate forecasting strategies, for instance inappropriate categorization and choosing the wrong sources of data verification and validation or relevant resources. The well-known belief that “garbage in, garbage out” will drive quality into inconsistency and in many other instances into wrong data derivation. In some data quality for a calibration procedure project I worked on, a procedural error was entered not by the code but by a random reading of data files, which caused for the procedure to introduce inappropriate measurements based on the timestamp associated with the data files. After I found why the algorithm, in spite of being accurate, would introduce an error, the lead researcher refused to accept an improved algorithm that would have eliminated the error, simply because this would have meant to redo the entire experiment. Therefore, the cell calibration test introduced a small percent of inconsistent data, which did not invalidate the regression, correlation, and factorial design model used to validate the test. The entire data went to an Oracle database, where it was analyzed, mined, and presented for calibration test of the CDMA wireless protocol. Indeed, it is obvious that time management was a key factor in making this decision. Having also worked directly with statisticians, biostatisticians and operations research analysts and managers, where my role was a DBA, data analyst, software engineer, software developer or simply consultant, I believe that such a decision might have been contested by others or at least carefully revised.

There are other scenarios, where a DBA role may be the underlying reason or perhaps not so directly. For instance, if a company adopts a DBA corporate approach, such that the CURSOR_SHARING parameter is set to SIMILAR, this may perform consistently in most databases, but could become an issue in many OLTP scenarios, in particular, in queries widely using bind variables. This is because the SQL bind peeking will not be so much concerned with the actual value of the bind variables. In those scenarios, setting CURSOR_SHARING to EXACT is more practical, since it avoids the typical scenario where the cached statement becomes the “SQL bind peeking bible” matching the query and reparsing of the statement never actually occurs. In these queries were used in BI environments, it would obvious that there is the potential and probability that at some point erroneous reporting occurs. Therefore, there is the possibility that a large amount of BI effort, such as applications, and tool-driven implementation can span into a significant level where some rows have been neglected for displaying. In some cases, I have tested, the rendering of result set from re-indexing of tables on additional columns can lead for this misbehavior of the SQL bind peeking and statement parsing to take place, which occurs mostly in OLTP databases.

While error and error propagation of various types is always possible, achieving data quality and integrity can only be measure in relation to the impact. For instance, in a research environment the error may have great or no significance, but it could be devastating in mission-critical environments such as medical and financial systems. Error tracking then becomes a pro-active objective to attain data integrity, consistency, and quality that is to be handled not only by manual human control and resilient data consistency but quality corporate auditing controls as well. In most instances, internal controls are more valuable than external controls such as outsourced data verification and validation, simply because internal resources usually know more about the applications, data domains, and data contents overall.

In some instances, database-driven data mining and BI tools, such as Discoverer, can be complemented not only their own visualization capabilities but also with the usage of other methodologies such as, for instance, surface response models and artificial intelligence mining. While visualization is unlikely to eliminate or resolve any existing database error, i.e., data quality or consistency, it can certainly congruently derive data trend and convey confirming or rejecting the desirable results to verify or validate. Therefore, visualization models are helpful to regression, correlation, and statistical tests, simply because they can verify and validate the data quality with one simple view. While BI models that include functions such as CUBE and ROLLUP, which can further aggregate and categorize data, visualization techniques are powerful tools for decision making in that they provide truthful real-life validation of data trends. Both database and statistical tools and models can expand their data mining capabilities with analysis of central tendency, — such as kurtosis and skewness—, and measures of variability such as mean deviation, variance and standard deviation, and further studies on analysis of variance, but it the end it goes for good to present to have chosen a visualization tool that provides the greatest expectations overall.