Thursday, July 18, 2013

The Engineering of the Intelligent Backup (III)

Estimating the Optimal Backup Parallel Degree

The postulates presented here are of public domain. However, hypotheses and theorems on which the postulates are based are not of public domain and will be copy righted under the ADN REsearch logo.

Derived from postulate 2, there is a need to establish a mechanism to enhance performance.  This mechanism is to identify the optimal degree of parallelism to be used.  Based on RMAN technology this is specified when configuring the default settings to be used by an RMAN operation or by specifying it in an RMAN script.

A rather non-formal study, suggests that the minimum production parallel degree to be used should be at least 4.  This is simply because there is significant improvement from using a lower degree such as 2 or 1.  However, attempting to identified the optimal parallel degree in a scenario where there are several huge files (as outliers by size), requires a statistical study or a heuristic mathematical formulation. A statistic study should produce a confidence interval with a 95% confidence, using a sample model where the mean and the variance are known.

Through my experience, a mathematical formulation is possible based on the variance (or standard deviation) and the average size of all files being backed up.

A mathematical formulation proposed for this range should consider the following formula, as follows:

where σ is the standard deviation of the data files sizes and k = 1,2,3… ,i.e., a positive integer.  This means that an appropriate parallel degree should be established on the basis of the variability of the data file size rather than on the number of files being backed up.

I can summarize this new postulates, as follows:

Postulate 3:

Derived from postulate 3, there exists a parallel degree such that the backup performance is optimized on the basis of duration.  The parallel degree can either be established within the closed interval N) data files’ sizes.  Furthermore, an additional adjustment could take place when the ratio of the largest file to the largest small file is significant, e.g., larger than 1000.  Such an adjustment could be based on that ratio. In Oracle technology, this could be applicable regardless of whether the data file is of the SMALLFILE or BIGFILE tablespace, i.e., whether one or more data files are allowed in a tablespace.  Similarly, a parallel degree can be established through a statistical model derived from sampling model utilizing a 95% confidence interval, where data file size are used as input, and specific transformation such as logarithmic transformations are applied to the model in order to attain a reasonable values for the expected parallel degree range,
where k is a positive integer greater than or equal to 2, and Sigma is the standard deviation of the population of all (N) data files’ sizes.  

Exhibit. An example to estimate the optimal degree of parallelism (as in an optimal economic model of returns) for a database with 100 files of size 1GB, 100 files of size 2GB, 100 files of size 4GB, 100 files of size 8GB, 50 files of size 16GB; 10 files of size 32TB; 4 files of size 64TB; and 2 files of size 128TB

From postulate 3, it is possible to suggest that the optimal value for the FILESPERSET RMAN parameter is either 1 or 2, in order to minimize the seek time fro restore operations, as well. However, the MAXSETSIZE should probably be either left alone (UNLIMITED) or controlled otherwise by the mean size of all data files involved, excluding the outlier files. A different approach with a future capacity planning could use the double of the largest file among the small files.