Saturday, June 29, 2013

The Engineering of the Intelligent Backup (II)

An Open Call to Study and Research 
“the Backup Fatigue Syndrome”

During my recent NYOUG presentation at St. John University, I refer to the relevance of data file size sorting in relation to backup performance and backup performance tuning degradation  when significantly larger files are left at the end of the backup, which I will hereby refer to as “the backup fatigue syndrome.”  This brief article is a summary of informal statistics on this problem, which I have never formalized for truthful statistical research.  It looks like this sydrome is associated with various factors such as, but not limited to, logic inherent to OS-level I/O, logic associated with generic backupset-driven technology, and inefficiency of large storage devices, in general. The combination of such factors, among others, is particularly critical to the appearance of the backup fatigue syndrome.

For the past twenty years or so, I have observed various scenarios where a backup operation took a bit longer or much longer than expected, to be more precise. These scenarios involved not only various instances of Oracle database backup but also OS-level backups, involving Windows, Linux, MacOS, and Solaris, and network-driven backups; and different media type, such as essentially tape or drive.  These observations lead me to the systematic testing of verifying that the order by size in which files or backup pieces are used to create the corresponding backup sets has a significant impact in the backup duration. 

This could simply mean that based a backup operation is not necessarily commutative by content (data file) size and that the order (by size) in which datafiles are backed up does result in different duration; in particular, when significantly large files (huge in comparison to the rest of datafiles in the backup) are placed at the end of the backupset or last in the order which files are backed up. This means that an abstraction would not allow me to define an Abelian Group, as a mathematical analogy; or more exactly, a class or abstract object that defines such a group, as the order by size does have an impact.  Comparing mathematical concepts is quite practical here, although it may appear inappropriate to most people.  

The purpose of this note is to actually entice a comprehensive research on this topic.  Explicitly, it is important to determine the following:

1.     The proportion, if any, at which backup performance latency occurs and results in exponential degradation when only a serial channel is used; and  when nearly exponential degradation occurs when parallelism is used.

2.     A method to determine such as proportion of small file to large files, and the proportion between the number of small files, and the number of larger files; such as, for instance, in Oracle, the number of SMALLFILE tablespace datafiles in comparison to the number of BIGFILE tablespace datafiles. The proportion could be established via an actual mathematical ratio or a statistical index or regression model, or as a stochastic model with controlled probability via a Bayes or Markovian or Levy process.

3.     The smoothing of the degradation impact as various level of parallelism are used or increased. In Oracle RMAN technology, this explicitly relates to the actual number of channels used from the number established.

4.     The average ratio from the rather small number large files to the number large number of small files, which does not cause for the backup fatigue syndrome to occur. Otherwise said, if there is a specific generalized proportion that one can used without the degradation to occur.

5.     A (formula-based) method to determine a factor or index to properly estimate the appropriate or custom level of parallelism for each backup operation, as needed.

Based on the following analysis derived from a good number of observations while running RMAN and OS-level backups where signficantly larger files were left a the end of the backup or backup set, it is possible to postulate the “Backup Fatigue Syndrome”, as follows:

Postulate I

When a significant large file is left at the end of a backup, having already backed up a comparatively very large number of much smaller files first, there will be a significant degradation in the backup performance, and the backup duration will appear to increase exponentially.

Postulate II

Based on consistent replicated observations, if the order of data files based on their size is an important factor in attaining optimal backup performance, then there exists an optimal backup order by size for any backup operation, such that the backup duration is optimized.

The following observations are made in relation to the parallel degee used, namely:

·      For serial backups (parallel degree 1), based on a good number of observations, the decreasing sort by size of data files should produce an optimal duration for backup operations, under similar environmental factors such as processor, memory, operating system, and storage, and various others.

·      For parallel backups (parallel degree 2 or greater), based on a good number of observations, the average of decreasing backupsets as parallelized (sorted by size of data files) should produce an optimal duration for backup operations, under similar environmental factors such as processor, memory, operating system, OS clustering, database clustering (RAC, HARD or other similar) , and storage, and various others.

Based on these postulates, it is important to consider expanding RMAN backup capabilities, as well as third party capabilities, to take advantage from these important observations driving the backup fatigue syndrome postulates.

Indeed, I believe that the backup fatigue syndrome exists as a combination of factors, in particular, how backup pieces and backupsets are aggregated into a full backup, and various technology factors, such as large storage system management involving file system based, library-based, and raw devices.

Although I have a good number of sufficient observations, with perceived backup fatigue, which repeated in similar observed scenarios, I currently do not have enough resources to conduct a more official and formal research on this matter using the appropriate tools, statistical methods, and experiment design; so I would like to invite the storage networking community to participate with me in a conjoint research on the backup fatigue syndrome, which could improve the logistic associated with backup technology in the era of big data and truly big data files.

Some of the expectation that can be derived from these perspectives is to provide a sorting by size capability inherit within RMAN rather than writing a script to attain these goals when a disparity of on the number of small files compared to a small number of huge files occurs.

No comments: