Tech Hat: All about Batch Processing

I wanted to quickly come up to speed on Batch Processing and Spring Batch and found a great read on the Spring Batch documentation. Following is a summary of the Introduction to Spring Batch page that I created for my reference.

A Few Types of Inputs for Batch Processing

Files
Databases
Message queues

General Principles

Batch architecture typically affects the application architecture (which runs on-line) and vice-versa. Design both together using common building blocks
Simplify the batch process as much as possible. Avoid complex logical structures in batch applications
Process the data as close as possible to where the data exists (Eliminates the overhead in data transfer)
Minimize system resources use, especially I/O operations. Use internal memory operations as much as possible
Review I/O operations in SQLs. Avoid the following
- Reading data more than once (the data can be cached)
- Unnecessary table/index scans
- Not specifying key values in WHERE clauses
Avoid repetition of work in batch. For example, if reporting needs summarised data, do it during the initial data processing as much as possible
Allocate enough memory to avoid overhead in memory re-allocation during the batch job
Assume worst on data integrity. Include adequate checks to ensure integrity.
Implement checksums for input files to make sure the input processed is valid.
Do stress testing early as possible in production like environments using realistic workloads
If the batch processing works on files which are backed from production system, check the file backing up regularly to make sure that it happens properly

Main Building Blocks of Batch Processing

Conversion - Converts the different types of records to a format required for further processing. Most of the functionality may be provided by the batch framework
Validation - Validation of inputs (input files via checksums, file tails etc.) and cross checks for input data integrity
Extract - Extract records from an input source based on pre-defined rules, writes them to an output
Extract and Update - Extract records (as explained above), makes changes and writes to an output file
Process and Update - Processes records coming from an input source, more commonly from an Extraction or a Validation and writes to an output
Output/Format - Re-structures and/or formats input records to be printed or transferred to another application
General - There’s always logic which does not fall under above categories.

Utilities for Batch Processing

Sort
Split
Merge

Factors Defining Batch Processing Strategy

Batch system volume
Concurrency with on-line applications and/or other batch systems
Available batch windows

Batch Processing Strategies

Following are the typical batch processing approaches

During a pre-defined window where application is offline
Concurrent batch - Processing while application is on-line
Parallel processing - Many different batch runs or jobs at the same time)
Partitioning - Processing of many instances of the same job at the same time
A combination of these

These are discussed in details below;

1. During a pre-defined window where application is offline

This is the most simple form of batch processing
Usually a single commit after the processing is enough, but it’s always best to consider a more robust approach from the beginning since batch systems will grow in complexity and volume over time
Locking strategy is not required. However, a proper commit strategy and a recovery/re-start strategy must be defined

2. Concurrent Batch - Processing while application is on-line

Online application can update the data referred by the batch processing
A locking strategy is required. However, there should be a balance between locking and availability of data
Locking should be done for a smaller chunk of data. If it’s a DB Row Level Locking could be used
Updates (by the batch process or online app) should be committed to the db in smaller chunks often
If Row Level Locking is not supported or the input is a file, either Optimistic Locking or Pessimistic Locking could be used

Optimistic Locking

Suitable for scenarios where record contention probability is low
Maintains a timestamp against the record referred. When updating, the record is referred with the time stamp and the timestamp is updated to the latest time
If the timestamp differs, then another process has updated the record prior to this process. This should trigger a retry

Pessimistic Locking

Suitable for scenarios where record contention probability is high
A physical or logical lock should be obtained before data retrieval and released once data updated
- A DB column could be used to implement this. However, obtaining the lock and releasing should be managed in a thread safe manner
- A timeout period for the lock is also required to manage scenarios processes that might not be able to release the lock
To easier management, a dedicated Locking Manager becomes handy

Notes:

Both batch and business application should assume the same locking strategy and schema
If locking happens on a related group of data, the locking manager should be enhanced to maintain this metadata to effectively mange the locking.
- This would help minimise deadlocking scenarios

3. Parallel Processing

Multiple batch-runs/jobs to run in parallel
This helps minimising total batch processing time
If the batch-runs/jobs run in parallel do not share files, db tables, index spaces the complexity of the implementation is quite low. If not, the following strategies should be applied
- Partition data - Batch jobs operate on partitioned data
- Control Table - A table containing all shared resources and information on whether that’s been used at a given time by a process. Each process is to check this table before using shared resources
Could be implanted via multiple threads. However, the solution should be robust enough to ensure fair amount of processing power for all running processes
Following complexities exist in this approach
- Proper load balancing is required to ensure availability of files, database pools etc.
- Control Table, if used could be a highly shared resource and can be a bottleneck

4. Partitioning

Many instance of the same batch job running simultaneously, on mutually exclusive partitions of data
Operates on scenarios where either databases or files used can be or already is partitioned
The logic should be constructed to clearly communicate the partition of data that the batch job should operate on, and to make sure the job operates on only the communicated partition of data
The solution would be quite dependent on the input data partition strategy. However, should be flexible enough to support dynamic configuration of the number of partitions

Partitioning Strategies

Following are a few potential partitioning strategies. Selecting a strategy will vary from case to case

Fixed and Even Breakup of Records
- Breakup number of records to even number of portions
- Pre-processing is required to split the records in to partitions and/or identify lower and upper bounds for the partitions which could be an overhead

Breakup by Key Attribute
- Breakup record sets based on a key attribute (i.e. column) or a set of attributes
- If the attribute(s) are static, then each partition would have a batch process defined.
  - Dynamic configuration is not required
  - All new records will be attached to a batch job
- If the attributes are dynamic, manual configuration of new batch jobs (or updating existing) are required to cater new records under new attributes
- Cannot realise optimal even distribution of records per batch process

Breakup by Range of Value
- Each batch is assigned a range for an attribute (or attributes) to process
- Cannot realise optimal even distribution of records per batch process

Breakup by Views
- Similar to breakup by key attributes. But the separation is done at the DB level via DB views
- A batch is assigned a view
- New values for the attributes will require new views, hence new batch configurations

Breakup by Addition of a Process Indicator
- Introduce a new attribute to hold whether the record has been processed (successfully or otherwise) or not
- Each batch job can pick a record which is not processed successfully
- Can start many batch operations simultaneously
- All new records will be included automatically (new records should have the attribute set to not-processed)
- No need of dynamic configuration for batch jobs
- This would however, increase the I/O operations on the table

Extract DB Table to a Flat File and Split
- This require a lot of pre-processing for extracting db table to a flat file and splitting the file
- The advantages of partitioned batch processing may get nullified if the pre-processing overhead is too much

Use of a Hashing Column
- Quite similar to partitioning by attribute(s).

DB and Application Design Strategies for a Partitioned Environment

Using a central partition repository (i.e. a partition table) is recommended
- The partition table will contain static info about the data partition (i.e. partition number, program or process id using the partition, lower and upper bounds, attribute sets etc.)
On startup, batch processes should be communicated with the parameters to access the partition able and figure out which partition is to be worked on
Partition ID should be kept from start to end of processing since it’s important
- If the data has to be merged, then the output should carry the corresponding partition id
- Logging, error reporting should have the partition id mentioned for easier troubleshooting

Deadlock Prevention

Design table indices keeping deadlock prevention and performance in mind
The architecture should facilitate strategies such as wait and re-try etc.the database is not available (due to locking, overload etc.)

Parameter Passing and Validation

The architecture should support
- Retrieve and validate partition parameters before startup
- Pass parameters to batch jobs at run time
Validations should ensure that;
- There are sufficient partitions to cover the whole data set
- There are no gaps between partitions (i.e. 100% of records are processed)
When consolidation, the application design should address common concerns associated with partitioned processing such as;
- Whether all partitions should be processed before proceeding to the next step/job
- Recovery strategy if a batch operation on one partition fails

Tech Hat

Friday, June 3, 2016

All about Batch Processing

2 comments: