Friday, June 3, 2016

All about Batch Processing

I wanted to quickly come up to speed on Batch Processing and Spring Batch and found a great read on the Spring Batch documentation. Following is a summary of the Introduction to Spring Batch page that I created for my reference.


A Few Types of Inputs for Batch Processing
  • Files
  • Databases
  • Message queues

General Principles

  • Batch architecture typically affects the application architecture (which runs on-line) and vice-versa. Design both together using common building blocks
  • Simplify the batch process as much as possible. Avoid complex logical structures in batch applications
  • Process the data as close as possible to where the data exists (Eliminates the overhead in data transfer)
  • Minimize system resources use, especially I/O operations. Use internal memory operations as much as possible
  • Review I/O operations in SQLs. Avoid the following
    • Reading data more than once (the data can be cached)
    • Unnecessary table/index scans
    • Not specifying key values in WHERE clauses
  • Avoid repetition of work in batch. For example, if reporting needs summarised data, do it during the initial data processing as much as possible
  • Allocate enough memory to avoid overhead in memory re-allocation during the batch job
  • Assume worst on data integrity. Include adequate checks to ensure integrity.
  • Implement checksums for input files to make sure the input processed is valid.
  • Do stress testing early as possible in production like environments using realistic workloads
  • If the batch processing works on files which are backed from production system, check the file backing up regularly to make sure that it happens properly

Main Building Blocks of Batch Processing

  • Conversion - Converts the different types of records to a format required for further processing. Most of the functionality may be provided by the batch framework
  • Validation - Validation of inputs (input files via checksums, file tails etc.) and cross checks for input data integrity
  • Extract - Extract records from an input source based on pre-defined rules, writes them to an output
  • Extract and Update - Extract records (as explained above), makes changes and writes to an output file
  • Process and Update - Processes records coming from an input source, more commonly from an Extraction or a Validation and writes to an output
  • Output/Format - Re-structures and/or formats input records to be printed or transferred to another application
  • General - There’s always logic which does not fall under above categories.

Utilities for Batch Processing

  • Sort
  • Split
  • Merge

Factors Defining Batch Processing Strategy

  • Batch system volume
  • Concurrency with on-line applications and/or other batch systems
  • Available batch windows

Batch Processing Strategies

Following are the typical batch processing approaches 

  1. During a pre-defined window where application is offline
  2. Concurrent batch - Processing while application is on-line
  3. Parallel processing - Many different batch runs or jobs at the same time)
  4. Partitioning - Processing of many instances of the same job at the same time
  5. A combination of these

These are discussed in details below;

1. During a pre-defined window where application is offline

  • This is the most simple form of batch processing
  • Usually a single commit after the processing is enough, but it’s always best to consider a more robust approach from the beginning since batch systems will grow in complexity and volume over time
  • Locking strategy is not required. However, a proper commit strategy  and a recovery/re-start strategy must be defined

2. Concurrent Batch - Processing while application is on-line

  • Online application can update the data referred by the batch processing
  • A locking strategy is required. However, there should be a balance between locking and availability of data
  • Locking should be done for a smaller chunk of data. If it’s a DB Row Level Locking could be used
  • Updates (by the batch process or online app) should be committed to the db in smaller chunks often
  • If Row Level Locking is not supported or the input is a file, either Optimistic Locking or Pessimistic Locking could be used

Optimistic Locking
  • Suitable for scenarios where record contention probability is low
  • Maintains a timestamp against the record referred. When updating, the record is referred with the time stamp and the timestamp is updated to the latest time
  • If the timestamp differs, then another process has updated the record prior to this process. This should trigger a retry

Pessimistic Locking
  • Suitable for scenarios where record contention probability is high
  • A physical or logical lock should be obtained before data retrieval and released once data updated
    • A DB column could be used to implement this. However, obtaining the lock and releasing should be managed in a thread safe manner
    • A timeout period for the lock is also required to manage scenarios processes that might not be able to release the lock
  • To easier management, a dedicated Locking Manager becomes handy

Notes:
  • Both batch and business application should assume the same locking strategy and schema
  • If locking happens on a related group of data, the locking manager should be enhanced to maintain this metadata to effectively mange the locking.
    • This would help minimise deadlocking scenarios

3. Parallel Processing

  • Multiple batch-runs/jobs to run in parallel
  • This helps minimising total batch processing time
  • If the batch-runs/jobs run in parallel do not share files, db tables, index spaces the complexity of the implementation is quite low. If not, the following strategies should be applied
    • Partition data - Batch jobs operate on partitioned data
    • Control Table - A table containing all shared resources and information on whether that’s been used at a given time by a process. Each process is to check this table before using shared resources
  • Could be implanted via multiple threads. However, the solution should be robust enough to ensure fair amount of processing power for all running processes
  • Following complexities exist in this approach
    • Proper load balancing is required to ensure availability of files, database pools etc.
    • Control Table, if used could be a highly shared resource and can be a bottleneck

4. Partitioning

  • Many instance of the same batch job running simultaneously, on mutually exclusive partitions of data
  • Operates on scenarios where either databases or files used can be or already is partitioned
  • The logic should be constructed to clearly communicate the partition of data that the batch job should operate on, and to make sure the job operates on only the communicated partition of data
  • The solution would be quite dependent on the input data partition strategy. However, should be flexible enough to support dynamic configuration of the number of partitions


Partitioning Strategies

Following are a few potential partitioning strategies. Selecting a strategy will vary from case to case

  • Fixed and Even Breakup of Records
    • Breakup number of records to even number of portions
    • Pre-processing is required to split the records in to partitions and/or identify lower and upper bounds for the partitions which could be an overhead

  • Breakup by Key Attribute
    • Breakup record sets based on a key attribute (i.e. column) or a set of attributes
    • If the attribute(s) are static, then each partition would have a batch process defined.
      • Dynamic configuration is not required
      • All new records will be attached to a batch job
    • If the attributes are dynamic, manual configuration of new batch jobs (or updating existing) are required to cater new records under new attributes
    • Cannot realise optimal even distribution of records per batch process

  • Breakup by Range of Value
    • Each batch is assigned a range for an attribute (or attributes) to process
    • Cannot realise optimal even distribution of records per batch process

  • Breakup by Views
    • Similar to breakup by key attributes. But the separation is done at the DB level via DB views
    • A batch is assigned a view
    • New values for the attributes will require new views, hence new batch configurations

  • Breakup by Addition of a Process Indicator
    • Introduce a new attribute to hold whether the record has been processed (successfully or otherwise) or not
    • Each batch job can pick a record which is not processed successfully
    • Can start many batch operations simultaneously
    • All new records will be included automatically (new records should have the attribute set to not-processed)
    • No need of dynamic configuration for batch jobs
    • This would however, increase the I/O operations on the table

  • Extract DB Table to a Flat File and Split
    • This require a lot of pre-processing for extracting db table to a flat file and splitting the file
    • The advantages of partitioned batch processing may get nullified if the pre-processing overhead is too much

  • Use of a Hashing Column
    • Quite similar to partitioning by attribute(s).

DB and Application Design Strategies for a Partitioned Environment

  • Using a central partition repository (i.e. a partition table) is recommended
    • The partition table will contain static info about the data partition (i.e. partition number, program or process id using the partition, lower and upper bounds, attribute sets etc.)
  • On startup, batch processes should be communicated with the parameters to access the partition able and figure out which partition is to be worked on
  • Partition ID should be kept from start to end of processing since it’s important
    • If the data has to be merged, then the output should carry the corresponding partition id
    • Logging, error reporting should have the partition id mentioned for easier troubleshooting

Deadlock Prevention

  • Design table indices keeping deadlock prevention and performance in mind
  • The architecture should facilitate strategies such as wait and re-try etc.the database is not available (due to locking, overload etc.)

Parameter Passing and Validation

  • The architecture should support
    • Retrieve and validate partition parameters before startup 
    • Pass parameters to batch jobs at run time
  • Validations should ensure that;
    • There are sufficient partitions to cover the whole data set
    • There are no gaps between partitions (i.e. 100% of records are processed)
  • When consolidation, the application design should address common concerns associated with partitioned processing such as;
    • Whether all partitions should be processed before proceeding to the next step/job
    • Recovery strategy if a batch operation on one partition fails