Friday, June 3, 2016

All about Batch Processing

I wanted to quickly come up to speed on Batch Processing and Spring Batch and found a great read on the Spring Batch documentation. Following is a summary of the Introduction to Spring Batch page that I created for my reference.

A Few Types of Inputs for Batch Processing
  • Files
  • Databases
  • Message queues

General Principles

  • Batch architecture typically affects the application architecture (which runs on-line) and vice-versa. Design both together using common building blocks
  • Simplify the batch process as much as possible. Avoid complex logical structures in batch applications
  • Process the data as close as possible to where the data exists (Eliminates the overhead in data transfer)
  • Minimize system resources use, especially I/O operations. Use internal memory operations as much as possible
  • Review I/O operations in SQLs. Avoid the following
    • Reading data more than once (the data can be cached)
    • Unnecessary table/index scans
    • Not specifying key values in WHERE clauses
  • Avoid repetition of work in batch. For example, if reporting needs summarised data, do it during the initial data processing as much as possible
  • Allocate enough memory to avoid overhead in memory re-allocation during the batch job
  • Assume worst on data integrity. Include adequate checks to ensure integrity.
  • Implement checksums for input files to make sure the input processed is valid.
  • Do stress testing early as possible in production like environments using realistic workloads
  • If the batch processing works on files which are backed from production system, check the file backing up regularly to make sure that it happens properly

Main Building Blocks of Batch Processing

  • Conversion - Converts the different types of records to a format required for further processing. Most of the functionality may be provided by the batch framework
  • Validation - Validation of inputs (input files via checksums, file tails etc.) and cross checks for input data integrity
  • Extract - Extract records from an input source based on pre-defined rules, writes them to an output
  • Extract and Update - Extract records (as explained above), makes changes and writes to an output file
  • Process and Update - Processes records coming from an input source, more commonly from an Extraction or a Validation and writes to an output
  • Output/Format - Re-structures and/or formats input records to be printed or transferred to another application
  • General - There’s always logic which does not fall under above categories.

Utilities for Batch Processing

  • Sort
  • Split
  • Merge

Factors Defining Batch Processing Strategy

  • Batch system volume
  • Concurrency with on-line applications and/or other batch systems
  • Available batch windows

Batch Processing Strategies

Following are the typical batch processing approaches 

  1. During a pre-defined window where application is offline
  2. Concurrent batch - Processing while application is on-line
  3. Parallel processing - Many different batch runs or jobs at the same time)
  4. Partitioning - Processing of many instances of the same job at the same time
  5. A combination of these

These are discussed in details below;

1. During a pre-defined window where application is offline

  • This is the most simple form of batch processing
  • Usually a single commit after the processing is enough, but it’s always best to consider a more robust approach from the beginning since batch systems will grow in complexity and volume over time
  • Locking strategy is not required. However, a proper commit strategy  and a recovery/re-start strategy must be defined

2. Concurrent Batch - Processing while application is on-line

  • Online application can update the data referred by the batch processing
  • A locking strategy is required. However, there should be a balance between locking and availability of data
  • Locking should be done for a smaller chunk of data. If it’s a DB Row Level Locking could be used
  • Updates (by the batch process or online app) should be committed to the db in smaller chunks often
  • If Row Level Locking is not supported or the input is a file, either Optimistic Locking or Pessimistic Locking could be used

Optimistic Locking
  • Suitable for scenarios where record contention probability is low
  • Maintains a timestamp against the record referred. When updating, the record is referred with the time stamp and the timestamp is updated to the latest time
  • If the timestamp differs, then another process has updated the record prior to this process. This should trigger a retry

Pessimistic Locking
  • Suitable for scenarios where record contention probability is high
  • A physical or logical lock should be obtained before data retrieval and released once data updated
    • A DB column could be used to implement this. However, obtaining the lock and releasing should be managed in a thread safe manner
    • A timeout period for the lock is also required to manage scenarios processes that might not be able to release the lock
  • To easier management, a dedicated Locking Manager becomes handy

  • Both batch and business application should assume the same locking strategy and schema
  • If locking happens on a related group of data, the locking manager should be enhanced to maintain this metadata to effectively mange the locking.
    • This would help minimise deadlocking scenarios

3. Parallel Processing

  • Multiple batch-runs/jobs to run in parallel
  • This helps minimising total batch processing time
  • If the batch-runs/jobs run in parallel do not share files, db tables, index spaces the complexity of the implementation is quite low. If not, the following strategies should be applied
    • Partition data - Batch jobs operate on partitioned data
    • Control Table - A table containing all shared resources and information on whether that’s been used at a given time by a process. Each process is to check this table before using shared resources
  • Could be implanted via multiple threads. However, the solution should be robust enough to ensure fair amount of processing power for all running processes
  • Following complexities exist in this approach
    • Proper load balancing is required to ensure availability of files, database pools etc.
    • Control Table, if used could be a highly shared resource and can be a bottleneck

4. Partitioning

  • Many instance of the same batch job running simultaneously, on mutually exclusive partitions of data
  • Operates on scenarios where either databases or files used can be or already is partitioned
  • The logic should be constructed to clearly communicate the partition of data that the batch job should operate on, and to make sure the job operates on only the communicated partition of data
  • The solution would be quite dependent on the input data partition strategy. However, should be flexible enough to support dynamic configuration of the number of partitions

Partitioning Strategies

Following are a few potential partitioning strategies. Selecting a strategy will vary from case to case

  • Fixed and Even Breakup of Records
    • Breakup number of records to even number of portions
    • Pre-processing is required to split the records in to partitions and/or identify lower and upper bounds for the partitions which could be an overhead

  • Breakup by Key Attribute
    • Breakup record sets based on a key attribute (i.e. column) or a set of attributes
    • If the attribute(s) are static, then each partition would have a batch process defined.
      • Dynamic configuration is not required
      • All new records will be attached to a batch job
    • If the attributes are dynamic, manual configuration of new batch jobs (or updating existing) are required to cater new records under new attributes
    • Cannot realise optimal even distribution of records per batch process

  • Breakup by Range of Value
    • Each batch is assigned a range for an attribute (or attributes) to process
    • Cannot realise optimal even distribution of records per batch process

  • Breakup by Views
    • Similar to breakup by key attributes. But the separation is done at the DB level via DB views
    • A batch is assigned a view
    • New values for the attributes will require new views, hence new batch configurations

  • Breakup by Addition of a Process Indicator
    • Introduce a new attribute to hold whether the record has been processed (successfully or otherwise) or not
    • Each batch job can pick a record which is not processed successfully
    • Can start many batch operations simultaneously
    • All new records will be included automatically (new records should have the attribute set to not-processed)
    • No need of dynamic configuration for batch jobs
    • This would however, increase the I/O operations on the table

  • Extract DB Table to a Flat File and Split
    • This require a lot of pre-processing for extracting db table to a flat file and splitting the file
    • The advantages of partitioned batch processing may get nullified if the pre-processing overhead is too much

  • Use of a Hashing Column
    • Quite similar to partitioning by attribute(s).

DB and Application Design Strategies for a Partitioned Environment

  • Using a central partition repository (i.e. a partition table) is recommended
    • The partition table will contain static info about the data partition (i.e. partition number, program or process id using the partition, lower and upper bounds, attribute sets etc.)
  • On startup, batch processes should be communicated with the parameters to access the partition able and figure out which partition is to be worked on
  • Partition ID should be kept from start to end of processing since it’s important
    • If the data has to be merged, then the output should carry the corresponding partition id
    • Logging, error reporting should have the partition id mentioned for easier troubleshooting

Deadlock Prevention

  • Design table indices keeping deadlock prevention and performance in mind
  • The architecture should facilitate strategies such as wait and re-try etc.the database is not available (due to locking, overload etc.)

Parameter Passing and Validation

  • The architecture should support
    • Retrieve and validate partition parameters before startup 
    • Pass parameters to batch jobs at run time
  • Validations should ensure that;
    • There are sufficient partitions to cover the whole data set
    • There are no gaps between partitions (i.e. 100% of records are processed)
  • When consolidation, the application design should address common concerns associated with partitioned processing such as;
    • Whether all partitions should be processed before proceeding to the next step/job
    • Recovery strategy if a batch operation on one partition fails

Tuesday, June 23, 2015

Join in OrientDB

I started working on OrientDB which is a NoSQL dbms which brings in the goodness of both Graph and Document DBs.

We (me and my team) were heavy on relational databases until we started working on OrientDB, it was a bit hard for us to get away from the Relational thinking where you try to do everything with SQL and think "Inner Joins" :)

However, what we realized was simply coz you are working on a NoSQL DB, the inner join like scenarios could not be avoided. A very good example is integrating with other systems in the organizations that have a relational DB as the back end.

Inner join like scenarios could be created using sub queries in OrientDB where we used the LET block to access the parent query context. However, we realized this is not the way to go forward in the hard way.

Here's a typical example.

My Schema:

Department {
 deptId : INTEGER
 deptName : STRING

Employee {
 empId : INTEGER
 empName : STRING
 deptId : INTEGER

Query to get departments linked to employees;

SELECT empName, $dept[0].deptName
FROM Employee
LET $dept = (SELECT FROM Department WHERE deptId = $parent.current.deptId)

For development we usually use a simple data set for easy debugging. There this worked wonderfully well and we were happy. However, when we started testing with larger sets of test data, the time taken for the query horribly increased. 

But we had a hidden weapon - Indexes!! We created indexes for Department.deptId and we were certain that the inner query would pick up the index. But to our utter horror, it didn't! The time taken was unchanged. We did a query profiling using the EXPLAIN feature and it did not indicate any index usage.

With out the index, the time complexity of the query was O(n2) since for each of the Employees (parent query) all the departments are searched. The expectation with the index was to bring the time complexity to O(n).

Therefore, we moved from pure SQL to OrientDB back-end functions. These are more like stored procedures in a Relational context and could be written in JavaScript, SQL and Groovy (for the time being). We liked JavaScript. Here's how the function body looked like;

var employeeList = db.query("SELECT FROM Employee");

for (var i = 0; i < employeeList.length; i++) {
  var deptId = employeeList[i].field("empId");
  var departmentList = db.query("SELECT FROM Department WHERE deptId = ? ", deptId);

The function first queries the list of Employees, then execute a direct SELECT query for the Department that should be liked. This query actually uses the index, and makes the execution time almost a constant. Selecting all the employees would be an additional cost. However, since OrientDB does a lazy load, this is not bad as it seems.

The above strategy actually helped us reduce certain queries that took around 10 minutes to 30 seconds.

Monday, September 16, 2013

IllumiRoom : Peripheral Projected Illusions for Interactive Experience

Video gaming is a popular past time and a multi-billion dollar industry spread worldwide. The numbers playing video games is on the increase all the time, and the game developers are spending millions of dollars in R&D to find ways to make the gaming experience as realistic as possible.

The latest trend in video gaming is to make the player her or himself to become the controller tracking the gesture of the player. Nintendo Wii with the Wii Remote, Sony PlayStation with Move and Eye and Microsoft Xbox with Kinect are three main gaming consoles that supported gesture tracking. Out of the three, Microsoft Kinect stands out tall since unlike its two competitors Kinect does not require the game player to hold any controller for gesture tracking. With the huge success received by Kinect where more than 24 million units were sold worldwide, Microsoft is motivated to take the gaming experience to a whole different level.

Imaging you sitting down in your living room to play a video game on your television. When the game starts, the room mysteriously transforms to look like an environment, matching the shading in the video game. The colors of the room become supersaturated and carton edges appear on your furniture. You come across an enemy in the game and suddenly a streaking bullet flies towards your character and then out of the television. The enemy throws a grenade towards you. The grenade rolls out of the television, bouncing off the coffee table and explodes in your living room, throwing shrapnel across the furniture. The entire living room appears to shake as you take damage from the explosion. Although this sounds like science fiction, Microsoft’s “Project IllumiRoom” which is still in R&D has completed a proof-of-concept which proves that the date this becomes a reality is not far from now!

IllumiRoom ??

“IllumiRoom” augments the area surrounding the television with peripheral projected visualizations. Well, what does that actually mean? Have a look at figure 1 below;

Figure 1 : Extending filed of view

The image on the left showcases the actual living room with the TV in the middle and some furniture surrounding the TV. The image on the right shows how IllumiRoom can extend the field of view so that the area surrounding the TV augmented with similar shadings of the game.

IllumiRoom focuses on three types of special enhancements;

1. Extend the field of view – The example explained above (Figure 1)

2. Change the appearance of the room – For example, if it’s snowing in the game, a snow visualization can be projected which corresponds to the motion of the game (Figure 2)

Figure 2 : Changing appearance

3. Induce apparent motion – For example, visually make the edges of the furniture in the room wobble when you get hit by enemy bullets (Figure 3)
Figure 3 : Induce apparent motion

How does IllumiRoom work?

As explained earlier, IllumiRoom “projects” visualizations. But the interesting question is how do the projections get adjusted to the room geometry?

The current prototype uses a Kinect sensor to capture the color of the objects in the room and the 3 dimensional geometry of the room, which becomes the key input when rendering the visualizations to be projected. The projection is done via a commonly used wide filed-of-view projector as show in in figure 4. Since Kinect is used to analyze the room geometry, the system can self-calibrate and project illusions in any type of room.

Figure 4 : To key hardware elements used

Once this goes in to production, this will be via a single device that combines both the features of the Kinect sensor and the projector, ideally sitting on top of the living room coffee table as shown in figure 5.

Figure 5 : Projection

IllumiRoom supports different types of extending the field of view. As explained earlier, it can entirely extend the gaming environment to your living room or it can selectively extend the field of view to either extend the weapon fire, other players in the game, important objects etc. However, this requires hard integrations with the games source code.

IllumiRoom has also invented algorithms to detect the camera motion of any given video filed. It can even intercept the gaming controller input. Therefore, even without integrating with a games source code, there can be illuminations projected to induce the corresponding motion of the game.


Since the IllumiRoom is not designed to project visualizations on a white flat surface, it should compensate for colors and geometry in the room. Therefore, the perceived projected visualization quality is always lower.
Since it’s projecting, it’s also subjected to limitations on any projection system, mainly neutralizing bright ambient illumination. Therefore, the best effects can be perceived if IllumiRoom is setup in a dark environment.


Still in its earlier stages, we can expect a lot of improvements over the coming years before this hit the markets. The usages may not limit to gaming, it can also extend to other forms of entertainment, education etc. If Microsoft could integrate IllumiRoom with Kinect, the possibilities can be enormous. One challenge is to bring down the cost of the system as it involves a projector which is not a common house-hold item right now.

The capabilities of IllumiRoom are much greater than what’s explained in this short article. More information with a set of videos which really speaks out of what IllumiRoom is capable of can be found at