Tuesday, June 23, 2015

Join in OrientDB

I started working on OrientDB which is a NoSQL dbms which brings in the goodness of both Graph and Document DBs.

We (me and my team) were heavy on relational databases until we started working on OrientDB, it was a bit hard for us to get away from the Relational thinking where you try to do everything with SQL and think "Inner Joins" :)

However, what we realized was simply coz you are working on a NoSQL DB, the inner join like scenarios could not be avoided. A very good example is integrating with other systems in the organizations that have a relational DB as the back end.

Inner join like scenarios could be created using sub queries in OrientDB where we used the LET block to access the parent query context. However, we realized this is not the way to go forward in the hard way.

Here's a typical example.

My Schema:

Department {
 deptId : INTEGER
 deptName : STRING

Employee {
 empId : INTEGER
 empName : STRING
 deptId : INTEGER
}

Query to get departments linked to employees;

SELECT empName, $dept[0].deptName
FROM Employee
LET $dept = (SELECT FROM Department WHERE deptId = $parent.current.deptId)


For development we usually use a simple data set for easy debugging. There this worked wonderfully well and we were happy. However, when we started testing with larger sets of test data, the time taken for the query horribly increased. 

But we had a hidden weapon - Indexes!! We created indexes for Department.deptId and we were certain that the inner query would pick up the index. But to our utter horror, it didn't! The time taken was unchanged. We did a query profiling using the EXPLAIN feature and it did not indicate any index usage.

With out the index, the time complexity of the query was O(n2) since for each of the Employees (parent query) all the departments are searched. The expectation with the index was to bring the time complexity to O(n).

Therefore, we moved from pure SQL to OrientDB back-end functions. These are more like stored procedures in a Relational context and could be written in JavaScript, SQL and Groovy (for the time being). We liked JavaScript. Here's how the function body looked like;

var employeeList = db.query("SELECT FROM Employee");

for (var i = 0; i < employeeList.length; i++) {
  var deptId = employeeList[i].field("empId");
  
  var departmentList = db.query("SELECT FROM Department WHERE deptId = ? ", deptId);
  
  print(departmentList[0].field("name"));
}

The function first queries the list of Employees, then execute a direct SELECT query for the Department that should be liked. This query actually uses the index, and makes the execution time almost a constant. Selecting all the employees would be an additional cost. However, since OrientDB does a lazy load, this is not bad as it seems.

The above strategy actually helped us reduce certain queries that took around 10 minutes to 30 seconds.

No comments: