RavenDB – The HiLo what how and why

One of the issues I touched on in with the basic interaction with RavenDB was the awkwardness of with having to call SaveChanges in order to get the ids of entities that had been saved across the unit of work. This is not a problem new to the document db space, nor is it a problem new to any system where the domain has been mapped to any id based data store (ORMs/RDBMS/etc).

I was going to cook a home brew solution specifically for my use within my projects and blog about it in order that other people could use it, but after posting my intentions in the RavenDB mailing list to create something like this, Oren suggested that making it the default behaviour and moving id generation to the Store would be a welcome move.

After posting on Twitter about this now being default, I got asked quite a few questions on what HiLo was, what the advantages were, and why it was a good thing that in the .NET client for RavenDB this was now going to be the default.

The gist

  • Waiting until SaveChanges to get ids for saved entities makes writing logic against those entities troublesome
  • Calling SaveChanges every time a new entity is created makes transactions troublesome
  • Calling SaveChanges to get the entity id means a call across the wire just to get an entity id, which is expensive
  • Simply assigning a Guid to the Id makes accessing documents via REST an unpleasant experience
  • You can’t just assign a random integer, because you’d just get collisions as other clients did the same and tried to save their entities
  • HiLo provides a method of creating *incremental* integer based ids for entities in a fashion that is safe in concurrent environments

The algorithm

The basic premise, is that the server still controls the id generation, but effectively hands out a range of ids to each client, which the client can then hand out to objects as they are created, and when the client runs out of ids, it simply requests more.

Obviously, requesting a heap of Ids all at the same time would be expensive, so the idea is that the server provides a single id, a “Hi” value which controls the creation of the range on the client. (which provides the “Lo” value)

There are a number of ways this can be implemented, but the one I chose was probably the simplest, and credit goes to Tuna Toksoz for the blog entry which provided the means to implementing it myself.

  • The data store needs only store the latest “Hi” value, which starts at 1, and increases by 1 every time a new “Hi” value is requested by a client
  • The clients all use the same number for a “Capacity”, that is – the range of numbers that each “Hi” value represents. For example 1000
  • Each client requests a “Hi” value and resets their “Lo” value to 0
  • Every time a new Id is requested from the generator, the Id is generated by combining the Hi and Lo numbers together:
   1:  (currentHi - 1)*capacity + (++currentLo)
  • When currentLo reaches capacity, a new Hi is requested and the cycle starts over again

In the actual implementation, there is some locking going on around this algorithm in order to make the client generator available across threads (web requests) and avoid having to create a new generator per session (defeating the point of having one if you only create a single object in a session).

Let’s look at a sample run through, with a small capacity of “3”, to keep the sample small!

Description currentLoBefore currentHi Created Id currentLoAfter
Hi Request 0 1 1 1
  1 1 2 2
  2 1 3 3 (capacity)
Hi Request 0 2 4 1
  1 2 5 2
  2 2 6 3 (capacity)

As we can see, if all the clients are using the same capacity, and they are given different “Hi” values, then they can’t generate duplicate keys, but by and large they’ll be sequential in nature.

The implementation in RavenDB

In RavenDB, the default function configured against the DocumentConvention is now HiLo, which means if a new document is saved against the session with its Id set to NULL, it will have an Id generated on the spot which contains the name of the document and the incremented Id. Obviously this can be overridden by changing the convention to leave the created id at some default value of your application’s choosing.

My original implementation was a bit poor, generating quite a bit of noise in the document database (it was inserting documents to get the ids), and the incremented Ids were being shared amongst objects – which meant if you created say, blogentry/1, saving a new user would mean having newuser/2.

Oren changed this to directly store a single object in the RavenDB for the generator, and to create a generator per-type – which means a lot less noise and more sensible ids being generated for each document.

What it means

What this essentially means, is if you’re using RavenDB out of the box without changing any of the conventions, documents will have a generated Id as soon as Store is called for that document. This means that SaveChanges does not have to be called until right at the very end of the Unit of Work, which means all changes can be efficiently batched in a single request and as a result applications should be easier to write and performance should be easier to maintain.

This is a .NET client specific feature and nothing was changed in the database itself to make this work.

What this does mean, is that if multiple clients from different platforms are going to be connecting to RavenDB and manipulating data, if you’re using the default HiLo implementation then a similar algorithm will need implementing for those other platforms, using the same capacity in order to prevent concurrency issues. This is not necessarily a downside, but is worth making a note of if you are going to be having this sort of set up.

What I learned

While I might contribute the odd bug fix to open source projects now and then, the idea of going in and changing the fundamental way the .NET RavenDB client worked was a bit daunting – not from a technical perspective, but from a taste perspective as I wasn’t sure how Oren wanted things done. As he later said, he’d prefer that code that has to then change be submitted, then no code at all be submitted. I’d like to raise that with anybody who wants to contribute to this project – if you’ve got a good idea then hit the mailing list and suggest it and maybe implement it – nothing to be lost if it’s something people want to use.

In the end, my implementation is barely visible in there, but I'm still pleased that this is in there, it makes *my* life easier :)



   


Print | posted on Sunday, May 16, 2010 9:00 AM

Feedback

# re: RavenDB – The HiLo what how and why

Left by Sean at 5/16/2010 10:45 PM
Gravatar When using hilo myself, I found it worked better if I implimented the hi by eg 1000 and then getting a new id with
(currentHi) + (++currentLo)

That lets you change the capacity client side without problems.

It also simplified getting a new id inside sql server stored procs.

# re: RavenDB – The HiLo what how and why

Left by robashton at 5/16/2010 10:54 PM
Gravatar Surely that means the client still needs to be aware of the capacity (because they'd still be able to go past capacity if it was changed on the server).

Either way it's brittle if you're not in control of the whole system - but if you're writing a proper system then only one client will access the database (ideally), and everything else will go through that client so it doesn't matter.

# re: RavenDB – The HiLo what how and why

Left by Ken Egozi at 5/17/2010 6:10 AM
Gravatar @rob: proper system => only one client?
with large, complex systems, you can easily find a mix of technologies. Like RoR front-end, erlang based chat server, .NET logic engine, and Java based batch processing. One of the things that can make RavenDB appeal would be a consistent set of clients for major environments.

For e.g., I like that with MongoDB you get official server build for every possible type of host (irrelevant with RavenDB as it is dependant on ESENT afaik), and a large list of consistent client APIs.

# re: RavenDB – The HiLo what how and why

Left by robashton at 5/17/2010 7:45 AM
Gravatar Sure - but most architects would balk at the idea of letting all those things go directly to the database. Instead, the main application would most likely expose services which these would go to.

Even if you weren't to do that (Because let's face it, RavenDB exposes the ability to load logic directly into it *and* exposes REST services so why *not* go directly to it), the fact remains that with any client-id assignation system you're going to have to standardise how and when those clients generate those ids and it doesn't matter which variant of the HiLo algorithm you use :)

Your comment:





 
Please add 8 and 6 and type the answer here:

Copyright © Rob Ashton

Design by Rob Ashton, Based On A Design By Bartosz Brzezinski