Wednesday, May 12, 2010 #

RavenDB – Basic usage considerations

Note: The interfaces have been updated since this entry was written, and there is now Linq query support built into the .NET client, I’ve updated these posts to use the LuceneQuery syntax but that’s probably not the preferred way of doing things

There will be plenty more of these to talk about as I carry on developing this application against RavenDB, but there are a few immediate concepts that I thought would be worth writing about to do with the basic manner in which you interact with RavenDB.

DocumentSession vs DocumentStore

This is the most basic consideration:

  • When do you create a DocumentStore
  • When do you create a DocumentSession

The simple answer, is you create a DocumentStore on application start-up, and you create a document session for every unit of work following that.

In an MS MVC web application, this would be

  • Create a DocumentStore in Application_Start
  • Create a DocumentSession on BeginRequest
  • Destroy the DocumentSession on EndRequest

Creating Indices

Because every example written as a tutorial of how to use RavenDB will no doubt include index creation as a part of it, the temptation will be there to get into the habit of invoking the code to create indexes every time your application is run (Or simply forget that you started off this way and leave the code in there).

   1:  documentStore.DatabaseCommands.PutIndex(
   2:      "BookByTitle",
   3:      new IndexDefinition<Book, Book>()
   4:      {
   5:          Map = docs => from doc in docs
   6:                        where doc.Title != null
   7:                        select new
   8:                        {
   9:                            Title = doc.Title
  10:                        },
  11:          Stores = { { x => x.Title, FieldStorage.Yes } }
  12:      });

As data is added to the system or modified, RavenDB will (in its own time) run that dirty data across those indexes, and the application will use those indexes to pull the data out for display and manipulation purposes.

If an index is re-created, all of that indexed data becomes obsolete, and thus RavenDB must re-run *all* of the data in the system against that index. If your application is re-creating indexes or simply creating indexes on the fly as a regular action then performance will suffer.

The best practise is to treat these indices as a management function, something that is done once when the document database is first created – and then updated as part of maintenance/upgrades – like database changes in a  traditional system (only somewhat easier!).

I have a simple script to create all the indexes in a blank, freshly created RavenDB instance so while I’m developing against the application I can start from scratch again anytime. The important thing of note is that I don’t run this every time I start the application up – just when I’ve made changes to those indexes.

I might talk about this in a future blog post as I’ve ended up with a nice structure that involves disposing of the magic strings that form the names of the indexes in RavenDB and that can’t be a bad thing.

Saving new objects

This actually goes for most operations such as deletion, updates to objects etc – but saving objects is probably  more complete proposal from this collection. None of this is too dissimilar to the considerations we’d apply when working against a traditional RDBMS and an ORM, but it’s worth re-iterating for those who are unfamiliar with the concepts.

Consider a simple repository for entities in our system whose interface looks something like this.

   1:      public interface IBookRepository
   2:      {
   3:          Book Get(string id);
   4:          void Save(Book book);
   5:      }

A sample implementation of this repository might look like this:

   1:      public class BookRepository : IBookRepository
   2:      {
   3:          private IDocumentSession mDocumentSession;
   4:   
   5:          public BookRepository(IDocumentSession documentSession)
   6:          {
   7:              mDocumentSession = documentSession;
   8:          }
   9:          public Book Get(string id)
  10:          {
  11:              return mDocumentSession.Load<Book>(id);
  12:          }
  13:   
  14:          public void Save(Book book)
  15:          {
  16:              mDocumentSession.Store(book);
  17:          }
  18:      }

Ignoring the rest of the repository, there are decisions to be made at this point about what the Save method should actually do.

Consider a basic use of the repository like so:

   1:          public void PublishBook(Book book)
   2:          {
   3:              mRepository.Save(book);
   4:              mEventInvoker.RaiseEvent(new BookPublishedEvent(book.Id));
   5:          }
 

Ignoring the obvious (like this publish method isn’t actually publishing a book!), our problem here is that the created book does not yet have an Id because we haven’t called SaveChanges yet, and yet we’re attempting to use this Id as the argument for another action in our application.

The proposed fix? Change the repository so we call SaveChanges of course!

   1:  public void Save(Book book)
   2:  {
   3:      mDocumentSession.Store(book);
   4:      mDocumentSession.SaveChanges();
   5:  }

That appears to have fixed the problem, but in actual fact if we were using IDocumentSession to control our unit of work, calling SaveChanges just broke that because all the changes (including others made across the rest of the system) were just flushed across to the server.

We can fix that by wrapping our whole unit of work inside of a  TransactionScope (which RavenDB respects),  but we’ve still got one problem we need to be aware of:

   1:  foreach (Book book in booksToCreate)
   2:  {
   3:      mRepository.Save(book);
   4:  }

Now we’re saving a collection of books, let’s say there are 100 of them – that’s 100 calls to SaveChanges, which is 100 calls across the wire, and 100 calls to ‘whatever RavenDB does when you push an object to RavenDB’ (It’s expensive okay?).

That’s not to say you don’t do use this hammer to solve the problem, but you should think about it and do what makes sense in your application.

  • You could still add more interfaces/methods specifically for batch operations, and still call SaveChanges at that level
  • You could use your own client-side key generation code (RavenDB allows this) – and perhaps adopt something like HiLo against the Type of the document – thus negating the need to call SaveChanges at all until everything has been done that needs doing

I’m probably going to experiment with the second option and write a blog entry once I’ve worked out what is I want to achieve.

Update: I have since written a HiLo generator, and Oren has integrated this so HiLo is the default generator for RavenDB, this means a call to SaveChanges is not needed in order to get the id for an item so this bullet point is now almost irrelevant unless you override this behaviour to use keys generated by the server

Stale Data

Let’s say we have a top level page on our website which displays the top 20 books by popularity in a certain category. The following query is executed

   1:  Book[] categoryBooks = documentSession.LuceneQuery<Book>("BookByCategory")
   2:                          .WaitForNonStaleResults()
   3:                          .Where(String.Format("Category:{0}", category))
   4:                          .Take(20)
   5:                          .OrderBy("Popularity").ToArray();

The temptation is there to always use that call WaitforNonStaleResults because most demo code will do this as a matter of course (because invoking this will deterministically say “give me back the results I expect for this demo”).

The problem is, WaitForNonStaleResults will do exactly what it says, it will wait until the results coming back are no longer stale – which means your page request will hang, which means you won’t have a responsive application – and the whole point of using a database like RavenDB is that you want the application to be responsive!

There is a good reason that WaitForNonStaleResults is not the default – consider when you start writing it what it is you actually want. In this example, it really doesn’t matter if the data being displayed on this high traffic top level page is a bit out of date, and the call simply is not needed.

Paging

Let’s say there are 100,000 books in the document store and we invoke the following code:

   1:  Book[] books = documentSession.LuceneQuery<Book>()
   2:                           .ToArray();

How many books do you expect for there to be in that collection? 100,000? If 100,000 objects were returned into that collection, how long would it take? What would you be doing to those 100,000 objects? How much memory would they require to hold in memory all together like that? Yeah, it’s unlikely that you’d ever write the above code in your production application, because bringing back all the objects is rarely what the developer actually intends.

Thankfully RavenDB safeguards against this kind of sloppy code and automatically limits the number of results returned back. Both the .NET client and server have this behaviour built into them and this means you’ll only get (at the moment), 128 objects coming back for the above query. This is equally true for all types of queries, including queries against indices with where clauses and orderings and everything else you might want to put in a query.

Currently the server itself will only let you page 1024 objects at one time, so you can’t be lazy and make a call to Take(100000) because it won’t let you. I’ve actually got an extension method which *does* bring back *all* the objects for testing purposes, but I’ll leave that one out of this blog entry for fear of people actually using it!

Just be aware that paging is there to help you and don’t be surprised when you don’t get all the documents back when doing a blanket query. Use paging properly!

posted @ Wednesday, May 12, 2010 2:05 PM | Feedback (15)

Copyright © Rob Ashton

Design by Rob Ashton, Based On A Design By Bartosz Brzezinski