A place to discuss Development techniques, .NET, XNA, NHibernate or anything else that tickles your fancy

Sunday, June 28, 2009

FNH session from Chicago Code Camp

The sound is horrible in all of these, I apologize. I've gotten several requests to upload it anyhow regardless of the sound/camera issues. Watch at your own risk ;)


They're all broken into about 10 minute chunks, except for the last one, which was just the last 15 minutes.

Monday, June 15, 2009

NHibernate EAV system search performance results

We've been working on a pretty large EAV system this past year, and we're ramping up to get some real clients in it. One of it's features is a search builder that allows the end user to search on any combination of values on the entity that they've defined as attributes, and also any class specific attributes that are hardcoded on the class.

SearchBuilder's great, it works well. The problem we've had with it is performance, specifically relating to scale. E.g. 10,000 entities, with 1,000 attributes each, would be 10,000,000 rows for a user to search through, and that's just for one grouping of entities, there would be more rows than this that are actually in Sql.

Currently, we were doing a subquery of these using NHibernate's criteria using a projection with a PropertyIn. This resolves out to be an IN query in SQL, so something like this:


   1:  this_.EntityID in (SELECT this_0_.EntityID as y0_

   2:  FROM   Values this_0_

   3:         left outer join Attributes attr1_

   4:           on this_0_.AttributeID = attr1_.AttributeID

   5:         left outer join ValueEntities owner3_

   6:           on this_0_.EntityID = owner3_.EntityID

   7:  WHERE  this_0_.DiscriminatorValue = 'NumericValue'

   8:         AND attr1_.Code = 'acd3edf4-6d11-4747-8383-03a70174d72e' /* @p6 */

   9:         and attr1_.OwnerType = 'EntityType' /* @p8 */

  10:         and this_0_.NumericValue > 1 /* @p9 */)

This allows us to match entities that have values, and then check to see if the entity exists in the subselect. Now, using the IN operator is traditionally a DBA's worst nightmare, and since we were hitting some performance issues, I opted to attempt to refactor this away from a subselect after refactoring the domain model restrictions that prevented me from searching directly on the collection association in the first place. This moves into a JOIN model for each value type association (There are 4 value types in the system at the moment, int, string, datetime, bool; each strongly typed in the database, so we hold 4 different collections on the entity to associate their 4 value types so the Value property is strongly typed. The searchbuilder only creates the joins required to fulfill the search, so if only an int is being searched on, only that association will be made)

Thinking I would nab a huge performance increase, I twittered about the performance issues with the IN that I was having, and @tehlike (Tuna) responded back asking if I had tried an Exists. I had not, and to be fair, I was basing all of my assumptions off of DBA articles I read a while back. Enamored, I decided to set up a simple performance profiling of the three different ways that I could execute my EAV find: JOIN (the new way), IN (The old way), EXISTS (The suggested way).

I set up my environment with a LocalSql2005 instance. During each execution I restarted MsSql so it wouldn't page the query (otherwise the second time the query runs you'd get it back in 1-2 seconds). This in no way is a thorough performance test, or even the most ideal. These numbers should be taken with a grain of salt at best. But they do give some general indications of performance comparisons. Here's the results:

Query restriction: A single attribute value on the entity being > 1.
Baseline data: 1566 rows on the left side (entity side), 141119 rows on the right side (value sub select)
  • JOIN: 7s
  • IN: 7s
  • EXISTS: 8s
So, right away, the JOIN didn't give me the performance improvements that I had thought were coming. At this point, I started to get somewhat worried of what was going on behind the scenes, and if I had done something wrong. I relooked at the mappings, and the sql generated by each, and it all seemed to match up to what would by typed by hand. Next I bumped up the size of the data by 4x:

Query restriction: A single attribute value on the entity being > 1
Baseline data: 6264 rows on the left side, 564476 rows on the right side
  • JOIN: 26s
  • IN: 30s
  • EXISTS: 31s

Join seems to perform a little bit better, but only slightly. On a whole, the performance improvement is underwhelming. Now I complicate things by changing up the query and adding an additional attribute value to be searched on of a different type (Another join to the specific value type collection for the JOIN, and another sub-select just referencing a different value column for the IN/EXISTS)

Query Restriction: Single value on entity being > 0, and a text value being != 'NotAValue'
Baseline data: 6264 rows on the left side, 564476 rows on the right side
  • JOIN 35s
  • IN 32s
  • EXISTS 31

Wait...what? Join ended up slowing down significantly as you add different joins to the various value type collections and additional criteria values, wheras the subselects did not really, because SQL had already executed the sub-select for the first value, and the execution plan was already optimized when the second value subquery was being asked.

Grant you, these numbers fluxuated by a second or two when run multiple times, I'm giving the average. But IN/EXISTS seems to scale better than JOIN, in terms of adding additional restrictions, and additional value type associations.

In terms of row data scaling, the IN and Exists seemed to hold their own admirably to the JOIN, and in any complex query scenario, performed better.

So this brings me to the final decision: We're probably going to go back to the JOIN, and may look at tuning that to an EXISTS if we run into future performance issue. It's also important to note that indexing/statistics in MsSql plays a *HUGE* role in the table performance. When the properly tuned indexing was enabled for the database, all three options went to sub second response times. So in the end, why quibble about the 1% syntax performance tweaks of a problem that is 99% solved by proper database tuning.

Anyone else have anything to add? I know it's not a large amount of test data, with a stable environment, but discounting that, I feel much better about using IN's/EXISTS now.

UPDATE: Robert and I talked to Ayende at length about this (Another quick round of thanks to Ayende for being willing to dedicate the time).

It's looking like RDMS is definitely the wrong way to solve this problem. Looking at a document database, like Couch DB, or Ayende's own Divan DB. Alternatively, we can try and get something going using MsSql's Xml document column type.

Of course either of those options pose huge headaches in dealing with the search options we're giving to the client, so we're also looking at utilizing Lucene.NET to perhaps manage that.

Something tells me I'll be posing some blog entries in the near future milestoning our journey.

Thursday, June 11, 2009

Why do we take on side projects?

I currently contribute to the following OSS projects:

I have a full time job, a wife, and in a band. So why do I spend what little time I have working when I get home, much less for free? For me, it's simple. I get the opportunity to give back to the community, the pride in being a part of a great team of developers who are creating fantastic products, and the chance at incredible learning opportunities.

But that's not enough is it? For many of us, programming is a hobby in the evenings. It's a way to do things the way they were *meant* to be done, without all of the bureaucracy and red tape that typically clutters up our day job. I mean, this is really a chance to flex our muscles, see what we can do when the restraints are taken off.

Some of us go barreling into the wall at 100mph , but in general, it's a way to get back to the roots that drew all of us into programming in the first place. Fun. Doing something, just because you can, and it'll make someone's life better as a result. Pushing the limits of terse code. Seeing just how fast you can actually get that cycle to complete in. Creating that framework that lets you plug in third party modules like home appliances. It's all great fun, and in my opinion, the way everything should be.

So what went wrong? Where did we go wrong during our day jobs? Why is it so difficult to find a company that's doing software development the fun way? With so many software engineers being hired out, why is there company after company that has one codethulu setup after another? Where'd the fun go?

The false prophets took it away. Who am I talking about? Anyone who got into development for the wrong reasons:
  • They were in sales 15 years ago and had some skills with spreadsheets at a (then) small company and became the "lead technical" guy
  • They went to school, didn't know what to do, and were told by the campus counselor that software was a growing industry and needed more warm bodies
  • The old has been that grew up knowing one language, and one way to do something, and never changed (to a man with only a hammer, every problem appears to be a nail)
  • Warm bodies trained to pump out code which gets charged per line (aka many "call center" developers)
  • The cousin of the president of the company, who's always been "Good with computers"
Generally these people have the best intentions, sometimes they don't, and they fight learning new processes and tools like a blind man sitting on a key lime pie (this is my blog, I'll make up whatever ridiculous simile's I feel like). So where did they go wrong? If we're all trying to achieve the same thing, why is it such a common obstacle to overcome to develop software the Right Way™ ?

What are your stories and reasons for getting into development, and if you're on any passion projects, what were the reasons that led to you being involved?