We've been working on a pretty large
EAV system this past year, and we're ramping up to get some real clients in it. One of it's features is a search builder that allows the end user to search on any combination of values on the entity that they've defined as attributes, and also any class specific attributes that are hardcoded on the class.
SearchBuilder's great, it works well. The problem we've had with it is performance, specifically relating to scale. E.g. 10,000 entities, with 1,000 attributes each, would be 10,000,000 rows for a user to search through, and that's just for one grouping of entities, there would be more rows than this that are actually in Sql.
Currently, we were doing a subquery of these using NHibernate's criteria using a projection with a PropertyIn. This resolves out to be an IN query in SQL, so something like this:
1: this_.EntityID in (SELECT this_0_.EntityID as y0_
2: FROM Values this_0_
3: left outer join Attributes attr1_
4: on this_0_.AttributeID = attr1_.AttributeID
5: left outer join ValueEntities owner3_
6: on this_0_.EntityID = owner3_.EntityID
7: WHERE this_0_.DiscriminatorValue = 'NumericValue'
8: AND attr1_.Code = 'acd3edf4-6d11-4747-8383-03a70174d72e' /* @p6 */
9: and attr1_.OwnerType = 'EntityType' /* @p8 */
10: and this_0_.NumericValue > 1 /* @p9 */)
This allows us to match entities that have values, and then check to see if the entity exists in the subselect. Now, using the IN operator is traditionally a DBA's worst nightmare, and since we were hitting some performance issues, I opted to attempt to refactor this away from a subselect after refactoring the domain model restrictions that prevented me from searching directly on the collection association in the first place. This moves into a JOIN model for each value type association (There are 4 value types in the system at the moment, int, string, datetime, bool; each strongly typed in the database, so we hold 4 different collections on the entity to associate their 4 value types so the Value property is strongly typed. The searchbuilder only creates the joins required to fulfill the search, so if only an int is being searched on, only that association will be made)
Thinking I would nab a huge performance increase, I twittered about the performance issues with the IN that I was having, and @tehlike (Tuna) responded back asking if I had tried an Exists. I had not, and to be fair, I was basing all of my assumptions off of DBA articles I read a while back. Enamored, I decided to set up a simple performance profiling of the three different ways that I could execute my EAV find: JOIN (the new way), IN (The old way), EXISTS (The suggested way).
I set up my environment with a LocalSql2005 instance. During each execution I restarted MsSql so it wouldn't page the query (otherwise the second time the query runs you'd get it back in 1-2 seconds). This in no way is a thorough performance test, or even the most ideal. These numbers should be taken with a grain of salt at best. But they do give some general indications of performance comparisons. Here's the results:
Query restriction: A single attribute value on the entity being > 1.
Baseline data: 1566 rows on the left side (entity side), 141119 rows on the right side (value sub select)
So, right away, the JOIN didn't give me the performance improvements that I had thought were coming. At this point, I started to get somewhat worried of what was going on behind the scenes, and if I had done something wrong. I relooked at the mappings, and the sql generated by each, and it all seemed to match up to what would by typed by hand. Next I bumped up the size of the data by 4x:
Query restriction: A single attribute value on the entity being > 1
Baseline data: 6264 rows on the left side, 564476 rows on the right side
- JOIN: 26s
- IN: 30s
- EXISTS: 31s
Join seems to perform a little bit better, but only slightly. On a whole, the performance improvement is underwhelming. Now I complicate things by changing up the query and adding an additional attribute value to be searched on of a different type (Another join to the specific value type collection for the JOIN, and another sub-select just referencing a different value column for the IN/EXISTS)
Query Restriction: Single value on entity being > 0, and a text value being != 'NotAValue'
Baseline data: 6264 rows on the left side, 564476 rows on the right side
Wait...what? Join ended up slowing down significantly as you add different joins to the various value type collections and additional criteria values, wheras the subselects did not really, because SQL had already executed the sub-select for the first value, and the execution plan was already optimized when the second value subquery was being asked.
Grant you, these numbers fluxuated by a second or two when run multiple times, I'm giving the average. But IN/EXISTS seems to scale better than JOIN, in terms of adding additional restrictions, and additional value type associations.
In terms of row data scaling, the IN and Exists seemed to hold their own admirably to the JOIN, and in any complex query scenario, performed better.
So this brings me to the final decision: We're probably going to go back to the JOIN, and may look at tuning that to an EXISTS if we run into future performance issue. It's also important to note that indexing/statistics in MsSql plays a *HUGE* role in the table performance. When the properly tuned indexing was enabled for the database, all three options went to sub second response times. So in the end, why quibble about the 1% syntax performance tweaks of a problem that is 99% solved by proper database tuning.
Anyone else have anything to add? I know it's not a large amount of test data, with a stable environment, but discounting that, I feel much better about using IN's/EXISTS now.
UPDATE: Robert and I talked to Ayende at length about this (Another quick round of thanks to Ayende for being willing to dedicate the time).
It's looking like RDMS is definitely the wrong way to solve this problem. Looking at a document database, like Couch DB, or Ayende's own Divan DB. Alternatively, we can try and get something going using MsSql's Xml document column type.
Of course either of those options pose huge headaches in dealing with the search options we're giving to the client, so we're also looking at utilizing Lucene.NET to perhaps manage that.
Something tells me I'll be posing some blog entries in the near future milestoning our journey.