SearchBuilder's great, it works well. The problem we've had with it is performance, specifically relating to scale. E.g. 10,000 entities, with 1,000 attributes each, would be 10,000,000 rows for a user to search through, and that's just for one grouping of entities, there would be more rows than this that are actually in Sql.
Currently, we were doing a subquery of these using NHibernate's criteria using a projection with a PropertyIn. This resolves out to be an IN query in SQL, so something like this:
1: this_.EntityID in (SELECT this_0_.EntityID as y0_
2: FROM Values this_0_
3: left outer join Attributes attr1_
4: on this_0_.AttributeID = attr1_.AttributeID
5: left outer join ValueEntities owner3_
6: on this_0_.EntityID = owner3_.EntityID
7: WHERE this_0_.DiscriminatorValue = 'NumericValue'
8: AND attr1_.Code = 'acd3edf4-6d11-4747-8383-03a70174d72e' /* @p6 */
9: and attr1_.OwnerType = 'EntityType' /* @p8 */
10: and this_0_.NumericValue > 1 /* @p9 */)
This allows us to match entities that have values, and then check to see if the entity exists in the subselect. Now, using the IN operator is traditionally a DBA's worst nightmare, and since we were hitting some performance issues, I opted to attempt to refactor this away from a subselect after refactoring the domain model restrictions that prevented me from searching directly on the collection association in the first place. This moves into a JOIN model for each value type association (There are 4 value types in the system at the moment, int, string, datetime, bool; each strongly typed in the database, so we hold 4 different collections on the entity to associate their 4 value types so the Value property is strongly typed. The searchbuilder only creates the joins required to fulfill the search, so if only an int is being searched on, only that association will be made)
Thinking I would nab a huge performance increase, I twittered about the performance issues with the IN that I was having, and @tehlike (Tuna) responded back asking if I had tried an Exists. I had not, and to be fair, I was basing all of my assumptions off of DBA articles I read a while back. Enamored, I decided to set up a simple performance profiling of the three different ways that I could execute my EAV find: JOIN (the new way), IN (The old way), EXISTS (The suggested way).
I set up my environment with a LocalSql2005 instance. During each execution I restarted MsSql so it wouldn't page the query (otherwise the second time the query runs you'd get it back in 1-2 seconds). This in no way is a thorough performance test, or even the most ideal. These numbers should be taken with a grain of salt at best. But they do give some general indications of performance comparisons. Here's the results:
Query restriction: A single attribute value on the entity being > 1.
Baseline data: 1566 rows on the left side (entity side), 141119 rows on the right side (value sub select)
- JOIN: 7s
- IN: 7s
- EXISTS: 8s
Query restriction: A single attribute value on the entity being > 1
Baseline data: 6264 rows on the left side, 564476 rows on the right side
- JOIN: 26s
- IN: 30s
- EXISTS: 31s
Join seems to perform a little bit better, but only slightly. On a whole, the performance improvement is underwhelming. Now I complicate things by changing up the query and adding an additional attribute value to be searched on of a different type (Another join to the specific value type collection for the JOIN, and another sub-select just referencing a different value column for the IN/EXISTS)
Query Restriction: Single value on entity being > 0, and a text value being != 'NotAValue'
Baseline data: 6264 rows on the left side, 564476 rows on the right side
- JOIN 35s
- IN 32s
- EXISTS 31
Wait...what? Join ended up slowing down significantly as you add different joins to the various value type collections and additional criteria values, wheras the subselects did not really, because SQL had already executed the sub-select for the first value, and the execution plan was already optimized when the second value subquery was being asked.
Grant you, these numbers fluxuated by a second or two when run multiple times, I'm giving the average. But IN/EXISTS seems to scale better than JOIN, in terms of adding additional restrictions, and additional value type associations.
In terms of row data scaling, the IN and Exists seemed to hold their own admirably to the JOIN, and in any complex query scenario, performed better.
So this brings me to the final decision: We're probably going to go back to the JOIN, and may look at tuning that to an EXISTS if we run into future performance issue. It's also important to note that indexing/statistics in MsSql plays a *HUGE* role in the table performance. When the properly tuned indexing was enabled for the database, all three options went to sub second response times. So in the end, why quibble about the 1% syntax performance tweaks of a problem that is 99% solved by proper database tuning.
Anyone else have anything to add? I know it's not a large amount of test data, with a stable environment, but discounting that, I feel much better about using IN's/EXISTS now.
UPDATE: Robert and I talked to Ayende at length about this (Another quick round of thanks to Ayende for being willing to dedicate the time).
It's looking like RDMS is definitely the wrong way to solve this problem. Looking at a document database, like Couch DB, or Ayende's own Divan DB. Alternatively, we can try and get something going using MsSql's Xml document column type.
Of course either of those options pose huge headaches in dealing with the search options we're giving to the client, so we're also looking at utilizing Lucene.NET to perhaps manage that.
Something tells me I'll be posing some blog entries in the near future milestoning our journey.
6 comments:
EAV systems are generally _very_ bad for performance in relational DBs
They also tend to cause a lot of concurrency issues and deadlocks.
I would recommend strongly against this as an architectual approach.
Those are actually two very large issues we're hitting right now sir. I'm not exactly sure what a good alternative might be to this though. User definable attributes that behave like columns is a staple of the application :(
The only other alternative that we've been kicking around is to have the attributes describe the table schema to the user in user terms, and actually generate a wide table as a result of that, which the entity data would live in. So it wouldn't be a tall/thin system. The only problem with that, is we would be generating a new "values" table for each client that's stored in the system, and I'm not aware of how to go about mapping dynamically generated tables in NHibernate.
There might be another technique that could be applied here, but that's the limits of what I can come up with right now :/
Can you store attributes and values in XML format in XML datatype column?
Yup, the attributes still need a strong typed definition in an attributes table, but the entity/attribute Values, the many table that sits between, that can go away with an XML datacolumn, or a document database. That's what's causing the deadlocks and majority of the performance issues. Once again, thanks to Ayende for his insight regarding this.
Lucene is a really great search engine, and very customizable, I've used it on two different projects and have always been impressed by it's performance.
Post a Comment