Hi Guys
I'm Gareth, Paul Foley's business partner (silverbug on here).
One of our umbraco installations has been acting up, so much as that the media section takes almost a minute to load up.
The first thing I did was check the CPU on the server, recycle the app pool etc - nothing I tried
fixed it.
I thought what the heck, and fired up SQL Profiler, and started observing the SQL:BatchStarting event.
For those of you who haven't used it, SQL Profiler lets you see the database queries being submitted to the database server, their text, SPID and various other great bits of data.
Usually you use it when you want to see what's happening behind the scenes or if you want to optimise an application by reducing the number of database queries etc.
It doesn't come as part of the workgroup sql installation, or express edition but you can get it in the developer, standard or enterprise editions.
When I click the blue media icon in the bottom of the umbraco admin menu, I see approximately 20,000 queries hit my database server.
Another example, is loading the homepage of the site, I see 34 queries hit the database server.
This system has approximately 2000 media items of various types, but on average, they have 4+ properties.
From looking at the profiler output, I see this for loading the media list:
First, _for each element_ in the tree (3 queries each):
--Find the id based on the guid
Select id from umbracoNode where uniqueID = 'cc07b313-0843-4aa8-bbda-871c8da728c8'
--get some info from the UmbracoNode table
Select createDate, trashed, parentId, nodeObjectType, nodeUser, level, path, sortOrder,
uniqueId, text from UmbracoNode where id = 1032
--Get some more info from the cmsContentType table
Select Alias,icon,thumbnail,description from cmsContentType where nodeid=1032
This could easily be optimised into a single query with a single inner join.
Then it gets scary:
For each media item in each folder of the tree:
Select createDate, trashed, parentId, nodeObjectType, nodeUser, level, path, sortOrder, uniqueId, text from UmbracoNode where id = 1032
Select Alias,icon,thumbnail,description from cmsContentType where nodeid=1032
Select top 1 VersionId from cmsContentVersion where contentID = 1377 order by id desc
Select dataNvarchar from cmsPropertyData where id = 8472
Select dataNvarchar from cmsPropertyData where id = 8473
Select dataNvarchar from cmsPropertyData where id = 8474
Select dataNvarchar from cmsPropertyData where id = 8475
Select dataNvarchar from cmsPropertyData where id = 8476
Okay, so here's the root of the problem - and as far as i can see, it's a problem thoughout, we've only just noticed it in the media items due to the sheer number of them and the number of properites on each. I think you'll agree that these numbers multiply out real fast - a good example is the homepage of 34 queries. 1 Visitor is 34 queries, 2 is 68, 4 is 136...
For each media item, first the base data for the node is retrieved, such as it's parent, the node object type etc.
Then the icon details, description etc, presumably for the UI. (again a join could be used)
Then the most recent version guid for what I think is this node (though the data gets a little jumbled and out of order in profiler and it's hard to tell)
Then _for each property_ of the media type, the data is retreived from cmsPropertyData.
Okay, so a lot of complaining is going on here, I have read parts of the umbraco source, and i'm a fairly competent programmer with a fair bit of experience in application development - it seems to me that the core of the problem is that the database abstraction layer / orm / whatever you want to call it, is a little too broken down.
What this means, is that each little bit retreives it's own data when it needs it and returns this. While this is fine for a small scale, it won't scale well.
There seemed to be small things, like when you have two constructors for getting an item by either a guid or an id, the guid one will retreive the id and then call the id one - rather than abstracting the code that reads the returned results and feeding it two different data sources (one selected by guid and the other by id) but these will compound when applied en-masse, for example loading the media tree.
The real problem with loading the tree (I can deal with 3 queries per media item, though this is still going to be several thousand queries) is the property system.
As I understand it, looking at the output of the data, you're using a structure which i've used in a few systems and that's storing individual properties vertically rather than horizontally - this obviously has huge customisation benefits for the adminstration - one of the main features of umbraco of course is it's property system.
The problem is in the implementation of retrieving the data.
From looking at the output, I can't see where the actual property ids for a given media type are stored and then used, I assume they're retrieved much earlier in the batch and cached.
I can see several solutions to the problem.
1) Use a UDF to retrieve each property and rotate the data -
create function dbo.getVarcharProperty (@nodeId as int)
returns varchar(255)
begin
declare @dataNvarchar varchar(255)
select
@dataNvarchar = dataNvarchar
from
cmsPropertyData
where
id = @nodeId
return @dataNvarchar
end
Select dbo.getVarcharProperty(8472), dbo.getVarcharProperty(8473)
This will work and slightly optimise but still wont be very efficent since the UDF performs a seperate select anyway
2) Build a temporary table (rotated) and populate it using cursors or similar in a stored procedure - as far as I can see umbraco doesn't use stored procs or udfs, so this is probably not an option.
3) Select all the properties as a single batch to a cache and use a sqlcachedependency - while it's okay, it's sql 2005 only and from experience quite difficult to get working - this wouldn't work for an application such as umbraco where the deployment isn't strictly controlled.
4) Be sneaky and do some of this in the application: an example
Presently we do this to get the properties:
Select dataNvarchar from cmsPropertyData where id = 8472
Select dataNvarchar from cmsPropertyData where id = 8473
Select dataNvarchar from cmsPropertyData where id = 8474
Select dataNvarchar from cmsPropertyData where id = 8475
What if this was:
Select id, dataNvarchar, dataNText, dataInt, dataDate from cmsPropertyData where id in (8472,8473,8474,8475)
(or better yet where propertyTypeId = ??
This data would then come back to the application (you're getting all the properties anyway, you obviously know the property ids, so then iterate, once, over that result set and perform your class population.
//pseudo code, haven't checked source
Enum propertyTypeMapping
{
dataNvarchar = 1, dataNText = 2, dataInt = 3, dataDate = 4
}
Media m = new Media();
int[] cmsNodePropertyIds = new int[] { 8472, 8473,8474,8475 };
string sql = string.format("Select id, propertyTypeId, dataNvarchar, dataNText, dataInt, dataDate from cmsPropertyData where id in ({0}), String.Join(",",cmsNodePropertyIds)
DataTable props = getDataTable(sql)
foreach(DataRow row in props.Rows)
{
int propertyId = (int)row["id"]
Property p = m.properties.find(delegate(Property prop) { return prop.id == propertyId });
if(p!=null)
p.setValue(row[Enum.GetName(typeof(propertyTypeMapping), (int)row["propertyTypeId"])]);
}
Anyway, if someone else has a better solution than changing the code (that isn't removing properties or decreasing the number of media items) let me know - these are just ideas based on experience.
I don't want to start digging into the umbraco source and recompiling since this is clearly an issue that needs to be resolved in the main code base - it's a scalability nightmare waiting to happen.
Happy to discuss over chat mediums or just post here, i'll check it a few times today, i'm available on googletalk:
agrath@gmail.com, and msn:
agrath@orcon.net.nzGareth Evans
Sniper Systems Ltd