Friday, February 24, 2012

Limitation for UDAs

When user-defined aggregates are discussed, there are two usage scenarios that are often mentioned. One is calculating the median value of a set of numbers, and the other is concatenating a set of strings into a comma-separated list.
However, I can not see how it would be possible to create an aggregate function for either of these problems. Because MaxSizeBytes can never be more than 8000 bytes these functions would only be able to operate on a limited number of values, since all values need to be kept for the entire execution. I have added this as a suggestion at the Product Feedback centre but have not yet received any feedback.
Or am I just missing something here, something that I have misunderstood?You aren't missing anything. For the reason you point out, neither of these examples are particularly good. (Of course, string concatenation won't work for more reasons than the size limit: we also can't guarantee the order of the concatenation right now.) This may change in future versions of SQL Server, but for now it's a limitation we're stuck with.

There are cases where UDAs could be used, but you're right that they cannot be aggregates which need to store all of the inputs until they're done. One example would be to find the bitwise 'or' or 'and' of a set of bitfields.

Incidently, the median of a set of items can be computed using a single T-SQL query. Assume you have a table 'sales' with the schema (city int, number int), you can find the median value of the 'number's for each 'city' by using the rank() function as follows:

select middles.city, avg(convert(float, number)) as median
from (
select city, floor((count(*) + 1) / 2.0) as lowmiddle, ceiling((count(*) + 1) / 2.0) as highmiddle
from sales
group by city
) as middles
join (
select city, number, rank() over (partition by city order by number) as rank
from sales
) as ranks
on (middles.city = ranks.city) and
((middles.lowmiddle = ranks.rank) or (middles.highmiddle = ranks.rank))
group by middles.city
Cheers,
- Isaac

|||Isaac, thanks for the answer and the nice query for median. It is definitely a lot better than the alternatives that are available in earlier versions of SQL Server. It would still not be able to match an aggregate in performance though.
Regarding string concatenation I generally agree it is not a particularly useful operation. But if it is to be done I strongly feel that it should be implemented just the way it would be today, apart from the size limitation. Guaranteeing any specific order of the values would be undesirable and confusing behaviour, unless the specific order is specified by the query executing the aggregate. In fact, if size was unlimited it would of course be possible for the UDA implementor to sort the rows before returning, if they wanted a specific order.|||

Regarding order, what would be desirable is for the query writer to be able to specify the order in which items were input to an order-sensitive UDA. One can imagine cases in which order would be significant, but in which the state does not have to be maintained in the UDA if the input is guaranteed to be ordered. This could be a win.

Cheers,
-Isaac

No comments:

Post a Comment