SQL Server - Select Distinct for Only 1 Column

Asked By Greg Collins on 14-Feb-08 06:45 PM
I'm sure this has been discussed many times before, but I've looked around
all over the place and no one has a solution that works.

I've got a table with X number of columns, and I need to sort them and then
return a subset based on distinct values inc ColumnA.

As per the internet sources, I've tried using
SELECT
ColumnA,
MAX(ColumnB) AS ColumnB,
MAX(ColumnC) AS ColumnC
FROM Table
WHERE ColumnC != 'Value'
GROUP BY ColumnA
ORDER BY ColumnA, ColumnB, ColumnC DESC

But contrary to what many state, the MAX function does not return the value
of the particular row returned, but rather some other max... I haven't taken
time to figure out which--just that the returned column does not contain the
right data for the row it is supposedly returning.

So then, is there any solution to this? Seem incredible that something so
basic is not supported by SQL2005!

Any help is welcomed!

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com




xyb replied on 16-Feb-08 03:16 AM
On 2=D4=C215=C8=D5, =C9=CF=CE=E77=CA=B145=B7=D6, "Greg Collins" <gcollins_at=

n
e
en
he

SELECT
ColumnA,
ColumnB =3D  (SELECT MAX(ColumnB) FROM Table ),
ColumnC =3D  (SELECT MAX(ColumnC ) FROM Table WHERE ColumnC !=3D
'Value' ),
FROM Table
GROUP BY ColumnA
ORDER BY ColumnA DESC
Plamen Ratchev replied on 14-Feb-08 11:51 PM
A bit unclear on what you are trying to do. See if this helps:

SELECT ColumnA, ColumnB, ColumnC
FROM (
SELECT ColumnA,
ColumnB,
ColumnC,
ROW_NUMBER() OVER(
PARTITION BY ColumnA
ORDER BY ColumnB, ColumnC DESC) AS seq
FROM Table
WHERE ColumnC <> 'Value') AS T
WHERE seq = 1;

HTH,

Plamen Ratchev
http://www.SQLStudio.com
Santhos replied on 15-Feb-08 06:17 AM
Select ColumnA,ColumnB = (Select Max(B.ColumnB) from [Table] B Where
B.ColumnA = A.ColumnA Group By B.ColumnA),
ColumnC = (Select Max(C.ColumnC) from [Table] C Where C.ColumnC <> 'Value'
And C.ColumnA = A.ColumnA Group By C.ColumnA)
From [Table] A Group By ColumnA Order By ColumnA,ColumnB,ColumnC
Greg Collins replied on 19-Feb-08 01:52 PM
This was just the thing. Thank you so much. I don't know why SQL Server
doesn't have built-in functionality (yet) that allows a user to easily
select DISTINCT based on specified columns rather than all columns. It seems
like a very useful add and frequently asked about. Hopefully the next time
around.

Thanks so much!

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com
Alex Kuznetsov replied on 21-Feb-08 09:00 PM
There is built in functionality - ROW_NUMBER(). Plamen's solution uses
it.
Greg Collins replied on 20-Feb-08 02:36 PM
You are right, however, I refer to user-friendly functionality--not
something so convoluted as row_number(). Something more like:

SELECT DISTINCT (ColumnB, ColumnF) * FROM Table

Which would select * from table but only rows where columns b and f are
distinct -- or some other more user-friendly syntax.

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com
--CELKO-- replied on 21-Feb-08 09:02 PM
SELECT DISTINCT (ColumnB, ColumnF) * FROM Table

Which would select * from table but only rows where columns b and f
are distinct <<

Standard SQL has a UNIQUE() predicate so you can write UNIQUE(col_b)
AND UNIQUE(col_f) or UNIQUE(COL_B, col_f) which are different things.

Let me throw out a little SQL puzzle I saw on a blog.CREATE TABLE
Source
(a INTEGER NOT NULL,
b INTEGER NOT NULL,
PRIMARY KEY (a,b));

INSERT INTO Source VALUES (1, 1);
INSERT INTO Source VALUES (1, 2);
INSERT INTO Source VALUES (2, 3);
INSERT INTO Source VALUES (7, 2);
INSERT INTO Source VALUES (2, 4);
INSERT INTO Source VALUES (5, 5);
INSERT INTO Source VALUES (5, 1);
INSERT INTO Source VALUES (5, 3);
INSERT INTO Source VALUES (9, 0);
INSERT INTO Source VALUES (11, 2);

CREATE TABLE Destination
(a INTEGER NOT NULL UNIQUE,
b INTEGER NOT NULL UNIQUE,
FOREIGN KEY (a,b)
REFERENCES Source(a,b));

How many subsets of rows from Source can be inserted into
Destination?  I am not going to give the authorship until later, but
give it a try without using any procedural code.
Hugo Kornelis replied on 20-Feb-08 05:42 PM
Hi Greg,

So, let's suppose you get to hand out work orders at Microsoft for one
day and you tell the developers to implement this syntax.

Now one of the developers comes into your office and shows you this
table (use fixed font to view as intended):

ColumnB | ColumnF | ColumnX | ColumnY
---------+---------+---------+---------
1    |    1    |    1    |    5
1    |    1    |    3    |    2
1    |    1    |    6    |    4

He says that, based on the DISTINCT (ColumnB, ColumnF), the query should
produce a single row of output. But which row should that be? What
values should be in the ColumnX and ColumnY columns for the single row
returned by the query?

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
Alex Kuznetsov replied on 21-Feb-08 09:03 PM
Greg,

While I agree that your suggestion may look intuitive, it is
redundant. IMO redundancy is one of the biggest problems with SQL, so
adding more of it makes little sense to me.

Cheers,
AK
SQL Server MVP
Alex Kuznetsov replied on 21-Feb-08 09:03 PM
Hi Hugo,

While I agree that your points are valid, same objections might be
raised against implementing SET ROWCOUNT 1, and/or SELECT TOP 1. Yet
they exist. What do you think?
David Portas replied on 21-Feb-08 09:03 PM
The same objections definitely apply. Microsoft have already
deprecated ROWCOUNT. TOP was also a big mistake because it lacks the
requirement to include a sort specification but is stupidly linked to
an ORDER BY clause whenever one is included.

--
David Portas
Greg Collins replied on 21-Feb-08 11:33 AM
Actually, I thought of a few other syntaxes that would be more intuitive
than what is currently offered. In all cases, ordering would need to come
before distinction. This first one is my preference so far:

1. Just add DISTINCT into the WHERE clase:

SELECT * FROM Table
WHERE ColumnA = 'Value' AND DISTINCT ColumnB, ColumnE AND DISTINCT ColumnG

This would allow you to include DISTINCT right into the WHERE clause and you
could group, not group, combine, etc. So in this example, ColumnB and
ColumnE have to together be distinct, and then along with that, ColumnG has
to be distinct. Rather than AND, you could also specify OR, or NOT, etc.

- - - - -

2. Specify DISTINCT on the specific column in the SELECT:

SELECT ColumnA, DISTINCT ColumnB, ColumnC, DISTINCT ColumnD, ColumnE FROM
Table

This allows you to specify one or more columns that must be distinct,
however has the inability to group columns for distinctness together...
though you could maybe do somehting like DISTINCT (ColumnB, ColumnD)... but
that doesn't seem to work very well.

3. DISTINCT ON:

SELECT * FROM Table
DISTINCT ON ColumnA, ColumnF

Again, there's the issue of grouped distinction.

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com
Visit InfoPathDev at http://www.infopathdev.com
Greg Collins replied on 21-Feb-08 11:37 AM
That would all depend on ORDER BY (if any).

Without any ordering, you'd take the first row 1,1,1,5.

If you did ORDER BY ColumnX DESC, you'd get 1,1,6,4

If you did WHERE ColumnY < 3, you'd of course get 1,1,3,2

If you did ORDER BY ColumnY, you'd get 1,1,3,2

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com
Visit InfoPathDev at http://www.infopathdev.com
Alex Kuznetsov replied on 21-Feb-08 09:04 PM
yes well that is easy to fix:

SELECT TOP 1 OVER(PARTITION BY <COLUMN list> ORDER BY <another column
list>)
Hugo Kornelis replied on 21-Feb-08 06:05 PM
(snip)

Hi Alex,

I guess I pretty much agree with David here. SET ROWCOUNT is deprecated.
TOP with ORDER BY does produce reproducable results, so even though the
syntax is very unfortunate (they should instead have chose to implement
it as TOP() OVER (ORDER BY ...) ), I have no problems with the TOP
operator itself. TOP without ORDER BY is an abomination - just think
what can happen if you use an automated test program and the optimizer
decides to try a different plan - thus returning a different row!

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
Hugo Kornelis replied on 21-Feb-08 06:20 PM
Hi Greg,



Ah, but there's the rub. If there's no ORDER BY, they how do you decide
that 1,1,1,5 is the first row?


Starting with SQL Server, you can also get that by using

WITH CTE AS
(SELECT ColumnB, ColumnF, ColumnX, ColumnY,
ROW_NUMBER() OVER (PARTITION BY ColumnB, ColumnF
ORDER BY ColumnX DESC) AS rn
FROM   TheTable)
SELECT ColumnB, ColumnF, ColumnX, ColumnY
FROM   CTE
WHERE  rn = 1;

Maybe a tad longer, but still pretty short, easy to understand and
maintain, and it's ANSI compliant. I fail to see why the way DISTINCT
works should be change from standard to non-standard just to duplicate
some existing functionality.


Even without adding DISTINCT to the query... :)


See above.

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
--CELKO-- replied on 21-Feb-08 09:04 PM
it as TOP() OVER (ORDER BY ...) ) <<

Nope, that would not work.  The Window functions are scalar functions
and return a single value.  TOP(n) returns a set, so it belongs on a
cursor, just like the ORDER BY clause. You usually see this with a
keyword like "LIMIT n ROWS" in the vendor extensions of other SQLs.

What you can write in a full implementation is "MAX(<exp>) OVER (..)"
which is legal, but weird.
Greg Collins replied on 21-Feb-08 07:03 PM
I guess my assumption is that if I don't use ORDER BY then I could care less
what that first row is. Maybe that assumption is invalid at times, and at
other times is valid. I don't pretend to understand the inner workings of
the SQL engine... so I must assume your accuracy in stating that without
specifically ordering you might get different results as to what the first
row is each time. In that case your point is well taken.

--
Greg Collins
Microsoft MVP
Visit Braintrove at http://www.braintrove.com
Visit InfoPathDev at http://www.infopathdev.com
Alex Kuznetsov replied on 23-Feb-08 03:41 PM
On Feb 21, 5:05 pm, Hugo Kornelis

Hi Hugo,

my point is consistency - there are too many inconsistent redundancies
and discrepancies in the product already.
If you apply some logic in one case, the same logic should apply in
other simlar ones. We already have too much discrepancies like
@@SERVERNAME and DB_NAME(), no need to add more.


With all due respect,
1. ROW_NUMBER() without ORDER BY is part of the standard, DB@ and
Oracle support it all right.
Being consistent, TOP without ORDER BY is just as good.
2. A good automated test will use an intelligent success criteria,
such as "the returned row exists in the database and meets the
criteria". As such, a good automated test should and will succeed.
Hugo Kornelis replied on 22-Feb-08 06:32 PM
Hi Greg,

Well, that is what you currently get with TOP(n) if you do not specify
any ORDER BY clause.

--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis