Click here to Skip to main content
15,886,199 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I've two tables t1 and t2.
From t1:-
total records sel_col-distinct
28564502 12257389

From t2:-
total records sel_col-distinct
11217091 10890077

I am executing the below queries:-
SQL
select count( distinct t1.sel_col ) distinct, count(sel_col) total
from 
t1(nolock) join t2(nolock)
on t1.sel_col = CONVERT(VARBINARY(max),t2.sel_col,2)
distinct total
10600479 25666519
time:-00:50 mins

SQL
SELECT count(distinct t1.sel_col) total, count(t1.sel_col) total FROM 
t1(nolock)
where t1.sel_col in 
(select CONVERT(VARBINARY(max),t2.sel_col,2) from t2(nolock))
distinct total
10600479 24835271
time:-12:54mins

The distinct value is coming same, but there is difference between total. The 2nd query is taking huge time. Is there any way to get the desired result? Also why there is so much difference in the total count?

Thanks in advance
Posted
Updated 2-Jan-14 10:15am
v2
Comments
chaau 2-Jan-14 15:56pm    
I guess you have rows in t2 with duplicate sel_col

It's because the join has the ability to multiply rows. Look at the following tables:

TableA
Cust    Value 
1        1
1        2
2        1

TableB
Cust
1
1
2


When you join the two you get

a.Cust b.Cust Value
1      1      1
1      1      1
1      1      2
1      1      2
2      2      1


So if you count these values you get 5 where are using the in you get that same dataset as TableA so the count is 3.

If you run this SQL:

SQL
SELECT SUM(COUNT(*) - 1) diff, COUNT(*) total, sel_col FROM t2 GROUP BY sel_col HAVING COUNT(*) > 1


If this returns any results it means your count is being multiplied by the join. The diff column in the result will equal the difference between the two results.

If you need a join with the same result as the in, try the following.

SQL
SELECT COUNT(t1.sel_col) FROM t1 (NOLOCK) INNER JOIN (SELECT DISTINCT sel_col FROM t2 (NOLOCK)) t2


Joining to the sub query when the sub query is distinct will reduce duplicates and the multiplying effect of the join.
 
Share this answer
 
Comments
Christian Graus 2-Jan-14 17:28pm    
Cool - thanks for taking the time to make this clearer than I did. Sadly, the OP does not appear to be replying.
Member 10499886 5-Jan-14 0:36am    
Thanks a lot for such a detailed explanation.
The reason they take different times is that the second uses a subquery, which is slow. Comparing varbinary(MAX) sounds like you're comparing files, which is always slow. Consider storing a hash of the data in your varbinary, for faster comparisons. You can then do a deeper comparison where the hashes match.

You are throwing the DISTINCT keyword about a bit. This often means that your SQL has other issues which you are hiding.

Are you saying the second result is correct ? What is the table structure ?

OK, I see now. You have put the totals there. Your first select is the same, it's a distinct. The second, is not. If you have the same value twice in t2, an inner join will create 2 copies of that row in your results, even if it's in t1, once. Doing a subquery, is not a join, so it would not create duplicate rows. This is, I assume, why you're using DISTINCT. Instead, you should fix your database structure.
 
Share this answer
 
v2
Comments
Member 10499886 5-Jan-14 0:39am    
Thanks Christian for your solution. And yes there is issue with DB structure, in future I'll fix it.
Christian Graus 5-Jan-14 0:41am    
I explain the issue in more depth in my latest article : http://www.codeproject.com/Articles/705911/SQL-Wizardry-Part-Five-when-not-to-use-DISTINCT

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900