Click here to Skip to main content
15,884,980 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I have an interesting dataset that I need to compress while maintaining the date contiguity. I have been struggling to formulate a logic to resolve the issue. Actually, I was able to come with a grouping logic using a cursor but the dataset contains more than 5 Million rows and running update queries 5 million times is not an option.

PROBLEM:
If the zip code matches with the next row zip code then they need to be compressed in a single row using Min date of the START row and the max date of the END date row.
Essentially, I have created the last column grouping manually to show how the data needs to be organized so I can easily compressed the data.
You can see rows 6,7,8 needs to be grouped together, so is rows 21-23, and so on and so forth.

Once compressed the dataset should be returning 30 rows and not 38.

ID	ZIP	 START	         END	  Rank   Grouping
500	55106	7/8/2009	9/1/2009	1	1
500	55407	9/2/2009	11/23/2009	2	2
500	55411	11/24/2009	11/29/2009	3	3
500	55407	11/30/2009	12/13/2009	4	4
500	55429	12/14/2009	12/20/2009	5	5
500	55407	12/21/2009	12/22/2009	6	6
500	55407	12/23/2009	1/3/2010	7	6
500	55407	1/4/2010	1/10/2010	8	6
500	55304	1/11/2010	1/13/2010	9	7
500	55407	1/14/2010	1/25/2010	10	8
500	55121	1/26/2010	2/9/2010	11	9
500	55119	2/10/2010	2/10/2010	12	10
500	55428	2/11/2010	2/11/2010	13	11
500	55038	2/12/2010	2/17/2010	14	12
500	55433	2/18/2010	3/8/2010	15	13
500	55406	3/9/2010	3/22/2010	16	14
500	55406	3/23/2010	3/24/2010	17	14
500	55433	3/25/2010	4/22/2010	18	15
500	55415	4/23/2010	6/20/2013	19	16
500	55106	6/21/2013	9/15/2013	20	17
500	55407	9/16/2013	9/19/2013	21	18
500	55407	9/20/2013	10/16/2013	22	18
500	55407	10/17/2013	11/25/2013	23	18
500	55102	11/26/2013	12/12/2013	24	19
500	55130	12/13/2013	1/21/2014	25	20
500	55407	1/22/2014	2/16/2014	26	21
500	55407	2/17/2014	7/10/2014	27	21
500	55125	7/11/2014	7/28/2014	28	22
500	55407	7/29/2014	10/29/2014	29	23
500	55411	10/30/2014	12/2/2014	30	24
500	55407	12/3/2014	7/7/2015	31	25
500	55434	7/8/2015	8/24/2015	32	26
500	55434	8/25/2015	11/3/2015	33	26
500	55130	11/4/2015	8/18/2016	34	27
500	55407	8/19/2016	7/11/2018	35	28
500	55407	7/12/2018	10/1/2018	36	28
500	55408	10/2/2018	10/5/2018	37	29
500	55411	10/6/2018	12/30/9999	38	30


What I have tried:

I have tried using the LEAD and LAG function to create the grouping column but no luck.

Thanks Much!
Posted
Updated 7-May-20 5:29am

Well, Lead and Lag is the right way I believe.
So first we create a subquery where we get the previous EndDate next to the startdate and the same for the end date.
Then we create two more subqueries where we find the start and end for every island, and create a key for them so we can join them in the end.
Performance will probably suck though. That can be fixed to a certain extent by using indexed temp tables.
SQL
WITH Data as (
    SELECT  ID
           ,ZIP
           ,StartDate
           ,LAG(EndDate) OVER (Partition BY ID,ZIP ORDER BY StartDate,EndDate) AS PreviousEndDate
           ,EndDate
           ,LEAD(StartDate) OVER (Partition BY ID,ZIP ORDER BY StartDate,EndDate) AS NextStartDate
    FROM    Test       
    )
,IslandStart AS (
    SELECT  ID
           ,ZIP
           ,ROW_NUMBER() OVER (Partition BY ID,ZIP ORDER BY StartDate) AS IslandNumber
           ,StartDate
    FROM    Data
    WHERE   DATEDIFF(DAY, PreviousEndDate, StartDate) > 1
        OR  PreviousEndDate IS NULL
        )
,IslandEnd AS (
    SELECT  ID
           ,ZIP
           ,ROW_NUMBER() OVER (Partition BY ID,ZIP ORDER BY EndDate) AS IslandNumber
           ,EndDate
    FROM    Data
    WHERE   DATEDIFF(DAY,EndDate, NextStartDAte) > 1
        OR  NextStartDate IS NULL
    )
SELECT  s.ID
       ,s.ZIP
       ,s.StartDate
       ,e.EndDate
FROM    IslandStart s
JOIN    IslandEnd e
    ON  s.ID=e.ID
    AND s.ZIP = e.ZIP
    AND s.IslandNumber = e.IslandNumber

ORDER BY ID,ZIP,StartDate,EndDate
 
Share this answer
 
v2
Comments
sulaif 7-May-20 11:51am    
Amazing!! Thanks so much. Just need to test and see how it performs on 5M rows
Jörgen Andersson 7-May-20 12:00pm    
Badly I suspect.

I also haven't tested it for overlapping ranges. And when talking about those I see a possible bug. Exchange the Order By for the second Row_Number to EndDate.
sulaif 7-May-20 12:24pm    
Noted the change.
Peformance wasn't bad at all. Took 54 seconds.

Thank You!
Jörgen Andersson 7-May-20 12:42pm    
I suspect that's as good as it gets when you have a self join on a table with 5M rows.
I guess you have an index on ID,ZIP,START,END
sulaif 7-May-20 12:43pm    
Yes, there is an index ID,ZIP,START,END
As to me, it would be 17 rows in output, because ZIP code: 55407 is reapeated 15 times and other ZIP codes are repeated twice.

See:
SQL
SET DATEFORMAT mdy;

DECLARE @tmp TABLE (ID INT, ZIP INT, [START] Date, [END] Date)

INSERT INTO @tmp(ID, ZIP, [START], [END])
VALUES(500, 55106, '7/8/2009', '9/1/2009'),
(500, 55407, '9/2/2009', '11/23/2009'),
(500, 55411, '11/24/2009', '11/29/2009'),
(500, 55407, '11/30/2009', '12/13/2009'),
(500, 55429, '12/14/2009', '12/20/2009'),
(500, 55407, '12/21/2009', '12/22/2009'),
(500, 55407, '12/23/2009', '1/3/2010'),
(500, 55407, '1/4/2010', '1/10/2010'),
(500, 55304, '1/11/2010', '1/13/2010'),
(500, 55407, '1/14/2010', '1/25/2010'),
(500, 55121, '1/26/2010', '2/9/2010'),
(500, 55119, '2/10/2010', '2/10/2010'),
(500, 55428, '2/11/2010', '2/11/2010'),
(500, 55038, '2/12/2010', '2/17/2010'),
(500, 55433, '2/18/2010', '3/8/2010'),
(500, 55406, '3/9/2010', '3/22/2010'),
(500, 55406, '3/23/2010', '3/24/2010'),
(500, 55433, '3/25/2010', '4/22/2010'),
(500, 55415, '4/23/2010', '6/20/2013'),
(500, 55106, '6/21/2013', '9/15/2013'),
(500, 55407, '9/16/2013', '9/19/2013'),
(500, 55407, '9/20/2013', '10/16/2013'),
(500, 55407, '10/17/2013', '11/25/2013'),
(500, 55102, '11/26/2013', '12/12/2013'),
(500, 55130, '12/13/2013', '1/21/2014'),
(500, 55407, '1/22/2014', '2/16/2014'),
(500, 55407, '2/17/2014', '7/10/2014'),
(500, 55125, '7/11/2014', '7/28/2014'),
(500, 55407, '7/29/2014', '10/29/2014'),
(500, 55411, '10/30/2014', '12/2/2014'),
(500, 55407, '12/3/2014', '7/7/2015'),
(500, 55434, '7/8/2015', '8/24/2015'),
(500, 55434, '8/25/2015', '11/3/2015'),
(500, 55130, '11/4/2015', '8/18/2016'),
(500, 55407, '8/19/2016', '7/11/2018'),
(500, 55407, '7/12/2018', '10/1/2018'),
(500, 55408, '10/2/2018', '10/5/2018'),
(500, 55411, '10/6/2018', '12/30/9999')


UPDATE t1 SET [START] = t2.MinStart, 
	[END] =  t2.MaxEnd
FROM @tmp t1 INNER JOIN
(
	SELECT ID, ZIP, MIN([START]) OVER(PARTITION BY ZIP ORDER BY [START]) AS MinStart, MAX([END]) OVER(PARTITION BY ZIP ORDER BY [END] DESC) AS MaxEnd
	FROM @tmp
) t2 ON t1.ID = t2.ID AND t1.ZIP = t2.ZIP


SELECT ID, ZIP, [START], [END], RowNo
FROM 
(
	SELECT *, ROW_NUMBER() OVER(PARTITION BY ID, ZIP ORDER BY [START]) AS RowNo
	FROM @tmp
) T
--WHERE RowNo = 1
--uncomment above line to see 17 rows


Result:
ID	ZIP	START	END	RowNo
500	55038	2010-02-12	2010-02-17	1
500	55102	2013-11-26	2013-12-12	1
500	55106	2009-07-08	2013-09-15	1
500	55106	2009-07-08	2013-09-15	2
500	55119	2010-02-10	2010-02-10	1
500	55121	2010-01-26	2010-02-09	1
500	55125	2014-07-11	2014-07-28	1
500	55130	2013-12-13	2016-08-18	1
500	55130	2013-12-13	2016-08-18	2
500	55304	2010-01-11	2010-01-13	1
500	55406	2010-03-09	2010-03-24	1
500	55406	2010-03-09	2010-03-24	2
500	55407	2009-09-02	2018-10-01	1
500	55407	2009-09-02	2018-10-01	2
500	55407	2009-09-02	2018-10-01	3
500	55407	2009-09-02	2018-10-01	4
500	55407	2009-09-02	2018-10-01	5
500	55407	2009-09-02	2018-10-01	6
500	55407	2009-09-02	2018-10-01	7
500	55407	2009-09-02	2018-10-01	8
500	55407	2009-09-02	2018-10-01	9
500	55407	2009-09-02	2018-10-01	10
500	55407	2009-09-02	2018-10-01	11
500	55407	2009-09-02	2018-10-01	12
500	55407	2009-09-02	2018-10-01	13
500	55407	2009-09-02	2018-10-01	14
500	55407	2009-09-02	2018-10-01	15
500	55408	2018-10-02	2018-10-05	1
500	55411	2009-11-24	9999-12-30	1
500	55411	2009-11-24	9999-12-30	2
500	55411	2009-11-24	9999-12-30	3
500	55415	2010-04-23	2013-06-20	1
500	55428	2010-02-11	2010-02-11	1
500	55429	2009-12-14	2009-12-20	1
500	55433	2010-02-18	2010-04-22	1
500	55433	2010-02-18	2010-04-22	2
500	55434	2015-07-08	2015-11-03	1
500	55434	2015-07-08	2015-11-03	2
 
Share this answer
 
Comments
sulaif 6-May-20 10:43am    
Thanks much for your response. Sorry for not being clear the first time. The dates cannot overlap even if the zip code is listed multiple times. In other words if there is a different zip code in between it should be grouping as such so the dates don't overlap while maintaining the date contiguity.


Uncompressed Dataset				
ID	ZIP	   START	        END	   Row
500	55106	7/8/09	     9/1/09	    1
500	55407	9/2/09	     11/23/09	2
500	55411	11/24/09     11/29/09	3
500	55407	11/30/09	 12/13/09	4
500	55429	12/14/09	 12/20/09	5
500	55407	12/21/09	 12/22/09	6
500	55407	12/23/09	 1/3/10	    7
500	55407	1/4/10	     1/10/10	8
500	55304	1/11/10	     1/13/10	9
				
Final Compressed Dataset 				
ID	ZIP	    START	    END	      Row
500	55106	7/8/09	    9/1/09	    1
500	55407	9/2/09	    11/23/09	2
500	55411	11/24/09	11/29/09	3
500	55407	11/30/09	12/13/09	4
500	55429	12/14/09	12/20/09	5
500	55407	12/21/09	1/10/10	    6
500	55304	1/11/10	    1/13/10	    7

You can see rows 6,7,8 are only grouped together for 55407 zip code with the START date being the MIN and END date being the MAX though it is repeated multiple times. 
Maciej Los 6-May-20 14:02pm    
Is there any unique value (primary key)?
I believe Row and Grouping are created by ROW_NUMBER and DENSE_RANK function.
sulaif 6-May-20 14:05pm    
There is no primary key. Using ID, ZIP, START, and END together can act as a primary key
Maciej Los 6-May-20 14:14pm    
Thanks for reply. I'll improve my answer ASAP, but now i'm bit busy. OK?
sulaif 6-May-20 14:15pm    
No problem at all - Thanks once again

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900