Sql or PLSQL: compressing dataset while maintaining date contiguity

Question

0.00/5 (No votes)

See more:

I have an interesting dataset that I need to compress while maintaining the date contiguity. I have been struggling to formulate a logic to resolve the issue. Actually, I was able to come with a grouping logic using a cursor but the dataset contains more than 5 Million rows and running update queries 5 million times is not an option.

PROBLEM:
If the zip code matches with the next row zip code then they need to be compressed in a single row using Min date of the START row and the max date of the END date row.
Essentially, I have created the last column grouping manually to show how the data needs to be organized so I can easily compressed the data.
You can see rows 6,7,8 needs to be grouped together, so is rows 21-23, and so on and so forth.

Once compressed the dataset should be returning 30 rows and not 38.

ID	ZIP	 START	         END	  Rank   Grouping
500	55106	7/8/2009	9/1/2009	1	1
500	55407	9/2/2009	11/23/2009	2	2
500	55411	11/24/2009	11/29/2009	3	3
500	55407	11/30/2009	12/13/2009	4	4
500	55429	12/14/2009	12/20/2009	5	5
500	55407	12/21/2009	12/22/2009	6	6
500	55407	12/23/2009	1/3/2010	7	6
500	55407	1/4/2010	1/10/2010	8	6
500	55304	1/11/2010	1/13/2010	9	7
500	55407	1/14/2010	1/25/2010	10	8
500	55121	1/26/2010	2/9/2010	11	9
500	55119	2/10/2010	2/10/2010	12	10
500	55428	2/11/2010	2/11/2010	13	11
500	55038	2/12/2010	2/17/2010	14	12
500	55433	2/18/2010	3/8/2010	15	13
500	55406	3/9/2010	3/22/2010	16	14
500	55406	3/23/2010	3/24/2010	17	14
500	55433	3/25/2010	4/22/2010	18	15
500	55415	4/23/2010	6/20/2013	19	16
500	55106	6/21/2013	9/15/2013	20	17
500	55407	9/16/2013	9/19/2013	21	18
500	55407	9/20/2013	10/16/2013	22	18
500	55407	10/17/2013	11/25/2013	23	18
500	55102	11/26/2013	12/12/2013	24	19
500	55130	12/13/2013	1/21/2014	25	20
500	55407	1/22/2014	2/16/2014	26	21
500	55407	2/17/2014	7/10/2014	27	21
500	55125	7/11/2014	7/28/2014	28	22
500	55407	7/29/2014	10/29/2014	29	23
500	55411	10/30/2014	12/2/2014	30	24
500	55407	12/3/2014	7/7/2015	31	25
500	55434	7/8/2015	8/24/2015	32	26
500	55434	8/25/2015	11/3/2015	33	26
500	55130	11/4/2015	8/18/2016	34	27
500	55407	8/19/2016	7/11/2018	35	28
500	55407	7/12/2018	10/1/2018	36	28
500	55408	10/2/2018	10/5/2018	37	29
500	55411	10/6/2018	12/30/9999	38	30

What I have tried:

I have tried using the LEAD and LAG function to create the grouping column but no luck.

Thanks Much!

Posted 5-May-20 10:26am

sulaif

Updated 7-May-20 5:29am

Add a Solution

2 solutions

Solution 1

As to me, it would be 17 rows in output, because ZIP code: 55407 is reapeated 15 times and other ZIP codes are repeated twice.

See:

SQL

SET DATEFORMAT mdy;

DECLARE @tmp TABLE (ID INT, ZIP INT, [START] Date, [END] Date)

INSERT INTO @tmp(ID, ZIP, [START], [END])
VALUES(500, 55106, '7/8/2009', '9/1/2009'),
(500, 55407, '9/2/2009', '11/23/2009'),
(500, 55411, '11/24/2009', '11/29/2009'),
(500, 55407, '11/30/2009', '12/13/2009'),
(500, 55429, '12/14/2009', '12/20/2009'),
(500, 55407, '12/21/2009', '12/22/2009'),
(500, 55407, '12/23/2009', '1/3/2010'),
(500, 55407, '1/4/2010', '1/10/2010'),
(500, 55304, '1/11/2010', '1/13/2010'),
(500, 55407, '1/14/2010', '1/25/2010'),
(500, 55121, '1/26/2010', '2/9/2010'),
(500, 55119, '2/10/2010', '2/10/2010'),
(500, 55428, '2/11/2010', '2/11/2010'),
(500, 55038, '2/12/2010', '2/17/2010'),
(500, 55433, '2/18/2010', '3/8/2010'),
(500, 55406, '3/9/2010', '3/22/2010'),
(500, 55406, '3/23/2010', '3/24/2010'),
(500, 55433, '3/25/2010', '4/22/2010'),
(500, 55415, '4/23/2010', '6/20/2013'),
(500, 55106, '6/21/2013', '9/15/2013'),
(500, 55407, '9/16/2013', '9/19/2013'),
(500, 55407, '9/20/2013', '10/16/2013'),
(500, 55407, '10/17/2013', '11/25/2013'),
(500, 55102, '11/26/2013', '12/12/2013'),
(500, 55130, '12/13/2013', '1/21/2014'),
(500, 55407, '1/22/2014', '2/16/2014'),
(500, 55407, '2/17/2014', '7/10/2014'),
(500, 55125, '7/11/2014', '7/28/2014'),
(500, 55407, '7/29/2014', '10/29/2014'),
(500, 55411, '10/30/2014', '12/2/2014'),
(500, 55407, '12/3/2014', '7/7/2015'),
(500, 55434, '7/8/2015', '8/24/2015'),
(500, 55434, '8/25/2015', '11/3/2015'),
(500, 55130, '11/4/2015', '8/18/2016'),
(500, 55407, '8/19/2016', '7/11/2018'),
(500, 55407, '7/12/2018', '10/1/2018'),
(500, 55408, '10/2/2018', '10/5/2018'),
(500, 55411, '10/6/2018', '12/30/9999')


UPDATE t1 SET [START] = t2.MinStart, 
	[END] =  t2.MaxEnd
FROM @tmp t1 INNER JOIN
(
	SELECT ID, ZIP, MIN([START]) OVER(PARTITION BY ZIP ORDER BY [START]) AS MinStart, MAX([END]) OVER(PARTITION BY ZIP ORDER BY [END] DESC) AS MaxEnd
	FROM @tmp
) t2 ON t1.ID = t2.ID AND t1.ZIP = t2.ZIP


SELECT ID, ZIP, [START], [END], RowNo
FROM 
(
	SELECT *, ROW_NUMBER() OVER(PARTITION BY ID, ZIP ORDER BY [START]) AS RowNo
	FROM @tmp
) T
--WHERE RowNo = 1
--uncomment above line to see 17 rows

Result:

ID	ZIP	START	END	RowNo
500	55038	2010-02-12	2010-02-17	1
500	55102	2013-11-26	2013-12-12	1
500	55106	2009-07-08	2013-09-15	1
500	55106	2009-07-08	2013-09-15	2
500	55119	2010-02-10	2010-02-10	1
500	55121	2010-01-26	2010-02-09	1
500	55125	2014-07-11	2014-07-28	1
500	55130	2013-12-13	2016-08-18	1
500	55130	2013-12-13	2016-08-18	2
500	55304	2010-01-11	2010-01-13	1
500	55406	2010-03-09	2010-03-24	1
500	55406	2010-03-09	2010-03-24	2
500	55407	2009-09-02	2018-10-01	1
500	55407	2009-09-02	2018-10-01	2
500	55407	2009-09-02	2018-10-01	3
500	55407	2009-09-02	2018-10-01	4
500	55407	2009-09-02	2018-10-01	5
500	55407	2009-09-02	2018-10-01	6
500	55407	2009-09-02	2018-10-01	7
500	55407	2009-09-02	2018-10-01	8
500	55407	2009-09-02	2018-10-01	9
500	55407	2009-09-02	2018-10-01	10
500	55407	2009-09-02	2018-10-01	11
500	55407	2009-09-02	2018-10-01	12
500	55407	2009-09-02	2018-10-01	13
500	55407	2009-09-02	2018-10-01	14
500	55407	2009-09-02	2018-10-01	15
500	55408	2018-10-02	2018-10-05	1
500	55411	2009-11-24	9999-12-30	1
500	55411	2009-11-24	9999-12-30	2
500	55411	2009-11-24	9999-12-30	3
500	55415	2010-04-23	2013-06-20	1
500	55428	2010-02-11	2010-02-11	1
500	55429	2009-12-14	2009-12-20	1
500	55433	2010-02-18	2010-04-22	1
500	55433	2010-02-18	2010-04-22	2
500	55434	2015-07-08	2015-11-03	1
500	55434	2015-07-08	2015-11-03	2

Posted 6-May-20 0:58am

Maciej Los

Comments

sulaif 6-May-20 10:43am

Thanks much for your response. Sorry for not being clear the first time. The dates cannot overlap even if the zip code is listed multiple times. In other words if there is a different zip code in between it should be grouping as such so the dates don't overlap while maintaining the date contiguity.

Uncompressed Dataset				
ID	ZIP	   START	        END	   Row
500	55106	7/8/09	     9/1/09	    1
500	55407	9/2/09	     11/23/09	2
500	55411	11/24/09     11/29/09	3
500	55407	11/30/09	 12/13/09	4
500	55429	12/14/09	 12/20/09	5
500	55407	12/21/09	 12/22/09	6
500	55407	12/23/09	 1/3/10	    7
500	55407	1/4/10	     1/10/10	8
500	55304	1/11/10	     1/13/10	9
				
Final Compressed Dataset 				
ID	ZIP	    START	    END	      Row
500	55106	7/8/09	    9/1/09	    1
500	55407	9/2/09	    11/23/09	2
500	55411	11/24/09	11/29/09	3
500	55407	11/30/09	12/13/09	4
500	55429	12/14/09	12/20/09	5
500	55407	12/21/09	1/10/10	    6
500	55304	1/11/10	    1/13/10	    7

You can see rows 6,7,8 are only grouped together for 55407 zip code with the START date being the MIN and END date being the MAX though it is repeated multiple times.

Maciej Los 6-May-20 14:02pm

Is there any unique value (primary key)?
I believe Row and Grouping are created by ROW_NUMBER and DENSE_RANK function.

sulaif 6-May-20 14:05pm

There is no primary key. Using ID, ZIP, START, and END together can act as a primary key

Maciej Los 6-May-20 14:14pm

Thanks for reply. I'll improve my answer ASAP, but now i'm bit busy. OK?

sulaif 6-May-20 14:15pm

No problem at all - Thanks once again

sulaif 5-Aug-22 18:00pm

Ae you able to help with a new problem. Thanks much in advance

https://www.codeproject.com/Questions/5339107/PLSQL-compressing-dataset-using-partition

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Jörgen Andersson · Accepted Answer · 2020-05-07T05:29:00

Well, Lead and Lag is the right way I believe.
So first we create a subquery where we get the previous EndDate next to the startdate and the same for the end date.
Then we create two more subqueries where we find the start and end for every island, and create a key for them so we can join them in the end.
Performance will probably suck though. That can be fixed to a certain extent by using indexed temp tables.

SQL

WITH Data as (
    SELECT  ID
           ,ZIP
           ,StartDate
           ,LAG(EndDate) OVER (Partition BY ID,ZIP ORDER BY StartDate,EndDate) AS PreviousEndDate
           ,EndDate
           ,LEAD(StartDate) OVER (Partition BY ID,ZIP ORDER BY StartDate,EndDate) AS NextStartDate
    FROM    Test       
    )
,IslandStart AS (
    SELECT  ID
           ,ZIP
           ,ROW_NUMBER() OVER (Partition BY ID,ZIP ORDER BY StartDate) AS IslandNumber
           ,StartDate
    FROM    Data
    WHERE   DATEDIFF(DAY, PreviousEndDate, StartDate) > 1
        OR  PreviousEndDate IS NULL
        )
,IslandEnd AS (
    SELECT  ID
           ,ZIP
           ,ROW_NUMBER() OVER (Partition BY ID,ZIP ORDER BY EndDate) AS IslandNumber
           ,EndDate
    FROM    Data
    WHERE   DATEDIFF(DAY,EndDate, NextStartDAte) > 1
        OR  NextStartDate IS NULL
    )
SELECT  s.ID
       ,s.ZIP
       ,s.StartDate
       ,e.EndDate
FROM    IslandStart s
JOIN    IslandEnd e
    ON  s.ID=e.ID
    AND s.ZIP = e.ZIP
    AND s.IslandNumber = e.IslandNumber

ORDER BY ID,ZIP,StartDate,EndDate