Archive | Partitioning RSS feed for this section

Automatic Data Purging With Partition Switching

16 Mar

In a previous article I showed a basic example of Partition Switching, using animal names (http://wp.me/p3Vxvi-2u). While not a realistic example for everyday use it did get the principles across. However, certainly in my experience, partitioning is normally dependent upon a number – usually a date.

In this article I’ll create a partition process that is used to keep data for three days (based upon a date within the row) and then remove the data automatically, via a job.

Firstly I require a database for this demonstration.

CREATE DATABASE PartitionTest;
GO

Now I need a Partition Function. Because there can only ever be 3 possible values with this calculation, I can be a little lazy and just specify two of the values. If I just write the function for values of 1 and 2, then 3 will always go into the third partition, normally used for ‘invalid’ value ranges. I can’t have any invalid values (provided it has been coded and tested properly), so use the third partition for the ‘normal’ work instead of out-of-range values.

CREATE PARTITION FUNCTION [pf_Retention](tinyint) AS RANGE LEFT FOR VALUES (1, 2)
GO

The associated Partition Scheme will have three partitions. Values of 1 will go to the first in the list, values of 2 will go to the second in the list and all others will go to the third. There are only ever going to be three values for this example, so I won’t worry about a fourth for invalid values.

CREATE PARTITION SCHEME [ps_Retention] AS PARTITION [pf_Retention] TO ([PRIMARY], [PRIMARY], [PRIMARY])
GO

Having created these I can now create the table that will use this partition Scheme:

CREATE TABLE UserAccess(
    UserAccessID		INT IDENTITY(1,1),
    UserID				INT,
    LoginDateTime       DATETIME,
    LogoutDateTime		DATETIME,
    PartitionNumber		AS (CONVERT([tinyint],abs(datediff(day,CONVERT([date],'20000101',(112)),LogoutDateTime)%(3)+(1)),(0))) PERSISTED NOT NULL,
CONSTRAINT [PK_Petdetails] PRIMARY KEY CLUSTERED
(
    UserAccessID		ASC,
    PartitionNumber     ASC
) WITH (IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON ps_Retention(PartitionNumber)
) ON ps_Retention(PartitionNumber)  ; 

GO 

CREATE TABLE UserAccess_PartitionSwitch(
    UserAccessID		INT IDENTITY(1,1),
    UserID				INT,
    LoginDateTime       DATETIME,
    LogoutDateTime		DATETIME,
    PartitionNumber		AS (CONVERT([tinyint],abs(datediff(day,CONVERT([date],'20000101',(112)),LogoutDateTime)%(3)+(1)),(0))) PERSISTED NOT NULL,
CONSTRAINT [PK_Petdetails_PartitionSwitch] PRIMARY KEY CLUSTERED
(
    UserAccessID		ASC,
    PartitionNumber     ASC
) WITH (IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON ps_Retention(PartitionNumber)
) ON ps_Retention(PartitionNumber)  ; 

GO

Notice that the PartitionNumber column is a calculated, persisted column. This value is based upon the LogoutDatetime, so will be kept for three days from that Datetime value.
At first glance the calculation for the Partition Number can look confusing:

(CONVERT([tinyint],abs(datediff(day,CONVERT([date],'20000101',(112)),LogoutDateTime)%(3)+(1)),(0)))

So let’s start with the basics and build up from there. Substitute the ‘LogoutDateTime’ with an ISO standard date, to help show what is happening.

Using a Base Date (in this case the 1st of January 2000) calculate the number of days between then and 20150310. Of course you need to set a Base date that you will suit your purposes. In this case should I ever have a date in 1999 or earlier then it will not work as expected.

The 112 enforces ISO standard date formatting:

SELECT datediff(day,CONVERT([date],'20000101',(112)),'20150310') --112 is ISO standard, to avoid any confusion

As I’m using a 3 day retention scheme, use a modulo of 3, which will result in 0, 1 or 2. The abs function is just there to ensure no negative values, an unlikely event for this example:

SELECT abs(datediff(day,CONVERT([date],'20000101',(112)),'20150310')%(3))

Because a Partition number of 0 is invalid add 1, so making the only possible values 1, 2 or 3:

SELECT abs(datediff(day,CONVERT([date],'20000101',(112)),'20150310')%(3)+(1))

As it is such a small number, convert it to a TinyInt:

SELECT (CONVERT([tinyint],abs(datediff(day,CONVERT([date],'20000101',(112)),'20150310')%(3)+(1)),(0)))

To make sure the partition function is working, add a few rows of test data:

USE [PartitionTest]
GO

INSERT INTO [dbo].[UserAccess]
           ([UserID]
           ,[LoginDateTime]
           ,[LogoutDateTime])
VALUES     (1
           ,'20150308 01:00:00'
           ,'20150308 02:00:00'),
		    (2
           ,'20150309 01:00:00'
           ,'20150309 02:00:00'),
		    (3
           ,'20150310 01:00:00'
           ,'20150310 02:00:00'),
		    (4
           ,'20150311 01:00:00'
           ,'20150311 02:00:00'),
		    (5
           ,'20150312 01:00:00'
           ,'20150312 02:00:00'),
		    (6
           ,'20150313 01:00:00'
           ,'20150313 02:00:00'),
		    (7
           ,'20150314 01:00:00'
           ,'20150314 02:00:00')

And then check the result – look at the table contents and at how the data is spread across the actual partitions:
Partition_01
From these result it can be shown that the PartitionNumber column is never outside of the range 1, 2, 3 and the partitions themselves match the data spread indicated by the value of PartitionID.
Now to test that the Partition Switching will actually work – ensuring that the two tables have been set up correctly:
Partition_02
The three rows in Partition 2 have been moved to the Partition_Switch table.
So, I know that the structure of the tables is correct and that we can switch partitions. The next step is to create a routine that will perform this action automatically.
Our table is designed to keep data for three days, and to do this successfully we need to make sure that the next partition we are going to need (tomorrow’s partition) is empty before that day starts. So just before midnight we need to remove the data that is about to become four days old.

A Stored Procedure is now required – ‘dbo.Purge_Table’, which ascertains the partition number that is now three days old and purges it (after some checks to make sure it should), because that will be the partition required tomorrow.

CREATE PROCEDURE [dbo].[Purge_Table] AS
    SET NOCOUNT ON 

    DECLARE @DaysToKeep				TINYINT,
        @PurgePartition			TINYINT,
        @TotalPartitions		TINYINT,
        @Today					DATETIME,
        @MaxDateInPartition		DATETIME,
        @DateToPurge			DATETIME,
        @SQLCode				VARCHAR(500) 

    SET @TotalPartitions = 3;
	SET @DaysToKeep = 2; 

    SET @Today = CURRENT_TIMESTAMP
    SET @DateToPurge = DATEADD (dd, (@DaysToKeep * -1), @Today)  --Includes today. We keep today, we keep yesterday but the day before yesterday needs to go 

    SET @PurgePartition = 0     

	SET @PurgePartition = (CONVERT(tinyint,ABS(datediff(day,CONVERT([date],'20000101',(112)),@DateToPurge)%(@TotalPartitions)+(1)),(0))) --Same formula as used within the table, but
																																		 --using the @DateToPurge, to find what partition
																																		 --that date will be in.

    SET @PurgePartition = $Partition.pf_Retention (@PurgePartition)   

	-----------------------------------------------------------------------------------------------
    ---- PURGE TABLE
    -----------------------------------------------------------------------------------------------

	--Get the latest LogOutDateTime from the partition that is going to be purged.
    SELECT @MaxDateInPartition = MAX (LogOutDateTime)
     FROM dbo.UserAccess
     WHERE $Partition.pf_Retention (PartitionNumber) = @PurgePartition 

    IF EXISTS (SELECT 1 FROM dbo.UserAccess
                WHERE PartitionNumber = @PurgePartition) --Check there is data for this partition
		BEGIN
			IF DATEDIFF (dd, @MaxDateInPartition, @Today) >= @DaysToKeep OR @MaxDateInPartition IS NULL --Check the date within this partition is old enough to be removed
			BEGIN
				SELECT @SQLCode = 'ALTER TABLE dbo.UserAccess SWITCH PARTITION ' +
					CAST (@PurgePartition AS varchar) +
					' TO dbo.UserAccess_PartitionSwitch PARTITION ' +
					CAST (@PurgePartition AS varchar) 

				EXEC (@SQLCode) 

				TRUNCATE TABLE dbo.UserAccess_PartitionSwitch	--Having switched the data over, remove it 

				RAISERROR ('The dbo.UserAccess Table has been successfully Purged.', 10, 1)
			END
			ELSE
				RAISERROR ('OPERATION CANCELLED. Incorrect Data would be purged from dbo.UserAccess',16, 1)
		END
	ELSE
		RAISERROR ('dbo.UserAccess - partition has no data', 10, 1)
GO

Having cleared the table (and PartitionSwitch table) I input three rows, with today, yesterday and the day before as the logOutDateTime:
Partition_03
Now execute the Stored Procedure and the row in partition 2 should be switched to the PartitionSwitch table and removed (based on the execution date of 20150316).
Partition_04
Now all that is required is a scheduled job that executes this Stored Procedure at the end of each day, ensuring there is always an empty partition available for the next day and any data older than is required is no longer stored.

Partition Switching – Basic Example

17 Feb

Want to delete a couple of records quickly from a large table; well DELETE is more than capable (indexing issues aside). Want to delete a couple of million from a large table quickly? Now it can get a little more interesting.

Partitioning is a very useful tool for moving large amounts of data very quickly. However, it does require an understanding of the several elements of this interesting aspect of SQL Server.

    What is Partitioning?

You can have a table within a database that for general use appears to be one large storage area for your information. However, it can be stored spilt into several filegroups and these filegroups can be spread across several folders/drives. So your data in reality can be spread across a wide range of drives for improved access on a hardware level whilst appearing to the user as one table, without having to take any considerations for its partitioned structure.

    How do you Partition?

To create a partitioned table there are three things that need to be present:

1. A Partition Function, which describes how the data will be split across partitions. It shows the values that will be used for dictating how the data is spread across those partitions. At this point it doesn’t specify the column itself, just the values.

2. A Partition Scheme, to show which filegroups are used for the partitions. If you design a table to have two partitioned values (as described in the Partition Function) then the Partition Scheme will have a filegroup for each of these plus one extra for those values that won’t fit within the values specified in the Function.

3. Allocating the table to a Partition Scheme. When the table is created it needs to be allocated at this time to a Partition Scheme. It is at this point that you specify which column of the table is used by the Partition Scheme, to map values according to the Partition Function.

    Example

Quite often examples of partitioning are given using date ranges, because that is a popular way of dividing the data. This time I’m going to use something that I found easier to visualise and demonstrate – animals.

For no particular reason I’m going to create a table that stores details of cats and dogs. One partition will contain Cats and the other Dogs. Using these as the boundaries of the partitions will show what happens when other animals are introduced, depending upon how they are listed alphabetically.

First of all, create a database. To properly utilise partitioning I should create the database across several filegroups but that isn’t essential for this demonstration. In my environment the following code will create a database in one filegroup – ‘PRIMARY’.

CREATE DATABASE PartitionTest; 
GO 

Now we create a Partition Function, which requires a name, the definition of the column it will ultimately be mapped against (or to look at it the other way – the mapped column should have the same definition as the Partition Function), the description of how the partition boundaries are set and the values that these boundaries will have.

CREATE PARTITION FUNCTION [pf_PetTypes](varchar(40)) AS RANGE LEFT FOR VALUES (N'Cat', N'Dog') 
GO 

This function is called ‘pf_PetTypes’. It has two boundaries, ‘Cat’ and ‘Dog’. The ‘RANGE LEFT’ means that anything to the left of ‘Cat’ will go into the first partition. So alphabetically anything with a value of ‘Cat’ or less will go into this partition.
Anything to the left of ‘Dog’ (and including ‘Dog’) alphabetically right down to (but not including) ‘Cat’ will go into the second partition.
Anything beyond ‘Dog’ will go into the third partition that will be specified in the Partition Scheme.

‘RANGE RIGHT’ is another option for the Partition Function and this would of course alter the behaviour of the partitions.

Having created the Partition Function we now associate it with a Partition Scheme, which will map the ranges specified in the Function to filegroups.

CREATE PARTITION SCHEME [ps_PetTypes] AS PARTITION [pf_PetTypes] TO ([PRIMARY], [PRIMARY], [PRIMARY]) 
GO 

This Scheme shows three partitions, which in this example all point to the same filegroup. For partition switching this is not a problem. To spread your partitions across multiple drives or folders you’d have to have different filegroups specified, which would have started with the database creation.
This Scheme maps the values of the first value within the Function to the first filegroup ‘PRIMARY’, the second value to the second filegroup ‘PRIMARY’ and any values that are beyond the range will go to the third filegroup ‘PRIMARY’.

Now the table can be created and mapped to the Partition Scheme.

CREATE TABLE PetDetails( 
	PetDetailID		INT IDENTITY(1,1), 
	PetType			VARCHAR(40) NOT NULL, 
	PetBreed			VARCHAR(20)	NOT NULL DEFAULT 'Unknown', 
	PetName			VARCHAR(20), 
	SpecialRequirements	VARCHAR(500), 
	AdditionalNotes		VARCHAR(500), 
CONSTRAINT [PK_Petdetails] PRIMARY KEY CLUSTERED 
( 
	PetDetailID ASC, 
	Pettype		ASC  
) WITH (IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON ps_PetTypes(PetType)  
) ON ps_PetTypes(PetType); 
 
GO 

In this example I’ve specified the Partition Scheme against the Clustered Index and the table itself (‘ON ps_PetTypes(PetType)’). This isn’t actually necessary – one or the other is sufficient, it’s just a habit I’ve developed.
It is here that you specify which column is used to decide which partition is required. The column is the parameter used for the ps_PetTypes Partition Scheme. In this case the column is PetType.

A Clustered Index has been created against this table and where this has been done the partition column has to be part of the Key, otherwise the error ‘Partition columns for a unique index must be a subset of the index key’ is thrown and the table creation fails.

For partition switching we need somewhere to actually switch the data to. For this purpose you need another table that is identical in structure to the main table. If there are defaults and suchlike within the main table then these aren’t required in the additional table but columns, definitions and Clustered Index must match.

CREATE TABLE PetDetails_PartitionSwitch( 
	PetDetailID		INT IDENTITY(1,1), 
	PetType			VARCHAR(40) NOT NULL, 
	PetBreed			VARCHAR(20) NOT NULL, 
	PetName			VARCHAR(20), 
	SpecialRequirements	VARCHAR(500), 
	AdditionalNotes		VARCHAR(500), 
CONSTRAINT [PK_Petdetails_PartitionSwitch] PRIMARY KEY CLUSTERED 
( 
	PetDetailID ASC, 
	Pettype		ASC 
) WITH (IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, DATA_COMPRESSION = PAGE) ON ps_PetTypes(PetType)  
) ON ps_PetTypes(PetType); 
 
GO 

In this case I’ve created another table with the suffix ‘_PartitionSwitch’, to signify the purpose of this table.
Now that we’ve created the tables the partitions details can be found:

SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_01

Both tables have three partitions, with no rows of data.
So we’ll add some data and look again:

INSERT INTO dbo.PetDetails 
        ( PetType , 
          PetBreed , 
          PetName , 
          SpecialRequirements , 
          AdditionalNotes 
        ) 
VALUES  ( 'Dog' , 'Alsatian' , 'Pookie' ,  '' , '' ), 
		( 'Dog' , 'Boxer', 'Rover','','Not the brightest bulb in the shop'), 
		( 'Cat', 'Persian', 'Snowball', '',''); 
GO 


SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_02

PetDetails has one row in Partition 1 and two rows in Partition 2, because we added one ‘Cat’ row and two ‘Dog’ rows.
This table was designed with Cats and Dogs in mind, so what happens when other animals are used?

INSERT INTO dbo.PetDetails 
        ( PetType , 
          PetBreed , 
          PetName , 
          SpecialRequirements , 
          AdditionalNotes 
        ) 
VALUES  ( 'Bird' , 'Canary' , '' ,  '' , '' ); 
GO 
 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_03

Because ‘Bird’ is left alphabetically of ‘Cat’ it also goes into Partition 1.

Add another one:

INSERT INTO dbo.PetDetails 
        ( PetType , 
          PetBreed , 
          PetName , 
          SpecialRequirements , 
          AdditionalNotes 
        ) 
VALUES  ( 'Frog' , 'Pig Frog' , '' ,  '' , '' ); 
GO 

--Frog goes into partition 3, because it is beyond the range of 'Dog' 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_04

The value ‘Frog’ is beyond the boundary set by ‘Dog’ alphabetically, so it goes into the extra partition that is there to cater for out of range values.
Remove these additional values, so we can get back to ‘clean’ data:

--Get rid of the distracting data 
DELETE dbo.PetDetails 
WHERE PetType NOT IN ('cat','dog'); 
GO 

It is possible to use the partition function to ascertain what values will be stored into which partition without actually writing any data:

--What partition will 'cat' be in 
SELECT $PARTITION.pf_PetTypes('cat'); 
GO 

--What partition would 'elephant' be placed into? 
SELECT $PARTITION.pf_PetTypes('elephant'); 
GO 

The result is the partition number the data will be written to.
Check the partitions before we switch:

SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_05

I know that the ‘Cat’ entry is in Partition 1. To move this data by switching the partition the command is:

ALTER TABLE dbo.PetDetails 
SWITCH PARTITION 1 TO dbo.PetDetails_PartitionSwitch PARTITION 1; 
GO 

SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails'; 
SELECT * FROM sys.partitions WHERE object_name(object_id)='PetDetails_PartitionSwitch'; 
GO 

PS_06

Partition 1 of PetDetails now has no rows (check the partition_number column as the order may have changed) and Partition 1 of PetDetails_PartitionSwitch has one row.

SELECT * 
FROM dbo.PetDetails; 
GO 

PS_07

There is no trace of the ‘Cat’ entry in this table, it is now in PetDetails_PartitionSwitch:

SELECT * 
FROM dbo.PetDetails_PartitionSwitch; 
GO 

PS_08

Now we can remove the data quickly by using TRUNCATE TABLE against Petdetails_PartitionSwitch.

The speed of the SWITCH PARTITION command and the TRUNCATE TABLE are extremely fast, because they are both DDL commands. Therefore switching and truncating one row or a million rows will take pretty much the same amount of time; certainly not enough difference for you to worry about.
For the finer detail on Partition Switching – http://technet.microsoft.com/en-us/library/ms191160(v=sql.105).aspx