Migrating your SQL Server Workloads to PostgreSQL – Part 2

Shameel

3.00/5 (1 vote)

Sep 21, 2019

CPOL

14 min read

4372

Part 2 of this multi-part article series compares SQL Server and PostgreSQL article, database structure, language differences, data type mappings, functions and operators and other important considerations for migration..

Article Series

Migrating your SQL Server Workloads to PostgreSQL – Part 1
Migrating your SQL Server Workloads to PostgreSQL – Part 2 (This article)
Migrating your SQL Server Workloads to PostgreSQL – Part 3
Migrating your SQL Server Workloads to PostgreSQL – Part 4

SQL Server vs PostgreSQL

For a high-level comparison, check this DB Engines page here.

Concurrency Control

Concurrency control of row data is an important aspect of database systems that determines how efficiently row data is maintained and updated. Different database systems have different concurrency control mechanisms that have a direct impact on their performance.

SQL Server Concurrency Control

SQL Server versions prior to 2005 traditionally used Row level exclusive locking that locks the entire row during write operations. This prevents other read and write threads from accessing the row and they had to wait till the current write operation on the row completed (either committed or rolled back). This has a significant impact on performance of the database during heavy concurrent writes and reads.

SQL Server 2005 improved this scenario by introducing “row-versioning isolation levels”. See Locking and Row Versioning. Note that there are two separate MVCC implementations, read committed isolation using row versioning (RCSI) and snapshot isolation (SI)

SQL Server queries can return dirty (uncommitted) rows depending on the isolation level of the transactions.

PostgreSQL Concurrency Control

PostgreSQL maintains data consistency using a multiversion model (Multiversion Concurrency Control, MVCC), which means that each SQL statement sees a snapshot of data (a database version) as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. MVCC avoids locking rows which minimizes lock contention and improves overall performance.

The main advantage of using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains this guarantee even when providing the strictest level of transaction isolation through the use of an Serializable Snapshot Isolation (SSI) level.

PostgreSQL queries never return dirty rows, no matter what the isolation level of the transaction is.

The Database Structure

Like SQL Server, PostgreSQL can contain multiple Databases within a single instance, and each database can contain multiple Schemas. Each of these schemas can contains other database objects like Tables, Views, Stored Procedures, Functions, etc.

When a connection to a database is opened in PostgreSQL, you can only refer to that database within that connection. If your queries on the connection refer to a different database running on the same instance, PostgreSQL will throw an error. You either open a connection for each database separately or use dblink or Foreign Data Wrappers to query tables from multiple databases.

T-SQL vs PL/pgSQL

The SQL Server procedural extension to the standard SQL language is called Transact-SQL or T-SQL in short and provides additional capabilities like procedural programming, looping constructs, conditional constructs, session management, Stored Procedures, etc.

PostgreSQL multiple procedural languages and the default language is called PL/pgSQL.

PostgreSQL has a more refined SQL syntax

Once you start using PostgreSQL, you’ll notice that it has a much cleaner and refined language syntax compared to SQL Server. Compare these statements:

SQL Server

IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[customers]') AND type IN (N'U')) 
BEGIN
    CREATE TABLE [dbo].[customers]( .... ) 
END

PostgreSQL

CREATE TABLE IF NOT EXISTS customers ( .... )

PostgreSQL code is much simpler, cleaner and refined. SQL Server code is definitely a maintenance nightmare.

Few more examples:

SQL Server

-- This throws an error if the table doesn’t exist
DROP TABLE [dbo].[customers]

-- This is how you do it in SQL Server 2014 and older
IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[customers]') AND type IN (N'U')) 
BEGIN
    DROP TABLE [dbo].[customers] 
END

-- This works in SQL Server 2016 and newer
DROP TABLE IF EXISTS [dbo].[customers]

PostgreSQL

DROP TABLE IF EXISTS customers;
-- No error if the table doesn’t exist

Similarly, PostgreSQL supports CREATE OR REPLACE syntax for Stored Procedures, Functions, Views etc. But in SQL Server 2014 and prior, you would have to check if the object exists and then drop it before attempting to create it again. You can use the ALTER statement, but it throws an error if the object doesn’t already exist. SQL Server 2016 SP1 introduced the CREATE OR ALTER syntax for Stored Procedures, Functions, Triggers and Views.

SQL Server 2016 and above support DROP IF EXISTS for Tables/Views but still does not support the CREATE IF NOT EXISTS syntax.

Case sensitivity of object names

In SQL Server; object names Customers, CUSTOMERS and customers are all the same. SQL Server does not impose case-sensitivity in object names and creates objects with the same case specified in the CREATE statement.

In PostgreSQL, object names are silently converted to lowercase. When a name is enclosed in double quotation marks, the name becomes case sensitive and must be used with quotation marks in queries. Therefore, the above names are converted to customers, but “Customers” (with quotes) is treated as a different object as must the used with quotes in queries that reference the object. This rule holds good for column names in tables as well.

Database Objects/Features

Migrating database objects from SQL Server to PostgreSQL is pretty straightforward as most of these objects are supported as-is in the target platform.

Object/Feature	SQL Server	PostgreSQL	Comments
Table	Yes	Yes
Indexes	Yes	Yes
Triggers	Yes	Yes
Views	Yes	Yes
Updatable Views	Yes	Yes
Materialized Views	Yes	Yes, starting v9.3
Computed Column	Yes	Yes, available in v12 Use Views for earlier versions	Computed Columns are called Generated Columns in PostgreSQL
Stored Procedures	Yes	Yes, starting v11	Prior to PostgreSQL 11, developers used Functions instead of Stored Procedures
User Defined Functions	Yes	Yes
Overloaded Functions	No	Yes	In PostgreSQL, you can have multiple functions with the same name but different parameters.
Common Table Expressions (CTE)	Yes	Yes	PostgreSQL 11 and older: CTEs are always materialized and may not perform well. PostgreSQL 12: Query hints allow CTEs to be either materialized or non-materialized.
Functional Indexes	No	Yes
Max Table Size	Unlimited	32 Terabytes (32TB) in PostgreSQL 9.6 or earlier 2 Exabytes (2EB) in PostgreSQL 10
Max Columns per table	1024 (non-wide) 30000 (wide)	250 – 1600 depending on column types
Max Database size	524,272 Terabytes	Unlimited
Max Varchar Length	Varchar(Max)	Varchar(10485760) Use text for longer strings
Scheduler	SQL Server Agent	pgAgent

SQL Server Maximum Capacity Specifications (v2012 and below)

SQL Server Maximum Capacity Specifications (v2014)

Data Types

Irrespective of whether you use tools or use manual method, keep the following data type mappings between SQL Server and PostgreSQL data types while doing the migration. Some tools allow you to play around the data type mappings to fine tune the migration process.

Microsoft SQL Server	Description	PostgreSQL
Text
CHAR(n)	Fixed length char string, 1 <= n <= 8000	CHAR(n)
VARCHAR(n)	Variable length char string, 1 <= n <= 8000	VARCHAR(n)
VARCHAR(max)	Variable length char string, <= 2GB	TEXT
NVARCHAR(n)	Variable length Unicode UCS-2 string	VARCHAR(n)
NVARCHAR(max)	Variable length Unicode UCS-2 data, <= 2GB	TEXT
TEXT	Variable length character data, <= 2GB	TEXT
NTEXT	Variable length Unicode UCS-2 data, <= 2GB	TEXT
UNIQUEIDENTIFIER	16 byte GUID(UUID) data	CHAR(16)
Numeric
BIGINT	64-bit integer	BIGINT
INTEGER	32 bit integer	INTEGER
TINYINT	8 bit unsigned integer, 0 to 255	SMALLINT
DOUBLE PRECISION	Double precision floating point number	DOUBLE PRECISION
FLOAT(p)	Floating point number	DOUBLE PRECISION
NUMERIC(p,s)	Fixed point number	NUMERIC(p,s)
SMALLMONEY	32 bit currency amount	MONEY
Date
DATE	Date includes year, month and day	DATE
DATETIME	Date and Time with fraction	TIMESTAMP(3)
DATETIME2(p)	Date and Time with fraction	TIMESTAMP(n)
DATETIMEOFFSET(p)	Date and Time with fraction and time zone	TIMESTAMP(p) WITH TIME ZONE
SMALLDATETIME	Date and Time	TIMESTAMP(0)
Boolean
BIT	1, 0 or NULL	BOOLEAN
Binary
BINARY(n)	Fixed length byte string	BYTEA
VARBINARY(n)	Variable length byte string, 1 <= n <= 8000	BYTEA
VARBINARY(max)	Variable length byte string, <= 2GB	BYTEA
ROWVERSION	Automatically updated binary data	BYTEA
IMAGE	Variable length binary data, <= 2GB	BYTEA

This list contains only commonly used data types, PostgreSQL supports a vast variety of data types and also supports custom data types. For a comprehensive list of major data types supported by PostgreSQL, check out this post.

Built-In Functions & operators

PostgreSQL has an extensively rich set of operators and built-in functions, way beyond what is offered by SQL Server.

Microsoft SQL Server	PostgreSQL	Comments
DATEPART	DATE_PART
ISNULL	COALESCE
SPACE	REPEAT
DATEADD	+
+	\|\|	String concatenation
CHARINDEX	POSITION
GETDATE	NOW
LTRIM/RTRIM	TRIM
REPLACE	OVERLAY
LEN	OCTET_LENGTH

For a comprehensive list of all PostgreSQL functions and operators, check out this page.

SQL Language differences

Feature	Microsoft SQL Server	PostgreSQL
Select first N rows	TOP n	LIMIT n
Statement terminator	; (not required)	; (required)
LIKE operator	Case insensitive by default (determined by collation)	Case sensitive, use ILIKE for case insensitive comparison
Regular Expressions	LIKE operator	SIMILAR TO operator

CTE Performance Differences

In SQL Server, this query:

WITH AllUsers AS (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;

results in a query plan for the entire query at once, and the WHERE clause filter is passed into the CTE. The resulting query plan is efficient, doing just a single clustered index seek.

In PostgreSQL, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To overcome this, you rewrite this query in PostgreSQL as:

WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;

The other option is to rewrite the query using a subquery. Keep that in mind when migrating code and queries that involve CTEs.

Note that PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED. So, the query can be rewritten as follows to achieve better performance.

WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;

Collation / Ordering

There are few fundamental differences in the way SQL Server and PostgreSQL store and compare data. In SQL Server, the default collation is Case insensitive but in PostgreSQL, it is case sensitive. Therefore, your code/query that searches for text in WHERE clause or ON clause in joins or LIKE statement might fail. PostgreSQL provides an ILIKE statement for case insensitive comparison.

When Delete does not delete

In SQL Server, when a DELETE statement is issued against a table, the rows are permanently deleted from the storage. In PostgreSQL they’re just marked for deletion (soft delete) and not immediately removed from storage. The UPDATE statement behaves similarly, the old row is marked for deletion and a new row is inserted with the new row data.

Keep that in mind when migrating tables that undergo heavy deletes and updates regularly, because they’ll keep growing exponentially until you do a VACUUM. You will have to set up a job that Vacuums the tables in your databases at regular intervals which could de daily, weekly or monthly depending on the growth rate of each table.

The AutoVacuum daemon is turned on by default and ensures that Vacuuming is done automatically at regular intervals, so you may not have to worry in most cases. However, it could be turned off during installation or later by the Database Admin. If you host your PostgreSQL databases in AWS Aurora, you need to be careful even when AutoVacuum daemon is running. Check out this post for more information.

See VACUUM statement for more details.

Where PostgreSQL has an edge

PostgreSQL provides a rich set of operators and functions to work with different types of data.

PostgreSQL literally runs anywhere

PostgreSQL runs on Windows, Linux, Unix, etc. whereas SQL Server 2016 and older run only on Windows. SQL Server 2017 runs on Linux as well.

Inserting Test data into a PostgreSQL table is a breeze

This query inserts one million rows into the customers table with random data.

INSERT INTO customers (id, name)
SELECT id, md5(random()::text) 
FROM generate_series(1, 1000000) AS id;

In SQL Server, you will be able to do this only using procedural code.

Multiple language support in PostgreSQL

There are currently four procedural languages available in the standard PostgreSQL distribution: PL/pgSQL, PL/Tcl, PL/Perl and PL/Python. PostgreSQL supports many other procedural languages as well, you just need to install the appropriate extension and enable it.

PL/pgSQL: this PostgreSQL’s native procedural language is like SQL Server’s T-SQL, but more refined and feature-complete.
PL/Python: PostgreSQL’s support for Python adds the enormous ecosystem of Python libraries to your arsenal. Imagine using Python functions in the middle of a SQL query, Yayy!!!
PL/Perl: PostgreSQL has full support for Perl as a procedural language.
PL/Tcl: PL/Tcl is a loadable procedural language for the PostgreSQL database system that enables the Tcl language to be used to write PostgreSQL functions.
PL/V8: This JavaScript engine is stable, feature-packed and extremely fast. The fact that PostgreSQL natively supports JSON data type make it powerful and flexible. PL/V8 supports caching data in memory which makes it an ideal language for data intensive row level operations.
PL/R: PostgreSQL has full support for R, a statistical programming language used extensively in data science that has a robust set of high-quality plugins and add-ons.
C: Though PostgreSQL supports C, it has to be compiled separately. This comes in handy when speed and fine control of memory management, resource usage for tasks and performance are critical.

PostgreSQL also has extensions for other languages like Java, Ruby, PHP, Lua, Tcl, etc.

One can argue that SQL Server also supports multiple languages because it acts as a .NET host and can run code written in any .NET languages like C#, VB.NET, F#, etc. Yes, but that involves writing code in a .NET IDE like Visual Studio, compiling code, deploying the assembly in SQL Server, writing a wrapper T-SQL function for your C# function and then calling the function from your T-SQL code. Not to mention the debugging hell. You can’t write C# code “in-line” within your T-SQL code the way the LANGUAGE keyword in PostgreSQL allows you to write a function in multiple languages.

Having personally done multi-language programming in SQL Server, I can vouch for the fact that the level of multi-language support provided by PostgreSQL is at a different level compared to that of SQL Server.

PostgreSQL has a rich set of Functions and Operators

The GREATEST function is similar to the MAX function, except that it works across columns. The LEAST is similar to MIN function, except that it works across columns.

SELECT GREATEST(LastLoginDate, LastReportDate, LastInteractionDate) FROM Users;

You can’t do this in SQL Server, except with multiple SELECT statements and procedural constructs. Similarly, there are tons of functions available in PostgreSQL which are not available in SQL Server.

PostgreSQL has much better support for CSV

People working with data on a day-to-day basis would work with CSV often to convert data from one format to another and for analytical processing, etc. PostgreSQL’s COPY FROM and COPY TO command do a much cleaner job at handling CSVs: no more truncated texts, no more encoding nightmares, no more quoting/escaping issues.

You can drop an entire Schema in PostgreSQL with a single statement

In PostgreSQL, all you do is execute DROP SCHEMA CASCADE. This is very useful during development but at the same time risky as hell in your Staging and Production environments.

In SQL Server, DROP SCHEMA doesn’t support the CASCADE clause. You must drop all the objects within the schema individually before dropping the schema, which can be a nightmare during development and prototyping.

Unicode and Character Encoding Basics

Unicode is a character encoding standard and UTF-8 and UTF-16 are different implementations of the standard.

UTF-8: UTF-8 uses one byte for characters codes below 128 and two, three or four bytes for characters beyond that. It is compatible with ASCII.

UTF-16: Always uses two or four bytes, not compatible with ASCII

UCS-2: Always uses two bytes.

Unicode Support

PostgreSQL has native support for UTF-8 encoding and its CHAR, VARCHAR and TEXT types are UTF-8 by default. String operations and regular expressions are UTF-8 aware. PostgreSQL doesn’t support UTF-16.

SQL Server (prior to 2012) support only UCS-2, a subset of UTF-16. SQL Server 2012 introduced optional support for UTF-16, but you must select an UTF-16 collation for your database for it to work. SQL Server does not support UTF-8.

Where SQL Server has an edge

You cannot query multiple databases directly within a single query

Unlike SQL Server, you cannot reference two databases within a query directly. But there’s a workaround, you can use a dblink or a Foreign Data Wrapper.

You cannot execute procedural code directly in PostgreSQL

In SQL Server, you can declare variables or use conditional logic in scripts directly:

DECLARE @MeanAge = 100
IF EXISTS (SELECT * FROM Users WHERE Age > @MeanAge) 
    SELECT 'Yay'  
ELSE 
    SELECT 'Nay';

SQL Server will allow you to run this code with a Client tool or in Inline queries in application code. In PostgreSQL, you can run procedural code like this:

DO $$ 
-- declare
BEGIN
  /* pl/pgsql here */
END $$;

However, you cannot return any results to the client. The only way you can run procedural code and return results to the client is by wrapping it within a Stored Procedure or Function and then executing it.

PostgreSQL doesn’t support Stored Procedures prior to version 11

PostgreSQL 10.x and below do not have Stored Procedures. But this is not a showstopper as you can do everything with a User Defined Function, with the only exception that you cannot have transactions inside a function.

PostgreSQL doesn’t support Computed Columns prior to version 12

PostgreSQL 11.x and below do not support Computed Columns. However, PostgreSQL 12, scheduled to be released in late 2019 introduces support for ‘Generated columns’. Check out this article for more information.

The post Migrating your SQL Server Workloads to PostgreSQL – Part 2 appeared first on The Developer Space.