Click here to Skip to main content
15,887,746 members
Articles / Database Development / PostgreSQL
Tip/Trick

Four Pitfalls of SQL Processing with NULL Values

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
1 Nov 2022Apache6 min read 4.9K   6  
How to cope with all scenarios regarding NULL value handling in your daily work
This article analyzes the root causes of the NULL value pitfalls, and concludes a simple and effective examination logic to infer the final result, and we explain the applicable conditions and solutions for the four common scenarios in daily development work.

Introduction

NULL value processing is the most error-prone for database application developers, mainly because we are accustomed to using binary Boolean logic, while the database's processing logic for NULL values is three-valued logic. In fact, the most flawed component in the database optimizers is the logic related to NULL value processing. Even mature database software, such as DB2/Teradata, still has more than 20% of the bugs related NULL processing.

In this article, we analyze the root causes of the NULL value pitfalls, and conclude with a simple and effective examination logic to infer the final result. At the same time, we explain the applicable conditions and solutions for the four common scenarios in daily development work. After reading this article, you will be able to cope with all the scenarios regarding NULL value handling in your daily SQL processing work.

Code of Examination

The following examination logic covers all scenarios for NULL values in SQL processing, and by understanding the following logic, you can avoid the NULL pitfalls.

  1. The result of all comparison and arithmetic operators (>, =, <, <>, <=, >=, +,-,*, /) with NULL result is unknown
  2. The logical operations(AND, OR, NOT) of unknown follows the truth values table of three-valued operations
  3. If the result of the operation is returned directly to the user, use NULL to represent unknown
  4. If the result of the operation is true or false as a condition during SQL processing, then the operation needs to be performed by the three-value logic, and the final result is determined by the following logic:
    1. {false、unknown} -> false
    2. {true} ->true
  5. In set operations such as UNION or INTERSECT, NULL values are considered equal to each other.

Three-valued Logic1

In logic, a three-valued logic (also trinary logic, trivalent, ternary, or trilean, sometimes abbreviated 3VL) is any of several many-valued logic systems in which there are three truth values indicating true, false and some indeterminate third value. This is contrasted with the more commonly known bivalent logics (such as classical sentential or Boolean logic) which provide only for true and false.

Image 1

Pitfalls of NULL Value Processing

1. Comparison with NULL Values

Conclusion: expr = null can’t determine the expression expr to be null, and is null should be used to determine whether the expr is null.

Suppose there is a customer table with following five columns, among which two columns (c_nationcode,c_phone) are nullable.

SQL
CREATE TABLE customer (
    c_custkey int4 NOT NULL,
    c_name varchar(25) NOT NULL,
    c_nationcode char(8) NULL,
    c_phone varchar(15) NULL,
  c_regdate date NULL,
    CONSTRAINT customer_pkey PRIMARY KEY (c_custkey)
);
insert into customer values(1, 'Randy', 'en', '13910010010', '20210911');
insert into customer values(2, 'Mandy', null, '13910010012', '20200211');
insert into customer values(3, 'Ray', 'us', null, '20180902');

If a developer wants to find out the customers with empty phones through the following statement, he won't succeed because the result of following statement is always empty.

SQL
select * from customer where c_phone = null;

The correct way should be:

SQL
select * from customer where c_phone is null;

Let's examine according to the , the predicate c_phone = null is evaluated to unknown; then according to Rule 4.1, unknown is treated as false.

SQL
c_phone=null
-> unknown
-> false;

2. Case When Expression with NULL

Conclusion: case expr when null can't determine if the expression is null, the correct way is case when expr is null.

Problems in the where or having clause are easier to detect and correct, while it is much more difficult to detect null usages in case when expressions, either by human or by existing SQL auditing tools.

For example, if we want to decode the nation code to nation name, and the code is null, we want to set nation name to China:

SQL
select c_name, case c_nationcode 
                when 'us' then 'USA' 
                when 'cn' then 'China'
                when null then 'China' 
                else 'Others' end 
from customer

The above statement does not convert the country code to null to China. Because when null is actually an operation performed by an c_nationcode = null. The correct way to do it should be:

SQL
select c_name, case when c_nationcode = 'us' then 'USA' 
                    when c_nationcode = 'cn' then 'China'
                    when c_nationcode is null then 'China' 
                    else 'Others' end 
from customer

3. NOT IN Subquery with NULL

Conclusion: A predicate of a not in subquery with nullable select elements will always be evaluated to false.

Suppose we have an orders table where the customer id(o_custkey) and order date(o_orderdate) are nullable due to missing data.

SQL
CREATE TABLE orders (
    o_orderkey int4 NOT NULL,
    o_custkey int4 NULL,
    o_orderdate date NULL,
    CONSTRAINT orders_pkey PRIMARY KEY (o_orderkey)
);
insert into orders values(1, 1, '2021-01-01');
insert into orders values(2, null, '2020-09-01');
insert into orders values(3, 3, null);

Now we want to find customers without orders for marketing. The expected result is the customer whose c_custkey is 2, and the query statement might look like this:

SQL
select * from 
customer 
where c_custkey not in (
  select o_custkey 
  from orders
  )

In fact, the above query returns nothing to us. The reason is that the o_custkey in the subquery has null values, and the processing logic of NOT IN is like this:

SQL
c_custkey not in (1,3,null) 
→ c_custkey<>1 and c_custkey<>3 and c_custkey<>null 
→ c_custkey<>1 and c_custkey<>3 and unknown 
→ unknown 
-> false

In fact, if there are null values in the result set of the subquery, the SQL will always return empty result set.

There are two correct ways:

  1. Add a NOT NULL predicate to the subquery, that is:

    SQL
    select * 
    from customer 
    where c_custkey not in (
      select o_custkey 
      from orders 
      where o_custkey is not null
      )
  2. Rewrite the NOT IN subquery to the not exists subquery, that is:

    SQL
    select *
    from customer
    where not exists (
      select o_custkey
      from orders
      where o_custkey=c_custkey
      )
    

4. ALL qualified Subquery with NULL

Conclusion: A condition of an ALL-qualified subquery with nullable select elements will always be evaluated to false.

Suppose we want to find out the orders which are wrongly registered after the user is revoked. One of the solutions is this following query statement:

SQL
select *
from customer
where c_regdate >  ALL (
  select o_orderdate
  from orders
  where c_custkey = o_custkey
  )

Similar to NOT IN above, this SQL does not return the expected result due to the presence of NULL in the result of the subquery. The ALL operation is actually performed by comparing it with the returned result set, and then performing the AND operation, and the final result is unknown. While unknown as a condition to be evaluated is, the result is false.

There are two ways to correct it:

  1. Add a NOT NULL predicate to the subquery:
    SQL
    select * 
    from customer 
    where c_regdate > all(
      select o_orderdate 
      from orders 
      where o_orderdate is not null
      )
  2. Rewrite expr > all to aggregate scalar subquery expr > (select max()...) 2:
    SQL
    select * 
    from customer 
    where c_regdate > (
      select max(o_custkey) 
      from orders
      )

Null Processing with SQL Optimization Tools

The database will do what the SQL tells. So the developers either have to pay special attention to these traps, or cope them with the help of SQL optimization tools, such as PawSQL or EverSQL. In PawSQL, there are three rewrite optimization rules for NULL processing, corresponding to the four cases above.

RuleCode Rule Description
UseEqual4NullRewrite `expr = null` or `case expr when null` can't determine whether the `expr` is null, `is null` should be used 
NotInNullableSubQueryRewrite A predicate of a `not in` subquery with nullable select elements will always be evaluated to false
AllQualifierSubQueryRewrite A condition of an `ALL`-qualified subquery with nullable select elements will always be evaluated to false

PawSQL will be based on whether the definition of a column in the DDL is nullable, and whether the operation on the column will produce nullable results, to determine whether the query elements in the subquery is nullable. If it is possible to be empty, the rewritten SQL will recommended to the user.

  • Case 1: = null is rewritten to is null
    SQL
    -- Original SQL
    select count(*) 
    from customer 
    where c_phone = null; 
    
    -- Rewritten SQL
    select count(*) 
    from customer 
    where customer.c_phone is null;
  • Case 2: case expr when null is rewritten to case when expr is null
    SQL
    -- Original SQL
    select case c_phone 
            when null then 1 
            when '139%' then 0 
            else -1 
            end 
    from customer;
    
    -- Rewritten SQL
    select case
             when c_phone is null then 1 
             when c_phone = '139%' then 0 
             else -1 
           end
      from customer;
  • Case 3: c_nationkey is nullable, add the condition c_nationkey is not null
    SQL
    -- Original SQL
    select count(*) 
    from nation 
    where n_nationkey not in (
      select c_nationkey 
      from customer
      );
    
    -- Rewritten SQL
    select count(*) 
    from nation 
    where n_nationkey not in ( 
      select c_nationkey 
      from customer
      where c_nationkey is not null
      )
  • Case 4: c_nationkey is nullable, so max(c_nationkey) is nullable, add the condition c_nationkey is not null
    SQL
    -- Original SQL
    select count(*) 
    from nation
    where n_nationkey not in (
      select max(c_nationkey) 
      from customer 
      group by c_mktsegment
      );
    
    -- Rewritten SQL
    select count(*) 
    from nation 
    where n_nationkey not in (
                 select max(customer.c_nationkey)
                   from customer
                   where c_nationkey is not null
                   group by c_mktsegment
                   )
  • Case 5: count(c_nationkey) is never empty, so there is no need to rewrite it.
    SQL
    select count(*)
    from nation
    where n_nationkey not in (
      select count(c_nationkey)
      from customer
      group by c_mktsegment
      );
    
  • Case 6: c_name is not empty, but c_nationkey is nullable, so the c_nationkey is not null condition needs to be added.
    SQL
    -- Original SQL
    select count(*)
    from nation
    where (n_name,n_nationkey) not in (
      select 'China',c_nationkey
      from customer
      );
    
    -- Rewritten SQL
    select count(*)
    from nation
    where(n_name, n_nationkey) not in (
      select'China', c_nationkey
      from customer
      where customer.c_nationkey is not null
      )
    
  • Case 7: c_nationkey is nullable, so > all (select c_nationkey ...) is rewritten to > (select min(c_nationkey) from customer)
    SQL
    -- Original SQL
     select count(*)
     from customer
     where n_nationkey > all(
       select c_nationkey
       from customer
       );
    
     -- Rewritten SQL
     select count(*)
       from customer
       where n_nationkey > (
         select min(c_nationkey)
         from customer
         )
    
  1. https://en.wikipedia.org/wiki/Three-valued_logic
  2. If expr < all or expr < = all, then rewrite it as expr < (select min() ...)

License

This article, along with any associated source code and files, is licensed under The Apache License, Version 2.0



Comments and Discussions

 
-- There are no messages in this forum --