pgsql-general-null_paradoxes_and_indexing_scheming_for_a_future_isnull_feature

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.



To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sat, 15 Feb 2003 23:14:14 -0500

Tom Lane wrote:

> Nick Wellnhofer <wellnhofer@aevum.de> writes:

>>If I have a query like
>>SELECT * FROM table WHERE key IS NULL
>>and an index on column "key", a sequential scan is used.

> IS NULL is not an indexable operator.

> I suggest reconsidering your data representation, as this
> is unlikely to change soon...

What is the problem with indexing nulls?
I had the similar problem some time ago, and created a custom set of 
operators as a work around (that do the same thing as <=> for numbers, 
but treat null as infinity and '=' returns true if both operand are 
null, and false if only one is)...
It seems to work fine.
The only problem is, that it is kinda cumbersome to create custom 
opclasses in postgres, and also, that I don't want to create the same 
wrappers for all possible types (int2,int4,int8,float etc)...

It would be a lot nicer if the default operators could handle that...
Why can it not be done?

Thanks!

Dima

===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Sun, 16 Feb 2003 00:41:04 -0500

Dima Tkach <dmitry@openratings.com> writes:
> What is the problem with indexing nulls?

We do index nulls (at least in btree indexes).

What I said was

>> IS NULL is not an indexable operator.

IS NULL is not an operator at all (it's a special syntactic construct).
It has no entry in pg_operator.  Therefore it doesn't fit into the
operator class structure that describes which operators can be used
with indexes.  There are a bunch of internal structures (ScanKeys, etc)
that it wouldn't fit into, either.

> I had the similar problem some time ago, and created a custom set of 
> operators as a work around (that do the same thing as <=> for numbers, 
> but treat null as infinity and '=' returns true if both operand are 
> null, and false if only one is)...
> It seems to work fine.

Non-strict = operators wil be a real bad idea starting in PG 7.4,
as they prevent usage of a number of hashed-aggregation optimizations.

I suggest rethinking your schema: whatever you are using NULL to
represent does not fit very well with SQL's idea of NULL semantics.
In particular, the notion that "NULL = NULL" should yield true is
going to get you in all kinds of trouble.


===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 11:27:52 -0500

>
>
>I suggest rethinking your schema: whatever you are using NULL to
>represent does not fit very well with SQL's idea of NULL semantics.
>In particular, the notion that "NULL = NULL" should yield true is
>going to get you in all kinds of trouble.
>

Oh, no, it is not really a notion of "NULL=NULL", as I said, I only use 
it as a workaround for postgres inability to use index with null keys.

Tom, as you said in your message "we do index nulls" - why do you index 
them, if there is no way to use those index values? :-) I mean, if they 
were not in the index at all, I could understand that, but they are 
already there, and not used, just because of some syntactical difference 
between 'is null' and other operators??? That looks very weird to me.

Of course, I would not want to use the 'notion of null=null' if "is 
null" worked the same way, but, as you said yourself, it doesn't... So 
what do I do?
As for "rethinking my schema"... I would appreciate any suggestions...
There are many instances where I need to have nulls in the indexes, here 
is the simplest one:

create table trees
(
   id serial primary key,
   parent int references trees on delete cascade on update cascade
   data text
);
create unique index trees_idx on trees (parent, id);

A row in the table is a tree node. A node can have one parent, ot no 
parent at all.
About the only way to do this I know (aside from hacking around and 
inserting "dummy" rows into the table) is to use null as parent values 
for the nodes with no parents, but then a query like select * from trees 
where parent is null will take forever if the table is any large...

What do you recommend? Predicate indexes? Waste of space... What else?

And what exactly is being able to just say something like 'select * from 
trees where parent == null' to work around the syntactical problem of is 
null not being an operator?

My only real problem with this is it being so complicated to set up. And 
I don't really understand what's wrong with it conceptually. To me, it 
looks like mereley a wrokaround for a problem with postgres parser (or 
planner?) not being able to treat is null as an operator for indexing 
purposes.

Dima


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Sun, 16 Feb 2003 13:03:42 -0500

Dima Tkach <dmitry@openratings.com> writes:
> Tom, as you said in your message "we do index nulls" - why do you index 
> them, if there is no way to use those index values? :-)

So that an indexscan can be a substitute for seqscan + sort.

Also, in a multi-column index you must be prepared to index nulls,
or you won't correctly answer questions that look at only some of the
columns.

Index types that don't support ordered scans don't have to store nulls
(at least in their first column) and indeed rtree and gist do not.  I
forget whether hash does.

> A row in the table is a tree node. A node can have one parent, ot no 
> parent at all.

You're better off making the root node link to itself (compare handling
of /.. in a Unix filesystem).  NULL parent link does not mean "has no
parent", it means "parent is unknown".


===

To: pgsql-general@postgresql.org
From: Stephan Szabo <sszabo@megazone23.bigpanda.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 10:12:45 -0800 (PST)


On Sat, 15 Feb 2003, Dima Tkach wrote:

> It would be a lot nicer if the default operators could handle that...
> Why can it not be done?

Jumping in... I usually use a partial index as a workaround.  Postgresql
will look at a partial index whose condition is IS NULL for queries of col
IS NULL.

===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 15:37:17 -0500

>>A row in the table is a tree node. A node can have one parent, ot no 
>>parent at all.
> 
> 
> You're better off making the root node link to itself (compare handling
> of /.. in a Unix filesystem).  NULL parent link does not mean "has no
> parent", it means "parent is unknown".
> 

Great idea! I'll do that. Thanks!
What about another example:

create table user
(
    id serial primary key,
    login text not null unique
);

create table tag_set
(
    id     serial primay key,
    tag    text not null unique,
    data   text not null,
    userid int references users on delete cascade on update cascade
);

The idea is that 'tags' may be user-specific or user-independent - so 
that to get a set of tags for a given user, I would do

select tag,data from tag_set where userid is null or userid=?

with my 'workaround' solution I do
select tag,data from tag_set where userid==null or userid=?
(where '==' is my special non-strict operator)
to force both parts of the criteria to use the index

Any ideas how to do this better (again, other than creating a dummy user 
with id 0)?

I'll apppreciate any suggestions...

Thanks a lot!

Dima



===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 23:16:23 -0500

Stephan Szabo wrote:
> On Sat, 15 Feb 2003, Dima Tkach wrote:
> 
> 
>>It would be a lot nicer if the default operators could handle that...
>>Why can it not be done?
> 
> 
> Jumping in... I usually use a partial index as a workaround.  Postgresql
> will look at a partial index whose condition is IS NULL for queries of col
> IS NULL.
> 

Yeah... I thought about it...
But, one problem was that in 7.2 partial indexes do not really work 
(more precisely, do not get used)  if your query has more than one table 
(Tom has given me a patch to fix that a while back, but I never got to 
installing it) :-(

More importantly, if you need to make queries of both kinds (for 'is 
null' and for = something), ther are two options, and both of them are 
not very good:

- create two indexes, one with predicate, and one without predicate - is 
a waste of space, because all the rows with nulls get indexed twice. The 
  space may not be such an important consideration by itself, but, when 
the table is huge and heavily being updated, the overhead of having to 
keep both indexes in synch becomes significant.

- create two indexes with complimetary predicates (one of IS NULL, one 
for IS NOT NULL)... Well, this seems to be better, at least from the 
space and performance standpoint, but I don't know how to even begin to 
explain to my users that they have to write queries like

... where parent is not null and parent=1

looks pretty reidiculous to me :-)
.. and the planner does not seem to be smart enough to know to use the 
index unless you mention the predicate *explicitly* in the query.

Dima


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Sun, 16 Feb 2003 23:46:00 -0500

Dima Tkach <dmitry@openratings.com> writes:
> But, one problem was that in 7.2 partial indexes do not really work 
> (more precisely, do not get used)  if your query has more than one table 
> (Tom has given me a patch to fix that a while back, but I never got to 
> installing it) :-(

That patch is in 7.2.2 and later.  If you're still running 7.2 or 7.2.1
I have *zero* sympathy for any problems you may have.  Update before
you get bit by one of the data-loss bugs we've fixed.


===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 23:08:36 -0500


>>A row in the table is a tree node. A node can have one parent, ot no 
>>parent at all.
> 
> 
> You're better off making the root node link to itself (compare handling
> of /.. in a Unix filesystem).  NULL parent link does not mean "has no
> parent", it means "parent is unknown".
> 

Actually, I am afraid, I have to take back my previous message (where I 
said, this was a good idea) - after giving it some thought, I don't see 
how it will make things better (if anything, it will make them worse).
For example, how would I get the list of the "top-level" (no parent) 
nodes given your suggestion?

select * from trees where parent=id

is hardly a good idea, because it just has to be a seq. scan, right?

Right now, I am, at least, able to do

select * from trees where parent == null;

(where '==' is my custom non-strict equivalence operator), that uses an 
index scan and works just fine.

Of course, it would be nicer to be able to get away with the standard 
sql set of operators, but, I guess, I have to do what I have to do :-(

Dima

P.S. Frankly, I still don't understand what is the big problem with 
making 'is null' indexable - as far as I can see, this is purely 
syntactical problem, because the btree code itself seems to be able to 
handle nulls just fine - it is at the level of the planner the index 
option just gets cut off, because it doesn't know what to do with 'is 
null'...
I may be missing something of course, but so far, this looks to me like 
a very useful feature, that would be very easy to implement too...


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 00:42:19 -0500

Dima Tkach <dmitry@openratings.com> writes:
> For example, how would I get the list of the "top-level" (no parent) 
> nodes given your suggestion?
> select * from trees where parent=id

Exactly.

> is hardly a good idea, because it just has to be a seq. scan, right?

Make a partial index if you need it to be fast.

regression=# create table trees (id int, parent int);
CREATE TABLE
regression=# explain select * from trees where parent=id;
                      QUERY PLAN
------------------------------------------------------
 Seq Scan on trees  (cost=0.00..22.50 rows=5 width=8)
   Filter: (parent = id)
(2 rows)

regression=# create index foo on trees(id) where parent=id;
CREATE INDEX
regression=# explain select * from trees where parent=id;
                            QUERY PLAN
------------------------------------------------------------------
 Index Scan using foo on trees  (cost=0.00..17.07 rows=5 width=8)
   Filter: (parent = id)
(2 rows)


> I may be missing something of course, but so far, this looks to me like 
> a very useful feature, that would be very easy to implement too...

Criticism in the form of patches is more useful than unsubstantiated
opinions that something is easy.


===

To: pgsql-general@postgresql.org
From: Stephan Szabo <sszabo@megazone23.bigpanda.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Sun, 16 Feb 2003 21:46:49 -0800 (PST)

On Sun, 16 Feb 2003, Dima Tkach wrote:

> Stephan Szabo wrote:
> > On Sat, 15 Feb 2003, Dima Tkach wrote:
> >
> >
> >>It would be a lot nicer if the default operators could handle that...
> >>Why can it not be done?
> >
> >
> > Jumping in... I usually use a partial index as a workaround.  Postgresql
> > will look at a partial index whose condition is IS NULL for queries of col
> > IS NULL.
> >
>
> - create two indexes, one with predicate, and one without predicate - is
> a waste of space, because all the rows with nulls get indexed twice. The
>   space may not be such an important consideration by itself, but, when
> the table is huge and heavily being updated, the overhead of having to
> keep both indexes in synch becomes significant.

Yes, this solution does double index the NULLs, but if you have alot of
NULLs you probably should be doing a seqscan to find them anyway and don't
need the index.  High update frequency costs you the NULL check, which is
a little annoying, and if you've got a small number of NULL rows or the
data is clustered in some fashion (so that the index is a win) that have
lots of updates this may become significant.



===

To: pgsql-general@postgresql.org
From: Andrei Ivanov <andrei.ivanov@ines.ro>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 12:46:20 +0200 (EET)


Hello, sorry for barging in...
I use a similar structure for keeping some some text pages categorized.

CREATE TABLE pages (
  id          SERIAL NOT NULL PRIMARY KEY,
  categ       INTEGER,
  CONSTRAINT  categ_fk FOREIGN KEY(categ) REFERENCES categs(id) ON DELETE CASCADE
);

All the pages that are not contained in a category are marked by categ IS 
NULL ( this is like the files in / in a filesystem). If I use other values 
than NULL for marking this kind of pages, then the constraint would 
complain, but then I can't use an index to find these pages.

Do you have a better solution for this ?


===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Mon, 17 Feb 2003 09:43:02 -0500

>
>
>Yes, this solution does double index the NULLs, but if you have alot of
>NULLs you probably should be doing a seqscan to find them anyway and don't
>need the index.  High update frequency costs you the NULL check, which is
>a little annoying, and if you've got a small number of NULL rows or the
>data is clustered in some fashion (so that the index is a win) that have
>lots of updates this may become significant.
>  
>
Yeah... But imagine the real-life condition, when I have a *moderate* 
number of nulls (not enough to justify a seq.scan, but still a lot) in 
*several* different columns - so that, instead of creating a single 
multi-column index, I would have to create a whole bunch of them with 
different predcates - where this is null and that is not null, where 
this is null and tha is null etc, etc...

That's what annoys me a lot here :-(


===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Mon, 17 Feb 2003 09:39:55 -0500

>
>
>Criticism in the form of patches is more useful than unsubstantiated
>opinions that something is easy.
>
>  
>
I'd be happy to come up with a patch... It just was my understanding 
that you would not accept such a patch anyway, because your opinion is 
that it is unnecessary and dangerous... Did I misunderstand you here?

Thanks!

Dima


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 10:10:28 -0500

Dima Tkach <dmitry@openratings.com> writes:
> I'd be happy to come up with a patch... It just was my understanding 
> that you would not accept such a patc hanyway, because your opinion is 
> that it is unnecessary and dangerous... Did I misunderstand you here?

I don't see anything dangerous about it --- except perhaps to
readability and mantainability of the code.  The problem is that IS NULL
doesn't fit into the operator-and-opclass model of what indexes can do.
If you can find a solution to that problem that's not a complete kluge,
I'm all ears.

===

To: pgsql-general@postgresql.org
From: Dmitry Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Mon, 17 Feb 2003 12:18:16 -0500

Tom Lane wrote:

>Dima Tkach <dmitry@openratings.com> writes:
>  
>
>>I'd be happy to come up with a patch... It just was my understanding 
>>that you would not accept such a patc hanyway, because your opinion is 
>>that it is unnecessary and dangerous... Did I misunderstand you here?
>>    
>>
>
>I don't see anything dangerous about it --- except perhaps to
>readability and mantainability of the code.  The problem is that IS NULL
>doesn't fit into the operator-and-opclass model of what indexes can do.
>If you can find a solution to that problem that's not a complete kluge,
>I'm all ears.
>
>			regards, tom lane
>  
>
Well... At first glance, it seems that what needs to be done is:

- add a special case in is_indexable_operator() to return true for IS_NULL
- modify _bt_checkkeys () to return isNull  from inside if 
(key->sk_flags & SK_ISNULL) clause instead of just false.
- remove sk_flags & SK_ISNULL checks from _bt_orderkeys

Obviously, I haven't tested this, and I may very well have missed 
something (I will, of course, inspect it and test thoroughly if you ok 
the change in principle - just don't want to spend much time right now 
on something, that may end up not being needed to anybody), but this 
looks to me like pretty much it.


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 10:52:54 -0500

Dmitry Tkach <dmitry@openratings.com> writes:
> Tom Lane wrote:
>> I don't see anything dangerous about it --- except perhaps to
>> readability and mantainability of the code.  The problem is that IS NULL
>> doesn't fit into the operator-and-opclass model of what indexes can do.
>> If you can find a solution to that problem that's not a complete kluge,
>> I'm all ears.

> Well... At first glance, it seems that what needs to be done is:

> - add a special case in is_indexable_operator() to return true for IS_NULL

And is_indexable_operator() will know that this is safe how?  Or do you
plan to fix the other three index types to support NULLs too?

> - modify _bt_checkkeys () to return isNull  from inside if 
> (key->sk_flags & SK_ISNULL) clause instead of just false.
> - remove sk_flags & SK_ISNULL checks from _bt_orderkeys

IIRC, SK_ISNULL marks that the value being compared against is null
--- not that the scan operator is ISNULL.  An approach as above would
cause "WHERE x = something" indexscans to start returning nulls if the
"something" is null, no?  You need a representation that preserves the
difference between "x = NULL" and "x IS NULL".  The ScanKey structure
can't do this at the moment, mainly because it assumes that the scan
operator *is* an operator.  Which IS NULL is not.


===


To: pgsql-general@postgresql.org
From: Martijn van Oosterhout <kleptog@svana.org>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Tue, 18 Feb 2003 12:26:45 +1100

Tom Lane wrote:

> Dmitry Tkach <dmitry@openratings.com> writes:

> > Tom Lane wrote:

> >> I don't see anything dangerous about it --- except perhaps to
> >> readability and mantainability of the code.  The problem is that IS NULL
> >> doesn't fit into the operator-and-opclass model of what indexes can do.
> >> If you can find a solution to that problem that's not a complete kluge,
> >> I'm all ears.
>
> > Well... At first glance, it seems that what needs to be done is:
>
> > - add a special case in is_indexable_operator() to return true for IS_NULL

> And is_indexable_operator() will know that this is safe how?  Or do you
> plan to fix the other three index types to support NULLs too?

I would have thought that the other index type supported null anyway, for
the purposes of uniqueness checks.

> > - modify _bt_checkkeys () to return isNull  from inside if
> > (key->sk_flags & SK_ISNULL) clause instead of just false.
> > - remove sk_flags & SK_ISNULL checks from _bt_orderkeys

> IIRC, SK_ISNULL marks that the value being compared against is null
> --- not that the scan operator is ISNULL.  An approach as above would
> cause "WHERE x =3D something" indexscans to start returning nulls if the
> "something" is null, no?  You need a representation that preserves the
> difference between "x =3D NULL" and "x IS NULL".  The ScanKey structure
> can't do this at the moment, mainly because it assumes that the scan
> operator *is* an operator.  Which IS NULL is not.

I remember looking into this a while ago. My solution to that problem was
that x =3D NULL is always NULL and so doesn't need to go through the scan
anyway (index or sequential). Once you've taken care of the x =3D NULL case
elsewhere, you can use the available state for x IS NULL.

I don't remember if I actually found the place to fix that though.

I would really like it if this was finally made to work.

===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 20:46:17 -0500

Martijn van Oosterhout <kleptog@svana.org> writes:
>> And is_indexable_operator() will know that this is safe how?  Or do you
>> plan to fix the other three index types to support NULLs too?

> I would have thought that the other index type supported null anyway, for
> the purposes of uniqueness checks.

Well, (a) the other index types don't support uniqueness checks, and (b)
it wouldn't be relevant anyway, because multiple nulls don't violate
a unique constraint.  GIST does support nulls in second and subsequent
columns of a multi-column index, because it *has* to do so, but not in
the first column --- and hash and rtree don't store nulls at all.

> I remember looking into this a while ago. My solution to that problem was
> that x =3D NULL is always NULL and so doesn't need to go through the scan
> anyway (index or sequential). Once you've taken care of the x =3D NULL case
> elsewhere, you can use the available state for x IS NULL.

But how do you get from point A to point B?  You need to represent both
cases in ScanKeys further upstream than where that conclusion can be
drawn (namely _bt_orderkeys()) --- or else do some very substantial
restructuring work, which is exactly the point.

Also, this would amount to hard-wiring the assumption that indexable
operators are always strict.  Which is rather a curious assumption
to be putting in, if your goal is to support the obviously-not-strict
construct IS NULL as an indexable operator.  (Now I believe we make
that assumption anyway in the index access methods ... but wiring it
into ScanKeys, which is a very widespread data structure, would be the
death knell for any hope of removing it someday.)



===

To: pgsql-general@postgresql.org
From: Greg Stark <gsstark@mit.edu>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: 17 Feb 2003 20:48:50 -0500

Martijn van Oosterhout <kleptog@svana.org> writes:

> I would have thought that the other index type supported null anyway, for
> the purposes of uniqueness checks.

Well, rtree indexes are used for boxes, which don't even have a useful =
operators anyways... Unique indexes on them wouldn't work very well.


===

To: pgsql-general@postgresql.org
From: Martijn van Oosterhout <kleptog@svana.org>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Tue, 18 Feb 2003 13:20:29 +1100

Tom Lane wrote:

> Martijn van Oosterhout <kleptog@svana.org> writes:

> > I would have thought that the other index type supported null anyway, for
> > the purposes of uniqueness checks.
>
> Well, (a) the other index types don't support uniqueness checks, and (b)
> it wouldn't be relevant anyway, because multiple nulls don't violate
> a unique constraint.  GIST does support nulls in second and subsequent
> columns of a multi-column index, because it *has* to do so, but not in
> the first column --- and hash and rtree don't store nulls at all.

I stand corrected. I just tested it here and multiple nulls in a unique
column indeed do work.

> > I remember looking into this a while ago. My solution to that problem was
> > that x =3D3D NULL is always NULL and so doesn't need to go through the scan
> > anyway (index or sequential). Once you've taken care of the x =3D3D NULL case
> > elsewhere, you can use the available state for x IS NULL.

> But how do you get from point A to point B?  You need to represent both
> cases in ScanKeys further upstream than where that conclusion can be
> drawn (namely _bt_orderkeys()) --- or else do some very substantial
> restructuring work, which is exactly the point.
>
> Also, this would amount to hard-wiring the assumption that indexable
> operators are always strict.  Which is rather a curious assumption
> to be putting in, if your goal is to support the obviously-not-strict
> construct IS NULL as an indexable operator.  (Now I believe we make
> that assumption anyway in the index access methods ... but wiring it
> into ScanKeys, which is a very widespread data structure, would be the
> death knell for any hope of removing it someday.)

I hadn't thought of that. While I can't think of a situation of a
non-strict indexable operator, I wouldn't want to rule it out.

My Plan B was to create a operator IS (and its inverse ISNOT) which is then
binary operator. It would be identical to =3D and <> except that it would be
defined where either argument is NULL. Fiddle the parser to use this
operator instead of the unary ISNULL. The disadvantage is that (unless you
restrict it in the parser) you could say things like:

SELECT * FROM x, y WHERE x.field IS y.field

Allowing you to join on NULL fields. This is not allowed by the spec. Do you
think this would be a better approach? Or is there something special about
the ISNULL in SQL does means this cannot work? It does seem a bit wasteful
to have an operator whose second argument is always NULL (unless you allow
the extra syntax).

As a bonus, if this could be made to work, you *know* your index operators
don't need to be strict.

===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Mon, 17 Feb 2003 22:24:19 -0500

>
>
>  
>
>>I remember looking into this a while ago. My solution to that problem was
>>that x =3D NULL is always NULL and so doesn't need to go through the scan
>>anyway (index or sequential). Once you've taken care of the x =3D NULL case
>>elsewhere, you can use the available state for x IS NULL.
>>    
>>
>
>But how do you get from point A to point B?  You need to represent both
>cases in ScanKeys further upstream than where that conclusion can be
>drawn (namely _bt_orderkeys()) --- or else do some very substantial
>restructuring work, which is exactly the point.
>  
>
I think what he meant was that a = null is actually a CONSTANT (null) 
and can be
reduced at the parsing/planning stage, so that the indexing code never 
has to deal with such condition at all, and the only
way for it to ever see SK_NULL would be the case of 'is null'...

As for the hard-wired assumption that indexable operators are always 
strict, it seems to me that it is already there -
_bt_checkkeys() always retruns false if it ever sees a null key. Doesn't 
it mean that it assumes that all the operators it will ever see are strict?

Dima


===

To: pgsql-general@postgresql.org
From: Dennis Gearon <gearond@cvc.net>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Mon, 17 Feb 2003 19:42:28 -0800

Tom Lane wrote:
> 
> Martijn van Oosterhout <kleptog@svana.org> writes:
> >> And is_indexable_operator() will know that this is safe how?  Or do you
> >> plan to fix the other three index types to support NULLs too?
> 
> > I would have thought that the other index type supported null anyway, for
> > the purposes of uniqueness checks.
> 
> Well, (a) the other index types don't support uniqueness checks, and (b)
> it wouldn't be relevant anyway, because multiple nulls don't violate
> a unique constraint.  GIST does support nulls in second and subsequent
> columns of a multi-column index, because it *has* to do so, but not in
> the first column --- and hash and rtree don't store nulls at all.
> 
> > I remember looking into this a while ago. My solution to that problem was
> > that x =3D NULL is always NULL and so doesn't need to go through the scan
> > anyway (index or sequential). Once you've taken care of the x =3D NULL case
> > elsewhere, you can use the available state for x IS NULL.
> 
> But how do you get from point A to point B?  You need to represent both
> cases in ScanKeys further upstream than where that conclusion can be
> drawn (namely _bt_orderkeys()) --- or else do some very substantial
> restructuring work, which is exactly the point.
> 
> Also, this would amount to hard-wiring the assumption that indexable
> operators are always strict.  Which is rather a curious assumption
> to be putting in, if your goal is to support the obviously-not-strict
> construct IS NULL as an indexable operator.  (Now I believe we make
> that assumption anyway in the index access methods ... but wiring it
> into ScanKeys, which is a very widespread data structure, would be the
> death knell for any hope of removing it someday.)

so NULLs **DON'T** count in a unique index? You can have more than one
NULL in a single column UNIQUE constraint? I guess I am showing my
ignorance, I thought you could only have one.

I was planning to do some interesting default configuration for a column
value to ensure uniqueness, but flag an unknown value.


===


To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 23:40:21 -0500

Dima Tkach <dmitry@openratings.com> writes:
> I think what he meant was that a = null is actually a CONSTANT (null) 
> and can be
> reduced at the parsing/planning stage, so that the indexing code never 
> has to deal with such condition at all,

If the = operator is strict, yes it will be.  But that just begs the
question.

> As for the hard-wired assumption that indexable operators are always 
> strict, it seems to me that it is already there -
> _bt_checkkeys() always retruns false if it ever sees a null key. Doesn't 
> it mean that it assumes that all the operators it will ever see are strict?

That assumption does currently exist in _bt_checkkeys() and some other
localized places.  But it could possibly be removed from them --- or a
new index access method could be written that makes no such assumption.

If you propagate the assumption upwards into ScanKeys and the planner,
then it'll never be possible to get rid of it, not even with an entirely
new access method.

These sorts of considerations are why I said that I thought a non-kluge
solution was difficult.  Yes, we could hack slash and burn our way to
a patch that works (most of the time anyway) in short order.  Developing
something that doesn't cripple future progress is another story.


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 23:54:59 -0500

Martijn van Oosterhout <kleptog@svana.org> writes:
> My Plan B was to create a operator IS (and its inverse ISNOT) which is then
> binary operator. It would be identical to =3D and <> except that it would be
> defined where either argument is NULL. Fiddle the parser to use this
> operator instead of the unary ISNULL.

I don't think there's anything fundamental that assumes that indexable
operators are binary, so you might as well make the operator unary.  The
problem with this approach isn't that --- it's the tedium of making an
ISNULL operator for every datatype, adding it to every opclass, etc.

Maybe there's no non-kluge answer that doesn't make us buy into that,
but it sure seems like the hard way.  It's definitely not going to be
a short and sweet patch :-(


===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Mon, 17 Feb 2003 23:44:59 -0500

Dennis Gearon <gearond@cvc.net> writes:
> so NULLs **DON'T** count in a unique index? You can have more than one
> NULL in a single column UNIQUE constraint? I guess I am showing my
> ignorance, I thought you could only have one.

Unique indexes enforce SQL92 unique constraints, which are defined by
the spec thusly (sec 4.10):

         A unique constraint is satisfied if and only if no two rows in
         a table have the same non-null values in the unique columns.

If this doesn't seem right to you, you have not grokked the concept of
NULL.  Two nulls are never "equal" per spec.


===

To: pgsql-general@postgresql.org
From: Martijn van Oosterhout <kleptog@svana.org>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Tue, 18 Feb 2003 16:29:46 +1100


Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:

> > My Plan B was to create a operator IS (and its inverse ISNOT) which is then
> > binary operator. It would be identical to =3D3D and <> except that it would be
> > defined where either argument is NULL. Fiddle the parser to use this
> > operator instead of the unary ISNULL.

> I don't think there's anything fundamental that assumes that indexable
> operators are binary, so you might as well make the operator unary.  The
> problem with this approach isn't that --- it's the tedium of making an
> ISNULL operator for every datatype, adding it to every opclass, etc.

I'll have to look again. I thought there was this whole section dealing with
how < related to <=3D and such which kind of implied it had to be binary.
Similarly, with an index scan on a unary operator would require a new entry
point, because it requires *no* parameters. An index scan relies on there
being an order, how can a unary operator have an order?

That said, how is this different from having an isnull() function and
building a functional index on that? An operator is just a function with
different syntax. Are there any optimisations involving IS NULL that
preclude us from simply replacing it with a function call globally. I guess
the issue is that it's a new index rather than using the existing one.

As for all the different versions, the actual implementation doesn't need to
know about the datatype, it only needs the NULL-ness of the argument so it
should be generic. Can an operator/function have a argument type that
matches anything?

> Maybe there's no non-kluge answer that doesn't make us buy into that,
> but it sure seems like the hard way.  It's definitely not going to be
> a short and sweet patch :-(

I'm still hoping. I'm sure we just need to think about it the right way...

===

To: pgsql-general@postgresql.org
From: Andrei Ivanov <andrei.ivanov@ines.ro>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Tue, 18 Feb 2003 17:34:19 +0200 (EET)


This is a resend, don't know if the first time it got to the list... sorry 
if it did.

Hello, sorry for barging in...
I use a similar structure for keeping some some text pages categorized.

CREATE TABLE pages (
  id          SERIAL NOT NULL PRIMARY KEY,
  categ       INTEGER,
  CONSTRAINT  categ_fk FOREIGN KEY(categ) REFERENCES categs(id) ON DELETE CASCADE
);

All the pages that are not contained in a category are marked by categ IS 
NULL ( this is like the files in / in a filesystem). If I use other values 
than NULL for marking this kind of pages, then the constraint would 
complain, but then I can't use an index to find these pages.

Do you have a better solution for this ?


===

To: pgsql-general@postgresql.org
From: "Mike Mascari" <mascarm@mascari.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Tue, 18 Feb 2003 10:55:50 -0500

From: "Andrei Ivanov" <andrei.ivanov@ines.ro>
>
> This is a resend, don't know if the first time it got to the
list... sorry
> if it did.
>
> Hello, sorry for barging in...
> I use a similar structure for keeping some some text pages
categorized.
>
> CREATE TABLE pages (
>   id          SERIAL NOT NULL PRIMARY KEY,
>   categ       INTEGER,
>   CONSTRAINT  categ_fk FOREIGN KEY(categ) REFERENCES
categs(id) ON DELETE CASCADE
> );
>
> All the pages that are not contained in a category are marked
by categ IS
> NULL ( this is like the files in / in a filesystem). If I use
other values
> than NULL for marking this kind of pages, then the constraint
would
> complain, but then I can't use an index to find these pages.
>
> Do you have a better solution for this ?

If some pages aren't associated with a category, shouldn't you
have three relations?

categories (
 categ PRIMARY KEY
 ...
);

pages (
 id PRIMARY KEY
 ...
);

category_pages (
 categ INTEGER NOT NULL,
 id INTEGER NOT NULL
);

Similarly, with previous posts regarding hierarchies, the model
should look like:

employees (
 employeeid PRIMARY KEY
 ...
)

employee_manager (
 employeeid INTEGER NOT NULL,
 manager INTEGER NOT NULL
)

*not*:

employees (
 employeeid PRIMARY KEY,
 manager INTEGER
);

NULLs are evil. ;-)


===

To: pgsql-general@postgresql.org
From: Dima Tkach <dmitry@openratings.com>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Wed, 19 Feb 2003 09:07:09 -0500

Tom Lane wrote:

>Dmitry Tkach <dmitry@openratings.com> writes:

>>Tom Lane wrote:

>>>I don't see anything dangerous about it --- except perhaps to
>>>readability and mantainability of the code.  The problem is that IS NULL
>>>doesn't fit into the operator-and-opclass model of what indexes can do.
>>>If you can find a solution to that problem that's not a complete kluge,
>>>I'm all ears.

>>Well... At first glance, it seems that what needs to be done is:
>>    
>
>>- add a special case in is_indexable_operator() to return true for IS_NULL
>
>And is_indexable_operator() will know that this is safe how?  Or do you
>plan to fix the other three index types to support NULLs too?
>
Well, yeah... Generally, I think, the best thing to do is to fix them 
all - if it can handle '=', there is no reason it cannot be made to 
handle 'is null' as well...
I haven't seen much of the code for rtree and hash, but I doubt it (null 
handling) is going to be that much different from btree...

But if we do not want to fix them all, just btree, you are right - 
indexable_operator() needs to know about that... In that case, we'd 
either need to add a param to it to pass the index info, or leave it 
alone, and just handle the is null case separately, at the higher level...
Either way, we'd probably want to add something like idxisstrict to 
pg_index - checking for btree explicitly would look like a kludge even 
to me :-)

>>- modify _bt_checkkeys () to return isNull  from inside if 
>>(key->sk_flags & SK_ISNULL) clause instead of just false.
>>- remove sk_flags & SK_ISNULL checks from _bt_orderkeys

>IIRC, SK_ISNULL marks that the value being compared against is null
>--- not that the scan operator is ISNULL.  An approach as above would
>cause "WHERE x = something" indexscans to start returning nulls if the
>"something" is null, no?  You need a representation that preserves the
>difference between "x = NULL" and "x IS NULL".  The ScanKey structure
>can't do this at the moment, mainly because it assumes that the scan
>operator *is* an operator.  Which IS NULL is not.


This can be resolved by just putting null (0) as the operator in the 
ScanKey to denote is null.
I know, this looks ugly, but the ugliness is due to 'is null' not being 
an operator in the first place... :-)

And another possibility is to create isnull () operator... but that 
would have to wait until postgres allows functions with unknown argument 
types (this can be done - just always pass pointers, and let the 
function figure out what it wants to do with them)...
Otherwise, as you said earlier, creating isnull() for every possible 
type would be quite a job.... and besides, there are also user-defined 
types too...

Dima

===

To: pgsql-general@postgresql.org
From: Tom Lane <tgl@sss.pgh.pa.us>
Subject: Re: [GENERAL] Index not used with IS NULL 
Date: Wed, 19 Feb 2003 09:44:03 -0500

Dima Tkach <dmitry@openratings.com> writes:
> And another possibility is to create isnull () operator... but that 
> would have to wait until postgres allows functions with unknown argument 
> types

Actually, we do have that now --- it'd be reasonable to implement such
a function and operator as taking type ANY.  Hm, maybe this is more
practical than I thought.  If we replace the special-purpose NullTest
expression node by two operators (IS NULL, IS NOT NULL) taking type ANY,
then you wouldn't have to do any violence to the ScanKeys representation
to handle these operators as index quals.  Rather than adding them to
pg_opclass for every btree opclass, I'd be inclined to special-case them
in the planner (they could be a case that special_indexable_operator
handles) --- with only two to deal with, that doesn't seem impractical.
Hm, probably only IS NULL need be indexable.  We don't index != ...


===

To: pgsql-general@postgresql.org
From: elein <elein@sbcglobal.net>
Subject: Re: [GENERAL] Index not used with IS NULL
Date: Wed, 19 Feb 2003 16:14:29 -0800

Tom Lane wrote:

> Dima Tkach <dmitry@openratings.com> writes:

> > And another possibility is to create isnull ()
> > operator... but that would have to wait until postgres
> > allows functions with unknown argument types

> Actually, we do have that now --- it'd be reasonable to implement such
> a function and operator as taking type ANY.  Hm, maybe this is more
> practical than I thought.  If we replace the special-purpose NullTest
> expression node by two operators (IS NULL, IS NOT NULL) taking type ANY,
> then you wouldn't have to do any violence to the ScanKeys representation
> to handle these operators as index quals.  Rather than adding them to
> pg_opclass for every btree opclass, I'd be inclined to special-case them
> in the planner (they could be a case that special_indexable_operator
> handles) --- with only two to deal with, that doesn't seem impractical.
> Hm, probably only IS NULL need be indexable.  We don't index != ...

I like this solution because it uses the existing infrastructure of
functional indexes and the ANY capability.  It makes perfect sense to me.

So, did this make it to the todo list?  It didn't see it there. 
Dima, Tom, Stephan, Martin?  Is this likely in 7.4?

===


the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu