postgresql_long

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 10:20:50 -0500 (EST)


> Bruce Momjian wrote:
> 
> > Should we use large objects for this, and beef them up.  Seems that
> > would be a good way.  I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
> 
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers, but managing external files or  even  large  objects
>     for  it  IMHO  isn't  that  simple,  if  you take transaction
>     commit/abort and MVCC problematic into account too. And  IMHO
>     this  is  something  that must be covered, because I meant to
>     create a DATATYPE that can be used as a replacement for  TEXT
>     if that's too small, so it must behave as a regular datatype,
>     without any restrictions WRT beeing able to rollback etc.


OK, I have thought about your idea, and I like it very much.  In fact,
it borders on genius.

Our/my original idea was to chain tuple in the main table.  That has
some disadvantages:

	More complex tuple handling of chained tuples
	Requires more tuple storage overhead for housekeeping of chaining data
	Sequential scan of table has to read those large fields
	Vacuum has to keep the tuples chained as they are moved
	
Your system would be:

	CREATE TABLE pg_long (
		refoid	OID,
		attno	int2,
		line	int4,
		attdata	VARCHAR(8000);

	CREATE INDEX pg_long_idx ON pg_long (refoid, attno, line);

You keep the long data out of the table.  When updating the tuple, you
mark the pg_long tuples as superceeded with the transaction id, and just
keep going.   No need to do anything special.  Vacuum will remove
superceeded tuples automatically while processing pg_long if the
transaction was committed.

The pg_long_idx index will allow rapid access to tuple long data.

This approach seems better than tuple chaining because it uses our
existing code more efficiently.  You keep long data out of the main
table, and allow use of existing tools to access the long data.  

In fact, you may decide to just extent varchar() and text to allow use
of long tuples.  Set the varlena VARLEN field to some special value like
-1, and when you see that, you go to pg_long to get the data.  Seems
very easy.  You could get fancy and keep data in the table in most
cases, but if the tuple length exceeds 8k, go to all the varlena fields
and start moving data into pg_long.  That way, a table with three 4k
columns could be stored without the user even knowing pg_long is
involved, but for shorter tuples, they are stored in the main table.

===

Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 10:38:28 -0500 (EST)


>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers, but managing external files or  even  large  objects
>     for  it  IMHO  isn't  that  simple,  if  you take transaction
>     commit/abort and MVCC problematic into account too. And  IMHO
>     this  is  something  that must be covered, because I meant to
>     create a DATATYPE that can be used as a replacement for  TEXT
>     if that's too small, so it must behave as a regular datatype,
>     without any restrictions WRT beeing able to rollback etc.

In fact, you could get fancy and allow an update of a non-pg_long using
column to not change pg_long at all.  Just keep the same value in the
column.  If the transaction fails or succeeds, the pg_long is the same
for that tuple.  Of course, because an update is a delete and then an
insert, that may be hard to do.  For very long fields, it would be a win
for UPDATE.  You certainly couldn't do that with chained tuples.

===

Subject: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 17:21:28 +0100 (MET)


I wrote:

> Bruce Momjian wrote:
>
> > Should we use large objects for this, and beef them up.  Seems that
> > would be a good way.  I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
>
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers ...

    When  looking  at  my  actual implementation concept, I'm not
    sure if it will win or loose compared  against  text  itself!
    Amazing, but I think it could win already on relatively small
    text sizes (1-2K is IMHO small compared  to  what  this  type
    could store).

    Well,  the  implementation  details. I really would like some
    little  comments  to  verify  it's  really  complete   before
    starting.

    - A  new  field "rellongrelid" type Oid is added to pg_class.
      It contains the Oid  of  the  long-value  relation  or  the
      invalid Oid for those who have no LONG attributes.

    - At    CREATE   TABLE,   a   long   value   relation   named
      "_LONG<tablename>" is created for those tables who need it.
      And of course dropped and truncated appropriate. The schema
      of this table is

          rowid       Oid,          -- oid of our main data row
          rowattno    int2,         -- the attribute number in main data
          chunk_seq   int4,         -- the part number of this data chunk
          chunk       text          -- the content of this data chunk

      There is a unique index defined on (rowid, rowattno).

    - The new data type is of variable size  with  the  following
      header:

          typedef struct LongData {
              int32           varsize;
              int32           datasize;
              Oid             longrelid;
              Oid             rowid;
              int16           rowattno;
          } LongData;

      The   types   input   function  is  very  simple.  Allocate
      sizeof(LongData)  +  strlen(input),  set  varsize  to   it,
      datasize  to  strlen(input), and the rest to invalid and 0.
      Then copy the input after the struct.

      The types output function determines on the longrelid, what
      to do.  If it's invalid, just output the bytes stored after
      the struct (it must be a datum that resulted from an  input
      operation.   If  longrelid  isn't invalid, it does an index
      scan on that relation, fetching all tuples that match rowid
      and  attno.  Since  it  knows the datasize, it doesn't need
      them in the correct order, it can put  them  at  the  right
      places into the allocated return buffer by their chunk_seq.

    - For now (until we have enough experience to judge) I  think
      it  would  be  better  to  forbid  ALTER  TABLE  when  LONG
      attributes  are  involved.   Sure,  must   be   implemented
      finally, but IMHO not on the first evaluation attempt.

Now how the data goes in and out of the longrel.

    - On  heap_insert(),  we look for non NULL LONG attributes in
      the tuple. If there could be any  can  simply  be  seen  by
      looking  at the rellongrelid in rd_rel.  We fetch the value
      either from the memory after LongData or by using the  type
      output function (for fetching it from the relation where it
      is!).  Then we simply break it up into  single  chunks  and
      store  them with our tuples information.  Now we need to do
      something tricky - to shrink the main data tuple  size,  we
      form  a new heap tuple with the datums of the original one.
      But we replace all LongData items we stored by faked  ones,
      where  the  varsize  is  sizeof(LongData) and all the other
      information is setup appropriate.   We  append  that  faked
      tuple  instead,  copy  the  resulting  information into the
      original tuples header and throw it away.

      This is a point, where  I'm  not  totally  sure.  Could  it
      possibly  be  better  or  required to copy the entire faked
      tuple over the one we should have stored?  It  could  never
      need more space, so that wouldn't be a problem.

    - On heap_replace(), we check all LONG attributes if they are
      NULL of if the information in longrelid, rowid and rowattno
      doesn't match our rellongrelid, tupleid, and attno. In that
      case this attribute  might  have  an  old  content  in  the
      longrel, which we need to delete first.

      The   rest   of   the   operation   is   exactly  like  for
      heap_insert(), except all the  attributes  information  did
      match - then it's our own OLD value that wasn't changed. So
      we can simply skip it - the existing data is still valid.

    - heap_delete() is so simple that I don't explain it.

    Now I hear you asking "how could this overhead be a win?" :-)

    That's  easy  to  explain.   As  long as you don't use a LONG
    column in the WHERE clause, when will the data be fetched? At
    the  time  it's finally clear that it's needed. That's when a
    result tuple is sent to the client (type output)  or  when  a
    tuple resulting from INSERT ... SELECT should be stored.

    Thus,  all  the  tuples  moving around in the execution tree,
    getting joined together, abused by sorts and  aggregates  and
    filtered  out  again,  allways  contain  the  small  LongData
    struct, not the data itself. Wheren't there recently  reports
    about too expansive sorts due to their huge size?

    Another  bonus  would be this: What happens on an UPDATE to a
    table  having  LONG  attributes?  If  the  attribute  is  not
    modified,  the  OLD LongData will be found in the targetlist,
    and we'll not waste any space by storing the same information
    again.  IIRC  that  one was one of the biggest concerns about
    storing huge data  in  tuples,  but  it  disappeared  without
    leaving a trace - funny eh?

    It  is  so simple, that I fear I made some mistake somewhere.
    But where?


===

Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 17:45:53 +0100 (MET)


Bruce Momjian wrote:

> In fact, you may decide to just extent varchar() and text to allow use
> of long tuples.  Set the varlena VARLEN field to some special value like
> -1, and when you see that, you go to pg_long to get the data.  Seems
> very easy.  You could get fancy and keep data in the table in most
> cases, but if the tuple length exceeds 8k, go to all the varlena fields
> and start moving data into pg_long.  That way, a table with three 4k
> columns could be stored without the user even knowing pg_long is
> involved, but for shorter tuples, they are stored in the main table.

    So  you  realized  most  of  my explanations yourself while I
    wrote the last mail.  :-)

    No, I don't intend to change anything on  the  existing  data
    types.  Where should be the limit on which to decide to store
    a datum in pg_long?  Based on the datums size? On  the  tuple
    size  and  attribute  order,  take one by one until the tuple
    became small enough to fit?

    Maybe  we  make  this  mechanism  so  general  that   it   is
    automatically applied to ALL varsize attributes? We'll end up
    with on big pg_long where 90+% of the databases content  will
    be stored.

    But  as  soon as an attribute stored there is used in a WHERE
    or is subject to be joined, you'll see why not (as said, this
    type  will  NOT  be enabled for indexing). The operation will
    probably fallback to a seq-scan on the main  table  and  then
    the attribute must be fetched from pg_long with an index scan
    on every single compare etc. - no, no, no.

    And it will not be one single pg_long table. Instead it  will
    be a separate table per table, that contains one or more LONG
    attributes.  IIRC, the TRUNCATE functionality was implemented
    exactly  to  QUICKLY  be able to whipe out the data from huge
    relations AND get the disk space  back.  In  the  case  of  a
    central  pg_long, TRUNCATE would have to scan pg_long to mark
    the tuples for deletion and vacuum must be run to really  get
    back  the  space.  And a vacuum on this central pg_long would
    probably take longer than the old DELETE, VACUUM of  the  now
    truncated table itself. Again no, no, no.


===

Subject: Re: [HACKERS] Last thoughts about LONG
From: Peter Eisentraut <e99re41@DoCS.UU.SE>
Date: Sat, 11 Dec 1999 17:55:24 +0100 (MET)


On Sat, 11 Dec 1999, Jan Wieck wrote:

>     Well,  the  implementation  details. I really would like some
>     little  comments  to  verify  it's  really  complete   before
>     starting.

Before I start the nagging, please be aware that I'm not as smart as I
think I am. Long datatypes of some sort are clearly necessary -- more
power to you.

>     - A  new  field "rellongrelid" type Oid is added to pg_class.
>       It contains the Oid  of  the  long-value  relation  or  the
>       invalid Oid for those who have no LONG attributes.

I have a mixed feeling about all these "sparse" fields everywhere. Doing
it completely formally, this seems to be a one-to-many relation, so you
should put the referencing field into the pg_long table or whatever
structure you use, pointing the other way around. This is probably slower,
but it's cleaner. As I mentioned earlier, this whole arrangement will
(hopefully) not be needed for all too long, and then we wouldn't want to
be stuck with it.

>     - At    CREATE   TABLE,   a   long   value   relation   named
>       "_LONG<tablename>" is created for those tables who need it.

Please don't forget, this would require changes to pg_dump and psql. Also,
the COPY command might not be able to get away without changes, either.

In general, it wouldn't surprise me if some sections of the code would go
nuts about the news of tuples longer than BLCKSZ coming along. (Where
"nuts" is either 'truncation' or 'segfault'.)

I guess what I'm really saying is that I'd be totally in awe of you if you
could get all of this (and RI) done by Feb 1st. Good luck.


===

Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 18:04:04 +0100 (MET)


Peter Eisentraut wrote:

> On Sat, 11 Dec 1999, Bruce Momjian wrote:
>
> > In fact, you could get fancy and allow an update of a non-pg_long using
> > column to not change pg_long at all.  Just keep the same value in the
> > column.  If the transaction fails or succeeds, the pg_long is the same
> > for that tuple.  Of course, because an update is a delete and then an
> > insert, that may be hard to do.  For very long fields, it would be a win
> > for UPDATE.  You certainly couldn't do that with chained tuples.
>
> While this is great and all, what will happen when long tuples finally get
> done? Will you remove this, or keep it, or just make LONG and TEXT
> equivalent? I fear that elaborate structures will be put in place here
> that might perhaps only be of use for one release cycle.

    With  the  actual  design  explained, I don't think we aren't
    that much in need for long tuples any more,  that  we  should
    introduce  all  the  problems  of  chaninig  tuples  into the
    vacuum, bufmgr, heapam, hio etc. etc. code.

    The rare cases, where someone really needs larger tuples  and
    not  beeing  able  to  use the proposed LONG data type can be
    tackled by increasing BLKSIZE for this specific installation.

    Isn't  there  a FAQ entry about "tuple size too big" pointing
    to BLKSIZE?  Haven't checked, but if it is, could that be the
    reason why we get lesser request on this item?

===

Subject: Re: [HACKERS] LONG 
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:13:15 -0500


>> I  thought  about  the  huge size variable text type a little
>> more.  And I think I could get the  following  implementation
>> to work reliable for our upcoming release.
>> 
>> For   any  relation,  having  one  or  more  LONG  data  type
>> attributes,  another  relation  (named   pg_<something>)   is
>> created,  accessible  only to superusers (and internal access
>> routines).  All LONG data items are  stored  as  a  reference
>> into  that relation, split up automatically so the chunks fit
>> into the installation specific tuple limit size.   Items  are
>> added/updated/removed totally transparent.

> Should we use large objects for this, and beef them up.  Seems that
> would be a good way.

Yes, I think what Jan is describing *is* a large object, with the
slight change that he wants to put multiple objects into the same
behind-the-scenes relation.  (That'd be a good change for regular
large objects as well ... it'd cut down the umpteen-thousand-files
problem.)

The two principal tricky areas would be (1) synchronization ---
having one hidden relation per primary relation might solve the
problems there, but I'm not sure about it; and (2) VACUUM.

But I don't really see why this would be either easier to do or
more reliable than storing multiple segments of a tuple in the
primary relation itself.  And I don't much care for



===

Subject: Re: [HACKERS] LONG 
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:32:46 -0500


wieck@debis.com (Jan Wieck) writes:
>     The rare cases, where someone really needs larger tuples  and
>     not  beeing  able  to  use the proposed LONG data type can be
>     tackled by increasing BLKSIZE for this specific installation.

This would be a more convincing argument if we supported BLCKSZ
greater than 32K, but we don't.

I think we've speculated about having a compilation flag that gets
thrown to change page offsets from shorts to longs, thereby allowing
larger page sizes.  But as Bruce was just pointing out, all of the
code depends in a fundamental way on the assumption that writing a
page is an atomic action.  The larger the page size, the more likely
that you'll see broken tables caused by partial page writes.  So
allowing BLCKSZ large enough to accomodate any tuple wouldn't be a
very good answer.

I think the proposed LONG type is a hack, and I'd rather see us solve
the problem correctly.  ISTM that allowing a tuple to be divided into
"primary" and "continuation" tuples, all stored in the same relation
file, would be a much more general answer and not significantly harder
to implement than a LONG datatype as Jan is describing it.



===

Subject: Re: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 19:29:59 +0100 (MET)


Peter Eisentraut wrote:

> Before I start the nagging, please be aware that I'm not as smart as I
> think I am. Long datatypes of some sort are clearly necessary -- more
> power to you.

    So  be  it. It forces me to think it over again and points to
    sections, I might have forgotten so  far.  Also,  it  happend
    more  than  one  time  to  me, that writing a totally OBVIOUS
    answer triggerd a better solution in my brain  (dunno  what's
    wrong  with  that  brain, but sometimes it needs to be shaken
    well before use).  Thus, any of your notes can help, and that
    counts!

>
> >     - A  new  field "rellongrelid" type Oid is added to pg_class.
> >       It contains the Oid  of  the  long-value  relation  or  the
> >       invalid Oid for those who have no LONG attributes.
>
> I have a mixed feeling about all these "sparse" fields everywhere. Doing
> it completely formally, this seems to be a one-to-many relation, so you
> should put the referencing field into the pg_long table or whatever
> structure you use, pointing the other way around. This is probably slower,
> but it's cleaner. As I mentioned earlier, this whole arrangement will
> (hopefully) not be needed for all too long, and then we wouldn't want to
> be stuck with it.

    It's  4 bytes per RELATION in pg_class. As a side effect, the
    information will be available at NO  COST  immediately  after
    heap_open() and in every place, where a relation is accessed.
    So it is the best place to put it.

>
> >     - At    CREATE   TABLE,   a   long   value   relation   named
> >       "_LONG<tablename>" is created for those tables who need it.
>
> Please don't forget, this would require changes to pg_dump and psql. Also,
> the COPY command might not be able to get away without changes, either.

    Oh yes, thanks. That was a point I forgot!

    Psql must not list tables that begin with "_LONG" on  the  \d
    request.  Anything else should IMHO be transparent.

    Pg_dump  either uses a SELECT to build a script that INSERT's
    the data via SQL, or uses COPY. In the SELECT/INSERT case, my
    implementation  would  again  be  totally transparent and not
    noticed  by  pg_dump,  only  that  it  must  IGNORE  "_LONG*"
    relations  and  be  aware that really big tuples can be sent,
    but that's more a libpq question  I  think  (what  I  already
    checked   because   the   view/rule/PL  combo  I  created  to
    demonstrate a >128K tuple  was  done  through  psql).  AFAIK,
    pg_dump  doesn't  use  a binary COPY, and looking at the code
    tells me that this is transparent too (due  to  use  of  type
    specific input/output function there).

    All  pg_dump would have to do is to ignore "_LONG*" relations
    too.

    The real problem is COPY. In the case of  a  COPY  BINARY  it
    outputs  the data portion of the fetched tuples directly. But
    these will only contain the LongData headers,  not  the  data
    itself.

    So  at  that  point,  COPY  has  to do the reverse process of
    heap_insert().  Rebuild a faked tuple where all the not  NULL
    LONG values are placed in the representation, they would have
    after type input.  Not a big deal, must only be done with the
    same  care  as  the  changes  in heapam not to leave unfreed,
    leaked memory around.

> In general, it wouldn't surprise me if some sections of the code would go
> nuts about the news of tuples longer than BLCKSZ coming along. (Where
> "nuts" is either 'truncation' or 'segfault'.)

    The place, where the size of a heap  tuple  only  is  checked
    (and  where  the "tuple size too big" message is coming from)
    is in hio.c, right before it is copied into the block. Up  to
    then, a tuple is NOT explicitly limited to any size.

    So  I would be glad to see crashes coming up from this change
    (not after release - during BETA of course). It would help us
    to get another existing bug out of the code.

> I guess what I'm really saying is that I'd be totally in awe of you if you
> could get all of this (and RI) done by Feb 1st. Good luck.

    Thank's for the flowers, but "awe" is far too much - sorry.

    During  the  years I had my hands on nearly every part of the
    code involved in this. So I'm not a newbe  in  creating  data
    types,  utility  commands  or doing syscat changes.  The LONG
    type I described will be the work of two or three nights.

    I already intended to tackle the long tuples  next.   Missing
    was the idea how to AVOID it simply. And I had this idea just
    while answering a question about storing big  text  files  in
    the database in the [SQL] list - that woke me up.

    In  contrast  to  the  RI stuff, this time I don't expect any
    bugs, because there are absolutely no side effects I  noticed
    so  far.   On  the  RI  stuff, we discussed for weeks (if not
    months) about tuple visibility during concurrent transactions
    and I finally ran into exactly these problems anyway.


===

Subject: Re: [HACKERS] Last thoughts about LONG 
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:52:11 -0500


wieck@debis.com (Jan Wieck) writes:
>     Another  bonus  would be this: What happens on an UPDATE to a
>     table  having  LONG  attributes?  If  the  attribute  is  not
>     modified,  the  OLD LongData will be found in the targetlist,
>     and we'll not waste any space by storing the same information
>     again.

Won't work.  If you do that, you have several generations of the
"primary" tuple pointing at the same item in the "secondary" table.
There is room in the multiple primary tuples to keep track of their
committed/uncommitted status, but there won't be enough room to
keep track in the secondary table.

I think this can only work if there are exactly as many generations
of the LONG chunks in the secondary table as there are of the primary
tuple in the main table, and all of them have the same transaction
identification info stored in them as the corresponding copies of
the primary tuple have.

Among other things, this means that an update or delete *must* scan
through the tuple, find all the LONG fields, and go over to the
secondary table to mark all the LONG chunks as deleted by the current
xact, just the same as the primary tuple gets marked.  This puts a
considerable crimp in your claim that it'd be more efficient than
a multiple-tuple-segment approach.

Of course, this could be worked around if the secondary table did *not*
use standard access methods (it could be more like an index, and rely on
the primary table for all xact status info).  But that makes it look



===
Subject: Re: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 22:02:43 +0100 (MET)


Tom Lane wrote:

> wieck@debis.com (Jan Wieck) writes:
> >     Another  bonus  would be this: What happens on an UPDATE to a
> >     table  having  LONG  attributes?  If  the  attribute  is  not
> >     modified,  the  OLD LongData will be found in the targetlist,
> >     and we'll not waste any space by storing the same information
> >     again.
>
> Won't work.  If you do that, you have several generations of the
> "primary" tuple pointing at the same item in the "secondary" table.
> There is room in the multiple primary tuples to keep track of their
> committed/uncommitted status, but there won't be enough room to
> keep track in the secondary table.

    A really critical point, to think about in depth. And another
    point I could have stumbled over.

    But it would work anyway.

    I assumed up to now,  that  even  under  MVCC,  and  even  if
    reading  dirty,  there could be at max one single transaction
    modifying one and the same tuple - no?  Ignore all  the  rest
    and  forget all my comments if your answer is no.  But please
    tell me how something like RI should ever  work  RELIABLE  in
    such   an   environment.  In  fact,  in  that  case  I  would
    immediately stop all my efford in  FOREIGN  KEY,  because  it
    would  be a dead end street - so I assume your answer is yes.

    My concept, using regular heap access inside of  heap  access
    to  act on "secondary" table, means to stamp the same current
    xact as for "primary" table into xmax of old, and  into  xmin
    of  new  tuples for the "secondary" table.  And it means that
    this operation appears to be atomic if living  in  a  locking
    environment.

    The  only  thing  I  DON'T  wanted to do is to stamp xmax and
    create new instances in "secondary" table, if  no  update  is
    done  to  the  value  of  the old LONG attribute.  Any UPDATE
    modifying the LONG value, and INSERT/DELETE  of  course  will
    stamp  this  information and/or create new instances.  So the
    only thing (because the only difference) to worry  about  are
    unstamped  and  uncreated  instances  in  "secondary" table -
    right?

    Since INSERT/DELETE allways act synchronous to the  "primary"
    table,  and and UPDATE modifying the LONG too, the only thing
    left to worry about is an UPDATE without updating the LONG.

    In this scenario,  a  "secondary"  tuple  of  a  not  updated
    "primary"  LONG  will  have  an  older, but surely committed,
    xmin.  And it's xmax will be either infinite, or aborted.  So
    it  is visible - no other chance. And that's good, because at
    the time beeing, the updater of the "primary"  tuple  does  a
    NOOP  on  the  "secondary".   And this (extended) part of the
    "primaries"  tuple  information  is  absolutely   unaffected,
    regardless if it's transaction will commit or rollback.

    Well, your concern is again valid. This concept MIGHT need to
    force a NON-MVCC locking scheme for "secondary"  tables.  But
    as  far  as I learned from the RI stuff, that isn't a problem
    and therefore current Jackpot value to be  added  to  Vadim's
    account.


===

Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 23:36:22 +0100 (MET)


Bruce Momjian wrote:

> > While this is great and all, what will happen when long tuples finally get
> > done? Will you remove this, or keep it, or just make LONG and TEXT
> > equivalent? I fear that elaborate structures will be put in place here
> > that might perhaps only be of use for one release cycle.
>
> I think the idea is that Jan's idea is better than chaining tuples.

    Just as Tom already pointed out, it cannot completely replace
    tuple chaining because of the atomicy  assumption  of  single
    fsync(2)  operation  in  current code. Due to this, we cannot
    get around the cases LONG will leave open by  simply  raising
    BLKSIZE, we instead need to tackle that anyways.

    But I believe LONG would still be something worth the efford.
    It will lower the pressure on chained tuples, giving us  more
    time  to  build  a really good solution, and I think LONG can
    survive tuple chaining and live in coexistance with  it.   As
    said  in my last mail, I still believe that not touching LONG
    values at UPDATE can avoid storing the same huge value again.
    And that's a benefit, tuple chaining will never give us.

    Remember:  If  your only tool is a hammer, anything MUST look
    like a nail.  So why not provide a richer set of tools?


Jan



===

Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 18:25:12 -0500 (EST)


> Bruce Momjian wrote:
> 
> > > While this is great and all, what will happen when long tuples finally get
> > > done? Will you remove this, or keep it, or just make LONG and TEXT
> > > equivalent? I fear that elaborate structures will be put in place here
> > > that might perhaps only be of use for one release cycle.
> >
> > I think the idea is that Jan's idea is better than chaining tuples.
> 
>     Just as Tom already pointed out, it cannot completely replace
>     tuple chaining because of the atomicy  assumption  of  single
>     fsync(2)  operation  in  current code. Due to this, we cannot
>     get around the cases LONG will leave open by  simply  raising
>     BLKSIZE, we instead need to tackle that anyways.

Actually, in looking at the fsync() system call, it does write the
entire file descriptor before marking the transaction as complete, so
there is no hard reason not to raise it, but because the OS has to do
two reads to get 16k, I think we are better keeping 8k as our base block
size.

Jan's idea is not to chain tuples, but to keep tuples at 8k, and instead
chain out individual fields into 8k tuple chunks, as needed.  This seems
like it makes much more sense.  It uses the database to recreate the
chains.

Let me mention a few things.  First, I would like to avoid a LONG data
type if possible.  Seems a new data type is just going to make things
more confusing for users.

My ideas is a much more limited one than Jan's.  It is to have a special
-1 varlena length when the data is chained on the long relation.  I
would do:


	-1|oid|attno

in 12 bytes.  That way, you can pass this around as long as you want,
and just expand it in the varlena textout and compare routines when you
need the value.  That prevents the tuples from changing size while being
processed.  As far as I remember, there is no need to see the data in
the tuple except in the type comparison/output routines.

Now it would be nice if we could set the varlena length to 12, it's
actual length, and then just somehow know that the varlena of 12 was a
long data entry.  Our current varlena has a maximum length of 64k.  I
wonder if we should grab a high bit of that to trigger long.  I think we
may be able to do that, and just do a AND mask to remove the bit to see
the length.  We don't need the high bit because our varlena's can't be
over 32k.  We can modify VARSIZE to strip it off, and make another
macro like ISLONG to check for that high bit.

Seems this could be done with little code.

===

Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 18:54:58 -0500 (EST)


>     Maybe  we  make  this  mechanism  so  general  that   it   is
>     automatically applied to ALL varsize attributes? We'll end up
>     with on big pg_long where 90+% of the databases content  will
>     be stored.

If most joins, comparisons are done on the 10% in the main table, so
much the better.

> 
>     But  as  soon as an attribute stored there is used in a WHERE
>     or is subject to be joined, you'll see why not (as said, this
>     type  will  NOT  be enabled for indexing). The operation will
>     probably fallback to a seq-scan on the main  table  and  then
>     the attribute must be fetched from pg_long with an index scan
>     on every single compare etc. - no, no, no.

Let's fact it.  Most long tuples are store/retrieve, not ordered on or
used in WHERE clauses.  Moving them out of the main table speeds up
things.  It also prevents expansion of rows that never end up in the
result set.

In your system, a sequential scan of the table will pull in all this
stuff because you are going to expand the tuple.  That could be very
costly.  In my system, the expansion only happens on output if they LONG
field does not appear in the WHERE or ORDER BY clauses.

Also, my idea was to auto-enable longs for all varlena types, so short
values stay in the table, while longer chained ones that take up lots of
space and are expensive to expand are retrieved only when needed.

I see this as much better than chained tuples.


> 
>     And it will not be one single pg_long table. Instead it  will
>     be a separate table per table, that contains one or more LONG
>     attributes.  IIRC, the TRUNCATE functionality was implemented
>     exactly  to  QUICKLY  be able to whipe out the data from huge
>     relations AND get the disk space  back.  In  the  case  of  a
>     central  pg_long, TRUNCATE would have to scan pg_long to mark
>     the tuples for deletion and vacuum must be run to really  get
>     back  the  space.  And a vacuum on this central pg_long would
>     probably take longer than the old DELETE, VACUUM of  the  now
>     truncated table itself. Again no, no, no.
> 

I guess a separate pg_long_ per table would be good.

===

Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 19:01:49 -0500 (EST)


>     Maybe  we  make  this  mechanism  so  general  that   it   is
>     automatically applied to ALL varsize attributes? We'll end up
>     with on big pg_long where 90+% of the databases content  will
>     be stored.
> 
>     But  as  soon as an attribute stored there is used in a WHERE
>     or is subject to be joined, you'll see why not (as said, this
>     type  will  NOT  be enabled for indexing). The operation will
>     probably fallback to a seq-scan on the main  table  and  then
>     the attribute must be fetched from pg_long with an index scan
>     on every single compare etc. - no, no, no.

A field value over 8k is not going to be something you join on,
restrict, or order by in most cases.  It is going to be some long
narrative or field that is just for output to the user, usually not used
to process the query.


===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu