This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 10:20:50 -0500 (EST)
> Bruce Momjian wrote:
>
> > Should we use large objects for this, and beef them up. Seems that
> > would be a good way. I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
>
> I thought to use a regular table. Of course, it will eat
> buffers, but managing external files or even large objects
> for it IMHO isn't that simple, if you take transaction
> commit/abort and MVCC problematic into account too. And IMHO
> this is something that must be covered, because I meant to
> create a DATATYPE that can be used as a replacement for TEXT
> if that's too small, so it must behave as a regular datatype,
> without any restrictions WRT beeing able to rollback etc.
OK, I have thought about your idea, and I like it very much. In fact,
it borders on genius.
Our/my original idea was to chain tuple in the main table. That has
some disadvantages:
More complex tuple handling of chained tuples
Requires more tuple storage overhead for housekeeping of chaining data
Sequential scan of table has to read those large fields
Vacuum has to keep the tuples chained as they are moved
Your system would be:
CREATE TABLE pg_long (
refoid OID,
attno int2,
line int4,
attdata VARCHAR(8000);
CREATE INDEX pg_long_idx ON pg_long (refoid, attno, line);
You keep the long data out of the table. When updating the tuple, you
mark the pg_long tuples as superceeded with the transaction id, and just
keep going. No need to do anything special. Vacuum will remove
superceeded tuples automatically while processing pg_long if the
transaction was committed.
The pg_long_idx index will allow rapid access to tuple long data.
This approach seems better than tuple chaining because it uses our
existing code more efficiently. You keep long data out of the main
table, and allow use of existing tools to access the long data.
In fact, you may decide to just extent varchar() and text to allow use
of long tuples. Set the varlena VARLEN field to some special value like
-1, and when you see that, you go to pg_long to get the data. Seems
very easy. You could get fancy and keep data in the table in most
cases, but if the tuple length exceeds 8k, go to all the varlena fields
and start moving data into pg_long. That way, a table with three 4k
columns could be stored without the user even knowing pg_long is
involved, but for shorter tuples, they are stored in the main table.
===
Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 10:38:28 -0500 (EST)
> I thought to use a regular table. Of course, it will eat
> buffers, but managing external files or even large objects
> for it IMHO isn't that simple, if you take transaction
> commit/abort and MVCC problematic into account too. And IMHO
> this is something that must be covered, because I meant to
> create a DATATYPE that can be used as a replacement for TEXT
> if that's too small, so it must behave as a regular datatype,
> without any restrictions WRT beeing able to rollback etc.
In fact, you could get fancy and allow an update of a non-pg_long using
column to not change pg_long at all. Just keep the same value in the
column. If the transaction fails or succeeds, the pg_long is the same
for that tuple. Of course, because an update is a delete and then an
insert, that may be hard to do. For very long fields, it would be a win
for UPDATE. You certainly couldn't do that with chained tuples.
===
Subject: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 17:21:28 +0100 (MET)
I wrote:
> Bruce Momjian wrote:
>
> > Should we use large objects for this, and beef them up. Seems that
> > would be a good way. I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
>
> I thought to use a regular table. Of course, it will eat
> buffers ...
When looking at my actual implementation concept, I'm not
sure if it will win or loose compared against text itself!
Amazing, but I think it could win already on relatively small
text sizes (1-2K is IMHO small compared to what this type
could store).
Well, the implementation details. I really would like some
little comments to verify it's really complete before
starting.
- A new field "rellongrelid" type Oid is added to pg_class.
It contains the Oid of the long-value relation or the
invalid Oid for those who have no LONG attributes.
- At CREATE TABLE, a long value relation named
"_LONG<tablename>" is created for those tables who need it.
And of course dropped and truncated appropriate. The schema
of this table is
rowid Oid, -- oid of our main data row
rowattno int2, -- the attribute number in main data
chunk_seq int4, -- the part number of this data chunk
chunk text -- the content of this data chunk
There is a unique index defined on (rowid, rowattno).
- The new data type is of variable size with the following
header:
typedef struct LongData {
int32 varsize;
int32 datasize;
Oid longrelid;
Oid rowid;
int16 rowattno;
} LongData;
The types input function is very simple. Allocate
sizeof(LongData) + strlen(input), set varsize to it,
datasize to strlen(input), and the rest to invalid and 0.
Then copy the input after the struct.
The types output function determines on the longrelid, what
to do. If it's invalid, just output the bytes stored after
the struct (it must be a datum that resulted from an input
operation. If longrelid isn't invalid, it does an index
scan on that relation, fetching all tuples that match rowid
and attno. Since it knows the datasize, it doesn't need
them in the correct order, it can put them at the right
places into the allocated return buffer by their chunk_seq.
- For now (until we have enough experience to judge) I think
it would be better to forbid ALTER TABLE when LONG
attributes are involved. Sure, must be implemented
finally, but IMHO not on the first evaluation attempt.
Now how the data goes in and out of the longrel.
- On heap_insert(), we look for non NULL LONG attributes in
the tuple. If there could be any can simply be seen by
looking at the rellongrelid in rd_rel. We fetch the value
either from the memory after LongData or by using the type
output function (for fetching it from the relation where it
is!). Then we simply break it up into single chunks and
store them with our tuples information. Now we need to do
something tricky - to shrink the main data tuple size, we
form a new heap tuple with the datums of the original one.
But we replace all LongData items we stored by faked ones,
where the varsize is sizeof(LongData) and all the other
information is setup appropriate. We append that faked
tuple instead, copy the resulting information into the
original tuples header and throw it away.
This is a point, where I'm not totally sure. Could it
possibly be better or required to copy the entire faked
tuple over the one we should have stored? It could never
need more space, so that wouldn't be a problem.
- On heap_replace(), we check all LONG attributes if they are
NULL of if the information in longrelid, rowid and rowattno
doesn't match our rellongrelid, tupleid, and attno. In that
case this attribute might have an old content in the
longrel, which we need to delete first.
The rest of the operation is exactly like for
heap_insert(), except all the attributes information did
match - then it's our own OLD value that wasn't changed. So
we can simply skip it - the existing data is still valid.
- heap_delete() is so simple that I don't explain it.
Now I hear you asking "how could this overhead be a win?" :-)
That's easy to explain. As long as you don't use a LONG
column in the WHERE clause, when will the data be fetched? At
the time it's finally clear that it's needed. That's when a
result tuple is sent to the client (type output) or when a
tuple resulting from INSERT ... SELECT should be stored.
Thus, all the tuples moving around in the execution tree,
getting joined together, abused by sorts and aggregates and
filtered out again, allways contain the small LongData
struct, not the data itself. Wheren't there recently reports
about too expansive sorts due to their huge size?
Another bonus would be this: What happens on an UPDATE to a
table having LONG attributes? If the attribute is not
modified, the OLD LongData will be found in the targetlist,
and we'll not waste any space by storing the same information
again. IIRC that one was one of the biggest concerns about
storing huge data in tuples, but it disappeared without
leaving a trace - funny eh?
It is so simple, that I fear I made some mistake somewhere.
But where?
===
Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 17:45:53 +0100 (MET)
Bruce Momjian wrote:
> In fact, you may decide to just extent varchar() and text to allow use
> of long tuples. Set the varlena VARLEN field to some special value like
> -1, and when you see that, you go to pg_long to get the data. Seems
> very easy. You could get fancy and keep data in the table in most
> cases, but if the tuple length exceeds 8k, go to all the varlena fields
> and start moving data into pg_long. That way, a table with three 4k
> columns could be stored without the user even knowing pg_long is
> involved, but for shorter tuples, they are stored in the main table.
So you realized most of my explanations yourself while I
wrote the last mail. :-)
No, I don't intend to change anything on the existing data
types. Where should be the limit on which to decide to store
a datum in pg_long? Based on the datums size? On the tuple
size and attribute order, take one by one until the tuple
became small enough to fit?
Maybe we make this mechanism so general that it is
automatically applied to ALL varsize attributes? We'll end up
with on big pg_long where 90+% of the databases content will
be stored.
But as soon as an attribute stored there is used in a WHERE
or is subject to be joined, you'll see why not (as said, this
type will NOT be enabled for indexing). The operation will
probably fallback to a seq-scan on the main table and then
the attribute must be fetched from pg_long with an index scan
on every single compare etc. - no, no, no.
And it will not be one single pg_long table. Instead it will
be a separate table per table, that contains one or more LONG
attributes. IIRC, the TRUNCATE functionality was implemented
exactly to QUICKLY be able to whipe out the data from huge
relations AND get the disk space back. In the case of a
central pg_long, TRUNCATE would have to scan pg_long to mark
the tuples for deletion and vacuum must be run to really get
back the space. And a vacuum on this central pg_long would
probably take longer than the old DELETE, VACUUM of the now
truncated table itself. Again no, no, no.
===
Subject: Re: [HACKERS] Last thoughts about LONG
From: Peter Eisentraut <e99re41@DoCS.UU.SE>
Date: Sat, 11 Dec 1999 17:55:24 +0100 (MET)
On Sat, 11 Dec 1999, Jan Wieck wrote:
> Well, the implementation details. I really would like some
> little comments to verify it's really complete before
> starting.
Before I start the nagging, please be aware that I'm not as smart as I
think I am. Long datatypes of some sort are clearly necessary -- more
power to you.
> - A new field "rellongrelid" type Oid is added to pg_class.
> It contains the Oid of the long-value relation or the
> invalid Oid for those who have no LONG attributes.
I have a mixed feeling about all these "sparse" fields everywhere. Doing
it completely formally, this seems to be a one-to-many relation, so you
should put the referencing field into the pg_long table or whatever
structure you use, pointing the other way around. This is probably slower,
but it's cleaner. As I mentioned earlier, this whole arrangement will
(hopefully) not be needed for all too long, and then we wouldn't want to
be stuck with it.
> - At CREATE TABLE, a long value relation named
> "_LONG<tablename>" is created for those tables who need it.
Please don't forget, this would require changes to pg_dump and psql. Also,
the COPY command might not be able to get away without changes, either.
In general, it wouldn't surprise me if some sections of the code would go
nuts about the news of tuples longer than BLCKSZ coming along. (Where
"nuts" is either 'truncation' or 'segfault'.)
I guess what I'm really saying is that I'd be totally in awe of you if you
could get all of this (and RI) done by Feb 1st. Good luck.
===
Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 18:04:04 +0100 (MET)
Peter Eisentraut wrote:
> On Sat, 11 Dec 1999, Bruce Momjian wrote:
>
> > In fact, you could get fancy and allow an update of a non-pg_long using
> > column to not change pg_long at all. Just keep the same value in the
> > column. If the transaction fails or succeeds, the pg_long is the same
> > for that tuple. Of course, because an update is a delete and then an
> > insert, that may be hard to do. For very long fields, it would be a win
> > for UPDATE. You certainly couldn't do that with chained tuples.
>
> While this is great and all, what will happen when long tuples finally get
> done? Will you remove this, or keep it, or just make LONG and TEXT
> equivalent? I fear that elaborate structures will be put in place here
> that might perhaps only be of use for one release cycle.
With the actual design explained, I don't think we aren't
that much in need for long tuples any more, that we should
introduce all the problems of chaninig tuples into the
vacuum, bufmgr, heapam, hio etc. etc. code.
The rare cases, where someone really needs larger tuples and
not beeing able to use the proposed LONG data type can be
tackled by increasing BLKSIZE for this specific installation.
Isn't there a FAQ entry about "tuple size too big" pointing
to BLKSIZE? Haven't checked, but if it is, could that be the
reason why we get lesser request on this item?
===
Subject: Re: [HACKERS] LONG
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:13:15 -0500
>> I thought about the huge size variable text type a little
>> more. And I think I could get the following implementation
>> to work reliable for our upcoming release.
>>
>> For any relation, having one or more LONG data type
>> attributes, another relation (named pg_<something>) is
>> created, accessible only to superusers (and internal access
>> routines). All LONG data items are stored as a reference
>> into that relation, split up automatically so the chunks fit
>> into the installation specific tuple limit size. Items are
>> added/updated/removed totally transparent.
> Should we use large objects for this, and beef them up. Seems that
> would be a good way.
Yes, I think what Jan is describing *is* a large object, with the
slight change that he wants to put multiple objects into the same
behind-the-scenes relation. (That'd be a good change for regular
large objects as well ... it'd cut down the umpteen-thousand-files
problem.)
The two principal tricky areas would be (1) synchronization ---
having one hidden relation per primary relation might solve the
problems there, but I'm not sure about it; and (2) VACUUM.
But I don't really see why this would be either easier to do or
more reliable than storing multiple segments of a tuple in the
primary relation itself. And I don't much care for
===
Subject: Re: [HACKERS] LONG
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:32:46 -0500
wieck@debis.com (Jan Wieck) writes:
> The rare cases, where someone really needs larger tuples and
> not beeing able to use the proposed LONG data type can be
> tackled by increasing BLKSIZE for this specific installation.
This would be a more convincing argument if we supported BLCKSZ
greater than 32K, but we don't.
I think we've speculated about having a compilation flag that gets
thrown to change page offsets from shorts to longs, thereby allowing
larger page sizes. But as Bruce was just pointing out, all of the
code depends in a fundamental way on the assumption that writing a
page is an atomic action. The larger the page size, the more likely
that you'll see broken tables caused by partial page writes. So
allowing BLCKSZ large enough to accomodate any tuple wouldn't be a
very good answer.
I think the proposed LONG type is a hack, and I'd rather see us solve
the problem correctly. ISTM that allowing a tuple to be divided into
"primary" and "continuation" tuples, all stored in the same relation
file, would be a much more general answer and not significantly harder
to implement than a LONG datatype as Jan is describing it.
===
Subject: Re: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 19:29:59 +0100 (MET)
Peter Eisentraut wrote:
> Before I start the nagging, please be aware that I'm not as smart as I
> think I am. Long datatypes of some sort are clearly necessary -- more
> power to you.
So be it. It forces me to think it over again and points to
sections, I might have forgotten so far. Also, it happend
more than one time to me, that writing a totally OBVIOUS
answer triggerd a better solution in my brain (dunno what's
wrong with that brain, but sometimes it needs to be shaken
well before use). Thus, any of your notes can help, and that
counts!
>
> > - A new field "rellongrelid" type Oid is added to pg_class.
> > It contains the Oid of the long-value relation or the
> > invalid Oid for those who have no LONG attributes.
>
> I have a mixed feeling about all these "sparse" fields everywhere. Doing
> it completely formally, this seems to be a one-to-many relation, so you
> should put the referencing field into the pg_long table or whatever
> structure you use, pointing the other way around. This is probably slower,
> but it's cleaner. As I mentioned earlier, this whole arrangement will
> (hopefully) not be needed for all too long, and then we wouldn't want to
> be stuck with it.
It's 4 bytes per RELATION in pg_class. As a side effect, the
information will be available at NO COST immediately after
heap_open() and in every place, where a relation is accessed.
So it is the best place to put it.
>
> > - At CREATE TABLE, a long value relation named
> > "_LONG<tablename>" is created for those tables who need it.
>
> Please don't forget, this would require changes to pg_dump and psql. Also,
> the COPY command might not be able to get away without changes, either.
Oh yes, thanks. That was a point I forgot!
Psql must not list tables that begin with "_LONG" on the \d
request. Anything else should IMHO be transparent.
Pg_dump either uses a SELECT to build a script that INSERT's
the data via SQL, or uses COPY. In the SELECT/INSERT case, my
implementation would again be totally transparent and not
noticed by pg_dump, only that it must IGNORE "_LONG*"
relations and be aware that really big tuples can be sent,
but that's more a libpq question I think (what I already
checked because the view/rule/PL combo I created to
demonstrate a >128K tuple was done through psql). AFAIK,
pg_dump doesn't use a binary COPY, and looking at the code
tells me that this is transparent too (due to use of type
specific input/output function there).
All pg_dump would have to do is to ignore "_LONG*" relations
too.
The real problem is COPY. In the case of a COPY BINARY it
outputs the data portion of the fetched tuples directly. But
these will only contain the LongData headers, not the data
itself.
So at that point, COPY has to do the reverse process of
heap_insert(). Rebuild a faked tuple where all the not NULL
LONG values are placed in the representation, they would have
after type input. Not a big deal, must only be done with the
same care as the changes in heapam not to leave unfreed,
leaked memory around.
> In general, it wouldn't surprise me if some sections of the code would go
> nuts about the news of tuples longer than BLCKSZ coming along. (Where
> "nuts" is either 'truncation' or 'segfault'.)
The place, where the size of a heap tuple only is checked
(and where the "tuple size too big" message is coming from)
is in hio.c, right before it is copied into the block. Up to
then, a tuple is NOT explicitly limited to any size.
So I would be glad to see crashes coming up from this change
(not after release - during BETA of course). It would help us
to get another existing bug out of the code.
> I guess what I'm really saying is that I'd be totally in awe of you if you
> could get all of this (and RI) done by Feb 1st. Good luck.
Thank's for the flowers, but "awe" is far too much - sorry.
During the years I had my hands on nearly every part of the
code involved in this. So I'm not a newbe in creating data
types, utility commands or doing syscat changes. The LONG
type I described will be the work of two or three nights.
I already intended to tackle the long tuples next. Missing
was the idea how to AVOID it simply. And I had this idea just
while answering a question about storing big text files in
the database in the [SQL] list - that woke me up.
In contrast to the RI stuff, this time I don't expect any
bugs, because there are absolutely no side effects I noticed
so far. On the RI stuff, we discussed for weeks (if not
months) about tuple visibility during concurrent transactions
and I finally ran into exactly these problems anyway.
===
Subject: Re: [HACKERS] Last thoughts about LONG
From: Tom Lane <tgl@sss.pgh.pa.us>
Date: Sat, 11 Dec 1999 13:52:11 -0500
wieck@debis.com (Jan Wieck) writes:
> Another bonus would be this: What happens on an UPDATE to a
> table having LONG attributes? If the attribute is not
> modified, the OLD LongData will be found in the targetlist,
> and we'll not waste any space by storing the same information
> again.
Won't work. If you do that, you have several generations of the
"primary" tuple pointing at the same item in the "secondary" table.
There is room in the multiple primary tuples to keep track of their
committed/uncommitted status, but there won't be enough room to
keep track in the secondary table.
I think this can only work if there are exactly as many generations
of the LONG chunks in the secondary table as there are of the primary
tuple in the main table, and all of them have the same transaction
identification info stored in them as the corresponding copies of
the primary tuple have.
Among other things, this means that an update or delete *must* scan
through the tuple, find all the LONG fields, and go over to the
secondary table to mark all the LONG chunks as deleted by the current
xact, just the same as the primary tuple gets marked. This puts a
considerable crimp in your claim that it'd be more efficient than
a multiple-tuple-segment approach.
Of course, this could be worked around if the secondary table did *not*
use standard access methods (it could be more like an index, and rely on
the primary table for all xact status info). But that makes it look
===
Subject: Re: [HACKERS] Last thoughts about LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 22:02:43 +0100 (MET)
Tom Lane wrote:
> wieck@debis.com (Jan Wieck) writes:
> > Another bonus would be this: What happens on an UPDATE to a
> > table having LONG attributes? If the attribute is not
> > modified, the OLD LongData will be found in the targetlist,
> > and we'll not waste any space by storing the same information
> > again.
>
> Won't work. If you do that, you have several generations of the
> "primary" tuple pointing at the same item in the "secondary" table.
> There is room in the multiple primary tuples to keep track of their
> committed/uncommitted status, but there won't be enough room to
> keep track in the secondary table.
A really critical point, to think about in depth. And another
point I could have stumbled over.
But it would work anyway.
I assumed up to now, that even under MVCC, and even if
reading dirty, there could be at max one single transaction
modifying one and the same tuple - no? Ignore all the rest
and forget all my comments if your answer is no. But please
tell me how something like RI should ever work RELIABLE in
such an environment. In fact, in that case I would
immediately stop all my efford in FOREIGN KEY, because it
would be a dead end street - so I assume your answer is yes.
My concept, using regular heap access inside of heap access
to act on "secondary" table, means to stamp the same current
xact as for "primary" table into xmax of old, and into xmin
of new tuples for the "secondary" table. And it means that
this operation appears to be atomic if living in a locking
environment.
The only thing I DON'T wanted to do is to stamp xmax and
create new instances in "secondary" table, if no update is
done to the value of the old LONG attribute. Any UPDATE
modifying the LONG value, and INSERT/DELETE of course will
stamp this information and/or create new instances. So the
only thing (because the only difference) to worry about are
unstamped and uncreated instances in "secondary" table -
right?
Since INSERT/DELETE allways act synchronous to the "primary"
table, and and UPDATE modifying the LONG too, the only thing
left to worry about is an UPDATE without updating the LONG.
In this scenario, a "secondary" tuple of a not updated
"primary" LONG will have an older, but surely committed,
xmin. And it's xmax will be either infinite, or aborted. So
it is visible - no other chance. And that's good, because at
the time beeing, the updater of the "primary" tuple does a
NOOP on the "secondary". And this (extended) part of the
"primaries" tuple information is absolutely unaffected,
regardless if it's transaction will commit or rollback.
Well, your concern is again valid. This concept MIGHT need to
force a NON-MVCC locking scheme for "secondary" tables. But
as far as I learned from the RI stuff, that isn't a problem
and therefore current Jackpot value to be added to Vadim's
account.
===
Subject: Re: [HACKERS] LONG
From: wieck@debis.com (Jan Wieck)
Date: Sat, 11 Dec 1999 23:36:22 +0100 (MET)
Bruce Momjian wrote:
> > While this is great and all, what will happen when long tuples finally get
> > done? Will you remove this, or keep it, or just make LONG and TEXT
> > equivalent? I fear that elaborate structures will be put in place here
> > that might perhaps only be of use for one release cycle.
>
> I think the idea is that Jan's idea is better than chaining tuples.
Just as Tom already pointed out, it cannot completely replace
tuple chaining because of the atomicy assumption of single
fsync(2) operation in current code. Due to this, we cannot
get around the cases LONG will leave open by simply raising
BLKSIZE, we instead need to tackle that anyways.
But I believe LONG would still be something worth the efford.
It will lower the pressure on chained tuples, giving us more
time to build a really good solution, and I think LONG can
survive tuple chaining and live in coexistance with it. As
said in my last mail, I still believe that not touching LONG
values at UPDATE can avoid storing the same huge value again.
And that's a benefit, tuple chaining will never give us.
Remember: If your only tool is a hammer, anything MUST look
like a nail. So why not provide a richer set of tools?
Jan
===
Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 18:25:12 -0500 (EST)
> Bruce Momjian wrote:
>
> > > While this is great and all, what will happen when long tuples finally get
> > > done? Will you remove this, or keep it, or just make LONG and TEXT
> > > equivalent? I fear that elaborate structures will be put in place here
> > > that might perhaps only be of use for one release cycle.
> >
> > I think the idea is that Jan's idea is better than chaining tuples.
>
> Just as Tom already pointed out, it cannot completely replace
> tuple chaining because of the atomicy assumption of single
> fsync(2) operation in current code. Due to this, we cannot
> get around the cases LONG will leave open by simply raising
> BLKSIZE, we instead need to tackle that anyways.
Actually, in looking at the fsync() system call, it does write the
entire file descriptor before marking the transaction as complete, so
there is no hard reason not to raise it, but because the OS has to do
two reads to get 16k, I think we are better keeping 8k as our base block
size.
Jan's idea is not to chain tuples, but to keep tuples at 8k, and instead
chain out individual fields into 8k tuple chunks, as needed. This seems
like it makes much more sense. It uses the database to recreate the
chains.
Let me mention a few things. First, I would like to avoid a LONG data
type if possible. Seems a new data type is just going to make things
more confusing for users.
My ideas is a much more limited one than Jan's. It is to have a special
-1 varlena length when the data is chained on the long relation. I
would do:
-1|oid|attno
in 12 bytes. That way, you can pass this around as long as you want,
and just expand it in the varlena textout and compare routines when you
need the value. That prevents the tuples from changing size while being
processed. As far as I remember, there is no need to see the data in
the tuple except in the type comparison/output routines.
Now it would be nice if we could set the varlena length to 12, it's
actual length, and then just somehow know that the varlena of 12 was a
long data entry. Our current varlena has a maximum length of 64k. I
wonder if we should grab a high bit of that to trigger long. I think we
may be able to do that, and just do a AND mask to remove the bit to see
the length. We don't need the high bit because our varlena's can't be
over 32k. We can modify VARSIZE to strip it off, and make another
macro like ISLONG to check for that high bit.
Seems this could be done with little code.
===
Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 18:54:58 -0500 (EST)
> Maybe we make this mechanism so general that it is
> automatically applied to ALL varsize attributes? We'll end up
> with on big pg_long where 90+% of the databases content will
> be stored.
If most joins, comparisons are done on the 10% in the main table, so
much the better.
>
> But as soon as an attribute stored there is used in a WHERE
> or is subject to be joined, you'll see why not (as said, this
> type will NOT be enabled for indexing). The operation will
> probably fallback to a seq-scan on the main table and then
> the attribute must be fetched from pg_long with an index scan
> on every single compare etc. - no, no, no.
Let's fact it. Most long tuples are store/retrieve, not ordered on or
used in WHERE clauses. Moving them out of the main table speeds up
things. It also prevents expansion of rows that never end up in the
result set.
In your system, a sequential scan of the table will pull in all this
stuff because you are going to expand the tuple. That could be very
costly. In my system, the expansion only happens on output if they LONG
field does not appear in the WHERE or ORDER BY clauses.
Also, my idea was to auto-enable longs for all varlena types, so short
values stay in the table, while longer chained ones that take up lots of
space and are expensive to expand are retrieved only when needed.
I see this as much better than chained tuples.
>
> And it will not be one single pg_long table. Instead it will
> be a separate table per table, that contains one or more LONG
> attributes. IIRC, the TRUNCATE functionality was implemented
> exactly to QUICKLY be able to whipe out the data from huge
> relations AND get the disk space back. In the case of a
> central pg_long, TRUNCATE would have to scan pg_long to mark
> the tuples for deletion and vacuum must be run to really get
> back the space. And a vacuum on this central pg_long would
> probably take longer than the old DELETE, VACUUM of the now
> truncated table itself. Again no, no, no.
>
I guess a separate pg_long_ per table would be good.
===
Subject: Re: [HACKERS] LONG
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sat, 11 Dec 1999 19:01:49 -0500 (EST)
> Maybe we make this mechanism so general that it is
> automatically applied to ALL varsize attributes? We'll end up
> with on big pg_long where 90+% of the databases content will
> be stored.
>
> But as soon as an attribute stored there is used in a WHERE
> or is subject to be joined, you'll see why not (as said, this
> type will NOT be enabled for indexing). The operation will
> probably fallback to a seq-scan on the main table and then
> the attribute must be fetched from pg_long with an index scan
> on every single compare etc. - no, no, no.
A field value over 8k is not going to be something you join on,
restrict, or order by in most cases. It is going to be some long
narrative or field that is just for output to the user, usually not used
to process the query.
===