This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
Subject: Re: [HACKERS] LONG From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Sat, 11 Dec 1999 10:20:50 -0500 (EST) > Bruce Momjian wrote: > > > Should we use large objects for this, and beef them up. Seems that > > would be a good way. I have considered putting them in a hash > > bucket/directory tree for faster access to lots of large objects. > > > > There is a lot to say about storing long tuples outside the tables > > because long tuples fill cache buffers and make short fields longer to > > access. > > I thought to use a regular table. Of course, it will eat > buffers, but managing external files or even large objects > for it IMHO isn't that simple, if you take transaction > commit/abort and MVCC problematic into account too. And IMHO > this is something that must be covered, because I meant to > create a DATATYPE that can be used as a replacement for TEXT > if that's too small, so it must behave as a regular datatype, > without any restrictions WRT beeing able to rollback etc. OK, I have thought about your idea, and I like it very much. In fact, it borders on genius. Our/my original idea was to chain tuple in the main table. That has some disadvantages: More complex tuple handling of chained tuples Requires more tuple storage overhead for housekeeping of chaining data Sequential scan of table has to read those large fields Vacuum has to keep the tuples chained as they are moved Your system would be: CREATE TABLE pg_long ( refoid OID, attno int2, line int4, attdata VARCHAR(8000); CREATE INDEX pg_long_idx ON pg_long (refoid, attno, line); You keep the long data out of the table. When updating the tuple, you mark the pg_long tuples as superceeded with the transaction id, and just keep going. No need to do anything special. Vacuum will remove superceeded tuples automatically while processing pg_long if the transaction was committed. The pg_long_idx index will allow rapid access to tuple long data. This approach seems better than tuple chaining because it uses our existing code more efficiently. You keep long data out of the main table, and allow use of existing tools to access the long data. In fact, you may decide to just extent varchar() and text to allow use of long tuples. Set the varlena VARLEN field to some special value like -1, and when you see that, you go to pg_long to get the data. Seems very easy. You could get fancy and keep data in the table in most cases, but if the tuple length exceeds 8k, go to all the varlena fields and start moving data into pg_long. That way, a table with three 4k columns could be stored without the user even knowing pg_long is involved, but for shorter tuples, they are stored in the main table. === Subject: Re: [HACKERS] LONG From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Sat, 11 Dec 1999 10:38:28 -0500 (EST) > I thought to use a regular table. Of course, it will eat > buffers, but managing external files or even large objects > for it IMHO isn't that simple, if you take transaction > commit/abort and MVCC problematic into account too. And IMHO > this is something that must be covered, because I meant to > create a DATATYPE that can be used as a replacement for TEXT > if that's too small, so it must behave as a regular datatype, > without any restrictions WRT beeing able to rollback etc. In fact, you could get fancy and allow an update of a non-pg_long using column to not change pg_long at all. Just keep the same value in the column. If the transaction fails or succeeds, the pg_long is the same for that tuple. Of course, because an update is a delete and then an insert, that may be hard to do. For very long fields, it would be a win for UPDATE. You certainly couldn't do that with chained tuples. === Subject: [HACKERS] Last thoughts about LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 17:21:28 +0100 (MET) I wrote: > Bruce Momjian wrote: > > > Should we use large objects for this, and beef them up. Seems that > > would be a good way. I have considered putting them in a hash > > bucket/directory tree for faster access to lots of large objects. > > > > There is a lot to say about storing long tuples outside the tables > > because long tuples fill cache buffers and make short fields longer to > > access. > > I thought to use a regular table. Of course, it will eat > buffers ... When looking at my actual implementation concept, I'm not sure if it will win or loose compared against text itself! Amazing, but I think it could win already on relatively small text sizes (1-2K is IMHO small compared to what this type could store). Well, the implementation details. I really would like some little comments to verify it's really complete before starting. - A new field "rellongrelid" type Oid is added to pg_class. It contains the Oid of the long-value relation or the invalid Oid for those who have no LONG attributes. - At CREATE TABLE, a long value relation named "_LONG<tablename>" is created for those tables who need it. And of course dropped and truncated appropriate. The schema of this table is rowid Oid, -- oid of our main data row rowattno int2, -- the attribute number in main data chunk_seq int4, -- the part number of this data chunk chunk text -- the content of this data chunk There is a unique index defined on (rowid, rowattno). - The new data type is of variable size with the following header: typedef struct LongData { int32 varsize; int32 datasize; Oid longrelid; Oid rowid; int16 rowattno; } LongData; The types input function is very simple. Allocate sizeof(LongData) + strlen(input), set varsize to it, datasize to strlen(input), and the rest to invalid and 0. Then copy the input after the struct. The types output function determines on the longrelid, what to do. If it's invalid, just output the bytes stored after the struct (it must be a datum that resulted from an input operation. If longrelid isn't invalid, it does an index scan on that relation, fetching all tuples that match rowid and attno. Since it knows the datasize, it doesn't need them in the correct order, it can put them at the right places into the allocated return buffer by their chunk_seq. - For now (until we have enough experience to judge) I think it would be better to forbid ALTER TABLE when LONG attributes are involved. Sure, must be implemented finally, but IMHO not on the first evaluation attempt. Now how the data goes in and out of the longrel. - On heap_insert(), we look for non NULL LONG attributes in the tuple. If there could be any can simply be seen by looking at the rellongrelid in rd_rel. We fetch the value either from the memory after LongData or by using the type output function (for fetching it from the relation where it is!). Then we simply break it up into single chunks and store them with our tuples information. Now we need to do something tricky - to shrink the main data tuple size, we form a new heap tuple with the datums of the original one. But we replace all LongData items we stored by faked ones, where the varsize is sizeof(LongData) and all the other information is setup appropriate. We append that faked tuple instead, copy the resulting information into the original tuples header and throw it away. This is a point, where I'm not totally sure. Could it possibly be better or required to copy the entire faked tuple over the one we should have stored? It could never need more space, so that wouldn't be a problem. - On heap_replace(), we check all LONG attributes if they are NULL of if the information in longrelid, rowid and rowattno doesn't match our rellongrelid, tupleid, and attno. In that case this attribute might have an old content in the longrel, which we need to delete first. The rest of the operation is exactly like for heap_insert(), except all the attributes information did match - then it's our own OLD value that wasn't changed. So we can simply skip it - the existing data is still valid. - heap_delete() is so simple that I don't explain it. Now I hear you asking "how could this overhead be a win?" :-) That's easy to explain. As long as you don't use a LONG column in the WHERE clause, when will the data be fetched? At the time it's finally clear that it's needed. That's when a result tuple is sent to the client (type output) or when a tuple resulting from INSERT ... SELECT should be stored. Thus, all the tuples moving around in the execution tree, getting joined together, abused by sorts and aggregates and filtered out again, allways contain the small LongData struct, not the data itself. Wheren't there recently reports about too expansive sorts due to their huge size? Another bonus would be this: What happens on an UPDATE to a table having LONG attributes? If the attribute is not modified, the OLD LongData will be found in the targetlist, and we'll not waste any space by storing the same information again. IIRC that one was one of the biggest concerns about storing huge data in tuples, but it disappeared without leaving a trace - funny eh? It is so simple, that I fear I made some mistake somewhere. But where? === Subject: Re: [HACKERS] LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 17:45:53 +0100 (MET) Bruce Momjian wrote: > In fact, you may decide to just extent varchar() and text to allow use > of long tuples. Set the varlena VARLEN field to some special value like > -1, and when you see that, you go to pg_long to get the data. Seems > very easy. You could get fancy and keep data in the table in most > cases, but if the tuple length exceeds 8k, go to all the varlena fields > and start moving data into pg_long. That way, a table with three 4k > columns could be stored without the user even knowing pg_long is > involved, but for shorter tuples, they are stored in the main table. So you realized most of my explanations yourself while I wrote the last mail. :-) No, I don't intend to change anything on the existing data types. Where should be the limit on which to decide to store a datum in pg_long? Based on the datums size? On the tuple size and attribute order, take one by one until the tuple became small enough to fit? Maybe we make this mechanism so general that it is automatically applied to ALL varsize attributes? We'll end up with on big pg_long where 90+% of the databases content will be stored. But as soon as an attribute stored there is used in a WHERE or is subject to be joined, you'll see why not (as said, this type will NOT be enabled for indexing). The operation will probably fallback to a seq-scan on the main table and then the attribute must be fetched from pg_long with an index scan on every single compare etc. - no, no, no. And it will not be one single pg_long table. Instead it will be a separate table per table, that contains one or more LONG attributes. IIRC, the TRUNCATE functionality was implemented exactly to QUICKLY be able to whipe out the data from huge relations AND get the disk space back. In the case of a central pg_long, TRUNCATE would have to scan pg_long to mark the tuples for deletion and vacuum must be run to really get back the space. And a vacuum on this central pg_long would probably take longer than the old DELETE, VACUUM of the now truncated table itself. Again no, no, no. === Subject: Re: [HACKERS] Last thoughts about LONG From: Peter Eisentraut <e99re41@DoCS.UU.SE> Date: Sat, 11 Dec 1999 17:55:24 +0100 (MET) On Sat, 11 Dec 1999, Jan Wieck wrote: > Well, the implementation details. I really would like some > little comments to verify it's really complete before > starting. Before I start the nagging, please be aware that I'm not as smart as I think I am. Long datatypes of some sort are clearly necessary -- more power to you. > - A new field "rellongrelid" type Oid is added to pg_class. > It contains the Oid of the long-value relation or the > invalid Oid for those who have no LONG attributes. I have a mixed feeling about all these "sparse" fields everywhere. Doing it completely formally, this seems to be a one-to-many relation, so you should put the referencing field into the pg_long table or whatever structure you use, pointing the other way around. This is probably slower, but it's cleaner. As I mentioned earlier, this whole arrangement will (hopefully) not be needed for all too long, and then we wouldn't want to be stuck with it. > - At CREATE TABLE, a long value relation named > "_LONG<tablename>" is created for those tables who need it. Please don't forget, this would require changes to pg_dump and psql. Also, the COPY command might not be able to get away without changes, either. In general, it wouldn't surprise me if some sections of the code would go nuts about the news of tuples longer than BLCKSZ coming along. (Where "nuts" is either 'truncation' or 'segfault'.) I guess what I'm really saying is that I'd be totally in awe of you if you could get all of this (and RI) done by Feb 1st. Good luck. === Subject: Re: [HACKERS] LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 18:04:04 +0100 (MET) Peter Eisentraut wrote: > On Sat, 11 Dec 1999, Bruce Momjian wrote: > > > In fact, you could get fancy and allow an update of a non-pg_long using > > column to not change pg_long at all. Just keep the same value in the > > column. If the transaction fails or succeeds, the pg_long is the same > > for that tuple. Of course, because an update is a delete and then an > > insert, that may be hard to do. For very long fields, it would be a win > > for UPDATE. You certainly couldn't do that with chained tuples. > > While this is great and all, what will happen when long tuples finally get > done? Will you remove this, or keep it, or just make LONG and TEXT > equivalent? I fear that elaborate structures will be put in place here > that might perhaps only be of use for one release cycle. With the actual design explained, I don't think we aren't that much in need for long tuples any more, that we should introduce all the problems of chaninig tuples into the vacuum, bufmgr, heapam, hio etc. etc. code. The rare cases, where someone really needs larger tuples and not beeing able to use the proposed LONG data type can be tackled by increasing BLKSIZE for this specific installation. Isn't there a FAQ entry about "tuple size too big" pointing to BLKSIZE? Haven't checked, but if it is, could that be the reason why we get lesser request on this item? === Subject: Re: [HACKERS] LONG From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat, 11 Dec 1999 13:13:15 -0500 >> I thought about the huge size variable text type a little >> more. And I think I could get the following implementation >> to work reliable for our upcoming release. >> >> For any relation, having one or more LONG data type >> attributes, another relation (named pg_<something>) is >> created, accessible only to superusers (and internal access >> routines). All LONG data items are stored as a reference >> into that relation, split up automatically so the chunks fit >> into the installation specific tuple limit size. Items are >> added/updated/removed totally transparent. > Should we use large objects for this, and beef them up. Seems that > would be a good way. Yes, I think what Jan is describing *is* a large object, with the slight change that he wants to put multiple objects into the same behind-the-scenes relation. (That'd be a good change for regular large objects as well ... it'd cut down the umpteen-thousand-files problem.) The two principal tricky areas would be (1) synchronization --- having one hidden relation per primary relation might solve the problems there, but I'm not sure about it; and (2) VACUUM. But I don't really see why this would be either easier to do or more reliable than storing multiple segments of a tuple in the primary relation itself. And I don't much care for === Subject: Re: [HACKERS] LONG From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat, 11 Dec 1999 13:32:46 -0500 wieck@debis.com (Jan Wieck) writes: > The rare cases, where someone really needs larger tuples and > not beeing able to use the proposed LONG data type can be > tackled by increasing BLKSIZE for this specific installation. This would be a more convincing argument if we supported BLCKSZ greater than 32K, but we don't. I think we've speculated about having a compilation flag that gets thrown to change page offsets from shorts to longs, thereby allowing larger page sizes. But as Bruce was just pointing out, all of the code depends in a fundamental way on the assumption that writing a page is an atomic action. The larger the page size, the more likely that you'll see broken tables caused by partial page writes. So allowing BLCKSZ large enough to accomodate any tuple wouldn't be a very good answer. I think the proposed LONG type is a hack, and I'd rather see us solve the problem correctly. ISTM that allowing a tuple to be divided into "primary" and "continuation" tuples, all stored in the same relation file, would be a much more general answer and not significantly harder to implement than a LONG datatype as Jan is describing it. === Subject: Re: [HACKERS] Last thoughts about LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 19:29:59 +0100 (MET) Peter Eisentraut wrote: > Before I start the nagging, please be aware that I'm not as smart as I > think I am. Long datatypes of some sort are clearly necessary -- more > power to you. So be it. It forces me to think it over again and points to sections, I might have forgotten so far. Also, it happend more than one time to me, that writing a totally OBVIOUS answer triggerd a better solution in my brain (dunno what's wrong with that brain, but sometimes it needs to be shaken well before use). Thus, any of your notes can help, and that counts! > > > - A new field "rellongrelid" type Oid is added to pg_class. > > It contains the Oid of the long-value relation or the > > invalid Oid for those who have no LONG attributes. > > I have a mixed feeling about all these "sparse" fields everywhere. Doing > it completely formally, this seems to be a one-to-many relation, so you > should put the referencing field into the pg_long table or whatever > structure you use, pointing the other way around. This is probably slower, > but it's cleaner. As I mentioned earlier, this whole arrangement will > (hopefully) not be needed for all too long, and then we wouldn't want to > be stuck with it. It's 4 bytes per RELATION in pg_class. As a side effect, the information will be available at NO COST immediately after heap_open() and in every place, where a relation is accessed. So it is the best place to put it. > > > - At CREATE TABLE, a long value relation named > > "_LONG<tablename>" is created for those tables who need it. > > Please don't forget, this would require changes to pg_dump and psql. Also, > the COPY command might not be able to get away without changes, either. Oh yes, thanks. That was a point I forgot! Psql must not list tables that begin with "_LONG" on the \d request. Anything else should IMHO be transparent. Pg_dump either uses a SELECT to build a script that INSERT's the data via SQL, or uses COPY. In the SELECT/INSERT case, my implementation would again be totally transparent and not noticed by pg_dump, only that it must IGNORE "_LONG*" relations and be aware that really big tuples can be sent, but that's more a libpq question I think (what I already checked because the view/rule/PL combo I created to demonstrate a >128K tuple was done through psql). AFAIK, pg_dump doesn't use a binary COPY, and looking at the code tells me that this is transparent too (due to use of type specific input/output function there). All pg_dump would have to do is to ignore "_LONG*" relations too. The real problem is COPY. In the case of a COPY BINARY it outputs the data portion of the fetched tuples directly. But these will only contain the LongData headers, not the data itself. So at that point, COPY has to do the reverse process of heap_insert(). Rebuild a faked tuple where all the not NULL LONG values are placed in the representation, they would have after type input. Not a big deal, must only be done with the same care as the changes in heapam not to leave unfreed, leaked memory around. > In general, it wouldn't surprise me if some sections of the code would go > nuts about the news of tuples longer than BLCKSZ coming along. (Where > "nuts" is either 'truncation' or 'segfault'.) The place, where the size of a heap tuple only is checked (and where the "tuple size too big" message is coming from) is in hio.c, right before it is copied into the block. Up to then, a tuple is NOT explicitly limited to any size. So I would be glad to see crashes coming up from this change (not after release - during BETA of course). It would help us to get another existing bug out of the code. > I guess what I'm really saying is that I'd be totally in awe of you if you > could get all of this (and RI) done by Feb 1st. Good luck. Thank's for the flowers, but "awe" is far too much - sorry. During the years I had my hands on nearly every part of the code involved in this. So I'm not a newbe in creating data types, utility commands or doing syscat changes. The LONG type I described will be the work of two or three nights. I already intended to tackle the long tuples next. Missing was the idea how to AVOID it simply. And I had this idea just while answering a question about storing big text files in the database in the [SQL] list - that woke me up. In contrast to the RI stuff, this time I don't expect any bugs, because there are absolutely no side effects I noticed so far. On the RI stuff, we discussed for weeks (if not months) about tuple visibility during concurrent transactions and I finally ran into exactly these problems anyway. === Subject: Re: [HACKERS] Last thoughts about LONG From: Tom Lane <tgl@sss.pgh.pa.us> Date: Sat, 11 Dec 1999 13:52:11 -0500 wieck@debis.com (Jan Wieck) writes: > Another bonus would be this: What happens on an UPDATE to a > table having LONG attributes? If the attribute is not > modified, the OLD LongData will be found in the targetlist, > and we'll not waste any space by storing the same information > again. Won't work. If you do that, you have several generations of the "primary" tuple pointing at the same item in the "secondary" table. There is room in the multiple primary tuples to keep track of their committed/uncommitted status, but there won't be enough room to keep track in the secondary table. I think this can only work if there are exactly as many generations of the LONG chunks in the secondary table as there are of the primary tuple in the main table, and all of them have the same transaction identification info stored in them as the corresponding copies of the primary tuple have. Among other things, this means that an update or delete *must* scan through the tuple, find all the LONG fields, and go over to the secondary table to mark all the LONG chunks as deleted by the current xact, just the same as the primary tuple gets marked. This puts a considerable crimp in your claim that it'd be more efficient than a multiple-tuple-segment approach. Of course, this could be worked around if the secondary table did *not* use standard access methods (it could be more like an index, and rely on the primary table for all xact status info). But that makes it look === Subject: Re: [HACKERS] Last thoughts about LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 22:02:43 +0100 (MET) Tom Lane wrote: > wieck@debis.com (Jan Wieck) writes: > > Another bonus would be this: What happens on an UPDATE to a > > table having LONG attributes? If the attribute is not > > modified, the OLD LongData will be found in the targetlist, > > and we'll not waste any space by storing the same information > > again. > > Won't work. If you do that, you have several generations of the > "primary" tuple pointing at the same item in the "secondary" table. > There is room in the multiple primary tuples to keep track of their > committed/uncommitted status, but there won't be enough room to > keep track in the secondary table. A really critical point, to think about in depth. And another point I could have stumbled over. But it would work anyway. I assumed up to now, that even under MVCC, and even if reading dirty, there could be at max one single transaction modifying one and the same tuple - no? Ignore all the rest and forget all my comments if your answer is no. But please tell me how something like RI should ever work RELIABLE in such an environment. In fact, in that case I would immediately stop all my efford in FOREIGN KEY, because it would be a dead end street - so I assume your answer is yes. My concept, using regular heap access inside of heap access to act on "secondary" table, means to stamp the same current xact as for "primary" table into xmax of old, and into xmin of new tuples for the "secondary" table. And it means that this operation appears to be atomic if living in a locking environment. The only thing I DON'T wanted to do is to stamp xmax and create new instances in "secondary" table, if no update is done to the value of the old LONG attribute. Any UPDATE modifying the LONG value, and INSERT/DELETE of course will stamp this information and/or create new instances. So the only thing (because the only difference) to worry about are unstamped and uncreated instances in "secondary" table - right? Since INSERT/DELETE allways act synchronous to the "primary" table, and and UPDATE modifying the LONG too, the only thing left to worry about is an UPDATE without updating the LONG. In this scenario, a "secondary" tuple of a not updated "primary" LONG will have an older, but surely committed, xmin. And it's xmax will be either infinite, or aborted. So it is visible - no other chance. And that's good, because at the time beeing, the updater of the "primary" tuple does a NOOP on the "secondary". And this (extended) part of the "primaries" tuple information is absolutely unaffected, regardless if it's transaction will commit or rollback. Well, your concern is again valid. This concept MIGHT need to force a NON-MVCC locking scheme for "secondary" tables. But as far as I learned from the RI stuff, that isn't a problem and therefore current Jackpot value to be added to Vadim's account. === Subject: Re: [HACKERS] LONG From: wieck@debis.com (Jan Wieck) Date: Sat, 11 Dec 1999 23:36:22 +0100 (MET) Bruce Momjian wrote: > > While this is great and all, what will happen when long tuples finally get > > done? Will you remove this, or keep it, or just make LONG and TEXT > > equivalent? I fear that elaborate structures will be put in place here > > that might perhaps only be of use for one release cycle. > > I think the idea is that Jan's idea is better than chaining tuples. Just as Tom already pointed out, it cannot completely replace tuple chaining because of the atomicy assumption of single fsync(2) operation in current code. Due to this, we cannot get around the cases LONG will leave open by simply raising BLKSIZE, we instead need to tackle that anyways. But I believe LONG would still be something worth the efford. It will lower the pressure on chained tuples, giving us more time to build a really good solution, and I think LONG can survive tuple chaining and live in coexistance with it. As said in my last mail, I still believe that not touching LONG values at UPDATE can avoid storing the same huge value again. And that's a benefit, tuple chaining will never give us. Remember: If your only tool is a hammer, anything MUST look like a nail. So why not provide a richer set of tools? Jan === Subject: Re: [HACKERS] LONG From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Sat, 11 Dec 1999 18:25:12 -0500 (EST) > Bruce Momjian wrote: > > > > While this is great and all, what will happen when long tuples finally get > > > done? Will you remove this, or keep it, or just make LONG and TEXT > > > equivalent? I fear that elaborate structures will be put in place here > > > that might perhaps only be of use for one release cycle. > > > > I think the idea is that Jan's idea is better than chaining tuples. > > Just as Tom already pointed out, it cannot completely replace > tuple chaining because of the atomicy assumption of single > fsync(2) operation in current code. Due to this, we cannot > get around the cases LONG will leave open by simply raising > BLKSIZE, we instead need to tackle that anyways. Actually, in looking at the fsync() system call, it does write the entire file descriptor before marking the transaction as complete, so there is no hard reason not to raise it, but because the OS has to do two reads to get 16k, I think we are better keeping 8k as our base block size. Jan's idea is not to chain tuples, but to keep tuples at 8k, and instead chain out individual fields into 8k tuple chunks, as needed. This seems like it makes much more sense. It uses the database to recreate the chains. Let me mention a few things. First, I would like to avoid a LONG data type if possible. Seems a new data type is just going to make things more confusing for users. My ideas is a much more limited one than Jan's. It is to have a special -1 varlena length when the data is chained on the long relation. I would do: -1|oid|attno in 12 bytes. That way, you can pass this around as long as you want, and just expand it in the varlena textout and compare routines when you need the value. That prevents the tuples from changing size while being processed. As far as I remember, there is no need to see the data in the tuple except in the type comparison/output routines. Now it would be nice if we could set the varlena length to 12, it's actual length, and then just somehow know that the varlena of 12 was a long data entry. Our current varlena has a maximum length of 64k. I wonder if we should grab a high bit of that to trigger long. I think we may be able to do that, and just do a AND mask to remove the bit to see the length. We don't need the high bit because our varlena's can't be over 32k. We can modify VARSIZE to strip it off, and make another macro like ISLONG to check for that high bit. Seems this could be done with little code. === Subject: Re: [HACKERS] LONG From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Sat, 11 Dec 1999 18:54:58 -0500 (EST) > Maybe we make this mechanism so general that it is > automatically applied to ALL varsize attributes? We'll end up > with on big pg_long where 90+% of the databases content will > be stored. If most joins, comparisons are done on the 10% in the main table, so much the better. > > But as soon as an attribute stored there is used in a WHERE > or is subject to be joined, you'll see why not (as said, this > type will NOT be enabled for indexing). The operation will > probably fallback to a seq-scan on the main table and then > the attribute must be fetched from pg_long with an index scan > on every single compare etc. - no, no, no. Let's fact it. Most long tuples are store/retrieve, not ordered on or used in WHERE clauses. Moving them out of the main table speeds up things. It also prevents expansion of rows that never end up in the result set. In your system, a sequential scan of the table will pull in all this stuff because you are going to expand the tuple. That could be very costly. In my system, the expansion only happens on output if they LONG field does not appear in the WHERE or ORDER BY clauses. Also, my idea was to auto-enable longs for all varlena types, so short values stay in the table, while longer chained ones that take up lots of space and are expensive to expand are retrieved only when needed. I see this as much better than chained tuples. > > And it will not be one single pg_long table. Instead it will > be a separate table per table, that contains one or more LONG > attributes. IIRC, the TRUNCATE functionality was implemented > exactly to QUICKLY be able to whipe out the data from huge > relations AND get the disk space back. In the case of a > central pg_long, TRUNCATE would have to scan pg_long to mark > the tuples for deletion and vacuum must be run to really get > back the space. And a vacuum on this central pg_long would > probably take longer than the old DELETE, VACUUM of the now > truncated table itself. Again no, no, no. > I guess a separate pg_long_ per table would be good. === Subject: Re: [HACKERS] LONG From: Bruce Momjian <pgman@candle.pha.pa.us> Date: Sat, 11 Dec 1999 19:01:49 -0500 (EST) > Maybe we make this mechanism so general that it is > automatically applied to ALL varsize attributes? We'll end up > with on big pg_long where 90+% of the databases content will > be stored. > > But as soon as an attribute stored there is used in a WHERE > or is subject to be joined, you'll see why not (as said, this > type will NOT be enabled for indexing). The operation will > probably fallback to a seq-scan on the main table and then > the attribute must be fetched from pg_long with an index scan > on every single compare etc. - no, no, no. A field value over 8k is not going to be something you join on, restrict, or order by in most cases. It is going to be some long narrative or field that is just for output to the user, usually not used to process the query. ===