postgresql_vs_berkeleydb

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.





===

Subject: Re: [HACKERS] Berkeley DB...
From: Chris Bitmead <chrisb@nimrod.itg.telstra.com.au>
Date: Mon, 22 May 2000 10:10:48 +1000


"Michael A. Olson" wrote:

> You get another benefit from Berkeley DB -- we eliminate the 8K limit
> on tuple size.  For large records, we break them into page-sized
> chunks for you, and we reassemble them on demand.  Neither PostgreSQL
> nor the user needs to worry about this, it's a service that just works.
> 
> A single record or a single key may be up to 4GB in size.

That's certainly nice. But if you don't access a BIG column, you have to
retrieve the whole record? A very nice idea of the Postgres TOAST idea
is that you don't. You can have...
CREATE TABLE image (name TEXT, size INTEGER, giganticTenMegImage GIF);
As long as you don't select the huge column you don't lift it off disk.
That's pretty nice. In other databases I've had to do some annoying
refactoring of data models to avoid this.



===

Subject: Re: [HACKERS] Berkeley DB...
From: "Michael A. Olson" <mao@sleepycat.com>
Date: Sun, 21 May 2000 18:25:07 -0700


At 10:14 AM 5/22/00 +0900, Tatsuo Ishii wrote:

> Can I ask you a simple question? Does Berkeley DB support encodings
> other than ASCII?

Berkeley DB is entirely agnostic on data types.  We store and retrieve
keys and values; you define the types and assign semantics to them.
We've got a number of customers storing wide character data in various
encodings and character sets in Berkeley DB.

Our default btree comparator and hash function are simple bit string
operators.  You'd need to write a comparison function for btrees that



===


Subject: RE: [HACKERS] Berkeley DB...
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
Date: Sun, 21 May 2000 18:47:03 -0700


> Empty pages get appended to a free list, and will be reused 
> on next page allocation. Empty space on pages (from deleted
> tuples) where the rest of the page isn't empty will get reused
> the next time the page is visited.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
If visited for update... but how can I know with what RECNO I have
to insert new tuple to store it on half-empty page?

> We do, however, do reverse splits of underfull nodes, so 
> we're aggressive at getting empty pages back on the free list.

You can't merge two 49% empty pages in one. So, how to reuse
this 49%? How will we able to implement feature that good
databases have: one can specify while table creation -
"insert new tuples on pages which N% empty"?

More of that, you can't just re-use empty space (even if you know
where in the tree there is space for new tuple) - you have to assign
*right* recno to new tuple. What if 2k tuple on a page was updated
to 0.2k and no tuple was deleted on this page? You can't re-use
this empty-space without tree reorg, what 1. requires additional
writes; 2. is not always possible at all.
...Oh, we can work it out, by using somehing like current TID
(blkno + itemid) as index key - this will require some changes
in SDB btree code (new RECNO-like AM)... so, this was just
calculation of work required -:)

And, while we are on heap subject - using index (RECNO) for heap
means that all our secondary-index scans will performe TWO
index scans - first, to find recno in secondary-index, and
second, to find heap tuple using recno (now indices give us
TID, which is physical address).

> > 2. SDB' btree-s support only one key, but we have multi-key 
> btree-s...
> 
> This is a misunderstanding. Berkeley DB allows you to use
> arbitrary data structures as keys. You define your own comparison 
> function, which understands your key structure and is capable of
> doing comparisons between keys.

Oh, you're right.
Though, I'm unhappy with

DB_SET_RANGE
     The DB_SET_RANGE flag is identical to the DB_SET flag, except that
     the key is returned as well as the data item, and, in the case of
     the Btree access method, the returned key/data pair is 
     the smallest key greater than or equal to the specified key
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we would need differentiate > and >= : it's not good for key > 2 to read
from disk all items with key == 2. But, this could be done.

     (as determined by the comparison function), permitting partial key
     matches and range searches. 

> You get another benefit from Berkeley DB -- we eliminate the 8K limit
> on tuple size.  For large records, we break them into page-sized
> chunks for you, and we reassemble them on demand.  Neither PostgreSQL
> nor the user needs to worry about this, it's a service that 
> just works.

But what I've seen in code - you *always* copy data to memory,
what is not good for both seq scans (when we have to evaluate
query qual) and index ones (with range quals). See also
comments from Chris Bitmead about the case when some of data
are not in select target list...
Probably, we could change DB->(c_)get methods and provide them
with two functions (to check qual and to determine what exactly
should be returned)...

> A single record or a single key may be up to 4GB in size.
> 
> > 3. How can we implement gist, rtree AND (multi-key) BTREE 
> > access methods using btree and hash access methods provided by SDB?!
> 
> You'd build gist and rtree on top of the current buffer manager, much
> as rtree is currently implemented on top of the lower-level page manager
> in PostgreSQL.

Oh, so we have to implement redo/undo for them. Ok.

WAL:
> I encourage you to think hard about the amount of work that's really
> required to produce a commercial-grade recovery and transaction system.
> This stuff is extremely hard to get right -- you need to design, code,
> and test for very high-concurrency, complex workloads. The log is a
> new source of contention, and will be a gate to performance. The log
> is also a new way to consume space endlessly, so you'll want to think
> about backup and checkpoint support. With Berkeley DB, you get both
> today. Our backup support permits you to do on-line backups. Backups
> don't acquire locks and don't force a shutdown.

As for design and coding, 90% is already done (though, with SDB we could
avoid heap/btree/hash redo/undo implementation). As for stability/testing
- as I already said, - after rewriting ~50% of system to use SDB,
nothing in PostgreSQL will be "well tested".

> Testing this stuff is tricky.  For example, you need to prove 
> that you're able to survive a crash that interrupts the three
> internal page writes that you do in the btree access method on
> a page split.  

Oh, testing of this case is very easy - I'll just stop backend
using gdb in critical points and will turn power off -:))
I've run 2-3 backends under gdb to catch some concurrency-related
bug in buffer manager - this technique works very well -:)

> All of that said, I'd boil Vadim's message down to this:
> 
> +  With Berkeley DB, you'd need to reimplement multi-version
>    concurrency control, and that's an opportunity to introduce
>    new bugs.

Unfortunately, this must be done *before* we could migrate to
SDB -:( So, some of us will have to stop PG development and
switch to SDB code... which is hard for soul -:)

> +  With PostgreSQL, you'll need to implement logging 
>    and recovery, and that's an opportunity to introduce new bugs.

But... we know where in our code set up breakpoints under gdb -:))

Oh, well.. let's continue to think...



===

Subject: Re: [HACKERS] Berkeley DB...
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sun, 21 May 2000 22:22:14 -0400 (EDT)


> And, while we are on heap subject - using index (RECNO) for heap
> means that all our secondary-index scans will performe TWO
> index scans - first, to find recno in secondary-index, and
> second, to find heap tuple using recno (now indices give us
> TID, which is physical address).

Yes, that was one of my questions.  Why use recno at all?  We already
have heap access which is very fast.  Why switch to SDB which gives us
a recno ordering of heap that doesn't do us any real good, except to
allow tuple update without changing indexes.



===

Subject: RE: [HACKERS] Berkeley DB...
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
Date: Sun, 21 May 2000 20:00:01 -0700


> > And, while we are on heap subject - using index (RECNO) for heap
> > means that all our secondary-index scans will performe TWO
> > index scans - first, to find recno in secondary-index, and
> > second, to find heap tuple using recno (now indices give us
> > TID, which is physical address).
> 
> Yes, that was one of my questions.  Why use recno at all?  We already
> have heap access which is very fast.  Why switch to SDB which gives us
> a recno ordering of heap that doesn't do us any real good, except to
> allow tuple update without changing indexes.

But if we'll use our heap AM, then we'll have to implement redo/undo




===

Subject: Re: [HACKERS] Berkeley DB...
From: Bruce Momjian <pgman@candle.pha.pa.us>
Date: Sun, 21 May 2000 23:11:00 -0400 (EDT)


[Charset iso-8859-1 unsupported, filtering to ASCII...]
> > > And, while we are on heap subject - using index (RECNO) for heap
> > > means that all our secondary-index scans will performe TWO
> > > index scans - first, to find recno in secondary-index, and
> > > second, to find heap tuple using recno (now indices give us
> > > TID, which is physical address).
> > 
> > Yes, that was one of my questions.  Why use recno at all?  We already
> > have heap access which is very fast.  Why switch to SDB which gives us
> > a recno ordering of heap that doesn't do us any real good, except to
> > allow tuple update without changing indexes.
> 
> But if we'll use our heap AM, then we'll have to implement redo/undo
> for it... no sence to switch to SDB for btree/hash WAL support -:)

Yes, SDB would give us redo/undo in heap, and that would make things
easier.  However, if there is the overhead of a double-index lookup when
using indexes, it seems like a very high cost.




===


Subject: RE: [HACKERS] Berkeley DB...
From: "Mikheev, Vadim" <vmikheev@SECTORBASE.COM>
Date: Sun, 21 May 2000 20:09:54 -0700


> > Yes, that was one of my questions.  Why use recno at all?  
> > We already have heap access which is very fast. Why switch
> > to SDB which gives us a recno ordering of heap that doesn't
> > do us any real good, except to allow tuple update without
> > changing indexes.
> 
> But if we'll use our heap AM, then we'll have to implement redo/undo
> for it... no sence to switch to SDB for btree/hash WAL support -:)

Also, I think that our native index logging will require less space
in log, because of we can do not write *key values* to log!
Index tuple insertion will be logged as "index tuple pointing to
heap TID was added to page BLKNO at position ITEMID".
The same for index page split...




===


Subject: Re: [HACKERS] Berkeley DB...
From: "Michael A. Olson" <mao@sleepycat.com>
Date: Sun, 21 May 2000 21:09:27 -0700


I'm responding in a single message to several questions prompted by my
message of this morning.

Chris Bitmead asked:

> > A single record or a single key may be up to 4GB in size.
> 
> That's certainly nice. But if you don't access a BIG column, you have to
> retrieve the whole record?

You can do partial reads, but only at the expense of complicating the
code in the PostgreSQL server.  We provide some reasonable interfaces
for fetching only part of large records.  However, you need to know
whether you should call them or not.  As a result, you'd have to
record in the system catalog somewhere that a particular table contained
big tuples, and you'd have to write your fetch code to read only the
byte range you care about.

That would complicate the server, since you'd want to do simple
fetches for the simple case, too.

Vadim Mikheev made some good points on space reuse.  Unless a page is
empty space on the page for keys in the right range.  For append-only 
workloads (like increasing heap tids), that's not what you want.

Vadim then asked:

> You can't merge two 49% empty pages in one. So, how to reuse
> this 49%? How will we able to implement feature that good
> databases have: one can specify while table creation -
> "insert new tuples on pages which N% empty"?


We already recognize the special case of in-order insersions (as in
the case of increasing heap tids).  We split pages so that the right
child is nearly empty and left is nearly full.  That gives you close
to 100% space utilization at build time.  Adding a fill factor to
the initialization code would be very easy.

> And, while we are on heap subject - using index (RECNO) for heap
> means that all our secondary-index scans will performe TWO
> index scans - first, to find recno in secondary-index, and
> second, to find heap tuple using recno (now indices give us
> TID, which is physical address).

We're not going to resolve this question without building both
systems and measuring their performance.  The non-leaf levels of
btrees are pretty much always in the cache because they're hot.
Whether your fetch-a-tuple code path is shorter than my fetch-
a-tuple code path is undecided.

Frankly, based on my experience with Berkeley DB, I'd bet on mine.
I can do 2300 tuple fetches per CPU per second, with linear scale-
up to at least four processors (that's what we had on the box we
used).  That's 9200 fetches a second.  Performance isn't going
to be the deciding issue.

(The test system was a mid-range Solaris box -- reasonable, but not
extravagant, clock speed, memory, and disk.)

On testing failure at critical points in the code, Vadim wrote:

> Oh, testing of this case is very easy - I'll just stop backend
> using gdb in critical points and will turn power off -:))
> I've run 2-3 backends under gdb to catch some concurrency-related
> bug in buffer manager - this technique works very well -:)

For limited concurrency and fairly simple problems, that technique
works well.  You should plan to test PostgreSQL with hundreds of
concurrent backends with a complex workload for days in order to
convince people that the system works correctly.  This is what the
commercial vendors (including Sleepycat!) do.  Your testing
strategy should include randomly killing the system to demonstrate
that you recover correctly.

I'm only warning you to be careful and to take this seriously.  It's
very hard to do the kind of testing you should.  The recovery system
is generally the most poorly-exercised part of the system, but it's
the one piece that absolutely has to work flawlessly.  It only runs
after your system has crashed, and your customer is already angry.

Finally, Vadim makes the point that switching to Berkeley DB forces
you to stop working on code you understand, and to pick up a new
package altogether.  Worse, you'd need to do some pretty serious
engineering to get multi-version concurrency control into Berkeley
DB before you could use it.

This is a pretty compelling argument to me.  I've been busy
explaining how you *could* make the switch, but the real question
is whether you *should*.  I can answer all of Vadim's questions
reasonably.  Frankly, though, if I were in charge of the engineering
effort for PostgreSQL, I'd be disinclined to use Berkeley DB on the
strength of the interface changes it requires and the effort that
would be required to implement MVCC.

I say this in the spirit of complete disclosure -- we'd like to
see you use our software, but you need to make a business decision
here.  If you hadn't already done MVCC, I'd be arguing the other
side, but you have.

===


the rest of The Pile (a partial mailing list archive)

doom@kzsu.stanford.edu