eating_large_textareas_postgresql

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Subject: Re: [HACKERS] Article on MySQL vs. Postgres
From: "Robert B. Easter" <reaster@comptechnews.com>
Date: Wed, 5 Jul 2000 08:27:07 -0400

On Wed, 05 Jul 2000, Hannu Krosing wrote:
> Tim Perdue wrote:
> > 
> > The Hermit Hacker wrote:
> > > > Further, I have had situations where postgres actually had DUPLICATE
> > > > ids in a primary key field, probably due to some abort or other nasty
> > > > situation in the middle of a commit. How did I recover from That?
> > > > Well, I had to run a count(*) next to each ID and select out the rows
> > > > where there was more than one of each "unique" id, then reinsert those
> > > > rows and drop and rebuild the indexes and reset the sequences.
> > >
> > > Odd, were you using transactions here, or transactionless?
> > 
> > Does it matter? I suppose it was my programming error that somehow I got
> > duplicate primary keys in a table in the database where that should be
> > totally impossible under any circumstance? Another stupid
> > transactionless program I'm sure.
> > 
> > At any rate, it appears that the main problem I had with postgres (the
> > 8K tuple limit) is being fixed and I will mention that in my writeup.
> 
> Currently (as of 7.0.x) you could use BLKSIZE=32K + lztext datatype and 
> get text fields about 64-128K depending on data if you are desperately 
> after big textfields.
> 
> -----------
> Hannu

While it is slow, I've been able to store unlimited amounts of text into
the database by using the following code.  I've tested inserting over 4
megabytes from a TEXTAREA web form using PHP.  When inserting such massive
amounts of text, you will have to wait a while, but it will eventually succeed
if you don't run out of memory.  If you do run out of memory, the backend
terminates gracefully and the transaction aborts/rollsback.

 -- Load the PGSQL procedural language
 -- This could also be done with the createlang script/program.
 -- See man createlang.
CREATE FUNCTION plpgsql_call_handler()
	RETURNS OPAQUE AS '/usr/local/pgsql/lib/plpgsql.so'
	LANGUAGE 'C';

CREATE TRUSTED PROCEDURAL LANGUAGE 'plpgsql'
	HANDLER plpgsql_call_handler
	LANCOMPILER 'PL/pgSQL';
          	
 --------------------------------------------------------------------------------
 --
 -- Large Text storage
 --


 -- 	putlgtext -	generic function to store text into the
 --			specified text storage table.
 --		The table specified in $1 should have the following
 --		fields:
 --			id, text_seq, text_block
 --
 -- $1 is the name of the table into which $3 is stored
 -- $2 is the id of the text and references id in another table
 -- $3 is the text to store, which is broken into chunks.
 -- returns 0 on success
 -- nonzero otherwise
CREATE FUNCTION putlgtext (TEXT, INTEGER, TEXT) RETURNS INTEGER AS '
	set i_table $1
	set i_id $2
	set i_t {}
	regsub -all {([\\''\\\\])} $3 {\\\\\\1} i_t
	set i_seq 0
	while { $i_t != {} } {
		set i_offset 0	
		set tblock [string range $i_t 0 [expr 7000 + $i_offset]]
		# Do not split string at a backslash
		while { [string range $tblock end end] == "\\\\" && $i_offset < 1001 } {
			set i_offset [expr $i_offset + 1]
			set tblock [string range $i_t 0 [expr 7000 + $i_offset]]
		}
		set i_t [string range $i_t [expr 7000 + [expr $i_offset + 1]] end]
		spi_exec "INSERT INTO $i_table (id, text_seq, text_block) VALUES ( $i_id , $i_seq , ''$tblock'' )"
		incr i_seq
	}
	return 0
' LANGUAGE 'pltcl';

 -- 		getlgtext - like putlgtext, this is a generic
 --				function that does the opposite of putlgtext
 -- $1 is the table from which to get TEXT
 -- $2 is the id of the text to get
 -- returns the text concatenated from one or more rows
CREATE FUNCTION getlgtext(TEXT, INTEGER) RETURNS TEXT AS '
	set o_text {}
	spi_exec -array q_row "SELECT text_block FROM $1 WHERE id = $2 ORDER BY text_seq" {
		append o_text $q_row(text_block)
	}
	return $o_text
' LANGUAGE 'pltcl';

 -- largetext exists just to hold an id and a dummy 'lgtext' attribute.
 -- This table's trigger function provides for inserting and updating
 -- into largetext_block.  The text input to lgtext actually gets
 -- broken into chunks and stored in largetext_block.
 -- Deletes to this table will chain to largetext_block automatically
 -- by referential integrity on the id attribute.
 -- Selects have to be done using the getlgtext function.
CREATE TABLE largetext (
	id				INTEGER PRIMARY KEY,
	lgtext		TEXT -- dummy field
);
COMMENT ON TABLE largetext IS 'Holds large text';

 -- This table must have the field names as they are.
 -- These attribute names are expected by put/getlgtext.
CREATE TABLE largetext_block (
	id					INTEGER NOT NULL
						REFERENCES largetext
						ON DELETE CASCADE,
						
	text_seq			INTEGER NOT NULL,
	
	text_block		TEXT,
	
	PRIMARY KEY (id, text_seq)
);
COMMENT ON TABLE largetext_block IS 'Holds blocks of text for table largetext';
CREATE SEQUENCE largetext_seq;

 -- SELECT:
 -- SELECT id AS the_id FROM largetext;
 -- SELECT getlgtext('largetext_block', id) FROM largetext WHERE id = the_id;

 -- INSERT:
 -- INSERT INTO largetext (lgtext) values ('.......');

 -- DELETE:
 -- DELETE FROM largetext WHERE id = someid;
 -- deletes from largetext and by referential
 -- integrity, from largetext_text all associated block rows.
CREATE FUNCTION largetext_trigfun() RETURNS OPAQUE AS '
	set i_t {}
	regsub -all {([\\''\\\\])} $NEW($2) {\\\\\\1} i_t
	switch $TG_op {
		INSERT {
			spi_exec "SELECT nextval(''largetext_seq'') AS new_id"
			set NEW($1) $new_id
			spi_exec "SELECT putlgtext(''largetext_block'', $new_id, ''$i_t'') AS rcode"
			if { $rcode != 0 } then { return SKIP }
		}
		UPDATE {
			if { $NEW($2) != {} } then {
				spi_exec "DELETE FROM largetext_text WHERE id = $OLD($1)"
				spi_exec "SELECT putlgtext(''largetext_block'', $OLD($1), ''$NEW($2)'') AS rcode"
				if { $rcode != 0 } then { return SKIP }
			}
		}
	}
	set NEW($2) "ok"
	return [array get NEW]
' LANGUAGE 'pltcl';

 -- Set the function as trigger for table largetext
CREATE TRIGGER largetext_trig BEFORE INSERT OR UPDATE
ON largetext FOR EACH ROW EXECUTE
PROCEDURE largetext_trigfun(id,lgtext);



I had to use the regsub function calls to replace the \ escaping on literal
'\'s.  What a pain!  If anyone can try this code and suggest ways to improve
its speed, I'd be happy.

===

Subject: Re: [HACKERS] Article on MySQL vs. Postgres
From: Hannu Krosing <hannu@tm.ee>
Date: Wed, 05 Jul 2000 14:51:39 +0300

Robert B. Easter" wrote:
> 
> 
> While it is slow, I've been able to store unlimited amounts of text into
> the database by using the following code. 

Thanks for a really nice exaple !

> I've tested inserting over 4
> megabytes from a TEXTAREA web form using PHP.  When inserting such massive
> amounts of text, you will have to wait a while, but it will eventually succeed
> if you don't run out of memory.  If you do run out of memory, the backend
> terminates gracefully and the transaction aborts/rollsback.
> 
> -- Load the PGSQL procedural language
> -- This could also be done with the createlang script/program.
> -- See man createlang.
> CREATE FUNCTION plpgsql_call_handler()
>         RETURNS OPAQUE AS '/usr/local/pgsql/lib/plpgsql.so'
>         LANGUAGE 'C';
> 
> CREATE TRUSTED PROCEDURAL LANGUAGE 'plpgsql'
>         HANDLER plpgsql_call_handler
>         LANCOMPILER 'PL/pgSQL';

You probably meant pl/tcl as all your code is using that ?

===

Subject: Re: [HACKERS] Article on MySQL vs. Postgres
From: "Robert B. Easter" <reaster@comptechnews.com>
Date: Wed, 5 Jul 2000 09:14:48 -0400

On Wed, 05 Jul 2000, Hannu Krosing wrote:
> "Robert B. Easter" wrote:
> > -- Load the PGSQL procedural language
> > -- This could also be done with the createlang script/program.
> > -- See man createlang.
> > CREATE FUNCTION plpgsql_call_handler()
> >         RETURNS OPAQUE AS '/usr/local/pgsql/lib/plpgsql.so'
> >         LANGUAGE 'C';
> > 
> > CREATE TRUSTED PROCEDURAL LANGUAGE 'plpgsql'
> >         HANDLER plpgsql_call_handler
> >         LANCOMPILER 'PL/pgSQL';
> 
> You probably meant pl/tcl as all your code is using that ?

Yes, I mean't to say this:

 -- Load the TCL procedural language
 -- This could also be done with the createlang script/program.
 -- See man createlang.
CREATE FUNCTION pltcl_call_handler()
	RETURNS OPAQUE AS '/usr/local/pgsql/lib/pltcl.so'
	LANGUAGE 'C';
	
CREATE TRUSTED PROCEDURAL LANGUAGE 'pltcl'
	HANDLER pltcl_call_handler
	LANCOMPILER 'PL/tcl';


===

Subject: Re: [HACKERS] Article on MySQL vs. Postgres
From: "Michael Mayo" <michael-a-mayo@worldnet.att.net>
Date: Wed, 5 Jul 2000 00:50:06 -0400

Original Message -----
From: "Tim Perdue" <tperdue@valinux.com>

> Before I do that I want to confirm the major problem I had w/postgres:
> the 8K tuple limit.

    Just wanted to point out that this is not *exactly* true.  While the
default limit is 8k, all that is required to change it to 32k is to change
one line of text in config.h (blcksz from 8k to 32k).  This is pointed out
in the FAQ.  So I would really consider the *default* to be 8k and the
*limit* to be 32k.  IMHO 32k is good enough for 99% of tuples in a typical
bulletin-board-like application.  It is not unreasonable to reject posts >
32k in size.  Though you might want to evaluate performance using the 32k
tuples; might increase or decrease depending on application.

===

Subject: Re: [HACKERS] Article on MySQL vs. Postgres
From: Benjamin Adida <ben@mit.edu>
Date: Wed, 05 Jul 2000 10:48:01 -0400

on 7/4/00 8:30 PM, Tim Perdue at tperdue@valinux.com wrote:

> This is for what most people do with PHP and databases - run
> semi-critical medium-traffic sites. Anyone running a mission-critical
> site would have to look elsewhere for true robustness. I would not at
> this time recommend any serious, life-threatening app run On either
> database.

To the person who owns the web site, data is always critical. Does
www.yahoo.com store "life-threatening" information? Not really, but if you
lose your yahoo.com email, the "oh sorry, our database doesn't support
transactions" excuse doesn't cut it.

> I took a real-world page from our site
> <http://sourceforge.net/forum/forum.php?forum_id=1> and made it portable
> to both databases. Of course, I could not import the "body" of the
> message into postgres because of the 8k limitation, so the body had to
> be dropped from both databases.
> 
> The "nested" view of this page requires joins against three tables and
> some recursion to show submessages.

Some recursion? That is interesting. Do you mean multiple queries to the
database? I don't see any reason to have multiple queries to the database to
show nested messages in a forum. Using stored procedures to create sort keys
at insertion or selection time is the efficient way to do this. Ah, but
MySQL doesn't have stored procedures.

>> Did you use connection pooling (a lot
> 
> I used persistent connections, yes. Without them, Postgres' showing was
> far poorer, with mysql showing about 2x the performance.

Well, there must be some issue with your setup, because 10 requests per
second on Postgres on reads only is far from what I've seen on much wimpier
boxes than yours. Maybe I should look some more into how pconnect really
handles connection pooling, I have heard bad things that need to be
verified.

> I'd really love to see a case where a real-world page view requires 4x
> the queries on MySQL. If you are doing subselects like that on a website
> in real-time you've got serious design problems and postgres would
> fold-up and quit under the load anyway.

I believe the "design problems" come up if you need subselects and you're
using MySQL. I've used Illustra/Informix, Oracle, and now Postgres to build
database-backed web sites, and subselects are a vital part of any
somewhat-complex web app. How exactly do subselects constitute a design
problem in your opinion?

> Specifically, what is the problem with my "intuition"? All I did in the
> prior message was report my results and ask for feedback before I post
> it.

Your intuition is that Postgres will be slower because it is slower than
MySQL at reads. I contend that:
    - Postgres 7.0 is much faster at reads than the numbers you've shown.
I've seen it be much faster on smaller boxes.
    - The slowdown you're seeing is probably due in no small part to the
implementation of pconnect(), the number of times it actually connects vs.
the number of times it goes to the pool, how large that pool gets, etc...
    - The write-inefficiencies of MySQL will, on any serious web site, cut
performance so significantly that it is simply not workable. I'm thinking of
the delayed updates on Slashdot, the 20-25 second page loads on SourceForge
for permission updating and such...

> Personally, I check every query in my PHP code. On the rare occasion
> that it fales, I show an error and get out. Even with postgres, I have
> always checked success or failure of a query and shown an appropriate
> error. Never in two years of programming PHP/postgres have I ever used
> commit/rollback, and I have written some extremely complex web apps
> (sourceforge being a prime example). Geocrawler.com runs on postgres and
> again, I NEVER saw any need for any kind of rollback at all.

Geez. So you never have two inserts or updates you need to perform at once?
*ever*? What happens if your second one fails? Do you manually attempt to
backtrack on the changes you've made?

> The statelessness of the web pretty much obviates the needs for
> locks/rollbacks as each process is extremely quick and runs from start
> to finish instantly. It's not like the old days where you pull data down
> into a local application, work on it, then upload it again.
> 
> Only now, with some extremely complex stuff that we're doing on
> SourceForge would I like to see locks and rollbacks (hence my recent
> interest in benchmarking and comparing the two). Your average web
> programmer will almost never run into that in the short term.

This is simply false. If you're not using commit/rollbacks, you're either
cutting back on the functionality of your site, creating potential error
situations by the dozen, or you've got some serious design issues in your
system. Commit/Rollback is not an "advanced" part of building web sites. It
is a basic building block.

Telling your "average web programmer" to ignore transactions is like telling
your programmers not to free memory in your C programs because, hey, who
cares, you've got enough RAM for small programs, and they can learn to clean
up memory when they build "real" systems!

Of all things, this is precisely the type of thinking that crushes the
credibility of the open-source community. Enterprise IT managers understand
in great detail the need for transactions. Web sites actually need *more*
reliable technology, because you don't have that stateful session: you
sometimes need to recreate rollback mechanisms across pages by having
cleanup processes. Building this on a substrate that doesn't support the
basic transaction construct is impossible and irresponsible.

> Your own strong biases are shown in your message. I do this stuff
> because I'm curious and want to find out for myself. Most readers will
> find it interesting as I did. Few will switch from MySQL to postgres or
> vice versa because of it.

My bias? Well, my company doesn't have a vested interest in promoting
Postgres or MySQL. Before I started using Postgres, I looked into MySQL.
You're right if you think my evaluation didn't take too long. If I have
preferences, they're based purely on engineering decisions. That's not the
same as "my company just publicly endorsed MySQL, and check it out, we think
MySQL is better than Postgres."

Note that I am *not* saying that you're doing this on purpose, I'm just
saying that you're going to have a really hard time proving your
objectivity.

> Another clarification: PHPBuilder is owned by internet.com, a competitor
> of VA Linux/Andover.

PHP folks have a bias, too: PHP was built with MySQL in mind, it even ships
with MySQL drivers (and not Postgres). PHP's mediocre connection pooling
limits Postgres performance.

I'm happy to continue this discussion, but here's what I've noticed from
having had this argument many many times: if you don't believe that
transactions are useful or necessary, that subselects and enforced foreign
key constraints are hugely important, then this discussion will lead
nowhere. We simply begin with different assumptions.

I only suggest that you begin your evaluation article by explaining:
    - your assumptions
    - the fact that the page you used for benchmarking was originally built
for MySQL, and thus makes no use of more advanced Postgres features.

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu