dbi_select_caching

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Subject: SELECT cacheing
From: Roger Espel Llima <espel@iagora.net>
Date: Thu, 7 Sep 2000 20:05:43 -0400

I've written a very small module to cache SELECT results from DBI
requests.  The interface looks like:

  use SelectCache;

  my $db = whatever::get_a_handle();
  my $st = qq{ select this, that ... };
  my $rows = SelectCache::select($db, $st, 180);

this returns an arrayref of rows (like the selectall_arrayref function),
and caches the result in a file, which gets reused for 180 seconds
instead of asking the db again.

The names of the cache files are the md5's of the select statement,
using the last hex digit as a subdirectory name.  There's no file
cleanup function; you can always do that from cron with find.

This is all very simple, but it's pretty useful in combination with
mod_perl, to speed up things like showing the "latest 10 posts", on
frequently accessed webpages.

The question now is: is there any interest in releasing this?  I could
write some minimal docs and give it a 'proper' module name, if there's
interest.

===

Subject: Re: SELECT cacheing
From: Peter Skipworth <skip@realestate.com.au>
Date: Fri, 8 Sep 2000 11:39:33 +1100 (EST)

Roger,

Sounds interesting - as a matter of fact, I'm doing it exactly the same
way on realestate.com.au, except I use the last 2 hex digits as the
directory name. Great minds think alike!

I don't know if many people realise this, but mySQL, for those of you
that use it, will soon include a select_cached function, for doing
exactly the same. Of course, a non-RDSM-specific solution would be great 
as well on the Perl side. I'm using Storable to write the query result to
disk as an array of hashes. I've implemented most of DBI's query methods,
including fetchrow, fetchrow_hashref, fetchrow_arrayref, etc, and it works
quite well and is transparent to the programmer. Page generation speed
improved by up to 100% in a lot of cases.  

I'd suggest that you include an (optionally
implementable) cache-cleaning routine, so that it's an 'all in one'
solution, as well as parameters such as 'maximum cache size' and
cache-timeout per store. Let me know if you'd like any help with getting
this completed, as I'd be happy to help. 

===

Subject: Re: SELECT cacheing
From: "G.W. Haywood" <ged@www.jubileegroup.co.uk>
Date: Fri, 8 Sep 2000 02:09:15 +0100 (BST)

Hi there,

On Thu, 7 Sep 2000, Roger Espel Llima wrote:

> I've written a very small module to cache SELECT results from DBI
> requests.
[snip]
> The question now is: is there any interest in releasing this?  I could
> write some minimal docs and give it a 'proper' module name, if there's
> interest.

I'm sure there is.  My last but one contract I did a very similar thing
but I implemented the cache as a tied DBM file.

===

Subject: Re: SELECT cacheing
From: Perrin Harkins <perrin@primenet.com>
Date: Thu, 7 Sep 2000 18:22:40 -0700 (PDT)

On Thu, 7 Sep 2000, Roger Espel Llima wrote:
> The question now is: is there any interest in releasing this?  I could
> write some minimal docs and give it a 'proper' module name, if there's
> interest.

I'd say this is probably useful to some people, so go ahead.  A few
suggestions: 
- Use the DBIx namespace for the module. 
- If possible, use some existing cache module for the storage, like
Apache::Session or one of the m/Cache/ modules on CPAN.
- Provide a safety check so that if a query brought back a few million
rows by accident you wouldn't try to write the whole mess to disk.
- Maybe try to support the other results interfaces in DBI?

===

Subject: Re: SELECT cacheing
From: Drew Taylor <dtaylor@vialogix.com>
Date: Fri, 08 Sep 2000 09:26:23 -0400

Roger Espel Llima wrote:
> 
> I've written a very small module to cache SELECT results from DBI
> requests.  The interface looks like:
> 
>   use SelectCache;
> 
>   my $db = whatever::get_a_handle();
>   my $st = qq{ select this, that ... };
>   my $rows = SelectCache::select($db, $st, 180);
> 
> this returns an arrayref of rows (like the selectall_arrayref function),
> and caches the result in a file, which gets reused for 180 seconds
> instead of asking the db again.
> 
> The names of the cache files are the md5's of the select statement,
> using the last hex digit as a subdirectory name.  There's no file
> cleanup function; you can always do that from cron with find.
> 
> This is all very simple, but it's pretty useful in combination with
> mod_perl, to speed up things like showing the "latest 10 posts", on
> frequently accessed webpages.
> 
> The question now is: is there any interest in releasing this?  I could
> write some minimal docs and give it a 'proper' module name, if there's
> interest.
I'm certainly interested. One question though - in the module do you
blindly use the cache? I ask because in my instance I display the
contents of a shopping cart on every page. And while only a few pages
change the cart contents, the cart listing does need to be current. How
do you handle this situation?

===

Subject: Re: SELECT cacheing
From: Peter Skipworth <skip@realestate.com.au>
Date: Sat, 9 Sep 2000 00:35:42 +1100 (EST)

I don't know about Roger, but in my situation queries are called as
follows.

my $queryhandle=Query("select blah from blah where blah")

the Query routine can be overloaded with a timeout value (a default
capable of being set), with a timeout of 0 meaning that the select
should never be cached and should always be selected live from the
database. I'd assume Roger would need to have something similar in the
module he's developing.
===

Subject: Re: SELECT cacheing
From: "Rodney Broom" <rbroom@home.com>
Date: Fri, 8 Sep 2000 08:02:22 -0700

Some good ideas, I think that this package might come out a bit thin though.
I've written a package that does arbitrary variable caching (like everybody
else). But it has a list of other bells and whistles. Things like cache
expiration and data refresh hooks. It's a pretty simple process.

>From there, I've have (but addmittedly don't use yet) a little DB package that
sits as an interface between the programmer and the DB, and incorporates things
like this caching package at the same time.

So you do:

$dbh = DB->new(...)
$sth = $dbh->prepare($q)
%results1 = $sth->fetch...

$sth = $dbh->prepare($q)
%results2 = $sth->fetch...

# Results are the same, %results2 comes from cache.

$sth = $dbh->prepare($insert)
$sth->execute

$sth = $dbh->prepare($q)
%diff_results = $sth->fetch...
# %diff_results is new data because the DB has changed.


Just some thoughts for y'all to mull over.

===

Subject: Re: SELECT cacheing
From: Tim Sweetman <tim@aldigital.co.uk>
Date: Fri, 08 Sep 2000 15:46:25 +0100

DeWitt - this started as a reply to the modperl mailing list, & I had a
look at File::Cache as my reply grew. See the end of this for the
relevant bit :) - think I've found a bug...

Drew Taylor wrote:
> 
> Roger Espel Llima wrote:
> >
> > I've written a very small module to cache SELECT results from DBI
> > requests.  The interface looks like:
> >
> >   use SelectCache;
> >
> >   my $db = whatever::get_a_handle();
> >   my $st = qq{ select this, that ... };
> >   my $rows = SelectCache::select($db, $st, 180);
> >
> > this returns an arrayref of rows (like the selectall_arrayref function),
> > and caches the result in a file, which gets reused for 180 seconds
> > instead of asking the db again.

"Storable" is probably a good way to store this sort of result.

> > The names of the cache files are the md5's of the select statement,
> > using the last hex digit as a subdirectory name.  There's no file
> > cleanup function; you can always do that from cron with find.
> >
> > This is all very simple, but it's pretty useful in combination with
> > mod_perl, to speed up things like showing the "latest 10 posts", on
> > frequently accessed webpages.

> > The question now is: is there any interest in releasing this?  I could
> > write some minimal docs and give it a 'proper' module name, if there's
> > interest.

This can be an extremely powerful approach to speeding up web
applications. We use a similar module which ended up fairly large - it
takes a method name & arguments, rather than an SQL string, meaning that
you can cache the result of operations other than SQL queries. It's also
grown several other enhancements: a mutual-exclusion-and-backoff
algorithm, so if one process is looking for the answer, others wait for
it rather than performing the same query at the same time, and several
ways to expire results that have become outdated (specifying lifetime,
or via timestamp files that get touched when major changes happen)

I always thought it'd make a good thing to CPANify but never got round
to it :(

The one thing I'd advise is: BE VERY CAREFUL WITH RACE CONDITIONS. You
can easily end up with something that will, in an unusual case, store
garbled data. I think you'd need to either use flock(), or write to
files then rename them, since rename is an atomic operation - and I
don't know how well that works under OSs other than UNIXes.

> I'm certainly interested. One question though - in the module do you
> blindly use the cache? I ask because in my instance I display the
> contents of a shopping cart on every page.

I think this would be tricky to use with a cache - cart contents will
change in real time, and there's one copy per user, so you'd need a way
of expiring the cached data according to user ID. 

Some RDBMSs get Large performance improvements from using placeholders
("select * from foo where userid = ?") and cacheing the statement
handles - I don't know if this applies to MySQL. With your sort of
application I'd try those measures before trying to use a complex cache
mechanism. Where up-to-date results are not critical, a cache mechanism
has great merit, IMHO.

Reading back along this thread,

Perrin Hawkins wrote:
> - Use the DBIx namespace for the module. 

Possibly. SQL is not the only application for this sort of tool, though
it seems to be the main one.

> - If possible, use some existing cache module for the storage, like
> Apache::Session or one of the m/Cache/ modules on CPAN.

IIRC, Apache::Session *generates* its own key for each session. This
isn't going to work with a MD5-keyed-cache, where the key is generated
from the SQL.

File::Cache seems to do something rather similar, though without the MD5
bit. However, from a cursory look at the code, I think it's vulnerable
to concurrency conditions such as:
+ process (a) reads a file whilst (b) is still writing it
+ processes (a) and (b) both write to a file simultaneously, possibly
corrupting it?!
  (this may be impossible, not sure)
+ process fails whilst writing a file (eg. process catches a KILL);
subsequent reads of that file
  will get fatal error

... which will pop up only Sometimes, usually on a busy site open to the
public :) This is Not Nice, assuming it's true.

Many CPAN things that do this sort of thing use tied hashes, which
(mostly, at least) won't work in a multi-process environment because
they don't handle concurrent reads & writes.

===

Subject: Re: SELECT cacheing
From: Perrin Harkins <perrin@primenet.com>
Date: Fri, 8 Sep 2000 11:25:08 -0700 (PDT)

On Fri, 8 Sep 2000, Tim Sweetman wrote:
> > - Use the DBIx namespace for the module. 
> 
> Possibly. SQL is not the only application for this sort of tool, though
> it seems to be the main one.

The module we're discussing is DBI-specific.  At least the interesting
part of it is.  The actual caching part is the second most re-invented
wheel on the mod_perl list, right behind templating systems.

> > - If possible, use some existing cache module for the storage, like
> > Apache::Session or one of the m/Cache/ modules on CPAN.
> 
> IIRC, Apache::Session *generates* its own key for each session.

It only does that if you don't hand it one.

===

Subject: Re: SELECT cacheing
From: Roger Espel Llima <espel@iagora.net>
Date: Fri, 8 Sep 2000 18:18:26 -0400

On Thu, Sep 07, 2000 at 06:22:40PM -0700, Perrin Harkins wrote:
> I'd say this is probably useful to some people, so go ahead.  A few
> suggestions: 
> - Use the DBIx namespace for the module. 

Sounds reasonable.  The question then is: what should the API be like?

The way it works right now is with an API of its own:
$arrayref_of_arrays = SelectCache::select($dbh, $st, $timeout);

It'd be nice to have an API that mimics the DBI one more closely, with
the different fetchrow_* and fetchall_* interfaces.  Then again, for a
module whose main purpose in life is to speed up SELECTs, maybe
restricting it to mimic the selectall_arrayref(), selectrow_array() and
selectcol_arrayref() would be enough.  

I really don't see much of a purpose on writing iterators a-la
fetchrow_* for a module that gets all the rows at the same time.  This
is actually an important thing to decide, because AFAICS
fetchrow_hashref is the only method that returns hashes and therefore
needs to care about column names.  So if we decide to support only the
select* interfaces, all we have to store are arrayrefs of arrayrefs.  If
we do support fetchrow_hashref, then we need either two kinds of storage
(so that the results of the same SELECT can be cached twice, if one
scrpit wants arrays and the other wants hashes), or a way to get arrays
from hashes or vice versa, which looks hard because one loses the order
and the other the names.

Another matter is: should this be a subclass of DBI, mimicing its API,
with an interface like:

my $dbc = DBI::SelectCache->new($db);
$dbc->expiration(180);
my $st = qq{ select ... };
my $rows = $dbc->selectall_arrayref($st);

and letting everything else (prepare, fetch*, etc) fall through to the
superclass, or should it be passing the expiration time as part of the
main function all, as I was doing before?  That woudl be something like:

my $st = qq{ select ... };
my $rows = DBI::SelectCache->selectall_arrayref($db, $st);

The first option fits in better with DBI, but the second is more
practical for the user, who doesn't need to create another object, and
can think of the expiration as a per-statement thing (which it is),
rather than per-connection.

Any suggestions?  I'm a bit lost on how to give this thing a good,
extendable interface.

> - If possible, use some existing cache module for the storage, like
> Apache::Session or one of the m/Cache/ modules on CPAN.

Others have suggested Storable.  I've used this one before, and I can
agree that it's probably a good solution.

Right now, what I'm doing is just join()ing the arrays with null
characters as separators, and using spilt() to get them back.  But null
chars are allowed in databases, so I agree that switching to Storable
would be a good idea.

> - Provide a safety check so that if a query brought back a few million
> rows by accident you wouldn't try to write the whole mess to disk.
> - Maybe try to support the other results interfaces in DBI?

Sounds good.  This means that this module would need a config file.

===

Subject: Re: SELECT cacheing
From: Roger Espel Llima <espel@iagora.net>
Date: Fri, 8 Sep 2000 18:38:00 -0400

On Fri, Sep 08, 2000 at 09:26:23AM -0400, Drew Taylor wrote:
> I'm certainly interested. One question though - in the module do you
> blindly use the cache? I ask because in my instance I display the
> contents of a shopping cart on every page. And while only a few pages
> change the cart contents, the cart listing does need to be current. How
> do you handle this situation?

the module gets the expiration time.  if it's 0 or negative, it ignores
the cache and reads straight from the db.

===

Subject: Re: SELECT cacheing
From: Roger Espel Llima <espel@iagora.net>
Date: Fri, 8 Sep 2000 18:43:22 -0400

On Fri, Sep 08, 2000 at 03:46:25PM +0100, Tim Sweetman wrote:
> This can be an extremely powerful approach to speeding up web
> applications. We use a similar module which ended up fairly large - it
> takes a method name & arguments, rather than an SQL string, meaning that
> you can cache the result of operations other than SQL queries. It's also
> grown several other enhancements: a mutual-exclusion-and-backoff
> algorithm, so if one process is looking for the answer, others wait for
> it rather than performing the same query at the same time, and several
> ways to expire results that have become outdated (specifying lifetime,
> or via timestamp files that get touched when major changes happen)

That sure sounds powerful!

> The one thing I'd advise is: BE VERY CAREFUL WITH RACE CONDITIONS. You
> can easily end up with something that will, in an unusual case, store
> garbled data. I think you'd need to either use flock(), or write to
> files then rename them, since rename is an atomic operation - and I
> don't know how well that works under OSs other than UNIXes.

I use the latter approach: write to a temp name, then rename.  I really
think this should be safe anywhere, I don't think any OS would be broken
enough to make a rename non atomic, and let other processes read garbled
stuff when the original file was written to and closed.

> Many CPAN things that do this sort of thing use tied hashes, which
> (mostly, at least) won't work in a multi-process environment because
> they don't handle concurrent reads & writes.

I really prefer Storable or something like it, for this application.  So
each cached value is a file, and we can use the filesystem and its
lastmod metadata, and standard tools like find or File::Find (or
whatever its name is) to clean up.

===

Subject: Re: SELECT cacheing
From: Perrin Harkins <perrin@primenet.com>
Date: Fri, 8 Sep 2000 16:34:44 -0700 (PDT)

On Fri, 8 Sep 2000, Roger Espel Llima wrote:
> > - If possible, use some existing cache module for the storage, like
> > Apache::Session or one of the m/Cache/ modules on CPAN.
> 
> Others have suggested Storable.  I've used this one before, and I can
> agree that it's probably a good solution.

Storable is just a way to turn a complex data structure into a single
scalar.  You still need to handle the file manipulation yourself.  Most of
the existing cache modules use Storable to serialize to a scalar and then
files or shared memory or a dbm for actual storage.

> This means that this module would need a config file.

Or some PerlSetVar directives in httpd.conf.

===

Subject: Re: SELECT cacheing
From: "Brian Cocks" <bcocks@multi-ad.com>
Date: Fri, 08 Sep 2000 18:44:58 -0500

I'm wondering how much improvement this caching is over the database caching
the parsed SQL statement and results (disk blocks)?

In Oracle, if you issue a query that is cached, it doesn't need to be parsed.
If the resulting blocks are also cached, there isn't any disk access.  If the
database is tuned, you should be able to get stuff out of cache over 90% of
the time.  I don't know what other databases other than Oracle do.

What are the advantages of implementing your own cache?  Is there any reason
other than speed?

===

Subject: Re: SELECT cacheing
From: Tim Bishop <timb@activespace.com>
Date: Fri, 8 Sep 2000 16:52:50 -0700 (PDT)

On Fri, 8 Sep 2000, Perrin Harkins wrote:

> On Fri, 8 Sep 2000, Roger Espel Llima wrote:
> > > - If possible, use some existing cache module for the storage, like
> > > Apache::Session or one of the m/Cache/ modules on CPAN.
> > 
> > Others have suggested Storable.  I've used this one before, and I can
> > agree that it's probably a good solution.
> 
> Storable is just a way to turn a complex data structure into a single
> scalar.  You still need to handle the file manipulation yourself.  Most of
> the existing cache modules use Storable to serialize to a scalar and then
> files or shared memory or a dbm for actual storage.

I would delegate the tieing, serialization, and locking to a module
like Apache::Session  (It uses Storable internally).

Then the user can specify their own favorite backing store and locking
mechanism by subclassing Apache::Session.

I would also look to the Memoize module for
ideas:  http://search.cpan.org/search?dist=Memoize

===

Subject: Re: SELECT cacheing
From: Greg Cope <gjjc@rubberplant.freeserve.co.uk>
Date: Sat, 09 Sep 2000 11:52:32 +0000

Brian Cocks wrote:
> 
> I'm wondering how much improvement this caching is over the database caching
> the parsed SQL statement and results (disk blocks)?
> 
> In Oracle, if you issue a query that is cached, it doesn't need to be parsed.
> If the resulting blocks are also cached, there isn't any disk access.  If the
> database is tuned, you should be able to get stuff out of cache over 90% of
> the time.  I don't know what other databases other than Oracle do.

Oracle is clever in this respect - and you are right if you tune
correctly you should hit the cache ...

But may other DB's do not have a shared executition plan / results
cache.  MySQL, msql and postgreSQL do not.  As far as I am aware Sybase
only has an execution plan cache that is per connection (could be wrong
here).

The MySQL developers have got a SELECT CACHED idea, where you can define
a statement as cacheable, and further calls to SELECT CACHE will return
the cached results - this is all on the todo list with no fixed date.
 
> What are the advantages of implementing your own cache?  Is there any reason
> other than speed?

Could be wrong but no - and it has a bad point in that it introduces an
added layer of complexity .....

I am certainly interested as accessing a local cache should be an order
of magnitude faster than asking a buzy DB.

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu