modperl_caching_swishe_search_results

This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.

Date: Mon, 08 Jan 2001 09:25:33 -0800
To: modperl@apache.org
From: Bill Moseley <moseley@hank.org>
Subject: Caching search results

I've got a mod_perl application that's using swish-e.  A query from swish
may return hundreds of results, but I only display them 20 at a time.  

There's currently no session control on this application, and so when the
client asks for the next page (or to jump to page number 12, for example),
I  have to run the original query again, and then extract out just the
results for the page the client wants to see.

Seems like some basic design problems there.

Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
in the sort term, I was thinking about caching search results (which is
just a sorted list of file names) using a simple file-system db -- that is,
(carefully) build file names out of the queries and writing them to some
directory tree .  Then I'd use cron to purge LRU files every so often.  I
think this approach will work fine and instead of a dbm or rdbms approach.


So I asking for some advice:

- Is there a better way to do this?

- There was some discussion about performance and how many files to put in
each directory in the past.  Are there some commonly accepted numbers for
this?

- For file names does it make sense to use a MD5 hash of the query string?
It would be nice to get an even distribution of files in each directory.

- Can someone offern any help with the locking issues?  I was hoping to
avoid shared locking during reading -- but maybe I'm worrying too much
about the time it takes to ask for a shared lock when reading.  I could
wait a second for the shared lock and if I don't' get it I'll run the query
again.

But it seems like if one process creates the file and begins to write
without LOCK_EX and then gets blocked, then other processes might not see
the entire file when reading.

Would it be better to avoid the locks and instead use a temp file when
creating and then do an (atomic?) rename?

===

Date: Mon, 08 Jan 2001 10:10:25 -0800
From: Perrin Harkins <perrin@primenet.com>
To: Bill Moseley <moseley@hank.org>
CC: modperl@apache.org
Subject: Re: Caching search results

Bill Moseley wrote:
> Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> in the sort term, I was thinking about caching search results (which is
> just a sorted list of file names) using a simple file-system db -- that is,
> (carefully) build file names out of the queries and writing them to some
> directory tree .  Then I'd use cron to purge LRU files every so often.  I
> think this approach will work fine and instead of a dbm or rdbms approach.

Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
starters. A dbm would be fine too, but more trouble to purge old entries
from.

===

Date: Mon, 08 Jan 2001 14:26:24 -0500
To: modperl@apache.org
From: Simon Rosenthal <srosenthal@northernlight.com>
Subject: Re: Caching search results

At 10:10 AM 1/8/01 -0800, you wrote:
>Bill Moseley wrote:
> > Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> > in the sort term, I was thinking about caching search results (which is
> > just a sorted list of file names) using a simple file-system db -- that is,
> > (carefully) build file names out of the queries and writing them to some
> > directory tree .  Then I'd use cron to purge LRU files every so often.  I
> > think this approach will work fine and instead of a dbm or rdbms approach.
>
>Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
>starters. A dbm would be fine too, but more trouble to purge old entries
>from.

an RDBMS is not much more trouble to purge, if you have a 
time-of-last-update field. And if you're ever going to access your cache 
from multiple servers, you definitely don't want to deal with  locking 
issues for DBM and filesystem based solutions ;=(

===

Date: Mon, 8 Jan 2001 14:07:12 -0800 (PST)
From: Sander van Zoest <sander@covalent.net>
To: Perrin Harkins <perrin@primenet.com>
cc: Bill Moseley <moseley@hank.org>, modperl@apache.org
Subject: Re: Caching search results

On Mon, 8 Jan 2001, Perrin Harkins wrote:

> Bill Moseley wrote:
> > Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> > in the sort term, I was thinking about caching search results (which is
> > just a sorted list of file names) using a simple file-system db -- that is,
> > (carefully) build file names out of the queries and writing them to some
> > directory tree .  Then I'd use cron to purge LRU files every so often.  I
> > think this approach will work fine and instead of a dbm or rdbms approach.
> Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
> starters. A dbm would be fine too, but more trouble to purge old entries
> from.

You could always have a second dbm file that can keep track of TTL issues
of your data keys, so it would simply be a series of delete calls.
Granted you would have another DBM file to maintain.

===

Date: Mon, 08 Jan 2001 17:27:10 -0500
To: Sander van Zoest <sander@covalent.net>
From: Simon Rosenthal <srosenthal@northernlight.com>
Subject: Re: Caching search results
Cc: modperl@apache.org

At 02:02 PM 1/8/01 -0800, Sander van Zoest wrote:
>On Mon, 8 Jan 2001, Simon Rosenthal wrote:
>
> > an RDBMS is not much more trouble to purge, if you have a
> > time-of-last-update field. And if you're ever going to access your cache
> > from multiple servers, you definitely don't want to deal with  locking
> > issues for DBM and filesystem based solutions ;=(
>
>RDBMS does bring replication and backup issues. The DBM and FS solutions
>definately have their advantages. It would not be too difficult to write
>a serialized daemon that makes request over the net to a DBM file.
>
>What in you experience makes you pick the overhead of an RDBMS for a simple
>cache in favor of DBM, FS solutions?

We cache user session state  (basically using Apache::Session) in a small 
(maybe 500K records) mysql database , which is accessed by multiple web 
servers. We made an explicit decision NOT to replicate or backup this 
database - it's very dynamic, and the only user visible consequence of a 
loss of the database would be an unexpected login screen - we felt this was 
a tradeoff we could live with.  We have a hot spare mysql instance which 
can be brought into service immediately, if required.

  I couldn't see writing a daemon as you suggested  offering us any 
benefits under those circumstances, given that RDBMS access is built into 
Apache::Session.

I would not be as cavalier as this if we were doing anything more than 
using the RDBMS as a fast cache. With decent hardware (which we have - Sun 
Enterprise servers  with nice fast disks and enough memory) the typical 
record retrieval time  is around 10ms, which  even if slow compared to a 
local FS access is plenty fast enough in the context of the processing we 
do for dynamic pages.

Hope this answers your question.

===


Date: Mon, 8 Jan 2001 14:02:13 -0800 (PST)
From: Sander van Zoest <sander@covalent.net>
To: Simon Rosenthal <srosenthal@northernlight.com>
cc: modperl@apache.org
Subject: Re: Caching search results

On Mon, 8 Jan 2001, Simon Rosenthal wrote:

> an RDBMS is not much more trouble to purge, if you have a 
> time-of-last-update field. And if you're ever going to access your cache 
> from multiple servers, you definitely don't want to deal with  locking 
> issues for DBM and filesystem based solutions ;=(

RDBMS does bring replication and backup issues. The DBM and FS solutions
definately have their advantages. It would not be too difficult to write
a serialized daemon that makes request over the net to a DBM file.

What in you experience makes you pick the overhead of an RDBMS for a simple
cache in favor of DBM, FS solutions?
  
===

Date: Mon, 8 Jan 2001 17:03:12 -0500
From: DeWitt Clinton <dclinton@avacet.com>
To: Perrin Harkins <perrin@primenet.com>
Cc: Bill Moseley <moseley@hank.org>, modperl@apache.org
Subject: Re: Caching search results

On Mon, Jan 08, 2001 at 10:10:25AM -0800, Perrin Harkins wrote:

> Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
> starters. A dbm would be fine too, but more trouble to purge old
> entries from.

If you find that File::Cache works for you, then you may also want to
check out the simplified and improved version in the Avacet code,
which additionally offers a unified service model for mod_perl
applications.  Services are available for templates (either Embperl or
Template Toolkit), XML-based configuratio, object caching, connecting
to the Avacet application engine, standardized error handling,
dynamically dispatching requests to modules, and many other things.


===

To: Sander van Zoest <sander@covalent.net>
Date: Mon, 8 Jan 2001 14:39:34 -0800 (PST)
Subject: Re: Caching search results

On Mon, 8 Jan 2001, Sander van Zoest wrote:
> > starters. A dbm would be fine too, but more trouble to purge old entries
> > from.
> 
> You could always have a second dbm file that can keep track of TTL issues
> of your data keys, so it would simply be a series of delete calls.
> Granted you would have another DBM file to maintain.

I find it kind of painful to trim dbm files, because most implementations
don't relinquish disk space when you delete entries.  You end up having to
actually make a new dbm file with the "good" contents copied over to it in
order to slim it down.


===

Date: Mon, 8 Jan 2001 15:38:41 -0800 (PST)
From: Sander van Zoest <sander@covalent.net>
To: Simon Rosenthal <srosenthal@northernlight.com>
cc: modperl@apache.org
Subject: Re: Caching search results

On Mon, 8 Jan 2001, Simon Rosenthal wrote:

>   I couldn't see writing a daemon as you suggested  offering us any 
> benefits under those circumstances, given that RDBMS access is built into 
> Apache::Session.

No, in your case I do not see a reason behind it either. ;-)
Again this shows that it all depends on the requirements and things you
are willing to sacrafice.

===

Date: Mon, 8 Jan 2001 15:54:34 -0800 (PST)
From: Sander van Zoest <sander@covalent.net>
To: Perrin Harkins <perrin@primenet.com>
cc: Bill Moseley <moseley@hank.org>, modperl@apache.org
Subject: Re: Caching search results


On Mon, 8 Jan 2001, Perrin Harkins wrote:

> On Mon, 8 Jan 2001, Sander van Zoest wrote:
> > > starters. A dbm would be fine too, but more trouble to purge old entries
> > > from.
> > You could always have a second dbm file that can keep track of TTL issues
> > of your data keys, so it would simply be a series of delete calls.
> > Granted you would have another DBM file to maintain.
> I find it kind of painful to trim dbm files, because most implementations
> don't relinquish disk space when you delete entries.  You end up having to
> actually make a new dbm file with the "good" contents copied over to it in
> order to slim it down.

Yeah, this is true. Some DBMs have special routines to fix these issues.  
You could use the gdbm_reorganize call to clean up those issues for 
example (if you are using gdbm that is)

Just some quick pseudo code (don't have a quick example ready here):

use GDBM_File;

my $gdbm = tie %hash, 'GDBM_File', 'file.gdbm' &GDBM_WRCREAT|&GDBM_FAST, 0640 
	   or die "$!";

$gdbm->reorganize;

That definately helps a lot.

===
the rest of The Pile (a partial mailing list archive)
doom@kzsu.stanford.edu