This is part of The Pile, a partial archive of some open source mailing lists and newsgroups.
Date: Mon, 08 Jan 2001 09:25:33 -0800 To: modperl@apache.org From: Bill Moseley <moseley@hank.org> Subject: Caching search results I've got a mod_perl application that's using swish-e. A query from swish may return hundreds of results, but I only display them 20 at a time. There's currently no session control on this application, and so when the client asks for the next page (or to jump to page number 12, for example), I have to run the original query again, and then extract out just the results for the page the client wants to see. Seems like some basic design problems there. Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, in the sort term, I was thinking about caching search results (which is just a sorted list of file names) using a simple file-system db -- that is, (carefully) build file names out of the queries and writing them to some directory tree . Then I'd use cron to purge LRU files every so often. I think this approach will work fine and instead of a dbm or rdbms approach. So I asking for some advice: - Is there a better way to do this? - There was some discussion about performance and how many files to put in each directory in the past. Are there some commonly accepted numbers for this? - For file names does it make sense to use a MD5 hash of the query string? It would be nice to get an even distribution of files in each directory. - Can someone offern any help with the locking issues? I was hoping to avoid shared locking during reading -- but maybe I'm worrying too much about the time it takes to ask for a shared lock when reading. I could wait a second for the shared lock and if I don't' get it I'll run the query again. But it seems like if one process creates the file and begins to write without LOCK_EX and then gets blocked, then other processes might not see the entire file when reading. Would it be better to avoid the locks and instead use a temp file when creating and then do an (atomic?) rename? === Date: Mon, 08 Jan 2001 10:10:25 -0800 From: Perrin Harkins <perrin@primenet.com> To: Bill Moseley <moseley@hank.org> CC: modperl@apache.org Subject: Re: Caching search results Bill Moseley wrote: > Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, > in the sort term, I was thinking about caching search results (which is > just a sorted list of file names) using a simple file-system db -- that is, > (carefully) build file names out of the queries and writing them to some > directory tree . Then I'd use cron to purge LRU files every so often. I > think this approach will work fine and instead of a dbm or rdbms approach. Always start with CPAN. Try Tie::FileLRUCache or File::Cache for starters. A dbm would be fine too, but more trouble to purge old entries from. === Date: Mon, 08 Jan 2001 14:26:24 -0500 To: modperl@apache.org From: Simon Rosenthal <srosenthal@northernlight.com> Subject: Re: Caching search results At 10:10 AM 1/8/01 -0800, you wrote: >Bill Moseley wrote: > > Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, > > in the sort term, I was thinking about caching search results (which is > > just a sorted list of file names) using a simple file-system db -- that is, > > (carefully) build file names out of the queries and writing them to some > > directory tree . Then I'd use cron to purge LRU files every so often. I > > think this approach will work fine and instead of a dbm or rdbms approach. > >Always start with CPAN. Try Tie::FileLRUCache or File::Cache for >starters. A dbm would be fine too, but more trouble to purge old entries >from. an RDBMS is not much more trouble to purge, if you have a time-of-last-update field. And if you're ever going to access your cache from multiple servers, you definitely don't want to deal with locking issues for DBM and filesystem based solutions ;=( === Date: Mon, 8 Jan 2001 14:07:12 -0800 (PST) From: Sander van Zoest <sander@covalent.net> To: Perrin Harkins <perrin@primenet.com> cc: Bill Moseley <moseley@hank.org>, modperl@apache.org Subject: Re: Caching search results On Mon, 8 Jan 2001, Perrin Harkins wrote: > Bill Moseley wrote: > > Anyway, I'd like to avoid the repeated queries in mod_perl, of course. So, > > in the sort term, I was thinking about caching search results (which is > > just a sorted list of file names) using a simple file-system db -- that is, > > (carefully) build file names out of the queries and writing them to some > > directory tree . Then I'd use cron to purge LRU files every so often. I > > think this approach will work fine and instead of a dbm or rdbms approach. > Always start with CPAN. Try Tie::FileLRUCache or File::Cache for > starters. A dbm would be fine too, but more trouble to purge old entries > from. You could always have a second dbm file that can keep track of TTL issues of your data keys, so it would simply be a series of delete calls. Granted you would have another DBM file to maintain. === Date: Mon, 08 Jan 2001 17:27:10 -0500 To: Sander van Zoest <sander@covalent.net> From: Simon Rosenthal <srosenthal@northernlight.com> Subject: Re: Caching search results Cc: modperl@apache.org At 02:02 PM 1/8/01 -0800, Sander van Zoest wrote: >On Mon, 8 Jan 2001, Simon Rosenthal wrote: > > > an RDBMS is not much more trouble to purge, if you have a > > time-of-last-update field. And if you're ever going to access your cache > > from multiple servers, you definitely don't want to deal with locking > > issues for DBM and filesystem based solutions ;=( > >RDBMS does bring replication and backup issues. The DBM and FS solutions >definately have their advantages. It would not be too difficult to write >a serialized daemon that makes request over the net to a DBM file. > >What in you experience makes you pick the overhead of an RDBMS for a simple >cache in favor of DBM, FS solutions? We cache user session state (basically using Apache::Session) in a small (maybe 500K records) mysql database , which is accessed by multiple web servers. We made an explicit decision NOT to replicate or backup this database - it's very dynamic, and the only user visible consequence of a loss of the database would be an unexpected login screen - we felt this was a tradeoff we could live with. We have a hot spare mysql instance which can be brought into service immediately, if required. I couldn't see writing a daemon as you suggested offering us any benefits under those circumstances, given that RDBMS access is built into Apache::Session. I would not be as cavalier as this if we were doing anything more than using the RDBMS as a fast cache. With decent hardware (which we have - Sun Enterprise servers with nice fast disks and enough memory) the typical record retrieval time is around 10ms, which even if slow compared to a local FS access is plenty fast enough in the context of the processing we do for dynamic pages. Hope this answers your question. === Date: Mon, 8 Jan 2001 14:02:13 -0800 (PST) From: Sander van Zoest <sander@covalent.net> To: Simon Rosenthal <srosenthal@northernlight.com> cc: modperl@apache.org Subject: Re: Caching search results On Mon, 8 Jan 2001, Simon Rosenthal wrote: > an RDBMS is not much more trouble to purge, if you have a > time-of-last-update field. And if you're ever going to access your cache > from multiple servers, you definitely don't want to deal with locking > issues for DBM and filesystem based solutions ;=( RDBMS does bring replication and backup issues. The DBM and FS solutions definately have their advantages. It would not be too difficult to write a serialized daemon that makes request over the net to a DBM file. What in you experience makes you pick the overhead of an RDBMS for a simple cache in favor of DBM, FS solutions? === Date: Mon, 8 Jan 2001 17:03:12 -0500 From: DeWitt Clinton <dclinton@avacet.com> To: Perrin Harkins <perrin@primenet.com> Cc: Bill Moseley <moseley@hank.org>, modperl@apache.org Subject: Re: Caching search results On Mon, Jan 08, 2001 at 10:10:25AM -0800, Perrin Harkins wrote: > Always start with CPAN. Try Tie::FileLRUCache or File::Cache for > starters. A dbm would be fine too, but more trouble to purge old > entries from. If you find that File::Cache works for you, then you may also want to check out the simplified and improved version in the Avacet code, which additionally offers a unified service model for mod_perl applications. Services are available for templates (either Embperl or Template Toolkit), XML-based configuratio, object caching, connecting to the Avacet application engine, standardized error handling, dynamically dispatching requests to modules, and many other things. === To: Sander van Zoest <sander@covalent.net> Date: Mon, 8 Jan 2001 14:39:34 -0800 (PST) Subject: Re: Caching search results On Mon, 8 Jan 2001, Sander van Zoest wrote: > > starters. A dbm would be fine too, but more trouble to purge old entries > > from. > > You could always have a second dbm file that can keep track of TTL issues > of your data keys, so it would simply be a series of delete calls. > Granted you would have another DBM file to maintain. I find it kind of painful to trim dbm files, because most implementations don't relinquish disk space when you delete entries. You end up having to actually make a new dbm file with the "good" contents copied over to it in order to slim it down. === Date: Mon, 8 Jan 2001 15:38:41 -0800 (PST) From: Sander van Zoest <sander@covalent.net> To: Simon Rosenthal <srosenthal@northernlight.com> cc: modperl@apache.org Subject: Re: Caching search results On Mon, 8 Jan 2001, Simon Rosenthal wrote: > I couldn't see writing a daemon as you suggested offering us any > benefits under those circumstances, given that RDBMS access is built into > Apache::Session. No, in your case I do not see a reason behind it either. ;-) Again this shows that it all depends on the requirements and things you are willing to sacrafice. === Date: Mon, 8 Jan 2001 15:54:34 -0800 (PST) From: Sander van Zoest <sander@covalent.net> To: Perrin Harkins <perrin@primenet.com> cc: Bill Moseley <moseley@hank.org>, modperl@apache.org Subject: Re: Caching search results On Mon, 8 Jan 2001, Perrin Harkins wrote: > On Mon, 8 Jan 2001, Sander van Zoest wrote: > > > starters. A dbm would be fine too, but more trouble to purge old entries > > > from. > > You could always have a second dbm file that can keep track of TTL issues > > of your data keys, so it would simply be a series of delete calls. > > Granted you would have another DBM file to maintain. > I find it kind of painful to trim dbm files, because most implementations > don't relinquish disk space when you delete entries. You end up having to > actually make a new dbm file with the "good" contents copied over to it in > order to slim it down. Yeah, this is true. Some DBMs have special routines to fix these issues. You could use the gdbm_reorganize call to clean up those issues for example (if you are using gdbm that is) Just some quick pseudo code (don't have a quick example ready here): use GDBM_File; my $gdbm = tie %hash, 'GDBM_File', 'file.gdbm' &GDBM_WRCREAT|&GDBM_FAST, 0640 or die "$!"; $gdbm->reorganize; That definately helps a lot. ===