From rosssd at gmail.com Fri Feb 1 08:47:19 2008 From: rosssd at gmail.com (Sean Ross) Date: Fri, 1 Feb 2008 08:47:19 +0000 Subject: [rucksack-devel] State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> Message-ID: <5bef28df0802010047h5885877ey73b2cca42f7af2a5@mail.gmail.com> On 1/31/08, Arthur Lemmens wrote: > I followed your suggestion and changed the class and slot indexes: they > now map to objects instead of object ids, so the garbage collector will > add the indexed objects to the root set. Hi Arthur, Sorry about the delay and lack of a patch, my hard drive recently failed and I've been slowly building up to where i was, unfortunately this also resulted in a lost patch. cheers, sean. Thou shalt run backups daily, thou shalt run backups daily.... From alemmens at xs4all.nl Fri Feb 1 16:44:05 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Fri, 01 Feb 2008 17:44:05 +0100 Subject: [rucksack-devel] State of the nation and heap patch In-Reply-To: <5bef28df0802010047h5885877ey73b2cca42f7af2a5@mail.gmail.com> References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0802010047h5885877ey73b2cca42f7af2a5@mail.gmail.com> Message-ID: Hi Sean, > Sorry about the delay and lack of a patch No problem at all. You already did the hard work by finding the bug and suggesting how to fix it. Looking forward to meeting you at the ECLM in April... Arthur From alemmens at xs4all.nl Sat Feb 2 17:12:53 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Sat, 02 Feb 2008 18:12:53 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: [Replying to the Rucksack mailing list] Sean Ross wrote: > The bad news is that there appears to be an issue with p-btrees and > gc. I've attached some code which exercises the bug. Good catch, thanks a lot for the report. This turned out to be a subtle bug in the garbage collector: when it deletes object ids from the object table (because the objects are dead and we may want to reuse their ids later for other objects), it should also remove that object from the cache. If it doesn't, there's a possibility that the object id will be reused later for a new object and the cache wil still refer to the old in-memory object. Some kind of identity crisis ;-) I fixed this bug on my machine, but common-lisp.net ssh is inaccessible at the moment so I haven't fixed it in CVS yet. With this bug fix, all data corruption bugs that I'm aware of have been fixed now. But I'm sure there are more problems hidden in some dirty corners. If you find them, please let me know. Arthur From ch-rucksack at bobobeach.com Sat Feb 2 17:27:34 2008 From: ch-rucksack at bobobeach.com (Cyrus Harmon) Date: Sat, 2 Feb 2008 09:27:34 -0800 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: Great! Does that mean the rucksack-devel team is going to tackle performance issues next? :) Cyrus On Feb 2, 2008, at 9:12 AM, Arthur Lemmens wrote: > [Replying to the Rucksack mailing list] > > Sean Ross wrote: > >> The bad news is that there appears to be an issue with p-btrees and >> gc. I've attached some code which exercises the bug. > > Good catch, thanks a lot for the report. > > This turned out to be a subtle bug in the garbage collector: when it > deletes object ids from the object table (because the objects are > dead and we may want to reuse their ids later for other objects), it > should also remove that object from the cache. If it doesn't, there's > a possibility that the object id will be reused later for a new object > and the cache wil still refer to the old in-memory object. Some kind > of identity crisis ;-) > > I fixed this bug on my machine, but common-lisp.net ssh is > inaccessible > at the moment so I haven't fixed it in CVS yet. > > With this bug fix, all data corruption bugs that I'm aware of have > been fixed now. But I'm sure there are more problems hidden in some > dirty corners. If you find them, please let me know. > > Arthur > > _______________________________________________ > rucksack-devel mailing list > rucksack-devel at common-lisp.net > http://common-lisp.net/cgi-bin/mailman/listinfo/rucksack-devel From alemmens at xs4all.nl Sat Feb 2 17:40:47 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Sat, 02 Feb 2008 18:40:47 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: Cyrus Harmon wrote: > Great! Does that mean the rucksack-devel team is going to tackle > performance issues next? :) Hehe. Improving performance is definitely on my list, but I'm not sure if it's on the /top/ of my list right now. Anyway... if you have concrete use cases or benchmarks that you'd like to see faster, feel free to send them to the list. Arthur From ch-rucksack at bobobeach.com Sat Feb 2 18:12:42 2008 From: ch-rucksack at bobobeach.com (Cyrus Harmon) Date: Sat, 2 Feb 2008 10:12:42 -0800 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: Yeah, the biggest performance problem I have is importing items from the NCBI taxonomy database which consists of organism name, id, etc... arranged into a tree of a million or so objects. I'll package up some sort of release of this and circulate the URL to the list. Right now it takes a few hours to import a million objects or so. It would be nice to get this down to a few minutes. Thanks, Cyrus On Feb 2, 2008, at 9:40 AM, Arthur Lemmens wrote: > Cyrus Harmon wrote: > >> Great! Does that mean the rucksack-devel team is going to tackle >> performance issues next? :) > > Hehe. Improving performance is definitely on my list, but I'm not > sure if it's on the /top/ of my list right now. > > Anyway... if you have concrete use cases or benchmarks that you'd > like to see faster, feel free to send them to the list. > > Arthur > From alemmens at xs4all.nl Mon Feb 11 11:52:31 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Mon, 11 Feb 2008 12:52:31 +0100 Subject: [rucksack-devel] Tutorial by Brad Beveridge Message-ID: Hi, I just added a Rucksack tutorial by Brad Beveridge to the repository. See http://common-lisp.net/cgi-bin/viewcvs.cgi/rucksack/doc/?root=rucksack Feedback welcome, as always. Arthur From alemmens at xs4all.nl Mon Feb 11 11:53:23 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Mon, 11 Feb 2008 12:53:23 +0100 Subject: [rucksack-devel] Rucksack? Message-ID: You promised me a "separate mail about Rucksack"? From edi at agharta.de Mon Feb 11 12:15:55 2008 From: edi at agharta.de (Edi Weitz) Date: Mon, 11 Feb 2008 13:15:55 +0100 Subject: [rucksack-devel] Tutorial by Brad Beveridge In-Reply-To: (Arthur Lemmens's message of "Mon, 11 Feb 2008 12:52:31 +0100") References: Message-ID: On Mon, 11 Feb 2008 12:52:31 +0100, "Arthur Lemmens" wrote: > I just added a Rucksack tutorial by Brad Beveridge to the repository. > See http://common-lisp.net/cgi-bin/viewcvs.cgi/rucksack/doc/?root=rucksack I haven't looked at the tutorial itself yet, but I think it's great that there is one! I'd suggest mentioning it on the Rucksack homepage. And at the same time removing the "Rucksack does not run" line. See also here... :) http://blog.viridian-project.de/2008/02/11/why-choose-elephant/ From alemmens at xs4all.nl Mon Feb 11 12:54:04 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Mon, 11 Feb 2008 13:54:04 +0100 Subject: [rucksack-devel] Version 0.1.16 Message-ID: * Version 0.1.16 Added tutorial by Brad Beveridge. Improved performance by decreasing persistent consing for btrees and using a lazy-cache. Fixed some small bugs. Added a few handy functions and macros. In more detail: Created a new doc directory and added the tutorial by Brad Beveridge. Added P-PUSH and P-POP. Improved btree efficiency by switching to a different data structure for the bindings. Instead of using a persistent cons for each key/ value pair, we now put the keys and values directly into the bnode vector. This speeds up most btree operations because it reduces persistent consing when adding new values and it reduces indirections when searching for keys. Renamed BTREE-NODE to BNODE, BTREE-NODE-INDEX to BNODE-BINDINGS, BTREE-NODE-INDEX-COUNT to BNODE-NR-BINDINGS, FIND-BINDING-IN-NODE to FIND-KEY-IN-NODE. Fix a missing argument bug in REMOVE-CLASS-INDEX. Added a LAZY-CACHE which just clears the entire hash table whenever the cache gets full. This improves memory usage, because the normal cache queue kept track of a lot of objects that for some reason couldn't be cleaned up by the implementation's garbage collector. Added the convenience macros RUCKSACK-DO-CLASS and RUCKSACK-DO-SLOT. Made RUCKSACK-DELETE-OBJECT an exported symbol of the RUCKSACK package. Fix a bug in TEST-NON-UNIQUE-BTREE: it should call CHECK-NON-UNIQUE-CONTENTS instead of CHECK-CONTENTS. From alemmens at xs4all.nl Mon Feb 11 13:19:47 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Mon, 11 Feb 2008 14:19:47 +0100 Subject: [rucksack-devel] Tutorial by Brad Beveridge In-Reply-To: References: Message-ID: Edi Weitz wrote: > I haven't looked at the tutorial itself yet, but I think it's great > that there is one! I'd suggest mentioning it on the Rucksack > homepage. And at the same time removing the "Rucksack does not run" > line. Done. Thanks for the suggestion. > See also here... :) > > http://blog.viridian-project.de/2008/02/11/why-choose-elephant/ Hmm. I can't deny that Rucksack hasn't seen much "active development" last year. Arthur From alemmens at xs4all.nl Mon Feb 11 13:48:10 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Mon, 11 Feb 2008 14:48:10 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: Cyrus Harmon wrote: > Yeah, the biggest performance problem I have is importing items from > the NCBI taxonomy database which consists of organism name, id, etc... > arranged into a tree of a million or so objects. I'll package up some > sort of release of this and circulate the URL to the list. Right now > it takes a few hours to import a million objects or so. It would be > nice to get this down to a few minutes. I did some work on improving Rucksack performance last week, using various ways of importing a 43 MB XML file (Jim Breen's Japanese dictionary at http://ftp.cc.monash.edu.au/pub/nihongo/JMdict.gz, which is basically also "a tree of a million or so objects") as test cases. The most important changes are that p-btrees don't use persistent conses anymore to represent bindings, that the default cache now doesn't use a queue to keep track of most-recent-use information and that Rucksack looks at the basic slot index information before it even starts digging into the btrees. These changes improved the overall performance and the maximum memory usage for my test cases by factors varying between 2 and 20. To give you a rough idea: one representative test case (with class indexes on the 5 most frequently used classes and string indexes on 4 slots, resulting in a 300 MB rucksack) now takes 18 minutes on my machine. I'd be interested to know what kind of performance improvements you see with the new version (0.1.16). By the way: turning off the Rucksack garbage collector when importing large amounts of data is also a good idea. But you knew that already... Arthur From ch-rucksack at bobobeach.com Mon Feb 11 18:22:29 2008 From: ch-rucksack at bobobeach.com (Cyrus Harmon) Date: Mon, 11 Feb 2008 10:22:29 -0800 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> Message-ID: <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> Hi Arthur, Thanks for these changes. My initial tests suggest that the new version is faster, but not overwhelmingly so. I'll try to do some more rigorous benchmarking (and profiling) and see what I can came up with. My initial profiling efforts suggest that we're spending too much time looking up indexed entries in the b-tree caches. I would have thought that the time spent looking in the b-tree caches would be less than the time actually consing up the object and doing whatever else PCL needs to do, BICBW. More later today or tomorrow. Thanks again, Cyrus On Feb 11, 2008, at 5:48 AM, Arthur Lemmens wrote: > Cyrus Harmon wrote: > >> Yeah, the biggest performance problem I have is importing items from >> the NCBI taxonomy database which consists of organism name, id, >> etc... >> arranged into a tree of a million or so objects. I'll package up some >> sort of release of this and circulate the URL to the list. Right now >> it takes a few hours to import a million objects or so. It would be >> nice to get this down to a few minutes. > > I did some work on improving Rucksack performance last week, using > various ways of importing a 43 MB XML file (Jim Breen's Japanese > dictionary at http://ftp.cc.monash.edu.au/pub/nihongo/JMdict.gz, > which is basically also "a tree of a million or so objects") as > test cases. > > The most important changes are that p-btrees don't use persistent > conses anymore to represent bindings, that the default cache now > doesn't use a queue to keep track of most-recent-use information > and that Rucksack looks at the basic slot index information before > it even starts digging into the btrees. > > These changes improved the overall performance and the maximum memory > usage for my test cases by factors varying between 2 and 20. To give > you a rough idea: one representative test case (with class indexes on > the 5 most frequently used classes and string indexes on 4 slots, > resulting in a 300 MB rucksack) now takes 18 minutes on my machine. > > I'd be interested to know what kind of performance improvements you > see with the new version (0.1.16). > > By the way: turning off the Rucksack garbage collector when importing > large amounts of data is also a good idea. But you knew that > already... > > Arthur > From alemmens at xs4all.nl Tue Feb 12 11:26:53 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Tue, 12 Feb 2008 12:26:53 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> Message-ID: Cyrus Harmon wrote: > Thanks for these changes. My initial tests suggest that the new > version is faster, but not overwhelmingly so. I'll try to do some more > rigorous benchmarking (and profiling) and see what I can came up with. I expect that most of the time is caused by class and slot indexing, but it would be interesting to test that first. For example, maybe you could time how much time it takes to import your data without any indexing at all? And then with class indexing but no slot indexing? Do you have slot indexes where relatively few slot values map to many objects? In that case, the current implementation is far too slow, because it uses a plain list to represent the set of all btree values that belong to to one key. I'm working on changing that to a different data structure, but I haven't finished that yet. Arthur From ch-rucksack at bobobeach.com Tue Feb 12 20:01:55 2008 From: ch-rucksack at bobobeach.com (Cyrus Harmon) Date: Tue, 12 Feb 2008 12:01:55 -0800 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> Message-ID: Yes, turning off the indices does seem to be helping the load time. Based on current progress, I'd guess that this will bring the load time down into the 30 min. range instead of the two hour range. Is there a way to add the indices back once the data is loaded? It's still not as fast as I would like, but one step at a time... No, the values being indexed are pretty spread out, so I doubt that's the problem. Thanks again for your help, Cyrus On Feb 12, 2008, at 3:26 AM, Arthur Lemmens wrote: > Cyrus Harmon wrote: > >> Thanks for these changes. My initial tests suggest that the new >> version is faster, but not overwhelmingly so. I'll try to do some >> more >> rigorous benchmarking (and profiling) and see what I can came up >> with. > > I expect that most of the time is caused by class and slot indexing, > but it would be interesting to test that first. For example, maybe > you > could time how much time it takes to import your data without any > indexing at all? And then with class indexing but no slot indexing? > > Do you have slot indexes where relatively few slot values map to many > objects? In that case, the current implementation is far too slow, > because it uses a plain list to represent the set of all btree values > that belong to to one key. I'm working on changing that to a > different > data structure, but I haven't finished that yet. > > Arthur > From alemmens at xs4all.nl Tue Feb 12 20:34:53 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Tue, 12 Feb 2008 21:34:53 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> Message-ID: Cyrus Harmon wrote: > Yes, turning off the indices does seem to be helping the load time. > Based on current progress, I'd guess that this will bring the load > time down into the 30 min. range instead of the two hour range. > > Is there a way to add the indices back once the data is loaded? Not for class indices (and maybe that's not even possible at all, in the general case). But you can add a slot index to a slot (of a class that has a class index) by adding something like :index :string-index to the slot definition and re-evaluating the entire class definition I usually just compile and load the file that contains the class definitions. See also the files test-index-1a.lisp and test-index-1b.lisp in the test directory. Arthur From ch-rucksack at bobobeach.com Tue Feb 12 21:09:15 2008 From: ch-rucksack at bobobeach.com (Cyrus Harmon) Date: Tue, 12 Feb 2008 13:09:15 -0800 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> Message-ID: <96D93BB4-AC99-4DB2-BC1F-6B024005FD2B@bobobeach.com> What do class indexes buy us? We can locate an instance (given an identifier of some sort I presume) quickly with or without an index, right? The indices allow us to do rucksack-map-instances, but it's not clear to me why this needs to be an index per se. A hash-table or other (unordered) could do the trick just as well, although the mapping would then no longer be in order. It would be nice if the index functionality were exposed through functions and methods allowing for adding/removing indices, rather than by modifying the class definition (the use case I'm thinking is allowing user operations for bulk loads that would remove the indices and then add them back). Why wouldn't it be possible to add class indices after the fact? Thanks, Cyrus On Feb 12, 2008, at 12:34 PM, Arthur Lemmens wrote: > Cyrus Harmon wrote: > >> Yes, turning off the indices does seem to be helping the load time. >> Based on current progress, I'd guess that this will bring the load >> time down into the 30 min. range instead of the two hour range. >> >> Is there a way to add the indices back once the data is loaded? > > Not for class indices (and maybe that's not even possible at all, > in the general case). > > But you can add a slot index to a slot (of a class that has a class > index) by adding something like > > :index :string-index > > to the slot definition and re-evaluating the entire class definition > I usually just compile and load the file that contains the class > definitions. > > See also the files test-index-1a.lisp and test-index-1b.lisp in > the test directory. > > Arthur > From alemmens at xs4all.nl Tue Feb 12 21:35:31 2008 From: alemmens at xs4all.nl (Arthur Lemmens) Date: Tue, 12 Feb 2008 22:35:31 +0100 Subject: [rucksack-devel] Re: Fwd: State of the nation and heap patch In-Reply-To: <96D93BB4-AC99-4DB2-BC1F-6B024005FD2B@bobobeach.com> References: <5bef28df0710311000u3e10bdbbn5cde1228d5b54ab5@mail.gmail.com> <5bef28df0710311332p5039929eh98062094935997e5@mail.gmail.com> <5bef28df0711040820u76216a5bw50b1d771d18f7c12@mail.gmail.com> <35C7719A-DD92-426F-B67C-D4CE7691547E@bobobeach.com> <96D93BB4-AC99-4DB2-BC1F-6B024005FD2B@bobobeach.com> Message-ID: Cyrus Harmon wrote: > What do class indexes buy us? They let us find all instances of a class (and they implicitly tell the garbage collector that instances of that class are alive, even if nothing else refers to such instances). This is not always necessary (for example if all instances of a class are always referred to by other persistent objects), but it can be very handy. > We can locate an instance (given an identifier of some sort I > presume) quickly with or without an index, right? Yes (assuming I understand your question). The object table always lets us locate an instance quickly, given an object identifier. > The indices allow us to do rucksack-map-instances, but it's not > clear to me why this needs to be an index per se. A hash-table or > other (unordered) could do the trick just as well Yes, that's right. Or some other persistent datastructure that can represent a set. But at the moment we only have persistent conses, persistent arrays and persistent b-trees. Hmm, now that you mention it... We could probably just push every new instance of an indexed class on a persistent list, instead of doing the whole btree insertion dance. Very good point, thank you! (The only problem with this idea is that it would make deleting an object from an index an O(n) instead of an O(log n) operation.) > although the mapping would then no longer be in order. Yeah, but we don't need that for class indices, do we? > It would be nice if the index functionality were exposed through > functions and methods allowing for adding/removing indices, rather > than by modifying the class definition Yes, I've been thinking about this too. The disadvantage of this would be that the (Rucksack-specific part of the) class definition no longer corresponds to the Rucksack reality. But I agree with you that the advantages of exposing such functions are probably greater than the disadvantage. > Why wouldn't it be possible to add class indices after the fact? Hmm... The only way to do it would be to iterate over the entire object table. Which is not impossible, but it is a bit expensive. I was also thinking that you wouldn't want to index an object that's dead (i.e. unreferenced by other objects), but I guess that could be solved by doing a complete garbage collection immediately before adding a class index. So: it's probably not impossible, but it would be quite expensive. Thanks for the feedback. Very interesting. Arthur