Home | Geschichten | Kunst | Computer | Tindertraum |
[Thu Apr 24 22:49:50 2003] vs.pl: reading archive from ../html/blog/archive
[Thu Apr 24 22:49:50 2003] vs.pl: found 1257 postings
[Thu Apr 24 22:49:50 2003] vs.pl: setting up VectorSpace
[Thu Apr 24 22:49:50 2003] vs.pl: building Index
[Thu Apr 24 22:49:50 2003] vs.pl: Making word list:
[Thu Apr 24 22:50:01 2003] vs.pl: Finished with word list
[Thu Apr 24 22:50:01 2003] vs.pl: doing queries
[Thu Apr 24 22:50:01 2003] vs.pl: making cosines for doc 0,
2003-04-24_22-33
[Thu Apr 24 22:50:02 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-33.sim
[Thu Apr 24 22:50:02 2003] vs.pl:
[Thu Apr 24 22:50:02 2003] vs.pl: making cosines for doc 1, 2003-04-24_22-29
[Thu Apr 24 22:50:03 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-29.sim
[Thu Apr 24 22:50:03 2003] vs.pl:
[Thu Apr 24 22:50:03 2003] vs.pl: making cosines for doc 2, 2003-04-24_22-26
[Thu Apr 24 22:50:05 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_22-26.sim
[Thu Apr 24 22:50:05 2003] vs.pl:
[Thu Apr 24 22:50:05 2003] vs.pl: making cosines for doc 3, 2003-04-24_15-53
[Thu Apr 24 22:50:06 2003] vs.pl: writing ../html/blog/archive/sims/2003-04-24_15-53.sim
[Thu Apr 24 22:50:06 2003] vs.pl:
[Thu Apr 24 22:50:06 2003] vs.pl: making cosines for doc 4, 2003-04-24_10-41
so the first optimisation could be to go semi-incremental. No need to re-do all 1200+ entries in each run, as most of them don't really change.
next step surly would be to pre-generate the word-index phase, which could also be used for the keyword search.
each actual 'query' takes roughly a second to preform, that is rather fast if you consider you are comparing one doc to 1200+ others
And no, keeping the whole VectorSpace in-memory between user-queries is not really an option. The Perl VM size for this is about 125MB on my machine, which is ok for running it every so often, but not for running it contiously.
Another gripe is the fact that it's difficult to incrementally add new words (or docs containing new words) to a 'pre-generated' space.
[ by Martin>] [permalink] [similar entries]
similar entries (vs):
similar entries (cg):