lucene - Document similarity framework -
i create application searches similar documents in database; eg. user uploads document (text, image, etc.), , query application similar ones.
i have created neccesseary algorithms process (fingerprinting, feature extraction, hashing, hash compare, etc.), i'm looking framework, couples of these.
for example, if implement in lucene, following:
- create custom "tokenizer" , "stemmer" (~ feature extraction , fingerprinting)
- than adding created elements lucene index
- and using morelikethis class find similar documents
so, lucene might choice - far know, lucene not meant document similarity search engine, rather term-based searchengine.
my question is: applications/frameworks, might fit above mentioned problem?
thanks, krisy
update: seems process described above called content based media (sound, image, video.) retrieval.
there many projects use lucene this, see: http://wiki.apache.org/lucene-java/poweredby (lire, alike, etc.), still didn't found dedicated framework ...
since you're using lucene, might take @ solr. realize it's not dedicated framework purpose either, add stuff on top of lucene comes in quite handy. given pluggability of lucene, track record , fact there lot of useful resources out there, solr might job done.
also, answer @mindas pointed to, links the blog post describing technical details @ how accomplish goal solr (but read in meantime).
Comments
Post a Comment