lucene - Document similarity framework -


i create application searches similar documents in database; eg. user uploads document (text, image, etc.), , query application similar ones.

i have created neccesseary algorithms process (fingerprinting, feature extraction, hashing, hash compare, etc.), i'm looking framework, couples of these.

for example, if implement in lucene, following:

  • create custom "tokenizer" , "stemmer" (~ feature extraction , fingerprinting)
  • than adding created elements lucene index
  • and using morelikethis class find similar documents

so, lucene might choice - far know, lucene not meant document similarity search engine, rather term-based searchengine.

my question is: applications/frameworks, might fit above mentioned problem?

thanks, krisy

update: seems process described above called content based media (sound, image, video.) retrieval.

there many projects use lucene this, see: http://wiki.apache.org/lucene-java/poweredby (lire, alike, etc.), still didn't found dedicated framework ...

since you're using lucene, might take @ solr. realize it's not dedicated framework purpose either, add stuff on top of lucene comes in quite handy. given pluggability of lucene, track record , fact there lot of useful resources out there, solr might job done.

also, answer @mindas pointed to, links the blog post describing technical details @ how accomplish goal solr (but read in meantime).


Comments

Popular posts from this blog

php - Why I am getting the Error "Commands out of sync; you can't run this command now" -

linux - Does gcc have any options to add version info in ELF binary file? -

java - Are there any classes that implement javax.persistence.Parameter<T>? -