tomcat - understanding apache solr scoring for non math background -
tomcat - understanding apache solr scoring for non math background -
i learning apache solr scoring methods here. here said should go this page understand scoring formula. not maths background hard me understand high level math. is there alternative understand basic scoring formula in easy manner?
lucene uses number of features score documents, scoring relies on similarity between document , query. explained thought of calculating similarity between documents before in more or less simple words, allow me explain here briefly.
if have dictionary of words, may organize them long-long list. mathematicians used utilize term "vector" sequences, including lists of words, let's phone call vector of words:
[abbat, about, bananas, ...]
we can express each document in our collection vector, each element stands number of occurrences of corresponding word in document. example, if document has 1 occurrence of word "bananas", 2 occurrences of "about" , no occurrences of "abbat", document vector start follows:
[0, 2, 1, ...]
now interesting part comes. can assume if 2 documents have lot of mutual words, similar topics, , if have few in common, these documents different. since know documents may represented vectors of words, can calculate similarity of documents similarity of vectors.
there many ways calculate how similar 2 vectors. lucene uses quite simple - cosine distance. thought comes geometrical representation of vectors , angle between them - if draw 2 vectors in 2d space, see more similar coordinates of these vectors, less angle between them. cosine distance comes from, in fact should care number of same words in 2 documents.
when tasking search engines, queries treated documents: document vector built them , used find similar (i.e. relevant) documents collection.
tomcat solr lucene
Comments
Post a Comment