Use relative term frequency instead of absolute#317
Conversation
|
The algorithm we are using here is BM25. You can learn more about it here The In your extreme case, the spam document would rank better, but not to much. In practice a document where a word occurs more times is more relevant. By doing a lot of search against documents, we decided to keep the coefficient 0.5 Therefore, changing |
|
Ok, will read about BM25 later. But how can $denom = $tfWeight
* ((1 - $dlWeight) + $dlWeight)
+ $tf;make any difference as this is always |
|
Sorry for the late reply! You are correct; the expression always equals 1. The formula has been modified over the years, and this was overlooked. I remember that we changed the BM25 formula to ignore the document length, which is why a parameter is missing. |
|
Hey @BlackbitDevs could you (re-) fix this, based on the current implementation? There have been a lot of changes in the last few weeks / months :) |
# Conflicts: # src/Engines/MysqlEngine.php # src/Engines/RedisEngine.php # src/Engines/SqliteEngine.php # src/TNTSearch.php
|
Have rebased the PR to latest master. |
In
tntsearch/src/TNTSearch.php
Lines 113 to 122 in a763e66
the score gets calculated. The problem here is that
$document['hit_count']returns the absolute number how often a term is contained in the document. This leads to the problem that for a spam document which simply contains all words 10 times, the score will be higher than for a document which has some terms more often than others.Example:
Let's take the example from https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Example_of_tf%E2%80%93idf but with a little changed hit counts
We have 4 documents:
Document 1
Document 2
Spam document
We search for "example":
idf = log( 3 / 2 ) = 0.1761 -> the term "example" occurs in 2 / 3 documents.
tf how it is currently calculated in TNT search:
So the spam document has a higher score than document 2, although this document simply contains all words with the same frequency. On the other hand, document 2's most occurring word is "example", so imho this should get higher score.
According to Wikipedia's calculation term frequency is
(There are also some other definitions for the denominator there but I can't find the one used in TNT Search)
So with Wikipedia's definition of term frequency, there would be the following results: