Skip to content

Commit 3da5bbe

Browse files
committed
Phase J: Introduce RankingExpression and apply to repos
Add a new Doogle\Search\RankingExpression utility to centralize SQL ranking snippets (weighted equals/like, non-empty, HTTPS, bounded click boost, capped MySQL full-text, and sum). Update Site, Image, and Video repositories to use the new expressions, adjust per-vertical weights and quality signals, and replace inline score aggregation. Add integration tests to verify click-boost capping so clicks cannot dominate relevance, plus unit tests for RankingExpression. Documentation updated (ARCHITECTURE2.md, README-refactored.md) to describe the deterministic ranking formula and capped boosts.
1 parent accfb80 commit 3da5bbe

10 files changed

Lines changed: 239 additions & 28 deletions

ARCHITECTURE2.md

Lines changed: 50 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1191,8 +1191,56 @@ Tasks:
11911191
- Ranking by Result Type
11921192
- Bounded Click Boost (Bad behaviour: Result with most clicks always wins.)
11931193
- Quality Signals
1194-
- Search Analytics
1195-
- Admin Ranking Settings (an analytics auth admin portal for managing/viewing ranking)
1194+
1195+
Implemented ranking formula:
1196+
1197+
```text
1198+
rankingScore =
1199+
cappedFullTextBoost
1200+
+ fieldRelevance
1201+
+ qualitySignals
1202+
+ boundedClickBoost
1203+
```
1204+
1205+
Bounded click boost:
1206+
1207+
```text
1208+
min(clicks, 100) * 0.1
1209+
```
1210+
1211+
This makes click boost useful as a tie-breaker and weak popularity signal, but
1212+
caps it at 10 points so clicks cannot dominate stronger textual relevance.
1213+
1214+
MySQL/MariaDB full-text boost is also capped:
1215+
1216+
```text
1217+
min(fullTextScore * 50, 120)
1218+
```
1219+
1220+
Deterministic ordering:
1221+
1222+
```sql
1223+
ORDER BY rankingScore DESC, clicks DESC, id DESC
1224+
```
1225+
1226+
Ranking by result type:
1227+
1228+
- Sites prioritise exact/partial title, then keywords, description, URL, HTTPS, and metadata completeness.
1229+
- Images prioritise exact/partial title, then alt text, image URL, HTTPS, and available title/alt text.
1230+
- Videos prioritise exact/partial title, then description, video URL, site URL, HTTPS, thumbnail availability, and metadata completeness.
1231+
1232+
Future optional work:
1233+
1234+
- Search analytics beyond click telemetry.
1235+
- Authenticated admin ranking settings for viewing/tuning ranking weights.
1236+
1237+
Acceptance criteria:
1238+
1239+
- all search verticals use the same ranking model shape
1240+
- click counts are capped and cannot dominate stronger relevance
1241+
- quality signals are included
1242+
- ranking order is deterministic
1243+
- repository tests cover ranking behaviour
11961244

11971245
---
11981246

README-refactored.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,11 @@ Completed from `ARCHITECTURE2.md`:
2323
- Phase G: Crawl Jobs / History
2424
- Phase H: Production Hardening
2525
- Phase I: Videos Search Vertical
26+
- Phase J: Search Ranking
2627

2728
Remaining:
2829

29-
- Phase J - Search Ranking
30-
31-
<!-- - None currently tracked in `ARCHITECTURE2.md`. -->
30+
- Phase K: Search analytics and admin ranking controls are future optional work.
3231

3332
Key changes now in place:
3433

@@ -37,6 +36,7 @@ Key changes now in place:
3736
- `app/` contains auth, crawl, database, repository, search, and security classes.
3837
- Legacy root `crawl.php`, root `crawl-manual.php`, and `classes/` have been removed.
3938
- Search uses repositories, services, DTOs, full-text indexes, relevance ranking, and bounded click boost for sites, images, and videos.
39+
- Ranking uses a shared deterministic formula with capped click/full-text boosts and per-vertical quality signals.
4040
- Browser crawling requires an admin login and CSRF token.
4141
- CLI crawling does not require a browser session, but still enforces crawler safety policy.
4242
- Web and CLI crawl jobs are stored in `crawl_jobs` and shown on the authenticated crawl page.

app/Repository/ImageRepository.php

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
namespace Doogle\Repository;
66

7+
use Doogle\Search\RankingExpression;
78
use PDO;
89
use PDOStatement;
910

@@ -92,21 +93,30 @@ private function imageWhereSql(): string
9293
private function imageRankingSql(): string
9394
{
9495
$scores = [
95-
'CASE WHEN title LIKE :rankTitleTerm THEN 100 ELSE 0 END',
96-
'CASE WHEN alt LIKE :rankAltTerm THEN 60 ELSE 0 END',
97-
'CASE WHEN imageUrl LIKE :rankImageUrlTerm THEN 20 ELSE 0 END',
98-
'(CASE WHEN clicks > 100 THEN 100 ELSE clicks END * 0.1)',
96+
RankingExpression::weightedEquals('title', ':rankTitleExactTerm', 220),
97+
RankingExpression::weightedEquals('alt', ':rankAltExactTerm', 180),
98+
RankingExpression::weightedLike('title', ':rankTitleTerm', 100),
99+
RankingExpression::weightedLike('alt', ':rankAltTerm', 75),
100+
RankingExpression::weightedLike('imageUrl', ':rankImageUrlTerm', 25),
101+
RankingExpression::httpsUrl('imageUrl', 3),
102+
RankingExpression::nonEmpty('alt', 5),
103+
RankingExpression::nonEmpty('title', 3),
104+
RankingExpression::boundedClickBoost(),
99105
];
100106

101107
if ($this->isMysql()) {
102108
array_unshift(
103109
$scores,
104-
'(MATCH(' . self::IMAGE_FULL_TEXT_COLUMNS . ') '
105-
. 'AGAINST (:rankFullTextTerm IN NATURAL LANGUAGE MODE) * 50)'
110+
RankingExpression::boundedMysqlFullTextBoost(
111+
self::IMAGE_FULL_TEXT_COLUMNS,
112+
':rankFullTextTerm',
113+
50,
114+
120
115+
)
106116
);
107117
}
108118

109-
return '(' . implode(' + ', $scores) . ')';
119+
return RankingExpression::sum($scores);
110120
}
111121

112122
private function bindImageWhereTerms(PDOStatement $statement, string $term): void
@@ -128,6 +138,8 @@ private function bindImageRankingTerms(PDOStatement $statement, string $term): v
128138
}
129139

130140
$likeTerm = '%' . $term . '%';
141+
$statement->bindValue(':rankTitleExactTerm', $term);
142+
$statement->bindValue(':rankAltExactTerm', $term);
131143
$statement->bindValue(':rankTitleTerm', $likeTerm);
132144
$statement->bindValue(':rankAltTerm', $likeTerm);
133145
$statement->bindValue(':rankImageUrlTerm', $likeTerm);

app/Repository/SiteRepository.php

Lines changed: 18 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
namespace Doogle\Repository;
66

7+
use Doogle\Search\RankingExpression;
78
use PDO;
89
use PDOStatement;
910

@@ -83,22 +84,30 @@ private function siteWhereSql(): string
8384
private function siteRankingSql(): string
8485
{
8586
$scores = [
86-
'CASE WHEN title LIKE :rankTitleTerm THEN 100 ELSE 0 END',
87-
'CASE WHEN keywords LIKE :rankKeywordsTerm THEN 60 ELSE 0 END',
88-
'CASE WHEN description LIKE :rankDescriptionTerm THEN 30 ELSE 0 END',
89-
'CASE WHEN url LIKE :rankUrlTerm THEN 20 ELSE 0 END',
90-
'(CASE WHEN clicks > 100 THEN 100 ELSE clicks END * 0.1)',
87+
RankingExpression::weightedEquals('title', ':rankTitleExactTerm', 240),
88+
RankingExpression::weightedLike('title', ':rankTitleTerm', 100),
89+
RankingExpression::weightedLike('keywords', ':rankKeywordsTerm', 60),
90+
RankingExpression::weightedLike('description', ':rankDescriptionTerm', 35),
91+
RankingExpression::weightedLike('url', ':rankUrlTerm', 25),
92+
RankingExpression::httpsUrl('url', 5),
93+
RankingExpression::nonEmpty('title', 5),
94+
RankingExpression::nonEmpty('description', 3),
95+
RankingExpression::boundedClickBoost(),
9196
];
9297

9398
if ($this->isMysql()) {
9499
array_unshift(
95100
$scores,
96-
'(MATCH(' . self::SITE_FULL_TEXT_COLUMNS . ') '
97-
. 'AGAINST (:rankFullTextTerm IN NATURAL LANGUAGE MODE) * 50)'
101+
RankingExpression::boundedMysqlFullTextBoost(
102+
self::SITE_FULL_TEXT_COLUMNS,
103+
':rankFullTextTerm',
104+
50,
105+
120
106+
)
98107
);
99108
}
100109

101-
return '(' . implode(' + ', $scores) . ')';
110+
return RankingExpression::sum($scores);
102111
}
103112

104113
private function bindSiteWhereTerms(PDOStatement $statement, string $term): void
@@ -121,6 +130,7 @@ private function bindSiteRankingTerms(PDOStatement $statement, string $term): vo
121130
}
122131

123132
$likeTerm = '%' . $term . '%';
133+
$statement->bindValue(':rankTitleExactTerm', $term);
124134
$statement->bindValue(':rankTitleTerm', $likeTerm);
125135
$statement->bindValue(':rankKeywordsTerm', $likeTerm);
126136
$statement->bindValue(':rankDescriptionTerm', $likeTerm);

app/Repository/VideoRepository.php

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
namespace Doogle\Repository;
66

7+
use Doogle\Search\RankingExpression;
78
use PDO;
89
use PDOStatement;
910

@@ -83,22 +84,31 @@ private function videoWhereSql(): string
8384
private function videoRankingSql(): string
8485
{
8586
$scores = [
86-
'CASE WHEN title LIKE :rankTitleTerm THEN 100 ELSE 0 END',
87-
'CASE WHEN description LIKE :rankDescriptionTerm THEN 60 ELSE 0 END',
88-
'CASE WHEN videoUrl LIKE :rankVideoUrlTerm THEN 30 ELSE 0 END',
89-
'CASE WHEN siteUrl LIKE :rankSiteUrlTerm THEN 10 ELSE 0 END',
90-
'(CASE WHEN clicks > 100 THEN 100 ELSE clicks END * 0.1)',
87+
RankingExpression::weightedEquals('title', ':rankTitleExactTerm', 230),
88+
RankingExpression::weightedLike('title', ':rankTitleTerm', 100),
89+
RankingExpression::weightedLike('description', ':rankDescriptionTerm', 55),
90+
RankingExpression::weightedLike('videoUrl', ':rankVideoUrlTerm', 25),
91+
RankingExpression::weightedLike('siteUrl', ':rankSiteUrlTerm', 15),
92+
RankingExpression::httpsUrl('videoUrl', 3),
93+
RankingExpression::nonEmpty('thumbnailUrl', 5),
94+
RankingExpression::nonEmpty('title', 5),
95+
RankingExpression::nonEmpty('description', 3),
96+
RankingExpression::boundedClickBoost(),
9197
];
9298

9399
if ($this->isMysql()) {
94100
array_unshift(
95101
$scores,
96-
'(MATCH(' . self::VIDEO_FULL_TEXT_COLUMNS . ') '
97-
. 'AGAINST (:rankFullTextTerm IN NATURAL LANGUAGE MODE) * 50)'
102+
RankingExpression::boundedMysqlFullTextBoost(
103+
self::VIDEO_FULL_TEXT_COLUMNS,
104+
':rankFullTextTerm',
105+
50,
106+
120
107+
)
98108
);
99109
}
100110

101-
return '(' . implode(' + ', $scores) . ')';
111+
return RankingExpression::sum($scores);
102112
}
103113

104114
private function bindVideoWhereTerms(PDOStatement $statement, string $term): void
@@ -121,6 +131,7 @@ private function bindVideoRankingTerms(PDOStatement $statement, string $term): v
121131
}
122132

123133
$likeTerm = '%' . $term . '%';
134+
$statement->bindValue(':rankTitleExactTerm', $term);
124135
$statement->bindValue(':rankTitleTerm', $likeTerm);
125136
$statement->bindValue(':rankDescriptionTerm', $likeTerm);
126137
$statement->bindValue(':rankVideoUrlTerm', $likeTerm);

app/Search/RankingExpression.php

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
<?php
2+
3+
declare(strict_types=1);
4+
5+
namespace Doogle\Search;
6+
7+
final class RankingExpression
8+
{
9+
public const CLICK_BOOST_CAP = 100;
10+
public const CLICK_BOOST_WEIGHT = 0.1;
11+
12+
public static function sum(array $expressions): string
13+
{
14+
return '(' . implode(' + ', $expressions) . ')';
15+
}
16+
17+
public static function weightedEquals(string $column, string $parameter, int $weight): string
18+
{
19+
return "CASE WHEN LOWER({$column}) = LOWER({$parameter}) THEN {$weight} ELSE 0 END";
20+
}
21+
22+
public static function weightedLike(string $column, string $parameter, int $weight): string
23+
{
24+
return "CASE WHEN {$column} LIKE {$parameter} THEN {$weight} ELSE 0 END";
25+
}
26+
27+
public static function nonEmpty(string $column, int $weight): string
28+
{
29+
return "CASE WHEN {$column} <> '' THEN {$weight} ELSE 0 END";
30+
}
31+
32+
public static function httpsUrl(string $column, int $weight): string
33+
{
34+
return "CASE WHEN {$column} LIKE 'https://%' THEN {$weight} ELSE 0 END";
35+
}
36+
37+
public static function boundedClickBoost(string $column = 'clicks'): string
38+
{
39+
return '(CASE WHEN ' . $column . ' > ' . self::CLICK_BOOST_CAP
40+
. ' THEN ' . self::CLICK_BOOST_CAP . ' ELSE ' . $column . ' END * '
41+
. self::CLICK_BOOST_WEIGHT . ')';
42+
}
43+
44+
public static function boundedMysqlFullTextBoost(
45+
string $columns,
46+
string $parameter,
47+
int $weight,
48+
int $cap,
49+
): string {
50+
$score = 'MATCH(' . $columns . ') AGAINST (' . $parameter . ' IN NATURAL LANGUAGE MODE) * ' . $weight;
51+
52+
return 'LEAST((' . $score . '), ' . $cap . ')';
53+
}
54+
}

tests/Integration/Repository/ImageRepositoryTest.php

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -77,6 +77,18 @@ public function testSearchRanksTitleMatchesAboveHigherClickedAltMatches(): void
7777
self::assertSame('https://example.com/alt.png', $results[1]['imageUrl']);
7878
}
7979

80+
public function testSearchCapsClickBoostSoClicksCannotDominateStrongerRelevance(): void
81+
{
82+
$this->insertImage('https://example.com/site', 'https://example.com/exact.png', '', 'linux', 0, 0);
83+
$this->insertImage('https://example.com/site', 'https://example.com/clicked.png', 'linux', 'Clicked', 10000, 0);
84+
85+
$results = $this->repository->search('linux', 0, 30);
86+
87+
self::assertSame('https://example.com/exact.png', $results[0]['imageUrl']);
88+
self::assertSame('https://example.com/clicked.png', $results[1]['imageUrl']);
89+
self::assertGreaterThan((float) $results[1]['rankingScore'], (float) $results[0]['rankingScore']);
90+
}
91+
8092
public function testSearchCanMatchImageUrl(): void
8193
{
8294
$this->insertImage(

tests/Integration/Repository/SiteRepositoryTest.php

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,18 @@ public function testSearchRanksTitleMatchesAboveHigherClickedDescriptionMatches(
7575
self::assertSame('https://example.com/description', $results[1]['url']);
7676
}
7777

78+
public function testSearchCapsClickBoostSoClicksCannotDominateStrongerRelevance(): void
79+
{
80+
$this->insertSite('https://example.com/exact', 'linux', '', '', 0);
81+
$this->insertSite('https://example.com/clicked', 'Clicked Result', 'linux', '', 10000);
82+
83+
$results = $this->repository->search('linux', 0, 20);
84+
85+
self::assertSame('https://example.com/exact', $results[0]['url']);
86+
self::assertSame('https://example.com/clicked', $results[1]['url']);
87+
self::assertGreaterThan((float) $results[1]['rankingScore'], (float) $results[0]['rankingScore']);
88+
}
89+
7890
public function testSearchAppliesOffsetAndLimit(): void
7991
{
8092
$this->insertSite('https://example.com/one', 'Linux One', 'Linux', 'linux', 3);

tests/Integration/Repository/VideoRepositoryTest.php

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,18 @@ public function testSearchRanksTitleMatchesAboveHigherClickedDescriptionMatches(
6565
self::assertSame('https://example.com/description.mp4', $results[1]['videoUrl']);
6666
}
6767

68+
public function testSearchCapsClickBoostSoClicksCannotDominateStrongerRelevance(): void
69+
{
70+
$this->insertVideo('https://example.com/exact.mp4', 'linux', '', 0);
71+
$this->insertVideo('https://example.com/clicked.mp4', 'Clicked Video', 'linux', 10000);
72+
73+
$results = $this->repository->search('linux', 0, 24);
74+
75+
self::assertSame('https://example.com/exact.mp4', $results[0]['videoUrl']);
76+
self::assertSame('https://example.com/clicked.mp4', $results[1]['videoUrl']);
77+
self::assertGreaterThan((float) $results[1]['rankingScore'], (float) $results[0]['rankingScore']);
78+
}
79+
6880
public function testSearchCanMatchVideoUrl(): void
6981
{
7082
$this->insertVideo('https://example.com/linux-demo.mp4', 'Demo', 'Video', 1);

0 commit comments

Comments
 (0)