Is your feature request related to a problem? (Feature Description)
When the package is fed text transcribed from an actual recitation (e.g. ASR output), the input almost always opens with the Istiʿādhah — "أعوذ بالله من الشيطان الرجيم" — and frequently a leading Basmalah before the first ayah. The package already has Smart Basmala Handling for the corpus side (Basmala stored as ayah 0, Basmalah-aware counting), but there is currently no way to detect or strip a leading Istiʿādhah, because — unlike the Basmalah — it is not part of the Quran corpus at all and matches no verse.
As a result, when a recitation transcript is passed to search_text / fuzzy_search / search_sliding_window / smart_search, the extra opening words skew similarity scores, throw off word-level positioning, and shift sliding-window alignment. There is also no clean signal back to the caller telling them the query began with an Istiʿādhah/Basmalah versus actual ayah text, which downstream apps need in order to segment or display the recitation correctly.
Describe the solution I'd like
A first-class "recitation prefix" detector that recognizes a leading Istiʿādhah and/or Basmalah and reports it, ideally:
A standalone helper, e.g. qal.detect_recitation_prefix(text), returning structured info: which prefix(es) were found (istiadha, basmalah), the matched span, a similarity score, and the remaining (cleaned) ayah text.
Optional flags on the existing search functions, e.g. strip_istiadha=True / strip_basmalah=True, so callers can normalize the query before matching in one call (also exposed as CLI flags and REST API params for consistency with the rest of the API).
Detection that reuses the package's existing Arabic normalization (diacritics removal, alef normalization) and fuzzy matching so it tolerates ASR/orthographic variation rather than requiring an exact string.
Coverage of the common Istiʿādhah variants reciters use, for example:
أعوذ بالله من الشيطان الرجيم
أعوذ بالله السميع العليم من الشيطان الرجيم
أعوذ بالله العظيم من الشيطان الرجيم
Describe alternatives you've considered
Stripping the prefix with a regex before calling the search functions — brittle against ASR variation and diacritics, and it duplicates normalization logic the package already owns.
Adding the Istiʿādhah to the corpus as a pseudo-verse — this would pollute verse counts and absolute indexing (0–6347), and there is no single canonical form, so it does not fit the corpus model the way the Basmalah does.
Relying on the fuzzy threshold to "ignore" the extra words — unreliable, and it still distorts word-level positioning and sliding-window start/end alignment.
Additional context
The primary use case is ASR / transcription pipelines that match recitation audio back to the Quran text, where the reciter conventionally opens with Istiʿādhah (+ Basmalah). Detecting these cleanly improves match accuracy and lets downstream tools label/segment the opening properly. Keeping the Istiʿādhah out of the corpus (it is not Quranic text) but detectable as a recitation prefix seems like the right separation of concerns, and it pairs naturally with the existing Basmala handling and the MultiAyahMatch response model.
Is your feature request related to a problem? (Feature Description)
When the package is fed text transcribed from an actual recitation (e.g. ASR output), the input almost always opens with the Istiʿādhah — "أعوذ بالله من الشيطان الرجيم" — and frequently a leading Basmalah before the first ayah. The package already has Smart Basmala Handling for the corpus side (Basmala stored as ayah 0, Basmalah-aware counting), but there is currently no way to detect or strip a leading Istiʿādhah, because — unlike the Basmalah — it is not part of the Quran corpus at all and matches no verse.
As a result, when a recitation transcript is passed to search_text / fuzzy_search / search_sliding_window / smart_search, the extra opening words skew similarity scores, throw off word-level positioning, and shift sliding-window alignment. There is also no clean signal back to the caller telling them the query began with an Istiʿādhah/Basmalah versus actual ayah text, which downstream apps need in order to segment or display the recitation correctly.
Describe the solution I'd like
A first-class "recitation prefix" detector that recognizes a leading Istiʿādhah and/or Basmalah and reports it, ideally:
A standalone helper, e.g. qal.detect_recitation_prefix(text), returning structured info: which prefix(es) were found (istiadha, basmalah), the matched span, a similarity score, and the remaining (cleaned) ayah text.
Optional flags on the existing search functions, e.g. strip_istiadha=True / strip_basmalah=True, so callers can normalize the query before matching in one call (also exposed as CLI flags and REST API params for consistency with the rest of the API).
Detection that reuses the package's existing Arabic normalization (diacritics removal, alef normalization) and fuzzy matching so it tolerates ASR/orthographic variation rather than requiring an exact string.
Coverage of the common Istiʿādhah variants reciters use, for example:
أعوذ بالله من الشيطان الرجيم
أعوذ بالله السميع العليم من الشيطان الرجيم
أعوذ بالله العظيم من الشيطان الرجيم
Describe alternatives you've considered
Stripping the prefix with a regex before calling the search functions — brittle against ASR variation and diacritics, and it duplicates normalization logic the package already owns.
Adding the Istiʿādhah to the corpus as a pseudo-verse — this would pollute verse counts and absolute indexing (0–6347), and there is no single canonical form, so it does not fit the corpus model the way the Basmalah does.
Relying on the fuzzy threshold to "ignore" the extra words — unreliable, and it still distorts word-level positioning and sliding-window start/end alignment.
Additional context
The primary use case is ASR / transcription pipelines that match recitation audio back to the Quran text, where the reciter conventionally opens with Istiʿādhah (+ Basmalah). Detecting these cleanly improves match accuracy and lets downstream tools label/segment the opening properly. Keeping the Istiʿādhah out of the corpus (it is not Quranic text) but detectable as a recitation prefix seems like the right separation of concerns, and it pairs naturally with the existing Basmala handling and the MultiAyahMatch response model.