Skip to content

Latest commit

ย 

History

History
266 lines (204 loc) ยท 16.6 KB

File metadata and controls

266 lines (204 loc) ยท 16.6 KB

๐Ÿ‘๏ธ ModuDoc: VLM-Powered Document Parser for Advanced RAG

Python 3.10+ License: MIT VLM: Qwen-VL

ModuDoc๋Š” ์‹œ๊ฐ-์–ธ์–ด ๋ชจ๋ธ(VLM)์„ ํ™œ์šฉํ•ด ๋ฌธ์„œ์˜ ์‹œ๊ฐ์  ๋ ˆ์ด์•„์›ƒ๊ณผ ๋…ผ๋ฆฌ์  ๊ณ„์ธต์„ ์ดํ•ดํ•˜๊ณ  ๊ตฌ์กฐํ™”ํ•˜๋Š” ๋ฌธ์„œ ํŒŒ์‹ฑ ๋„๊ตฌ์ž…๋‹ˆ๋‹ค.

ํŠนํžˆ HWP/PDF ๋ฌธ์„œ์˜ ๋ณต์žกํ•œ ํ‘œ(Table), ๋‹ค๋‹จ ๋ ˆ์ด์•„์›ƒ, ๋‹ค์ค‘ ํŽ˜์ด์ง€ ๋ฌธ๋งฅ ๋‹จ์ ˆ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜์—ฌ RAG ์‹œ์Šคํ…œ ๊ตฌ์ถ•์„ ์œ„ํ•œ ์›์ฒœ ๋ฐ์ดํ„ฐ(JSON/Markdown)๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

โœจ ์ฃผ์š” ๊ธฐ๋Šฅ (Key Features)

  • ๐Ÿ“š ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ(Image + Text) ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํŒŒ์‹ฑ: VLM์˜ ์‹œ๊ฐ์  ์ธ์ง€ ๋Šฅ๋ ฅ๊ณผ ์›๋ณธ ํ…์ŠคํŠธ ๋ ˆ์ด์–ด ์ถ”์ถœ ๊ธฐ์ˆ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ •๋ณด ๋ˆ„๋ฝ์„ ์ตœ์†Œํ™”ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ”— ๋‹ค์ค‘ ํŽ˜์ด์ง€ ๋ฌธ๋งฅ ๋ณ‘ํ•ฉ (Cross-page Chunking): ํŽ˜์ด์ง€๊ฐ€ ๋„˜์–ด๊ฐ€๋ฉฐ ์ž˜๋ฆฌ๋Š” ํ‘œ๋‚˜ ๋ฌธ๋‹จ์„ ๋ฌผ๋ฆฌ์  ํŽ˜์ด์ง€๊ฐ€ ์•„๋‹Œ ๋ชฉ์ฐจ(Heading) ๊ธฐ์ค€์œผ๋กœ ์™„๋ฒฝํ•˜๊ฒŒ ๋ณ‘ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  • ๐ŸŒณ ์˜๋ฏธ ๊ธฐ๋ฐ˜ RAG ์ฒญํ‚น ์ „๋žต: ์‚ฌ์šฉ์ž์˜ ๋ชฉ์ ์— ๋”ฐ๋ผ 3๊ฐ€์ง€ ์ฒญํ‚น ๋ชจ๋“œ(page(ํŽ˜์ด์ง€ ๊ธฐ๋ฐ˜) / toc(๋ชฉ์ฐจ ๊ธฐ๋ฐ˜) / tree(๋ฌธ์„œ์˜ ํŠธ๋ฆฌ ๊ตฌ์กฐ))๋ฅผ ์œ ์—ฐํ•˜๊ฒŒ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“Š ์‹œ๊ฐ ์ž๋ฃŒ์˜ ์ง€์‹ํ™”: ๋‹จ์ˆœ ํ…์ŠคํŠธ๋กœ๋Š” ์•Œ ์ˆ˜ ์—†๋Š” ๋ณต์žกํ•œ ๋‹ค์ด์–ด๊ทธ๋žจ์ด๋‚˜ ๋„์‹์„ VLM์ด ์ง์ ‘ ๋ถ„์„ํ•˜์—ฌ ์ƒ์„ธํ•œ ํ…์ŠคํŠธ(description)๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๐Ÿ“„ ๊ด‘๋ฒ”์œ„ํ•œ ํฌ๋งท ์ง€์›: PDF, HWP, HWPX, DOCX, PPTX, XLSX ๋ฐ ํŽธ๋ฆฌํ•œ Web UI ์ œ๊ณต.

๐Ÿ† ๋ฒค์น˜๋งˆํฌ (Benchmark)

opendataloader-bench ๊ธฐ์ค€ 200์—ฌ ๊ฐœ์˜ ๋ฌธ์„œ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ํ‘œ ๊ตฌ์กฐ ์ธ์‹(TEDS) ๋ฐ ํ—ค๋”ฉ ๊ณ„์ธต ๊ตฌ์กฐํ™”(MHS) ๋ถ€๋ฌธ์—์„œ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.

NID: ์ฝ๊ธฐ ์ˆœ์„œ ์ •ํ™•๋„ ยท TEDS: ํ‘œ ๊ตฌ์กฐ ์œ ์‚ฌ๋„ ยท MHS: ํ—ค๋”ฉ ๊ณ„์ธต ์ •ํ™•๋„

1. ์ฃผ์š” ์˜คํ”ˆ์†Œ์Šค ํŒŒ์„œ ์„ฑ๋Šฅ ๋น„๊ต

Benchmark

ํŒŒ์„œ (Parser) ์ œ์กฐ์‚ฌ (Maker) NID TEDS MHS Overall ๋น„๊ณ 
opendataloader-hybrid Hancom (ํ•œ์ปด) 0.9355 0.9276 0.8057 0.9034
ModuDoc (Ours) Marker-Inc 0.9212 0.9358 0.8219 0.8993 ๐Ÿ‘‘ ํ‘œ(Table), ํ—ค๋”ฉ(Heading) 1์œ„
docling IBM 0.8995 0.8871 0.8019 0.8766
marker โ€” 0.8897 0.8076 0.7956 0.8608
opendataloader Hancom (ํ•œ์ปด) 0.9127 0.4942 0.7404 0.8393
mineru โ€” 0.8574 0.8730 0.7430 0.8311
markitdown Microsoft 0.8786 0.0000 0.0000 0.5832

2. ๊ธฐ์—ฌ๋„ ๋ถ„์„ (Ablation Study)

ModuDoc์˜ ๋‘ ํ•ต์‹ฌ ์•„ํ‚คํ…์ฒ˜์ธ **'๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํ…์ŠคํŠธ ๋ ˆ์ด์–ด ์ฃผ์ž…'**๊ณผ **'ํ”„๋กฌํ”„ํŠธ ์—”์ง€๋‹ˆ์–ด๋ง'**์ด ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ •๋Ÿ‰ํ™”ํ–ˆ์Šต๋‹ˆ๋‹ค.

์•„ํ‚คํ…์ฒ˜ ๊ตฌ์„ฑ VLM ์ž…๋ ฅ (Input) ํ”„๋กฌํ”„ํŠธ TEDS (ํ‘œ) MHS (๊ณ„์ธต) Overall
Naive baseline ์ด๋ฏธ์ง€๋งŒ ์ œ๊ณต ๊ธฐ๋ณธ ์ง€์‹œ์–ด 0.8635 0.7740 0.8667
+ Prompt Engineering ์ด๋ฏธ์ง€๋งŒ ์ œ๊ณต JSON ๊ตฌ์กฐํ™” ๊ฐ•์ œ 0.8705 0.8058 0.8913
+ Text Layer (Ours) ํ…์ŠคํŠธ + ์ด๋ฏธ์ง€ JSON ๊ตฌ์กฐํ™” ๊ฐ•์ œ 0.9358 0.8219 0.8993

๐Ÿ’ก ํ•ต์‹ฌ ์ธ์‚ฌ์ดํŠธ: VLM์— ์ด๋ฏธ์ง€์™€ ์ถ”์ถœ๋œ ํ…์ŠคํŠธ๋ฅผ ๋™์‹œ์— ๋ฐ€์–ด ๋„ฃ๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ฃผ์ž… ๋ฐฉ์‹์„ ํ†ตํ•ด, ๊ธฐ์กด ํŒŒ์„œ๋“ค์ด ์‹คํŒจํ•˜๋˜ ๋ณต์žกํ•œ ํ‘œ ๊ตฌ์กฐ ์ธ์‹๋ฅ (TEDS)์„ 0.8705์—์„œ 0.9358๋กœ ๋Œ€ํญ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.


๐ŸŽฏ RAG๋ฅผ ์œ„ํ•œ ์ถœ๋ ฅ ๊ตฌ์กฐ (TOC/Tree Chunking)

ModuDoc๋Š” ๋ฌธ์„œ๋ฅผ ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ์˜ ๋‚˜์—ด์ด ์•„๋‹Œ, ๊ฒ€์ƒ‰์— ์ตœ์ ํ™”๋œ **๊ณ„์ธต์  ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ(Breadcrumbs)**๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

  • split_toc.json: ๋ชฉ์ฐจ(Heading) ๊ธฐ์ค€์œผ๋กœ ํŽ˜์ด์ง€ ๊ฒฝ๊ณ„๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ๋ฌธ๋งฅ์„ ๋ณ‘ํ•ฉํ•œ ์ฒญํฌ. ๊ฐ ์ฒญํฌ์— heading_path(๊ณ„์ธต ๊ฒฝ๋กœ)๊ฐ€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋กœ ํฌํ•จ๋˜์–ด RAG ํ•„ํ„ฐ๋ง์— ํ™œ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
  • split_tree.json: ๋ฌธ์„œ์˜ ๋…ผ๋ฆฌ์  ๊ณ„์ธต์„ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋กœ ํ‘œํ˜„ํ•œ ์ฒญํฌ. ํ—ค๋”ฉ depth ๊ธฐ๋ฐ˜์œผ๋กœ ์„น์…˜์„ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

ํ—ค๋”ฉ ๊ณ„์ธต์€ ํ•œ๊ตญ ๊ทœ์ • ๋ฒˆํ˜ธ์ฒด๊ณ„(์ œN์žฅ > ์ œN์กฐ > โ‘  ํ•ญ > 1. ํ˜ธ > ๊ฐ€. ๋ชฉ, ์‹ญ์ง„ 4.1.1)๋กœ ๋ฌธ์„œ ์ „์—ญ์—์„œ ์ผ๊ด€ ๋ณด์ •๋ฉ๋‹ˆ๋‹ค(heading_pathยทdepth). ๋ณธ๋ฌธ์— ๋ฌปํžŒ ์กฐ๋ฌธ(์ œN์กฐ(์ œ๋ชฉ))๊ณผ ์ •์˜ ํ•ญ๋ชฉ(N. "์šฉ์–ด"๋ž€ โ€ฆ)์€ ํ—ค๋”ฉ์œผ๋กœ ์Šน๊ธ‰๋˜์–ด ์กฐ๋ฌธยท์šฉ์–ด ๋‹จ์œ„ ๊ฒ€์ƒ‰์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. (CHUNK_NORMALIZE๋กœ ์ œ์–ด)

โš ๏ธ toc/tree ๋Š” VLM ์ด ์ถ”์ถœํ•œ heading ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. VLM ๊ตฌ์กฐ ๊ฒฐ๊ณผ(*_structured.json)๊ฐ€ ์—†์œผ๋ฉด(VLM ๋ฏธ์—ฐ๊ฒฐยท์ „์ฒด ์‹คํŒจ ๋“ฑ) heading ์ด ์—†์–ด toc/tree ๋Š” ๋งŒ๋“ค ์ˆ˜ ์—†๊ณ , ๋ Œ๋” ํ…์ŠคํŠธ ๊ธฐ๋ฐ˜ page ์ฒญํ‚น๋งŒ ์ œ๊ณต๋ฉ๋‹ˆ๋‹ค.

์–ด๋–ค ์ฒญํ‚น ์ „๋žต์„ ์“ธ๊นŒ?

์ „๋žต ์ฒญํฌ ๋‹จ์œ„ ์–ธ์ œ ์“ฐ๋‚˜ heading_path
page ํŽ˜์ด์ง€ 1์žฅ = 1์ฒญํฌ ํŽ˜์ด์ง€ ๋…๋ฆฝ์„ฑ์ด ์ค‘์š”ํ•˜๊ฑฐ๋‚˜, VLM ์—†์ด ๋น ๋ฅด๊ฒŒ ์ฒญํ‚นํ•  ๋•Œ ์žˆ์Œ(์„น์…˜ ํ—ค๋”ฉ์ด ์žˆ์„ ๋•Œ. ํŽ˜์ด์ง€ ๋„˜์–ด๊ฐ„ ์—ฐ์† ํ‘œยท๋ณธ๋ฌธ์€ ์ง์ „ ์„น์…˜์„ ์ƒ์†ํ•˜๊ณ  _heading_inherited ๋กœ ํ‘œ์‹œ)
toc ๋ชฉ์ฐจ(heading) ์„น์…˜ = 1์ฒญํฌ (ํŽ˜์ด์ง€ ๊ฒฝ๊ณ„ ๋ฌด์‹œยท๋ณ‘ํ•ฉ) ์„น์…˜ ๋‹จ์œ„ ๊ฒ€์ƒ‰ยท๋ฌธ๋งฅ ๋ณ‘ํ•ฉ์ด ์ค‘์š”ํ•œ ์ผ๋ฐ˜ RAG (๊ฐ€์žฅ ๋ฒ”์šฉ) ์žˆ์Œ
tree ํ—ค๋”ฉ depth ํŠธ๋ฆฌ์˜ ๊ณ„์ธต ๋…ธ๋“œ = 1์ฒญํฌ ์กฐ๋ฌธยทํ•ญ๋ชฉ ๋‹จ์œ„์˜ ์ •๋ฐ€ ๊ณ„์ธต ๊ฒ€์ƒ‰(๊ทœ์ •ยท๋ฒ•๋ นยท๊ธฐ์ˆ ํ‘œ์ค€ ๋“ฑ) ์žˆ์Œ(+depth)

์ž˜ ๋ชจ๋ฅด๊ฒ ์œผ๋ฉด toc ์™€ tree ๋ฅผ ๋‘˜ ๋‹ค ์ƒ์„ฑํ•ด RAG ์ธ๋ฑ์Šค์—์„œ ๋น„๊ตํ•ด๋ณด์„ธ์š”. ํฐ ์ฒญํฌ๋Š” CHUNK_MAX_CHARS(๊ธฐ๋ณธ 4000์ž)๋ฅผ ๋„˜์œผ๋ฉด element ๊ฒฝ๊ณ„์—์„œ ์ž๋™ ๋ถ„ํ• ๋ฉ๋‹ˆ๋‹ค(ํ‘œ๋Š” ํ†ต์งธ ์œ ์ง€).

๊ฐ ์ฒญํฌ(split_toc.json / split_tree.json)์˜ ๊ตฌ์กฐ:

{
  "chunk_id": "tree_0007",
  "chunk_type": "tree",
  "depth": 2,
  "heading_path": ["์ ํ•ฉํŒ์ • ๋“ฑ ์‹ฌ์‚ฌ๊ธฐ์ค€(์ œ5์กฐ ๊ด€๋ จ)", "4. ํ’ˆ์งˆ๊ฒฝ์˜์‹œ์Šคํ…œ"],
  "page_range": [42, 43],
  "elements": [
    {"type": "heading_2", "content": "4. ํ’ˆ์งˆ๊ฒฝ์˜์‹œ์Šคํ…œ"},
    {"type": "text", "content": "..."},
    {"type": "table", "content": "<table>...</table>", "caption": "ํ‘œ ์ œ๋ชฉ"}
  ]
}
  • heading_path: ๋ฃจํŠธโ†’ํ˜„์žฌ ์„น์…˜๊นŒ์ง€์˜ ๊ณ„์ธต ๊ฒฝ๋กœ. RAG ์ฒญํฌ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ•„ํ„ฐยท์ปจํ…์ŠคํŠธ๋กœ ์‚ฌ์šฉ
  • depth: ๊ณ„์ธต ๊นŠ์ด(= len(heading_path), tree ์ „์šฉ)
  • page_range: ์›๋ณธ ํŽ˜์ด์ง€ ๋ฒ”์œ„ [์‹œ์ž‘, ๋]
  • elements: ์ฒญํฌ์— ์†ํ•œ ์›์†Œ๋“ค(heading_* / text / table / figure / footnote)

๐Ÿš€ ์‹œ์ž‘ํ•˜๊ธฐ (Getting Started)

1. ์„ค์น˜ (Installation)

git clone https://github.com/Marker-Inc-Korea/ModuDoc.git
cd ModuDoc
pip install -r requirements.txt

์‹œ์Šคํ…œ ์˜์กด์„ฑ: PDFยทHWPยทHWPX ๋Š” ๋ณ„๋„ ์„ค์น˜ ์—†์ด ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. DOCX/PPTX/XLSX(๋ฐ DOC/PPT/ODT/RTF)๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ LibreOffice(soffice)๋กœ PDF ๋ณ€ํ™˜ ํ›„ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ LibreOffice ์„ค์น˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. (Python 3.10+)

HWP/HWPX ๋ Œ๋”๋ง: ๋ฆฌ๋ˆ…์Šค์—์„œ๋Š” rhwp(Skia ๊ธฐ๋ฐ˜ ๋„ค์ดํ‹ฐ๋ธŒ ๋ Œ๋”๋Ÿฌ)๋ฅผ ๊ธฐ๋ณธ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ„๋„ ๋„๊ตฌ ์—†์ด HWP/HWPX๋ฅผ ํŽ˜์ด์ง€ ์ด๋ฏธ์ง€๋กœ ๋ Œ๋”ํ•ฉ๋‹ˆ๋‹ค. ํ‘œ ๊ฒน์นจยท์ž„๋ฒ ๋””๋“œ ์ด๋ฏธ์ง€ ์ž˜๋ฆผ์ด ์—†๊ณ  ํ•œ๊ธ€ ์›๋ณธ์— ๊ฐ€๊นŒ์šด ํŽ˜์ด์ง€๋„ค์ด์…˜์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ •ํ™•ํ•œ ๊ธ€์ž ์กฐํŒ์„ ์œ„ํ•ด ํ•œ๊ธ€ ํฐํŠธ(ํ•จ์ดˆ๋กฌ HCR / Noto CJK KR / ๋‚˜๋ˆ”) ์„ค์น˜๋ฅผ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

  • ๋ฆฌ๋ˆ…์Šค(x86_64)์—์„œ๋Š” OLE ๊ฐ์ฒด๊นŒ์ง€ ์™„์ „ ๋ Œ๋”ํ•˜๋Š” ํŒจ์น˜ ๋นŒ๋“œ rhwp(vendor/)๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค โ€” ChemDraw ๋“ฑ ํ™”ํ•™๊ตฌ์กฐ์‹(WMF), ๋ณด๋„์ž๋ฃŒ ์ž„๋ฒ ๋””๋“œ ๋น„ํŠธ๋งต(StaticDib), ์ดˆ๋Œ€ํ˜• ๊ทœ์ •๋ฌธ์„œ(์‹์•ฝ์ฒ˜ ์‹ํ’ˆ์ฒจ๊ฐ€๋ฌผ ๊ธฐ์ค€๊ทœ๊ฒฉยท๋Œ€ํ•œ๋ฏผ๊ตญ์•ฝ์ „ ๋“ฑ)๊นŒ์ง€ ๋ Œ๋”. ํŒจ์น˜ ๋‚ด์šฉยท์žฌ๋นŒ๋“œ ๋ฐฉ๋ฒ•์€ patches/ ์ฐธ์กฐ.
  • rhwp ๋ฏธ์„ค์น˜ ๋˜๋Š” ๋ Œ๋” ์‹คํŒจ(์•”ํ˜ธํ™”ยท์†์ƒ ํŒŒ์ผ ๋“ฑ) ์‹œ LibreOffice + H2Orestart ๋กœ ์ž๋™ ํด๋ฐฑํ•ฉ๋‹ˆ๋‹ค(์„ค์น˜๋˜์–ด ์žˆ์œผ๋ฉด).
  • USE_RHWP=0 ์œผ๋กœ rhwp ๊ฒฝ๋กœ๋ฅผ ๋„๊ณ  ํ•ญ์ƒ LibreOffice ๋ฅผ ์“ฐ๊ฒŒ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(Windows ํ™˜๊ฒฝ์—์„œ ๋ฌด์†์‹ค ๋ณ€ํ™˜์„ ์œ„ํ•ด์„œ๋Š” ํ•œ๊ธ€๊ณผ์ปดํ“จํ„ฐ ํ•œ๊ธ€(HWP) ํ”„๋กœ๊ทธ๋žจ์ด ์„ค์น˜๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)

2. ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ • (Configuration)

๊ธฐ๋ณธ์ ์œผ๋กœ ๋กœ์ปฌ VLM(vLLM ๋“ฑ OpenAI ํ˜ธํ™˜ ์—”๋“œํฌ์ธํŠธ)์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋ธ์„ ๋กœ์ปฌ์— ์„œ๋น™ํ•œ ๋’ค, ๊ทธ ์ฃผ์†Œ๋ฅผ VLM_BASE_URL๋กœ ์ง€์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. (๋ณ„๋„ API ํ‚ค ๋ถˆํ•„์š”)

# 1) VLM ๋กœ์ปฌ ์„œ๋น™ (์˜ˆ: vLLM, OpenAI ํ˜ธํ™˜ ์—”๋“œํฌ์ธํŠธ)
vllm serve Qwen/Qwen3-VL-30B-A3B-Instruct --port 8000

# 2) ModuDoc๊ฐ€ ์ด ์—”๋“œํฌ์ธํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋„๋ก ์ง€์ •
export VLM_BASE_URL="http://localhost:8000/v1"
# (์ธ์ฆ์ด ํ•„์š”ํ•œ ์—”๋“œํฌ์ธํŠธ๋ผ๋ฉด) export VLM_API_KEY="your_key"

๐Ÿ’ก /api/process์˜ model ํŒŒ๋ผ๋ฏธํ„ฐ(๊ธฐ๋ณธ Qwen/Qwen3-VL-30B-A3B-Instruct)๋Š” ์„œ๋น™ ์ค‘์ธ ๋ชจ๋ธ ์ด๋ฆ„๊ณผ ์ผ์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ’ก ๋ชจ๋ธ ์„ ํƒ: ๊ธฐ๋ณธ Qwen3-VL-30B-A3B-Instruct ๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ ~60GB(FP16)/~30GB(FP8) ํ•„์š”. GPU๊ฐ€ ์ž‘๋‹ค๋ฉด ๋” ์ž‘์€ Qwen-VL(์˜ˆ: 7B/8B) ๋˜๋Š” ์–‘์žํ™” ๋ชจ๋ธ์„ ์„œ๋น™ํ•˜๊ณ  model ์„ ๊ทธ ์ด๋ฆ„์œผ๋กœ ๋งž์ถ”์„ธ์š” โ€” OpenAI ํ˜ธํ™˜ ์—”๋“œํฌ์ธํŠธ๋ฉด ์–ด๋–ค VLM ์ด๋“  ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค(ํ’ˆ์งˆ์€ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋น„๋ก€).

HWP ๋ Œ๋”(rhwp) ๊ด€๋ จ ์„ ํƒ ํ™˜๊ฒฝ๋ณ€์ˆ˜
๋ณ€์ˆ˜ ๊ธฐ๋ณธ ์„ค๋ช…
USE_RHWP 1 HWP/HWPX ๋ฅผ rhwp ๋กœ ๋ Œ๋”. 0 ์ด๋ฉด LibreOffice ๋งŒ ์‚ฌ์šฉ
RHWP_FONTCONFIG (์ž๋™) ํ•œ๊ธ€ ํฐํŠธ๋งŒ ๋‹ด์€ ์ตœ์†Œ fonts.conf ๊ฒฝ๋กœ. ๋ฏธ์ง€์ • ์‹œ ์ž๋™ ์ƒ์„ฑ(์‹œ์Šคํ…œ ํฐํŠธ๊ฐ€ ์ˆ˜์ฒœ ๊ฐœ๋ฉด ๋ Œ๋”๊ฐ€ ๋А๋ ค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€)
RENDER_DPI 300 ํŽ˜์ด์ง€ ๋ Œ๋” ํ•ด์ƒ๋„(PDF ํฌํ•จ ์ „ ํฌ๋งท). ๋†’์„์ˆ˜๋ก ๋‚˜๋ž€ํžˆ ๋ถ™์€ ํ‘œยท์กฐ๋ฐ€ํ•œ ํ‘œ๋ฅผ ์ •ํ™•ํžˆ ๋ถ„๋ฆฌํ•˜๋‚˜ VLM ํ† ํฐ/์‹œ๊ฐ„์ด ๋Š˜์–ด๋‚จ
VLM_IMG_MAXW 2464 VLM ์ž…๋ ฅ ์ด๋ฏธ์ง€ ์ตœ๋Œ€ ํญ(px, 28์˜ ๋ฐฐ์ˆ˜). RENDER_DPI ์™€ ์ง โ€” ์ด ๊ฐ’์ด ๋ Œ๋” ํญ๋ณด๋‹ค ์ž‘์œผ๋ฉด ๋‹ค์šด์Šค์ผ€์ผ๋ผ ํ‘œ ๋ถ„๋ฆฌ ํšจ๊ณผ๊ฐ€ ์ค„์–ด๋“ฆ
VLM_IMG_MAXW_FALLBACK 1024 ๋ฐ˜๋ณต ํญ์ฃผ/ํƒ€์ž„์•„์›ƒ ์žฌ์‹œ๋„ ์‹œ ๋‚ฎ์ถ”๋Š” ํด๋ฐฑ ํญ

๐Ÿ’ก ํ‘œ ๊ตฌ์กฐ ์ •ํ™•๋„ โ†” ๋น„์šฉ: ๊ธฐ๋ณธ๊ฐ’ 300DPI/2464px ๋Š” ์„œ๋กœ ๋ถ™์–ด ์žˆ๋Š” ํ‘œ(์˜ˆ: ์ขŒ์ธก ํ‰๊ฐ€ํ‘œ + ์šฐ์ธก ๋“ฑ๊ธ‰ํ‘œ)๋ฅผ ๊ฐ๊ฐ ๋ณ„๋„ <table> ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์†๋„ยทํ† ํฐ์„ ์•„๋ผ๋ ค๋ฉด RENDER_DPI=200 VLM_IMG_MAXW=1568 ๋กœ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(ํŽ˜์ด์ง€ ์ˆ˜ยท๋‚ด์šฉ์€ ๋™์ผํ•˜๋‚˜ ๋‚˜๋ž€ํžˆ ํ‘œ์˜ ๋ถ„๋ฆฌ ์ •ํ™•๋„๋Š” ํ•˜๋ฝ).

rhwp ์ฝ”์–ด๋Š” ๋ ˆ์ด์•„์›ƒ ์ง„๋‹จ(LAYOUT_OVERFLOW ๋“ฑ)์„ stderr ๋กœ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค(๊ธฐ๋Šฅ์—” ๋ฌดํ•ด). ๋กœ๊ทธ๊ฐ€ ๊ฑฐ์Šฌ๋ฆฌ๋ฉด ํ”„๋กœ์„ธ์Šค stderr ๋ฅผ ๋ฆฌ๋‹ค์ด๋ ‰ํŠธํ•˜์„ธ์š”.

RAG ์ฒญํ‚น ๊ด€๋ จ ์„ ํƒ ํ™˜๊ฒฝ๋ณ€์ˆ˜
๋ณ€์ˆ˜ ๊ธฐ๋ณธ ์„ค๋ช…
CHUNK_MAX_CHARS 4000 ์ฒญํฌ ์ตœ๋Œ€ ๊ธธ์ด(์ž). ์ดˆ๊ณผ ์‹œ element ๊ฒฝ๊ณ„์—์„œ ๋ถ„ํ• (ํ‘œ๋Š” ํ†ต์งธ ์œ ์ง€)
CHUNK_OVERLAP 0 ํฌ๊ธฐ ๋ถ„ํ• ๋กœ ์ƒ๊ธด ํ•˜์œ„์ฒญํฌ ์‚ฌ์ด ์˜ค๋ฒ„๋žฉ(์ž). 0=๋น„ํ™œ์„ฑ
CHUNK_NORMALIZE 1 ํ•œ๊ตญ ๋ฒˆํ˜ธ์ฒด๊ณ„(์ œN์กฐ > โ‘  > 1. > ๊ฐ€., ์‹ญ์ง„ 4.1.1)๋กœ heading ๊ณ„์ธต์„ ๋ฌธ์„œ ์ „์—ญ์—์„œ ์ผ๊ด€ ๋ณด์ • + ๋ณธ๋ฌธ์— ๋ฌปํžŒ ์กฐ๋ฌธยท์ •์˜ ํ•ญ๋ชฉ ์Šน๊ธ‰. 0=VLM ์›๋ณธ ๋ ˆ๋ฒจ ์‚ฌ์šฉ
CHUNK_MERGE_CONTINUED 1 ํŽ˜์ด์ง€ ๋„˜๊น€์œผ๋กœ ๋ฐ˜๋ณต๋œ ๋จธ๋ฆฌ๊ธ€/์—ฐ์† heading ์„ ๋ณ‘ํ•ฉํ•ด ์„น์…˜ ์ชผ๊ฐœ์ง ๋ฐฉ์ง€. 0=๋น„ํ™œ์„ฑ
CHUNK_CLAUSE_HEADING_MAX 40 ํ•ญ/ํ˜ธ/๋ชฉ ์ ˆ ๋งˆ์ปค๊ฐ€ ์ด ๊ธธ์ด(์ž)๋ฅผ ๋„˜์œผ๋ฉด '์ œ๋ชฉ'์ด ์•„๋‹ˆ๋ผ ๋ณธ๋ฌธ ์ ˆ๋กœ ๋ณด๊ณ  ๊ฐ•๋“ฑ(๊ธด ์ ˆ ๋ฌธ์žฅ์ด heading_path ๋ฅผ ์˜ค์—ผ์‹œํ‚ค๋Š” ๊ฒƒ ๋ฐฉ์ง€)
XLSX(์—‘์…€) ์ฒ˜๋ฆฌ์™€ ํ•œ๊ณ„

์—‘์…€์€ ํ‘œยท์ฐจํŠธยท์ด๋ฏธ์ง€๋ฅผ ํ•จ๊ป˜ ๊ตฌ์กฐํ™”ํ•ฉ๋‹ˆ๋‹ค.

  • ํ‘œ ๋ฐ์ดํ„ฐ: ๋ชจ๋“  ์…€(๋ณ‘ํ•ฉ์…€ ํฌํ•จ)์„ HTML <table> ๋กœ ์ถ”์ถœ. ์‹œํŠธ๋‹น ํ•˜๋‚˜์˜ ์„น์…˜(heading_path = [์‹œํŠธ๋ช…])์œผ๋กœ ์ฒญํ‚น๋ฉ๋‹ˆ๋‹ค.
  • ์ฐจํŠธ(๊ทธ๋ž˜ํ”„): VLM ์ด ๋‚ด์šฉ์„ ์„ค๋ช… ํ…์ŠคํŠธ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ์ž„๋ฒ ๋””๋“œ ์ด๋ฏธ์ง€(์‚ฌ์ง„ยท๋„์‹): VLM ์ด ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค(๋กœ๊ณ ยท์žฅ์‹์€ ์ œ์™ธ).

ํ•œ๊ณ„

  • .xls(๊ตฌ ๋ฐ”์ด๋„ˆ๋ฆฌ)๋Š” ์ผ๋ฐ˜ VLM ๊ฒฝ๋กœ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค(๋Œ€ํ˜• ์‹œํŠธ๋Š” ๋А๋ฆด ์ˆ˜ ์žˆ์Œ). .xlsx ๊ถŒ์žฅ.
  • ์ˆ˜์‹ ์…€์€ ๋งˆ์ง€๋ง‰์œผ๋กœ ๊ณ„์‚ฐยท์ €์žฅ๋œ ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

3. ์„œ๋ฒ„ ์‹คํ–‰ (Run)

python app.py

์„œ๋ฒ„๊ฐ€ http://localhost:5000์—์„œ ์‹คํ–‰๋˜๋ฉฐ, ์ง๊ด€์ ์ธ Web UI๋ฅผ ํ†ตํ•ด ์ฆ‰์‹œ ๋ฌธ์„œ๋ฅผ ์—…๋กœ๋“œํ•˜๊ณ  ํŒŒ์‹ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


๐Ÿ”Œ ์‚ฌ์šฉ๋ฒ• (Usage)

A. ์›น UI

python app.py ์‹คํ–‰ ํ›„ http://localhost:5000 ์—์„œ ๋ฌธ์„œ๋ฅผ ๋“œ๋ž˜๊ทธ&๋“œ๋กญ์œผ๋กœ ์—…๋กœ๋“œ โ†’ ํ˜•์‹ยท์ฒญํ‚น ์ „๋žต ์„ ํƒ โ†’ ๊ฒฐ๊ณผ ํ™•์ธ/๋‹ค์šด๋กœ๋“œ.

B. REST API (๋น„๋™๊ธฐ 3๋‹จ๊ณ„)

1) ํŒŒ์‹ฑ ์š”์ฒญ โ€” POST /api/process โ†’ task_id ๋ฐ˜ํ™˜

curl -X POST http://localhost:5000/api/process \
  -F "files=@sample_document.pdf" \
  -F "format=json" \
  -F "model=Qwen/Qwen3-VL-30B-A3B-Instruct" \
  -F "chunk=toc" -F "chunk=tree" \
  -F "concurrency=3"
# โ†’ {"message": "์ž‘์—… ์‹œ์ž‘๋จ", "task_id": "abc123..."}
ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’ ๊ธฐ๋ณธ ์„ค๋ช…
files ํŒŒ์ผ(๋ณต์ˆ˜ ๊ฐ€๋Šฅ) โ€” ์—…๋กœ๋“œํ•  ๋ฌธ์„œ(PDF/HWP/HWPX/DOCX/PPTX/XLSX)
format json / markdown / xml json ์ถœ๋ ฅ ํ˜•์‹
model ๋ชจ๋ธ๋ช… Qwen/Qwen3-VL-30B-A3B-Instruct ์„œ๋น™ ์ค‘์ธ ๋ชจ๋ธ๋ช…๊ณผ ์ผ์น˜ํ•ด์•ผ
chunk page / toc / tree (์—†์Œ) ์ฒญํ‚น ์ „๋žต(์ค‘๋ณต ์„ ํƒ). ์œ„ ๊ฐ€์ด๋“œ ์ฐธ๊ณ 
concurrency 1~16 3 ๋ฌธ์„œ(ํŒŒ์ผ) ๋™์‹œ ์ฒ˜๋ฆฌ ์ˆ˜(์—ฌ๋Ÿฌ ํŒŒ์ผ ์—…๋กœ๋“œ ์‹œ). ํŽ˜์ด์ง€ ๋‹จ์œ„ ๋™์‹œ์„ฑ์€ VLM_PAGE_CONCURRENCY(๊ธฐ๋ณธ 16)

2) ์ง„ํ–‰ ์กฐํšŒ โ€” GET /api/progress/<task_id> (is_done ๊ฐ€ true ๋  ๋•Œ๊นŒ์ง€ ํด๋ง)

curl http://localhost:5000/api/progress/abc123...
# โ†’ {"progress": 100, "status": "...", "is_done": true, "error": null, ...}

3) ๊ฒฐ๊ณผ ๋‹ค์šด๋กœ๋“œ โ€” GET /api/download/<task_id> โ†’ ์ „์ฒด ๊ฒฐ๊ณผ ZIP

C. Python์—์„œ ์ง์ ‘ ํ˜ธ์ถœ

import os
os.environ["VLM_BASE_URL"] = "http://localhost:8000/v1"   # ์„œ๋น™ ์ค‘์ธ VLM ์—”๋“œํฌ์ธํŠธ
from utils import DocumentProcessor

DocumentProcessor.process_and_save(
    file_path="sample.pdf",
    base_output_dir="./processed",
    api_key="local-vllm-noauth-key",     # ๋กœ์ปฌ no-auth๋ฉด ๋น„์–ด์žˆ์ง€ ์•Š์€ ์•„๋ฌด ๊ฐ’
    output_format="json",                # "json" | "markdown" | "xml"
    model_name="Qwen/Qwen3-VL-30B-A3B-Instruct",
    chunk_strategies=["page", "toc", "tree"],
)

์ถœ๋ ฅ๋ฌผ

processed/
โ””โ”€โ”€ document_name/
    โ”œโ”€โ”€ metadata.json               # ๋ฌธ์„œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ (+ vlm_pages_total / vlm_failed_pages)
    โ”œโ”€โ”€ page_0001.txt               # ํŽ˜์ด์ง€๋ณ„ ์ถ”์ถœ ํ…์ŠคํŠธ
    โ”œโ”€โ”€ page_0001.png               # ํŽ˜์ด์ง€๋ณ„ ์ด๋ฏธ์ง€
    โ”œโ”€โ”€ page_0001_structured.json   # ํŽ˜์ด์ง€๋ณ„ ๊ตฌ์กฐํ™” ๋ฐ์ดํ„ฐ (JSON ๋ชจ๋“œ)
    โ”œโ”€โ”€ page_0001_structured.md     # ํŽ˜์ด์ง€๋ณ„ ๋งˆํฌ๋‹ค์šด (Markdown ๋ชจ๋“œ)
    โ”œโ”€โ”€ split_page.json             # ํŽ˜์ด์ง€ ๋‹จ์œ„ ์ฒญํฌ
    โ”œโ”€โ”€ split_toc.json              # ๋ชฉ์ฐจ(Heading) ๊ธฐ์ค€ ๋ณ‘ํ•ฉ ์ฒญํฌ
    โ””โ”€โ”€ split_tree.json             # Depth ๊ธฐ๋ฐ˜ ๊ณ„์ธตํ˜• ํŠธ๋ฆฌ ์ฒญํฌ

๋ถ€๋ถ„ ์‹คํŒจ ํ™•์ธ: VLM ๊ตฌ์กฐ์ถ”์ถœ์ด (์žฌ์‹œ๋„ ํ›„์—๋„) ์‹คํŒจํ•œ ํŽ˜์ด์ง€๋Š” metadata.json ์˜ vlm_failed_pages ์— ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค(vlm_pages_total ์™€ ํ•จ๊ป˜). ํ•ด๋‹น ํŽ˜์ด์ง€๋Š” ๋‚ด์šฉ ์œ ์‹ค ์—†์ด ํ…์ŠคํŠธ๋งŒ ํด๋ฐฑ ํฌํ•จ๋˜๋ฉฐ, ๋‹ค๋ฅธ ํŽ˜์ด์ง€๋Š” ์ •์ƒ ๊ตฌ์กฐํ™”๋ฉ๋‹ˆ๋‹ค.


๐Ÿ—๏ธ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ (Project Architecture)

ModuDoc/
โ”œโ”€โ”€ app.py            # Flask ์›น API ๋ฐ UI ์„œ๋ฒ„
โ”œโ”€โ”€ utils.py          # ํ•ต์‹ฌ ํŒŒ์‹ฑ ๋กœ์ง (ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ…์ŠคํŠธ ์ถ”์ถœ + VLM ํ”„๋กฌํ”„ํŒ…)
โ”œโ”€โ”€ hwp_extract.py    # ๋„ค์ดํ‹ฐ๋ธŒ HWP/HWPX ํ…์ŠคํŠธยทํ‘œ ์ถ”์ถœ๊ธฐ (์™ธ๋ถ€ ์˜์กด์„ฑ ์—†์Œ)
โ”œโ”€โ”€ hwpx_paginate.py  # HWPX ํŽ˜์ด์ง€ ๊ฒฝ๊ณ„ ์ถ”์ • ์œ ํ‹ธ
โ”œโ”€โ”€ hwp_figures.py    # HWP/HWPX ์ž„๋ฒ ๋””๋“œ ์ด๋ฏธ์ง€(์‹œ๊ฐ์ž๋ฃŒ) ์œ„์น˜-์ธ์‹ salvage + VLM ์„ค๋ช…
โ”œโ”€โ”€ table_validate.py # ํ‘œ HTML ๊ฒ€์ฆยท์ˆ˜๋ฆฌ + HWP/HWPX ๋„ค์ดํ‹ฐ๋ธŒ ํ‘œ ์น˜ํ™˜ยทํŽ˜์ด์ง€ ์žฌ๋ถ„๋ฐฐ
โ”œโ”€โ”€ hwp_memo.py       # HWP/HWPX ํŽธ์ง‘๊ธฐ ๋ฉ”๋ชจ(์ฃผ์„) ์ถ”์ถœ
โ”œโ”€โ”€ chunker.py        # RAG ์ฒญํ‚นยทํ›„์ฒ˜๋ฆฌ (page / toc / tree, ๊ณ„์ธต ๊ฒฝ๋กœ heading_path)
โ”œโ”€โ”€ hwp_to_pdf.py     # Windows COM ๊ธฐ๋ฐ˜ HWP/HWPX ๋ฌด์†์‹ค ๋ณ€ํ™˜๊ธฐ
โ””โ”€โ”€ templates/        # ์›น UI ํ…œํ”Œ๋ฆฟ

๐Ÿ“œ ๋ผ์ด์„ ์Šค (License)

์ด ํ”„๋กœ์ ํŠธ๋Š” MIT License๋ฅผ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค.