How are you running AnythingLLM?
Docker (remote machine)
What happened?
Problem Summary
When uploading and embedding large .jsonl files in AnythingLLM (Docker version, Weaviate backend), the embedding completes, but AnythingLLM throws an error at the very end:
[backend] info: [VectorDB::Weaviate] Inserting vectorized chunks into Weaviate collection.
[backend] info: [VectorDB::Weaviate] addDocumentToNamespace Invalid string length
[backend] error: Failed to vectorize rag_pdf.jsonl
I'm working on an AI Pipeline to turn ~30 various file types into .jsonl for vector input. Some of these .jsonl are very large, but it should be streaming the inserts as opposed to one large insert.
Root Cause
-
Large .jsonl files get chunked and embedded.
-
After embedding, the chunk data is cached using storeVectorResult.
-
The code currently uses JSON.stringify(vectorData) on the entire array of chunks. If the input file is sufficiently large, this can exceed Node.js's maximum allowable string length (~512MB, up to ~1–2GB on 64-bit systems), causing a fatal RangeError: Invalid string length.
-
This results in a misleading error: the data is often successfully inserted into Weaviate, but then fails to be cached/written locally.
Ideal Behavior: Cache writing for vector data should use a streaming approach rather than a single giant JSON string to avoid out-of-memory and string length errors.
Proposed Code Refactor
Old Code
File: server/utils/files/index.js
async function storeVectorResult(vectorData = [], filename = null) {
if (!filename) return;
console.log(
`Caching vectorized results of ${filename} to prevent duplicated embedding.`
);
if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);
const digest = uuidv5(filename, uuidv5.URL);
const writeTo = path.resolve(vectorCachePath, `${digest}.json`);
fs.writeFileSync(writeTo, JSON.stringify(vectorData), "utf8");
return;
}
New Replacement Code (Streamed Write version)
async function storeVectorResult(vectorData = [], filename = null) {
if (!filename) return;
console.log(
`Caching vectorized results of ${filename} to prevent duplicated embedding.`
);
if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);
const digest = uuidv5(filename, uuidv5.URL);
const writeTo = path.resolve(vectorCachePath, `${digest}.json`);
return new Promise((resolve, reject) => {
const writeStream = fs.createWriteStream(writeTo, { encoding: "utf8" });
writeStream.on("error", (err) => {
console.error(`Failed to write vector cache for ${filename}:`, err.message);
reject(err);
});
writeStream.on("finish", resolve);
writeStream.write("[");
for (let i = 0; i < vectorData.length; i++) {
const chunkJson = JSON.stringify(vectorData[i]);
if (i > 0) writeStream.write(",");
writeStream.write(chunkJson);
}
writeStream.write("]");
writeStream.end();
});
}
Why this should be implemented
- Prevents fatal errors and out-of-memory crashes for large documents.
- Allows AnythingLLM to handle very large .jsonl.
- Output file format remains exactly the same ([chunk0,chunk1,chunk2,...]); compatibility with current cache read logic is preserved.
I hope this makes sense!
Thank you!
Are there known steps to reproduce?
Create a .jsonl file with 100,000 entries and it'll fail. I used 1000 token max limit for each "text" value, and metadata object was about 150-300.
How are you running AnythingLLM?
Docker (remote machine)
What happened?
Problem Summary
When uploading and embedding large .jsonl files in AnythingLLM (Docker version, Weaviate backend), the embedding completes, but AnythingLLM throws an error at the very end:
I'm working on an AI Pipeline to turn ~30 various file types into .jsonl for vector input. Some of these .jsonl are very large, but it should be streaming the inserts as opposed to one large insert.
Root Cause
Large
.jsonlfiles get chunked and embedded.After embedding, the chunk data is cached using
storeVectorResult.The code currently uses
JSON.stringify(vectorData)on the entire array of chunks. If the input file is sufficiently large, this can exceed Node.js's maximum allowable string length (~512MB, up to ~1–2GB on 64-bit systems), causing a fatal RangeError: Invalid string length.This results in a misleading error: the data is often successfully inserted into Weaviate, but then fails to be cached/written locally.
Ideal Behavior: Cache writing for vector data should use a streaming approach rather than a single giant JSON string to avoid out-of-memory and string length errors.
Proposed Code Refactor
Old Code
File: server/utils/files/index.js
New Replacement Code (Streamed Write version)
Why this should be implemented
I hope this makes sense!
Thank you!
Are there known steps to reproduce?
Create a .jsonl file with 100,000 entries and it'll fail. I used 1000 token max limit for each "text" value, and metadata object was about 150-300.