Skip to content

[BUG]: Large .jsonl Files Cause "Invalid string length" Error – Request Streamed Cache Writing #5063

Description

@Speedy059

How are you running AnythingLLM?

Docker (remote machine)

What happened?

Problem Summary

When uploading and embedding large .jsonl files in AnythingLLM (Docker version, Weaviate backend), the embedding completes, but AnythingLLM throws an error at the very end:

[backend] info: [VectorDB::Weaviate] Inserting vectorized chunks into Weaviate collection.
[backend] info: [VectorDB::Weaviate] addDocumentToNamespace Invalid string length
[backend] error: Failed to vectorize rag_pdf.jsonl

I'm working on an AI Pipeline to turn ~30 various file types into .jsonl for vector input. Some of these .jsonl are very large, but it should be streaming the inserts as opposed to one large insert.

Root Cause

  • Large .jsonl files get chunked and embedded.

  • After embedding, the chunk data is cached using storeVectorResult.

  • The code currently uses JSON.stringify(vectorData) on the entire array of chunks. If the input file is sufficiently large, this can exceed Node.js's maximum allowable string length (~512MB, up to ~1–2GB on 64-bit systems), causing a fatal RangeError: Invalid string length.

  • This results in a misleading error: the data is often successfully inserted into Weaviate, but then fails to be cached/written locally.

Ideal Behavior: Cache writing for vector data should use a streaming approach rather than a single giant JSON string to avoid out-of-memory and string length errors.

Proposed Code Refactor

Old Code
File: server/utils/files/index.js

async function storeVectorResult(vectorData = [], filename = null) {
  if (!filename) return;
  console.log(
    `Caching vectorized results of ${filename} to prevent duplicated embedding.`
  );
  if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);

  const digest = uuidv5(filename, uuidv5.URL);
  const writeTo = path.resolve(vectorCachePath, `${digest}.json`);
  fs.writeFileSync(writeTo, JSON.stringify(vectorData), "utf8");
  return;
}

New Replacement Code (Streamed Write version)

async function storeVectorResult(vectorData = [], filename = null) {
  if (!filename) return;
  console.log(
    `Caching vectorized results of ${filename} to prevent duplicated embedding.`
  );
  if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);

  const digest = uuidv5(filename, uuidv5.URL);
  const writeTo = path.resolve(vectorCachePath, `${digest}.json`);

  return new Promise((resolve, reject) => {
    const writeStream = fs.createWriteStream(writeTo, { encoding: "utf8" });

    writeStream.on("error", (err) => {
      console.error(`Failed to write vector cache for ${filename}:`, err.message);
      reject(err);
    });

    writeStream.on("finish", resolve);

    writeStream.write("[");
    for (let i = 0; i < vectorData.length; i++) {
      const chunkJson = JSON.stringify(vectorData[i]);
      if (i > 0) writeStream.write(",");
      writeStream.write(chunkJson);
    }
    writeStream.write("]");
    writeStream.end();
  });
}

Why this should be implemented

  • Prevents fatal errors and out-of-memory crashes for large documents.
  • Allows AnythingLLM to handle very large .jsonl.
  • Output file format remains exactly the same ([chunk0,chunk1,chunk2,...]); compatibility with current cache read logic is preserved.

I hope this makes sense!

Thank you!

Are there known steps to reproduce?

Create a .jsonl file with 100,000 entries and it'll fail. I used 1000 token max limit for each "text" value, and metadata object was about 150-300.

Metadata

Metadata

Assignees

No one assigned

    Labels

    possible bugBug was reported but is not confirmed or is unable to be replicated.

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions