[BUG]: Large .jsonl Files Cause "Invalid string length" Error – Request Streamed Cache Writing

### How are you running AnythingLLM?

Docker (remote machine)

### What happened?

### Problem Summary

When uploading and embedding large .jsonl files in AnythingLLM (Docker version, Weaviate backend), the embedding completes, but AnythingLLM throws an error at the very end:

```
[backend] info: [VectorDB::Weaviate] Inserting vectorized chunks into Weaviate collection.
[backend] info: [VectorDB::Weaviate] addDocumentToNamespace Invalid string length
[backend] error: Failed to vectorize rag_pdf.jsonl
```
I'm working on an AI Pipeline to turn ~30 various file types into .jsonl for vector input. Some of these .jsonl are very large, but it should be streaming the inserts as opposed to one large insert.

**Root Cause**

- Large `.jsonl` files get chunked and embedded.

- After embedding, the chunk data is cached using `storeVectorResult`.

- The code currently uses `JSON.stringify(vectorData)` on the entire array of chunks. If the input file is sufficiently large, this can exceed Node.js's maximum allowable string length (~512MB, up to ~1–2GB on 64-bit systems), causing a fatal RangeError: Invalid string length.

- This results in a misleading error: the data is often successfully inserted into Weaviate,  but then fails to be cached/written locally.

**Ideal Behavior:** Cache writing for vector data should use a streaming approach rather than a single giant JSON string to avoid out-of-memory and string length errors.

### Proposed Code Refactor

**Old Code**
File: server/utils/files/index.js

```
async function storeVectorResult(vectorData = [], filename = null) {
  if (!filename) return;
  console.log(
    `Caching vectorized results of ${filename} to prevent duplicated embedding.`
  );
  if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);

  const digest = uuidv5(filename, uuidv5.URL);
  const writeTo = path.resolve(vectorCachePath, `${digest}.json`);
  fs.writeFileSync(writeTo, JSON.stringify(vectorData), "utf8");
  return;
}
```

**New Replacement Code (Streamed Write version)**

```
async function storeVectorResult(vectorData = [], filename = null) {
  if (!filename) return;
  console.log(
    `Caching vectorized results of ${filename} to prevent duplicated embedding.`
  );
  if (!fs.existsSync(vectorCachePath)) fs.mkdirSync(vectorCachePath);

  const digest = uuidv5(filename, uuidv5.URL);
  const writeTo = path.resolve(vectorCachePath, `${digest}.json`);

  return new Promise((resolve, reject) => {
    const writeStream = fs.createWriteStream(writeTo, { encoding: "utf8" });

    writeStream.on("error", (err) => {
      console.error(`Failed to write vector cache for ${filename}:`, err.message);
      reject(err);
    });

    writeStream.on("finish", resolve);

    writeStream.write("[");
    for (let i = 0; i < vectorData.length; i++) {
      const chunkJson = JSON.stringify(vectorData[i]);
      if (i > 0) writeStream.write(",");
      writeStream.write(chunkJson);
    }
    writeStream.write("]");
    writeStream.end();
  });
}
```

**Why this should be implemented**

- Prevents fatal errors and out-of-memory crashes for large documents.
- Allows AnythingLLM to handle very large .jsonl.
- Output file format remains exactly the same ([chunk0,chunk1,chunk2,...]); compatibility with current cache read logic is preserved.

I hope this makes sense! 

Thank you!

### Are there known steps to reproduce?

Create a .jsonl file with 100,000 entries and it'll fail. I used 1000 token max limit for each "text" value, and metadata object was about 150-300.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG]: Large .jsonl Files Cause "Invalid string length" Error – Request Streamed Cache Writing #5063

How are you running AnythingLLM?

What happened?

Problem Summary

Proposed Code Refactor

Are there known steps to reproduce?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[BUG]: Large .jsonl Files Cause "Invalid string length" Error – Request Streamed Cache Writing #5063

Description

How are you running AnythingLLM?

What happened?

Problem Summary

Proposed Code Refactor

Are there known steps to reproduce?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions