Intelligent load balancer for distributed vLLM server clusters 分布式 vLLM 服务器集群的智能负载均衡器
-
Updated
Oct 22, 2025 - Python
Intelligent load balancer for distributed vLLM server clusters 分布式 vLLM 服务器集群的智能负载均衡器
Diagnostic tool for vLLM inference servers
agentsculptor is an experimental AI-powered development agent designed to analyze, refactor, and extend Python projects automatically. It uses an OpenAI-like planner–executor loop on top of a vLLM backend, combining project context analysis, structured tool calls, and iterative refinement. It has only been tested with gpt-oss-120b via vLLM.
Wheels & Docker images for running vLLM on CPU-only systems, optimized for different CPU instruction sets
A curated list of plugins built on top of vLLM
A simple UI and config generator to run vLLM with Docker, GPU settings, model config parsing, memory estimation, and OpenAI-compatible test clients.
The core source files to this self-hostable successor to the OpenAI Assistants API. To contribute to the core logic, fork or submit pull requests to this repro.
Performant LLM inferencing on Kubernetes via vLLM
Qwen 3.5 Reverse Proxy for handling instant / thinking modes and their variants automatically
This Repository contains terraform configuration for vllm production-stack in the cloud managed K8s
[KAIST CS632] Road damage detection using YOLOv8 on Xilinx FPGA, repair estimation with vLLM-Serve Phi-3.5 FAISS RAG, and data management via GS1 EPCISv2 and React dashboard
Deploy the Magistral-Small-2506 model using vLLM and Modal
Documenting my journey of understanding modern AI concepts—from Transformers and LLMs to RAG, Agents, Embeddings, Vector Databases, and vLLM.
This project offers a production-ready RAG (Retrieval-Augmented Generation) API running on FastAPI, utilizing the high-performance vLLM engine.
A production-style LLM evaluation pipeline spanning vLLM serving, lm-eval-harness integration, performance metrics (TTFT/TPOT/p95), deterministic guardrails, and statistically significant benchmark improvements.
Deploy the SOTA qwen 3.5 to 9B Serverless
Load testing openai/gpt-oss-20b with vLLM and Docker
由 AI 辅助开发的实用开源项目合集
A reproducible benchmarking suite for vLLM inference. Measure latency, throughput, and VRAM across model configurations, quantization schemes, and deployment environments.
Add a description, image, and links to the vllm-serve topic page so that developers can more easily learn about it.
To associate your repository with the vllm-serve topic, visit your repo's landing page and select "manage topics."