TileLang-Ascend

Tile Language Ascend (tilelang-ascend) is a specialized variant of the tile-lang domain-specific language, specifically optimized for Huawei Ascend NPU (Neural Processing Unit) architecture. Built upon the foundation of tile-lang's Pythonic syntax and TVM compiler infrastructure, tilelang-ascend enables developers to efficiently create high-performance AI compute kernels tailored for Ascend processors, including operations like GEMM, vector operations, and attention mechanisms. Tilelang-ascend allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance on the NPU.

Within the TileLang ecosystem, we have developed an NPU Intermediate Representation (AscendNPU IR) infrastructure specifically for Ascend, enabling seamless integration into the open-source AI compiler ecosystem based on MLIR. This effort not only enhances the openness and extensibility of the compiler stack but also provides developers with a more flexible and efficient pathway for custom operator development. The compiler backend supports two technical routes: AscendNPU IR and Ascend C & PTO.

Latest News

4/24/2026 🚀: Released DeepSeek V4 kernels DeepSeek-V4!
3/28/2026 🚀: We provide a free environment to facilitate user experience and development for TileLang Pull Request#708.
1/23/2026 🚀: TileLang now supports CANN 8.5. Check out Pull Request#334 and Pull Request#346 for details!
1/2/2026 🚀: Support for store_nz2nd, arange, concat, flip, bitcast, and vpad ops. Check out Pull Request#201, Pull Request#202, and Pull Request#203 for details!
12/28/2025 🚀: Support for CV automatic pipelining (T.Pipelined), enabling parallel collaboration between Cube and Vector cores to boost performance. Check out Pull Request#181 for details!
12/29/2025 🚀: Support for double buffer, enabling concurrent computation and data transfer. Automatically takes effect in developer mode.
12/28/2025 🚀: Support for automatic buffer reuse, enabling improved memory efficiency. Automatically takes effect in developer mode.
12/28/2025 🚀: Support for T.Parallel with automatic vectorization, enabling scalar operations to be automatically converted to vector computations. Check out Pull Request#171 for details!
12/27/2025 🚀: We are excited to announce support for developer mode, enabling programming consistency across different hardware architectures. Check out Pull Request#173 for details!
12/16/2025 🚀: Support for developer mode memory op (T.alloc_shared, T.alloc_fragment), check out Pull Request#129 for details!
12/09/2025 🚀: Support for additional vector ops (VFlip, VLn, VNot, VAbs, VAnd, VOr, VCmp, VPad, VPow, VRec, VRelu, VRSqrt, VSel, VShl, VShr, VXor, VTranspose, VInterleave, VGather)!
12/09/2025 🚀: Support compilation based on MLIR APIs and fully replace the previous string-based implementation, providing a robust and extensible compilation process!
11/21/2025 🚀: Support integrated compilation of open source AscendNPU-IR together with TileLang, easing the compilation experience!
09/29/2025 🚀: Officially establish the NPU Intermediate Representation (AscendNPU IR) infrastructure for Ascend within the TileLang ecosystem, deeply integrating into the open-source AI compiler ecosystem based on MLIR. At the same time, deliver peak performance—fusion operators such as FlashAttention (FA) written in TileLang achieve performance on Ascend hardware that matches hand-written AscendC equivalents at a 1.0x level, balancing both development efficiency and ultimate performance!

High-Performance Ascend Operators with Developer Mode

TileLang enables developers to write high-performance kernels on Ascend NPU with minimal effort. The following benchmarks demonstrate that TileLang achieves performance comparable to hand-written AscendC implementations.

Cube Operators

Typical Case: GEMM | Average Performance: 0.98x AscendC | Example

Operator	Test Shape	AscendC (us)	TileLang (us)	Performance Ratio
gemm	M,N,K=(4096,4096,4096)	497.930	501.190	0.993
batch_gemm	Batch,M,N,K=(8,4096,4096,4096)	3800.376	3963.859	0.959

Vector Operators

Typical Case: DeepSeek-V4 mHC | Average Performance: 0.96x AscendC | Example

Operator	Test Shape	AscendC (us)	TileLang (us)	Performance Ratio
hc_sinkhorn (DSV4)	b*s=128; hc=4; d=4096	24.6	28.21	0.872
	b*s=4096; hc=4; d=4096	485.97	490.84	0.990
	b*s=16384; hc=4; d=4096	1902.1	1850.12	1.028

Cube-Vector Operators

Typical Case: Flash Attention | Average Performance: 0.95x AscendC | Example

Environment Variables Guide

Currently, we need to set environment variables to configure the developer mode or expert mode. For more environment variables, please refer to the EnvironmentVariables.md

Variable	Default	Description	Valid Values
`TILELANG_ASCEND_MODE`	Expert	Set the TileLang Mode; currently, Expert mode and Developer mode are supported	`Expert`: Expert Mode `Developer`: Developer Mode

Tested Devices

Although TileLang aims to support portability across a variety of devices, it has been specifically tested and validated on the following hardware:Huawei Ascend AI accelerators, including A2/A3.

Accessing Ascend NPU

If you need to access Ascend NPU computing resources for development or testing, please visit the HiDevLab - Online Development page on the Huawei HiDevLab platform to apply for and use them

OP Implementation Examples

tile-lang provides the building blocks to implement a wide variety of operators. Some examples include:

Within the examples directory, you will also find additional complex kernels—such as convolutions, forward/backward passes for FlashAttention, more operators will continuously be added.

Installation

Environment Setup

Install the Ascend Toolkit.

Download the installation package，installAscend-cann-toolkit.For complete installation instructions, refer to the relevant documentation.

chmod +x Ascend-cann-toolkit_{ascend_cann_toolkit version}_linux-aarch64.run
./Ascend-cann-toolkit_{ascend-cann-toolkit version}_linux-aarch64.run --install

Configure environment variables:

source /path/to/install/Ascend/ascend-toolkit/set_env.sh

Prepare a Python environment with Python version between 3.7.x and 3.11.4 (inclusive) and ensure that pip3 is available.

Ascend Toolkit Installation Requirements

pip3 install attrs cython 'numpy>=1.19.2,<=1.24.0' decorator sympy cffi pyyaml pathlib2 psutil protobuf==3.20.0 scipy requests absl-py

Set Environment Variables

export ACL_OP_INIT_MODE=1

Note: If you require a new compiler installation package, please contact the community administrators: zhaojiqiao@huawei.com, yangsichan@huawei.com

Build

Pull the code

git clone https://github.com/tile-ai/tilelang-mlir-ascend.git --recursive

Run the installation script

cd tilelang-mlir-ascend
# build AscendNPU-IR in 3rdparty
bash install_npuir.sh
# Alternative way of building with local AscendNPU-IR
bash install_npuir.sh --bishengir-path=/path/to/bishengir-compile

Install torch_npu

pip install pybind11 torch_npu

Quick Start

This code implements a vector addition kernel using TileLang, a domain-specific language for NPU (Neural Processing Unit) programming. It defines a parallel kernel that adds two float32 vectors of length 4096 on the NPU by loading data into on-chip unified buffers, performing element-wise addition via a low-level NPU instruction (npuir_add), and writing the result back to global memory. The test function compares the kernel’s output against PyTorch’s native vector addition to verify correctness. The example runs on NPU device 6 and demonstrates basic TileLang workflow: kernel definition, compilation to AscendNPU IR, and execution with PyTorch tensors.

# Copyright (c) Tile-AI Corporation.
# Licensed under the MIT License.

import os

import tilelang
import tilelang.language as T  # Import TileLang DSL for kernel definition

import torch
import torch_npu  # Import NPU (Neural Processing Unit) backend support for PyTorch

# Clear any previously cached compiled kernels to ensure a clean run
tilelang.cache.clear_cache()

# Define data type and sequence length for the vector addition
dtype = "float32"
seq_len = 4096  # Length of the vectors to be added

def vec_add(N, block_N, dtype="float32"):
    """
    Define a vector addition kernel using TileLang.

    Parameters:
    - N: Total length of the vectors.
    - block_N: Number of elements processed per kernel thread/block.
    - dtype: Data type of the tensors (default: "float32").

    Returns:
    - A TileLang prim_func representing the vector addition kernel.
    """
    n_num = N // block_N  # Number of blocks (each block processes `block_N` elements)

    @T.prim_func
    def main(
        A: T.Tensor((N), dtype),  # Input tensor A
        B: T.Tensor((N), dtype),  # Input tensor B
        C: T.Tensor((N), dtype),  # Output tensor C = A + B
        shape: T.int32,           # Actual size (used for handling tail cases if N is not divisible by block_N)
    ):
        # Launch kernel with `n_num` parallel threads on the NPU
        with T.Kernel(n_num, is_npu=True) as (cid, _):
            # Allocate on-chip Unified Buffer (UB) for local computation
            A_VEC = T.alloc_ub((block_N), dtype)
            B_VEC = T.alloc_ub((block_N), dtype)
            C_VEC = T.alloc_ub((block_N), dtype)

            # Calculate the starting index for this thread
            start_idx = cid * block_N
            # Compute remaining elements from this start index to the end of the tensor
            remaining = shape - start_idx
            # Determine how many elements this thread should actually process (handles tail)
            tail_size = T.min(block_N, remaining)

            # Copy data from global memory (A, B) into on-chip buffers (A_VEC, B_VEC)
            T.copy(A[start_idx : start_idx + tail_size], A_VEC[0:tail_size])
            T.copy(B[start_idx : start_idx + tail_size], B_VEC[0:tail_size])

            # Perform vector addition on the NPU using low-level NPU IR instruction
            T.npuir_add(A_VEC, B_VEC, C_VEC)

            # Write the result back from on-chip buffer (C_VEC) to global memory (C)
            T.copy(C_VEC[0:tail_size], C[start_idx : start_idx + tail_size])

    return main

def test_vec_add():
    """
    Test function to validate the vector addition kernel.
    Compares the result of the custom TileLang kernel against PyTorch's native addition.
    """

    # Instantiate the vector addition kernel for the full sequence length (single block)
    func = vec_add(seq_len, seq_len)

    # Compile the TileLang function to NPU IR for execution on the NPU
    compiled_kernel = tilelang.compile(func, target="npuir")

    # Create random input tensors on the NPU
    v1 = torch.randn(size=[seq_len], dtype=eval("torch." + dtype)).npu()
    v2 = torch.randn(size=[seq_len], dtype=eval("torch." + dtype)).npu()
    v3 = torch.zeros(size=[seq_len], dtype=eval("torch." + dtype)).npu()  # Output buffer

    # Compute reference result using PyTorch's native addition (on NPU)
    y_ref = v1 + v2

    # Launch the compiled TileLang kernel
    compiled_kernel(v1, v2, v3, seq_len)

    # Print both results for visual comparison (should be nearly identical)
    print("Reference result (PyTorch):")
    print(y_ref)
    print("TileLang kernel result:")
    print(v3)

if __name__ == "__main__":
    test_vec_add()

GEMM Example using developer mode

The developer mode approach simplifies code portability across different hardware. Below is a basic matrix multiplication (GEMM) example demonstrating its implementation.

# To run on Ascend NPU, Specify npuir as target in JIT
@tilelang.jit(out_idx=[-1], target="npuir")
def matmul(M, N, K, block_M, block_N, block_K, dtype="float16", accum_dtype="float32"):
    @T.prim_func
    def gemm(
        A: T.Tensor((M, K), dtype), # Input matrix A
        B: T.Tensor((K, N), dtype), # Input matrix B
        C: T.Tensor((M, N), dtype), # Output matrix C
    ):
        with T.Kernel(T.ceildiv(N, block_N) * T.ceildiv(M, block_M), is_npu=True) as (cid, _):

            by = cid // T.ceildiv(N, block_N) # Block row index
            bx = cid % T.ceildiv(N, block_N)  # Block column index

            # Alloc shared memory for inputs
            A_shared = T.alloc_shared((block_M, block_K), dtype)
            B_shared = T.alloc_shared((block_K, block_N), dtype)

            # Alloc local fragment for accumulation
            C_local = T.alloc_fragment((block_M, block_N), accum_dtype)

            # Loop over the K dimension in block_K chunks, using ping-pong
            for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=2):
                # Copy the data from global memory to shared memory
                T.copy(A[by * block_M, k * block_K], A_shared)
                T.copy(B[k * block_K, bx * block_N], B_shared)

                # Perform matrix multiplication with accumulation
                # If 'initC' is true, the result matrix will be initialized to zero before accumulation
                T.gemm(A_shared, B_shared, C_local, initC=(k == 0))

            # Copy the accumulated result from local memory to global memory
            T.copy(C_local, C[by * block_M, bx * block_N])

    return gemm

Roadmap

Acknowledgements

Peking University Kunpeng & Ascend Center for Excellence in Science, Education, Innovation

Name		Name	Last commit message	Last commit date
Latest commit History 709 Commits
.agents		.agents
.github/workflows		.github/workflows
3rdparty		3rdparty
benchmark		benchmark
docker		docker
docs		docs
examples		examples
images		images
maint/scripts		maint/scripts
src		src
testing		testing
tilelang		tilelang
tilelangir		tilelangir
unittest/npuir		unittest/npuir
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.pymarkdown		.pymarkdown
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
THIRDPARTYNOTICES.txt		THIRDPARTYNOTICES.txt
VERSION		VERSION
build_wheel.sh		build_wheel.sh
format.sh		format.sh
install_ascend.sh		install_ascend.sh
install_cpu.sh		install_cpu.sh
install_cuda.sh		install_cuda.sh
install_npuir.sh		install_npuir.sh
install_rocm.sh		install_rocm.sh
pyproject.toml		pyproject.toml
requirements-build.txt		requirements-build.txt
requirements-dev.txt		requirements-dev.txt
requirements-lint.txt		requirements-lint.txt
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
run-clang-tidy.py		run-clang-tidy.py
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TileLang-Ascend

Latest News

High-Performance Ascend Operators with Developer Mode

Cube Operators

Vector Operators

Cube-Vector Operators

Environment Variables Guide

Tested Devices

Accessing Ascend NPU

OP Implementation Examples

Installation

Environment Setup

Build

Quick Start

GEMM Example using developer mode

Roadmap

Acknowledgements

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TileLang-Ascend

Latest News

High-Performance Ascend Operators with Developer Mode

Cube Operators

Vector Operators

Cube-Vector Operators

Environment Variables Guide

Tested Devices

Accessing Ascend NPU

OP Implementation Examples

Installation

Environment Setup

Build

Quick Start

GEMM Example using developer mode

Roadmap

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages