👋 Welcome to MLC LLM¶

Discord | GitHub

Machine Learning Compilation for Large Language Models (MLC LLM) is a high-performance universal deployment solution that allows native deployment of any large language models with native APIs with compiler acceleration. The mission of this project is to enable everyone to develop, optimize and deploy AI models natively on everyone’s devices with ML compilation techniques.

Getting Started¶

To begin with, try out MLC LLM support for int4-quantized Llama2 7B. It is recommended to have at least 6GB free VRAM to run it.

Install MLC Chat Python. MLC LLM is available via pip. It is always recommended to install it in an isolated conda virtual environment.

Download pre-quantized weights. The comamnds below download the int4-quantized Llama2-7B from HuggingFace:

git lfs install && mkdir -p dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1 \
                          dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Download pre-compiled model library. The pre-compiled model library is available as below:

git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Run in Python. The following Python script showcases the Python API of MLC LLM and its stream capability:

from mlc_chat import ChatModule
from mlc_chat.callback import StreamToStdout

cm = ChatModule(model="Llama-2-7b-chat-hf-q4f16_1")
cm.generate(prompt="What is the meaning of life?", progress_callback=StreamToStdout(callback_interval=2))

Colab walkthrough. A Jupyter notebook on Colab is provided with detailed walkthrough of the Python API.

Documentation and tutorial. Python API reference and its tutorials are available online.

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/python-api.jpg — MLC LLM Python API¶

Install MLC Chat CLI. MLC Chat CLI is available via conda using the command below. It is always recommended to install it in an isolated conda virtual environment. For Windows/Linux users, make sure to have latest Vulkan driver installed.

conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv

Download pre-quantized weights. The comamnds below download the int4-quantized Llama2-7B from HuggingFace:

git lfs install && mkdir -p dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-Llama-2-7b-chat-hf-q4f16_1 \
                          dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1

Download pre-compiled model library. The pre-compiled model library is available as below:

git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

Run in command line.

mlc_chat_cli --model Llama-2-7b-chat-hf-q4f16_1

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/Llama2-macOS.gif — MLC LLM on CLI¶

Note

The MLC Chat CLI package is only built with Vulkan (Windows/Linux) and Metal (macOS). To use other GPU backends such as CUDA and ROCm, please use the prebuilt Python package or build from source.