Are you an LLM? You can read better optimized documentation at /docs/getting-started/run-deploy.md for this page in Markdown format
Running a ModelKit as a Docker container
Jozu Hub automatically generates Jozu Runtime Inference Containers (RICs) for each ModelKit. RICs are a ready-to-run set of optimized AI containers designed to streamline deployment of inference workloads. These containers accelerate time-to-value by generating optimized, containers for any AI model, and runtime environment.
Jozu RICs can be run in docker, deployed to Kubernetes, or shipped to any other container runtime.
Below we'll use jozu/qwen2-0.5b
as an example repository and model to work with.
Docker
You can get the docker command needed to run your ModelKit-packaged model either from the UI's Deploy sub tab, or via a modified URL you can use from the CLI or code.
Using the UI
- Select the "Docker" radio button on the left, then select
llama.cpp
on the right. - Open a terminal on a computer with the docker CLI or Docker Desktop installed and paste the command we copied from the UI, then hit enter to run the Jozu RIC.
- Open a browser window and navigate to
http://localhost:8000/
. You'll see thellama.cpp
UI, where you can have a conversation with your LLM.
Using a Modified URL
Each ModelKit in Jozu Hub has a URL with the format:
<username>/<repository>:<tag>
To get the generated docker container we need to specify which container type we need with a modified URL that becomes:
<username>/<repository>/<container-type>:<tag>
For example, to get the llama.cpp
container for this ModelKit, we'll add /llama-cpp
before the colon and tag name:
// original ModelKit URL
jozu.ml/jozu/qwen2-0.5b:0.5b-instruct-q4_0
// generated container URL
jozu.ml/jozu/qwen2-0.5b/llama-cpp:0.5b-instruct-q4_0
Details
The compatible types are listed on the right radio button column on the deploy tab for the ModelKit.
Now, simply open a terminal on a computer with the docker CLI or Docker Desktop installed and paste the modified URL, for example:
bash
docker run -it --rm \
--publish 8000:8000 \
jozu.ml/jozu/qwen2-0.5b/llama-cpp:0.5b-instruct-q4_0
The CLI will first check if the RIC is available locally (you may need to grant permissions for docker to find files locally), then it will download it and start the container.
Open a browser window and navigate to http://localhost:8000/
. You'll see the llama.cpp
UI, where you can have a conversation with your LLM.
Kubernetes
Before you begin, make sure you can reach a Kubernetes cluster and that the kubectl
CLI is installed on your local machine and configured to communicate with it (see the Kubernetes docs for details).
Creating the Namespace
In a terminal on your Kubernetes cluster run:
bash
kubectl create namespace jozu-ric
This will create a new namespace in the cluster called jozu-ric
(you can use any name you want).
Creating the Deployment YAML File
Create a new file called jozu-deploy.yaml
with the following contents
yaml
apiVersion: v1
kind: Pod
metadata:
name: qwen2-0.5b-llama-cpp
labels:
app: qwen2-0.5b-llama-cpp
spec:
containers:
- name: llama-cpp-serve
image: jozu.ml/jozu/qwen2-0.5b/llama-cpp:latest
ports:
- containerPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: qwen2-0.5b-llama-cpp-svc
spec:
selector:
app: qwen2-0.5b-llama-cpp
ports:
- protocol: TCP
port: 8000
targetPort: 8000
In the same terminal session you used above run:
bash
kubectl apply -f jozu-deploy.yaml -n jozu-ric
You can watch the pod startup status by running:
bash
kubectl get pods -n jozu-ric --watch
Once the pods are up and running, forward the port to access the model within the cluster. Note that this method doesn’t expose the model externally— to make it accessible outside the cluster you can configure a Service with different types such as NodePort, LoadBalancer, use Ingress or another proxy solution.
bash
kubectl port-forward svc/qwen2-0.5b-llama-cpp-svc 8000:8000 -n jozu-ric
Accessing the Model Pod
Now you can open http://localhost:8000/
in your browser and see the UI.
You can also use alternate Kubernetes resources such as Deployments to scale the workload.
This section outlined how to create a minimal Kubernetes model deployment that you can interact with. For a more robust deployment you should factor in additional considerations like scaling, GPUs, ingress.
For more information on getting production ready workloads or contact Jozu.
Limitations
RICs use ModelKit layers as-is. This introduces some constraints:
- ModelKit contents are mounted into the container root directory
- Files are owned by root, with original permissions
- No file transformations or relocations are performed
Additionally, RICs only currently work with ModelKits that have a single model file (not model parts).
While we can work around these limitations by designing containers to work with ModelKits directly, you may run into compatibility issues on certain platforms. If you encounter issues or have any feedback, please email us at [email protected].