v2026.1 · now in private beta

Forge your proprietary model matrix.

MatForge is the fully automated model-engineering platform for your enterprise data. Run LoRA / QLoRA fine-tuning, 4-bit quantization, and high-performance cluster deployment from a single forge.

Whitelist required · access granted to vetted teams

-73%: train loss
4-bit: quant footprint
2.1k: tok/s · serve

forge://job/acme-support-70b

RUNNING

Throughput

2,104tok/s

Train loss

0.571↓

Step

1,408/ 4k

loss curveconvergence -73%

cluster · 16× h100

stdout

$ adapter · QLoRA r=64 · 4-bit nf4 · bf16 compute

$ shard 0 → gpu:0 shard 1 → gpu:1 shard 2 → gpu:2

$ step 1280 · loss 0.612 · lr 1.4e-4 · grad_norm 0.83

$ step 1344 · loss 0.589 · lr 1.3e-4 · grad_norm 0.79

$ checkpoint saved · adapter_1344.safetensors · 142MB

Architectures & runtimes supported out of the box

llama-3.3mistral-largeqwen-2.5deepseek-v3phi-4gemma-2vLLMAWQGPTQFlashAttention-3llama-3.3mistral-largeqwen-2.5deepseek-v3phi-4gemma-2vLLMAWQGPTQFlashAttention-3

// the forge suite

Raw compute into custom intelligence.

Three tightly integrated modules cover the full lifecycle — tune, compress, and serve — without leaving the forge.

smart fine-tuning bay

Forge-Tuning

Full fine-tuning, LoRA and QLoRA paradigms. Upload a JSONL dataset and the platform auto-selects optimal hyperparameters, schedules, and checkpoints.

Auto hyperparameter search
JSONL → adapter pipeline
Resumable checkpoints

forge tune --method qlora \
  --data acme.jsonl --auto-hp

matrix compressor

Matrix-Compress

Quantize a 70B model down to a 4-bit build that runs on consumer GPUs with near-zero accuracy loss, using state-of-the-art AWQ / GPTQ kernels.

AWQ & GPTQ kernels
70B → single-GPU
< 1% accuracy drop

forge compress --bits 4 \
  --algo awq --group 128

universal deploy engine

Omni-Deploy

Package any tuned model into an OpenAI-compatible API endpoint in one command, with vLLM-accelerated inference and autoscaling baked in.

OpenAI-compatible API
vLLM accelerated
Autoscaling endpoints

forge deploy --engine vllm \
  --replicas auto

// the forge pipeline

From dataset to endpoint in four stages.

Ingest

Stream your proprietary JSONL / parquet corpus into a versioned, encrypted dataset registry.

Tune

Launch a QLoRA job. MatForge shards across the cluster and auto-tunes hyperparameters.

Compress

Merge adapters and quantize to 4-bit with AWQ — verified against your eval set.

Deploy

Ship an OpenAI-compatible, vLLM-accelerated endpoint with autoscaling in one command.

// benchmarks

Engineered for throughput, measured in production.

Every job is profiled end to end. Compression preserves quality, while the serving layer squeezes maximum tokens-per-second out of each GPU in the cluster.

17.5×

smaller

70B FP16 → 4-bit AWQ build

0.4%

accuracy drop

post-quantization on eval set

2,140

tok/s

single H100 · vLLM serve

9 min

to first endpoint

dataset upload → live API

// pricing

Access is whitelist-gated.

MatForge is in private beta. Sign-in and purchase unlock only after your team is approved for the whitelist.

Forge Lab

$0/ for vetted teams

Evaluate the full pipeline on a single shared GPU.

1 concurrent tune job
Up to 8B models · QLoRA
4-bit quantization
Community support

Most chosen

Foundry

$4,800/ month

Dedicated cluster for production model engineering.

Up to 16× H100 cluster
Models up to 70B
AWQ / GPTQ + auto-HP search
vLLM autoscaling endpoints
Priority forge queue

Sovereign

Custom/ private cloud

Air-gapped, on-prem deployment in your own VPC.

Bring your own silicon
SOC 2 · ISO 27001
Dedicated solutions architect
Custom kernels & SLAs

// faq

Questions from the field.

// cold-core forging

Join the whitelist and start forging.

Reserve a slot in the next onboarding cohort. Approved teams get cluster access, the full Forge Suite, and a dedicated architect.