v2026.1 · now in private beta

Forge your proprietary model matrix.

MatForge is the fully automated model-engineering platform for your enterprise data. Run LoRA / QLoRA fine-tuning, 4-bit quantization, and high-performance cluster deployment from a single forge.

Whitelist required · access granted to vetted teams

-73%
train loss
4-bit
quant footprint
2.1k
tok/s · serve
forge://job/acme-support-70b
RUNNING
Throughput
2,104tok/s
Train loss
0.571
Step
1,408/ 4k
loss curveconvergence -73%
cluster · 16× h100
stdout

$ adapter · QLoRA r=64 · 4-bit nf4 · bf16 compute

$ shard 0 → gpu:0 shard 1 → gpu:1 shard 2 → gpu:2

$ step 1280 · loss 0.612 · lr 1.4e-4 · grad_norm 0.83

$ step 1344 · loss 0.589 · lr 1.3e-4 · grad_norm 0.79

$ checkpoint saved · adapter_1344.safetensors · 142MB

Architectures & runtimes supported out of the box

llama-3.3mistral-largeqwen-2.5deepseek-v3phi-4gemma-2vLLMAWQGPTQFlashAttention-3llama-3.3mistral-largeqwen-2.5deepseek-v3phi-4gemma-2vLLMAWQGPTQFlashAttention-3
// the forge suite

Raw compute into custom intelligence.

Three tightly integrated modules cover the full lifecycle — tune, compress, and serve — without leaving the forge.

smart fine-tuning bay

Forge-Tuning

Full fine-tuning, LoRA and QLoRA paradigms. Upload a JSONL dataset and the platform auto-selects optimal hyperparameters, schedules, and checkpoints.

  • Auto hyperparameter search
  • JSONL → adapter pipeline
  • Resumable checkpoints
forge tune --method qlora \
  --data acme.jsonl --auto-hp

matrix compressor

Matrix-Compress

Quantize a 70B model down to a 4-bit build that runs on consumer GPUs with near-zero accuracy loss, using state-of-the-art AWQ / GPTQ kernels.

  • AWQ & GPTQ kernels
  • 70B → single-GPU
  • < 1% accuracy drop
forge compress --bits 4 \
  --algo awq --group 128

universal deploy engine

Omni-Deploy

Package any tuned model into an OpenAI-compatible API endpoint in one command, with vLLM-accelerated inference and autoscaling baked in.

  • OpenAI-compatible API
  • vLLM accelerated
  • Autoscaling endpoints
forge deploy --engine vllm \
  --replicas auto
// the forge pipeline

From dataset to endpoint in four stages.

01

Ingest

Stream your proprietary JSONL / parquet corpus into a versioned, encrypted dataset registry.

02

Tune

Launch a QLoRA job. MatForge shards across the cluster and auto-tunes hyperparameters.

03

Compress

Merge adapters and quantize to 4-bit with AWQ — verified against your eval set.

04

Deploy

Ship an OpenAI-compatible, vLLM-accelerated endpoint with autoscaling in one command.

// benchmarks

Engineered for throughput, measured in production.

Every job is profiled end to end. Compression preserves quality, while the serving layer squeezes maximum tokens-per-second out of each GPU in the cluster.

17.5×
smaller
70B FP16 → 4-bit AWQ build
0.4%
accuracy drop
post-quantization on eval set
2,140
tok/s
single H100 · vLLM serve
9 min
to first endpoint
dataset upload → live API
// pricing

Access is whitelist-gated.

MatForge is in private beta. Sign-in and purchase unlock only after your team is approved for the whitelist.

Forge Lab

$0/ for vetted teams

Evaluate the full pipeline on a single shared GPU.

  • 1 concurrent tune job
  • Up to 8B models · QLoRA
  • 4-bit quantization
  • Community support
Most chosen

Foundry

$4,800/ month

Dedicated cluster for production model engineering.

  • Up to 16× H100 cluster
  • Models up to 70B
  • AWQ / GPTQ + auto-HP search
  • vLLM autoscaling endpoints
  • Priority forge queue

Sovereign

Custom/ private cloud

Air-gapped, on-prem deployment in your own VPC.

  • Bring your own silicon
  • SOC 2 · ISO 27001
  • Dedicated solutions architect
  • Custom kernels & SLAs
// faq

Questions from the field.

// cold-core forging

Join the whitelist and start forging.

Reserve a slot in the next onboarding cohort. Approved teams get cluster access, the full Forge Suite, and a dedicated architect.