Build Your Own AI Model Locally: A Step-by-Step Guide

Haris · March 3, 2026, 7:39pm

There is a huge misconception that building your own AI requires a massive server farm, expensive NVIDIA graphics cards, or deep pockets to pay cloud providers like Amazon Web Services. I used to believe that too.

I needed a custom AI model that was:

100% open-source and MIT-licensed, so I could use it in my own commercial projects without any legal headaches.
Incredibly lightweight, so it could run on standard, cheap hardware, like a laptop with just 2GB of VRAM or a basic web server with 512MB of RAM.

It sounded impossible, but I actually managed to do it in an afternoon. I built my own AI, trained it to chat exactly how I wanted, and shrunk the final “brain” down to a tiny 89MB file that runs completely offline.

If you know a little bit of Python, you can do this too. Here is the exact, step-by-step process I used.

Phase 1: Creating the Training Data

You can’t train an AI without data. If you want your model to chat casually, write code, or parse logs, you have to show it exactly what that looks like.

Instead of typing out thousands of examples by hand, I used a larger, smarter AI (I used a free model called Trinity Large via OpenRouter) to act as a “Teacher.” I wrote a Python script that asked the Teacher to generate 2,000 different examples of the behavior I wanted.

Tip for speed: To make the script fast, I used threading so it would generate 20 examples at the exact same time.

Here is the script I wrote to generate the data:

import os
import json
import requests
import concurrent.futures
import threading

OPENROUTER_API_KEY = "YOUR_API_KEY"
MODEL_ID = "arcee-ai/trinity-large-preview:free"

counter_lock = threading.Lock()
file_lock = threading.Lock()
success_count = 0
total_samples = 2000

def generate_sample(category):
    headers = {
        "Authorization": f"Bearer {OPENROUTER_API_KEY}",
        "Content-Type": "application/json"
    }

    # Custom prompts based on what we want the AI to learn
    if category == "saas":
        system_prompt = "Generate a realistic SaaS operation interaction (e.g., log parsing, JSON output)."
    elif category == "coding":
        system_prompt = "Generate a realistic programming or tech-support interaction."
    else:
        system_prompt = "Generate a realistic, short casual conversation."

    data = {
        "model": MODEL_ID,
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Generate one example interaction."}
        ],
        "response_format": {"type": "json_object"}
    }
    
    try:
        response = requests.post("https://openrouter.ai/api/v1/chat/completions", headers=headers, json=data)
        response.raise_for_status()
        content = json.loads(response.json()['choices'][0]['message']['content'])
        
        # We format this using "ChatML" tags. 
        # This teaches the AI exactly when a user stops talking and the AI should start.
        chatml = "<|im_start|>system\nYou are an efficient, lightweight AI assistant.<|im_end|>\n"
        chatml += f"<|im_start|>user\n{content.get('user', '')}<|im_end|>\n"
        chatml += f"<|im_start|>assistant\n{content.get('assistant', '')}<|im_end|>"
        
        return {"text": chatml}
    except Exception:
        return None

def worker(category, output_file):
    global success_count
    sample = generate_sample(category)
    if sample:
        with file_lock:
            with open(output_file, "a", encoding="utf-8") as f:
                f.write(json.dumps(sample) + "\n")
        with counter_lock:
            success_count += 1
            if success_count % 10 == 0:
                print(f"Progress: {success_count}/{total_samples} generated...")

def main():
    print("Beginning data generation...")
    categories = ["saas", "general", "coding", "creative"]
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
        futures = []
        for i in range(total_samples):
            # Rotate through the categories so we get a balanced dataset
            futures.append(executor.submit(worker, categories[i % 4], "training_data.jsonl"))
            
        for future in concurrent.futures.as_completed(futures):
            pass

if __name__ == "__main__":
    main()

If you run this, you will end up with a file called training_data.jsonl packed with perfectly formatted examples.

Phase 2: Teaching Your Model (Fine-Tuning)

For the foundation of my AI, I chose OpenAI’s GPT-2 (124M). It’s an older model, but it is completely open-source and extremely small. By itself, GPT-2 isn’t very smart. But when we expose it to our new high-quality dataset, it learns our specific patterns.

I didn’t have a giant graphics card to train this. I actually trained it on a normal pc using the CPU.

To prevent memory crashes, we use a technique called LoRA (Low-Rank Adaptation). LoRA freezes 99% of the model and only trains a tiny, specific fraction of the parameters. This allows everyday computers to train AI.

Here is the training script. You will need to pip install torch transformers datasets peft to run it.

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling
from peft import LoraConfig, get_peft_model

def main():
    model_id = "openai-community/gpt2"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    
    # Load model on CPU. 
    # If you have an NVIDIA GPU, change this to device_map="auto" and torch_dtype=torch.float16
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", torch_dtype=torch.float32)

    # Setup LoRA to dramatically lower RAM usage during training
    lora_config = LoraConfig(
        r=8,
        lora_alpha=16,
        target_modules=["c_attn"],
        task_type="CAUSAL_LM"
    )
    model = get_peft_model(model, lora_config)
    
    dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
    
    def tokenize_func(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=256)
        
    tokenized_dataset = dataset.map(tokenize_func, batched=True, remove_columns=["text"])
    
    # We use a batch size of 1 so it doesn't crash low-end hardware
    training_args = TrainingArguments(
        output_dir="./custom-model",
        per_device_train_batch_size=1,     
        gradient_accumulation_steps=8,    
        learning_rate=2e-4,
        max_steps=200, # CPU training is slow, so 200 steps is a good start
        fp16=False,                        
        optim="adamw_torch",
        report_to="none"
    )
    
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    print("Training the brain...")
    trainer.train()
    
    # Save the custom brain adapter!
    model.save_pretrained("./my-lora-adapter")
    tokenizer.save_pretrained("./my-lora-adapter")

if __name__ == "__main__":
    main()

Let this run. Depending on your computer, it might take a little while. Once it finishes, you’ll have a folder called my-lora-adapter containing everything your AI just learned.

Phase 3: Shrinking the Model (Quantization)

Our trained model works, but it takes up around 250MB. I wanted this to run on basic web servers that only have 512MB of total RAM. To make it fit, we need to “Quantize” it.

Quantization is the process of compressing the math inside the neural network from highly precise 16-bit decimals down to chunky 4-bit integers. It drastically shrinks the file size, but the AI barely loses any intelligence.

1. Merge the Model

First, we need to permanently fuse our newly trained LoRA adapter from Phase 2 into the original GPT-2 model:

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load the blank base model
base_model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2", device_map="cpu")

# Load our custom brain
model = PeftModel.from_pretrained(base_model, "./my-lora-adapter")

# Merge them permanently into a single model
model = model.merge_and_unload()
model.save_pretrained("./fully-merged-model")

2. Compress the File

Next, to actually compress it, you need to use an incredible open-source C++ project called llama.cpp.
When you download llama.cpp, it comes with a Python script to convert your model, and an executable program to shrink it.

Run these two commands in your terminal:

# 1. Convert our merged model to the standard GGUF format
python convert_hf_to_gguf.py ./fully-merged-model --outfile my-ai-f16.gguf

# 2. Compress the file down to pure 4-bit using the Q4_K_M algorithm
./llama-quantize my-ai-f16.gguf final-model.gguf Q4_K_M

The result? The model shrinks from 250MB down to a staggering 89 Megabytes.

Phase 4: Using Your AI (Connecting the API)

You now have final-model.gguf. This tiny file contains your custom, offline AI.

To actually chat with it, you use a program that comes bundled with llama.cpp called llama-server. You just run it in your terminal like this:

./llama-server -m final-model.gguf -c 256 -np 2 --host 127.0.0.1 --port 8080 -cb -mmap

Important: The -c 256 flag restricts the memory so it never crashes, and -np 2 tells it to only use 2 CPU cores.

Now the AI is running locally on your machine on port 8080. If you want to connect a website or an app to it, you just send an HTTP request to that port.

For example, here is a very simple PHP script that you could put on your web server. It takes a message from a user, formats it exactly how we trained the AI in Phase 1, and returns the response:

<?php
$input = json_decode(file_get_contents('php://input'), true);
$user_msg = $input['message'] ?? 'Hello';

// Format the prompt EXACTLY like our training data from Phase 1
$prompt = "<|im_start|>system\nYou are an efficient AI.<|im_end|>\n";
$prompt .= "<|im_start|>user\n" . $user_msg . "<|im_end|>\n<|im_start|>assistant\n";

$payload = json_encode([
    'prompt' => $prompt,
    'n_predict' => 128, 
    'stop' => ["<|im_end|>"]
]);

// Send the prompt to our offline AI server
$ch = curl_init('http://127.0.0.1:8080/completion');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $payload);

$response = json_decode(curl_exec($ch), true);
curl_close($ch);

// Output the AI's reply
echo json_encode(['reply' => trim($response['content'])]);
?>

Final Thoughts

And just like that, you’ve built a fully functional, offline AI. When you chat with it, the response flies over the local server instantly—powered entirely by an 89MB file that behaves exactly the way you trained it to.

If you’ve ever felt intimidated by AI development, just know it doesn’t take supercomputers anymore. It just takes an afternoon, some basic Python skills, and a little bit of patience.

Have questions about building your AI model?
Comment below and I’ll help you out!