GEMINI LABJP
SEARCH — File Search grounding now adds media_id for visual citations and page numbers, so you can trace the exact sourceAPI — Event-driven Webhooks replace polling for the Batch API and long-running operationsDEPRECATION — Two image preview models (e.g. gemini-3.1-flash-image-preview) shut down June 25; migrate dependent automation nowMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running 4x fasterAGENTS — Managed Agents are in public preview on the Gemini API, running autonomous agents in isolated Linux sandboxesSTUDIO — Google AI Studio can now generate Android apps from natural-language promptsSEARCH — File Search grounding now adds media_id for visual citations and page numbers, so you can trace the exact sourceAPI — Event-driven Webhooks replace polling for the Batch API and long-running operationsDEPRECATION — Two image preview models (e.g. gemini-3.1-flash-image-preview) shut down June 25; migrate dependent automation nowMODEL — Gemini 3.5 Flash is GA, beating 3.1 Pro on nearly every benchmark while running 4x fasterAGENTS — Managed Agents are in public preview on the Gemini API, running autonomous agents in isolated Linux sandboxesSTUDIO — Google AI Studio can now generate Android apps from natural-language prompts
Articles/Dev Tools
Dev Tools/2026-06-24Intermediate

Folding a Local Gemma 4 into Daily Work — Practical Notes on the Ollama API and Response Speed

Taking a local Gemma 4 you can now run interactively and folding it into real work: how to hit Ollama's local API from a script, tricks to improve perceived response speed, and a two-tier fallback that automatically routes to the cloud Gemini API — code included.

Gemma2Gemma 412Ollama8local LLM6API12Gemini69premium4

Premium Article

The previous article covered launching Gemma 4 locally with Ollama and running it interactively. This follow-up steps into folding that local model into real work. Typing into the conversation window by hand draws out only half the benefit. Only once you can call it from a script can you hand it the work that repeats.

As an indie developer, I have moments where I try a great many store-listing phrasings and blog drafts, and I offload those trials to a local Gemma 4. Here I lay out how to hit the API, how to squeeze the speed, and how to escape safely to the cloud, implementation included.

Ollama stands up a local REST API

It is less known than it should be, but on launch Ollama quietly stands up a local HTTP server behind the scenes. By default it listens on localhost:11434, and posting there lets you call the model without opening the conversation window. The clearest start is hitting the generate endpoint.

curl http://localhost:11434/api/generate -d '{
  "model": "gemma4:e2b",
  "prompt": "Give me three polite review replies",
  "stream": false
}'

With stream set to false, you receive the result in one piece after generation finishes. When wiring into a script, this single-call shape is the easiest to handle. Conversely, to show long text gradually, set stream to true and process the fragments as they arrive.

Call it from Python and fold it into repeated work

You can hit the CLI directly, but for repeated use it is easier to wrap it in a script. That it can be written with the standard library alone also fits indie development, where you would rather not add dependencies.

import json, urllib.request
 
def ask_local(prompt: str, model: str = "gemma4:e2b") -> str:
    payload = json.dumps({"model": model, "prompt": prompt, "stream": False}).encode()
    req = urllib.request.Request(
        "http://localhost:11434/api/generate",
        data=payload, headers={"Content-Type": "application/json"},
    )
    with urllib.request.urlopen(req, timeout=120) as res:
        return json.loads(res.read())["response"]
 
if __name__ == "__main__":
    print(ask_local("Summarize this note into three bullet points: ..."))

With a wrapper this thin, you can call a canned instruction as a function as many times as you like. I run store-listing phrasings and note summaries through this shape, and one layer of retyping by hand vanished. Always attaching a timeout matters quietly — it keeps the whole process from freezing when the model jams.

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN
Minimal code to hit Ollama's local REST API from a script and receive results
Concrete steps to cut first-call latency with keep_alive and a resident model, improving perceived speed
A two-tier fallback that automatically switches to the cloud Gemini API when the local run fails
Secure payment via Stripe · Cancel anytime

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

or
Unlock all articles with Membership →
Share

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.

  • Copy-paste ready implementation code
  • New advanced guides published daily
  • $5/mo or $10 for lifetime access
View Membership →

Related Articles

Dev Tools2026-06-24
Running Gemma 4 Locally on Windows — A Hands-On LLM in Two Commands with Ollama
How to run Google's lightweight open model Gemma 4 locally on a Windows laptop. With Ollama, you go from install to running in effectively two commands. Plus how to split work between the cloud Gemini API and a local Gemma.
Dev Tools2026-05-06
Running Gemma 4 Locally in Android Studio via Ollama — Setup, Performance, and Real-World Development Experience
A hands-on guide to connecting Android Studio's local LLM feature with Gemma 4 via Ollama. Covers MacOS setup, model selection, practical coding experience, and when local AI makes more sense than cloud APIs.
Dev Tools2026-06-21
Finding Every Reference to the Image Preview Models Before They Stop on June 25
gemini-3.1-flash-image-preview and gemini-3-pro-image-preview stop on June 25. Here is a dependency audit for surfacing references buried in rarely-run branches and batches before the cutoff.
📚RECOMMENDED BOOKS
Build a Large Language Model (From Scratch)
Sebastian Raschka
LLM Dev
Prompt Engineering for LLMs
Berryman & Ziegler
Prompting
AI Engineering
Chip Huyen
AI Eng
* Contains affiliate links
See all →