●MODEL — Gemini 3.5 Flash is generally available, beating 3.1 Pro on nearly all benchmarks while running faster●API — The Interactions API reaches GA as the primary way to work with Gemini models and agents●AGENTS — Managed Agents enter public preview, running autonomous agents in Google-hosted isolated Linux sandboxes●COST — Project Spend Caps let you set a monthly dollar limit on Gemini API usage per project●SHEETS — Gemini in Sheets diagnoses and fixes formula errors in one click by analyzing surrounding data●STUDIO — Google AI Studio gets a developer-first refresh with an expanded gallery of starter apps●MODEL — Gemini 3.5 Flash is generally available, beating 3.1 Pro on nearly all benchmarks while running faster●API — The Interactions API reaches GA as the primary way to work with Gemini models and agents●AGENTS — Managed Agents enter public preview, running autonomous agents in Google-hosted isolated Linux sandboxes●COST — Project Spend Caps let you set a monthly dollar limit on Gemini API usage per project●SHEETS — Gemini in Sheets diagnoses and fixes formula errors in one click by analyzing surrounding data●STUDIO — Google AI Studio gets a developer-first refresh with an expanded gallery of starter apps
Reliable Text-in-Image with Gemini 3.1 Flash Image — an OCR-Verified Pipeline
After the preview shutdown, the GA gemini-3.1-flash-image still occasionally garbles text baked into images. Here is a generate -> read-back-verify -> regenerate/composite pipeline, with working code and an unattended retry budget.
When you automate image generation as an indie developer — banners for a wallpaper app, OGP thumbnails for a blog — you eventually hit a quietly maddening wall. The composition looks great, but the Japanese line you wanted baked into the image, something like "Free this week," comes out subtly wrong. One character swapped for a similar one, a diacritic dropped, the second line illegible. You don't notice at a glance; you notice when you zoom in.
On June 25, 2026 the preview models gemini-3.1-flash-image-preview and gemini-3-pro-image-preview were shut down, and the GA versions gemini-3.1-flash-image (Flash Image) and gemini-3-pro-image (Pro Image) took their place. Text rendering keeps getting better with each generation. Even so, in an unattended setup — spin it up, publish whatever comes out — the occasional garble is guaranteed to become an incident. I spent a while eyeballing that one bad frame by hand, which defeated half the point of automating.
What I landed on is a two-stage approach: don't trust the output, read the image back with a model to confirm the characters, and if it fails, either regenerate or switch to compositing. This article records that verification-gated pipeline, thresholds and retry design included.
After the shutdown: which model should draw the text
First, pick the model that draws. The two GA options trade text accuracy against speed and cost differently. Here is my rough feel for my use case (a few short Japanese characters on wallpaper-app announcement banners and blog OGP images).
Aspect
gemini-3.1-flash-image (Flash Image)
gemini-3-pro-image (Pro Image)
Accuracy of short Japanese text
Production-usable; stable up to a few lines
Higher; holds up on multi-line, smaller text
Speed per image
Fast (seconds)
Somewhat slower
Approx. cost per image
Low
A few times Flash
Video-to-image
Supported (gemini-3.1-flash-image only)
Not supported
Workhorse for unattended bulk runs
Make this the default
Reserve for regeneration on failure
Personally, defaulting to Flash Image and only promoting an image to Pro Image after it fails verification twice in a row gave me the best cost-effectiveness. Running every image through Pro Image from the start reduces garbling but inflates cost to several times Flash, which undercuts the point of unattended runs. I think the realistic way to win on text accuracy is "promote only the suspicious ones," not "use the premium model for everything."
How to prompt the model to actually draw the text
The first lever for text-in-image is how you prompt. Three rules I keep:
Pass the target string as text to copy verbatim
Don't bury it inside a description; hand over the exact characters as an unchangeable quote. A vague "a title that feels like ~" leaves room for paraphrasing and mis-conversion.
Pin character count, line count, and placement with numbers
Instead of "large in the center," constrain layout numerically: "one line at 25% from the top, max 8 characters." Long or multi-line text raises the failure rate on its own, so keep the text payload short.
from google import genaifrom google.genai import typesclient = genai.Client(api_key="YOUR_GEMINI_API_KEY")MODEL_FAST = "gemini-3.1-flash-image"MODEL_STRONG = "gemini-3-pro-image"def build_prompt(headline: str) -> str: # Pass the desired characters as a verbatim, do-not-alter constraint return ( "Generate a simple vertical (9:16) announcement banner image.\n" "Calm indigo background with a soft washi-paper texture.\n" f"At 25% from the top, draw EXACTLY these characters, one horizontal line, large: '{headline}'\n" "- Max 10 characters. Avoid decorative fonts; high-legibility gothic.\n" "- Draw NO other characters, no alphanumerics, no logo, no signature.\n" "- Do not omit diacritics or small kana." )def generate_image(model: str, headline: str) -> bytes: resp = client.models.generate_content( model=model, contents=build_prompt(headline), config=types.GenerateContentConfig(response_modalities=["Image"]), ) for part in resp.candidates[0].content.parts: if part.inline_data and part.inline_data.mime_type.startswith("image/"): return part.inline_data.data # PNG bytes raise RuntimeError("No image part returned")
The negative instruction "draw no other characters" helps suppress stray English signatures and random logos that creep in. With Japanese in particular, unrequested alphabet tends to sneak in as decoration, so I include this line almost every time.
✦
Thank you for reading this far.
Continue Reading
What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.
WHAT YOU'LL LEARN
✦If garbled text in generated images keeps biting you, you can drop in an OCR match-ratio gate today and copy-paste the verification code
✦You'll get the full two-stage pattern for gemini-3.1-flash-image: generate, read back with a vision model, and fall back to Pillow compositing, thresholds included
✦You can size a monthly cost ceiling for unattended runs by reasoning from per-image cost and a capped retry budget instead of guessing
Secure payment via Stripe · Cancel anytime
✦
Unlock This Article
Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.
Don't trust the output: insert an OCR verification gate
This is the crux. Take the image and ask a model again, "what does this say," read it back, and compare against the intended string. You could add a dedicated OCR library, but I read it back with the multimodal-strong gemini-3.5-flash. It handles decorative Japanese gracefully and adds no dependency.
Compare the read-back string against the original, stripping whitespace and punctuation, and compute a per-character match ratio. Requiring an exact match fails too often over a single punctuation wobble, so I gate on a ratio.
import refrom difflib import SequenceMatcherdef read_back_text(image_bytes: bytes) -> str: resp = client.models.generate_content( model="gemini-3.5-flash", contents=[ types.Part.from_bytes(data=image_bytes, mime_type="image/png"), "Output ONLY the Japanese characters drawn in this image, exactly as they appear, " "with no description of style or layout. Return an empty string if there is no text.", ], ) return resp.text.strip()def normalize(s: str) -> str: # Drop whitespace and punctuation to stabilize the comparison return re.sub(r"[\s、。・!?,.\-]", "", s)def char_match_ratio(expected: str, actual: str) -> float: e, a = normalize(expected), normalize(actual) if not e: return 1.0 return SequenceMatcher(None, e, a).ratio()# Threshold: I use 0.9 as the pass linePASS_RATIO = 0.9
What I saw in practice: a 0.9 cutoff happens to reject almost exactly the garbles that, on zooming in, a human would call unpublishable. Loosening to 0.8 let single-character errors slip through; tightening to 0.95 triggered frequent regenerations over harmless punctuation wobble. This shifts a bit with the length of the text you handle, so I recommend calibrating once on your own data.
Two-stage fallback: regenerate, then composite
Design what happens on failure. Regenerating blindly makes cost unpredictable, so I fixed the stages:
Generate with Flash Image and run the OCR gate.
On failure, regenerate once more with the same Flash Image (fixed prompt, betting on RNG variance).
Still failing, promote to Pro Image and regenerate.
If Pro Image also fails, generate a text-free background and composite the text from code with Pillow.
That last composite fallback is the safety net. The model handles the background while the critical characters are burned in with my own font, so garbling becomes impossible. For text where accuracy is everything, like an announcement line, I think it's safest to ultimately draw it on the code side.
from io import BytesIOfrom PIL import Image, ImageDraw, ImageFontdef composite_text(bg_bytes: bytes, headline: str, font_path: str) -> bytes: img = Image.open(BytesIO(bg_bytes)).convert("RGB") draw = ImageDraw.Draw(img) font = ImageFont.truetype(font_path, size=int(img.width * 0.09)) bbox = draw.textbbox((0, 0), headline, font=font) tw = bbox[2] - bbox[0] x = (img.width - tw) // 2 y = int(img.height * 0.25) draw.text((x + 2, y + 2), headline, font=font, fill=(0, 0, 0)) # subtle shadow draw.text((x, y), headline, font=font, fill=(255, 255, 255)) out = BytesIO() img.save(out, format="PNG") return out.getvalue()def produce_banner(headline: str, font_path: str) -> bytes: attempts = [MODEL_FAST, MODEL_FAST, MODEL_STRONG] for model in attempts: img = generate_image(model, headline) actual = read_back_text(img) ratio = char_match_ratio(headline, actual) print(f"{model}: read_back='{actual}' ratio={ratio:.2f}") if ratio >= PASS_RATIO: return img # Last resort: text-free background + code composite bg_prompt = "Vertical (9:16) indigo washi-texture background. Draw no text, logo, or signature." bg = client.models.generate_content( model=MODEL_FAST, contents=bg_prompt, config=types.GenerateContentConfig(response_modalities=["Image"]), ).candidates[0].content.parts[0].inline_data.data return composite_text(bg, headline, font_path)
Sizing cost and retry limits for unattended runs
The two-stage design is reassuring, but retries map directly to cost. The first thing I fix when wiring an unattended pipeline is the per-image ceiling.
Roughly, if one Flash Image generation is baseline cost 1, the worst case (Flash x2 + Pro x1 + read-back x3 + one background generation) puts a single image at several times baseline. The read-backs are text output from gemini-3.5-flash, orders of magnitude cheaper than image generation, so the cost lead is still image generation. Pro Image is several times Flash, so the number of promotions is what moves your monthly bill.
So at an assumed 2,000 images per month, I set a ceiling: "alert if the share reaching Pro promotion exceeds 10%." Empirically, with sane prompts and copy, promotion stays at a few percent. When it crosses 10%, it's usually a sign the text is too long or you got greedy with line count. Trimming the input character count improved both cost and quality more than forcing things through with extra retries.
Absolute prices change with model revisions, so check the current per-image rate on the official billing page and plug it into the "several times baseline" figure to back-calculate your monthly ceiling. I back this out from the AdMob revenue of my wallpaper app and revisit the ceiling monthly so image-generation spend never eats into the margin.
Gotchas (what surfaced in production)
A few traps I caught by re-reading logs from unattended runs.
The read-back model "helpfully" paraphrases
If you don't strongly constrain the verification prompt with "exactly as they appear," gemini-3.5-flash sometimes corrects a typo into the right spelling and returns that. The image is garbled but the ratio comes out high, slipping past the gate. The fix is to state explicitly that read-back is a "transcribe the drawn glyphs" task, not a "guess correct Japanese" task.
Punctuation and whitespace wobble fails too often
Without dropping punctuation and spaces in normalization, effectively identical strings mismatch. The humble normalize() before comparison did the most for a stable pass rate.
Preview-era model names linger in code
Model names with the -preview suffix were shut down on June 25. Copy an old script or sample and one day it just stops with an error. For the generation jobs we run at Dolice Labs, I bulk-replaced model names with the GA versions before the shutdown date. A dated deprecation, if missed, quietly kills an unattended pipeline, so decide the migration target ahead of time.
Wrapping up: your next step
If you're currently rejecting garbled text by eye, start by dropping just read_back_text() and char_match_ratio() behind your existing generation step. Even logging how many images fail at threshold 0.9 turns "how often does it garble" into a number on your own data. Whether to add the composite fallback or Pro promotion is a decision you can make after seeing that number.
A single verification gate turns unattended generation from "occasionally breaks" into "notices and fixes itself when it breaks." I personally stopped eyeballing my images once I added this one layer. If you're also worn down by text in generated images, I hope this helps.
Share
Thank You for Reading
Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.