◈ API / SDK/2026-06-12Intermediate

Building an App Store Rejection Workflow with the Gemini API — From Structured Notices to Resolution Center Replies

How I use the Gemini API to parse App Store rejection notices into structured JSON, cross-check guidelines, draft Resolution Center replies, and run pre-submission checks as an indie developer.

Gemini API¹⁹⁵ App Store Review Structured Output⁹ Indie Development¹² App Operations

✦ Premium Article

One morning this spring, while I was pushing parallel updates to six of my apps, I opened App Store Connect and found two unread messages waiting in the Resolution Center. One was a Guideline 2.3.3 issue about screenshots; the other was a 5.1.1 issue about privacy disclosures. When your release queue is already full, a rejection is not just rework. The real burden is the traffic control: which app, which issue, in what order.

As an indie developer, there is no teammate to hand review correspondence to. I have been through enough rejections over the years that the process itself does not rattle me, but during that stretch of overlapping updates — a StoreKit 2 migration and new device resolutions landing at the same time — the read-research-reply-fix loop was visibly eating into development hours. So I wired parts of it into the Gemini API. It worked better than I expected, and this is a record of how the pipeline fits together and where it stops being useful.

Reading the notice was the actual bottleneck

Before any automation, I timed where the hours actually went. Rejection handling breaks down into four stages.

Read the notice and identify each issue
Look up the relevant guideline text and understand what is being asked
Decide whether to fix the build or respond with an explanation
Write the Resolution Center reply, or fix and resubmit

On average, one rejection cost me about 90 minutes. The surprising part was that the heavy stages were 1 and 2 — reading and researching — not the fixing. Review notices embed specific findings inside boilerplate, often bundle several issues into one message, and vary wildly in granularity. Some are concrete ("replace this screenshot"); others leave broad room for interpretation ("verify your app meets the guideline's requirements").

Misreading a notice leads straight to a second rejection, so I would end up reading every sentence carefully while other apps' work sat idle. Stack that on top of routine operations — staged rollout monitoring, AdMob report checks — and a single rejection could wreck the rhythm of half a day. The design goal of this pipeline is simple: let the machine do the reading and researching, and keep the judging and fixing for myself.

Converting a notice into three-layer JSON

The core of the pipeline turns the notice body into structured data using Gemini's structured output (response_schema). One notice becomes a list of individual findings, organized in three layers.

Notice level: app name, submission ID, and a flag for whether a reply alone might resolve it
Finding level: guideline number, what the reviewer is asking for, and the target (binary, metadata, or screenshot)
Evidence level: a verbatim quote from the notice body

The third layer — verbatim evidence — is the part that earns its keep in production. If you let the model output only summaries, you will not notice when it invents a requirement the notice never made. Forcing a quoted excerpt alongside each finding means I can verify mechanically that the quote actually exists in the source text, which catches hallucinated findings before they reach my to-do list.

Here is the working code, tidied up for reading. It takes the notice text and returns a Pydantic model containing the findings.

from google import genai
from pydantic import BaseModel
 
class RejectionItem(BaseModel):
    guideline: str      # e.g. "2.3.3"
    requirement: str    # summary of what the reviewer asks for
    evidence: str       # verbatim quote from the notice (required)
    target: str         # "binary" / "metadata" / "screenshot"
    action: str         # the concrete step on my side
 
class RejectionReport(BaseModel):
    app_name: str
    submission_id: str
    items: list[RejectionItem]
    reply_only_candidate: bool  # might a reply alone resolve this?
 
client = genai.Client()
 
def parse_rejection(notice_text: str) -> RejectionReport:
    prompt = f"""The following is an App Store review rejection notice.
Break it down into individual findings according to the schema.
Constraints:
- Never add facts that are not stated in the notice
- The evidence field must quote the notice verbatim
- If a finding is ambiguous, write "needs human judgment" in action
 
---
{notice_text}
"""
    res = client.models.generate_content(
        model="gemini-3-flash",
        contents=prompt,
        config={
            "response_mime_type": "application/json",
            "response_schema": RejectionReport,
        },
    )
    return res.parsed
 
def verify_evidence(report: RejectionReport, notice_text: str) -> list[str]:
    """Check each quote actually exists in the source text."""
    broken = []
    normalized = " ".join(notice_text.split())
    for item in report.items:
        quoted = " ".join(item.evidence.split())
        if quoted[:80] not in normalized:
            broken.append(item.guideline)
    return broken

Two notes on why it is written this way. First, gemini-3-flash is enough. Parsing a notice is a reading task, not a reasoning task, and the lower latency suits the tempo of "a notice just arrived, break it down now." Second, verify_evidence is deliberately a separate, deterministic function. Quote verification is plain string matching, so I do not delegate it to a probabilistic model. Separating generation from verification is the principle that runs through the whole pipeline.

A word on the two fields I went back and forth on. I kept target to exactly three values because my downstream work branches exactly three ways: a binary finding means rebuild and resubmit, a metadata finding means edits inside App Store Connect, and a screenshot finding means setting up capture devices. Finer-grained categories only increased classification jitter without changing what I do next. The reply_only_candidate flag marks the optimistic path — findings that might be resolved with an explanation instead of a fix — and starting with those can move a review forward without waiting on a rebuild. It is a candidate flag, though; the final call always comes after I read the guideline myself.

✦

Thank you for reading this far.

Continue Reading

What follows includes implementation code, benchmarks, and practical content we hope you'll find useful. This site runs without ads — server and development costs are supported entirely by members like you. If it's been helpful, we'd be truly grateful for your support.

WHAT YOU'LL LEARN

✦A working Python schema that converts rejection notices into three-layer JSON (guideline number, verbatim evidence, action) with gemini-3-flash

✦Three principles for Resolution Center replies, plus a before/after of the prompt that drafts them in under 120 words

✦An NDJSON rejection history covering 9 rejections across 6 apps that cut handling time from roughly 90 to 30 minutes per case

Secure payment via Stripe · Cancel anytime

✦

Unlock This Article

Get full access to the rest of this article. Buy once, read anytime. This site is ad-free — your support goes directly toward keeping it running.

Unlock all articles with Membership →

Cross-checking guideline text without hallucinations

Once findings are structured, the next step is matching them against the actual guideline text. Let me share the mistake I made first.

# Before: asking without providing the text (poor reproducibility)
My app was rejected under Guideline 2.3.3.
How should I fix it?

The problem is that the model reconstructs the guidelines from its training memory. The App Store Review Guidelines are revised several times a year, so memory-based answers mix in outdated clauses and requirements that do not exist. I once received a confidently worded "recommended phrasing" for a tracking permission dialog that appears nowhere in the current text. If you want interpretation, you must supply the clause itself.

# After: provide the clause verbatim and force citations
Below are the exact text of the relevant App Store Review Guidelines
section and the structured findings JSON.
 
Tasks:
1. Show which sentence of the clause each finding maps to, quoting it
2. List fix options ordered by effort, smallest first
3. Explicitly mark anything not present in the clause as "outside the text"
 
[paste the clause here]
[paste the findings JSON here]

After switching to the second form, verification time dropped sharply, because checking the answer reduces to two questions: does the quoted sentence exist in the clause, and is the interpretation consistent with the quote? I copy the relevant section by hand from the App Store Review Guidelines. I also tried feeding the entire page automatically, but unrelated clauses blurred the focus of the answers; passing only the relevant section was consistently more stable. More context is not better context — a lesson that keeps repeating across my other tooling.

One caution: I treat the matching result strictly as candidate interpretation. Replying based on a misread guideline only prolongs the exchange with the review team. Gemini's job here is to assemble the decision material quickly, not to make the decision.

Drafting Resolution Center replies

When I judge that an explanation can resolve a finding without a fix, I reply in the Resolution Center. Writing quality directly affects the outcome here, and over the years I have settled on three principles.

State facts only — how the feature behaves, where to find the screen. No speculation, no emotion
Never argue — do not contest the finding's validity; dissolve misunderstandings with facts
Keep it short — something a reviewer can read in 30 seconds

These rules go straight into the prompt that drafts the English reply.

REPLY_RULES = """You are drafting a reply to the App Store
Resolution Center on behalf of an indie developer.
Principles:
- State only facts; no speculation, emotion, or rebuttal
- For each claim, note where in the app it can be verified
- Keep the whole reply under 120 words
- Polite but not excessively deferential
"""
 
def draft_reply(item_json: str, facts: str) -> str:
    res = client.models.generate_content(
        model="gemini-3-flash",
        contents=(
            f"{REPLY_RULES}\n\n"
            f"## Finding (structured)\n{item_json}\n\n"
            f"## Verified facts on my side\n{facts}\n"
        ),
    )
    return res.text

The facts input is a bullet list I write by hand. For a 2.3.3 finding it looks like "screenshot 3 shows the settings screen, captured on a physical iPhone 17 Pro Max" and "all on-screen text matches the live app." Writing this manually looks like overhead, but verifying facts is precisely the part only a human can do, so I never skip it. Conversely, once the facts are accurate, assembling them into reviewer-friendly English is something the model does faster and more consistently than I do.

One case made the value concrete. A reviewer flagged a settings-screen description as "not matching actual behavior." It did match; the reviewer had apparently looked at a different tab. I sent a 120-word reply that walked through the three-step navigation path to the feature, and the build was approved the next business day with no changes. A short, precise verification path beats a long defense.

Every draft gets read before it is sent. In my logs, about nine out of ten drafts are usable as-is; the remaining tenth is the model helpfully adding context I never provided as fact. Per the principles above, those additions get deleted mechanically.

Triage when several rejections land at once

The structured data pays off most when rejections pile up. Three signals drive the ordering.

Speed to resolve — findings with reply_only_candidate true, or metadata targets, can move the same day
Impact — an app frozen mid-staged-rollout, especially one carrying a crash fix, jumps the queue
Deadlines — letting a Resolution Center thread go stale risks the submission being treated as withdrawn

In practice I split work into two lanes: metadata findings that can be resubmitted same-day, and binary findings that fold into the normal development cycle. Back when I processed notices in arrival order, light issues regularly sat for days behind heavy ones. On that spring morning, the 5.1.1 privacy finding needed only an App Store Connect edit, so it was fixed and resubmitted before lunch, while the 2.3.3 screenshot reshoot took two days of device logistics. Swapping the order meant one app cleared review by the next day. It is just sorting — but having the sorting criteria appear as data within five minutes of opening the notice is the difference that compounds as the case count grows.

Turning rejection history into an asset with NDJSON

Structured findings append naturally into newline-delimited JSON, one record per finding.

import json
import pathlib
from datetime import date
 
HISTORY = pathlib.Path("rejection_history.ndjson")
 
def append_history(report: RejectionReport, app_id: str) -> None:
    with HISTORY.open("a", encoding="utf-8") as f:
        for item in report.items:
            f.write(json.dumps({
                "date": date.today().isoformat(),
                "app_id": app_id,
                "guideline": item.guideline,
                "target": item.target,
                "requirement": item.requirement,
                "resolution": "",  # filled in later: "metadata-fix" etc.
            }, ensure_ascii=False) + "\n")

Aggregating the last twelve months surfaced something I had never seen while handling cases one at a time: of 9 rejections across 6 apps, 5 were metadata-driven (screenshots, descriptions, privacy text), and genuine binary findings were the minority. In other words, a metadata check before submission could have prevented more than half of them. Per-app patterns emerged too — live-wallpaper apps skew toward screenshot findings, apps with notifications skew toward privacy-text findings. I keep the store as NDJSON rather than a database because the volume is a few dozen records a year, fully greppable. Not over-tooling for the scale is part of what keeps a system maintained. Logging takes under a minute per case, and that minute buys real confidence at the next submission.

Running a pre-flight check before every submission

The history feeds a pre-submission self-check: before submitting a new version, I hand Gemini the upcoming metadata together with that app's past findings and ask it to flag recurrence risks.

def preflight_check(app_id: str, metadata_text: str) -> str:
    history_lines = [
        line for line in HISTORY.read_text(encoding="utf-8").splitlines()
        if json.loads(line)["app_id"] == app_id
    ]
    res = client.models.generate_content(
        model="gemini-3-flash",
        contents=(
            "Below are this app's past review findings and the metadata "
            "for the upcoming submission.\n"
            "Check whether the submission contains issues of the same kind "
            "as past findings, citing the matching history line for each risk.\n"
            "If nothing matches, answer only with 'no findings'.\n\n"
            f"## Past findings\n{chr(10).join(history_lines)}\n\n"
            f"## Upcoming metadata\n{metadata_text}\n"
        ),
    )
    return res.text

The metadata_text bundle covers the app name, subtitle, description, keywords, promotional text, a per-screenshot note of which screen each image shows, and the privacy-related text. I experimented with passing the actual screenshot images, but for the purpose of matching against history, one-line textual notes proved more reproducible and faster to review. I reserve vision input for checks that genuinely require pixels — like spotting mismatches between on-screen labels and the description — and keep history matching in text.

Since adding this gate, no submission has been rejected for a repeat reason. That covers roughly 20 update submissions across the six apps, so it is not yet a statistically meaningful sample, but it has eliminated the most demoralizing pattern: failing twice for the same cause. Overall handling time per rejection fell from about 90 minutes to around 30. The breakdown is telling — what shrank was the front half, reading and researching; the time I spend on final reply review and on the fixes themselves is essentially unchanged. The judgment stayed human. Only the preparation got faster.

Measured costs and model choice

For anyone weighing this up: a rejection notice body ran 1,500–3,000 tokens in my cases. Running all three stages — structuring, guideline matching, reply drafting — stays under 10,000 tokens total per case. At gemini-3-flash pricing that is effectively a rounding error, and at a few cases per month the rate limits never come into view, even on the free tier.

I compared gemini-3-flash against the Pro line on the same ten notices when I set this up. Structuring accuracy was effectively identical; the difference was latency. Flash returned in about 3 seconds per notice, while Pro occasionally exceeded 10, and for a "break this down right now" tool, Flash's tempo wins. The only place Pro showed an edge was composing prose when an interpretation spanned multiple clauses — not decisive when a human makes the final call anyway. My recommendation for operational tools like this: build on Flash first, and upgrade only the stages where accuracy disappoints. Start big and the latency, more than the cost, quietly pushes the tool out of daily use.

For structured-output stability, two things mattered most: putting concrete examples in the schema field comments, and keeping enumerated values to three or four options. Classification fields fragment as the option count grows. Treat the schema as instructions to the model, scoped at the same granularity a human reviewer would use.

Where the limits are, and what stays human

The system runs well, but its boundaries are clear, and I want to be candid about them.

First, guideline interpretation cannot be delegated. Supplying clause text raises matching precision, but the operational reality of review — what actually passes under a 4.3-family finding this quarter — is not in any model. That gap is covered by developer-community knowledge and my own scar tissue.

Second, review correspondence is a dialogue, not a negotiation. A reply containing concrete facts about this specific app resolves faster than a polished template every time. Mass-producing generic drafts would be a detour, which is exactly why the hand-written facts list is a mandatory input.

Third, rejection notices sometimes reference unreleased features or internal details. I still scan every notice before it goes to the API, and I keep human eyes at both the entrance and the exit of the pipeline.

Fourth, this kind of semi-automation is never finished. Notice formats change without warning, and guideline numbering shifts with revisions. Once a quarter I re-run recent notices through the parser and adjust schema descriptions and prompt constraints. It is light work, but it is why I keep the code small — the fatter the pipeline, the more tiresome it becomes to track format drift.

One closing note: the same pipeline now handles Google Play's pre-launch findings almost unchanged — the guideline field simply reads as a policy name, and the structuring and history code is shared. Rejections are unavoidable for as long as you ship as an indie developer. The point is not to absorb each one as a fresh wound, but to turn handling into a procedure and records into the next submission's safety net. Start by running one of your old rejection notices through the schema above — the shape of where your apps tend to stumble will surface as data. I hope this is useful to anyone juggling review correspondence across multiple apps.

Thank You for Reading

Gemini Lab is ad-free, supported entirely by members like you. We publish practical guides daily with implementation code, benchmarks, and production-ready patterns. If you've found it useful, we'd love to have you on board.