# Gemini 3.1 Pro Preview — Independent Research Report

**Publisher**: Swfte AI Research
**Report date**: May 2026
**Methodology**: https://www.swfte.com/research/methodology
**Web version**: https://www.swfte.com/research/gemini-3-1-pro
**Citation**: Swfte AI Research, "Gemini 3.1 Pro Preview — Independent Research Report", May 2026.

## Executive Summary

Gemini 3.1 Pro Preview is Google DeepMind's flagship reasoning and multimodal model, currently in public preview on Vertex AI and Google AI Studio. It is the highest-scoring model on the LMSys text-only Chatbot Arena (Elo ~1500, #1) and the highest-scoring model on GPQA Diamond (94.3%) among publicly accessible models as of May 2026. The headline number that distinguishes Gemini 3.1 Pro from its peers is the 2,000,000-token context window — twice that of Opus 4.7 and GPT-5.5 — which enables genuinely different workload patterns rather than just incrementally larger versions of existing ones.

Pricing is $3.50 per million input tokens and $10.50 per million output tokens, materially below both Opus 4.7 and GPT-5.5 at the standard tier. This price-quality positioning makes Gemini 3.1 Pro the most economically rational frontier-tier choice for many workloads.

Three strengths set this model apart. First, mathematical and scientific reasoning: 94.3% on GPQA Diamond is genuinely class-leading and translates into measurably stronger performance on AIME-style and FrontierMath-style benchmarks. Second, multimodal document understanding: the 2M context lets you fit hundreds of pages of mixed PDF, images, and tables into a single prompt with high fidelity. Third, multilingual capability — particularly Korean, Japanese, and Vietnamese — where Google's training data advantage shows.

Three weaknesses are worth flagging. First, "preview" status means SLAs and pricing are subject to change; production commitments here carry real risk. Second, agentic tool-use reliability is below Opus 4.7 by a measurable margin, particularly on error recovery. Third, the model still produces aggressive over-refusals on policy-adjacent prompts more often than competitors, despite Google's stated improvements in 2026.

For buyers: Gemini 3.1 Pro is the right model for math-heavy, science-heavy, multimodal-heavy, or multilingual workloads — and for any workload where the 2M context unlocks something a 1M context cannot. It is a less appropriate choice for production agentic loops or workloads requiring guaranteed pricing through a long-running contract.

## 1. Model Snapshot

| Attribute | Value |
|---|---|
| Provider | Google DeepMind |
| Release date | March 18, 2026 (preview) |
| Parameters | Not disclosed; estimated MoE ~1.5T total / 180B active |
| Context window | 2,000,000 tokens |
| Max output | 64,000 tokens |
| License | Proprietary (commercial only); preview SLA caveats apply |
| Input pricing | $3.50 per 1M tokens (≤200K context) / $7.00 (>200K) |
| Output pricing | $10.50 per 1M tokens (≤200K context) / $21.00 (>200K) |
| Cache hit | $0.875 per 1M tokens |
| Batch (50% off) | $1.75 / $5.25 per 1M |
| Modalities | Text, image, audio, video (input + output for image/audio) |
| Providers | Google AI Studio, Vertex AI, Gemini API |
| Knowledge cutoff | February 2026 |

## 2. Architecture & Training (what's known publicly)

Google has been more open about Gemini 3.1 Pro than OpenAI or Anthropic have been about their flagships. The Gemini 3 technical report (March 2026) describes the model as a sparse mixture-of-experts with native multimodal input baked into the pretraining stack. The 2M context is supported by an attention mechanism Google describes as "extended-locality attention" — a variant of ring attention that trades constant-time latency increase for the ability to operate on much longer sequences.

The pretraining corpus is reported at "approximately 22T tokens" with explicit balance across English, code, scientific literature, and a deliberately enriched multilingual subset that emphasizes high-resource Asian languages. Synthetic data ratio is described as "below 25%."

Post-training combines RLHF with what Google describes as "verifier-augmented RL" — using executable verifiers (math solvers, code interpreters, fact-checkers against authoritative sources) to provide reward signal where outcomes can be evaluated programmatically. This shows up clearly in the GPQA Diamond and AIME results.

The "Preview" tag matters operationally. Google has published a deprecation policy stating that previews carry no SLA; pricing, rate limits, and capabilities can change with 30-day notice. As of May 2026, Google has indicated GA is expected in Q3 2026, but no committed date.

## 3. Pricing Reality

Headline: $3.50 / $10.50 per million tokens for prompts ≤200K. The per-token rates double once your input exceeds 200K, which is a procurement gotcha worth highlighting.

Effective production cost on a 4,000-token prompt → 1,000-token completion:

| Scenario | Cost |
|---|---|
| Cold prompt (≤200K) | $0.0245 |
| Cache hit (90% of prompt) | $0.0148 |
| Batch (≤200K) | $0.01225 |
| Cold prompt (>200K context) | $0.049 |

For long-context workloads — the regime that justifies choosing Gemini 3.1 Pro over a 1M-context competitor in the first place — the effective cost approximately doubles. On a 1.5M-token prompt with 2K output, the cost is roughly $10.55 per request. Compare to Opus 4.7 at 1M context: a 1M-token prompt + 2K output is roughly $5.05 per request after the tokenizer adjustment. The cost-per-token is similar; the pricing structure incentivizes Gemini for either small prompts (<200K) or genuinely-needs-2M prompts, and is poorly positioned for the 200K-1M middle.

Cache hit pricing at $0.875/M is higher than Opus 4.7 ($0.50/M) and GPT-5.5 ($0.50/M). The break-even for caching is later in Gemini's pricing model.

## 4. SMQTS Programming Series Results

| Category | Score | Notes |
|---|---|---|
| Algorithm implementation (LeetCode-Hard) | 92 | 2 points behind Opus 4.7. |
| TypeScript refactor (50K LOC repo) | 86 | Behind Opus 4.7; skipped 4 files. |
| Python data pipeline (pandas → polars) | 88 | Strong; subtle correctness issues on lazy frames. |
| Go concurrency bug isolation | 85 | Solid; missed channel-direction issue. |
| SQL query optimization (Postgres) | 89 | Top of field on partitioned table joins. |
| React server component migration | 84 | Behind Opus 4.7. |
| Rust lifetime errors | 86 | Better than Opus 4.7 in our test set. |
| Code review (security-focused) | 87 | Solid; behind Opus 4.7. |
| Test generation (pytest, vitest) | 88 | Clean coverage. |
| Long-context refactor (600K-token monorepo) | 82 | Trailed Opus 4.7 despite 2M context. |

**Series average: 86.7** (vs. 90.5 for Opus 4.7, 84.1 for GPT-5.5, 78.3 for DeepSeek V4 Pro)

## 5. SMQTS Non-Programming Series Results

| Category | Score | Notes |
|---|---|---|
| Long-form analytical writing | 88 | Strong; 2 points behind GPT-5.5. |
| Multi-step financial analysis | 91 | Top of field on DCF + sensitivity. |
| Legal contract review (redlines) | 89 | Strong; solid edge-case detection. |
| Multilingual translation (EN→ZH/JA/KO) | 92 | Top of field, particularly on KO colloquial. |
| Image OCR + table extraction | 91 | Top of field on dense and faded scans. |
| Data extraction from PDFs (structured) | 92 | Best of class on multi-page invoice extraction. |
| Creative writing (genre fiction) | 81 | Capable; less voice-versatility than GPT-5.5. |
| Instruction-following under adversarial prompts | 87 | Below Opus 4.7. |
| Mathematical reasoning (AIME-2025) | 94 | Top of field. |
| Tool use (5+ interleaved tools) | 86 | Solid; below Opus 4.7. |

**Series average: 88.1** (vs. 86.7 for Opus 4.7, 85.4 for GPT-5.5, 76.2 for DeepSeek V4 Pro)

## 6. Cost-Quality Validation

We re-ran 200 prompts on DeepSeek V4 Pro and Gemma 4 27B. For 156 of 200 prompts, the cheaper model produced output that blinded raters scored as indistinguishable or better than Gemini 3.1 Pro's. The fungibility ratio is between Opus 4.7 (142/200) and GPT-5.5 (168/200).

The 44 prompts where Gemini 3.1 Pro won decisively concentrated in:
1. Multimodal document understanding with mixed image and text inputs (Gemini won 14 of 16).
2. Mathematical and scientific reasoning at AIME / GPQA difficulty (Gemini won 12 of 14).
3. Long-document Q&A exceeding 800K tokens (Gemini won 10 of 12).
4. Korean, Japanese, Vietnamese translation (Gemini won 6 of 7).

The pattern is clear: Gemini 3.1 Pro has a tighter distribution of "uniquely capable" workloads than Opus 4.7, but those workloads are real, recurring, and economically meaningful for many buyers.

## 7. Strengths (Detailed)

**Mathematical and scientific reasoning.** GPQA Diamond is a benchmark of graduate-level science questions designed to be Google-proof — humans at the relevant subspecialty level score around 80%. Gemini 3.1 Pro at 94.3% materially leads the field; Opus 4.7 is at 87.4%, GPT-5.5 at 84.1%, DeepSeek V4 Pro at 80.2%. On AIME 2025, Gemini scored 94 vs. 87 for Opus 4.7 and 84 for GPT-5.5. For workloads in technical research, scientific literature analysis, or quantitative finance, this difference is decision-relevant.

**2M context window.** This is not just an incremental upgrade. We tested workloads — for example, "given the full SEC 10-K filings of 30 large-cap firms over the last 3 years, identify common risk-factor language patterns" — that simply do not fit in a 1M context model without chunking. Gemini 3.1 Pro produces high-quality output on these unchunked workloads. The "needle in 1.8M tokens" callback test scored 84% accuracy, which is the highest we measured for a model operating in that regime.

**Multimodal document understanding.** Google's training data advantage on documents shows. On our 40-document benchmark of historical legal filings (faded carbon copies, irregular layouts, mixed-language inserts), Gemini 3.1 Pro scored 91 — the highest of any model. Opus 4.7 scored 78; GPT-5.5 scored 80. For document-processing pipelines with non-clean inputs, this is the model to use.

**Multilingual capability on high-resource Asian languages.** On Korean colloquial-register translation, Gemini 3.1 Pro scored 92 vs. 81 for Opus 4.7 and 84 for GPT-5.5. Native Korean speakers in our blinded panel preferred Gemini's output 7 of 10 times. Similar patterns hold for Japanese formal/informal register and Vietnamese.

**Pricing-quality ratio in the standard tier.** At $3.50/$10.50 per million tokens and AAII-comparable quality to Opus 4.7 ($5/$25), Gemini 3.1 Pro offers the best raw cost-quality ratio in the frontier tier — provided your prompt is below 200K tokens.

## 8. Weaknesses & Failure Modes (Detailed)

**Preview status carries production risk.** As of May 2026, Gemini 3.1 Pro is "preview," meaning Google may change pricing, rate limits, or deprecate the model with 30-day notice. We have observed two SKU revisions since the March release. Production commitments to Gemini 3.1 Pro should account for this. Several enterprises we spoke with had multi-vendor router fallbacks specifically because of preview-status concerns.

**Agentic tool-use error recovery.** When tools return errors mid-trace, Gemini 3.1 Pro recovered in 78% of trials vs. 91% for Opus 4.7. The most common failure mode was the model concluding the user's request was infeasible and refusing to continue, when in fact the tool error was a recoverable transient. For long-running agents, this is a meaningful reliability gap.

**Over-refusal on policy-adjacent prompts.** Despite Google's stated 2026 improvements, Gemini 3.1 Pro still over-refuses on prompts that mention sensitive topics in clearly benign framings. We measured 4.1% over-refusal on Anthropic's harmless-prompt benchmark vs. 0.4% for Opus 4.7 and 0.9% for GPT-5.5. For consumer-facing products, this produces user friction.

**Pricing tier discontinuity at 200K context.** The 2x pricing jump above 200K context is a procurement trap. Workloads that opportunistically grow past 200K (RAG systems with permissive context inclusion, agentic loops that accumulate context) can suddenly double in cost. This is fixable with prompt budgeting but requires explicit attention.

**Long-context refactor coding underperforms its context advantage.** Despite the 2M context, Gemini 3.1 Pro scored 82 on our 600K-token monorepo refactor vs. 92 for Opus 4.7. Having more context capacity does not automatically yield better long-context reasoning quality.

## 9. When To Use This Model

- Mathematical, scientific, or technical research workloads
- Multimodal document processing (PDFs, scans, mixed-media inputs)
- Workloads exceeding 1M context tokens that genuinely need 2M
- Korean, Japanese, Vietnamese, or other high-resource Asian language tasks
- Cost-sensitive frontier-tier workloads that fit under 200K context
- Quantitative finance reasoning (DCF, sensitivity analysis)
- Document extraction at scale

## 10. When NOT To Use This Model

- Production agentic loops with strict reliability SLAs
- Long-context coding refactors (use Opus 4.7)
- Consumer products sensitive to refusal-related friction
- Workloads requiring guaranteed long-term pricing
- Routine high-volume generation (use DeepSeek V4 Pro or Gemma 4)
- Workloads in the 200K-1M context band where the pricing tier is unfavorable

## 11. Procurement Notes

- **MSA / DPA**: Available via Google Cloud MSA and Vertex AI DPA.
- **BAA**: Available on Vertex AI for HIPAA workloads.
- **Data residency**: Multi-region on Vertex; EU-resident option available.
- **Lock-in score (1-10)**: 7. The Vertex AI surface diverges from OpenAI's; the Gemini API has its own request format. Migration off Gemini 3.1 Pro requires meaningful engineering work, particularly if you've leaned into the 2M context.
- **Compliance**: SOC 2 Type II, ISO 27001, HIPAA-eligible.
- **Preview SLA**: None committed; pricing and rate limits subject to 30-day notice changes.
- **Rate limits**: Higher tiers available on Vertex; defaults are 4M input / 200K output per minute.

## 12. Bottom Line

For startups doing technical, scientific, or document-heavy work, Gemini 3.1 Pro is a strong primary choice on cost-quality grounds — but build in a fallback model from another vendor to mitigate preview-status risk.

For mid-market companies, Gemini 3.1 Pro fits cleanly in a multi-model router as the "default below 200K" model and the "specialty tool" for math, multimodal, and multilingual workloads. Reserve Opus 4.7 for agentic coding peaks. Reserve a cheap model for routine traffic.

For enterprises, the preview status is a real procurement obstacle. Either wait for Q3 2026 GA or commit only to workloads where the alternative cost (using Opus 4.7 or GPT-5.5 instead) is small enough that a forced migration would be tolerable. The compliance posture on Vertex AI is strong; the technical capability is genuinely best-in-class on its strongest workloads. The risk is contractual.

## Appendix A: Test Prompts Used

1. *"Solve GPQA Diamond Problem 47 with full reasoning."* — Graduate-level scientific reasoning.
2. *"Given the attached 30 SEC 10-K filings, identify common risk-factor language patterns and rank by frequency."* — Long-context multi-document analysis.
3. *"Extract every clause, party, and effective date from this 80-page faded scan."* — Multimodal document extraction.
4. *"Translate this 4,000-word Korean essay to English, preserving the colloquial register."* — Multilingual capability.
5. *"You have access to: file_read, file_write, web_search, calculator. Some tool calls will return errors. Complete the user's task."* — Agentic error recovery.
6. *"Build a DCF model for the following company with ±10% WACC and ±5% growth sensitivity."* — Quantitative finance reasoning.
7. *"Solve AIME 2025 Problem 12 with full reasoning."* — Mathematical reasoning.
8. *"Summarize the key findings across this 1.8M-token corpus of medical literature."* — 2M context utilization.

## Appendix B: Methodology Reference

Full methodology at https://www.swfte.com/research/methodology, including blinded rater protocols, statistical-significance thresholds, and the prompt corpus provenance. Raw transcripts available on request.

## Appendix C: Operational Notes from Production Deployments

**Context tier boundary.** The 200K context-pricing tier cutoff is a sharp cliff, not a smooth transition. A prompt at 199,999 tokens costs $3.50/M; a prompt at 200,001 tokens costs $7.00/M on the entire request, not just the marginal tokens. Teams running RAG pipelines with permissive context inclusion frequently cross this boundary by accident. We recommend explicit token budgets in retrieval logic with hard caps below 200K.

**Vertex AI vs. Google AI Studio.** The two surfaces are not identical. Vertex AI is the production-ready endpoint with regional residency, IAM, and SLA terms (where applicable for GA models). Google AI Studio is the developer-centric surface with looser terms but easier onboarding. Production deployments should use Vertex AI. Several teams we have audited were running production traffic against AI Studio endpoints and discovered pricing or rate-limit surprises.

**Multimodal input encoding.** Image and video inputs are billed by token-equivalent counts that depend on resolution and duration. A single high-resolution image can consume 1-3K tokens; a one-minute video can consume 8-15K tokens. Teams unfamiliar with this billing model frequently underestimate their multimodal traffic costs by 3-5x.

**Preview-status migration risk.** Google has indicated that Gemini 3.1 Pro Preview will be superseded by a GA SKU later in 2026. Historically, Google's preview-to-GA transitions have included pricing changes and minor capability changes. Teams committing to preview should plan for a migration window. Maintaining a router with a fallback model (DeepSeek V4 Pro is a common choice) mitigates the bulk of this risk.

**Refusal patterns.** Gemini 3.1 Pro's refusal behavior is more aggressive than its peers on prompts that touch competitive intelligence, public-figure-related topics, or content that mentions named third-party brands in negative framings. We have measured 4-7% over-refusal on prompts that other models answer cleanly. Mitigations: rephrasing user prompts before forwarding, using a different model for these categories, or accepting the refusal in user-facing UX.

**System instruction scoping.** Gemini's system_instruction field has subtly different semantics than OpenAI's system role or Anthropic's system parameter. Instructions sometimes leak into model output ("As an AI assistant directed to...") in ways that don't occur on other surfaces. Test prompts for instruction leakage before shipping production system prompts.

## Sources & References

- Google DeepMind, "Gemini 3 Technical Report", March 18, 2026
- Google AI Pricing Page, accessed May 12, 2026 — https://ai.google.dev/pricing
- LMSys Chatbot Arena Leaderboard, May 14, 2026 snapshot — https://lmarena.ai
- GPQA Diamond Leaderboard, May 11, 2026 — https://gpqa.github.io
- Artificial Analysis, "Gemini 3.1 Pro Independent Evaluation", May 2, 2026 — https://artificialanalysis.ai
- Google Cloud Vertex AI Gemini 3.1 Pro Preview Notes, March 18, 2026
- HuggingFace SMQTS-Public Leaderboard, May 11, 2026
- Stanford HELM 2026 Q1 Report — https://crfm.stanford.edu/helm
- ArXiv 2603.18420, "Extended-Locality Attention for 2M Context", March 2026
- AIME 2025 Solutions and Model Performance, May 2026
- FrontierMath Public Leaderboard, May 8, 2026
- Vellum AI Frontier Model Comparison, May 9, 2026

---

*Independent research by Swfte AI. We route across multiple AI providers via Swfte Connect, including the model in this report. Full conflict-of-interest disclosure at /research/methodology. Raw test transcripts available on request.*
