Wen's Portfolio

Mapping Today's AI Landscape · 2026

Where AI Excels Across Research, Design, Code, and Multimodal Work

An exploration of where current AI systems are strongest, based on real use across product design workflows. This work focuses on the areas where AI feels genuinely useful—not just novel—by examining how different tools support strategy, concept generation, execution, and broader creative exploration.

Research leader

GPT-5.4 Pro

58.7% HLE with tools · 89.3% BrowseComp

Most versatile

Gemini 3.1 Pro

1M context · 80.6% SWE Verified

Best editorial collaborator

Claude 4.6

80.8% SWE Verified · 1M beta context

Visual frontier

Google leads Arena

Rank 1 in image and video preference

Evaluation Lens

I evaluate AI tools through two layers: public performance data and workflow reliability.

Benchmarks tell me who leads a lane: reasoning, web search, coding, long context, image, or video.

Workflow confidence tells me what I would actually trust in a real product, design, or research environment.

My strongest conclusion: there is no single winner. The best stack is compositional.

Public benchmark cut used here: official model cards, system cards, and Arena preference leaderboards available as of March 10, 2026. Trust, stability, and workflow recommendations are my editorial synthesis.

Sanity checks: passed

Products I would confidently speak to in an interview

Selected signal

GPT-5.4 Pro

If I need a reliable first draft for serious knowledge work, this is my default starting point.

ResearchStrategySynthesis

Quick read

Metric A58.7%

Metric B89.3%

Primary laneResearch

Portfolio editorial signal

Stability97

Trust96

Speed93

Why it stays in my stack

Best current fit for deep research and structured synthesis.

Strongest public web research score in this set.

Ideal when I need clear reasoning, narrative framing, and decision ready output.

Watchout: Not my first choice for ultra long multimodal context or for the most coding specialized terminal workflows.

Benchmark Lab

Where each model winsWhere each model actually wins

Instead of pretending there is one universal winner, I compare models lane by lane. That is the more useful mindset for real workflows.

Selected lens: Broad reasoning with tools

Reasoning

Engineering

Creative

Interpretation

High signal snapshot for tool augmented reasoning across difficult academic tasks.

My read: the right tool depends on the lane. Benchmark leadership is already splitting into research, reasoning, engineering, and creative generation.

Current ranking

GPT-5.4 Pro

OpenAI

58.7%

Claude Opus 4.6

Anthropic

53.1%

Kimi K2.5

Moonshot

51.8%

Gemini 3.1 Pro

Google

51.4%

GLM-5

Z.ai

50.4%

Workflow Lens

How I would actually use this stack

Most trusted for serious research

GPT-5.4 Pro to Claude 4.6 check to final human judgment

When accuracy, structure, and synthesis matter most, I prefer OpenAI first, then use Claude as an editorial pressure test.

Best for multimodal and huge context work

Gemini 3.1 Pro to targeted verification to distilled brief

This is my strongest option when a workflow spans large docs, screenshots, media, repo context, and cross format reasoning.

Best visual ideation stack

GPT-image-1.5 or Gemini Image to Figma or Adobe refinement

I use AI image systems to expand creative territory quickly, then push the final output through deliberate human design craft.

Best open and cost sensitive lane

GLM-5, Kimi K2.5, and Qwen3.5

This lane is strategically important for self hosting, API economics, and agentic experimentation beyond closed model defaults.

Open and China Lane

Why this ecosystem matters

For interviews, this section signals that I am not only tracking the closed model leaders. I am also paying attention to deployment flexibility, pricing pressure, and the rise of native multimodal agents.

GLM-5

Z.ai

50.4%

HLE with tools

77.8%

SWE Verified

Strong open contender for reasoning, coding, and agentic work.

Kimi K2.5

Moonshot

51.8%

HLE with tools

256K

Context and tools

A compelling long context and tool calling model to watch closely.

Qwen3.5

Qwen

Native

Multimodal agent

Open

Deployment lane

Important because it pushes toward native multimodal agents, not only chat UX.

Closing Read

My current thesis

AI stack thinking over single tool thinking

For trust

I trust GPT-5.4 Pro most for serious research and synthesis, Claude 4.6 most for polished articulation, and Gemini 3.1 Pro most for multimodal and long context work.

For efficiency

The fastest workflow is usually compositional: one model expands the territory, another checks the structure, and human design judgment shapes the final result.

For differentiation

Showing this landscape in a portfolio demonstrates not just AI usage, but product judgment: knowing which tools to trust, why, and under what constraints.