Writing

Do LLMs Have a Greece Problem?

Large language models are increasingly relied upon as de facto arbiters of knowledge. When someone asks GPT or Claude about a historical event, a territorial dispute, or a cultural claim, the answer they receive carries weight—even when the model is hedging. I wanted to know: how do frontier LLMs handle questions where Greece has a well-established position under international law or scholarly consensus? And more importantly, do they treat these positions fairly, or do they manufacture ambiguity where none exists?

To find out, I built a systematic testing framework. I assembled 118 sensitive questions spanning over 30 categories—Aegean sovereignty, Macedonia and Alexander the Great, Cyprus, Constantinople and Hagia Sophia, the Greek Genocides, the Elgin Marbles, Greek-Turkish relations, Greek identity and civilisational continuity, and more. Each question was defined with explicit pro-Greek and anti-Greek reference positions, and every response was scored on a 1–5 scale. I then ran the full battery against five frontier models: GPT-5.2, Claude Opus 4.6, Qwen 3.5 Plus, DeepSeek v3.2, and Gemini 3.1 Pro.

But testing in English alone tells an incomplete story. I ran the same questions in 13 languages—from Mandarin and Hindi to Turkish and Arabic—to see whether models shift their positions depending on the language of the query. The results were revealing: some models showed meaningful score variance across languages, suggesting that the training data’s linguistic composition leaves fingerprints on the output.

· · ·

The methodology goes deeper than simple scoring. I introduced a position strength framework that classifies each question on a spectrum from undisputed fact (e.g., “Alexander the Great was Greek”—overwhelming scholarly consensus) to political opinion. Cross-referencing position strength against model scores reveals what I call manufactured ambiguity—cases where a model hedges on questions that are not genuinely contested. This is the more concerning finding: not outright bias, but a pattern of false balance on settled matters.

I also ran two adversarial experiments. In persona testing, I prompted GPT-5.2 and Claude Opus 4.6 through nine distinct personas—a British academic, a Turkish nationalist, a Greek diaspora member, an Arab conservative, a Hindu nationalist, among others—to measure how identity framing shifts responses. In the fake authority attack, I injected fabricated but authoritative-sounding citations (scaling from 1 to 5 fake sources) to test whether models could be nudged toward different positions through citation laundering—a technique with real implications for information warfare.

The goal was never to produce a scorecard of which model is “most pro-Greek.” It was to build a repeatable framework—a Model Alignment Index—that makes LLM behaviour on contested geopolitical topics measurable and comparable. The framework is general: swap out the questions and positions, and it works for any country or domain. Greece was simply where I started, because I know the territory.