OpenAI Introduces IndQA: A Culture-Aware Benchmark for Indian Languages

Why IndQA?

OpenAI developed IndQA to address gaps in evaluating AI models’ understanding of Indian languages and cultural contexts. Traditional benchmarks often rely on translation or multiple-choice formats, which fail to capture nuanced cultural knowledge. With 80% of the global population not using English as their primary language, IndQA focuses on real-world relevance for Indian languages, where 1 billion people speak non-English languages and 22 official languages (e.g., Hindi, Bengali, Tamil) are used across diverse cultural domains.

Dataset, Languages, and Domains

IndQA spans 2,278 questions across 12 Indian languages (Bengali, Hindi, Tamil, etc.) and 10 cultural domains, including:

Architecture and Design
Food and Cuisine
History
Religion and Spirituality
Law and Ethics

Key features:

Hinglish (code-switching between Hindi and English) is included to reflect real-world communication.
Questions are authored by 261 domain experts from India, ensuring cultural and linguistic accuracy.
Each question includes:
- A native-language prompt
- English translation for auditability
- Grading rubric with weighted criteria
- Ideal answer reflecting expert expectations

Rubric-Based Evaluation Pipeline

IndQA uses rubric-based grading instead of exact-match accuracy to assess responses. For each question:

Domain experts define criteria (e.g., relevance, cultural accuracy, depth of reasoning) with assigned weights.
A model-based grader evaluates responses against these criteria, assigning partial credit for partial correctness.
Final scores are calculated as the sum of satisfied criteria weights divided by total possible score, enabling nuanced evaluation beyond surface-level matches.

Construction Process and Adversarial Filtering

IndQA’s development involved a four-step pipeline:

Expert Collaboration: Partnered with Indian organizations to recruit 261 experts in 10 domains (e.g., law, history), who authored culturally grounded, reasoning-heavy questions.
Adversarial Filtering: Draft questions were tested against OpenAI’s strongest models (GPT-4o, GPT-4.5, GPT-5) at creation. Only questions where majority of models failed to produce acceptable answers were retained, ensuring future progress is measurable.
Rubric Development: Experts created detailed grading rubrics, reused for all model evaluations.
Peer Review: Ideal answers and translations underwent iterative revisions and peer review for quality assurance.

Measuring Progress on Indian Languages

OpenAI uses IndQA to track improvements in AI models over time. Results show:

Significant performance gains in recent frontier models (e.g., GPT-5 Thinking High) on IndQA.
Stratified analysis by language and domain reveals disparities, with room for improvement in low-resource languages and niche domains.

Key Takeaways

Cultural Relevance: Focuses on real-world Indian contexts (e.g., regional history, legal systems) rather than translation tasks.
Expert-Curated Data: Questions are authored by native speakers and domain experts, ensuring depth and accuracy.
Nuanced Evaluation: Rubric-based grading supports partial credit and cultural correctness, avoiding over-reliance on exact matches.
Future-Proof Design: Adversarial filtering ensures benchmarks remain challenging for emerging models.

Editorial Comments

IndQA fills a critical gap by prioritizing culturally specific evaluation for Indian languages, which are often underrepresented in global benchmarks. Its adversarial filtering and rubric-based approach provide a practical framework for assessing AI progress in multilingual, culturally rich environments.

For more details, visit the IndQA announcement.

On This Page

OpenAI Introduces IndQA: A Culture-Aware Benchmark for Indian Languages