A Taxonomy of Programming Languages for Code Generation

arXivarX

Establishes a resource-tier taxonomy for programming languages (PLs) based on their prevalence in training data, providing a framework for analyzing LLM code generation capabilities across different languages.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationhigh

Displacement Horizon6 months

REASONING

This project is a research paper (arXiv:2604.00239) rather than a software tool. It applies the established NLP resource-tiering logic (Joshi et al., 2020) to the domain of programming languages. While academically valuable for benchmarking LLM performance on low-resource vs. high-resource languages, it possesses no technical moat. The defensibility is 2 because a taxonomy is a conceptual framework that is trivially reproducible once published. Frontier labs (OpenAI, Anthropic, Google) and platform holders (GitHub/Microsoft) already possess the internal telemetry and training data statistics that this taxonomy seeks to categorize; they essentially define the tiers through their data collection processes (e.g., The Stack by BigCode). The project's low quantitative signals (0 stars, 3 forks) reflect its status as a newly released academic artifact. It is highly likely to be absorbed into broader research surveys or superseded by data-driven reports from GitHub (e.g., Octoverse) or Hugging Face within months.

COMPOSABILITY

TECH STACK

LaTeXPythonHuggingFace DatasetsArXiv

INTEGRATION

theoretical_framework

code_generation_benchmarkingtaxonomy_developmentdataset_analysisprogramming_language_classification

READINESS

Composabilitytheoretical

Depththeoretical

Noveltynovel_combination