Collected molecules will appear here. Add from search or explore.
Detects benchmark contamination in LLMs by performing paired confidence significance testing to determine if a model has previously seen specific test data.
Defensibility
citations
0
co_authors
3
PaCoST is a research-oriented implementation focused on the critical issue of benchmark leakage. Despite the importance of the problem, the project has zero stars and minimal forks, indicating it has failed to gain traction outside of its immediate academic context. From a competitive standpoint, contamination detection is a 'feature, not a product' and is being aggressively addressed by platform giants. Hugging Face is integrating decontamination checks into their Leaderboard 2.0, and frontier labs like OpenAI and Anthropic utilize proprietary, more robust internal methods for data sanitation. The methodology, while statistically sound, is easily reproducible by any ML engineer familiar with hypothesis testing and model perplexity. Its value as a standalone project is low because the most effective contamination detection requires access to the training corpora or large-scale comparative datasets that a 0-star GitHub repo lacks. It is likely to be displaced or ignored as standard evaluation suites (like LM Eval Harness) incorporate their own decontamination modules.
TECH STACK
INTEGRATION
reference_implementation
READINESS