A Synthetic Conversational Smishing Dataset for Social Engineering Detection

arXivarX

Generation of synthetic, multi-turn conversational SMS phishing (smishing) datasets to improve detection of complex social engineering attacks beyond single-message analysis.

View on arXiv

Defensibility

2.0/10

citations

co_authors

Platform Dominationhigh

Market Consolidationmedium

Displacement Horizon6 months

REASONING

The project addresses a legitimate gap in cybersecurity: the transition from single-message spam to sophisticated, multi-turn social engineering. However, the defensibility is extremely low (Score: 2) because the dataset is synthetic. In the current AI landscape, any researcher or security firm can use frontier models (GPT-4o, Claude 3.5) with targeted prompting to generate similar conversational datasets. With 0 stars and being only 4 days old, it lacks the 'data gravity' or community validation required for a higher score. The primary threat comes from platform owners like Google (Android) and Apple (iMessage), who have direct access to anonymized real-world data and are integrating on-device ML models for exactly this purpose. These platforms have a 'platform domination risk' of high because they can implement these detection layers at the OS level, rendering third-party datasets for training niche detectors less relevant for consumer protection. The 'displacement horizon' is short because synthetic data generation techniques are evolving rapidly; a more comprehensive or better-validated dataset could be produced by a competitor in a matter of weeks.

COMPOSABILITY

TECH STACK

PythonLLMs (for synthetic generation)JSONNLP/Text Classification Frameworks

INTEGRATION

reference_implementation

social_engineering_detectionsynthetic_data_generationconversational_ai_safetysmishing_defense

READINESS

Composabilityalgorithm

Depthreference_implementation

Noveltynovel_combination