junhao21xd/FYP_CCDS25-0121

GitHubGH

Multimodal Speech Emotion Recognition (SER) using a fine-tuned Llama-3 model that integrates text (via Whisper ASR) and acoustic features (via OpenSMILE).

View on GitHub

Defensibility

2.0/10

stars

Platform Dominationhigh

Market Consolidationmedium

Displacement Horizon6 months

REASONING

The project is a typical Final Year Project (FYP) as indicated by the repository name. It follows a standard academic recipe for multimodal SER: take an LLM, provide it with transcripts from an ASR (Whisper), and augment the prompt with engineered acoustic features (OpenSMILE). With only 1 star and no forks, it lacks any community traction or ecosystem. Defensibility is minimal as the approach relies on standard libraries and publicly available datasets (IEMOCAP). From a competitive standpoint, this 'cascaded' approach (ASR -> Features -> LLM) is rapidly being superseded by native multimodal models like GPT-4o and Gemini 1.5, which process raw audio tokens and capture emotional nuance more effectively than text-based LLMs with external feature vectors. There is no technical moat here; any developer with access to Hugging Face could replicate this pipeline in a weekend. The 72.4% accuracy on IEMOCAP is respectable for a student project but does not represent a breakthrough that would challenge commercial or frontier lab capabilities.

COMPOSABILITY

TECH STACK

PythonPyTorchHugging Face TransformersOpenSMILEOpenAI WhisperMeta Llama-3PEFT/LoRA

INTEGRATION

reference_implementation

speech_emotion_recognitionmultimodal_fusionaudio_feature_extractionllm_fine_tuning

READINESS

Composabilityapplication

Depth