B. Preliminary Experiment Using a Semiotic-Enriched Dataset
To empirically test the viability of the CAS-6 framework, we conducted preliminary experiments using a curated semiotic-enriched dataset specifically designed to capture layers of denotative, connotative, idiomatic, and metaphorical meaning in natural language.
1. Dataset Design and Composition
We constructed a pilot dataset composed of 4,000 labeled expressions in English, drawn from multiple genres:
Literary sources (poetry, prose, song lyrics)
Cultural idioms from English-speaking regions (UK, US, Australia)
Journalistic metaphor and rhetorical constructs
Annotated corpora from the VU Amsterdam Metaphor Corpus and the Cambridge International Dictionary of Idioms
Each entry was annotated with CAS-6 parameters:
Annotations were performed by three linguistic experts, with inter-annotator agreement measured via Cohen's Kappa ( = 0.79), indicating substantial agreement.
2. Experimental Setup
We implemented the CAS-6 semantic layer on top of a fine-tuned DistilBERT model, using two conditions:
Baseline: Standard DistilBERT fine-tuned on the same dataset for idiom and metaphor classification.
CAS-6-Enhanced: Same model with additional graph-structured CAS-6 layer and auxiliary CAS-6-informed loss.
Tasks evaluated:
1. Idiomaticity Detection (binary classification)
2. Output Type Prediction (denotative, idiomatic, metaphorical, artistic)
3. Stability Estimation (low, medium, high semantic resonance)