Post · bonfire.cafe

🤖 Our findings suggest strategies for future convention-aware multimodal agents that: (1) learn users’ chunked conventions as they emerge, (2) shift to abstract-first instructions over time, (3) adapt modality to evolving user preferences, and (4) use redundancy to highlight changes from prior interactions.
3/4

Parastoo Abtahi

@parastoo@hci.social · 2 months ago

Using #AR, we carefully isolate speech and gestures, removing other cues (e.g., gaze, facial expressions). This allows us to analyze how partners coordinate on abstractions and how information shifts across these modalities over time.

We develop a computational model, extending the Rational Speech Act (RSA) framework to multimodal settings, and simulate the behaviors we observe.
2/4

Parastoo Abtahi

@parastoo@hci.social · 2 months ago

Parastoo Abtahi

@parastoo@hci.social · 2 months ago

If you saw @jefan present our poster at #CogSci2025, the full paper will appear at #CHI2026:

“Gesturing Toward Abstraction: Multimodal Convention Formation in Collaborative Physical Tasks”
🔗 https://multimodal-conventions.github.io
📄 https://arxiv.org/pdf/2602.08914

@hci 4/4

Screenshot of the paper. Teaser figure: five-panel teaser showing a shift from block-by-block to abstract tower descriptions. Panel 1 shows the first L-shaped tower made from three LEGO blocks (blue base, two red blocks stacked). A speech bubble says Put a blue block on the front side of the grid, with a hand precisely placing an imaginary block on a two-by-two grid. Panel 2 shows a speech bubble saying a red block on top of the blue, on the left side, with a hand holding an imaginary block vertically above the previous position. Panel 3 shows a speech bubble saying then another red block on top of that, with the right hand stacking another imaginary block. Panel 4 shows a speech bubble saying like an L shape. Two hands depict an L-shape gesture representing tower shape without position or orientation. The final panel shows the same tower in a different position and orientation, with a speech bubble reading Put a backward L-shape tower on the back of the grid and a hand indicating the back row of the grid.

https://arxiv.org/pdf/2602.08914