ACL 2025 Reflections

I wanted to share some reflections and ideas from ACL 2025 Conference (Association for Computation Linguistics), a machine-learning conference focused on text understanding.

Overall Reflections

Academically-inclined ML conferences can feel like a music festival (whoops, did I go to the wrong ACL?!) — you can get burned out after three days, especially if you give into your FOMO and try to see everything and just end up running from stage to stage and miss it all. But here's what I did see!

LLM everything

Pardon me, but have you heard about this thing called AI? Or how about this thing called an "LLM"? LLM's totally dominated the conference. From the keynote:

67% of papers have "LLM" in their title/abstract

Benchmarks (+ skepticism of them)

A lot of focus was put on benchmarking existing SOTA models through various tasks and datasets. It seems that broadly there is an inherent skepticism of existing benchmarks and evaluations for LLMs as there has been a lot of data leakage, hyper-parameter gaming, few shot prompting over-optimization, etc, that has led to people not being able to trust reported results.

I was surprised by how much work was focused on simply evaluating these foundational models on various tasks and biases and measuring this bias, rather than understanding the root sources of the bias or proposing solutions.

Academia vs Industry // Foundational Models vs Small Players

The majority of folks at the conference seemed to be coming from academia and university associations, maybe two-thirds attendees from academia I would estimate.

Shifts in the political climate in the US were discussed and how that's leading to less grant money.

Regardless of whether someone is in academia or industry, we all seem to be beholden to the Big Companies who are training foundational models, most importantly because the smaller groups simply do not have enough computational resources available to compete.

There was also an under-current of tension throughout the conference between the more classical computational linguistics approach to NLP, and the new guard of "just throw an LLM at it". Some dismay was expressed at the over-focus on solely LLM performance and benchmarking without the fundamental understandings of what these LLM's are doing.


Interesting Papers (by theme)

Here are the broad themes I noticed and some posters/papers I saw that looked cool.

(Also, here's a link to all the official conference's voted best papers.)

★ for especially interesting papers/posters

Architectures & Efficiency

How can we keep scaling and make it affordable to do so? Faster fine-tunes, longer context windows, and tokenization alternatives.

  • ModernBERT — Smarter, Better, Faster, Longer: A modernized encoder-only model with strong gains over classic BERT (longer context, better downstream, faster). [Answer.ai / Jeremy Howard] ACL Anthology
  • Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention (ACL Best Paper Award) [Deepseek] ACL Anthology
  • Byte Latent Transformer (BLT) Patches Scale Better Than Tokens: Tokenizer-free — BLT encodes bytes into dynamically sized patches. [Meta] ACL Anthology
  • Run LoRA Run: Practical LoRA implementation tricks that speed up fine-tuning by 10-28% without hurting accuracy. ACL Anthology

Alignment & Bias

  • Aligned but Blind: Alignment Increases Implicit Bias by Reducing Awareness of Race: Alignment surprisingly amplifies implicit bias in model outputs. ACL Anthology
  • Fairness through Difference Awareness: Introduces a metric for desired group discrimination when some differential treatment is actually the goal. (ACL Best Paper Award) ACL Anthology
  • Language Models Resist Alignment: Evidence from Data Compression: Evidence that alignment doesn't fully overwrite pretraining. (ACL Best Paper Winner) ACL Anthology

Agents (+ REALM workshop)

Agents are still pretty hot. Agent calling and using tool calls were themes during the Thursday workshop on Agents — REALM (Research on Agent Language Models) Workshop

Prompting, and the automation of it

Can we automate prompt engineering? How do we get observability into LLM's?

  • PromptWizard: Optimizing Prompts via Task-Aware, Feedback-Driven Self Evolution: Fully automated framework for discrete prompt optimization. [Microsoft] ACL Anthology
  • SCULPT: Systematic Tuning of Long Prompts: Structured tuning for very long prompts. ACL Anthology
  • MExGen Multi-Level Explanations for Generative Language Models: Multi-level explanations that attribute generated outputs to parts of the context. ACL Anthology

SLM ↔ LLM in Practice (where smaller models win)

Pattern: Use a foundational model to generate synthetic data → fine tune a LLaMA ~8B model → achieve same performance as few-shotting on the foundational model

  • TOOLFLOW: Boosting LLM Tool-Calling Through Natural and Coherent Dialogue Synthesis: LLaMA3.1-8B achieves tool-calling performance comparable to or even surpassing GPT-4. ACL Anthology
  • Rethinking Low-Resource MT: The Surprising Effectiveness of Fine-Tuned Multilingual Models in the LLM Age. ACL Anthology
  • PiFi: Plug-in & Fine-tuning: Graft a single LLaMA layer after BERT. ACL Anthology

⚠️ But note this pattern has limitations — it doesn't work on really small models like LLaMA 1B / BERT (models are just too small)!

Tools for Researchers

Evaluation

  • Can External Validation Tools Improve LLM-as-a-Judge? (Apple): Tool-assisted judging improves annotation quality versus pure LLM-judges. ACL Anthology
  • Is That Your Final Answer? Test-Time Scaling for Selective QA: ACL Anthology

Random interesting stuff

  • Toward Automatic Discovery of a Canine Phonetic Alphabet ACL Anthology
  • Multi-Hop Reasoning for Question Answering with Hyperbolic Representations: Non-Euclidian geometry in ML! ACL Anthology
  • HotelMatch-LLM: Asymmetric dense retrieval—SLM for live queries, LLM for document embeddings. ACL Anthology
  • Tunable LLM-based Proactive Recommendation Agent: Actor-Critic agent that proactively explores user interests. ACL Anthology

← Back to Home