I wanted to share some reflections and ideas from ACL 2025 Conference (Association for Computation Linguistics), a machine-learning conference focused on text understanding.
Academically-inclined ML conferences can feel like a music festival (whoops, did I go to the wrong ACL?!) — you can get burned out after three days, especially if you give into your FOMO and try to see everything and just end up running from stage to stage and miss it all. But here's what I did see!
Pardon me, but have you heard about this thing called AI? Or how about this thing called an "LLM"? LLM's totally dominated the conference. From the keynote:
67% of papers have "LLM" in their title/abstract
A lot of focus was put on benchmarking existing SOTA models through various tasks and datasets. It seems that broadly there is an inherent skepticism of existing benchmarks and evaluations for LLMs as there has been a lot of data leakage, hyper-parameter gaming, few shot prompting over-optimization, etc, that has led to people not being able to trust reported results.
I was surprised by how much work was focused on simply evaluating these foundational models on various tasks and biases and measuring this bias, rather than understanding the root sources of the bias or proposing solutions.
The majority of folks at the conference seemed to be coming from academia and university associations, maybe two-thirds attendees from academia I would estimate.
Shifts in the political climate in the US were discussed and how that's leading to less grant money.
Regardless of whether someone is in academia or industry, we all seem to be beholden to the Big Companies who are training foundational models, most importantly because the smaller groups simply do not have enough computational resources available to compete.
There was also an under-current of tension throughout the conference between the more classical computational linguistics approach to NLP, and the new guard of "just throw an LLM at it". Some dismay was expressed at the over-focus on solely LLM performance and benchmarking without the fundamental understandings of what these LLM's are doing.
Here are the broad themes I noticed and some posters/papers I saw that looked cool.
(Also, here's a link to all the official conference's voted best papers.)
★ for especially interesting papers/posters
How can we keep scaling and make it affordable to do so? Faster fine-tunes, longer context windows, and tokenization alternatives.
Agents are still pretty hot. Agent calling and using tool calls were themes during the Thursday workshop on Agents — REALM (Research on Agent Language Models) Workshop
Can we automate prompt engineering? How do we get observability into LLM's?
Pattern: Use a foundational model to generate synthetic data → fine tune a LLaMA ~8B model → achieve same performance as few-shotting on the foundational model
⚠️ But note this pattern has limitations — it doesn't work on really small models like LLaMA 1B / BERT (models are just too small)!