LLMs vs. BERT: The Text Classification Showdown

New research explores the performance of large language models on South Slavic languages.

A recent study investigates how large language models (LLMs) and fine-tuned BERT-like models perform in text classification for South Slavic languages. It reveals that LLMs offer strong zero-shot capabilities but come with significant practical drawbacks. The research helps understand the best approach for different AI applications.

By Sarah Kline

November 22, 2025

4 min read

LLMs vs. BERT: The Text Classification Showdown

Key Facts

The study compared fine-tuned BERT-like models with LLMs for text classification in South Slavic languages.
LLMs showed strong zero-shot performance, often matching or surpassing BERT-like models.
LLMs perform comparably in South Slavic languages and English in a zero-shot setup.
Key drawbacks of LLMs include slower inference, higher computational costs, and less predictable outputs.
Fine-tuned BERT-like models remain more practical for large-scale automatic text annotation due to efficiency.

Why You Care

Ever wonder if the latest AI models are truly better for every task? Or are older, more established methods still superior in some areas? This new research dives into text classification, a core AI function. It compares large language models (LLMs) with traditional BERT-like models. Why should you care? Because understanding these differences can save you time, money, and effort in your own AI projects. It helps you pick the right tool for the job. Do you need raw power or practical efficiency?

What Actually Happened

A recent paper, authored by Taja Kuzman Pungeršek and her team, investigated the effectiveness of different AI models. The study focused on text classification for several South Slavic languages. This included languages like Serbian, Croatian, and Slovenian. The researchers compared openly available fine-tuned BERT-like models with various open-source and closed-source LLMs, as detailed in the blog post. They these models across three distinct tasks. These tasks included sentiment classification in parliamentary speeches. They also looked at topic classification in news articles and speeches. Finally, genre identification in web texts was part of the evaluation. The goal was to see which approach delivered better results for these less-resourced languages.

Why This Matters to You

This research offers crucial insights for anyone working with AI, especially in multilingual contexts. The study finds that LLMs (Large Language Models), despite their recent popularity, have a complex role. They show strong zero-shot performance. This means they can perform tasks without specific prior training examples for that task. However, this power comes with trade-offs. You might be weighing the benefits of LLMs against their practical limitations. For instance, imagine you are building an AI system to categorize customer feedback in Croatian. Using an LLM might give you great initial accuracy without much setup. However, its operational costs could quickly become prohibitive.

Key Findings for Text Classification:

LLM Zero-Shot Performance: LLMs often match or surpass fine-tuned BERT-like models in zero-shot scenarios.
Cross-Lingual Consistency: LLMs perform comparably in South Slavic languages and English in a zero-shot setup.
Inference Speed: LLMs are significantly slower for inference (making predictions).
Computational Cost: LLMs incur much higher computational costs.
Output Predictability: LLMs tend to produce less predictable outputs.

As the team revealed, “LLMs demonstrate strong zero-shot performance, often matching or surpassing fine-tuned BERT-like models.” This is great news for quick deployment. However, it’s not the whole story. Do you prioritize initial performance or long-term operational efficiency? Your choice depends on your project’s specific needs.

The Surprising Finding

Here’s the twist: while LLMs showed impressive zero-shot capabilities, they are not always the best practical choice. You might assume that newer, more models always win. However, the study indicates this isn’t true for all applications. Despite their strong performance, LLMs come with significant drawbacks. These include slower inference times and higher computational costs, as the paper states. What’s more, their outputs are less predictable. This challenges the common assumption that bigger, more general models are inherently superior for every task. For example, if you need to process millions of documents daily, even a slight increase in inference time or cost per query adds up. This makes fine-tuned BERT-like models surprisingly resilient.

What Happens Next

This research suggests a nuanced future for text classification technologies. We can expect to see continued creation in both LLM efficiency and specialized fine-tuning techniques. For instance, in the next 12-18 months, companies might focus on optimizing LLMs for faster inference. This could involve techniques like quantization or distillation. Meanwhile, fine-tuned BERT-like models will likely remain the go-to for large-scale, high-throughput text annotation tasks. If you’re developing an AI application, consider a hybrid approach. Use LLMs for initial prototyping or tasks requiring high flexibility. Then, deploy fine-tuned models for production-level work where cost and speed are essential. The industry implications are clear: a ‘one-size-fits-all’ AI approach is still a distant dream. As the team revealed, “fine-tuned BERT-like models remain a more practical choice for large-scale automatic text annotation.”

Ready to start creating?