Ep 419 Tool May 20, 2026 4:21 w/ Justy & Cody

5 Small Language Models for Agentic Tool Calling KDnuggets

Small language models are gaining ground on a critical frontier benchmark: tool calling. This episode looks at five compact, open-weight models that can route to APIs, format JSON arguments, and run multi-step agentic workflows without requiring a data center. Cody and Justy debate whether the gap between small and frontier models is closing fast enough to matter for real shipping teams.

Read the source → Plain-text transcript →

Embed this episode

Paste this on any site — the player is a self-contained iframe with no cookies or trackers.

<iframe src="https://sandrise.io/exploring-next/embed/419"
  width="100%" height="180" style="max-width:640px;border:0;border-radius:12px;overflow:hidden"
  title="Exploring Next — Episode 419 audio player"
  loading="lazy" allow="autoplay" referrerpolicy="strict-origin-when-cross-origin"></iframe>

Embed & API docs →

Script Kimi K2.6 Voice Inworld TTS 2

Transcript

Justy So apparently we're at the point where a three-billion-parameter model can call your API, format the JSON, and not embarrass itself.

Cody Yeah, I saw that KDnuggets piece. Five small models, all with structured tool calling, all open weights.

Justy This is basically the thing you keep saying was impossible.

Cody I did NOT say impossible. I said the gap was still wide. There's a difference.

Justy Mm-hm.

Justy Anyway, you got in late? You look destroyed.

Cody Redeye from the west coast. I am running on whatever this airport coffee is.

Justy Pathetic. I actually slept eight hours. First time in like three weeks.

Cody Okay, showoff. But yeah, the article. The claim is that agentic AI lives or dies on tool calling, and these five small models are finally closing the gap with the big frontier ones.

Justy Right, right.

Cody They start with SmolLM3-3B. Hugging Face, three billion parameters, decoder-only with Grouped Query Attention and NoPE, which is their no-positional-embedding thing. Sixty-four K native context, up to one twenty-eight with YaRN extrapolation. Trained on eleven point two trillion tokens, post-trained with something called Anchored Preference Optimization.

Justy APO, yeah. I saw that paper.

Cody The interesting bit is the dual tool interfaces. XML blobs through xml_tools and Python-style function calls through python_tools. That's unusually flexible for a model this small.

Justy Okay, but who is reaching for a three B model when they could just call GPT-4?

Cody If you're running on an edge device or a machine with eight gigs of VRAM, you literally cannot run the big stuff. Plus, Apache two, fully open, weights and training code. SmolLM3 is built for people who need to ship without sending everything to an API.

Justy Fair. I do love a model I can actually host.

Cody Then there's Qwen3-4B-Instruct from Alibaba. Four billion parameters, but three point six excluding embeddings. Thirty-six layers, GQA with thirty-two query heads and eight KV heads. The big number is two hundred sixty-two thousand tokens of native context.

Justy That's absurd for four billion.

Cody Right? And it's non-thinking only, so optimized for fast responses. Hundred-plus languages, native tool calling through Qwen-Agent and MCP.

Justy Wait, MCP? As in the Anthropic protocol?

Cody Yeah, they adopted the Model Context Protocol. That's actually a big signal, because if a Chinese lab is building around Anthropic's open standard, that standard is winning.

Justy Okay, so the article's central argument is that these aren't just toy demos. They're actually viable for production agentic pipelines.

Cody That's the claim. My read? It's directionally true but the 'first-class' label is doing more work than the author admits. Frontier models still win on complex multi-hop reasoning where tools depend on each other. These small models are great for single-tool or shallow chains.

Justy Which is like eighty percent of real use cases, though.

Cody Probably, yeah.

Justy For product teams, the question is always latency and burn. If I can run this on a cheap GPU instead of paying per-token to OpenAI, that's not nothing.

Cody It's definitely not nothing. I just don't want people thinking the gap is gone. It's shrinking, not closed.

Justy Noted. You'll say 'I told you so' when someone's three B agent loops forever.

Cody I absolutely will.

Cody The thing the article doesn't dig into enough is the evaluation. They list specs but don't show head-to-head success rates on real tool-use benchmarks. I'd love to see how SmolLM3's xml_tools mode actually performs against Claude on something like BFCL or ToolBench.

Justy So your take is: exciting, directionally correct, but bring your own benchmarks before you ship.

Cody Exactly. And if you're already in the Qwen ecosystem, the MCP support is genuinely convenient. If you're in Hugging Face land, SmolLM3 is probably the easiest on-ramp.

Justy Good enough for me. Go sleep, Cody.