Lessons Learned Using LLMs with COBOL on Exercism
I've heard LLMs brought up in the context of legacy code modernization. Here are some lessons learned I've had digging into using agentic coding techniques with Fixed Format COBOL on the Exercism COBOL track.
My Approach
When working through the track, I basically did the Easy difficulty problems by hand with LLMs disabled to refresh my COBOL knowledge, and I saved the medium difficulty problems for Grok Code, Claude Sonnet, Gemini 2.5, GPT-5, etc. via GitHub Copilot Agent Mode.
Generally, the models did well, and the verbosity of COBOL seemed well suited to having the machine doing the typing and the human doing the reading and reviewing.
Model Comparison
It's hard to declare which model did best. I feel like Claude Sonnet and GPT-5 had an edge over Gemini 2.5 Pro in terms of debugging issues. Gemini seemed to get caught in internal loops where it made the same change to the COBOL source over and over again.
Claude Sonnet was way faster than GPT-5. Grok Code Fast was the fastest overall and felt about 90% as smart as Claude Sonnet and GPT-5. Given that Grok Code Fast consumes no Copilot Tokens, I'd likely default to that model and switch to Claude Sonnet for edge cases.
Note that IBM has a special "watsonx Code Assistant for Z" which I assume has better post-training on Enterprise COBOL and an understanding of how to do tool calling on z/OS. This wasn't relevant for Exercism since it uses GnuCOBOL locally on my PC, but it's worth mentioning.
Where Models Struggle with COBOL
All of the models struggled with COBOL in similar ways because it's an odd duck among programming languages. Some notable lessons learned across models:
Fixed Format Column Rules
Models struggle to generate COBOL Fixed Format. This is an odd format where the column positions have special meaning (code generally has to go between columns 12 and 72, a comment requires an asterisk in column 7, etc.). Gemini crashed out and couldn't figure this out, but the other models would get there eventually. It was time consuming and chewed through a ton of tokens, so this really is something that should be handled by callable auto-formatting tools.
1-Based Indices
Models seem to struggle with COBOL's 1-based indices at times, so there would be occasional off-by-one errors.
Overflow at Unusual Boundaries
Models might generate infinite loops because overflow occurs at unusual numbers depending on COBOL PIC clauses. For example, a PIC 99 object stores two decimal digits and overflows when adding 1 to 99. A loop might have a condition like UNTIL > 99 that would never be reached.
Paragraphs vs Subprograms
Models would default to generating COBOL paragraphs (callable subroutines that don't support parameter passing) rather than subprograms (free functions). Given that this involved mutation of shared state, there would occasionally be situations where a callee clobbers the loop index which is used by the caller.