Here are some simple metrics you can use to assess how good a chatbot’s output is. Or as its nearly Bloomsday to tell what era of James Joyce you are reading.
Before using a new chatbot, it’s helpful to get a sense of its output quality and where it may fall short. Some of this can be quantitative, numbers you can use to make comparisons, and some will always be qualitative, based on your own close reading.
Or for the more literary: before reading a piece by Joyce, it might be handy to know whether you're getting the clarity of The Dead, the experimentation of Ulysses, or whatever it is that's going on in Finnegans Wake.
The Metrics
The Five Simple Metrics We'll Use
Word count:
LLMs can be verbose. if a new model has outputs much longer than an earlier one it may be inefficient.
Average sentence length:
As well as just raw number of words really long sentences are a common issue. A sudden increase in sentence length is likely a red flag.
Stopword ratio:
The glue words in the language 'the, a, at, and...' are essential but if overused can indicate that output of a model could have lots of filler.
A score based on sentence length and syllables per word. Higher means easier to read.
Repetition ratio:
Counts the number of unique words compared to total words. Low ratio means high numbers of unique words which could be quite a complicated to read. Very low numbers could be quite robotic.
Most of these metrics can raise suspicions about a text if they are unusually high or low. They serve as a guide on where to look for issues rather than being guaranteed to find issues.
As simple metrics these will be wrong more often than nuanced ones. But I will use them to show some of the shortcomings that typically occur in LLM models and how these can be found.
Example Joycean Usage
Lets take three works by James Joyce and see how they look to these metrics. Before we start lets say text from 'The Dead' is good. From one of Ulysses more lively passages is not as good. And Finnegans' Wake is really not the sort of thing you want a model outputting. Though of course this might be different to your views and requirements.
The Dead
Metric | Value |
---|---|
Word Count | 2196 |
Character Count | 10206 |
Avg Sentence Length | 23.87 |
Stopword Ratio | 0.44 |
Flesch Reading Ease | 88.36 |
Repetition Ratio | 0.25 |
Clear, direct prose. Short sentences. Good word variety. This is our baseline.
Oxen of the Sun (Ulysses)
Metric | Value |
---|---|
Word Count | 1904 |
Character Count | 10165 |
Avg Sentence Length | 29.29 |
Stopword Ratio | 0.47 |
Flesch Reading Ease | 61.50 |
Repetition Ratio | 0.18 |
Sentences are longer, stopwords increase, and readability drops. The style is denser and less repetitive.
Finnegans Wake
Metric | Value |
---|---|
Word Count | 1659 |
Character Count | 10160 |
Avg Sentence Length | 18.64 |
Stopword Ratio | 0.37 |
Flesch Reading Ease | 70.73 |
Repetition Ratio | 0.09 |
Finnegans Wake has shorter sentences and a relatively readable Flesch score, but extremely low repetition, suggesting high vocabulary churn. This shows some of the limitations of a metric based on syllable counts.
Conclusion
If you had a model that originally produced Dubliners style outputs and suddenly started writing like Ulysses or Finnegans Wake, these metrics would flag the change immediately.
They won’t catch every issue, but they’ll give you a fast and useful first pass.
Try it yourself at https://optamam.com/metrics Paste in some chatbot output or a literary excerpt and see what you find
By David Curran