UK – The LinksAI English law benchmark (Version 2)

We have updated the LinksAI English law benchmark to test two of the latest LLMs – OpenAI o1 and Google Gemini 2.0. 

They show significant improvements, both in terms of increased accuracy and reduced use of fictitious or incorrect citations.

The benchmark

We created the LinksAI English law benchmark to test the ability of Large Language Models (LLMs) to answer legal questions. 

The benchmark comprises 50 questions from 10 different areas of legal practice. The questions are hard.

They are the sort of questions that would require advice from a competent mid-level (2 years’ post qualification experience) lawyer, specialised in that practice area; that is someone traditionally four years out of law school. 

The answers were marked by senior lawyers from each practice area. Each answer was given a mark out of 10, comprising 5 marks for substance (“is the answer right?”), 3 for citations (“is the answer supported by relevant statute, case law, regulations?”) and 2 marks for clarity.

The updated results

The last benchmarking exercise in October 2023 tested four different models: GPT 2, GPT 3, GPT 4 and Bard. The best was Bard, which scored 4.4 out of 10. However, all were often wrong and the citations sometimes fictional.

In this second benchmarking exercise, we saw significant improvements. We gave:

  • Gemini 2.0 a score of 6.0 out of 10
  • OpenAI o1 the top score of 6.4 out of 10

In both cases, this was driven by material increases in the scores for substance and the accuracy of citations.

Despite the significant improvement in Gemini 2.0 and OpenAI o1, we recommend they should not be used for English law legal advice without expert human supervision. They are still not always right and lack nuance.

However, if that expert supervision is available, they are getting to the stage where they could be useful, for example by creating a first draft or as a cross-check. This is particularly the case for tasks that involve summarising relatively well-known areas of law.

What about GPT 5?

The improvement in LLM technology since the last benchmarking exercise is significant. Whether this rate of progression will continue is less clear. 

It is possible there are inherent limitations to LLMs – which are partly stochastic parrots regurgitating the internet (and other learned text) on demand. For example, they all suffer from the embodiment problem; they will never experience the beauty of a summer cricket game or the physical revulsion at finding a snail in a bottle of ginger beer. However, the fine tuning of this technology is likely to deliver performance improvements for years to come.

In any event, we will reapply the LinksAI English law benchmark to future iterations of this technology and update this report.

Supporting materials

Our report is here.

The annexes containing the questions and answers, and other supporting materials are here.

Summary of the benchmarking exercise