Legal Tech’s Dilemma: AI Power at Hand, Yet Essential Training Data Remains Elusive

Generative AI’s promise to revolutionize the legal tech arena faces an unexpected hurdle: while ample documentation exists, there’s a glaring shortage of “trainable” data.

In a landscape drowning in legal documents, one might assume that generative AI tools would be thriving. Surprisingly, despite the abundance, many tech providers in the legal space are grappling with the realization that much of this data is not as trainable as they had anticipated.

The primary impediment isn’t the absence of data – there are innumerable contracts and legal documents to draw from. The real challenge lies in client stipulations that often bar the use of their documents for AI training. Coupled with the genuine concerns of potential data breaches with AI tools, the legal tech industry appears caught in a conundrum: they have the machinery but lack the right kind of fuel.

Experts have elucidated that while this data shortage might exert pressure on legal tech entities momentarily, there exist potential solutions: from harnessing more generic training data sets to the nuanced fine-tuning of AI to comprehend legalese. Yet, the journey isn’t identical for all providers; the ease or complexity of training varies.

The Enigma of ‘Untrainable’ Legal Data

Delving into the heart of the matter, John Brewer, Chief AI Officer at e-discovery firm HaystackID, reflected on the ideal scenario. “If we wanted to train an e-discovery model to be good at reading the kind of data that we push through [in] large amounts on a regular basis, the way that we would do that is train it on actual authentic discovery data.”

Also Read: Deliberately.ai and the Rise of Client Intelligence

But reality doesn’t always align with ideals. Often, the original owners of data prohibit its use for AI training. Attempts to anonymize such data – a standard practice when tech companies use client information – dilutes its value. Brewer elucidates the AI intricacy: it crafts intricate networks among words or “tokens”. In non-sensitive documents, the technology performs flawlessly. However, the AI depends on creating ties between significant nouns or “sensitive tokens”. Anonymizing them breaks these crucial connections.

Moreover, even when firms like HaystackID encounter data not explicitly prohibited for AI training, ethical quandaries arise. Much of this data, Brewer notes, was amassed prior to generative AI becoming mainstream in late 2022. Its proprietors probably haven’t reassessed their training stances since then. Therefore, even if there are no legal barriers, the mounting threats of cyberattacks and data leaks associated with this technology raise serious ethical concerns.

The Path Forward

The ramifications of inadequate or subpar training data for AI are profound. Ryan O’Leary, Research Director at IDC, emphasizes that the real question is not the volume of available training data but its quality post-implementation of necessary safeguards. The financial implications of tool creation on client-specific data further complicate matters, potentially exacerbating the already soaring costs of AI.

Yet, Brewer offers a different perspective, suggesting that the intrinsic value of legal data in AI training might be overestimated. He argues, “It’s unclear whether training on specifically legal data provides enough of an incentive over a general purpose model.”

Interestingly, general-purpose models, with some tweaking, can be repurposed. HaystackID, for instance, “fine-tunes” its generic models with e-discovery-specific legalese. However, this approach may not universally apply. While companies like HaystackID might navigate with broader training, other providers, like contract life cycle management firms, could necessitate more specialized training.

Also Read: California High Court's Game-Changing AI Liability Decision: What You Need to Know

In short, Brewer states, “Nobody is going to seriously argue that you cannot get a better product out of training on data that’s closer to what you expect your model to be capturing and producing.” The crux, however, lies in whether the incremental benefits justify the significant investment.