Synthetic data could help when it comes to evaluating RAGs, researchers find !

 



Synthetic data generated by LLMs could provide a way to head off an impending data crunch, at least when it comes to evaluating RAG systems, a team of Dutch researchers has shown.

But the prospect of a tsunami of LLM generated info means enterprises will have to rethink how their data management systems and skill sets, according to one of the researchers, Pegasystems AI lab director and chief scientist, Peter van der Putten.

A SynDAiTE workshop in Porto next week will be considering whether synthetic data offers a way to offset a projected “shortage of fresh text data” by 2050. Image data is expected to “become similarly limited” by 2060. This data crunch potentially creates “significant barriers to progress” in AI.

The paper due to be presented at the conference by the team of Dutch researchers, led by Jonas van Elburg of the IR Lab at the University of Amsterdam investigates “whether synthetic question-answer (QA) data generated by large language models (LLMs) can serve as an effective proxy for human-labelled benchmarks when such data is unavailable.”


The short answer is they can, up to a point. “Our findings suggest that synthetic benchmarks can provide a reliable signal when tuning retrieval parameters, particularly when the synthetic and human tasks are well-aligned in format and difficulty”. However, it continued, there were “substantial inconsistencies between synthetic and human benchmark ratings” when comparing results from different generator architectures.
It’s a question of lineage

The researchers said, “As such, synthetic benchmarks should not be treated as universally reliable, but rather as tools whose validity depends on the alignment between task design, metric choice, and evaluation target.”

Pegasystems’ Van der Putten told Blocks and Files that RAGs “Are very useful systems, because you can ask questions about a particular topic, as long as you have a stack of documents. You can ask domain specific questions.”

These domains could be insurance policies or HR regulations, he said. “The appeal is that you can build these kind of smart, let’s say, chat bots, without too much effort. Just find a stack of documents and put a RAG on top of it.”

But the answers RAGs provide need to be evaluated using verified data, he continued. “Of course, maintaining such a reference set of golden truth answers, that’s a lot of work, and typically, because it’s a lot of work, there’s also poor coverage.”

It’s also costly he said. All of this “becomes a blocker on the critical path for rolling out many of those knowledge buddies.”

Using LLMs to generate synthetic data questions and answers could fast track the evaluation part of the process.

Van der Putten said the researchers had “some hunches…on how to improve synthetic generation methods to get more consistency.”

He said ultimately, the best results would probably come from a combination of synthetic and human-curated data.

As for the possibility of a data crunch for AI, there were a number of factors at play, he said, including the possibility of copyright concerns limiting the amount of available data for training models. The sensitivity, diversity and quality of data could also be limiting factors for training AI.

And AI training also meant an increased need for more extensive historical data in certain categories, he said, but often this was not available.

All of this could create more demand for synthetic data. But, he continued, “That’s yet another category of data that will be generated and also that needs to be managed.”

Establishing data lineage would become important, he said. “You need to know that this was synthetic data and not real. And you also need to know how it was generated.”



This will need to be incorporated into the skill set for data professionals, he said, as well as into the systems they use.


#ResearchDataExcellence #DataAnalysisAwards #InternationalDataAwards #ResearchDataAwards #DataExcellence #ResearchData #DataAnalysis #DataAwards #GlobalDataExcellence #DataInnovationAwards #DataResearch #ExcellenceInData #DataAwardWinners#DataAnalysisExcellence #ResearchDataInsights #GlobalResearchAwards #DataExcellenceAwards #ExcellenceInResearchData #ResearchDataLeadership #DataResearchExcellence #AwardWinningData #InternationalResearchAwards #DataAnalysisInnovation #ResearchDataAchievement #ExcellenceInDataAnalysis #GlobalDataInsights #ResearchDataSuccess #DataAwards2024

Website: International Research Data Analysis Excellence Awards

Visit Our Website : researchdataanalysis.com
Nomination Link : researchdataanalysis.com/award-nomination
Registration Link : researchdataanalysis.com/award-registration
member link : researchdataanalysis.com/conference-abstract-submission
Awards-Winners : researchdataanalysis.com/awards-winners
Contact us :support@researchdataanalysis.com

Get Connected Here:
==================
Facebook : www.facebook.com/profile.php?id=61550609841317
Twitter : twitter.com/Dataanalys57236
Pinterest : in.pinterest.com/dataanalysisconference
Blog : dataanalysisconference.blogspot.com
Instagram : www.instagram.com/eleen_marissa

Comments