edfas.org 25 ELECTRONIC DEVICE FAILURE ANALYSIS | VOLUME 27 NO. 4 dictionaries and text documents are segmented by the indexer into chunks of size 512 using a custom JSON and a standard text splitter. The chunks are then forwarded to an embedding model. This work considered three models selected according to the MTEB leaderboard at the time of implementation:[25] Luminous-base (13B parameters), bge-base-en-v1.5 (335M),[26] and a smaller nomic-embed (137M). All generated embeddings, along with corresponding texts as metadata, are finally loaded into a Qdrant vector database (https://qdrant.tech) with default settings for further processing. NAIVE AND ADVANCED RAG SYSTEMS The implementation of both naive and advanced systems,[27] whose general workflows are shown in Fig. 4, was carried out utilizing the LlamaIndex framework with the following models: (i) Starling-LM-7B-alpha (7.24B parameters); (ii) Mixtral-8x7B-Instruct-v0.1 (46.7B parameters); (iii) Llama-2-70B (69B parameters); and (iv) BAAI/ bge-base-en-v1.5 embedding model (109M parameters). After connecting to the Qdrant store, the framework generates embeddings for a given question, retrieves the most relevant documents, augments the question and documents within a prompt, and calls the LLM to generate an answer. The naive query engine was specified with a custom prompt template, shown in Fig. 5. The advanced RAG comprises two additional components: a pre-retrieval router, which calls an LLM to classify a given query into the question categories’ next-step prediction, related information identification, failure mode identification, safety verification, and default knowledge retrieval. Based on the LLM output choice, the corresponding query engine with a special prompt template is utilized as a tool. The post-retrieval step implements a re-ranker, which first sorts the retrieved documents and then creates the prompt. Experimental results showed that the best performance was achieved with flag embedding re-ranker (FER), which reorders the documents at the document level using a specialized re-ranking model, and SEO that splits the documents into sentences, generates a similarity score for each sentence, and prunes sentences up to a threshold or percentile. EVALUATION First, the RAG pipelines were initially evaluated within an ablation study using reference-free metrics that do not rely on a golden answer and enable evaluation without validation datasets. Next, selected RAG pipelines were further evaluated by human experts. In the first case, responses generated by the RAG approaches were presented to an LLM evaluator with a specialized prompt instructing the LLM-as-a-judge to assess the generated response according to a dimension. Following the Ragas framework,[28] the team adopted the four metrics: answer faithfulness, answer relevance, context recall, and context precision. Based on these metrics, an ablation study was conducted to analyze the impact of hyperparameters by changing one component at a time and evaluating the results. Optimal context precision was attained while retrieving 10 documents at most. Furthermore, integrating a re-ranker into naive RAG resulted in a substantial increase in performance, particularly in terms of context recall and precision, indicating a reduction in noise in retrieved data. A 51.41% improvement over the naive pipeline was achieved with the SEO re-ranker. Moreover, expert evaluations were conducted to assess the quality of the RAG pipelines by rating the pipelines’ responses to a set of 20 questions regarding accuracy and usefulness using a five-point Likert scale, where 5 corresponds to the highest quality and 1 to the lowest. The results show that the best performance of 2.4 was obtained by a pipeline without a pre-processor and using a SEO re-ranker on both types of data: (i) Total average without refusal: 2.39; (ii) Next-step prediction average: 2.5; (iii) Related information identification: 1.67; (iv) Failure mode identification: 2.83; (v) Safety verification: 2.00; (vi) Default knowledge retrieval: 2.63. MODULAR RAG SYSTEM To address issues related to hallucinations and retrieval quality, the aim was to implement the designed modular RAG pipeline. Figure 6 shows a complete workflow including various optimization methods. The modular RAG pipeline begins with pre-retrieval enrichment, using step-back prompting and sub-query generation via specialized templates and few-shot demonstrations to improve question clarity. A semantic router, initialized with encoded examples and leveraging SFREmbedding-Mistral, classifies queries into categories in milliseconds. Retrieval combines dense (Qdrant), sparse Fig. 5 General FA prompt template.
RkJQdWJsaXNoZXIy MTYyMzk3NQ==