2️⃣DSPy RAG

Retrieval Augmented Generation(RAG)은 LLM이 소스의 대규모 지식 코퍼스를 활용하고 지식 저장소를 쿼리하여 관련 구절/콘텐츠를 찾고 잘 정제된 답변을 생성할 수 있도록 하는 접근 방식입니다.

RAG는 LLM이 원래 해당 주제에 대한 교육을 받지 않았더라도 실시간 지식을 동적으로 활용하여 사려 깊은 답변을 제공할 수 있도록 합니다. 그러나 이러한 미묘한 차이로 인해 정교한 RAG 파이프라인을 설정하는 데는 더 큰 복잡성이 따릅니다. 이러한 복잡성을 줄이기 위해 저희는 프롬프트 파이프라인 설정에 대한 원활한 접근 방식을 제공하는 DSPy를 사용합니다!

Configuring LM and RM

먼저 DSPy가 여러 LM 및 RM API와 로컬 모델 호스팅을 통해 지원하는 언어 모델(LM)과 검색 모델(RM)을 설정하는 것부터 시작하겠습니다.

이 노트북에서는 GPT-3.5(gpt-3.5-turbo) 및 ColBERTv2 리트리버(이 2017 덤프에서 각 문서의 첫 문단이 포함된 Wikipedia 2017 "초록" 검색 색인을 호스팅하는 무료 서버)로 작업할 것입니다. 생성 또는 검색에 필요할 때 DSPy가 내부적으로 해당 모듈을 호출할 수 있도록 DSPy 내에서 LM과 RM을 구성합니다.

import dspy

turbo = dspy.OpenAI(model='gpt-3.5-turbo')
colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')

dspy.settings.configure(lm=turbo, rm=colbertv2_wiki17_abstracts)

Loading the Dataset

이 튜토리얼에서는 일반적으로 멀티홉 방식으로 답변하는 복잡한 질문-답변 쌍의 모음인 HotPotQA 데이터 세트를 사용합니다. 이 데이터 세트는 HotPotQA 클래스를 통해 DSPy에서 제공하는 데이터 세트를 로드할 수 있습니다:

from dspy.datasets import HotPotQA

# Load the dataset.
dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)

# Tell DSPy that the 'question' field is the input. Any other fields are labels and/or metadata.
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

len(trainset), len(devset)

Output:

(20, 50)

Building Signatures

이제 데이터가 로드되었으므로 파이프라인의 하위 작업에 대한 signature을 정의해 보겠습니다.

간단한 입력 질문과 출력 답변을 식별할 수 있지만, RAG 파이프라인을 구축 중이므로 콜버트 말뭉치의 컨텍스트 정보를 활용하고자 합니다. 따라서 context, question --> answer이라는 서명을 정의해 보겠습니다.

class GenerateAnswer(dspy.Signature):
    """Answer questions with short factoid answers."""

    context = dspy.InputField(desc="may contain relevant facts")
    question = dspy.InputField()
    answer = dspy.OutputField(desc="often between 1 and 5 words")

context 및 answer 필드에 대한 간단한 설명을 추가하여 모델이 수신하고 생성해야 하는 내용에 대한 보다 강력한 지침을 정의합니다.

Building the Pipeline

RAG 파이프라인을 DSPy 모듈로 구축할 것이며, 여기에는 두 가지 방법이 필요합니다:

__init__ 메서드는 필요한 하위 모듈을 선언하기만 하면 됩니다: dspy.Retrieve와 dspy.ChainOfThought입니다. 후자는 GenerateAnswer 서명을 구현하도록 정의되어 있습니다.
forward 메서드는 우리가 가진 모듈을 사용해 질문에 답하는 제어 흐름을 설명합니다: 질문이 주어지면 관련성이 높은 상위 3개 구절을 검색한 다음 이를 답변 생성을 위한 컨텍스트로 제공합니다.

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
    
    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

Optimizing the Pipeline

Compiling the RAG program

이 프로그램을 정의했으니 이제 컴파일해 봅시다. 프로그램 컴파일을 하면 각 모듈에 저장된 매개변수가 업데이트됩니다. 저희 설정에서는 주로 프롬프트에 포함할 좋은 데모를 수집하고 선택하는 형태입니다.

컴파일은 세 가지 사항에 따라 달라집니다:

트레이닝 세트. 위의 trainset에 있는 20개의 질문과 답변 예제를 사용하겠습니다.
검증을 위한 메트릭 예측된 답이 맞는지, 검색된 컨텍스트에 실제로 답이 포함되어 있는지 확인하는 간단한 validate_context_and_answer를 정의하겠습니다.
특정 텔레프롬프터 DSPy 컴파일러에는 프로그램을 최적화할 수 있는 여러 텔레프롬프터가 포함되어 있습니다.

from dspy.teleprompt import BootstrapFewShot

# Validation logic: check that the predicted answer is correct.
# Also check that the retrieved context does actually contain that answer.
def validate_context_and_answer(example, pred, trace=None):
    answer_EM = dspy.evaluate.answer_exact_match(example, pred)
    answer_PM = dspy.evaluate.answer_passage_match(example, pred)
    return answer_EM and answer_PM

# Set up a basic teleprompter, which will compile our RAG program.
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)

# Compile!
compiled_rag = teleprompter.compile(RAG(), trainset=trainset)

:::Teleprompters: 텔레프롬프터는 모든 프로그램을 부트스트랩하고 해당 모듈에 효과적인 프롬프트를 선택하는 방법을 배울 수 있는 강력한 최적화 프로그램입니다. 따라서 "원거리에서 프롬프트"라는 뜻의 이름입니다.

텔레프롬프터마다 비용 대비 품질 등을 최적화하는 정도에 따라 다양한 절충안을 제공합니다. 위 예제에서는 간단한 기본값인 BootstrapFewShot을 사용하겠습니다.

비유를 하자면, 이를 표준 DNN 지도 학습 설정에서 학습 데이터, 손실 함수, 최적화 도구라고 생각할 수 있습니다. SGD가 기본적인 최적화 도구인 반면, Adam이나 RMSProp._ :: 같은 더 정교한(그리고 더 비싼!) 최적화 도구가 있습니다:::

Executing the Pipeline

이제 RAG 프로그램을 컴파일했으니 직접 사용해 보겠습니다.

# Ask any question you like to this simple RAG program.
my_question = "What castle did David Gregory inherit?"

# Get the prediction. This contains `pred.context` and `pred.answer`.
pred = compiled_rag(my_question)

# Print the contexts and the answer.
print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

훌륭합니다. LM의 마지막 프롬프트를 살펴보는 건 어떨까요?

turbo.inspect_history(n=1)

Output:

Answer questions with short factoid answers.

---

Question: At My Window was released by which American singer-songwriter?
Answer: John Townes Van Zandt

Question: "Everything Has Changed" is a song from an album released under which record label ?
Answer: Big Machine Records
...(truncated)

자세한 데모를 작성하지는 않았지만, DSPy가 이 3,000개의 토큰 프롬프트를 부트스트랩하여 3샷 검색을 위한 증강 생성(어려운 부정 구절이 포함된 생성)을 할 수 있었고 매우 간단하게 작성된 프로그램 내에서 연쇄 추론을 사용한다는 것을 알 수 있습니다.

이는 구성과 학습의 힘을 보여줍니다. 물론 이것은 특정 텔레프롬프터에 의해 생성된 것으로, 각 설정에 따라 완벽할 수도 있고 그렇지 않을 수도 있습니다. DSPy에서 볼 수 있듯이 프로그램의 품질과 비용과 관련하여 최적화하고 검증해야 하는 옵션의 공간은 넓지만 체계적으로 구성되어 있습니다.

학습된 객체 자체를 쉽게 검사할 수도 있습니다.

for name, parameter in compiled_rag.named_predictors():
    print(name)
    print(parameter.demos[0])
    print()

Evaluating the Pipeline

이제 개발 세트에서 compiled_rag 프로그램을 평가할 수 있습니다. 물론 이 작은 세트는 신뢰할 수 있는 벤치마크가 될 수는 없지만, 설명을 위해 사용하는 것은 도움이 될 것입니다.

예측된 답변의 정확도(정확히 일치하는지)를 평가해 보겠습니다.

from dspy.evaluate.evaluate import Evaluate

# Set up the `evaluate_on_hotpotqa` function. We'll use this many times below.
evaluate_on_hotpotqa = Evaluate(devset=devset, num_threads=1, display_progress=False, display_table=5)

# Evaluate the `compiled_rag` program with the `answer_exact_match` metric.
metric = dspy.evaluate.answer_exact_match
evaluate_on_hotpotqa(compiled_rag, metric=metric)

Output:

Average Metric: 22 / 50  (44.0): 100%|██████████| 50/50 [00:00<00:00, 116.45it/s]
Average Metric: 22 / 50  (44.0%)

44.0

Evaluating the Retrieval

검색의 정확도를 살펴보는 것도 도움이 될 수 있습니다. 이를 수행하는 방법에는 여러 가지가 있지만 검색된 구절에 답이 포함되어 있는지 간단히 확인할 수 있습니다.

검색해야 하는 골드 타이틀이 포함된 개발자 세트를 활용할 수 있습니다.

def gold_passages_retrieved(example, pred, trace=None):
    gold_titles = set(map(dspy.evaluate.normalize_text, example['gold_titles']))
    found_titles = set(map(dspy.evaluate.normalize_text, [c.split(' | ')[0] for c in pred.context]))

    return gold_titles.issubset(found_titles)

compiled_rag_retrieval_score = evaluate_on_hotpotqa(compiled_rag, metric=gold_passages_retrieved)

Output:

Average Metric: 13 / 50  (26.0): 100%|██████████| 50/50 [00:00<00:00, 671.76it/s]Average Metric: 13 / 50  (26.0%)

이 간단한 compiled_rag 프로그램은 질문의 상당 부분(이 작은 세트에서는 40% 이상)에 정답을 맞출 수 있지만, 검색 품질은 훨씬 낮습니다.

이는 LM이 훈련 중에 암기한 지식에 의존하여 질문에 답하는 경우가 많다는 것을 암시할 수 있습니다. 이러한 낮은 검색 품질을 해결하기 위해 보다 고급 검색 동작을 포함하는 두 번째 프로그램을 살펴 보겠습니다.

PreviousDSPy NextDSPy with LangChain

Last updated 15 days ago