5️⃣Naive RAG

Naive RAG with LlamaIndex, Weaviate

LlamaIndex와 Weaviate를 사용한 RAG 파이프라인에 대해 안내합니다.

Prerequisites

#%pip install -U weaviate-client
#%pip install llama-index-vector-stores-weaviate
import llama_index
import weaviate
from importlib.metadata import version
import os
from dotenv import load_dotenv  

!echo "OPENAI_API_KEY=<Your OpenAI Key" >> .env # OpenAI Key 한번만 설정
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
True

Embedding Model & LLM

먼저 글로벌 설정 개체에서 임베딩 모델과 LLM을 정의할 수 있습니다. 이렇게 하면 코드에서 모델을 다시 명시적으로 지정할 필요가 없습니다.

  • Embedding model: 쿼리뿐만 아니라 문서 청크에 대한 벡터 임베딩을 생성하는 데 사용됩니다.

  • LLM: 사용자 쿼리와 관련 컨텍스트를 기반으로 답변을 생성하는 데 사용됩니다.

Weaviate에서도 임베딩 모델(벡터화 모듈)과 LLM(생성 모듈)을 지정할 수 있지만 이 경우에는 Llamaindex에 정의된 LLM과 임베딩 모델이 사용됩니다.

from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()

Load data

!mkdir -p 'data'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham_essay.txt'
from llama_index.core import SimpleDirectoryReader

# Load data
documents = SimpleDirectoryReader(
        input_files=["./data/paul_graham_essay.txt"]
).load_data()

#documents

Chunk documents to Nodes

전체 문서가 너무 커서 LLM의 컨텍스트 창에 맞지 않으므로 이를 작은 텍스트 덩어리로 분할해야 하며, 이를 LlamaIndex에서 nodes라고 합니다.

SimpleNodeParser를 사용하면 청크 크기를 1024로 정의할 수 있습니다.

from llama_index.core.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)

# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)
i=10

print(f"Text: \n{nodes[i].text}")
Text: 
But for what it's worth, as a sort of time capsule, here's why I don't like the look of Java:  
  
1\. It has been so energetically hyped. Real standards don't have to be promoted. No one had to promote C, or Unix, or HTML. A real standard tends to be already established by the time most people hear about it. On the hacker radar screen, Perl is as big as Java, or bigger, just on the strength of its own merits.  
  
2\. It's aimed low. In the original Java white paper, Gosling explicitly says Java was designed not to be too difficult for programmers used to C. It was designed to be another C++: C plus a few ideas taken from more advanced languages. Like the creators of sitcoms or junk food or package tours, Java's designers were consciously designing a product for people not as smart as them. Historically, languages designed for other people to use have been bad: Cobol, PL/I, Pascal, Ada, C++. The good languages have been those that were designed for their own creators: C, Perl, Smalltalk, Lisp.  
  
3\. It has ulterior motives. Someone once said that the world would be a better place if people only wrote books because they had something to say, rather than because they wanted to write a book. Likewise, the reason we hear about Java all the time is not because it has something to say about programming languages. We hear about Java as part of a plan by Sun to undermine Microsoft.  
  
4\. No one loves it. C, Perl, Python, Smalltalk, and Lisp programmers love their languages. I've never heard anyone say that they loved Java.  
  
5\. People are forced to use it. A lot of the people I know using Java are using it because they feel they have to. Either it's something they felt they had to do to get funded, or something they thought customers would want, or something they were told to do by management. These are smart people; if the technology was good, they'd have used it voluntarily.  
  
6\. It has too many cooks. The best programming languages have been developed by small groups. Java seems to be run by a committee. If it turns out to be a good language, it will be the first time in history that a committee has designed a good language.  
  
7\. It's bureaucratic. From what little I know about Java, there seem to be a lot of protocols for doing things. Really good languages aren't like that. They let you do what you want and get out of the way.  
  
8\. It's pseudo-hip. Sun now pretends that Java is a grassroots, open-source language effort like Perl or Python. This one just happens to be controlled by a giant company. So the language is likely to have the same drab clunkiness as anything else that comes out of a big company.  
  
9\. It's designed for large organizations. Large organizations have different aims from hackers. They want languages that are (believed to be) suitable for use by large teams of mediocre programmers-- languages with features that, like the speed limiters in U-Haul trucks, prevent fools from doing too much damage. Hackers don't like a language that talks down to them. Hackers just want power. Historically, languages designed for large organizations (PL/I, Ada) have lost, while hacker languages (C, Perl) have won. The reason: today's teenage hacker is tomorrow's CTO.  
  
10\. The wrong people like it. The programmers I admire most are not, on the whole, captivated by Java. Who does like Java? Suits, who don't know one language from another, but know that they keep hearing about Java in the press; programmers at big companies, who are amazed to find that there is something even better than C++; and plug-and-chug undergrads, who are ready to like anything that might get them a job (will this be on the test?). These people's opinions change with every wind.  
  
11\. Its daddy is in a pinch. Sun's business model is being undermined on two fronts. Cheap Intel processors, of the same type used in desktop machines, are now more than fast enough for servers. And FreeBSD seems to be at least as good an OS for servers as Solaris. Sun's advertising implies that you need Sun servers for industrial strength applications. If this were true, Yahoo would be first in line to buy Suns; but when I worked there, the servers were all Intel boxes running FreeBSD. This bodes ill for Sun's future.

Building Index

  • 모든 외부 지식을 Weaviate 벡터 데이터베이스에 저장하는 인덱스를 구축합니다.

  • 먼저 Weaviate 인스턴스에 연결해야 합니다. 여기서는 Weaviate Embedded를 사용합니다.

  • 임베디드 인스턴스는 부모 애플리케이션이 실행되는 한 계속 유지된다는 점에 유의하세요. 보다 영구적인 솔루션을 원한다면 관리형 Weaviate 클라우드 서비스(WCS) 인스턴스를 사용하는 것이 좋습니다. 여기에서 14일 동안 무료로 사용해 볼 수 있습니다여기.*

import weaviate

# Connect to your Weaviate instance
client = weaviate.Client(
    embedded_options=weaviate.embedded.EmbeddedOptions(), 
)

print(f"Client is ready: {client.is_ready()}")

# Print this line to get more information about the client
# client.get_meta()
Client is ready: True

다음으로, 데이터를 저장하고 상호 작용할 Weaviate 클라이언트에서 VectorStoreIndex를 빌드합니다.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore

index_name = "MyExternalContext"

# Construct vector store
vector_store = WeaviateVectorStore(
    weaviate_client = client, 
    index_name = index_name
)

# Set up the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# If an index with the same index name already exists within Weaviate, delete it
if client.schema.exists(index_name):
    client.schema.delete_class(index_name)

# Setup the index
# build VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(
    nodes,
    storage_context = storage_context,
)
import json
response = client.schema.get(index_name)

print(json.dumps(response, indent=2))
{
  "class": "MyExternalContext",
  "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanupIntervalSeconds": 60,
    "stopwords": {
      "additions": null,
      "preset": "en",
      "removals": null
    }
  },
  "multiTenancyConfig": {
    "enabled": false
  },
  "properties": [
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "file_type",
      "tokenization": "word"
    },
    {
      "dataType": [
        "number"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": false,
      "name": "file_size"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "last_modified_date",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "text",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "file_name",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "creation_date",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "_node_content",
      "tokenization": "word"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "_node_type",
      "tokenization": "word"
    },
    {
      "dataType": [
        "uuid"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": false,
      "name": "document_id"
    },
    {
      "dataType": [
        "uuid"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": false,
      "name": "doc_id"
    },
    {
      "dataType": [
        "uuid"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": false,
      "name": "ref_doc_id"
    },
    {
      "dataType": [
        "text"
      ],
      "description": "This property was generated by Weaviate's auto-schema feature on Mon Apr 15 23:30:35 2024",
      "indexFilterable": true,
      "indexSearchable": true,
      "name": "file_path",
      "tokenization": "word"
    }
  ],
  "replicationConfig": {
    "factor": 1
  },
  "shardingConfig": {
    "virtualPerPhysical": 128,
    "desiredCount": 1,
    "actualCount": 1,
    "desiredVirtualCount": 128,
    "actualVirtualCount": 128,
    "key": "_id",
    "strategy": "hash",
    "function": "murmur3"
  },
  "vectorIndexConfig": {
    "skip": false,
    "cleanupIntervalSeconds": 300,
    "maxConnections": 64,
    "efConstruction": 128,
    "ef": -1,
    "dynamicEfMin": 100,
    "dynamicEfMax": 500,
    "dynamicEfFactor": 8,
    "vectorCacheMaxObjects": 1000000000000,
    "flatSearchCutoff": 40000,
    "distance": "cosine",
    "pq": {
      "enabled": false,
      "bitCompression": false,
      "segments": 0,
      "centroids": 256,
      "trainingLimit": 100000,
      "encoder": {
        "type": "kmeans",
        "distribution": "log-normal"
      }
    }
  },
  "vectorIndexType": "hnsw",
  "vectorizer": "none"
}

Setup Query Engine

마지막으로 인덱스를 쿼리 엔진으로 설정합니다.

# The QueryEngine class is equipped with the generator
# and facilitates the retrieval and generation steps
query_engine = index.as_query_engine()

Query & Response

Now, you can run naive RAG queries on your data.

# Use your Default RAG
response = query_engine.query(
    "Interleaf에서 무슨일이 벌어졌어?"
)
print(str(response))
Interleaf에서는 Emacs에서 영감을 받아 스크립팅 언어를 추가하고, 그 스크립팅 언어를 Lisp의 변형으로 만들었습니다. Lisp 해커를 고용하여 그 언어로 작업을 하려고 했지만, Paul Graham은 C를 모르고 배우고 싶지도 않아서 대부분의 소프트웨어를 이해하지 못했습니다. Paul Graham은 그 시기에 매우 무책임했고, 프로그래밍 작업은 특정 근무 시간 동안 매일 출근하는 것을 의미했기 때문에 문제가 있었습니다. Paul Graham은 Interleaf에서 많은 유용한 교훈을 얻었지만, 그것들은 대부분 무엇을 해서는 안 되는지에 대한 것이었습니다.

Last updated