3️⃣Data Connectors

RAG 비즈니스 시나리오에서 데이터 로딩은 매우 중요한 측면입니다. LlamaIndex는 Data Connectors 인터페이스를 정의하고 다양한 데이터 소스 또는 데이터 형식에서 데이터 로드를 지원하는 여러 가지 구현을 제공합니다. 여기에는 다음이 포함됩니다.:

  • Simple Directory Reader

  • Psychic Reader

  • DeepLake Reader

  • Qdrant Reader

  • Discord Reader

  • MongoDB Reader

  • Chroma Reader

  • MyScale Reader

  • Faiss Reader

  • Obsidian Reader

  • Slack Reader

  • Web Page Reader

  • Pinecone Reader

  • Mbox Reader

  • MilvusReader

  • Notion Reader

  • Github Repo Reader

  • Google Docs Reader

  • Database Reader

  • Twitter Reader

  • Weaviate Reader

  • Make Reader

LlamaHub

LlmaIndex 용 Data Connectors는 LlamaHub를 통해 제공됩니다. LlmaHub는 데이터 커넥터가 포함된 오픈 소스 저장소로, 모든 LlamaIndex 애플리케이션에 쉽게 통합할 수 있습니다.

사용 예시

LlamaIndex 프레임워크의 내장 Data Connectors 사용

LlamaIndex 프레임워크는 내장된 Data Connector 세트를 제공합니다. 개발자는 LlamaHub에서 로드할 필요 없이 바로 사용할 수 있습니다.

다음 코드는 웹 페이지 데이터를 읽는 방법을 보여줍니다.

from llama_index.core import SummaryIndex, SimpleWebPageReader

documents = SimpleWebPageReader(html_to_text=True).load_data(
    ["http://paulgraham.com/worked.html"]
)

LlamaHub에서 Data Connectors 로드하기

다음 샘플 코드는 LlamaHub에서 마크다운 문서 데이터 커넥터를 로드하는 예제 코드입니다. 이 데이터 커넥터에 대한 자세한 내용은 https://llamahub.ai/l/file-markdown를 참조하세요.

from pathlib import Path
from llama_index.core import download_loader

MarkdownReader = download_loader("MarkdownReader")

loader = MarkdownReader()
documents = loader.load_data(file=Path('./README.md'))

LlamaIndex: Data Connectors

Setup Environments

import os
from dotenv import load_dotenv  

!echo "OPENAI_API_KEY=<Your OpenAI Key>" >> .env # OpenAI API Key 입
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")

Data connectors

LlamaIndex: Data Connectors

LlamaIndex 튜토리얼(https://github.com/Anil-matcha/LlamaIndex-tutorials을 참조하여 쿼리 엔진 사용을 위해 LlamaIndex 정의 문서에 로드합니다.

!git clone https://github.com/Anil-matcha/LlamaIndex-tutorials.git
Cloning into 'LlamaIndex-tutorials'...
remote: Enumerating objects: 16, done.
remote: Counting objects: 100% (16/16), done.
remote: Compressing objects: 100% (15/15), done.
remote: Total 16 (delta 3), reused 4 (delta 1), pack-reused 0
Unpacking objects: 100% (16/16), 8.04 KiB | 1.15 MiB/s, done.
from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(
    input_dir="./LlamaIndex-tutorials",
    required_exts=[".md"],
    recursive=True
)
docs = reader.load_data()
docs
[Document(id_='e06e478c-8581-4ff8-b3ba-59be370e8ffc', embedding=None, metadata={'file_path': '/home/kubwa/kubwai/13-LlamaIndex/LlamaIndex-Tutorials/04_Data_Connectors/LlamaIndex-tutorials/README.md', 'file_name': 'README.md', 'file_type': 'text/markdown', 'file_size': 455, 'creation_date': '2024-04-15', 'last_modified_date': '2024-04-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nLlamaIndex tutorials\n\nOverview and tutorials of the LlamaIndex Library\n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='3eb8990d-ae3e-4940-9f23-809934e30e33', embedding=None, metadata={'file_path': '/home/kubwa/kubwai/13-LlamaIndex/LlamaIndex-Tutorials/04_Data_Connectors/LlamaIndex-tutorials/README.md', 'file_name': 'README.md', 'file_type': 'text/markdown', 'file_size': 455, 'creation_date': '2024-04-15', 'last_modified_date': '2024-04-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nGetting Started\n\nVideos coming soon https://www.youtube.com/@AnilChandraNaiduMatcha\n.Subscribe to the channel to get latest content\n\nFollow Anil Chandra Naidu Matcha on twitter for updates\n\nJoin our discord server for support https://discord.gg/FBpafqbbYF\n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'),
 Document(id_='195d642a-f821-4d8f-82ee-10e3059633a7', embedding=None, metadata={'file_path': '/home/kubwa/kubwai/13-LlamaIndex/LlamaIndex-Tutorials/04_Data_Connectors/LlamaIndex-tutorials/README.md', 'file_name': 'README.md', 'file_type': 'text/markdown', 'file_size': 455, 'creation_date': '2024-04-15', 'last_modified_date': '2024-04-15'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='\n\nAlso check\n\nLlamaIndex Course\n\n', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')]
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("LlamaIndex란 무엇이야?")

print(response)
LlamaIndex is a library that provides an overview and tutorials for users.
response = query_engine.query("LlamaIndex 튜토리얼은 무엇을 제공해?")

print(response)
The LlamaIndex tutorials provide an overview and tutorials of the LlamaIndex Library.

LlmaHub: Data Connectors

LlamaIndex 튜토리얼(https://github.com/Anil-matcha/LlamaIndex-tutorials)을 참조하여 쿼리 엔진 사용을 위해 LlamaIndex 정의 문서에 로드합니다.

from pathlib import Path
from llama_index.core import download_loader

MarkdownReader = download_loader("MarkdownReader")

loader = MarkdownReader()
documents = loader.load_data(file=Path('./LlamaIndex-tutorials/README.md'))
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("LlamaIndex 시작하기 튜토리얼에서는 어떤 버전의 프레임워크를 사용하나요?")

print(response)
The LlamaIndex tutorials use a specific version of a framework.

Last updated