6️⃣Llama3-8B with LlamaIndex

Introducing Meta Llama 3: The most capable openly available LLM to dateMeta AI

Meta는 Llama-8B 및 70B 크기의 사전 학습 및 인스트럭션 튜닝된 LLM 제품군을 개발하여 출시했습니다. 인스트럭션 튜닝된 Llama 3 모델은 대화 사용 사례에 최적화되어 있으며 일반적인 업계 벤치마크에서 사용 가능한 많은 오픈 소스 채팅 모델보다 성능이 뛰어납니다.

Meta Llama 3란?

Meta의 Llama3는 Meta(Facebook)의 차세대 오픈소스 대규모 언어 모델(LLM)입니다. 8B와 70B 파라미터 규모의 사전학습 및 instruction-tuned 모델로 제공되며, 다양한 활용 사례를 지원할 수 있습니다.

Meta는 Llama3가 현존하는 동급 규모의 최고 오픈소스 모델이라고 자신하고 있습니다. 사전학습과 사후학습 과정의 개선을 통해 reasoning, 코드 생성, 지시 수행 등의 능력이 크게 향상되었다고 합니다.

Llama 3의 새로운점

Llama3는 이전 버전의 Llama2 대비 여러가지 개선점이 있습니다:

Tokenizer 개선: 128K 토큰의 vocabulary로 언어를 더 효율적으로 인코딩해 성능 향상
Inference 효율성 개선: 8B, 70B 모델 모두 Grouped Query Attention(GQA) 적용
대규모 사전학습 확장: 15조 토큰 이상으로 학습, 라마2 대비 7배 이상 데이터셋 증가
Instruction-tuning 혁신: SFT, Rejection Sampling, PPO, DPO 기법 활용한 정교한 모델 얼라인먼트

Llama 3의 목표

Llama3 공개에 있어 다음과 같은 목표를 가지고 있습니다:

현존 최고 수준의 독점 모델에 필적하는 최상의 오픈소스 모델 구축
개발자 피드백을 반영해 라마3의 전반적인 유용성 증대
LLM의 책임감있는 사용과 배포를 주도하는 역할 수행
개발 중인 모델을 조기에 공개해 커뮤니티의 접근성 향상

메타는 텍스트 기반 라마3 모델을 시작으로, 향후 다국어/멀티모달 지원, 컨텍스트 확장, 전반적 성능 고도화 등을 계획하고 있습니다.

Llama 3 모델 아키텍처

LLlama3는 비교적 표준적인 디코더 전용 트랜스포머 아키텍처를 채택했습니다. 주요 특징은 다음과 같습니다:

128K Token의 vocabulary로 언어를 효율적으로 인코딩
8B, 70B 모델 모두 Grouped Query Attention(GQA) 적용해 추론 효율성 개선
8,192 토큰 시퀀스로 학습, self-attention이 문서 경계를 넘지 않도록 마스킹 처리

학습 데이터로는 공개 출처에서 수집한 15조 이상의 토큰을 활용했습니다. 30개 이상 언어의 고품질 비영어 데이터도 5% 이상 포함되어 있습니다.

품질 관리를 위해 휴리스틱/NSFW 필터링, 시맨틱 중복제거, 텍스트 분류기 등을 활용한 정교한 데이터 필터링 파이프라인을 개발 적용했습니다.

Llama 3로 개발하기

Meta는 새롭게 공개한 Llama3 모델이 최대한 유용하면서도 책임감있게 배포될 수 있도록 업계를 선도 할만한 접근 방식을 채택했습니다. 이를 위해 라마 개발과 배포에 있어 새로운 시스템 레벨의 접근법을 도입했습니다.

Llama3 모델을 개발자가 주도권을 갖는 더 큰 시스템의 일부로 생각합니다. 개발자는 Llama 모델을 기반으로 자신만의 고유한 목표에 맞는 시스템을 설계할 수 있습니다.

Instruction-tuning도 모델의 안전성 확보에 중요한 역할을 합니다. 메타의 instruction-tuned 모델은 내외부의 red teaming을 통해 안전성을 테스트 받았습니다.

Red teaming에서는 전문가와 자동화 방법을 활용해 문제가 될 만한 프롬프트를 생성하고 모델의 응답을 평가합니다. 화학, 생물학, 사이버 보안 등 다양한 분야에서의 오용 리스크를 종합적으로 평가하고, 이를 반영해 모델을 안전하게 Fine-tuning 합니다. 자세한 내용은 모델 카드에서 확인할 수 있습니다.

Llama Guard 모델은 프롬프트 및 응답 안전성을 위한 기반이 되며, 애플리케이션 필요에 따라 새로운 분류체계를 만들도록 쉽게 파인튜닝 될 수 있습니다.

Llama Guard 2는 업계 표준 수립을 지원하기 위해 최근 발표된 MLCommons 분류체계를 사용합니다.

CyberSecEval 2는 LLM의 코드 인터프리터 악용 가능성, 공격적인 사이버 보안 기능, 프롬프트 인젝션 공격 취약성 등을 평가하는 기능을 추가했습니다. (기술 논문 참고)

Code Shield는 LLM이 생성한 안전하지 않은 코드를 추론 시점에 필터링하는 기능을 제공합니다. 이를 통해 안전하지 않은 코드 제안, 코드 인터프리터 악용, 안전한 명령 실행 등의 리스크를 완화할 수 있습니다.

생성형 AI 분야가 빠르게 발전하고 있는 만큼, 메타는 오픈 접근법이 생태계를 하나로 모으고 잠재적 위험을 완화하는 중요한 방법이라고 믿습니다. 이에 LLM을 책임감있게 개발하기 위한 포괄적인 가이드인 Responsible Use Guide를 업데이트 했습니다.

LlamaIndex: Llama3-8B 적용

Llama3를 LlamaIndex와 함께 사용하는 방법을 보여드리겠습니다. 여기서는 데모를 위해 Llama-3-8B-Instruct모델을 사용합니다.

Llma3-8B 모델을 사용하려면 Hugginface 로그인 후에 아래 페이지에서 Access 사용 신청 후 승인이 되면 사용할 수 있습니다.(승인은 HF 계정 메일로 30분 이내 발송)

meta-llama/Meta-Llama-3-8B-Instruct · Hugging Facehuggingface

Installation

%pip install llama-index
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface

hf_token="<Your_Huggingface_Token" # huggingface.co 가입 후 계정에서 key 발급 및 확인

Tokenizer and Stopping ids

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    token=hf_token,
)

stopping_ids = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>"),
]

tokenizer_config.json: 100%|██████████| 51.0k/51.0k [00:00<00:00, 279kB/s]
tokenizer.json: 100%|██████████| 9.08M/9.08M [00:01<00:00, 6.06MB/s]
special_tokens_map.json: 100%|██████████| 73.0/73.0 [00:00<00:00, 165kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

import torch
from llama_index.llms.huggingface import HuggingFaceLLM

# Optional quantization to 4bit
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

llm = HuggingFaceLLM(
    model_name="meta-llama/Meta-Llama-3-8B-Instruct",
    model_kwargs={
        "token": hf_token,
        "torch_dtype": torch.bfloat16,  # comment this line and uncomment below to use 4bit
        # "quantization_config": quantization_config
    },
    generate_kwargs={
        "do_sample": True,
        "temperature": 0.6,
        "top_p": 0.9,
    },
    tokenizer_name="meta-llama/Meta-Llama-3-8B-Instruct",
    tokenizer_kwargs={"token": hf_token},
    stopping_ids=stopping_ids,
)

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]
[A
model-00004-of-00004.safetensors: 100%|██████████| 1.17G/1.17G [00:10<00:00, 115MB/s][A
Downloading shards: 100%|██████████| 4/4 [04:49<00:00, 72.47s/it]
Loading checkpoint shards: 100%|██████████| 4/4 [00:06<00:00,  1.53s/it]
generation_config.json: 100%|██████████| 136/136 [00:00<00:00, 359kB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Huggingface의 Inference Endpoint에 Deploy 하여 사용한다면 아래 코드로 활용

# from llama_index.llms.huggingface import HuggingFaceInferenceAPI

# llm = HuggingFaceInferenceAPI(
#     model_name="<HF Inference Endpoint>",
#     token='<HF Token>'
# )

Call complete with a prompt

response = llm.complete("Paul Graham이 누구인지 알려줄래?")

print(response)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


](https://www.youtube.com/watch?v=QVwq3w6W2k8)
* [Paul Graham - How to Start a Startup (Y Combinator)](https://www.youtube.com/watch?v=QVwq3w6W2k8)

## 4. How to Start a Startup (Paul Graham)

* Paul Graham, Y Combinator의 창립자, 스타트업을 시작하는 방법에 대한 강의입니다.
* 스타트업을 시작하는 이유, 스타트업의 특징, 스타트업의 성공 요인 등에 대한 설명입니다.

## 5. How to Write a Startup Law (Paul Graham)

* Paul Graham, Y Combinator의 창립자, 스타트업을 위한 법률에 대한 강의입니다.
* 스타트업을 위한 법률의 중요성, 스타트업의 법률 문제, 스타트업을 위한 법률 해결 방안 등에 대한 설명입니다.

## 6. The Power of Iteration (Paul Graham)

* Paul Graham, Y Combinator의 창립자, 반복(iteration

Call chat with a list of messages

from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="당신은 MetaAI의 CEO 입니다."),
    ChatMessage(role="user", content="Llama3를 전세계에 소개해주세요."),
]
response = llm.chat(messages)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

print(response)

assistant: assistant

What an exciting moment! As the CEO of MetaAI, I am thrilled to introduce LLaMA 3, our latest and most advanced large language model to the world!

LLaMA 3 is the culmination of our team's tireless efforts to push the boundaries of natural language processing (NLP) and artificial intelligence (AI). This cutting-edge model is designed to understand and generate human-like text, with a focus on conversational dialogue and creative writing.

Here are some of the key features that set LLaMA 3 apart:

1. **Improved Conversational Understanding**: LLaMA 3 has been trained on a massive dataset of text from the internet, books, and other sources, allowing it to understand complex conversations and nuances of human language.
2. **Enhanced Creativity**: This model is capable of generating original text, including stories, poems, and even entire scripts. Its creative capabilities are unmatched, making it a valuable tool for writers, artists, and creatives.
3. **Multilingual Support**: LLaMA 3 can process and respond in multiple languages, including English, Spanish, French, German, Chinese, and many more. This opens up new possibilities for global communication and collaboration.
4. **Real-time Processing

Build RAG pipeline with Llama3

Download & Load Dataset

!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" "paul_graham_essay.txt"

--2024-04-21 00:57:10--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘paul_graham_essay.txt.1’

paul_graham_essay.t 100%[===================>]  73.28K  --.-KB/s    in 0.002s  

2024-04-21 00:57:10 (36.2 MB/s) - ‘paul_graham_essay.txt.1’ saved [75042/75042]

--2024-04-21 00:57:10--  http://paul_graham_essay.txt/
Resolving paul_graham_essay.txt (paul_graham_essay.txt)... failed: Name or service not known.
wget: unable to resolve host address ‘paul_graham_essay.txt’
FINISHED --2024-04-21 00:57:10--
Total wall clock time: 0.07s
Downloaded: 1 files, 73K in 0.002s (36.2 MB/s)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["paul_graham_essay.txt"]
).load_data()

Embedding Model

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

modules.json: 100%|██████████| 349/349 [00:00<00:00, 662kB/s]
config_sentence_transformers.json: 100%|██████████| 124/124 [00:00<00:00, 334kB/s]
README.md: 100%|██████████| 94.8k/94.8k [00:00<00:00, 513kB/s]
sentence_bert_config.json: 100%|██████████| 52.0/52.0 [00:00<00:00, 145kB/s]
config.json: 100%|██████████| 743/743 [00:00<00:00, 2.03MB/s]
model.safetensors: 100%|██████████| 133M/133M [00:01<00:00, 110MB/s] 
tokenizer_config.json: 100%|██████████| 366/366 [00:00<00:00, 877kB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 1.26MB/s]
tokenizer.json: 100%|██████████| 711k/711k [00:00<00:00, 3.75MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 217kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 550kB/s]

LLM & Embedding Model

from llama_index.core import Settings

# bge embedding model
Settings.embed_model = embed_model

# Llama-3-8B-Instruct model
Settings.llm = llm

Create Index

index = VectorStoreIndex.from_documents(
    documents,
)

Create QueryEngine

query_engine = index.as_query_engine(similarity_top_k=3)

Querying & Response

response = query_engine.query("폴 그레이엄은 자라면서 무엇을 했나요??")
print(response)

1. 폴 그레이엄은 학교 밖에서 단편 소설을 쓰는 일을 했습니다. 
2. 9학년 때 IBM 1401에서 포트란을 사용하여 프로그래밍을 시작했습니다. 
3. 히스킷 키트를 사용하여 자신만의 마이크로컴퓨터를 만들었습니다. 
4. 1980년경에 아버지를 설득하여 TRS-80을 구입하여 간단한 게임과 워드 프로세서를 작성하는 데 사용했습니다. 
5. 대학에서 철학을 공부할 계획이었지만, 하인라인의 소설과 PBS 다큐멘터리의 영향으로 결국 AI로 진로를 바꿨다. 
6. 다양한 주제에 대한 에세이를 쓰고 스팸 필터, 그림 그리기, 그룹을 위한 요리 등의 작업을 했습니다. 
7. 사무실로 사용하기 위해 캠브리지에 다른 건물을 구입했습니다. 
8. 매주 목요일 밤 친구들을 위한 디너 파티를 열어 그룹을 위한 요리법을 배웠습니다. 
9. 그는 제시카 리빙스턴에게 직장을 그만두고 자신의 스타트업에서 일하라고 설득했다. 
10. 제시카 리빙스턴, 로버트 태판 모리스, 트레버 블랙웰과 함께 와이 컴비네이터를 창업했다. 
11. 스타트업을 시작하는 방법에 대한 강연을 작성하여 하버드 컴퓨터 소사이어티에서 강연했습니다. 
12. Dan과 함께 Lisp의 새로운 방언인 Arc를 개발하기 시작했습니다.

PreviousMamba RAG with LangChain NextLlama3-8B with LangChain

Last updated 12 days ago