An end-to-end guide for scaling and serving LLM application in production.
This repo currently contains one such application: a retrieval-augmented generation (RAG) app for answering questions about supplied information. By default, the app uses the Ray documentation as the source of information. This app first indexes the documentation in a vector database and then uses an LLM to generate responses for questions that got augmented with relevant info retrieved from the index.
- Start a new Anyscale workspace on staging using an
g3.8xlargehead node on an AWS cloud. - Use the
default_cluster_env_2.6.2_py39cluster environment.
First, clone this repository.
git clone https://github.com/ray-project/llm-applications.git .Then set up the environment correctly by specifying the values in your .env file, and installing the dependencies:
cp ./envs/.env_template .envs source .envs pip install --user -r requirements.txt pre-commit install pre-commit autoupdateOur data is already ready at /efs/shared_storage/pcmoritz/docs.ray.io/en/master/ (on Staging) but if you wanted to load it yourself, run this bash command:
bash scrape-docs.shLocal installation with brew on MacOS
brew install postgresql brew install pgvector psql -c "CREATE USER postgres WITH SUPERUSER;" # pragma: allowlist nextline secret psql -c "ALTER USER postgres with password 'postgres';" psql -c "CREATE EXTENSION vector;" psql -f migrations/initial.sql python app/index.py create-indexbash setup-pgvector.sh sudo -u postgres psql -f migrations/initial.sql python app/index.py create-indexJust a sample and uses the current index that's been created.
import json from app.query import QueryAgent query = "What is the default batch size for map_batches?" system_content = "Your job is to answer a question using the additional context provided." agent = QueryAgent( embedding_model="thenlper/gte-base", llm="meta-llama/Llama-2-7b-chat-hf", max_context_length=4096, system_content=system_content, ) result = agent.get_response(query=query) print(json.dumps(result, indent=2))python app/main.py generate-responses \ --system-content "Answer the {query} using the additional {context} provided."python app/main.py evaluate-responses \ --system-content """ Your job is to rate the quality of our generated answer {generated_answer} given a query {query} and a reference answer {reference_answer}. Your score has to be between 1 and 5. You must return your response in a line with only the score. Do not return answers in any other format. On a separate line provide your reasoning for the score as well. """export APP_PORT=8501 echo https://$APP_PORT-port-$ANYSCALE_SESSION_DOMAIN streamlit run dashboard/Home.py- notebook cleanup
- evaluator (ex. GPT4) response script
- DB dump & load
- experiments (in order and fixing choices along the way)
- Evaluator
- GPT-4 best experiment
- Llama-70b consistency with GPT4
- OSS vs. Closed (gpt-3.5 vs. llama)
- w/ and w/out context (value of RAG)
- # of chunks to use in context
- Does using more resources help/harm?
- 1, 5, 10 will all fit in the smallest context length of 4K)
- Chunking size/overlap
- related to # of chunks + context length, but we'll treat as independent variable
- Embedding (top 3 in leaderboard)
- global leaderboard may not be your leaderboard (empirically validate)
- Later
- Commercial Assistant evaluation
- Human Assistant evaluation
- Data sources
- Much later
- Prompt
- Prompt-tuning on query
- Embedding vs. LLM for retrieval
- Evaluator
- Ray Tune to tweak a subset of components
- CI/CD workflows