A chill, open-source web engine that crawls, indexes, and vibes with web content using semantic search.
Froxy is a modular full-stack web engine designed to crawl web pages, extract content, and index it using semantic embeddings for intelligent search β all powered by modern tools. It includes:
- A Go-based crawler (aka the spider π·οΈ) with real-time indexing
- FastEmbed service for generating semantic embeddings
- Qdrant vector database for semantic search
- Froxy Apex - AI-powered intelligent search (Perplexity-style)
- A PostgreSQL database for structured data
- A Next.js front-end UI (fully integrated with real APIs)
This project is built for learning, experimenting, and extending β great for developers who want to understand how modern semantic search engines work from scratch.
Fun fact: I made this project in just 3 days β so it might not be perfect, but you know what? It works!
(We'll keep evolving this codebase together β€οΈ)
Note: I prefer simplicity over unnecessary complexity. We might make the architecture more advanced in the future, but for now, it's simple, clean, and straightforwardβno fancy stuff, no over-engineering. It's just a chill project for now. If needed, we can scale and make it more complex later. After all, it started as a fun projectβnothing more. <3
- π Crawl websites with real-time indexing (Go)
- π§ Semantic search using embeddings (FastEmbed + Qdrant)
- π€ AI-powered intelligent search with LLM integration (Froxy Apex)
- π Vector similarity search for intelligent results
- π Chunk-based relevance scoring with cosine similarity
- πΊ Store structured data in PostgreSQL
- π¨ Modern UI in Next.js + Tailwind
- π³ Fully containerized with Docker
The frontend is fully connected to the backend and provides semantic search capabilities.
froxy/ βββ front-end/ # Next.js frontend β βββ app/ # App routes (search, terms, about, etc.) β βββ components/ # UI components (shadcn-style) β βββ hooks/ # React hooks β βββ lib/ # Utility logic β βββ public/ # Static assets β βββ styles/ # TailwindCSS setup βββ indexer-search/ # Node.js search backend β βββ lib/ β βββ functions/ β βββ services/ # DB + search service β βββ utils/ # Helper utilities βββ froxy-apex/ # AI-powered intelligent search service β βββ api/ # API endpoints β βββ db/ # Database connections β βββ functions/ # AI processing logic β βββ llama/ # LLM integration β βββ models/ # Data models β βββ utils/ # Helper utilities βββ spider/ # Web crawler in Go with real-time indexing β βββ db/ # DB handling (PostgreSQL + Qdrant) β βββ functions/ # Crawl + indexing logic + Proxies (if-need it) β βββ models/ # Data models β βββ utils/ # Misc helpers βββ fastembed/ # FastEmbed embedding service β βββ models/ # Cached embedding models β βββ docker-compose.yml βββ qdrant/ # Qdrant vector database β βββ docker-compose.yml βββ db/ # PostgreSQL database β βββ scripts/ # Shell backups β βββ docker-compose.yml βββ froxy.sh # Automated setup & runner script βββ LICENSE # MIT License βββ readme.md # This file - Node.js (18+)
- pnpm or npm
- Go (1.23+)
- Docker & Docker Compose
- At least 2GB RAM (for embedding service)
For the fastest crawler setup without dealing with configuration details:
# Make the script executable and run it chmod +x froxy.sh ./froxy.shThe script will automatically:
- Set up all environment variables with default values
- Create the Docker network
- Start all required services (PostgreSQL, Qdrant, FastEmbed)
- Health check all containers
- Guide you through the crawling process
Note: The froxy.sh script only handles the crawler setup. You'll need to manually start the froxy-apex AI service and front-end after crawling.
If you prefer to set things up manually:
# 1. Create Docker network docker network create froxy-network # 2. Start Qdrant vector database cd qdrant docker-compose up -d --build # 3. Start PostgreSQL database cd ../db # Set proper permissions for PostgreSQL data directory sudo chown -R 999:999 postgres_data/ docker-compose up -d --build # 4. Start FastEmbed service cd ../fastembed docker-compose up -d --build # 5. Wait for all services to be healthy, then run the crawler cd ../spider go run main.go # 6. After crawling, start the search backend cd ../indexer-search npm install npm start # 7. Start the AI-powered search service (Froxy Apex) # Make sure to configure froxy-apex/.env first cd ../froxy-apex go run main.go # 8. Launch the front-end cd ../front-end npm i --legacy-peer-deps npm run devAll services use these environment variables (automatically set by froxy.sh):
# Database Configuration (for spider & indexer-search) DB_HOST=localhost DB_PORT=5432 DB_USER=froxy_user DB_PASSWORD=froxy_password DB_NAME=froxy_db DB_SSLMODE=disable # Vector Database Configuration QDRANT_API_KEY=froxy-secret-key QDRANT_HOST=http://localhost:6333 # FastEmbed Service EMBEDDING_HOST=http://localhost:5050 # AI Service (for froxy-apex) LLM_API_KEY=your_groq_api_key API_KEY=your_froxy_apex_api_keyPOSTGRES_DB=froxy_db POSTGRES_USER=froxy_user POSTGRES_PASSWORD=froxy_password DB_NAME=froxy_db DB_SSLMODE=disableQDRANT_API_KEY=froxy-secret-keyDB_HOST=localhost DB_PORT=5432 DB_USER=froxy_user DB_PASSWORD=froxy_password DB_NAME=froxy_db DB_SSLMODE=disable QDRANT_API_KEY=froxy-secret-key EMBEDDING_HOST=http://localhost:5050LLM_API_KEY=your_groq_api_key QDRANT_HOST=http://localhost:6333 EMBEDDING_HOST=http://localhost:5050 API_KEY=your_froxy_apex_api_key QDRANT_API_KEY=froxy-secret-keyAPI_URL=http://localhost:8080 API_KEY=your_api_key WEBSOCKET_URL=ws://localhost:8080/ws/search FROXY_APEX_API_KEY=your_froxy_apex_api_key ACCESS_CODE=auth_access_for_froxy_apex_ui AUTH_SECRET_TOKEN=jwt_token_for_apex_ui_to_calc_the_usageπ‘ The
froxy.shscript automatically creates.envfiles with working default values for the crawler and database services. You'll need to manually configurefroxy-apex/.envandfront-end/.envfor the AI search and UI components.
- Crawler pulls website content from your provided URLs
- Real-time indexing generates semantic embeddings using FastEmbed
- Qdrant stores vector embeddings for semantic similarity search
- PostgreSQL stores structured metadata
- Frontend provides intelligent semantic search interface
- User query is received and processed
- Query enhancement using Llama 3.1 8B via Groq API
- Embedding generation for the enhanced query using FastEmbed
- Vector search in Qdrant to find relevant pages
- Content chunking of relevant pages for detailed analysis
- Cosine similarity calculation for each chunk against the query
- LLM processing to generate structured response with:
- Concise summary
- Detailed results with sources
- Relevance scores
- Reference links and favicons
- Confidence ratings
{ "summary": "Concise overview addressing the query directly", "results": [ { "point": "Detailed information in markdown format", "reference": "https://exact-source-url.com", "reference_favicon": "https://exact-source-url.com/favicon.ico", "relevance_score": 0.95, "timestamp": "when this info was published/updated" } ], "language": "detected_language_code", "last_updated": "timestamp", "confidence": 0.90 }When you run the spider, you'll be prompted to:
- Enter URLs you want to crawl
- Set the number of workers (default: 5)
The crawler will:
- Extract content from each page
- Generate embeddings in real-time
- Store vectors in Qdrant
- Store metadata in PostgreSQL
Since froxy.sh only handles the crawler, you'll need to manually configure:
- Froxy Apex: Set up your Groq API key and other environment variables
- Frontend: Configure API endpoints and keys
- Service startup: Start each service individually after crawler completes
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ β Next.js UI βββββΆβ Search Backend βββββΆβ PostgreSQL β βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ β β β βΌ β ββββββββββββββββββββ βββββββββββββββββββ β β Qdrant ββββββ FastEmbed β β β (Vector Search) β β (Embeddings) β β ββββββββββββββββββββ βββββββββββββββββββ β β² β² β β β β ββββββββββββββββββββ β β β Go Crawler ββββββββββββββββ β β (Real-time β β β Indexing) β β ββββββββββββββββββββ β βΌ βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ β Froxy Apex βββββΆβ Groq LLM API β β Chunk Analysis β β (AI Search) β β (Llama 3.1 8B) ββββββ (Cosine Sim) β βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ - π·οΈ Go (Golang) β crawler with real-time indexing
- π§ FastEmbed β embedding generation service
- π Qdrant β vector database for semantic search
- π€ Froxy Apex β AI-powered search with LLM integration
- π¦ Llama 3.1 8B β language model via Groq API
- πͺ Node.js β search backend API
- π PostgreSQL β structured data storage
- βοΈ Next.js β frontend interface
- π¨ TailwindCSS + shadcn/ui β UI components
- π³ Docker β containerized services
- π Docker Network β service communication
- AI-Powered Search: Perplexity-style intelligent search with LLM integration
- Semantic Search: Find content by meaning, not just keywords
- Real-time Indexing: Content is indexed as it's crawled
- Vector Similarity: Intelligent search results based on context
- Chunk Analysis: Deep content analysis with cosine similarity
- Structured Responses: Rich JSON responses with sources and confidence scores
- Query Enhancement: AI-powered query understanding and improvement
- Scalable Architecture: Microservices with Docker containers
- Automated Setup: One-command deployment with
froxy.sh
- Fork it π
- Open a PR π°
- Share your ideas π‘
MIT β feel free to fork, remix, and learn from it.
Made with β€οΈ for the curious minds of the internet.
Stay weird. Stay building.
"Not all who wander are lost β some are just crawling the web with semantic understanding."
