Kalau kamu mau stop drama “si A nyontek si B” di kelas / tugas / kantor, ini tutorial praktis buat bikin plagiarism checker yang nyadar bahasa Indonesia, bisa cek whole text, per paragraf, per kalimat, dan menyimpan semua tugas ke dalam text bank (Postgres + Qdrant) supaya ketika kamu upload 40 tugas—mereka akan saling dicek otomatis.
Ringkasan arsitektur
- Frontend & API gateway: Nuxt (server script) — menerima request
/check
, meneruskan ke embedding/check service. - Embedding & Search service: Python + FastAPI +
sentence-transformers/paraphrase-multilingual-mpnet-base-v2
+qdrant-client
— buat embedding, search, dan upsert vectors. - Vector DB: Qdrant — menyimpan embedding (vector size = 768, distance = Cosine).
- Relational DB: PostgreSQL — menyimpan metadata dokumen (doc_id unik, title, created_at, raw_text).
Prasyarat (tools & versi)
- Docker + docker-compose
- Python 3.10+
- Node.js + Nuxt 3
- (Opsional) GPU jika mau cepat inference
1) Docker — jalankan Postgres + Qdrant cepat
Buat docker-compose.yml
sederhana:
version: '3.8' services: postgres: image: postgres:15 environment: POSTGRES_USER: pguser POSTGRES_PASSWORD: pgpass POSTGRES_DB: plagiarism ports: - "5432:5432" volumes: - ./pgdata:/var/lib/postgresql/data qdrant: image: qdrant/qdrant:latest ports: - "6333:6333" volumes: - ./qdrant_storage:/qdrant/storage
Jalankan:
docker-compose up -d
2) Backend embedding & search (Python + FastAPI)
Buat virtualenv dan install dependency utama:
python -m venv venv source venv/bin/activate pip install -U pip pip install fastapi uvicorn sentence-transformers qdrant-client sqlalchemy psycopg2-binary python-multipart numpy
Buat file service/app.py
— berikut versi ringkas tapi lengkap:
# service/app.py from fastapi import FastAPI, HTTPException from pydantic import BaseModel from sentence_transformers import SentenceTransformer import numpy as np from qdrant_client import QdrantClient from qdrant_client.http.models import VectorParams, Distance, PointStruct from sqlalchemy import create_engine, Column, String, Text, DateTime from sqlalchemy.orm import declarative_base, sessionmaker from datetime import datetime import re import uuid # CONFIG QDRANT_COLLECTION = "plagiarism_vectors" VECTOR_SIZE = 768 THRESHOLD = 0.8 QDRANT_URL = "http://localhost:6333" DATABASE_URL = "postgresql+psycopg2://pguser:pgpass@localhost:5432/plagiarism" # Init app = FastAPI(title="Plagiarism Checker Service") model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2') # loads on startup qdrant = QdrantClient(url=QDRANT_URL) engine = create_engine(DATABASE_URL) Base = declarative_base() SessionLocal = sessionmaker(bind=engine) # DB model class Document(Base): __tablename__ = "documents" doc_id = Column(String, primary_key=True, index=True) title = Column(String) text = Column(Text) created_at = Column(DateTime, default=datetime.utcnow) Base.metadata.create_all(bind=engine) # Ensure Qdrant collection exists try: qdrant.recreate_collection( collection_name=QDRANT_COLLECTION, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE) ) except Exception: # if exists, create may fail — try get or ignore qdrant.create_collection(collection_name=QDRANT_COLLECTION, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE)) # Helpers def split_paragraphs(text: str): paras = [p.strip() for p in re.split(r'\n{2,}', text) if p.strip()] return paras if paras else [text.strip()] def split_sentences(paragraph: str): # simple rule-based sentence split (works reasonably for Indo). Improve with spaCy if needed. sents = re.split(r'(?<=[\.\?\!])\s+', paragraph.strip()) return [s.strip() for s in sents if s.strip()] def cosine_sim(a: np.ndarray, b: np.ndarray): den = (np.linalg.norm(a) * np.linalg.norm(b)) if den == 0: return 0.0 return float(np.dot(a, b) / den) def chunk_text_all(text: str): """Return list of chunks with type info: full, paragraphs, sentences""" chunks = [] # full chunks.append({"type":"full","index":0,"text":text.strip()}) paras = split_paragraphs(text) for i,p in enumerate(paras): chunks.append({"type":"paragraph","index":i,"text":p}) sents = split_sentences(p) for j,s in enumerate(sents): chunks.append({"type":"sentence","index":f"{i}-{j}","text":s}) return chunks # Request model class CheckRequest(BaseModel): doc_id: str title: str text: str @app.post("/check") def check_and_add(req: CheckRequest): doc_id = req.doc_id title = req.title text = req.text # 1) chunk chunks = chunk_text_all(text) texts = [c["text"] for c in chunks] # 2) embed embeddings = model.encode(texts, show_progress_bar=False) embeddings = np.array(embeddings) # shape (n, 768) # 3) search each chunk in Qdrant (exclude same doc_id results to avoid self-match) results = {"full": [], "paragraphs": [], "sentences": []} for i, c in enumerate(chunks): vec = embeddings[i].tolist() # search top 5 hits = qdrant.search(collection_name=QDRANT_COLLECTION, query_vector=vec, limit=5, with_payload=True, with_vectors=True) matches = [] for h in hits: payload = h.payload or {} source_doc = payload.get("doc_id") # skip matches from same doc (because this doc may already be in DB) if source_doc == doc_id: continue # compute exact cosine using vectors returned if hasattr(h, "vector") and h.vector is not None: sim = cosine_sim(np.array(vec), np.array(h.vector)) else: # fallback to score if vector not returned sim = float(getattr(h, "score", 0.0)) matches.append({ "score": sim, "source_doc_id": source_doc, "source_text": payload.get("text"), "source_type": payload.get("chunk_type"), "source_index": payload.get("chunk_index") }) # sort matches desc matches = sorted(matches, key=lambda x: x["score"], reverse=True) entry = { "chunk_type": c["type"], "chunk_index": c["index"], "text": c["text"], "top_matches": matches[:5] } if c["type"] == "full": results["full"].append(entry) elif c["type"] == "paragraph": results["paragraphs"].append(entry) else: results["sentences"].append(entry) # 4) store doc metadata in Postgres (prevent duplicate by doc_id) db = SessionLocal() existing = db.query(Document).filter(Document.doc_id==doc_id).first() if not existing: newdoc = Document(doc_id=doc_id, title=title, text=text) db.add(newdoc) db.commit() db.close() # 5) upsert all chunks to Qdrant (id uses doc_id to prevent duplicates) points = [] for i, c in enumerate(chunks): pid = f"{doc_id}__{c['type']}__{c['index']}" payload = { "doc_id": doc_id, "title": title, "chunk_type": c["type"], "chunk_index": c["index"], "text": c["text"] } points.append(PointStruct(id=pid, vector=embeddings[i].tolist(), payload=payload)) qdrant.upsert(collection_name=QDRANT_COLLECTION, points=points) # 6) build report: find any sentences/paras > THRESHOLD flagged = {"sentences": [], "paragraphs": []} for s in results["sentences"]: if s["top_matches"] and s["top_matches"][0]["score"] >= THRESHOLD: flagged["sentences"].append({ "text": s["text"], "best_match": s["top_matches"][0] }) for p in results["paragraphs"]: if p["top_matches"] and p["top_matches"][0]["score"] >= THRESHOLD: flagged["paragraphs"].append({ "text": p["text"], "best_match": p["top_matches"][0] }) return { "status": "ok", "doc_id": doc_id, "scores": results, "flagged": flagged }
Catatan singkat:
- Endpoint
/check
menerima{doc_id, title, text}
. - Melakukan check (full/paragraph/sentence), lalu insert metadata ke Postgres dan upsert semua chunk ke Qdrant.
- Self-match dikecualikan saat mencari (agar dokumen baru tidak match terhadap dirinya sendiri).
- Threshold default 0.8 (atur di
THRESHOLD
).
3) Nuxt (frontend + server route)
Kamu bisa bikin server API route di Nuxt yang meneruskan request ke FastAPI (atau langsung panggil FastAPI dari frontend, tetapi lebih aman via server route).
Contoh server/api/check.post.ts
(Nuxt 3 / Nitro):
// server/api/check.post.ts import { defineEventHandler, readBody } from 'h3' export default defineEventHandler(async (event) => { const body = await readBody(event) // ganti URL jika service beda host/port const FASTAPI_URL = process.env.FASTAPI_URL || "http://localhost:8000/check" const res = await $fetch(FASTAPI_URL, { method: "POST", body, headers: { "Content-Type": "application/json" } }) return res })
Frontend form (simplified):
<template> <form @submit.prevent="submit"> <input v-model="docId" placeholder="doc id (unik)"/> <input v-model="title" placeholder="judul"/> <textarea v-model="text" placeholder="tempelkan teks tugas"></textarea> <button>Check</button> </form> <div v-if="report"> <h3>Flagged:</h3> <div v-for="s in report.flagged.sentences" :key="s.text"> <b>Kalimat:</b> {{ s.text }} — <i>match</i>: {{ s.best_match.source_doc_id }} ({{ s.best_match.score.toFixed(3) }}) <div>source text: {{ s.best_match.source_text }}</div> </div> <div v-for="p in report.flagged.paragraphs" :key="p.text"> <b>Paragraf:</b> {{ p.text }} — match: {{ p.best_match.source_doc_id }} ({{ p.best_match.score.toFixed(3) }}) </div> </div> </template> <script setup> import { ref } from 'vue' const docId = ref(`task-${Date.now()}`) const title = ref('') const text = ref('') const report = ref(null) async function submit(){ report.value = null const res = await $fetch('/api/check', { method: 'POST', body: { doc_id: docId.value, title: title.value, text: text.value } }) report.value = res } </script>
4) Testing cepat (local)
- Jalankan FastAPI:
uvicorn service.app:app --reload --port 8000
- Jalankan Nuxt, buka form, kirim 2 dokumen mirip → lihat
flagged
pada response. - Coba upload 40 tugas: berikan tiap tugas
doc_id
unik (misal:tugas-2025-09-25-001
) — sistem akan menyimpan semua ke bank dan saling memeriksa. - Jika kamu punya textbook atau paper yang menjadi referensi anak-anak, kamu bisa memasukkannya ke dalam bank.
5) Tips produksi / tuning
- Threshold: 0.8 untuk indikasi kuat (sesuaikan). Paraphrase kuat mungkin 0.7–0.75.
- Top_k: 5–10 cukup untuk deteksi.
- Batching: embed batch (mis. 128 per batch) untuk performance.
- GPU: pindahkan model ke GPU (
device='cuda'
) kalau banyak dokumen. - Dedup: gunakan
doc_id
sebagai primary key; idempotensi pada/check
. - Eksklusi self-match: sudah diterapkan, pastikan saat search selalu filter
payload.doc_id != doc_id
. - Tokenisasi noglitches: split sentence di atas bersifat heuristik — untuk ketepatan tinggi, gunakan tokenizer bahasa Indo (spaCy model/transformer-based segmentation).
- Privacy: simpan teks hanya jika kebijakan/privacy policy mengizinkan (kuliah / bisnis sensitif).
- Scaling: Qdrant bisa dijalankan cluster; gunakan sharding & replicas untuk ukuran besar.
6) Checklist deploy
- [ ] Docker compose running (Postgres + Qdrant)
- [ ] FastAPI deployed (container / VM) dan reachable dari Nuxt
- [ ] Nuxt server deployed (environment variable FASTAPI_URL)
- [ ] Backup Postgres, snapshot Qdrant
- [ ] Monitoring: latency embed time, Qdrant search time
Penutup (TL;DR)
- Kamu punya flow: Nuxt → FastAPI(embedding+Qdrant) → Postgres
- Checker melakukan: whole text, per paragraph, per sentence.
- Output: array skor & flagged items (kalimat/paragraf > 0.8 menampilkan teks dan referensinya).
- Saat
/check
juga menambahkan dokumen ke bank (unik dengandoc_id
) sehingga tugas saling ter-cek.
Disclaimer
Tutorial ini dibuat oleh AI. Jika Anda menemukan error, jadikanlah sebagai progres pembelajaran. Syukur-syukur kalau bisa buat tutorial yang lebih bener.
Top comments (1)
ada github linknya bang?