End-to-end ML pipeline that scrapes RemoteOK, clusters jobs with K-Means + TF-IDF, classifies seniority across 3 models, generates LLM market insights via Groq LLaMA 3.3 70B, and pushes automated reports to Google Sheets — all from one API call.
Last updated April 2026
An end-to-end ML data pipeline that monitors the AI/ML job market in real time. A single POST /run-pipeline triggers the full chain: async scraping → PostgreSQL storage → TF-IDF clustering → seniority classification → Groq LLaMA 3.3 70B market briefing → Google Sheets report with a dated tab per run.
Built as a portfolio project to demonstrate every layer of an ML engineering stack — scraping, databases, unsupervised and supervised ML, LLM integration, external API automation, and containerised deployment. Live run: 97 jobs scraped, 8 clusters found, 36/36 tests passing.
| Step | Component | Technology |
|---|---|---|
| 1 | Async Scraper | aiohttp · asyncio · BeautifulSoup4 · RemoteOK JSON API |
| 2 | Storage | PostgreSQL · SQLAlchemy 2 · Alembic migrations |
| 3 | Feature Extraction | TF-IDF vectorisation (5,000 features) · skill frequency analysis |
| 4 | Clustering | K-Means · silhouette k-selection (k=2–7) · PCA visualisation |
| 5 | Seniority Classifier | Logistic Regression · LinearSVC · Random Forest · 5-fold CV F1 |
| 6 | LLM Insights | Groq API · LLaMA 3.3 70B · structured prompt template |
| 7 | Reporter | gspread · Google Service Account · dated tab per run |
| 8 | API | FastAPI · uvicorn · POST /run-pipeline |
find_optimal_k() tries k=2–7 and picks the highest silhouette score automatically — no hardcoded cluster count, adapts to each day's data distributionsource_url — re-running never double-counts; second run on the same day shows jobs_scraped: 0 by design| Metric | Result |
|---|---|
| Jobs scraped | 97 |
| Clusters found | 8 |
| Best classifier (F1) | LogisticRegression — 0.417 |
| LinearSVC F1 | 0.404 |
| Random Forest F1 | 0.351 |
| Tests passing | 36 / 36 |
Each pipeline run writes a new dated tab (e.g. 2026-04-23) containing:
I build AI agents, full-stack applications, and cloud-native systems. Let's discuss your next project.
Get in Touch