Skip to main content
Safdar.
HomeAboutProjectsBlogContact
Resume
Safdar.

AI Engineer & Full Stack Developer building intelligent systems.

Quick Links

  • Home
  • About
  • Projects
  • Blog
  • Contact
  • Privacy Policy

Connect

safdarayub@gmail.com

Kohat District, KP, Pakistan

© 2026 Safdar Ayub. All rights reserved.RSS Feed
  1. Home
  2. Projects
  3. AI/ML Job Market Intelligence Pipeline
Back to Projects
ML Pipelines

AI/ML Job Market Intelligence Pipeline

End-to-end ML pipeline that scrapes RemoteOK, clusters jobs with K-Means + TF-IDF, classifies seniority across 3 models, generates LLM market insights via Groq LLaMA 3.3 70B, and pushes automated reports to Google Sheets — all from one API call.

Last updated April 2026

View on GitHub

Overview

An end-to-end ML data pipeline that monitors the AI/ML job market in real time. A single POST /run-pipeline triggers the full chain: async scraping → PostgreSQL storage → TF-IDF clustering → seniority classification → Groq LLaMA 3.3 70B market briefing → Google Sheets report with a dated tab per run.

Built as a portfolio project to demonstrate every layer of an ML engineering stack — scraping, databases, unsupervised and supervised ML, LLM integration, external API automation, and containerised deployment. Live run: 97 jobs scraped, 8 clusters found, 36/36 tests passing.

Pipeline Architecture

StepComponentTechnology
1Async Scraperaiohttp · asyncio · BeautifulSoup4 · RemoteOK JSON API
2StoragePostgreSQL · SQLAlchemy 2 · Alembic migrations
3Feature ExtractionTF-IDF vectorisation (5,000 features) · skill frequency analysis
4ClusteringK-Means · silhouette k-selection (k=2–7) · PCA visualisation
5Seniority ClassifierLogistic Regression · LinearSVC · Random Forest · 5-fold CV F1
6LLM InsightsGroq API · LLaMA 3.3 70B · structured prompt template
7Reportergspread · Google Service Account · dated tab per run
8APIFastAPI · uvicorn · POST /run-pipeline

Key Engineering Decisions

  • Silhouette k-selection: find_optimal_k() tries k=2–7 and picks the highest silhouette score automatically — no hardcoded cluster count, adapts to each day's data distribution
  • Classifier comparison: All three models are evaluated via 5-fold cross-validation F1 macro on every run — the winning model and score are returned in the API response
  • RemoteOK JSON API over Playwright: Switched from headless browser scraping to the public JSON API via aiohttp — more reliable, faster, no browser install
  • Deduplication by URL: Jobs are keyed on source_url — re-running never double-counts; second run on the same day shows jobs_scraped: 0 by design
  • SQLite in tests: Integration tests use SQLite in-memory with mocked LLM and Sheets — no external services needed in CI
  • Groq free tier: LLaMA 3.3 70B via Groq — no credit card required; rate-limit retries built in with 10-second backoff

Live Run Results (2026-04-23)

MetricResult
Jobs scraped97
Clusters found8
Best classifier (F1)LogisticRegression — 0.417
LinearSVC F10.404
Random Forest F10.351
Tests passing36 / 36

Google Sheets Output

Each pipeline run writes a new dated tab (e.g. 2026-04-23) containing:

  • LLM-generated 2–3 paragraph market intelligence briefing
  • Top 20 skills by mention rate across all job listings
  • Cluster breakdown — job count and percentage per cluster
  • Seniority distribution — junior / mid / senior counts
Tech Stack
Pythonscikit-learnFastAPIPostgreSQLGroq APIGoogle SheetsSQLAlchemyDocker

Interested in similar work?

I build AI agents, full-stack applications, and cloud-native systems. Let's discuss your next project.

Get in Touch

Other Projects

CareerCoach Pakistan screenshot
Platinum TierSaaS Products
CareerCoach Pakistan

Full-stack SaaS — paste a job description, answer 10 AI-generated questions in English or Urdu, and get instant scored feedback. Subscription billing, Google auth, transactional email, and analytics included.

Next.js 16SupabaseStripeGroq AIResend
View DetailsGitHub
Personal AI Employee screenshot
Platinum TierAI Agents
Personal AI Employee

A 24/7 autonomous AI agent that monitors Gmail, WhatsApp, and filesystem — drafts responses on a cloud VM while you're offline, then executes with your approval when you reconnect.

PythonFastMCPClaude CodeGmail OAuth2Playwright
View DetailsGitHub
BusBot Pakistan screenshot
Platinum TierAI Applications
BusBot Pakistan

Pakistan's first AI-powered public bus guide — speak in Urdu, get your route. Built for Google AI Seekho 2026 competition, covering Lahore, Karachi, and Islamabad with community-driven route data.

ReactGemini 3 FlashFirebase FirestoreWeb Speech APIGoogle Cloud Run
View DetailsGitHub