Skip to main content
Safdar.
HomeAboutProjectsBlogContact
Resume
Safdar.

AI Engineer & Full Stack Developer building intelligent systems.

Quick Links

  • Home
  • About
  • Projects
  • Blog
  • Contact
  • Privacy Policy

Connect

safdarayub@gmail.com

Kohat District, KP, Pakistan

© 2026 Safdar Ayub. All rights reserved.RSS Feed
  1. Home
  2. Projects
  3. AI/ML Job Market Intelligence Pipeline
Back to Projects
ML Pipelines

AI/ML Job Market Intelligence Pipeline

End-to-end ML pipeline that scrapes RemoteOK, clusters jobs with K-Means + TF-IDF, classifies seniority across 3 models, generates LLM market insights via Groq LLaMA 3.3 70B, and pushes automated reports to Google Sheets — all from one API call.

Last updated April 2026

View on GitHub

Overview

An end-to-end ML data pipeline that monitors the AI/ML job market in real time. A single POST /run-pipeline triggers the full chain: async scraping → PostgreSQL storage → TF-IDF clustering → seniority classification → Groq LLaMA 3.3 70B market briefing → Google Sheets report with a dated tab per run.

Built as a portfolio project to demonstrate every layer of an ML engineering stack — scraping, databases, unsupervised and supervised ML, LLM integration, external API automation, and containerised deployment. Live run: 97 jobs scraped, 8 clusters found, 36/36 tests passing.

Pipeline Architecture

StepComponentTechnology
1Async Scraperaiohttp · asyncio · BeautifulSoup4 · RemoteOK JSON API
2StoragePostgreSQL · SQLAlchemy 2 · Alembic migrations
3Feature ExtractionTF-IDF vectorisation (5,000 features) · skill frequency analysis
4ClusteringK-Means · silhouette k-selection (k=2–7) · PCA visualisation
5Seniority ClassifierLogistic Regression · LinearSVC · Random Forest · 5-fold CV F1
6LLM InsightsGroq API · LLaMA 3.3 70B · structured prompt template
7Reportergspread · Google Service Account · dated tab per run
8APIFastAPI · uvicorn · POST /run-pipeline

Key Engineering Decisions

  • Silhouette k-selection: find_optimal_k() tries k=2–7 and picks the highest silhouette score automatically — no hardcoded cluster count, adapts to each day's data distribution
  • Classifier comparison: All three models are evaluated via 5-fold cross-validation F1 macro on every run — the winning model and score are returned in the API response
  • RemoteOK JSON API over Playwright: Switched from headless browser scraping to the public JSON API via aiohttp — more reliable, faster, no browser install
  • Deduplication by URL: Jobs are keyed on source_url — re-running never double-counts; second run on the same day shows jobs_scraped: 0 by design
  • SQLite in tests: Integration tests use SQLite in-memory with mocked LLM and Sheets — no external services needed in CI
  • Groq free tier: LLaMA 3.3 70B via Groq — no credit card required; rate-limit retries built in with 10-second backoff

Live Run Results (2026-04-23)

MetricResult
Jobs scraped97
Clusters found8
Best classifier (F1)LogisticRegression — 0.417
LinearSVC F10.404
Random Forest F10.351
Tests passing36 / 36

Google Sheets Output

Each pipeline run writes a new dated tab (e.g. 2026-04-23) containing:

  • LLM-generated 2–3 paragraph market intelligence briefing
  • Top 20 skills by mention rate across all job listings
  • Cluster breakdown — job count and percentage per cluster
  • Seniority distribution — junior / mid / senior counts
Tech Stack
Pythonscikit-learnFastAPIPostgreSQLGroq APIGoogle SheetsSQLAlchemyDocker

Interested in similar work?

I build AI agents, full-stack applications, and cloud-native systems. Let's discuss your next project.

Get in Touch

Other Projects

Personal AI Employee screenshot
Platinum TierAI Agents
Personal AI Employee

A 24/7 autonomous AI agent that monitors Gmail, WhatsApp, and filesystem — drafts responses on a cloud VM while you're offline, then executes with your approval when you reconnect.

PythonFastMCPClaude CodeGmail OAuth2Playwright
View DetailsGitHub
AHF Auto Parts screenshot
Platinum TierFull Stack
AHF Auto Parts

Production-grade e-commerce platform for a Japan-based auto parts business — vehicle compatibility lookup, Stripe checkout, admin dashboard, and full order management.

Next.jsTypeScriptPostgreSQLPrismaNextAuth.js
View DetailsGitHub
Campaign Manager screenshot
Full Stack
Campaign Manager

A CRM prototype with contacts CRUD, audience segmentation rule builder, multi-step campaign wizard, and analytics dashboard — powered by Prisma and Turso.

Next.js 16TypeScriptTailwind CSSShadCN UIPrisma
View DetailsGitHub