Language‑Guided Zero‑Shot Object Detection

Table of Contents
A simple, practical pipeline for zero‑shot object detection using YOLO for detections and CLIP for language guidance. Provide natural‑language prompts (e.g., “a black smartphone”), and the system highlights the best‑matching object in images or live webcam. Under improvement!
Implementation details #
Code on GitHubFeatures #
- Zero‑shot detection without task‑specific training
- Language‑guided selection with CLIP (image–text similarity)
- YOLO for detection/segmentation; optional SAM prototype
- Customizable prompts
- Webcam demo for real‑time testing
Project structure #
segmentation/yolo_detection.py— YOLO‑based object detectionsam_detection.py— SAM‑based detection (prototype)
selector.py— CLIP‑based selection of detections by prompt similaritymodels/— pre‑trained model weights (YOLO, CLIP)resources/— sample imagesREADME.md— documentation
Requirements #
- Python 3.8+
- PyTorch
- OpenCV
- Ultralytics (YOLO, SAM)
- OpenAI CLIP (
clip)
Install (example):
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Usage #
Webcam demo:
python segmentation/webcam_demo.py
Customize prompts in webcam_demo.py:
self.object_prompt = [
"A black smartphone",
"A person wearing a red shirt",
"A remote controller",
"A frying pan",
]
The demo:
- runs YOLO to detect objects
- crops detections
- ranks crops by CLIP similarity to your prompts
- overlays boxes + matched prompt on the live feed
Example #
Given the prompt “A black smartphone”:
- YOLO finds candidate objects
- Each crop is embedded with CLIP
- Text–image similarity is computed for the prompt
- The highest‑scoring object is highlighted
Notes / roadmap #
- Add lightweight Ultralytics SAM acceleration for live use
- Implement NMS consolidation across overlapping detections
- Consider detection models with broader class sets or custom training
- Package a simple executable for non‑dev users
Troubleshooting #
- File/path errors → run scripts from repo root; check relative paths
- Close similarity scores → refine prompts; ensure consistent preprocessing
- Model downloads → verify internet/SSL; confirm cache directory