Sign Language Dataset Hub: Curating 73+ Verified Datasets for 26 Sign Languages
March 27, 2026 ยท 8 min read
The Problem: Sign Language Research Needs Better Data Discovery
Sign language recognition (SLR) is a rapidly growing field in computer vision, NLP, and sensor-based computing. But finding the right dataset is shockingly difficult. Datasets are scattered across personal websites, academic pages, Kaggle, and GitHub. Many have broken links. Some have been taken down entirely. And crucially, many catalogs online contain fabricated entries โ datasets that don't actually exist.
As someone working on sensor-based sign language recognition research, I needed a trustworthy, verified catalog. So I built one.
Verification: No Fabricated Data
This was the most important design decision. Every single dataset in the hub is verified: the link works, the data is actually downloadable, the language and modality are confirmed. I personally checked each URL and removed any dataset that was unavailable or had broken access.
The initial version of the repo (not by me) had inflated claims about dataset counts and some entries that didn't exist. I did a full audit: deleted everything unverified and rebuilt from scratch with proper citations and honest statistics.
Coverage: 26 Sign Languages
The hub covers 26 sign languages including ASL (American), BSL (British), ISL (Indian), BSL (Bengali), JSL (Japanese), DGS (German), LSF (French), and many more. Each entry includes the language, modality (video, sensor, image, skeleton), sample count (when verified), and proper academic citations.
The Demo Data: BdSL-Sensor-Glove
The repository includes a small demo dataset from the Bengali Sign Language Sensor Glove project (BdSL-Sensor-Glove) with 4,824 sensor samples split across train (3,528), validation (648), and test (648) sets. This is included specifically for learning and prototyping โ not as a research dataset.
Beyond a Catalog: Tools for Researchers
The hub isn't just a list of links. It includes Jupyter notebooks for data exploration, a PyTorch data loader, visualization tools, and tutorials for beginners. The GitHub Pages site provides a searchable, filterable interface for finding datasets by language, modality, or type.
Academic Use
Every dataset has proper citation information. The hub itself includes a BibTeX entry for researchers who want to cite it. The REFERENCES.md file provides a curated bibliography of key papers in sign language recognition.
For the Deaf Community
This isn't just a research tool. It's also for developers building assistive technology โ sign language translation apps, educational tools, and communication aids. Every dataset entry notes whether it's free, commercial, or academic, making it easy for indie developers to find usable data.
Try It
Browse the catalog: rudra496.github.io/SignLanguage-Dataset-Hub
github.com/rudra496/SignLanguage-Dataset-Hub
Sample Dataset Entries
Each entry in the catalog includes the language, modality, approximate sample count, and a direct link:
| Dataset | Language | Modality | Samples |
|---|---|---|---|
| WLASL | ASL | Video | 2,000+ glosses |
| Phoenix-2014T | ASL | Video | 1,329 classes |
| AUTSL | Turkish SL | Video | 226 signs |
| RWTH-PHOENIX-Weather | German SL | Video | 1,107 sentences |
| How2Sign | ASL | Video | 80+ hours |
The Verification Process
Every dataset was verified in three steps: (1) URL check โ confirming the link returns HTTP 200, (2) content check โ confirming the page actually contains downloadable data (not a dead page), and (3) metadata check โ confirming the language, modality, and sample counts match what's advertised. Any dataset that failed any check was removed. This is critical because many existing catalogs include entries that link to pages that no longer exist or have been moved.
BibTeX Citation
@misc{sarker2026signlanguage,
author = {Sarker, Rudra},
title = {Sign Language Dataset Hub: A Verified Catalog of 73+ Datasets},
year = {2026},
publisher = {GitHub},
url = {https://github.com/rudra496/SignLanguage-Dataset-Hub}
}
Why SLR Research Matters
Over 70 million deaf people worldwide use sign language as their primary language. Automatic sign language recognition can bridge communication gaps, enable real-time translation, and improve accessibility in education, healthcare, and public services. The field spans computer vision (video-based recognition), NLP (gloss-to-text translation), and sensor-based computing (glove and accelerometer data). Having a centralized, verified dataset catalog accelerates research by reducing the time researchers spend finding and validating data sources.
PyTorch Data Loader
The repo includes a ready-to-use PyTorch data loader for the demo dataset:
from torch.utils.data import DataLoader
from datasets.bdsl_loader import BdSLDataset
# Load the demo dataset
train_dataset = BdSLDataset(
data_dir="data/bdsl/BdSL-Sensor-Glove/train",
normalize=True
)
train_loader = DataLoader(
train_dataset,
batch_size=32,
shuffle=True,
num_workers=2
)
# Each batch: (sensor_data, labels)
for batch_x, batch_y in train_loader:
predictions = model(batch_x)
loss = criterion(predictions, batch_y)