🤟 Open Source · Research · Deaf Community

Sign Language Dataset Hub: Curating 73+ Verified Datasets for 26 Sign Languages

March 27, 2026 · 8 min read

The Problem: Sign Language Research Needs Better Data Discovery

Sign language recognition (SLR) is a rapidly growing field in computer vision, NLP, and sensor-based computing. But finding the right dataset is shockingly difficult. Datasets are scattered across personal websites, academic pages, Kaggle, and GitHub. Many have broken links. Some have been taken down entirely. And crucially, many catalogs online contain fabricated entries — datasets that don't actually exist.

As someone working on sensor-based sign language recognition research, I needed a trustworthy, verified catalog. So I built one.

Verification: No Fabricated Data

This was the most important design decision. Every single dataset in the hub is verified: the link works, the data is actually downloadable, the language and modality are confirmed. I personally checked each URL and removed any dataset that was unavailable or had broken access.

The initial version of the repo (not by me) had inflated claims about dataset counts and some entries that didn't exist. I did a full audit: deleted everything unverified and rebuilt from scratch with proper citations and honest statistics.

Coverage: 26 Sign Languages

The hub covers 26 sign languages including ASL (American), BSL (British), ISL (Indian), BSL (Bengali), JSL (Japanese), DGS (German), LSF (French), and many more. Each entry includes the language, modality (video, sensor, image, skeleton), sample count (when verified), and proper academic citations.

The Demo Data: BdSL-Sensor-Glove

The repository includes a small demo dataset from the Bengali Sign Language Sensor Glove project (BdSL-Sensor-Glove) with 4,824 sensor samples split across train (3,528), validation (648), and test (648) sets. This is included specifically for learning and prototyping — not as a research dataset.

Beyond a Catalog: Tools for Researchers

The hub isn't just a list of links. It includes Jupyter notebooks for data exploration, a PyTorch data loader, visualization tools, and tutorials for beginners. The GitHub Pages site provides a searchable, filterable interface for finding datasets by language, modality, or type.

Academic Use

Every dataset has proper citation information. The hub itself includes a BibTeX entry for researchers who want to cite it. The REFERENCES.md file provides a curated bibliography of key papers in sign language recognition.

For the Deaf Community

This isn't just a research tool. It's also for developers building assistive technology — sign language translation apps, educational tools, and communication aids. Every dataset entry notes whether it's free, commercial, or academic, making it easy for indie developers to find usable data.

Try It

Browse the catalog: rudra496.github.io/SignLanguage-Dataset-Hub

github.com/rudra496/SignLanguage-Dataset-Hub

Sample Dataset Entries

Each entry in the catalog includes the language, modality, approximate sample count, and a direct link:

Dataset	Language	Modality	Samples
WLASL	ASL	Video	2,000+ glosses
Phoenix-2014T	German SL	Video	1,329 classes
AUTSL	Turkish SL	Video	226 signs
RWTH-PHOENIX-Weather	German SL	Video	1,107 sentences
How2Sign	ASL	Video	80+ hours

The Verification Process

Every dataset was verified in three steps: (1) URL check — confirming the link returns HTTP 200, (2) content check — confirming the page actually contains downloadable data (not a dead page), and (3) metadata check — confirming the language, modality, and sample counts match what's advertised. Any dataset that failed any check was removed. This is critical because many existing catalogs include entries that link to pages that no longer exist or have been moved.

BibTeX Citation

@misc{sarker2026signlanguage,
  author = {Sarker, Rudra},
  title = {Sign Language Dataset Hub: A Verified Catalog of 73+ Datasets},
  year = {2026},
  publisher = {GitHub},
  url = {https://github.com/rudra496/SignLanguage-Dataset-Hub}
}

Why SLR Research Matters

Over 70 million deaf people worldwide use sign language as their primary language. Automatic sign language recognition can bridge communication gaps, enable real-time translation, and improve accessibility in education, healthcare, and public services. The field spans computer vision (video-based recognition), NLP (gloss-to-text translation), and sensor-based computing (glove and accelerometer data). Having a centralized, verified dataset catalog accelerates research by reducing the time researchers spend finding and validating data sources.

PyTorch Data Loader

The repo includes a ready-to-use PyTorch data loader for the demo dataset:

from torch.utils.data import DataLoader
from datasets.bdsl_loader import BdSLDataset

# Load the demo dataset
train_dataset = BdSLDataset(
    data_dir="data/bdsl/BdSL-Sensor-Glove/train",
    normalize=True
)

train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    shuffle=True,
    num_workers=2
)

# Each batch: (sensor_data, labels)
for batch_x, batch_y in train_loader:
    predictions = model(batch_x)
    loss = criterion(predictions, batch_y)

Connect

Back to Blog