Machine Learning · Healthcare · Wearable Sensors · Research

Stroke Rehabilitation Activity Classification from Wearable IMU Data

By Rudra Sarker • Published May 9, 2026

The Challenge: Monitoring Stroke Rehabilitation

Stroke is one of the leading causes of long-term disability worldwide. Recovery depends heavily on consistent, repetitive rehabilitation exercises. But here is the problem: once patients leave the clinic, therapists have almost no visibility into whether exercises are being performed correctly, how often, or for how long. Patients are given paper sheets with exercise instructions that are easy to ignore or perform incorrectly.

Wearable sensors, specifically Inertial Measurement Units (IMUs) that combine accelerometers and gyroscopes, offer a way to capture movement data unobtrusively. The challenge is building a classification system that can take raw IMU signals and determine which rehabilitation exercise a patient is performing. This is not a diagnosis problem. It is a pattern recognition problem. And it is one where machine learning can provide genuine value as an assistive tool for therapists.

I wanted to explore this space rigorously: take a real clinical dataset, build classification models, validate them properly, and produce interpretable results that a healthcare professional could actually understand and trust. This project is the result of that exploration, documented in a 67-page research paper with fully reproducible Jupyter notebooks.

Dataset and Methods

I worked with the PrimSeq (StrokeRehab) dataset, which contains IMU recordings from stroke patients performing standardized rehabilitation exercises. The dataset captures the reality of clinical data: varying numbers of repetitions, different movement quality levels, inter-subject variability in exercise execution, and sensor noise from everyday use.

Before any modeling, I applied stratified sampling to ensure that the class distribution was preserved across training and test splits. This matters because rehabilitation sessions do not contain equal numbers of every exercise. Some activities are overrepresented, and naive random splitting would produce misleadingly optimistic results.

The preprocessing pipeline included windowing the continuous IMU streams into fixed-length segments, computing statistical and frequency-domain features from each window, and normalizing the feature space. Feature engineering was a critical step. For each sensor axis, I extracted features like mean, standard deviation, variance, minimum, maximum, signal magnitude area, correlation between axes, and spectral energy. These hand-crafted features give traditional classifiers the information they need without requiring the massive parameter count of deep learning approaches.

Model Architecture: Random Forest and LSTM

I implemented two complementary modeling approaches to balance interpretability with raw classification performance.

Random Forest was the first model. Decision tree ensembles are well-suited to tabular feature representations, handle mixed feature scales gracefully, and most importantly provide direct feature importance rankings. In a healthcare context, knowing which sensor features drive a classification decision is just as valuable as the decision itself. Therapists need to understand why the system thinks a patient performed a particular exercise, not just receive a label.

LSTM (Long Short-Term Memory) networks were the second model. Unlike the Random Forest, which operates on engineered features, the LSTM processes raw sequential IMU data and learns temporal dependencies automatically. This makes it potentially more powerful for capturing subtle movement dynamics, at the cost of being harder to interpret.

Here is a simplified version of the training approach for the Random Forest classifier:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import LeaveOneGroupOut
from sklearn.metrics import classification_report, accuracy_score

# LOSO validation: each fold leaves one subject out entirely
logo = LeaveOneGroupOut()

 accuracies = []
for train_idx, test_idx in logo.split(X_features, y_labels, groups=subject_ids):
    X_train, X_test = X_features[train_idx], X_features[test_idx]
    y_train, y_test = y_labels[train_idx], y_labels[test_idx]

    clf = RandomForestClassifier(
        n_estimators=200,
        max_depth=None,
        min_samples_split=5,
        class_weight='balanced',
        random_state=42,
        n_jobs=-1
    )
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracies.append(accuracy_score(y_test, y_pred))

print(f"LOSO Accuracy: {np.mean(accuracies)*100:.1f}% +/- {np.std(accuracies)*100:.1f}%")

This code shows the core validation strategy: Leave-One-Subject-Out cross-validation. Every fold trains on all subjects except one, then tests on the held-out subject. This is the gold standard for wearable sensor studies because it measures generalization to entirely new patients, not just new data from patients the model has already seen.

Results: 92.5% Accuracy Under LOSO Validation

The primary result: the combined pipeline achieved 92.5% accuracy under Leave-One-Subject-Out (LOSO) validation. This is a meaningful number because LOSO is a strict evaluation protocol. It means the system can correctly classify rehabilitation activities for a patient it has never encountered before, using only wearable sensor data.

Several factors contributed to this result:

Feature engineering quality: The hand-crafted statistical and frequency-domain features captured the discriminative characteristics of each exercise movement effectively. Not all features were equally informative, and the feature importance analysis helped identify the most useful ones.
Stratified sampling: By preserving class distributions during data splitting, the models were trained and evaluated under realistic conditions, preventing inflated performance metrics from imbalanced data.
Model selection: Random Forest provided a strong baseline with interpretable results. The LSTM complemented this by capturing temporal patterns in the raw signal that static features might miss.

Beyond accuracy, I also generated ROC curves for each activity class. These curves show the trade-off between true positive rate and false positive rate for individual exercises, giving a more nuanced view of performance than a single accuracy number. Some exercises with very similar movement patterns (for example, different arm raise variations) naturally showed lower discrimination, which is an important finding for guiding future work.

Feature Importance: What the Models Learned

One of the most valuable outputs from this project was the feature importance ranking from the Random Forest model. Understanding which sensor features matter most for classification has practical implications:

It tells us which sensor placements and axes are most informative, potentially allowing for fewer sensors and simpler wearable hardware.
It provides clinicians with interpretable evidence. If the model classifies an exercise based primarily on wrist rotation velocity and acceleration magnitude, a therapist can evaluate whether that aligns with their clinical understanding of the exercise.
It guides future data collection by highlighting which features to prioritize and which can be dropped without significant performance loss.

The feature importance analysis revealed that signal magnitude area and variance across accelerometer axes were consistently among the top discriminators. This makes intuitive sense: different rehabilitation exercises involve distinct movement intensities and ranges of motion that are captured well by these aggregate statistics.

Few-Shot Personalization

A purely subject-independent classifier, while robust, does not account for individual movement patterns. Stroke patients in particular exhibit highly variable motor impairments. To address this, I explored few-shot personalization: fine-tuning the pre-trained model with just a small number of labeled examples from a new patient.

The idea is practical. A therapist records a patient performing each exercise a few times, labels them, and the system adapts. Even with limited per-patient data, personalization can improve classification accuracy for that individual beyond what the generic model achieves. This bridges the gap between a one-size-fits-all classifier and a fully personalized system that requires extensive data collection from every patient.

Why This Matters for Healthcare

This project is a research exploration, not a clinical tool. But the implications are worth discussing.

Stroke rehabilitation is resource-constrained. Therapists cannot be present for every exercise session. Home-based exercise programs are prescribed, but adherence and quality are difficult to monitor. An assistive classification system built on wearable IMU data could, with further validation and clinical trials, provide therapists with objective data about which exercises patients performed, how often, and whether movement patterns suggest correct or incorrect execution.

Key design principles that matter for real-world healthcare applications:

Interpretability over black-box accuracy: Healthcare professionals need to understand and trust system outputs. Feature importance rankings and ROC curves provide this transparency in a way that raw neural network predictions cannot.
Subject-independent validation: LOSO evaluation ensures results generalize to new patients, which is essential for any deployable system. Reporting only within-subject accuracy would be misleading.
Minimal sensor burden: Feature importance analysis helps identify the smallest set of sensors needed for reliable classification, reducing cost and improving patient comfort.
Reproducibility: Every experiment in this project is reproducible through provided Jupyter notebooks. Transparent methodology is critical for healthcare research.

Reproducibility and Resources

Full reproducibility was a core goal of this project. The repository includes:

67-page research document in the /docs/ directory covering literature review, methodology, experimental setup, results, and discussion in depth
Jupyter notebooks for every stage of the pipeline: data loading, preprocessing, feature extraction, model training, evaluation, and visualization
Requirements: Python 3.8+, numpy, pandas, scikit-learn, tensorflow, matplotlib

All code is open-source under the MIT License.

Get Started

The complete codebase, research documentation, and notebooks are available on GitHub:

GitHub: github.com/rudra496/Stroke

If you are working on wearable computing, rehabilitation technology, or healthcare machine learning, I would welcome your feedback and collaboration. Feel free to open an issue, submit a pull request, or reach out to me on any of my social channels.

Stroke Rehabilitation Activity Classification

92.5% LOSO Accuracy · Random Forest + LSTM · PrimSeq Dataset · MIT License
GitHub

Connect With Me

Follow my work and connect across platforms:

Back to Blog