Skip to content

An unofficial implementation of TubeViT in "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning"

License

Notifications You must be signed in to change notification settings

daniel-code/TubeViT

Repository files navigation

TubeViT

An unofficial implementation of TubeViT in "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning"

Spec.

  • Fixed Positional embedding
  • Sparse Tube Construction
    • Multi-Tube
    • Interpolated Kernels
    • Space To Depth
    • config of tubes
  • pipeline
    • training
    • evaluating
    • inference

Usage

This project is based on torch==1.13.1 and pytorch-lightning

Setup

  1. Install requirements

    pip install -r requirements.txt 
  2. Download UFC101 dataset

Convert ViT pre-trained weight

Use convert_vit_weight.py to convert torch ViT pre-trained weight to TubeVit.

python scripts/convert_vit_weight.py --help  ✔ Usage: convert_vit_weight.py [OPTIONS] Options: -nc, --num-classes INTEGER num of classes of dataset. -f, --frames-per-clip INTEGER frame per clip. -v, --video-size <INTEGER INTEGER>... frame per clip. -o, --output-path PATH output model weight name. --help Show this message and exit. 

Example

Convert ImageNet pre-trained weight to UCF101. --num-classes is 101 by default.

python scripts/convert_vit_weight.py 

Train

Current train.py only train on pytorch UCF101 dataset. Change the dataset if needed.

--dataset-root and --annotation-path is based on torchvision.datasets.UCF101

python scripts/train.py --help Usage: train.py [OPTIONS] Options: -r, --dataset-root PATH path to dataset. [required] -a, --annotation-path PATH path to dataset. [required] -nc, --num-classes INTEGER num of classes of dataset. -b, --batch-size INTEGER batch size. -f, --frames-per-clip INTEGER frame per clip. -v, --video-size <INTEGER INTEGER>... frame per clip. --max-epochs INTEGER max epochs. --num-workers INTEGER --fast-dev-run --seed INTEGER random seed. --preview-video Show input video --help Show this message and exit. 

Examples

python scripts/train.py -r path/to/dataset -a path/to/annotation 

Evaluation

python scripts/evaluate.py --help Usage: evaluate.py [OPTIONS] Options: -r, --dataset-root PATH path to dataset. [required] -m, --model-path PATH path to model weight. [required] -a, --annotation-path PATH path to dataset. [required] --label-path PATH path to classInd.txt. [required] -nc, --num-classes INTEGER num of classes of dataset. -b, --batch-size INTEGER batch size. -f, --frames-per-clip INTEGER frame per clip. -v, --video-size <INTEGER INTEGER>... frame per clip. --num-workers INTEGER --seed INTEGER random seed. --verbose Show input video --help Show this message and exit. 

Examples

python scripts/evaluate.py -r path/to/dataset -a path/to/annotation 

Model Architecture

fig1.png fig2.png fig3.png

Positional embedding

Position_Embedding.png

About

An unofficial implementation of TubeViT in "Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages