Sakuga-42M Dataset is the first large-scale cartoon animation dataset. it comprises 42 million keyframes covering various artistic styles, regions, and years, with comprehensive semantic annotations including video-text description pairs, anime tags, content taxonomies, etc. We pioneer the benefits of such a large-scale cartoon dataset on comprehension and generation tasks by finetuning contemporary foundation models like Video CLIP, Video Mamba, and SVD, achieving outstanding performance on cartoon-related tasks.
Our motivation is to introduce large-scaling to cartoon research and foster generalization and robustness in future cartoon applications. Dataset, Code, and Pretrained Models will be publicly available.
Sakuga-42M primarily comprises clips with key frame durations within 96 frames, emphasizing a high proportion of aesthetic value and dynamic score. To better categorizing the clips, we provide additional content-based taxonomy based on anime tags. Our dataset surpasses the combined size of all previous cartoon datasets, paving the way for large-scale models.
Sakuga-42M reveals the differences in data distribution between natural data and hand-drawn cartoons. While different natural datasets overlap, Sakuga-42M forms a distinct cluster, highlighting its unique features.
@article{sakuga42m2024,
title = {Sakuga-42M Dataset: Scaling Up Cartoon Research},
author = {Zhenglin Pan, Yu Zhu, Yuxuan Mu},
journal = {arXiv preprint arXiv:2405.07425},
year = {2024}
}