
The collection known as MusicCaps contains 5,521 10-second-long music clips that have been annotated with an aspect list and a free-text caption created by musicians. A set of descriptors called an aspect list is used to describe the sound of a piece of music. Examples include “pop, tinny wide hi hats, mellow piano melody, high pitched female vocal melody, sustained pulsating synth lead.”

The music’s tone is described in the free-text caption, along with specifics like the instruments and atmosphere. MusicCaps is split into an eval and train split and is derived from the AudioSet dataset.

A Creative Commons BY-SA 4.0 licence is used to certify the dataset’s ownership. Each clip has metadata attached to it, including a YT ID (pointing to the YouTube video where the labelled music segment appears), the start and end of the video, labels from the AudioSet dataset, an aspect list, a caption, an author ID (for grouping samples according to who wrote them), a balanced subset, and an AudioSet eval split.

The dataset is meant to be used for challenges involving music description.

Leave a Reply

Your email address will not be published. Required fields are marked *