Spaces:
Running
on
CPU Upgrade
CLAP Format Specification
- Status: DRAFT
- Document revision: 0.0.1
- Last updated: Feb 6th, 2024
- Author(s): Julian BILCKE (@flngr)
BEFORE YOU READ
The CLAP format spec is experimental and not finished yet! There might be inconsistencies, unnecessary redundancies or blatant omissions.
What are CLAP files?
The CLAP format (.clap) is a file format designed for AI video projects.
It preserves prompts and assets into the same container, making it easier to share an AI video project between different people or applications.
Structure
A CLAP is an array of objects serialized into a YAML text string, then finally compressed using gzip to a binary file.
The file extension is .clap
The mime type is application/x-yaml
There can be 5 different types of objects:
- one HEADER
- one METADATA
- zero, one or more MODEL(s)
- zero, one or more SCENE(s)
- zero, one or more SEGMENT(s)
This can be represented in javascript like this:
[
clapHeader, // one metadata object
clapMeta, // one metadata object
...clapModels, // optional array of models
...clapScenes, // optional array of scenes
...clapSegments // optional array of segments
]
Header
The HEADER provides information about how to decode a CLAP.
Knowing in advance the number of models, scenes and segments helps the decoder parsing the information, and in some implementation, help with debugging, logging, and provisioning memory usage.
However in the future, it is possible that a different scheme is used, in order to support streaming.
Either by recognizing the shape of each object (fields), or by using a specific field eg. a _type
.
{
// used to know which format version is used.
// CLAP is still in development and the format is not fully specified yet,
// during the period most .clap file will have the "clap-0" format
format: "clap-0"
numberOfModels: number // integer
numberOfScenes: number // integer
numberOfSegments: number // integer
}
Metadata
{
id: string // "<a valid UUID V4>"
title: string // "project title"
description: string // "project description"
licence: string // "information about licensing"
// this provides information about the image ratio
// this might be removed in the final spec, as this
// can be re-computed from width and height
orientation: "landscape" | "vertical" | "square"
// the suggested width and height of the video
// note that this is just an indicator,
// and might be superseeded by the application reading the .clap file
width: number // integer between 256 and 8192 (value in pixels)
height: number // integer between 256 and 8192 (value in pixels)
// name of the suggested video model to use
// note that this is just an indicator,
// and might be superseeded by the application reading the .clap file
defaultVideoModel: string
// additional prompt to use in the video generation
// this helps adding some magic touch and flair to the videos,
// but perhaps the field should be renamed
extraPositivePrompt: string
// the screenplay (script) of the video
screenplay: string
}
Models
Before talking about models, first we should describe the concept of entity:
in a story, an entity is something (person, place, vehicle, animal, robot, alien, object) with a name, a description of the appearance, an age, mileage or quality, an origin, and so on.
An example could be "a giant magical school bus, with appearance of a cat with wheels, and which talks"
The CLAP model would be an instance (an interpretation) of this entity, where we would assign it an identity:
- a name and age
- a visual style (a photo of the magic school bus cat)
- a voice style
- and maybe other things eg. an origin or background story
As you can see, it can be difficult to create clearly separated categories, like "vehicule", "character", or "location" (the magical cat bus could turn into a location in some scene, a speaking character in another etc)
This is why there is a common schema for all models:
{
id: string
category: ClapSegmentCategory
triggerName: string
label: string
description: string
author: string
thumbnailUrl: string
seed: number
assetSourceType: ClapAssetSource
assetUrl: string
age: number
gender: ClapModelGender
region: ClapModelRegion
appearance: ClapModelAppearance
voiceVendor: ClapVoiceVendor
voiceId: string
}
TO BE CONTINUED
(you can read "./types.ts" for more information)