Spaces:

jbilcke-hf
/

ai-tube

Running on CPU Upgrade

App Files Files Community

ai-tube / src /clap /clap-specification-draft.md

jbilcke-hf HF staff

fix broken markdown

2d3ddad 9 months ago

preview code

raw

history blame

4.54 kB

	# CLAP Format Specification

	- Status: DRAFT
	- Document revision: 0.0.1
	- Last updated: Feb 6th, 2024
	- Author(s): Julian BILCKE (@flngr)

	## BEFORE YOU READ

	The CLAP format spec is experimental and not finished yet!
	There might be inconsistencies, unnecessary redundancies or blatant omissions.

	## What are CLAP files?

	The CLAP format (.clap) is a file format designed for AI video projects.

	It preserves prompts and assets into the same container, making it easier to share an AI video project between different people or applications.

	## Structure

	A CLAP is an array of objects serialized into a YAML text string, then finally compressed using gzip to a binary file.

	The file extension is `.clap`
	The mime type is `application/x-yaml`

	There can be 5 different types of objects:

	- one HEADER
	- one METADATA
	- zero, one or more MODEL(s)
	- zero, one or more SCENE(s)
	- zero, one or more SEGMENT(s)

	This can be represented in javascript like this:

	```javascript
	[
	clapHeader, // one metadata object
	clapMeta, // one metadata object
	...clapModels, // optional array of models
	...clapScenes, // optional array of scenes
	...clapSegments // optional array of segments
	]
	```

	## Header

	The HEADER provides information about how to decode a CLAP.

	Knowing in advance the number of models, scenes and segments helps the decoder parsing the information,
	and in some implementation, help with debugging, logging, and provisioning memory usage.

	However in the future, it is possible that a different scheme is used, in order to support streaming.

	Either by recognizing the shape of each object (fields), or by using a specific field eg. a `_type`.

	```typescript
	{
	// used to know which format version is used.
	// CLAP is still in development and the format is not fully specified yet,
	// during the period most .clap file will have the "clap-0" format
	format: "clap-0"

	numberOfModels: number // integer
	numberOfScenes: number // integer
	numberOfSegments: number // integer
	}
	```

	## Metadata

	```typescript
	{
	id: string // "<a valid UUID V4>"
	title: string // "project title"
	description: string // "project description"
	licence: string // "information about licensing"

	// this provides information about the image ratio
	// this might be removed in the final spec, as this
	// can be re-computed from width and height
	orientation: "landscape" \| "vertical" \| "square"

	// the suggested width and height of the video
	// note that this is just an indicator,
	// and might be superseeded by the application reading the .clap file
	width: number // integer between 256 and 8192 (value in pixels)
	height: number // integer between 256 and 8192 (value in pixels)

	// name of the suggested video model to use
	// note that this is just an indicator,
	// and might be superseeded by the application reading the .clap file
	defaultVideoModel: string

	// additional prompt to use in the video generation
	// this helps adding some magic touch and flair to the videos,
	// but perhaps the field should be renamed
	extraPositivePrompt: string

	// the screenplay (script) of the video
	screenplay: string
	}
	```

	## Models

	Before talking about models, first we should describe the concept of entity:

	in a story, an entity is something (person, place, vehicle, animal, robot, alien, object) with a name, a description of the appearance, an age, mileage or quality, an origin, and so on.

	An example could be "a giant magical school bus, with appearance of a cat with wheels, and which talks"

	The CLAP model would be an instance (an interpretation) of this entity, where we would assign it an identity:
	- a name and age
	- a visual style (a photo of the magic school bus cat)
	- a voice style
	- and maybe other things eg. an origin or background story

	As you can see, it can be difficult to create clearly separated categories, like "vehicule", "character", or "location"
	(the magical cat bus could turn into a location in some scene, a speaking character in another etc)

	This is why there is a common schema for all models:

	```typescript
	{
	id: string
	category: ClapSegmentCategory
	triggerName: string
	label: string
	description: string
	author: string
	thumbnailUrl: string
	seed: number

	assetSourceType: ClapAssetSource
	assetUrl: string

	age: number
	gender: ClapModelGender
	region: ClapModelRegion
	appearance: ClapModelAppearance
	voiceVendor: ClapVoiceVendor
	voiceId: string
	}
	```

	TO BE CONTINUED

	(you can read "./types.ts" for more information)