Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
File size: 4,535 Bytes
566d763 4ca35f8 566d763 b313b13 566d763 b313b13 566d763 2d3ddad 566d763 2d3ddad 566d763 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
# CLAP Format Specification
- Status: DRAFT
- Document revision: 0.0.1
- Last updated: Feb 6th, 2024
- Author(s): Julian BILCKE (@flngr)
## BEFORE YOU READ
The CLAP format spec is experimental and not finished yet!
There might be inconsistencies, unnecessary redundancies or blatant omissions.
## What are CLAP files?
The CLAP format (.clap) is a file format designed for AI video projects.
It preserves prompts and assets into the same container, making it easier to share an AI video project between different people or applications.
## Structure
A CLAP is an array of objects serialized into a YAML text string, then finally compressed using gzip to a binary file.
The file extension is `.clap`
The mime type is `application/x-yaml`
There can be 5 different types of objects:
- one HEADER
- one METADATA
- zero, one or more MODEL(s)
- zero, one or more SCENE(s)
- zero, one or more SEGMENT(s)
This can be represented in javascript like this:
```javascript
[
clapHeader, // one metadata object
clapMeta, // one metadata object
...clapModels, // optional array of models
...clapScenes, // optional array of scenes
...clapSegments // optional array of segments
]
```
## Header
The HEADER provides information about how to decode a CLAP.
Knowing in advance the number of models, scenes and segments helps the decoder parsing the information,
and in some implementation, help with debugging, logging, and provisioning memory usage.
However in the future, it is possible that a different scheme is used, in order to support streaming.
Either by recognizing the shape of each object (fields), or by using a specific field eg. a `_type`.
```typescript
{
// used to know which format version is used.
// CLAP is still in development and the format is not fully specified yet,
// during the period most .clap file will have the "clap-0" format
format: "clap-0"
numberOfModels: number // integer
numberOfScenes: number // integer
numberOfSegments: number // integer
}
```
## Metadata
```typescript
{
id: string // "<a valid UUID V4>"
title: string // "project title"
description: string // "project description"
licence: string // "information about licensing"
// this provides information about the image ratio
// this might be removed in the final spec, as this
// can be re-computed from width and height
orientation: "landscape" | "vertical" | "square"
// the suggested width and height of the video
// note that this is just an indicator,
// and might be superseeded by the application reading the .clap file
width: number // integer between 256 and 8192 (value in pixels)
height: number // integer between 256 and 8192 (value in pixels)
// name of the suggested video model to use
// note that this is just an indicator,
// and might be superseeded by the application reading the .clap file
defaultVideoModel: string
// additional prompt to use in the video generation
// this helps adding some magic touch and flair to the videos,
// but perhaps the field should be renamed
extraPositivePrompt: string
// the screenplay (script) of the video
screenplay: string
}
```
## Models
Before talking about models, first we should describe the concept of entity:
in a story, an entity is something (person, place, vehicle, animal, robot, alien, object) with a name, a description of the appearance, an age, mileage or quality, an origin, and so on.
An example could be "a giant magical school bus, with appearance of a cat with wheels, and which talks"
The CLAP model would be an instance (an interpretation) of this entity, where we would assign it an identity:
- a name and age
- a visual style (a photo of the magic school bus cat)
- a voice style
- and maybe other things eg. an origin or background story
As you can see, it can be difficult to create clearly separated categories, like "vehicule", "character", or "location"
(the magical cat bus could turn into a location in some scene, a speaking character in another etc)
This is why there is a common schema for all models:
```typescript
{
id: string
category: ClapSegmentCategory
triggerName: string
label: string
description: string
author: string
thumbnailUrl: string
seed: number
assetSourceType: ClapAssetSource
assetUrl: string
age: number
gender: ClapModelGender
region: ClapModelRegion
appearance: ClapModelAppearance
voiceVendor: ClapVoiceVendor
voiceId: string
}
```
TO BE CONTINUED
(you can read "./types.ts" for more information) |