CrucibleAI/ControlNetMediaPipeFace · Exploring animation use case / controlnetmodel usage - down_block_res_samples, mid_block_res

I 'm working on this paper - EMOTE
https://github.com/johndpope/Emote-hack/issues/23

it seems like one of their training stages is to basically train this foundational model.
the paper shows input images and they get combined into a VAE frames encoder
https://github.com/johndpope/Emote-hack/blob/main/Net.py#L42
I want to sort of skip this step - and create some thing akin to this - but would be powered by this model.

q) rather than using the images / latents - I think it would be faster to get convergence on the symbolic representation ie the outputs of this model -
ControlNetMediaPipeFace

typically the model is only ever used in conjunction with the SD pipeline where one image gets spat out. This rendering out / flattening the data in image format.
While i can do this - I'm wondering if keeping the activations in some temporal model may be better approach. I was excited about this - until it dawned on me that these activations are specific to the input image - so I started looking at visualizing these - but is this pushing the proverbial up hill?

CrucibleAI
/

ControlNetMediaPipeFace

Exploring animation use case / controlnetmodel usage - down_block_res_samples, mid_block_res_sample