NeuralBody / supplementary_material.md
pengsida
initial commit
1ba539f

Supplementary Material

Training and test data

We provide a website for visualization.

The multi-view videos are captured by 23 cameras. We train our model on the "0, 6, 12, 18" cameras and test it on the remaining cameras.

The following table shows the detailed frame numbers for training and test of each video. Since the video length of each subject is different, we choose the appropriate number of frames for training and test.

Note that since rendering is very slow, we test our model every 30 frames. For example, although the frame range of video 313 is "0-59", we only test our model on the 0-th and 30-th frames.

Video 313 315 377 386 387 390 392 393 394
Number of frames 1470 2185 617 646 654 1171 556 658 859
Frame Range (Training) 0-59 0-399 0-299 0-299 0-299 700-999 0-299 0-299 0-299
Frame Range (Unseen human poses) 60-1060 400-1400 300-617 300-646 300-654 0-700 300-556 300-658 300-859

Evaluation metrics

We save our rendering results on novel views of training frames and unseen human poses at here.

As described in the paper, we evaluate our model in terms of the PSNR and SSIM metrics.

A straightforward way for evaluation is calculating the metrics on the whole image. Since we already know the 3D bounding box of the target human, we can project the 3D box to obtain a bound_mask and make the colors of pixels outside the mask as zero, as shown in the following figure.

fig

As a result, the PSNR and SSIM metrics appear very high performances, as shown in the following table.

Training frames Unseen human poses
PSNR SSIM PSNR SSIM
313 35.21 0.985 29.02 0.964
315 33.07 0.988 25.70 0.957
392 35.76 0.984 31.53 0.971
393 33.24 0.979 28.40 0.960
394 34.31 0.980 29.61 0.961
377 33.86 0.985 30.60 0.977
386 36.07 0.984 33.05 0.974
390 34.48 0.980 30.25 0.964
387 31.39 0.975 27.68 0.961
34.15 0.982 29.54 0.966

To overcome this problem, a solution is only calculating the metrics on pixels inside the bound_mask. Since the SSIM metric requires the input to have the image format, we first compute the 2D box that bounds the bound_mask and then crop the corresponding image region.

def ssim_metric(rgb_pred, rgb_gt, batch):
    mask_at_box = batch['mask_at_box'][0].detach().cpu().numpy()
    H, W = int(cfg.H * cfg.ratio), int(cfg.W * cfg.ratio)
    mask_at_box = mask_at_box.reshape(H, W)
    # convert the pixels into an image
    img_pred = np.zeros((H, W, 3))
    img_pred[mask_at_box] = rgb_pred
    img_gt = np.zeros((H, W, 3))
    img_gt[mask_at_box] = rgb_gt
    # crop the object region
    x, y, w, h = cv2.boundingRect(mask_at_box.astype(np.uint8))
    img_pred = img_pred[y:y + h, x:x + w]
    img_gt = img_gt[y:y + h, x:x + w]
    # compute the ssim
    ssim = compare_ssim(img_pred, img_gt, multichannel=True)
    return ssim

The following table lists corresponding results.

Training frames Unseen human poses
PSNR SSIM PSNR SSIM
313 30.56 0.971 23.95 0.905
315 27.24 0.962 19.56 0.852
392 29.44 0.946 25.76 0.909
394 28.44 0.940 23.80 0.878
393 27.58 0.939 23.25 0.893
377 27.64 0.951 23.91 0.909
386 28.60 0.931 25.68 0.881
387 25.79 0.928 21.60 0.870
390 27.59 0.926 23.90 0.870
28.10 0.944 23.49 0.885

Results of other methods on ZJU-MoCap

We save rendering results of other methods on novel views of training frames and unseen human poses at here, including Neural Volumes, Multi-view Neural Human Rendering, and Deferred Neural Human Rendering. Note that we only generate novel views of training frames for Neural Volumes.

The following table lists quantitative results of Neural Volumes.

PSNR SSIM
313 20.09 0.831
315 18.57 0.824
392 22.88 0.726
394 22.08 0.843
393 21.29 0.842
377 21.15 0.842
386 23.21 0.820
387 20.74 0.838
390 22.49 0.825
21.39 0.821

The following table lists quantitative results of Multi-view Neural Human Rendering.

Training frames Unseen human poses
PSNR SSIM PSNR SSIM
313 26.68 0.935 23.05 0.893
315 19.81 0.874 18.88 0.844
392 24.73 0.902 23.66 0.893
394 25.01 0.906 22.87 0.874
393 23.47 0.894 22.27 0.885
377 23.79 0.918 21.94 0.885
386 25.02 0.879 23.70 0.853
387 22.65 0.858 20.97 0.866
390 23.72 0.873 22.65 0.858
23.87 0.893 22.22 0.872

The following table lists quantitative results of Deferred Neural Human Rendering.

Training frames Unseen human poses
PSNR SSIM PSNR SSIM
313 25.78 0.929 22.56 0.889
315 19.44 0.869 18.38 0.841
392 24.96 0.905 24.08 0.900
394 24.84 0.903 22.67 0.871
393 23.50 0.896 22.45 0.888
377 23.74 0.917 22.07 0.886
386 24.93 0.877 23.70 0.851
387 22.44 0.888 20.64 0.862
390 24.33 0.881 22.90 0.864
23.77 0.896 22.16 0.872