nvidia/segformer-b0-finetuned-cityscapes-1024-1024

First, thanks for sharing the weights. I have difficulties replicating (or even approaching) the Segformer paper performance on cityscapes. The first issue I observed is that from B0 to B5 Image Processor only B1 and B5 resize the input image to 10241024 and the rest to 512512, as you can see in the attached screenshot.

Moreover, with a standard pytorch Dataset (see below using the LabelTrainingIds after running cityscapesscripts) loading the image and label as PIL image and passing both through the image processor I obtain an mIoU of 58 on cityscapes validation.

When using a custom albumentation transform pipeline of Resize(1024, 1024) , Normalize, ToTensorV2 I improve the result to mean IoU 68 still far from the 76.2 in the paper.

From the paper and mmsegmentation the test was done via 1024*1024 sliding window with stride 768. Were the result replicated here with this type of inference, it is unclear from the image processor attributes ? Is there an available implementation of the Cityscapes inference pipeline with this model implementation ?

If not what was the achieved results and the used pre-processing pipeline ?

CityScapes Dataset:

class CityscapesDataset(Dataset):
"""CityScapes Dataset from Raw Data"""

def __init__(
    self,
    root_dir: os.PathLike,
    image_processor: Optional[BaseImageProcessor] = None,
    transform: Optional[A.Compose] = None,
    split: str = "train",
):  # TODO: we could specify a callable for transform
    """Initialize the dataset Object.

    Args:
        root_dir (os.PathLike): Local path to raw data.
        image_processor (Optional[BaseImageProcessor], optional): HuggingFace Image Processor. Defaults to None.
        transform (Optional[A.Compose], optional): Set of transforms for processing images and masks. Defaults to None.
        split (str, optional): Dataset split to load. Defaults to "train".
    """
    self.root_dir = root_dir
    self.image_processor = image_processor
    self.split = split
    self.transform = transform

    self.images_dir = self.root_dir / "leftImg8bit" / self.split
    self.labels_dir = self.root_dir / "gtFine" / self.split

    self.images = []
    self.labels = []
    self.ids = []

    for city in self.images_dir.iterdir():
        for image_path in city.iterdir():
            self.images.append(image_path)
            self.labels.append(
                self.labels_dir / city.name / image_path.name.replace("leftImg8bit", "gtFine_labelTrainIds")
            )
            self.ids.append(image_path.name.replace("_leftImg8bit.png", ""))

def __len__(self) -> int:
    return len(self.images)

def __getitem__(self, idx) -> Dict[str, torch.Tensor]:
    image = Image.open(self.images[idx]).convert("RGB")
    label = Image.open(self.labels[idx])

    if self.transform is not None:
        transformed = self.transform(image=np.asarray(image).copy(), mask=np.asarray(label).copy())
        image = transformed["image"]
        label = transformed["mask"]

    if self.image_processor is not None:
        encoded_inputs = {}
        process_inputs = self.image_processor.preprocess(images=image, segmentation_maps=label, return_tensors="pt")
        for k, _ in process_inputs.items():
            encoded_inputs[k] = process_inputs[k].squeeze()  # remove batch dimension

    else:
        encoded_inputs = {"pixel_values": image, "labels": label}

    encoded_inputs["id"] = self.ids[idx]
    # encoded_inputs["labels"].apply_(lambda x: CityscapesDataset.mapping_ids[x])

    return encoded_inputs

nvidia
/

segformer-b0-finetuned-cityscapes-1024-1024

Performance & Image Processor