Possible issue with the data

#5
by pgabrys - opened

I believe there is a problem with the data labels. Some records with the same title and synopsis can have various labels in the dataset. It would be fine if such records were either in train or test set, but it isn't. For example "Sironia" is only in the train set with two various labels. "Iron Man" has adventure genre in the train set. Does it mean that it if expected to have anything but adventure genre in the test set? There are 8540 (out of 70291) movies with the same title and synopsis that are both in the train and test set.

image.png

Competitions org

@pgabrys Thank you for pointing this out.
A movie can be associated with multiple genres. We'll leave handling training data upto the user. While making predictions, user should choose the most probable genre for test samples.

Ok. Let's take the example of "Iron Man". The record has "adventure" genre in the train set. Is it possible for it to have the same genre in the test set? My concern is that it's impossible and we should show the most probable genre, but not "adventure".

A follow-up question:

Consider the toy dataset below with the actual and predicted labels. We know there are duplicated entries in both train and test. However, if we focus on the test set and in the below scenario predictions are correct but in different order. Submission script contains the id and predicted genre, in this case even though model could identify correct labels the accuracy will be low because of the mapping of id to genre being in different order.

What can one do in such scenario?

id title genre (actual) prediction
1 ABC family drama
2 ABC drama family

Sign up or log in to comment