# Filtros para reduzir o numero de instâncias dos conjuntos de dados
motivação: 

Percebi que existem filmes com poucas avaliações, como podemos dizer se um filme é bom com apenas 1 ou 2 avaliações?

Portanto, aqui reduzimos o conjunto de dados ao remover usuários e filmes com pouca "relevância" segundo os conceitos citados anteriormente.

In [53]:
import pandas as pd

## filtro de numero de avaliações por usuário e filme

### defina o limite minimo para permanecer

In [54]:
n_ratings_movie = 10

In [55]:
ratings = pd.read_csv('../data/standard/ratings.csv', index_col='userId')
ratings.head()

Unnamed: 0_level_0,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931


In [56]:
ratings.describe()

Unnamed: 0,movieId,rating,timestamp
count,100836.0,100836.0,100836.0
mean,19435.295718,3.501557,1205946000.0
std,35530.987199,1.042529,216261000.0
min,1.0,0.5,828124600.0
25%,1199.0,3.0,1019124000.0
50%,2991.0,3.5,1186087000.0
75%,8122.0,4.0,1435994000.0
max,193609.0,5.0,1537799000.0


In [57]:
ratings.index.value_counts().describe()

count     610.000000
mean      165.304918
std       269.480584
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
Name: count, dtype: float64

In [58]:
ratings['movieId'].value_counts().describe()

count    9724.000000
mean       10.369807
std        22.401005
min         1.000000
25%         1.000000
50%         3.000000
75%         9.000000
max       329.000000
Name: count, dtype: float64

In [59]:
def filter_n_ratings(df, column, n_ratings):
    low = df[column].value_counts() >= n_ratings
    low = low[low == True]
    return df[df[column].isin(low.index)]


In [60]:
ratings = filter_n_ratings(ratings, 'movieId', n_ratings_movie)

In [61]:
ratings.describe()

Unnamed: 0,movieId,rating,timestamp
count,81116.0,81116.0,81116.0
mean,14857.178078,3.573678,1197217000.0
std,29539.336412,1.01859,216718200.0
min,1.0,0.5,828124600.0
25%,1007.0,3.0,1001562000.0
50%,2471.0,4.0,1180447000.0
75%,6016.0,4.0,1431955000.0
max,187593.0,5.0,1537799000.0


In [62]:
len(ratings['movieId'].value_counts())

2269

In [63]:
len(ratings.index.value_counts())

610

## aplicando filtro no movies dataset

In [64]:
movies = pd.read_csv('../data/standard/movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [65]:
movies = movies[movies['movieId'].isin(ratings['movieId'])]

In [66]:
movies.describe()

Unnamed: 0,movieId
count,2269.0
mean,20530.586161
std,35185.840333
min,1.0
25%,1345.0
50%,3256.0
75%,8958.0
max,187593.0


In [67]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2269 entries, 0 to 9709
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  2269 non-null   int64 
 1   title    2269 non-null   object
 2   genres   2269 non-null   object
dtypes: int64(1), object(2)
memory usage: 70.9+ KB


In [68]:
movies.set_index('movieId', inplace=True)

In [69]:
movies

Unnamed: 0_level_0,title,genres
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,Jumanji (1995),Adventure|Children|Fantasy
3,Grumpier Old Men (1995),Comedy|Romance
5,Father of the Bride Part II (1995),Comedy
6,Heat (1995),Action|Crime|Thriller
...,...,...
174055,Dunkirk (2017),Action|Drama|Thriller|War
176371,Blade Runner 2049 (2017),Sci-Fi
177765,Coco (2017),Adventure|Animation|Children
179819,Star Wars: The Last Jedi (2017),Action|Adventure|Fantasy|Sci-Fi


## Aplicando filtros no conjuno de dados 'links'

In [70]:
links = pd.read_csv('../data/standard/links.csv', index_col='movieId')

In [71]:
links.describe()

Unnamed: 0,imdbId,tmdbId
count,9742.0,9734.0
mean,677183.9,55162.123793
std,1107228.0,93653.481487
min,417.0,2.0
25%,95180.75,9665.5
50%,167260.5,16529.0
75%,805568.5,44205.75
max,8391976.0,525662.0


In [72]:
links = links[links.index.isin(movies.index)]

In [73]:
links.describe()

Unnamed: 0,imdbId,tmdbId
count,2269.0,2269.0
mean,367098.1,19720.70119
std,587023.8,49425.176137
min,13442.0,5.0
25%,99653.0,1452.0
50%,119951.0,9071.0
75%,347149.0,11566.0
max,5463162.0,503475.0


In [74]:
ratings

Unnamed: 0_level_0,movieId,rating,timestamp
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,4.0,964982703
1,3,4.0,964981247
1,6,4.0,964982224
1,47,5.0,964983815
1,50,5.0,964982931
...,...,...,...
610,159093,3.0,1493847704
610,164179,5.0,1493845631
610,166528,4.0,1493879365
610,168250,5.0,1494273047


In [75]:
ratings.to_csv('../data/reduced/ratings_m{}.csv'.format(n_ratings_movie), index_label='userId')

In [76]:
movies.to_csv('../data/reduced/movies_m{}.csv'.format(n_ratings_movie))

In [77]:
links.to_csv('../data/reduced/links_m{}.csv'.format(n_ratings_movie))