Boris Orekhov's picture
2

Boris Orekhov

nevmenandr
·

AI & ML interests

Natural Language Processing, Poetry Generation, Linguistics, Low-resource languages

Organizations

Posts 2

view post
Post
1197
nevmenandr/w2v-russian-tolstoy

import gensim
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

modelLNT2 = Word2Vec.load("cbow_300_10.model")

# skip some code... for full version see model's card

tsnescatterplot(modelLNT2, 'жизнь_S', [i[0] for i in modelLNT2.wv.most_similar(negative=["жизнь_S"])])


life by Tolstoy (w2v):

view post
Post
1181
Playing with dhcloud/w2v-russian-19c-fiction-lemmas


import numpy as np
from gensim.models import Word2Vec
from sklearn.manifold import TSNE

modell = Word2Vec.load("w2vlemmas.model")
keys = ['Шекспир', 'Пушкин', 'Гоголь', 'матрос', 'кот', 'роман']
embedding_clusters = []
word_clusters = []
for word in keys:
    embeddings = []
    words = []
    for similar_word, _ in modell.wv.most_similar(word, topn=30):
        words.append(similar_word)
        embeddings.append(modell.wv[similar_word])
    embedding_clusters.append(embeddings)
    word_clusters.append(words)
tsne_model_en_2d = TSNE(perplexity=15, n_components=2, init='pca', n_iter=3500, random_state=32)
embedding_clusters = np.array(embedding_clusters)
n, m, k = embedding_clusters.shape
embeddings_en_2d = np.array(tsne_model_en_2d.fit_transform(embedding_clusters.reshape(n * m, k))).reshape(n, m, 2)

Novel is a different type of literature than Shakespeare and Pushkin