{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "cadena = \"El/DT gato/N come/V pescado/N de/P la/DT nevera/N y/C de/P la/DT lata/N y/C baila/V el/DT la/N la/N la/N ./Fp\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1) Obtener un diccionario, que para cada categoría, muestre su frecuencia. Ordenar el resultado alfabéticamente por categoría." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "cadenaS = cadena.split(' ')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C 2\n", "DT 4\n", "Fp 1\n", "N 7\n", "P 2\n", "V 2\n" ] } ], "source": [ "diccionario = {}\n", "\n", "for i in cadenaS:\n", " separacion = i.split(\"/\")\n", " try:\n", " diccionario[separacion[1]] = diccionario[separacion[1]] + 1\n", " except:\n", " diccionario[separacion[1]] = 1\n", " \n", " \n", "#1 primer punto.\n", "\n", "for s in sorted(diccionario):\n", " print(s,diccionario[s])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generemos un diccionario para cada palabra de \"cadena\", mostremos la frecuencia y una lista de sus categorías morfosintácticas con su frecuencia. Imprimimos el resultado ordenado alfabeticamente." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "diccionario = {}\n", "\n", "for i in cadenaS:\n", " separacion = i.split(\"/\")\n", " separacion[0] = separacion[0].lower()\n", "\n", " if separacion[0] not in diccionario:\n", " diccionario[separacion[0]] = {}\n", "\n", " if separacion[1] in diccionario[separacion[0]]:\n", " diccionario[separacion[0]][separacion[1]] += 1 \n", " else:\n", " diccionario[separacion[0]][separacion[1]] = 1 \n", "\n", "\n", "for s in sorted(diccionario.keys()):\n", " tmp = 0\n", " salida = \"\"\n", " for j in diccionario[s].keys():\n", " tmp += diccionario[s][j]\n", " salida += \" \"+j+\" \"+str(diccionario[s][j])\n", "\n", "\n", " print(s,tmp,salida)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calculamos la frecuencia de todos los bigramas de la cadena, teniendo en cuenta un símbolo inicial `` y un simbolo final `` para la cadena.\n", "\n", "```\n", "('DT', 'N') 4\n", " ('N', 'V') 1\n", " ('N', 'C') 2\n", " ('N', 'Fp') 1\n", " ('N', 'N') 2\n", " ('C', 'V') 1\n", " ('V', 'N') 1\n", " ('V', 'DT') 1\n", " ('P', 'DT') 2\n", " ('Fp', '') 1\n", " ('', 'DT') 1\n", " ('C', 'P') 1\n", " ('N', 'P') 1\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('', 'DT') 1\n", "('DT', 'N') 4\n", "('N', 'V') 1\n", "('V', 'N') 1\n", "('N', 'P') 1\n", "('P', 'DT') 2\n", "('N', 'C') 2\n", "('C', 'P') 1\n", "('C', 'V') 1\n", "('V', 'DT') 1\n", "('N', 'N') 2\n", "('N', 'Fp') 1\n", "('Fp', '') 1\n" ] } ], "source": [ "diccionario = {}\n", "bigramas = []\n", "\n", "#cosa = [\"\"] + [ (cadenaS[0].split(\"/\")[1],cadenaS[i+1].split(\"/\")[1]) if (i+1) < len(cadenaS) else [] for i in range(len(cadenaS)) ] + [\"\"] \n", "cosa = [\"\"] + [ i.split(\"/\")[1] for i in cadenaS ] + [\"\"] \n", "\n", "#\"El/DT perro/N come/V carne/N de/P la/DT carnicería/N y/C de/P la/DT nevera/N y/C canta/V el/DT la/N la/N la/N ./Fp\"\n", "#print(cosa)\n", "\n", "l = len(cosa)\n", "\n", "for i in range(l):\n", " if (i+1) < l:\n", " bigramas += [(cosa[i],cosa[i+1])]\n", " else:\n", " break\n", "\n", "\n", "for i in bigramas:\n", " if i not in diccionario:\n", " diccionario[i] = 1\n", " else:\n", " diccionario[i] += 1\n", "\n", "for i in diccionario.keys():\n", " print(i,diccionario[i])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ahora construimos una función que devuelva las probabilidades léxicas P(C|w) y de emisión P(w|C) para una palabra dada (w) para todas sus categorías (C) que aparecen en el diccionario construido anteriormente. Si la palabra no existe en el diccionario debe decir que la palabra es desconocida.\n", "\n", "```\n", "Por ejemplo, para la palabra w=”la”, debería devolver:\n", " P( DT | la )= 0.400000\n", " P( N | la )= 0.600000\n", " P( la | DT )= 0.500000\n", " P( la | N )= 0.428571\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def lex(w,cased=True):\n", " diccionario = {}\n", "\n", " #iteracion = 0\n", " for i in cadenaS:\n", " separacion = i.split('/')\n", " if cased == False:\n", " w = w.lower()\n", " separacion[0] = separacion[0].lower()\n", " \n", " if separacion[1] not in diccionario:\n", " diccionario[separacion[1]] = {\"cantidad\" : 1}\n", " if separacion[0] not in diccionario[separacion[1]] and separacion[0] == w:\n", " diccionario[separacion[1]][separacion[0]] = 1\n", " elif separacion[0] == w:\n", " diccionario[separacion[1]][separacion[0]] += 1\n", " else:\n", " diccionario[separacion[1]][\"cantidad\"] += 1\n", " if w not in diccionario[separacion[1]] and separacion[0] == w:\n", " diccionario[separacion[1]][separacion[0]] = 1\n", " elif separacion[0] == w:\n", " diccionario[separacion[1]][separacion[0]] += 1\n", " \n", " if w not in diccionario and w == separacion[0]:\n", " diccionario[w] = {\"cantidad\":1}\n", " if separacion[1] not in diccionario[w]:\n", " diccionario[w][separacion[1]] = 1\n", " else:\n", " diccionario[w][separacion[1]] += 1\n", " elif w == separacion[0]:\n", " diccionario[w][\"cantidad\"] += 1\n", " if separacion[1] not in diccionario[w]:\n", " diccionario[w][separacion[1]] = 1\n", " else:\n", " diccionario[w][separacion[1]] += 1\n", "\n", " \n", " #print(iteracion,diccionario)\n", " #iteracion += 1\n", " \n", " for i in diccionario.keys():\n", " if w in diccionario[i]:\n", " probWC = diccionario[i][w]/diccionario[i][\"cantidad\"]\n", " print(\"P(\", w, \"|\", i, \") = \", probWC) # P( la | N )= 0.428571\n", "\n", " if i == w:\n", " for categoria in diccionario[w]:\n", " probCW = diccionario[w][categoria] / diccionario[w][\"cantidad\"]\n", " if categoria != \"cantidad\":\n", " print(\"P(\", categoria, \"|\", w, \") = \", probCW) #P( DT | la )= 0.400000" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "P( la | DT ) = 0.5\n", "P( la | N ) = 0.42857142857142855\n", "P( DT | la ) = 0.4\n", "P( N | la ) = 0.6\n" ] } ], "source": [ "lex(\"la\",True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "interpreter": { "hash": "6e5e0e4de587a08ae1fd499d48602c29fc81255ce67beabc6badfa0dc31fba78" }, "kernelspec": { "display_name": "Python 3.6.13 ('myenv')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.13" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }