{ "cells": [ { "cell_type": "markdown", "id": "iiNbGRn-KitL", "metadata": { "id": "iiNbGRn-KitL" }, "source": [ "# Question generation\n", "This notebook is inspired by [Question Generation tutorial](https://haystack.deepset.ai/tutorials/question-generation), from Haystack documentation.\n", "\n", "Here we use a collection of articles about Twin Peaks to generate a variety of questions about that awesome TV series!\n", "\n", "The following steps are performed:\n", "* load data\n", "* create document store and write documents\n", "* generate questions and save them" ] }, { "cell_type": "markdown", "id": "viixGIJcKPSQ", "metadata": { "id": "viixGIJcKPSQ" }, "source": [ "## Preliminary operations" ] }, { "cell_type": "code", "execution_count": 1, "id": "MevE4jEZ5QBT", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "MevE4jEZ5QBT", "outputId": "136106e4-40c9-4443-ee84-784fb922e188" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mounted at /content/drive\n" ] } ], "source": [ "from google.colab import drive\n", "drive.mount('/content/drive')" ] }, { "cell_type": "code", "execution_count": null, "id": "VYWRJ-Lf55nV", "metadata": { "id": "VYWRJ-Lf55nV" }, "outputs": [], "source": [ "# install dependencies\n", "! pip install farm-haystack[faiss-gpu]==1.4.0" ] }, { "cell_type": "markdown", "id": "QVDuHAMIK4bg", "metadata": { "id": "QVDuHAMIK4bg" }, "source": [ "## Load data" ] }, { "cell_type": "code", "execution_count": 4, "id": "72139774", "metadata": { "execution": { "iopub.execute_input": "2022-01-09T08:40:46.176031Z", "iopub.status.busy": "2022-01-09T08:40:46.175755Z", "iopub.status.idle": "2022-01-09T08:40:46.179554Z", "shell.execute_reply": "2022-01-09T08:40:46.178704Z", "shell.execute_reply.started": "2022-01-09T08:40:46.175959Z" }, "id": "72139774" }, "outputs": [], "source": [ "import glob\n", "import json\n", "import os" ] }, { "cell_type": "code", "execution_count": 7, "id": "4421e328", "metadata": { "execution": { "iopub.execute_input": "2022-01-09T08:40:47.846999Z", "iopub.status.busy": "2022-01-09T08:40:47.846757Z", "iopub.status.idle": "2022-01-09T08:40:48.327632Z", "shell.execute_reply": "2022-01-09T08:40:48.326829Z", "shell.execute_reply.started": "2022-01-09T08:40:47.846975Z" }, "id": "4421e328" }, "outputs": [], "source": [ "DATA_DIRECTORY = '/content/drive/MyDrive/Colab Notebooks/wklp/data'\n", "\n", "docs=[]\n", "\n", "for json_file in glob.glob(f'{DATA_DIRECTORY}/*.json'):\n", " # select only the largest documents\n", " if os.path.getsize (json_file)>=5000:\n", " with open(json_file, 'r') as fin: \n", " json_content=json.load(fin)\n", " \n", " doc={'content': json_content['text'],\n", " 'meta': {'name': json_content['name'],\n", " 'url': json_content['url']}}\n", " docs.append(doc)" ] }, { "cell_type": "code", "execution_count": 8, "id": "GR6qWQAn72WG", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GR6qWQAn72WG", "outputId": "1198a602-7f4e-444a-f8f4-05b488663799" }, "outputs": [ { "data": { "text/plain": [ "134" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(docs)" ] }, { "cell_type": "code", "execution_count": 9, "id": "aa231b94", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2022-01-09T08:40:48.796741Z", "iopub.status.busy": "2022-01-09T08:40:48.796550Z", "iopub.status.idle": "2022-01-09T08:40:48.805224Z", "shell.execute_reply": "2022-01-09T08:40:48.804705Z", "shell.execute_reply.started": "2022-01-09T08:40:48.796722Z" }, "id": "aa231b94", "outputId": "3d88f0a8-635d-419c-8660-f6c77803e369" }, "outputs": [ { "data": { "text/plain": [ "{'content': 'Part 5\\nNot to be confused with Episode 5.\\n\"Part 5\" is the fifth episode of the 2017 series of Twin Peaks and the thirty-fifth episode of the franchise as a whole. It aired on June 4, 2017.\\nPlot\\n\"Case files.\"\\n ―Dale Cooper\\nGene and Jake sit in a car, the former on the phone with Lorraine, reporting on the situation with Dougie Jones. Frustrated, she sends the message \"2\" (leaving 159 characters to type) to her contact \"ARGENT\" which causes a device in Buenos Aires to ring and flash twice with its two red lights.\\nConstance Talbot, Detective Macklay, and Detective Harrison observe the John Doe in the morgue. Talbot confirms the decapitation as the man\\'s cause of death and presents a ring found inside the body. On it is an inscription that reads, \"To Dougie, with love, Janey-E.\"\\nCooper\\'s doppelganger sits in his jail cell and correctly predicts that his food is coming. He takes his food and goes to the mirror, noting that BOB is still with him.\\nAt his place of employment, Mike Nelson calls in Steven Burnett, who has applied for a job. Mike tells him that his resume is inadequate and his forms were filled out incorrectly, then kicks him out.\\nSheriff Frank Truman talks to Harry on the phone and is informed by Lucy Brennan that his wife, Doris, is coming to him. Doris tells him about her frustrations, including a leaky pipe.\\nJaney-E, Sonny Jim and Cooper leave the Jones home and Janey-E tells Cooper that he had won $425,000. He looks at Sonny Jim and begins to shed tears. On realization that Dougie\\'s car is not there, Janey-E begrudgingly takes Cooper to Dougie\\'s work.\\nGene and Jake check on Dougie\\'s car again, which still has not moved. A group of delinquent youths also drive by the car.\\nJaney-E drops Cooper off at work and he wanders, following the aim of a statue of a man carrying a revolver. He stands around until Dougie\\'s co-worker, Phil Bisby comes carrying coffee on his way to a board meeting. Following him into the elevator, Cooper takes one of the coffees and begins drinking the \"damn good Joe.\" It was Frank\\'s one who then takes a green tea latte instead while Darren is turned down by Rhonda and Bonnici next to Frank is served the eighth cup carried by Phil.\\nAnthony Sinclair tells Cooper that he has covered for Dougie\\'s absence and during the meeting, as Sinclair presents a report. When saying that there was no arson with Littlefield, Cooper blurts out \"He\\'s lying,\" but does not elaborate, causing the boss Bushnell Mullins to have \"Dougie\" meet with him after the meeting. Mullins questions his accusation and gives him case files to assess by the next day.\\nRodney and Bradley Mitchum come to the Silver Mustang Casino and in front of Candie, Mandie and Sandie punish Burns for Cooper\\'s win at the casino and replace Burns with Warrick, who they tell to inform them if Cooper ever returns to the casino.\\nWhile his mother is passed out on drugs, the little boy living in the home across from Dougie\\'s car goes to examine it. He is shooed away by the gang of youths, arriving in a loud black 1970 Dodge Charger, who try to steal the car. The bomb under Dougie\\'s car explodes, killing several members of the gang, and the boy runs back to his home. Hearing him coming back in, his mother slowly wakes up and stares at the door.\\nAn auto detailer informs Jade that he found a set of keys for the Great Northern Hotel in her car. Since they have an address on them, she puts them inside a mailbox for delivery.\\nNorma sorts through documents as Heidi is serving and Becky delivers bread to Toad and gets money from Shelly. Norma goes to Shelly, urging her to help Becky rather than continue to enable her. Becky takes the money to Steven and they snort a drug.\\nCooper is pushed out of the elevator at the end of the workday and he goes to the statue he saw that morning.\\nAt the Twin Peaks sheriff\\'s station, Hawk and Andy continue to sort through files.\\nJacoby starts up his webcast he hosts as \"Dr. Amp\" and it is viewed by Jerry Horne—who smokes a joint—and Nadine Hurley. His broadcast ends with an advertisement for his golden shovels that he urges his viewers to buy to shovel themselves \"out of the shit and into the truth.\"\\nAt the Pentagon, Lieutenant Cynthia Knox informs Colonel Davis that they have received a match on Major Garland Briggs\\' fingerprints – the sixteenth match in 25 years – in Buckhorn, South Dakota. Davis doubts the legitimacy of the match but says that if it is indeed truly Briggs that has been identified, that the FBI must be informed.\\nThe band Trouble plays at the Roadhouse as Richard Horne smokes underneath a \\'no smoking\\' sign. Employee Federico asks him to quit and the off-duty Deputy Chad Broxford takes over but ends up taking a bribe from Horne. Charlotte, from the next table over with Elizabeth, asks him for a light, but he grabs her and threatens to rape her.\\nAgent Preston examines Cooper\\'s file and compares his fingerprints from before his 1989 disappearance and from the doppelganger\\'s booking at the federal prison.\\nWarden Murphy gives the doppelganger his phone call. However, the doppelganger dials a number that sets off the prison\\'s alarms and he says \"The cow jumped over the moon,\" before hanging up, stopping the alarms.\\nIn Buenos Aires, the device contacted before by Lorraine rings and flashes twice with its two red lights and then shrinks to a kind of seed.\\nCooper continues to observe the statue.\\nCredits\\n\\nStarring\\nKyle MacLachlan as Dale Cooper / Dale Cooper (doppelganger)\\nIn Alphabetical Order\\nJane Adams as Constance Talbot\\nMädchen Amick as Shelly\\nTammie Baird as Lorraine\\nChrysta Bell as FBI Agent Tammy Preston\\nJim Belushi as Bradley Mitchum\\nSean Bolger as Detailer\\nBrent Briscoe as Detective Dave Macklay\\nWes Brown as Darren\\nJuan Carlos Cantu as Officer Reynaldo\\nVincent Castellanos as Federico\\nBailey Chase as Detective Don Harrison\\nCandy Clark as Doris Truman\\nGrace Victoria Cox as Charlotte\\nGiselle Damier as Sandie\\nDavid Dastmalchian as Pit Boss Warrick\\nJosh Fadem as Phil Bisby\\nEamon Farren as Richard Horne\\nRobert Forster as Sheriff Frank Truman\\nPierce Gagnon as Sonny Jim Jones\\nHailey Gates as Drugged-out Mother\\nBrett Gelman as Supervisor Burns\\nHarry Goaz as Deputy Andy Brennan\\nHank Harris as Prison Tech\\nAndrea Hays as Heidi\\nGary Hershberger as Mike Nelson\\nMichael Horse as Deputy Chief Tommy \"Hawk\" Hill\\nErnie Hudson as Colonel Davis\\nCaleb Landry Jones as Steven Burnett\\nDavid Patrick Kelly as Jerry Horne\\nRobert Knepper as Rodney Mitchum\\nAndrea Leal as Mandie\\nSheryl Lee as Laura Palmer\\nJane Levy as Elizabeth\\nPeggy Lipton as Norma Jennings\\nKarl Makinen as Inspector Randy Hollister\\nJames Morrison as Warden Dwight Murphy\\nDon Murray as Bushnell Mullins\\nJohn Pirruccello as Deputy Chad Broxford\\nAdele René as Lieutenant Cynthia Cox\\nKimmy Robertson as Lucy Brennan\\nWendy Robie as Nadine Hurley\\nMarv Rosand as Toad\\nElena Satine as Rhonda\\nAmanda Seyfried as Rebecca (Becky) Burnett\\nAmie Shiels as Candie\\nSawyer Shipman as Little Boy\\nFrank Silva as Bob\\nTom Sizemore as Anthony Sinclair\\nBob Stephenson as Frank\\nRuss Tamblyn as Dr. Lawrence Jacoby\\nBill Tangradi as Jake\\nGreg Vrotsos as Gene\\nNaomi Watts as Janey-E Jones\\nNafessa Williams as Jade\\nBlake Zingale as Punk Leader\\nTrouble:\\nRiley Lynch\\nSam Smith\\nAlex Zhang Hungtai\\nDean Hurley\\nUncredited\\nTyler Malik as stand-in\\nKenneth Welsh as Windom Earle (archive footage)\\nUnknown performer as Bonnici\\nUnknown performer as Woman in elevator\\nUnknown performer as Man across Mullins\\nUnknown performer as Woman at meeting\\nUnknown performer as Mullins\\' secretary\\nProduction staff\\nSee: Twin Peaks (2017) § Production staff\\nFeatured music\\n\"The Flame\"\\nWritten and performed by Johnny Jewel\\nCourtesy of Italians Do It Better\\n\"Frank 2000\"\\nWritten by Angelo Badalamenti and David Lynch\\nPerformed by Thought Gang\\n\"I Love How You Love Me\"\\nWritten by Barry Mann and Larry Kolber\\nPerformed by The Paris Sisters\\nPublished by Screen Gems-EMI Music Inc. (BMI)\\n\"I Am\"\\nWritten and performed by BluntedBeatz\\n\"Stars And Stripes Forever\"\\nWritten by John Philip Sousa\\nPerformed and arranged by the U.S. Army Band\\n\"Snake Eyes\"\\nWritten by Dean Hurley, Riley Lynch and Alex Zhang Hungtai\\nPerformed by Trouble\\n\"Habit\" and \"Tabloid\"\\nWritten and performed by Uniform\\nCourtesy of Sacred Bones Records\\n\"Windswept\"\\nWritten and performed by Johnny Jewel\\nCourtesy of Italians Do It Better\\nNotes\\nThis episode was dedicated to the memory of Marv Rosand.\\nAmy Shiels is credited as \"Amie\".\\nFrank, Dougie\\'s coworker who discovers he likes green tea lattes, is played by Bob Stephenson, who appeared in Episode 6 as the burger cook at the Double R Diner. This was Stephenson\\'s first acting gig.\\nUpon release, Twin Peaks: The Return earned some criticism for earning the \"Empty Cup Award,\" a satirical achievement for television series where actors handle coffee cups that are claimed to be full in the dialogue but are very clearly empty based on how they are handled by the performers. In the case of this episode, however, Kyle MacLachlan was given some praise for being the sole actor to handle his cup as though it were actually full, especially in an episode where a character (Phil Bisby) unrealistically balances two full trays of coffee while running around.\\nThe statue in front of Dougie\\'s workplace was not originally part of location and was brought by the production staff. It might be a statue of Donald Lynch, father of David Lynch, since according to the stand-in Tyler Malkin, Lynch talked to it saying \"Hi, Dad\".\\nThe numbers input by the doppelganger during his phone call are, using the standard DTMF tones pitched up 2 octaves for offscreen ones:\\n16 (pause) 1235789 (computer modem response) 3135378912315 (01189998819991197253 offscreen)\\nThis could be interpreted as two numbers dialing to get an outside line from the internal prison phone system, then a 7 digit local number calling a computer set up beforehand with a local number so it would be a free local call and finally a code that triggers a pre-planned, automated hack of the prison systems.',\n", " 'meta': {'name': 'Part_5', 'url': 'https://twinpeaks.fandom.com/wiki/Part_5'}}" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "docs[5]" ] }, { "cell_type": "markdown", "id": "Yu3bAUPoLrPI", "metadata": { "id": "Yu3bAUPoLrPI" }, "source": [ "## Create document store ([FAISS](https://github.com/facebookresearch/faiss)) and write documents\n", "\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "bfe846df", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "execution": { "iopub.execute_input": "2022-01-09T08:40:59.678181Z", "iopub.status.busy": "2022-01-09T08:40:59.678003Z", "iopub.status.idle": "2022-01-09T08:40:59.753228Z", "shell.execute_reply": "2022-01-09T08:40:59.752500Z", "shell.execute_reply.started": "2022-01-09T08:40:59.678161Z" }, "id": "bfe846df", "outputId": "be9c9ef8-bcc4-4c4a-e7a5-9003077a7ea3" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO - haystack.modeling.model.optimization - apex not found, won't use it. See https://nvidia.github.io/apex/\n", "ERROR - root - Failed to import 'magic' (from 'python-magic' and 'python-magic-bin' on Windows). FileTypeClassifier will not perform mimetype detection on extensionless files. Please make sure the necessary OS libraries are installed if you need this functionality.\n", "INFO - haystack.telemetry - Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://haystack.deepset.ai/guides/telemetry\n" ] } ], "source": [ "from haystack.document_stores import FAISSDocumentStore\n", "\n", "# the document store settings are those compatible with Embedding Retriever\n", "document_store = FAISSDocumentStore(\n", " similarity=\"dot_product\",\n", " embedding_dim=768)" ] }, { "cell_type": "code", "execution_count": 11, "id": "191144b4", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 49, "referenced_widgets": [ "81c8d8eb80d64687bdcc31ab1e3f156e", "b18ba72ca8e94c508fde617b04b82273", "0cc8731e80994ab097236f295da512c7", "e1b08799bca14b7aa8aed017b5545923", "4ee263ae48834c5dbb138a5dbe2183bd", "ae4e66819ab04f68a867768d03bc4a04", "6bba9b7051f64993a477688eb8c6ed92", "96e28522b93b4c77a9d6da982d601465", "9eb3c190c28b47618572ecd9e3b80932", "887b5bcc8d1e4c38b40700d324424f33", "3739785a410b4eadb0c36881e41b88ed" ] }, "execution": { "iopub.execute_input": "2022-01-09T08:41:10.695292Z", "iopub.status.busy": "2022-01-09T08:41:10.695064Z", "iopub.status.idle": "2022-01-09T08:41:22.144864Z", "shell.execute_reply": "2022-01-09T08:41:22.144203Z", "shell.execute_reply.started": "2022-01-09T08:41:10.695271Z" }, "id": "191144b4", "outputId": "f564e88e-be20-4f20-b4ed-4c663bfd71ac" }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "81c8d8eb80d64687bdcc31ab1e3f156e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Writing Documents: 0%| | 0/134 [00:00