# Preparing Dataset

We will be downloading the initial dataset from Kaggle, for this we need to install the kaggle client via `pip install kaggle`.

In [None]:
!pip install kaggle

In [None]:
import kaggle

When first trying to `import kaggle` we will see an error showing us where we need to place a Kaggle API key, we can find our API key by signing up for an account on Kaggle and clicking on our **profile in the top-right > Account > API > Create API token**. This will download a *kaggle.json*, which we must place in the directory mentioned above.

*(The dataset is ~10GB in size so feel free to skip this step and download the modified dataset directly from HuggingFace datasets)*

Once we have our *kaggle.json* in the correct directory we download the YTTTS speech collection dataset like so:

In [None]:
!kaggle datasets download ryanrudes/yttts-speech

We can unzip the dataset files:

In [None]:
!unzip yttts-speech.zip

To build the full dataset we need to work through a few additional steps and install a few more libraries.

In [None]:
!pip install bs4
!pip install tqdm
!pip install datasets

In [None]:
import os
import time
import requests
from tqdm.auto import tqdm
from bs4 import BeautifulSoup

The current dataset is organized into a set of directories containing folders named based on video IDs.

In [None]:
video_ids = os.listdir("data")
video_ids[:5]

['ZPewmEu7644', 'g4M7stjzR1I', 'P0yVuoATjzs', 'EkzZSaeIikI', 'pWAc9B2zJS4']

Inside each of these we find many more directories where each represents a timestamp pulled from the video.

In [None]:
splits = sorted(os.listdir(f"data/{video_ids[0]}"))
splits[:5]

['00:00:00,030-00:00:02,040',
 '00:00:02,040-00:00:03,570',
 '00:00:03,570-00:00:05,670',
 '00:00:05,670-00:00:07,230',
 '00:00:07,230-00:00:09,120']

In here we have the text transcription itself.

In [None]:
with open(f"data/{video_ids[0]}/{splits[0]}/subtitles.txt") as f:
 text = f.read()
print(text)

hi this is Jeff Dean welcome to


We will loop through all of these files to give us the initial core dataset consisting of *video_id*, *text*, *start_second*, *end_second*, and *url*.

In [None]:
documents = []
for video_id in tqdm(video_ids):
 splits = sorted(os.listdir(f"data/{video_id}"))
 # we start at 00:00:00
 start_timestamp = "00:00:00"
 passage = ""
 for i, s in enumerate(splits):
 with open(f"data/{video_id}/{s}/subtitles.txt") as f:
 # append tect to current chunk
 out = f.read()
 passage += " " + out
 # average sentence length is 75-100 characters so we will cut off
 # around 3-4 sentences
 if len(passage) > 360:
 # now we've hit the needed length create a record
 # extract the end timestamp from the filename
 end_timestamp = s.split("-")[1].split(",")[0]
 # extract string timestamps to actual datetime objects
 start = time.strptime(start_timestamp,"%H:%M:%S")
 end = time.strptime(end_timestamp,"%H:%M:%S")
 # now we extract the second/minute/hour values and convert
 # to total number of seconds
 start_second = start.tm_sec + start.tm_min*60 + start.tm_hour*3600
 end_second = end.tm_sec + end.tm_min*60 + end.tm_hour*3600
 # save this to the documents list
 documents.append({
 "video_id": video_id,
 "text": passage,
 "start_second": start_second,
 "end_second": end_second,
 "url": f"https://www.youtube.com/watch?v={video_id}&t={start_second}s",
 })
 # now we update the start_timestamp for the next chunk
 start_timestamp = end_timestamp
 # refresh passage
 passage = ""

100%|██████████| 127/127 [00:19<00:00, 6.60it/s]


In [None]:
documents[:3]

[{'video_id': 'ZPewmEu7644',
 'text': " hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you",
 'start_second': 0,
 'end_second': 20,
 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'},
 {'video_id': 'ZPewmEu7644',
 'text': ' often see them use for they can definitely generate other types of images but they can also work on tabular data and really any sort of data where you are attempting to have a neural network that is generating data that should be real or should or could be classified as fake the key element to having something as again is having that discriminator that tells the difference',
 'start_second': 20,
 'end_second': 41,
 'url': 'https://www.

We also need additional video metadata that cannot be pulled from our dataset, like the video *title* and *thumbnail*. For both of these we can scrape the data using Beautiful Soup.

In [None]:
import lxml # if on mac, pip/conda install lxml

metadata = {}
for _id in tqdm(video_ids):
 r = requests.get(f"https://www.youtube.com/watch?v={_id}")
 soup = BeautifulSoup(r.content, 'lxml') # lxml package is used here
 try:
 title = soup.find("meta", property="og:title").get("content")
 thumbnail = soup.find("meta", property="og:image").get("content")
 metadata[_id] = {"title": title, "thumbnail": thumbnail}
 except Exception as e:
 print(e)
 print(_id)
 metadata[_id] = {"title": "", "thumbnail": ""}

len(metadata)

 51%|█████ | 65/127 [02:56<02:01, 1.96s/it]

'NoneType' object has no attribute 'get'
fpDaQxG5w4o


 52%|█████▏ | 66/127 [03:00<02:42, 2.67s/it]

'NoneType' object has no attribute 'get'
arbbhHyRP90


100%|██████████| 127/127 [05:21<00:00, 2.54s/it]


127

In [None]:
documents[0]

{'video_id': 'ZPewmEu7644',
 'text': " hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you",
 'start_second': 0,
 'end_second': 20,
 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s'}

In [None]:
metadata['ZPewmEu7644']

{'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',
 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}

In [None]:
for i, doc in enumerate(documents):
 _id = doc['video_id']
 meta = metadata[_id]
 # add metadata to existing doc
 documents[i] = {**doc, **meta}

In [None]:
documents[0]

{'video_id': 'ZPewmEu7644',
 'text': " hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we're going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan's have a wide array of uses beyond just the face generation that you",
 'start_second': 0,
 'end_second': 20,
 'url': 'https://www.youtube.com/watch?v=ZPewmEu7644&t=0s',
 'title': 'GANS for Semi-Supervised Learning in Keras (7.4)',
 'thumbnail': 'https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg'}

In [None]:
import json

with open("train.jsonl", "w") as f:
 for doc in documents:
 json.dump(doc, f)
 f.write('\n')

In [None]:
with open("train.jsonl") as f:
 d = f.readlines()

In [None]:
d[:3]

['{"video_id": "ZPewmEu7644", "text": " hi this is Jeff Dean welcome to applications of deep neural networks of Washington University in this video we\'re going to look at how we can use ganz to generate additional training data for the latest on my a I course and projects click subscribe in the bell next to it to be notified of every new video Dan\'s have a wide array of uses beyond just the face generation that you", "start_second": 0, "end_second": 20, "url": "https://www.youtube.com/watch?v=ZPewmEu7644&t=0s", "title": "GANS for Semi-Supervised Learning in Keras (7.4)", "thumbnail": "https://i.ytimg.com/vi/ZPewmEu7644/maxresdefault.jpg"}\n',
 '{"video_id": "ZPewmEu7644", "text": " often see them use for they can definitely generate other types of images but they can also work on tabular data and really any sort of data where you are attempting to have a neural network that is generating data that should be real or should or could be classified as fake the key element to having somet