retkowski commited on
Commit
cb71ef5
·
0 Parent(s):
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +40 -0
  2. README.md +13 -0
  3. app.py +256 -0
  4. demo_data/lectures/Lecture-01-18.04.2023/English.vtt +2582 -0
  5. demo_data/lectures/Lecture-01-18.04.2023/video.mp4 +3 -0
  6. demo_data/lectures/Lecture-02-20.04.2023/English.vtt +2984 -0
  7. demo_data/lectures/Lecture-02-20.04.2023/video.mp4 +3 -0
  8. demo_data/lectures/Lecture-03-25.04.2023/English.vtt +3102 -0
  9. demo_data/lectures/Lecture-03-25.04.2023/video.mp4 +3 -0
  10. demo_data/lectures/Lecture-04-27.04.2023/English.vtt +2919 -0
  11. demo_data/lectures/Lecture-04-27.04.2023/video.mp4 +3 -0
  12. demo_data/lectures/Lecture-05-02.05.2023/English.vtt +1124 -0
  13. demo_data/lectures/Lecture-05-02.05.2023/video.mp4 +3 -0
  14. demo_data/lectures/Lecture-06-09.05.2023/English.vtt +2970 -0
  15. demo_data/lectures/Lecture-06-09.05.2023/video.mp4 +3 -0
  16. demo_data/lectures/Lecture-07-11.05.2023/English.vtt +2593 -0
  17. demo_data/lectures/Lecture-07-11.05.2023/video.mp4 +3 -0
  18. demo_data/lectures/Lecture-07-16.05.2023/English.vtt +0 -0
  19. demo_data/lectures/Lecture-07-16.05.2023/video.mp4 +3 -0
  20. demo_data/lectures/Lecture-09-25.05.2023/English.vtt +3031 -0
  21. demo_data/lectures/Lecture-09-25.05.2023/video.mp4 +3 -0
  22. demo_data/lectures/Lecture-10-13.06.2023/English.vtt +2450 -0
  23. demo_data/lectures/Lecture-10-13.06.2023/video.mp4 +3 -0
  24. demo_data/lectures/Lecture-11-15.06.2023/English.vtt +0 -0
  25. demo_data/lectures/Lecture-11-15.06.2023/video.mp4 +3 -0
  26. demo_data/lectures/Lecture-12-20.06.2023/English.vtt +0 -0
  27. demo_data/lectures/Lecture-12-20.06.2023/video.mp4 +3 -0
  28. demo_data/lectures/Lecture-13-04.07.2023/English.vtt +2696 -0
  29. demo_data/lectures/Lecture-13-04.07.2023/video.mp4 +3 -0
  30. demo_data/lectures/Lecture-14-27.06.2023/English.vtt +2747 -0
  31. demo_data/lectures/Lecture-14-27.06.2023/video.mp4 +3 -0
  32. demo_data/lectures/Lecture-15-11.07.2023/English.vtt +2279 -0
  33. demo_data/lectures/Lecture-15-11.07.2023/video.mp4 +3 -0
  34. demo_data/lectures/Lecture-18-18.07.2023/English.vtt +2732 -0
  35. demo_data/lectures/Lecture-18-18.07.2023/video.mp4 +3 -0
  36. demo_data/lectures/Lecture-19-21.07.2023/English.vtt +2853 -0
  37. demo_data/lectures/Lecture-19-21.07.2023/video.mp4 +3 -0
  38. demo_data/nips-2021/25957/metadata.json +3 -0
  39. demo_data/nips-2021/25957/transcript_whisper_large-v2.txt +179 -0
  40. demo_data/nips-2021/25957/transcript_whisper_large-v2.vtt +539 -0
  41. demo_data/nips-2021/25957/video.mp4 +3 -0
  42. demo_data/nips-2021/25958/metadata.json +3 -0
  43. demo_data/nips-2021/25958/transcript_whisper_large-v2.txt +124 -0
  44. demo_data/nips-2021/25958/transcript_whisper_large-v2.vtt +374 -0
  45. demo_data/nips-2021/25958/video.mp4 +3 -0
  46. demo_data/nips-2021/25959/metadata.json +3 -0
  47. demo_data/nips-2021/25959/transcript_whisper_large-v2.txt +117 -0
  48. demo_data/nips-2021/25959/transcript_whisper_large-v2.vtt +353 -0
  49. demo_data/nips-2021/25959/video.mp4 +3 -0
  50. demo_data/nips-2021/25963/metadata.json +3 -0
.gitattributes ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ video.mp4 filter=lfs diff=lfs merge=lfs -text
37
+ *.psd filter=lfs diff=lfs merge=lfs -text
38
+ *.mp4 filter=lfs diff=lfs merge=lfs -text
39
+ demo_data/lectures/*/*.mp4 filter=lfs diff=lfs merge=lfs -text
40
+ demo_data/*/.mp4 filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Chaptering Demo (YTSeg & MiniSeg)
3
+ emoji: ⚡
4
+ colorFrom: blue
5
+ colorTo: blue
6
+ sdk: streamlit
7
+ sdk_version: 1.32.2
8
+ app_file: app.py
9
+ pinned: false
10
+ license: other
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import itertools
2
+ import json
3
+ import re
4
+ from functools import partial
5
+ from pathlib import Path
6
+
7
+ import pandas as pd
8
+ import requests
9
+ import streamlit as st
10
+
11
+ from generate_text_api import SummarizerGenerator
12
+ from model_inferences.utils.files import get_captions_from_vtt, get_transcript
13
+
14
+ USE_PARAGRAPHING_MODEL = True
15
+
16
+ def get_sublist_by_flattened_index(A, i):
17
+ current_index = 0
18
+ for sublist in A:
19
+ sublist_length = len(sublist)
20
+ if current_index <= i < current_index + sublist_length:
21
+ return sublist, A.index(sublist)
22
+ current_index += sublist_length
23
+ return None, None
24
+
25
+ import requests
26
+
27
+
28
+ def get_talk_metadata(video_id):
29
+ url = "https://www.ted.com/graphql"
30
+
31
+ headers = {
32
+ "Content-Type": "application/json",
33
+ "Accept": "application/json",
34
+ "x-operation-name": "Transcript", # Replace with the actual operation name
35
+ }
36
+
37
+ data = {
38
+ "query": """
39
+ query GetTalk($videoId: ID!) {
40
+ video(id: $videoId) {
41
+ title,
42
+ presenterDisplayName,
43
+ nativeDownloads {medium}
44
+ }
45
+ }
46
+ """,
47
+ "variables": {
48
+ "videoId": video_id, # Corrected key to "videoId"
49
+ },
50
+ }
51
+
52
+ response = requests.post(url, json=data, headers=headers)
53
+
54
+ if response.status_code == 200:
55
+ result = response.json()
56
+ return result
57
+ else:
58
+ print(f"Error: {response.status_code}, {response.text}")
59
+
60
+ class OfflineTextSegmenterClient:
61
+ def __init__(self, host_url):
62
+ self.host_url = host_url.rstrip("/") + "/segment"
63
+
64
+ def segment(self, text, captions=None, generate_titles=False, threshold=0.4):
65
+ payload = {
66
+ 'text': text,
67
+ 'captions': captions,
68
+ 'generate_titles': generate_titles,
69
+ "prefix_titles": True,
70
+ "threshold": threshold,
71
+ }
72
+
73
+ headers = {
74
+ 'Content-Type': 'application/json'
75
+ }
76
+
77
+ response = requests.post(self.host_url, data=json.dumps(payload), headers=headers).json()
78
+ #segments = response["annotated_segments"] if "annotated_segments" in response else response["segments"]
79
+ return {'segments':response["segments"], 'titles': response["titles"], 'sentences': response["sentences"]}
80
+
81
+ class Toc:
82
+
83
+ def __init__(self):
84
+ self._items = []
85
+ self._placeholder = None
86
+
87
+ def title(self, text):
88
+ self._markdown(text, "h1")
89
+
90
+ def header(self, text):
91
+ self._markdown(text, "h2", " " * 2)
92
+
93
+ def subheader(self, text):
94
+ self._markdown(text, "h3", " " * 4)
95
+
96
+ def placeholder(self, sidebar=False):
97
+ self._placeholder = st.sidebar.empty() if sidebar else st.empty()
98
+
99
+ def generate(self):
100
+ if self._placeholder:
101
+ self._placeholder.markdown("\n".join(self._items), unsafe_allow_html=True)
102
+
103
+ def _markdown(self, text, level, space=""):
104
+ key = re.sub(r'[^\w-]', '', text.replace(" ", "-").replace("'", "-").lower())
105
+ st.markdown(f"<{level} id='{key}'>{text}</{level}>", unsafe_allow_html=True)
106
+ self._items.append(f"{space}* <a href='#{key}'>{text}</a>")
107
+
108
+ endpoint = "http://hiaisc.isl.iar.kit.edu/summarize/summarize_stream"
109
+
110
+ client = OfflineTextSegmenterClient("http://hiaisc.isl.iar.kit.edu/chapterize")
111
+ if USE_PARAGRAPHING_MODEL:
112
+ paragrapher = OfflineTextSegmenterClient("http://hiaisc.isl.iar.kit.edu/paragraph")
113
+ summarizer = SummarizerGenerator(endpoint)
114
+
115
+ import re
116
+
117
+
118
+ def replace_newlines(text):
119
+ updated_text = re.sub(r'\n+', r'\n\n', text)
120
+ return updated_text
121
+
122
+ def generate_summary(summarizer, generated_text_box, input_, prefix=""):
123
+ all_generated_text = prefix
124
+ for generated_text in summarizer.generate_summary_stream(input_):
125
+ all_generated_text += replace_newlines(generated_text)
126
+ generated_text_box.info(all_generated_text)
127
+ print(all_generated_text)
128
+ return all_generated_text.strip()
129
+
130
+ st.header("Demo: Intelligent Recap")
131
+
132
+ if not hasattr(st, 'global_state'):
133
+ st.global_state = {'NIPS 2021 Talks': None, 'TED Talks': None}
134
+ # NIPS 2021 Talks
135
+ transcript_files = itertools.islice(Path("demo_data/nips-2021/").rglob("transcript_whisper_large-v2.vtt"), 15)
136
+ # get titles from metadata.json
137
+ transcripts_map = {}
138
+ for transcript_file in transcript_files:
139
+ base_path = transcript_file.parent
140
+ metadata = base_path / "metadata.json"
141
+ txt_file = base_path / "transcript_whisper_large-v2.txt"
142
+ with open(metadata) as f:
143
+ metadata = json.load(f)
144
+ title = metadata["title"]
145
+ transcript = get_transcript(txt_file)
146
+ captions = get_captions_from_vtt(transcript_file)
147
+ transcripts_map[title] = {"transcript": transcript, "captions": captions, "video": base_path / "video.mp4"}
148
+ st.global_state['NIPS 2021 Talks'] = transcripts_map
149
+
150
+ data = pd.read_json("demo_data/ted_talks.json")
151
+ video_ids = data.talk_id.tolist()
152
+ transcripts = data.text.apply(lambda x: " ".join(x)).tolist()
153
+ transcripts_map = {}
154
+ for video_id, transcript in zip(video_ids, transcripts):
155
+ metadata = get_talk_metadata(video_id)
156
+ title = metadata["data"]["video"]["title"]
157
+ presenter = metadata["data"]["video"]["presenterDisplayName"]
158
+ print(metadata["data"])
159
+ if metadata["data"]["video"]["nativeDownloads"] is None:
160
+ continue
161
+ video_url = metadata["data"]["video"]["nativeDownloads"]["medium"]
162
+ transcripts_map[title] = {"transcript": transcript, "video": video_url, "presenter": presenter}
163
+ st.global_state['TED Talks'] = transcripts_map
164
+
165
+ def get_lecture_id(path):
166
+ return int(path.parts[-2].split('-')[1])
167
+
168
+ transcript_files = Path("demo_data/lectures/").rglob("English.vtt")
169
+ sorted_path_list = sorted(transcript_files, key=get_lecture_id)
170
+
171
+ transcripts_map = {}
172
+ for transcript_file in sorted_path_list:
173
+ base_path = transcript_file.parent
174
+ lecture_id = base_path.parts[-1]
175
+ transcript = " ".join([c["text"].strip() for c in get_captions_from_vtt(transcript_file)]).replace("\n", " ")
176
+ video_path = Path(base_path, "video.mp4")
177
+ transcripts_map["Machine Translation: " + lecture_id] = {"transcript": transcript, "video": video_path}
178
+ st.global_state['KIT Lectures'] = transcripts_map
179
+
180
+ type_of_document = st.selectbox('What kind of document do you want to test it on?', list(st.global_state.keys()))
181
+
182
+ transcripts_map = st.global_state[type_of_document]
183
+
184
+ selected_talk = st.selectbox("Choose a document...", list(transcripts_map.keys()))
185
+
186
+ st.video(str(transcripts_map[selected_talk]['video']), format="video/mp4", start_time=0)
187
+
188
+ input_text = st.text_area("Transcript", value=transcripts_map[selected_talk]['transcript'], height=300)
189
+
190
+ toc = Toc()
191
+
192
+ summarization_todos = []
193
+
194
+ with st.expander("Adjust Thresholds"):
195
+ threshold = st.slider('Chapter Segmentation Threshold', 0.00, 1.00, value=0.4, step=0.05)
196
+ paragraphing_threshold = st.slider('Paragraphing Threshold', 0.00, 1.00, value=0.5, step=0.05)
197
+
198
+ if st.button("Process Transcript"):
199
+ with st.sidebar:
200
+ st.header("Table of Contents")
201
+ toc.placeholder()
202
+
203
+ st.header(selected_talk, divider='rainbow')
204
+ # if 'presenter' in transcripts_map[selected_talk]:
205
+ # st.markdown(f"### *by **{transcripts_map[selected_talk]['presenter']}***")
206
+
207
+ captions = transcripts_map[selected_talk]['captions'] if 'captions' in transcripts_map[selected_talk] else None
208
+ result = client.segment(input_text, captions, generate_titles=True, threshold=threshold)
209
+ if USE_PARAGRAPHING_MODEL:
210
+ presult = paragrapher.segment(input_text, captions, generate_titles=False, threshold=paragraphing_threshold)
211
+ paragraphs = presult['segments']
212
+ segments, titles, sentences = result['segments'], result['titles'], result['sentences']
213
+
214
+ if USE_PARAGRAPHING_MODEL:
215
+ prev_chapter_idx = 0
216
+ prev_paragraph_idx = 0
217
+ segment = []
218
+ for i, sentence in enumerate(sentences):
219
+ chapter, chapter_idx = get_sublist_by_flattened_index(segments, i)
220
+ paragraph, paragraph_idx = get_sublist_by_flattened_index(paragraphs, i)
221
+
222
+ if (chapter_idx != prev_chapter_idx and paragraph_idx == prev_paragraph_idx) or (paragraph_idx != prev_paragraph_idx and chapter_idx != prev_chapter_idx):
223
+ print("Chapter / Chapter & Paragraph")
224
+ segment_text = " ".join(segment)
225
+ toc.subheader(titles[prev_chapter_idx])
226
+ if len(segment_text) > 450:
227
+ generated_text_box = st.info("")
228
+ summarization_todos.append(partial(generate_summary, summarizer, generated_text_box, segment_text))
229
+ st.write(segment_text)
230
+ segment = []
231
+ elif paragraph_idx != prev_paragraph_idx and chapter_idx == prev_chapter_idx:
232
+ print("Paragraph")
233
+ segment.append("\n\n")
234
+
235
+ segment.append(sentence)
236
+
237
+ prev_chapter_idx = chapter_idx
238
+ prev_paragraph_idx = paragraph_idx
239
+
240
+ segment_text = " ".join(segment)
241
+ toc.subheader(titles[prev_chapter_idx])
242
+ generated_text_box = st.info("")
243
+ summarization_todos.append(partial(generate_summary, summarizer, generated_text_box, segment_text))
244
+ st.write(segment_text)
245
+
246
+ else:
247
+ segments = [" ".join([sentence for sentence in segment]) for segment in segments]
248
+ for title, segment in zip(titles, segments):
249
+ toc.subheader(title)
250
+ generated_text_box = st.info("")
251
+ summarization_todos.append(partial(generate_summary, summarizer, generated_text_box, segment))
252
+ st.write(segment)
253
+ toc.generate()
254
+
255
+ for summarization_todo in summarization_todos:
256
+ summarization_todo()
demo_data/lectures/Lecture-01-18.04.2023/English.vtt ADDED
@@ -0,0 +1,2582 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:00.000 --> 0:00:10.115
4
+ That easy to say this is a good translation
5
+ and this is a bad translation.
6
+
7
+ 0:00:10.115 --> 0:00:12.947
8
+ How can we evaluate?
9
+
10
+ 0:00:13.413 --> 0:00:26.083
11
+ We will put an emphasis on machine translation
12
+ because that is currently the state of the
13
+
14
+ 0:00:26.083 --> 0:00:26.787
15
+ art.
16
+
17
+ 0:00:28.028 --> 0:00:35.120
18
+ But we are now focused on the details of neural
19
+ networks where we are describing the basic
20
+
21
+ 0:00:35.120 --> 0:00:39.095
22
+ ideas and how to use the info machine translation.
23
+
24
+ 0:00:39.095 --> 0:00:41.979
25
+ This is not a neural network course.
26
+
27
+ 0:00:42.242 --> 0:00:49.574
28
+ If you have some background in Neo Networks,
29
+ that is of course of an advantage, but it should
30
+
31
+ 0:00:49.574 --> 0:00:51.134
32
+ not be a challenge.
33
+
34
+ 0:00:51.134 --> 0:00:58.076
35
+ If you have not done the details, we'll shortly
36
+ cover the background and the main ideas.
37
+
38
+ 0:00:58.076 --> 0:01:00.338
39
+ How can we use them for for?
40
+
41
+ 0:01:00.280 --> 0:01:06.880
42
+ Machine translation: We will starve the first
43
+ two, three lectures with some like more traditional
44
+
45
+ 0:01:06.880 --> 0:01:12.740
46
+ approaches how they work because they still
47
+ give some good intuition, some good ideas.
48
+
49
+ 0:01:12.872 --> 0:01:17.141
50
+ And they help us to understand where our systems
51
+ might be better.
52
+
53
+ 0:01:17.657 --> 0:01:22.942
54
+ And yeah, we have an innocence on really what
55
+ do we need to do to build a strong system.
56
+
57
+ 0:01:23.343 --> 0:01:35.534
58
+ And then we have a part on experience where
59
+ it's about how to build the systems and how
60
+
61
+ 0:01:35.534 --> 0:01:37.335
62
+ to apply it.
63
+
64
+ 0:01:39.799 --> 0:01:47.774
65
+ For additional reading materials, so we have
66
+ the slides on the website.
67
+
68
+ 0:01:47.774 --> 0:01:55.305
69
+ There is also links to papers which cover
70
+ the topic of the lecture.
71
+
72
+ 0:01:55.235 --> 0:01:58.436
73
+ If You'd Like to Study Additional Books.
74
+
75
+ 0:01:59.559 --> 0:02:07.158
76
+ Think the most relevant is this machine translation
77
+ from Philip Kurnan, which gives an introduction
78
+
79
+ 0:02:07.158 --> 0:02:09.210
80
+ about machine translation.
81
+
82
+ 0:02:09.210 --> 0:02:15.897
83
+ But this lecture is, of course, not a one
84
+ to one like we don't go through the book, but
85
+
86
+ 0:02:15.897 --> 0:02:17.873
87
+ it covers related topics.
88
+
89
+ 0:02:18.678 --> 0:02:25.094
90
+ Is a previous version of that statistical
91
+ machine translation focusing on that part,
92
+
93
+ 0:02:25.094 --> 0:02:28.717
94
+ and we cover some of that part rather than
95
+ all.
96
+
97
+ 0:02:28.717 --> 0:02:35.510
98
+ If you want to have more basics about natural
99
+ language processing, this might be helpful.
100
+
101
+ 0:02:39.099 --> 0:02:53.738
102
+ In addition, there is an online course on
103
+ machine translation which we also develop here
104
+
105
+ 0:02:53.738 --> 0:02:57.521
106
+ at which is available.
107
+
108
+ 0:02:57.377 --> 0:03:04.894
109
+ Input where you're, of course, free to use
110
+ that I might give you some other type of presentation
111
+
112
+ 0:03:04.894 --> 0:03:07.141
113
+ of the lecture important is.
114
+
115
+ 0:03:07.141 --> 0:03:14.193
116
+ It's, of course, a lot shorter and book doesn't
117
+ cover all the topics which you're covering
118
+
119
+ 0:03:14.193 --> 0:03:15.432
120
+ in the lecture.
121
+
122
+ 0:03:15.655 --> 0:03:19.407
123
+ So, of course, for the exam everything which
124
+ was in the lecture is important.
125
+
126
+ 0:03:19.679 --> 0:03:25.012
127
+ This covers like the first half where don't
128
+ know exactly the first X lectures.
129
+
130
+ 0:03:26.026 --> 0:03:28.554
131
+ Feel free to have a look at that.
132
+
133
+ 0:03:28.554 --> 0:03:29.596
134
+ It's shorter.
135
+
136
+ 0:03:29.596 --> 0:03:36.438
137
+ Maybe there's some of you interesting to have
138
+ very short videos or after the lecture single
139
+
140
+ 0:03:36.438 --> 0:03:39.934
141
+ this topic I didn't understand want to repeat.
142
+
143
+ 0:03:40.260 --> 0:03:50.504
144
+ Then this might be helpful, but it's important
145
+ that there is more content in the lecture.
146
+
147
+ 0:03:53.753 --> 0:04:02.859
148
+ The exam will be minutes and oral exam and
149
+ just make an appointment and then.
150
+
151
+ 0:04:05.305 --> 0:04:09.735
152
+ If you think this is a really cool topic,
153
+ want to hear more.
154
+
155
+ 0:04:09.735 --> 0:04:14.747
156
+ There's two similars, one on advanced topics
157
+ in machine translation.
158
+
159
+ 0:04:15.855 --> 0:04:24.347
160
+ Which is every Thursday and there is one which
161
+ was already on Monday.
162
+
163
+ 0:04:24.347 --> 0:04:34.295
164
+ But if you're interested in speech translation
165
+ to contact us and there, I think,.
166
+
167
+ 0:04:34.734 --> 0:04:47.066
168
+ Then there are other lectures, one more learning
169
+ by Professor Vival, and for us some of you
170
+
171
+ 0:04:47.066 --> 0:04:48.942
172
+ have already.
173
+
174
+ 0:04:48.888 --> 0:04:55.496
175
+ Lecture, which is related but of discovering
176
+ more general natural language processing than
177
+
178
+ 0:04:55.496 --> 0:04:57.530
179
+ will be again available in.
180
+
181
+ 0:04:57.597 --> 0:05:07.108
182
+ Winter semester, and then we are concentrating
183
+ on the task of machine translation and mighty.
184
+
185
+ 0:05:11.191 --> 0:05:14.630
186
+ Yeah, and also there's an automatic speech
187
+ emission problem.
188
+
189
+ 0:05:16.616 --> 0:05:27.150
190
+ And this is a bit what we are planning to
191
+ talk about in this semester.
192
+
193
+ 0:05:27.150 --> 0:05:30.859
194
+ Today we have a general.
195
+
196
+ 0:05:31.371 --> 0:05:37.362
197
+ Then on Thursday we are doing a bit of a different
198
+ lecture and that's about the linguistic.
199
+
200
+ 0:05:37.717 --> 0:05:42.475
201
+ It may be quite different from what you're
202
+ more computer scientist, what you've done there,
203
+
204
+ 0:05:42.475 --> 0:05:43.354
205
+ but don't worry.
206
+
207
+ 0:05:43.763 --> 0:05:49.051
208
+ We're coming in a very basic thing that I
209
+ think it's important if you're dealing with
210
+
211
+ 0:05:49.051 --> 0:05:53.663
212
+ natural language to have a bit of an understanding
213
+ of what language isn't.
214
+
215
+ 0:05:53.663 --> 0:05:59.320
216
+ Maybe I've learned about that in high school,
217
+ but also for you this I guess some years ago.
218
+
219
+ 0:05:59.619 --> 0:06:07.381
220
+ And so it's a bit of yeah, it better understand
221
+ also what other challenges there.
222
+
223
+ 0:06:07.307 --> 0:06:16.866
224
+ And especially since we are all dealing with
225
+ our mother time, it may be English, but there
226
+
227
+ 0:06:16.866 --> 0:06:25.270
228
+ is a lot of interesting phenomena which would
229
+ not occur in these two languages.
230
+
231
+ 0:06:25.625 --> 0:06:30.663
232
+ And therefore we'll also look a bit into what
233
+ are things which might happen in other languages.
234
+
235
+ 0:06:30.930 --> 0:06:35.907
236
+ If we want to build machine translation, of
237
+ course we want to build machine Translation
238
+
239
+ 0:06:35.907 --> 0:06:36.472
240
+ for many.
241
+
242
+ 0:06:38.178 --> 0:06:46.989
243
+ Then we will see a lot of these machine learning
244
+ based how to get the data and process the data
245
+
246
+ 0:06:46.989 --> 0:06:47.999
247
+ next week.
248
+
249
+ 0:06:48.208 --> 0:07:03.500
250
+ And then we'll have one lecture about statistical
251
+ machine translation, which was the approach
252
+
253
+ 0:07:03.500 --> 0:07:06.428
254
+ for twenty years.
255
+
256
+ 0:07:07.487 --> 0:07:17.308
257
+ And then maybe surprisingly very early we'll
258
+ talk about evaluation and this is because evaluation
259
+
260
+ 0:07:17.308 --> 0:07:24.424
261
+ is really essential for machine translation
262
+ and it's very challenging.
263
+
264
+ 0:07:24.804 --> 0:07:28.840
265
+ To decide if machine translation output is
266
+ good or bad is really challenging.
267
+
268
+ 0:07:29.349 --> 0:07:38.563
269
+ If you see another translation for a machine
270
+ to decide is not as difficult and even for
271
+
272
+ 0:07:38.563 --> 0:07:48.387
273
+ a machine translation output and ask them to
274
+ rate, you'll get three different answers: And
275
+
276
+ 0:07:48.387 --> 0:07:55.158
277
+ so it's worse to investigate it, and of course
278
+ it's also important to have that at the beginning
279
+
280
+ 0:07:55.158 --> 0:08:01.928
281
+ because if we're later talking about some techniques,
282
+ it will be always saying this technique is
283
+
284
+ 0:08:01.928 --> 0:08:03.813
285
+ better by x percent or so.
286
+
287
+ 0:08:04.284 --> 0:08:06.283
288
+ And we'll also have a practical good course
289
+ of this.
290
+
291
+ 0:08:06.746 --> 0:08:16.553
292
+ Then we're going to build language models
293
+ which are in point to translation models.
294
+
295
+ 0:08:16.736 --> 0:08:28.729
296
+ After the half you have a basic understanding
297
+ of what and basic machine translation.
298
+
299
+ 0:08:29.029 --> 0:08:39.065
300
+ And then on the second part of the lecture
301
+ we will cover more advanced topics.
302
+
303
+ 0:08:39.065 --> 0:08:42.369
304
+ What are the challenging?
305
+
306
+ 0:08:43.463 --> 0:08:48.035
307
+ One challenge is, of course, about additional
308
+ resources about data.
309
+
310
+ 0:08:48.208 --> 0:08:53.807
311
+ So the question is how can we get more data
312
+ or better data and their different ways of
313
+
314
+ 0:08:53.807 --> 0:08:54.258
315
+ doing?
316
+
317
+ 0:08:54.214 --> 0:09:00.230
318
+ Our thralling data will look into our building
319
+ systems which not translate between one language
320
+
321
+ 0:09:00.230 --> 0:09:06.122
322
+ but which translate between fifteen languages
323
+ and youth knowledge and share knowledge between
324
+
325
+ 0:09:06.122 --> 0:09:09.632
326
+ the language so that for each pair they need
327
+ less data.
328
+
329
+ 0:09:11.751 --> 0:09:19.194
330
+ And then we'll have something about efficiency.
331
+
332
+ 0:09:19.194 --> 0:09:27.722
333
+ That is, of course, with more and more complex
334
+ models.
335
+
336
+ 0:09:27.647 --> 0:09:33.053
337
+ Because then nobody can afford to do that,
338
+ so how can you build really efficient things?
339
+
340
+ 0:09:33.393 --> 0:09:38.513
341
+ Who also like energy is getting more expensive
342
+ so it's even more important to build systems.
343
+
344
+ 0:09:39.419 --> 0:09:43.447
345
+ We're Looking to Biases So.
346
+
347
+ 0:09:43.423 --> 0:09:50.364
348
+ That is a machine translation quite interesting
349
+ because some information are represented different
350
+
351
+ 0:09:50.364 --> 0:09:51.345
352
+ in languages.
353
+
354
+ 0:09:51.345 --> 0:09:55.552
355
+ So if you think about German, there is always
356
+ clear or not.
357
+
358
+ 0:09:55.552 --> 0:10:00.950
359
+ But in a lot of situations, it's clear if
360
+ you talk about to teach her about.
361
+
362
+ 0:10:01.321 --> 0:10:03.807
363
+ Another Person If It's Male or Female.
364
+
365
+ 0:10:04.204 --> 0:10:13.832
366
+ From English to German you don't have this
367
+ information, so how do you generate that and
368
+
369
+ 0:10:13.832 --> 0:10:15.364
370
+ what systems?
371
+
372
+ 0:10:15.515 --> 0:10:24.126
373
+ Will just assume things and we'll see that
374
+ exactly this is happening, so in order to address
375
+
376
+ 0:10:24.126 --> 0:10:27.459
377
+ these challenges and try to reduce.
378
+
379
+ 0:10:28.368 --> 0:10:35.186
380
+ The main adaptation is what I said that beginning
381
+ systems are good at the task they are trained.
382
+
383
+ 0:10:35.186 --> 0:10:37.928
384
+ But how can we adapt them to new task?
385
+
386
+ 0:10:38.959 --> 0:10:51.561
387
+ Document level is doing more context and we
388
+ have two lectures about speech translation,
389
+
390
+ 0:10:51.561 --> 0:10:56.859
391
+ so mostly before we are translating.
392
+
393
+ 0:10:57.117 --> 0:11:00.040
394
+ Are now translating audio things.
395
+
396
+ 0:11:00.040 --> 0:11:05.371
397
+ We have just additional challenges and these
398
+ we will address.
399
+
400
+ 0:11:10.450 --> 0:11:22.165
401
+ So to the motivation, why should you work
402
+ on the theme translation and why should you
403
+
404
+ 0:11:22.165 --> 0:11:23.799
405
+ put effort?
406
+
407
+ 0:11:24.224 --> 0:11:30.998
408
+ So we want or we are living in a more global
409
+ society.
410
+
411
+ 0:11:30.998 --> 0:11:37.522
412
+ You have now the chance to communicate with
413
+ people.
414
+
415
+ 0:11:37.897 --> 0:11:44.997
416
+ And the danger of course is that languages
417
+ are dying, and more and more languages are
418
+
419
+ 0:11:44.997 --> 0:11:45.988
420
+ going away.
421
+
422
+ 0:11:46.006 --> 0:11:53.669
423
+ I think at least that some opportunity in
424
+ order to keep more languages is that we have
425
+
426
+ 0:11:53.669 --> 0:12:01.509
427
+ technology solutions which help you to speak
428
+ in your language and still communicate with
429
+
430
+ 0:12:01.509 --> 0:12:04.592
431
+ people who speak another language.
432
+
433
+ 0:12:04.864 --> 0:12:16.776
434
+ And on the one hand there is the need and
435
+ more and more people want to speak in some
436
+
437
+ 0:12:16.776 --> 0:12:19.159
438
+ other languages.
439
+
440
+ 0:12:19.759 --> 0:12:27.980
441
+ For example, Iceland was really keen on getting
442
+ Icelandic into commercial systems and they
443
+
444
+ 0:12:27.980 --> 0:12:36.471
445
+ even provided data and so on because they wanted
446
+ that their language is spoken longer and not
447
+
448
+ 0:12:36.471 --> 0:12:38.548
449
+ just people switching.
450
+
451
+ 0:12:38.959 --> 0:12:47.177
452
+ So there's even like yeah, they were spending
453
+ for promoting this language in order to have
454
+
455
+ 0:12:47.177 --> 0:12:55.125
456
+ all these digital tools available for languages
457
+ which are not spoken by so many people.
458
+
459
+ 0:12:56.156 --> 0:13:07.409
460
+ So it's questionable and it's not completely
461
+ clear technology always provides.
462
+
463
+ 0:13:10.430 --> 0:13:25.622
464
+ If we think about machine translation, there
465
+ are different use cases in which you can use
466
+
467
+ 0:13:25.622 --> 0:13:26.635
468
+ that.
469
+
470
+ 0:13:27.207 --> 0:13:36.978
471
+ And this has some characteristics: So typically
472
+ in this case it is where machine translation
473
+
474
+ 0:13:36.978 --> 0:13:40.068
475
+ was used first anybody.
476
+
477
+ 0:13:40.780 --> 0:13:50.780
478
+ Because most youth outlets around the world
479
+ report at least some of the same events, like
480
+
481
+ 0:13:50.780 --> 0:13:58.669
482
+ was probably covered around the world in a
483
+ lot of different languages.
484
+
485
+ 0:13:59.279 --> 0:14:08.539
486
+ That is one point yes, so the training gator
487
+ is there.
488
+
489
+ 0:14:08.539 --> 0:14:16.284
490
+ That's definitely a good point here and then.
491
+
492
+ 0:14:17.717 --> 0:14:19.425
493
+ Yes, there was my regional idea.
494
+
495
+ 0:14:19.425 --> 0:14:23.256
496
+ The motivation program was a bit different
497
+ by you, but it's a good point.
498
+
499
+ 0:14:23.256 --> 0:14:26.517
500
+ So on the one end you'll understand maybe
501
+ not perfect English.
502
+
503
+ 0:14:26.517 --> 0:14:30.762
504
+ Also, it's for his personal use, so you're
505
+ using machine translation for you use.
506
+
507
+ 0:14:31.311 --> 0:14:37.367
508
+ It's not as important that this is really
509
+ perfect written text, but you're more interested
510
+
511
+ 0:14:37.367 --> 0:14:38.564
512
+ in understanding.
513
+
514
+ 0:14:38.858 --> 0:14:45.570
515
+ Maybe it's more clearer if you think about
516
+ the other situation where it's about dissimination
517
+
518
+ 0:14:45.570 --> 0:14:48.926
519
+ that means producing text in another language.
520
+
521
+ 0:14:48.926 --> 0:14:55.138
522
+ So just imagine you have a website or you
523
+ have a restaurant and you want to offer your
524
+
525
+ 0:14:55.138 --> 0:14:55.566
526
+ menu.
527
+
528
+ 0:14:56.476 --> 0:15:01.948
529
+ And in this case maybe you want to have a
530
+ higher quality because in some of your.
531
+
532
+ 0:15:01.901 --> 0:15:06.396
533
+ You're presenting something of yourself and
534
+ you want to have good quality.
535
+
536
+ 0:15:06.396 --> 0:15:11.490
537
+ Just remember you're writing a letter and
538
+ if you're translating your letter then you
539
+
540
+ 0:15:11.490 --> 0:15:17.123
541
+ don't want to have it full of mistakes because
542
+ it's somehow a bad, bad oppression but if it's
543
+
544
+ 0:15:17.123 --> 0:15:20.300
545
+ assimilation it's about you getting the information.
546
+
547
+ 0:15:20.660 --> 0:15:25.564
548
+ So here you want your disciplination, you're
549
+ producing texts for another language.
550
+
551
+ 0:15:26.006 --> 0:15:31.560
552
+ And then you have the disadvantage that you
553
+ maybe want to have a higher quality.
554
+
555
+ 0:15:31.831 --> 0:15:43.432
556
+ Therefore, typically there is less amount,
557
+ so normally you're getting more information
558
+
559
+ 0:15:43.432 --> 0:15:46.499
560
+ than you're producing.
561
+
562
+ 0:15:49.109 --> 0:15:57.817
563
+ Then of course there is a dynamic scenario
564
+ where there is some type of interaction and
565
+
566
+ 0:15:57.817 --> 0:16:07.099
567
+ the one thing which is interesting about the
568
+ dialogue scenario is there is: So if you're
569
+
570
+ 0:16:07.099 --> 0:16:18.045
571
+ translating a website you have all the data
572
+ available but in a dialogue scenario you.
573
+
574
+ 0:16:18.378 --> 0:16:23.655
575
+ And we'll see that in speech recognition this
576
+ is a big challenge.
577
+
578
+ 0:16:23.655 --> 0:16:30.930
579
+ Just to mention German where in German the
580
+ work is often more at the end, so each harmony.
581
+
582
+ 0:16:32.052 --> 0:16:36.343
583
+ Know that you want to generate the English
584
+ sentence.
585
+
586
+ 0:16:36.343 --> 0:16:42.740
587
+ Now you need to know if you cancel this registration
588
+ to produce a second word.
589
+
590
+ 0:16:42.740 --> 0:16:49.785
591
+ So you have to either guess or do something
592
+ in order to provide the translation before
593
+
594
+ 0:16:49.785 --> 0:16:52.052
595
+ the translation is already.
596
+
597
+ 0:16:57.817 --> 0:17:00.530
598
+ The question, of course, is in the new world.
599
+
600
+ 0:17:00.530 --> 0:17:05.659
601
+ I mean, of course, we can, on the one hand,
602
+ say we don't want to have English, but the
603
+
604
+ 0:17:05.659 --> 0:17:10.789
605
+ question is do we really need that many languages
606
+ and how many are here at the moment?
607
+
608
+ 0:17:11.291 --> 0:17:20.248
609
+ Does anybody have an idea how many languages
610
+ are spoken in the world?
611
+
612
+ 0:17:23.043 --> 0:17:26.510
613
+ This is already the first big challenge.
614
+
615
+ 0:17:26.510 --> 0:17:34.120
616
+ What a language is and what no language is
617
+ is already difficult, and then maybe one point
618
+
619
+ 0:17:34.120 --> 0:17:40.124
620
+ people have to argue first about written language
621
+ or spoken languages.
622
+
623
+ 0:17:40.400 --> 0:17:47.765
624
+ For written languages I think that number
625
+ is still too low, but for a spoken language
626
+
627
+ 0:17:47.765 --> 0:17:53.879
628
+ people normally think: So you see that it's
629
+ really a lot of languages which will be difficult
630
+
631
+ 0:17:53.879 --> 0:17:54.688
632
+ to all happen.
633
+
634
+ 0:17:55.035 --> 0:18:00.662
635
+ And these are just like you see Europe where
636
+ there's relatively few languages.
637
+
638
+ 0:18:00.662 --> 0:18:05.576
639
+ You already have quite a lot of languages,
640
+ even walls and countries.
641
+
642
+ 0:18:06.126 --> 0:18:13.706
643
+ Of course sometimes you share the language,
644
+ but then you have Briton or Gillesian vest
645
+
646
+ 0:18:13.706 --> 0:18:17.104
647
+ where you have languages in a country.
648
+
649
+ 0:18:18.478 --> 0:18:24.902
650
+ And yeah, of course, there's the question:
651
+ When does it start to be a language?
652
+
653
+ 0:18:24.902 --> 0:18:27.793
654
+ And when is it more like a dialect?
655
+
656
+ 0:18:27.793 --> 0:18:28.997
657
+ So is Catalan?
658
+
659
+ 0:18:28.997 --> 0:18:31.727
660
+ Is Swiss German a known language?
661
+
662
+ 0:18:31.727 --> 0:18:33.253
663
+ Or is it the same?
664
+
665
+ 0:18:33.293 --> 0:18:36.887
666
+ So then, of course, it's are like Czech and
667
+ Slovakian.
668
+
669
+ 0:18:36.887 --> 0:18:42.704
670
+ I know heard that people can understand each
671
+ other so they can just continue talking and
672
+
673
+ 0:18:42.704 --> 0:18:45.711
674
+ understand by some of their own language and.
675
+
676
+ 0:18:46.026 --> 0:18:56.498
677
+ Of course, it's partly also like about your
678
+ own nationality, so I think some people said
679
+
680
+ 0:18:56.498 --> 0:18:57.675
681
+ creation.
682
+
683
+ 0:18:58.018 --> 0:19:04.957
684
+ But think for a lot of people you shouldn't
685
+ say that they are part of being creation language.
686
+
687
+ 0:19:05.165 --> 0:19:10.876
688
+ But you see therefore that it is not completely
689
+ clear that there is no hardwater between this
690
+
691
+ 0:19:10.876 --> 0:19:13.974
692
+ and the new language, and this is a different
693
+ one.
694
+
695
+ 0:19:14.094 --> 0:19:19.403
696
+ And of course it's getting more fluent when
697
+ you talk about scientific things.
698
+
699
+ 0:19:19.403 --> 0:19:25.189
700
+ I guess sometimes it's no longer clear if
701
+ it's German or English because we start to
702
+
703
+ 0:19:25.189 --> 0:19:27.707
704
+ use a lot of English terms in there.
705
+
706
+ 0:19:27.707 --> 0:19:31.519
707
+ So of course there's interesting mixes which
708
+ will talk.
709
+
710
+ 0:19:33.193 --> 0:19:38.537
711
+ So should everybody just speak English, and
712
+ these numbers are a bit older, have to admit:
713
+
714
+ 0:19:38.938 --> 0:19:47.124
715
+ However, I don't think they're completely different
716
+ now and it says like how many people know in
717
+
718
+ 0:19:47.124 --> 0:19:54.718
719
+ Europe can speak English for countries where
720
+ English is not the mothertown or for people.
721
+
722
+ 0:19:54.995 --> 0:20:06.740
723
+ In some countries like smaller ones, for smaller
724
+ countries you have quite high numbers.
725
+
726
+ 0:20:07.087 --> 0:20:13.979
727
+ However, there are many countries where you
728
+ have like twenty to thirty percent of the population,
729
+
730
+ 0:20:13.979 --> 0:20:16.370
731
+ only being able to speak English.
732
+
733
+ 0:20:16.370 --> 0:20:22.559
734
+ So if we would only do everything only in
735
+ English, we would exclude half the population
736
+
737
+ 0:20:22.559 --> 0:20:23.333
738
+ of Europe.
739
+
740
+ 0:20:23.563 --> 0:20:30.475
741
+ And therefore providing translations is very
742
+ important and therefore, for example, the European
743
+
744
+ 0:20:30.475 --> 0:20:35.587
745
+ Parliament puts a really large amount of money
746
+ into doing translation.
747
+
748
+ 0:20:35.695 --> 0:20:40.621
749
+ So that's why you can speak in your mother
750
+ too in the European Parliament.
751
+
752
+ 0:20:40.621 --> 0:20:46.204
753
+ Everybody like everyone elected there can
754
+ speak in there and they were translated to
755
+
756
+ 0:20:46.204 --> 0:20:52.247
757
+ all the other languages and it's a huge effort
758
+ and so the question is can we do better with
759
+
760
+ 0:20:52.247 --> 0:20:52.838
761
+ machine.
762
+
763
+ 0:20:53.493 --> 0:20:58.362
764
+ And for other countries things are even more.
765
+
766
+ 0:20:58.362 --> 0:21:05.771
767
+ They may be not worse, difficult, but they
768
+ are even more challenging.
769
+
770
+ 0:21:06.946 --> 0:21:13.764
771
+ So there's even more diversity of languages
772
+ and it might be even more important to do machines.
773
+
774
+ 0:21:16.576 --> 0:21:31.034
775
+ If you see how many people speak French, Portuguese
776
+ or English, it's relatively few compared to
777
+
778
+ 0:21:31.034 --> 0:21:33.443
779
+ the population.
780
+
781
+ 0:21:33.813 --> 0:21:46.882
782
+ So think that this should be around millions
783
+ would understand you, but all the others wouldn't.
784
+
785
+ 0:21:49.289 --> 0:21:54.877
786
+ So it seems to be very important to provide
787
+ some taebo translation.
788
+
789
+ 0:21:54.877 --> 0:21:58.740
790
+ It's a quite big industry as a European Union.
791
+
792
+ 0:21:58.740 --> 0:22:05.643
793
+ This is already also quite long ago, but it
794
+ won't get less spent like in that year.
795
+
796
+ 0:22:05.643 --> 0:22:08.931
797
+ One point three billion on translation.
798
+
799
+ 0:22:09.289 --> 0:22:21.315
800
+ So it might be very helpful to have tools
801
+ in order to provide them, and as said, not
802
+
803
+ 0:22:21.315 --> 0:22:26.267
804
+ all directions might be important.
805
+
806
+ 0:22:26.426 --> 0:22:35.059
807
+ Is even not possible for students, so in the
808
+ European Parliament they don't have all combinations
809
+
810
+ 0:22:35.059 --> 0:22:36.644
811
+ of the different.
812
+
813
+ 0:22:36.977 --> 0:22:42.210
814
+ And language is so if they want to translate
815
+ from Maltese to Estonian or so.
816
+
817
+ 0:22:42.402 --> 0:22:47.361
818
+ And maybe they have a translator for that,
819
+ but there are some directions which don't have
820
+
821
+ 0:22:47.361 --> 0:22:47.692
822
+ that.
823
+
824
+ 0:22:47.692 --> 0:22:52.706
825
+ Then they handle directly, but they would
826
+ translate first to French, German or or English,
827
+
828
+ 0:22:52.706 --> 0:22:57.721
829
+ and then there would be a second translator
830
+ getting the translation and really translating
831
+
832
+ 0:22:57.721 --> 0:22:59.154
833
+ to your Italian language.
834
+
835
+ 0:22:59.299 --> 0:23:06.351
836
+ And it's not always English, so they are really
837
+ selecting what is most helpful.
838
+
839
+ 0:23:06.351 --> 0:23:13.931
840
+ But you see that even in this small setup,
841
+ with this large amount of effort in there,
842
+
843
+ 0:23:13.931 --> 0:23:17.545
844
+ there's not enough ability to translate.
845
+
846
+ 0:23:19.819 --> 0:23:21.443
847
+ And of course this was text.
848
+
849
+ 0:23:21.443 --> 0:23:26.538
850
+ Then you have a lot of other things where
851
+ you want to, for example, do speech translation.
852
+
853
+ 0:23:26.538 --> 0:23:31.744
854
+ There is a lot of conferences which currently
855
+ are all held in English, which of course might
856
+
857
+ 0:23:31.744 --> 0:23:35.831
858
+ also not be the best solution if you've gone
859
+ to some of the conferences.
860
+
861
+ 0:23:36.176 --> 0:23:45.964
862
+ You might have heard some accented speech
863
+ where people speak a language that is very
864
+
865
+ 0:23:45.964 --> 0:23:49.304
866
+ different from their mother.
867
+
868
+ 0:23:49.749 --> 0:23:52.059
869
+ Might be difficult to understand.
870
+
871
+ 0:23:52.212 --> 0:23:59.123
872
+ We're currently having an effort for example
873
+ by ACL, which is the conference organized in
874
+
875
+ 0:23:59.123 --> 0:24:06.112
876
+ this field to provide these translations into
877
+ ten hour languages so that also students who
878
+
879
+ 0:24:06.112 --> 0:24:06.803
880
+ are not.
881
+
882
+ 0:24:06.746 --> 0:24:12.446
883
+ That familiar English is able to read the
884
+ papers and watch the present case.
885
+
886
+ 0:24:16.416 --> 0:24:25.243
887
+ So the question is what can you do here and
888
+ one interesting solution which we'll cover
889
+
890
+ 0:24:25.243 --> 0:24:26.968
891
+ in this lecture?
892
+
893
+ 0:24:27.087 --> 0:24:38.112
894
+ This always comes with a question: is it will
895
+ it replace the human?
896
+
897
+ 0:24:38.112 --> 0:24:40.382
898
+ And yes, the.
899
+
900
+ 0:24:40.300 --> 0:24:49.300
901
+ Idea, but the question doesn't really happen
902
+ and I'm any skeptical about that.
903
+
904
+ 0:24:49.300 --> 0:24:52.946
905
+ So currently we are not seeing.
906
+
907
+ 0:24:53.713 --> 0:24:55.807
908
+ So much more effort needed.
909
+
910
+ 0:24:55.807 --> 0:25:00.294
911
+ Of course, machine translation is now used
912
+ as some type of.
913
+
914
+ 0:25:01.901 --> 0:25:11.785
915
+ If you think about in the European Parliament,
916
+ they will have some humans doing their translation
917
+
918
+ 0:25:11.785 --> 0:25:18.060
919
+ because: If you think about the chancel of
920
+ Germany trembling somewhere and quite sure
921
+
922
+ 0:25:18.060 --> 0:25:18.784
923
+ you want,.
924
+
925
+ 0:25:19.179 --> 0:25:31.805
926
+ And so it's more like we are augmenting the
927
+ possibilities to have more possibilities to
928
+
929
+ 0:25:31.805 --> 0:25:37.400
930
+ provide translation and travel around.
931
+
932
+ 0:25:39.499 --> 0:25:53.650
933
+ How can this technology help so machine translation
934
+ is one way of dealing with?
935
+
936
+ 0:25:54.474 --> 0:26:01.144
937
+ Of course, there is other tasks which do even
938
+ without machine translation.
939
+
940
+ 0:26:01.144 --> 0:26:04.613
941
+ Just think about summarize my lecture.
942
+
943
+ 0:26:04.965 --> 0:26:08.019
944
+ Approaches doing that what they call end to
945
+ end.
946
+
947
+ 0:26:08.019 --> 0:26:11.635
948
+ So you just put an English text and get a
949
+ German summary.
950
+
951
+ 0:26:11.635 --> 0:26:17.058
952
+ However, a good baseline and an important
953
+ thing is to either first lecture into German
954
+
955
+ 0:26:17.058 --> 0:26:22.544
956
+ and then do a summary art, first do a summary
957
+ in English and then translation language.
958
+
959
+ 0:26:23.223 --> 0:26:28.764
960
+ Translation is very important in order to
961
+ different application scenarios.
962
+
963
+ 0:26:28.764 --> 0:26:33.861
964
+ We have that dissemination dialogue but also
965
+ information extraction.
966
+
967
+ 0:26:33.861 --> 0:26:39.993
968
+ So if you want to do like get information
969
+ not only from English websites but from.
970
+
971
+ 0:26:40.300 --> 0:26:42.427
972
+ Very different websites.
973
+
974
+ 0:26:42.427 --> 0:26:46.171
975
+ It's helpful to have this type of solution.
976
+
977
+ 0:26:50.550 --> 0:26:52.772
978
+ Yeah, what can you translate?
979
+
980
+ 0:26:52.772 --> 0:26:59.660
981
+ Of course, we will focus on text, as I said
982
+ for most of them, because it's about translation
983
+
984
+ 0:26:59.660 --> 0:27:06.178
985
+ and anything first translates to text, and
986
+ then change to text, and then we can do text
987
+
988
+ 0:27:06.178 --> 0:27:07.141
989
+ translation.
990
+
991
+ 0:27:09.189 --> 0:27:19.599
992
+ And text is not equals text, so we can do
993
+ translation that is some of the most common.
994
+
995
+ 0:27:19.499 --> 0:27:27.559
996
+ Is working on translation, so just imagine
997
+ you are developing your new.
998
+
999
+ 0:27:27.947 --> 0:27:34.628
1000
+ Nowadays you don't want to have to only be
1001
+ available in English or German books in as
1002
+
1003
+ 0:27:34.628 --> 0:27:40.998
1004
+ many languages as possible, and if you use
1005
+ the standard tools it's not that easy.
1006
+
1007
+ 0:27:41.141 --> 0:27:50.666
1008
+ We have a different type of domain and there
1009
+ again we have very few contexts.
1010
+
1011
+ 0:27:50.666 --> 0:27:56.823
1012
+ Normally we translate: To pick up an app you
1013
+ have the menu and there's like safe.
1014
+
1015
+ 0:27:57.577 --> 0:28:02.535
1016
+ And then you only have safe.
1017
+
1018
+ 0:28:02.535 --> 0:28:14.845
1019
+ How should translate safe should it be written
1020
+ or should it be spicing?
1021
+
1022
+ 0:28:16.856 --> 0:28:24.407
1023
+ Then, of course, if you have like files, it
1024
+ might be that you have meta data to transport.
1025
+
1026
+ 0:28:26.466 --> 0:28:27.137
1027
+ Novels.
1028
+
1029
+ 0:28:27.137 --> 0:28:32.501
1030
+ Some work on that, but yeah, that's always
1031
+ a typical criticism.
1032
+
1033
+ 0:28:32.501 --> 0:28:36.440
1034
+ You'll never be able to translate Shakespeare.
1035
+
1036
+ 0:28:36.656 --> 0:28:43.684
1037
+ Think this is somehow the last use case of
1038
+ machine translation.
1039
+
1040
+ 0:28:43.684 --> 0:28:47.637
1041
+ For a translation of books there's.
1042
+
1043
+ 0:28:47.847 --> 0:28:57.047
1044
+ But the nice thing about machine translation
1045
+ is that it can translate to things which are
1046
+
1047
+ 0:28:57.047 --> 0:29:05.327
1048
+ boring, so think about translating some bureaucrative
1049
+ forms or some regulations.
1050
+
1051
+ 0:29:05.565 --> 0:29:11.302
1052
+ This is normally not very interesting, it's
1053
+ very repetitive, so their automation works
1054
+
1055
+ 0:29:11.302 --> 0:29:11.697
1056
+ well.
1057
+
1058
+ 0:29:11.931 --> 0:29:17.519
1059
+ Of course, there is also translations on Paibos
1060
+ images.
1061
+
1062
+ 0:29:17.519 --> 0:29:24.604
1063
+ I guess you point your camera to an object
1064
+ where it translates things.
1065
+
1066
+ 0:29:25.005 --> 0:29:43.178
1067
+ And we'll cover that at the end, as said,
1068
+ the speech translation.
1069
+
1070
+ 0:29:43.663 --> 0:29:46.795
1071
+ So you can't provide the translation of the
1072
+ lecture.
1073
+
1074
+ 0:29:46.795 --> 0:29:50.518
1075
+ If I'm five slides further then you would
1076
+ see the translation.
1077
+
1078
+ 0:29:50.518 --> 0:29:52.291
1079
+ It might not be very helpful.
1080
+
1081
+ 0:29:54.794 --> 0:29:57.062
1082
+ We are not speaking as we are written.
1083
+
1084
+ 0:29:57.062 --> 0:29:59.097
1085
+ It's again like a domain mismatch.
1086
+
1087
+ 0:29:59.359 --> 0:30:10.161
1088
+ So typically the sentences are not full sentences
1089
+ and I'm saying this is not the right way to
1090
+
1091
+ 0:30:10.161 --> 0:30:19.354
1092
+ praise it and if you just read what was written
1093
+ it might be hard to understand.
1094
+
1095
+ 0:30:23.803 --> 0:30:36.590
1096
+ We are focusing on the first application scenario
1097
+ that is fully out of management.
1098
+
1099
+ 0:30:37.177 --> 0:30:46.373
1100
+ Of course, there are quite interesting application
1101
+ scenarios for other things where it should
1102
+
1103
+ 0:30:46.373 --> 0:30:47.645
1104
+ be referred.
1105
+
1106
+ 0:30:47.867 --> 0:30:49.695
1107
+ Where it's no longer going to be.
1108
+
1109
+ 0:30:49.695 --> 0:30:52.436
1110
+ We have this tool and it works, but it's a
1111
+ market.
1112
+
1113
+ 0:30:52.436 --> 0:30:57.381
1114
+ We have the machine translation system and
1115
+ the human translator, and they somehow cooperate
1116
+
1117
+ 0:30:57.381 --> 0:30:59.853
1118
+ and try to be as fast as possible in doing
1119
+ a.
1120
+
1121
+ 0:31:00.380 --> 0:31:12.844
1122
+ The easiest idea there would be the first
1123
+ point you take the machine translation.
1124
+
1125
+ 0:31:13.553 --> 0:31:17.297
1126
+ That sometimes farther might not be the best
1127
+ way of suing it.
1128
+
1129
+ 0:31:17.357 --> 0:31:25.308
1130
+ Any ideas or what else you could do, then
1131
+ maybe the machine could aid the human and say
1132
+
1133
+ 0:31:25.308 --> 0:31:27.838
1134
+ I'm sure about this author.
1135
+
1136
+ 0:31:28.368 --> 0:31:32.319
1137
+ Yeah, very interesting, very good.
1138
+
1139
+ 0:31:32.319 --> 0:31:42.252
1140
+ Of course, the dangerous thing there is you
1141
+ asking something from a machine translation
1142
+
1143
+ 0:31:42.252 --> 0:31:45.638
1144
+ system where it's really bad.
1145
+
1146
+ 0:31:45.845 --> 0:31:50.947
1147
+ There is quality estimation that maybe it
1148
+ will couple that in evaluation so in evaluation
1149
+
1150
+ 0:31:50.947 --> 0:31:55.992
1151
+ you know what is correct translation and you
1152
+ have another output and you try to estimate
1153
+
1154
+ 0:31:55.992 --> 0:31:57.409
1155
+ how good is the quality.
1156
+
1157
+ 0:31:57.409 --> 0:32:02.511
1158
+ In quality estimation you don't have you only
1159
+ have a source and time and good question is
1160
+
1161
+ 0:32:02.511 --> 0:32:03.531
1162
+ exactly this one.
1163
+
1164
+ 0:32:03.531 --> 0:32:05.401
1165
+ Is it a good translation or not?
1166
+
1167
+ 0:32:05.665 --> 0:32:12.806
1168
+ This might be easier because the system might
1169
+ not know what translation is.
1170
+
1171
+ 0:32:13.053 --> 0:32:23.445
1172
+ Human is very good at that for machines that
1173
+ are difficult, but of course that's an interesting
1174
+
1175
+ 0:32:23.445 --> 0:32:24.853
1176
+ application.
1177
+
1178
+ 0:32:25.065 --> 0:32:32.483
1179
+ Be more interactive so that you may be translating
1180
+ if the human changes the fifth word.
1181
+
1182
+ 0:32:32.483 --> 0:32:36.361
1183
+ What does it mean for the remaining sentence?
1184
+
1185
+ 0:32:36.361 --> 0:32:38.131
1186
+ Do I need to change?
1187
+
1188
+ 0:32:38.131 --> 0:32:43.948
1189
+ There are also things like you don't have
1190
+ to repeat the same errors.
1191
+
1192
+ 0:32:47.767 --> 0:32:57.651
1193
+ Hell our automated basemen, you only want
1194
+ to correct at once and not at all positions.
1195
+
1196
+ 0:33:00.000 --> 0:33:21.784
1197
+ And then they ask, for example, so before
1198
+ the translation is done they ask: I'm not directly
1199
+
1200
+ 0:33:21.784 --> 0:33:23.324
1201
+ aware of that.
1202
+
1203
+ 0:33:23.324 --> 0:33:33.280
1204
+ I think it's a good way of ending and I think
1205
+ it's where, especially with more advanced dialogue
1206
+
1207
+ 0:33:33.280 --> 0:33:34.717
1208
+ strategy and.
1209
+
1210
+ 0:33:35.275 --> 0:33:38.831
1211
+ Currently think of most of the focus is like
1212
+ at least determining.
1213
+
1214
+ 0:33:39.299 --> 0:33:45.646
1215
+ Don't have this information that is already
1216
+ challenging, so there is quite some work on
1217
+
1218
+ 0:33:45.646 --> 0:33:49.541
1219
+ quality estimation that I'm missing your information.
1220
+
1221
+ 0:33:49.789 --> 0:33:53.126
1222
+ But is there something missing?
1223
+
1224
+ 0:33:53.126 --> 0:33:59.904
1225
+ It's really quite challenging and think that
1226
+ is where currently.
1227
+
1228
+ 0:34:00.260 --> 0:34:05.790
1229
+ What is there is there is opportunities to
1230
+ provide or there is models to directly provide
1231
+
1232
+ 0:34:05.790 --> 0:34:06.527
1233
+ additional?
1234
+
1235
+ 0:34:06.786 --> 0:34:13.701
1236
+ You can give them anything you have and provide
1237
+ them.
1238
+
1239
+ 0:34:13.701 --> 0:34:21.129
1240
+ It's a similar situation if you're translating
1241
+ to German.
1242
+
1243
+ 0:34:21.641 --> 0:34:31.401
1244
+ And it would just guess normally or do some
1245
+ random guessing always means it's using some
1246
+
1247
+ 0:34:31.401 --> 0:34:36.445
1248
+ information which should not be really there.
1249
+
1250
+ 0:34:36.776 --> 0:34:46.449
1251
+ So then you can provide it with an additional
1252
+ input or you should use formula or non formula.
1253
+
1254
+ 0:34:47.747 --> 0:35:04.687
1255
+ To know that this information is missing.
1256
+
1257
+ 0:35:04.544 --> 0:35:19.504
1258
+ Since you're not specifically modeling this,
1259
+ it's likely that there is a gender difference
1260
+
1261
+ 0:35:19.504 --> 0:35:21.805
1262
+ in languages.
1263
+
1264
+ 0:35:26.046 --> 0:35:39.966
1265
+ One are we doing good search on machine translation,
1266
+ so it's a very important part to ask in natural
1267
+
1268
+ 0:35:39.966 --> 0:35:42.860
1269
+ language processing.
1270
+
1271
+ 0:35:43.283 --> 0:35:49.234
1272
+ So of course you have a lot of computer science
1273
+ thing in there and that's the backbone of.
1274
+
1275
+ 0:35:49.569 --> 0:36:01.848
1276
+ However, task and understanding you can also
1277
+ get from information like computational linguistics,
1278
+
1279
+ 0:36:01.848 --> 0:36:08.613
1280
+ which tell you about what language it's good
1281
+ to know.
1282
+
1283
+ 0:36:08.989 --> 0:36:15.425
1284
+ Doesn't mean that in a computer we have to
1285
+ bottle it exactly the same, but for example
1286
+
1287
+ 0:36:15.425 --> 0:36:22.453
1288
+ to know that there is something like morphology,
1289
+ which means how words are built, and that for
1290
+
1291
+ 0:36:22.453 --> 0:36:24.746
1292
+ some languages it's very easy.
1293
+
1294
+ 0:36:24.746 --> 0:36:28.001
1295
+ In English there is nearly no worth coming.
1296
+
1297
+ 0:36:28.688 --> 0:36:35.557
1298
+ Well in Germany you already start for soon
1299
+ you have like different forms and so on.
1300
+
1301
+ 0:36:36.316 --> 0:36:41.991
1302
+ And for other languages, for finish, it's
1303
+ even more complicated with Basque.
1304
+
1305
+ 0:36:41.991 --> 0:36:44.498
1306
+ I think for some words more than.
1307
+
1308
+ 0:36:45.045 --> 0:36:52.098
1309
+ So knowing this, of course, gives you some
1310
+ advice.
1311
+
1312
+ 0:36:52.098 --> 0:37:04.682
1313
+ How do I look at that now because we'll see
1314
+ in the basic treat each word as an individual?
1315
+
1316
+ 0:37:06.106 --> 0:37:09.259
1317
+ Of course there is a lot of interest also
1318
+ prone from industry.
1319
+
1320
+ 0:37:09.259 --> 0:37:10.860
1321
+ There is a lot of applications.
1322
+
1323
+ 0:37:11.191 --> 0:37:17.068
1324
+ There's research groups at Google, Facebook,
1325
+ and Amazon.
1326
+
1327
+ 0:37:17.068 --> 0:37:26.349
1328
+ So there's quite a lot of interest in providing
1329
+ that for German and English it is solved.
1330
+
1331
+ 0:37:26.546 --> 0:37:27.569
1332
+ Annoucing it's hard.
1333
+
1334
+ 0:37:27.569 --> 0:37:31.660
1335
+ We're saying that not hard, but of course
1336
+ we haven't acquired high quality in them.
1337
+
1338
+ 0:37:32.212 --> 0:37:39.296
1339
+ But there's currently really a large trend
1340
+ in building other systems for low research
1341
+
1342
+ 0:37:39.296 --> 0:37:40.202
1343
+ languages.
1344
+
1345
+ 0:37:40.480 --> 0:37:53.302
1346
+ So there are tasks on last year's task on
1347
+ translating from Native American languages:
1348
+
1349
+ 0:37:53.193 --> 0:37:58.503
1350
+ Don't know yet but but five other languages,
1351
+ so how can you translate from them?
1352
+
1353
+ 0:37:58.538 --> 0:38:05.074
1354
+ Then you don't have like millions of sentences,
1355
+ but you might have only the Bible or some more
1356
+
1357
+ 0:38:05.074 --> 0:38:05.486
1358
+ data.
1359
+
1360
+ 0:38:05.486 --> 0:38:08.169
1361
+ Then the question is, what can you do?
1362
+
1363
+ 0:38:08.169 --> 0:38:09.958
1364
+ And how good can you get?
1365
+
1366
+ 0:38:14.794 --> 0:38:17.296
1367
+ One thing is very important.
1368
+
1369
+ 0:38:17.296 --> 0:38:25.751
1370
+ Of course, in a lot of A I is to measure the
1371
+ quality and what you can measure is quite important.
1372
+
1373
+ 0:38:25.986 --> 0:38:37.213
1374
+ So that's why for many years of regular there
1375
+ is different evaluation campaigns where people
1376
+
1377
+ 0:38:37.213 --> 0:38:38.178
1378
+ submit.
1379
+
1380
+ 0:38:39.419 --> 0:38:45.426
1381
+ We're often part of the statistical machine
1382
+ translation original, yet now I think it's
1383
+
1384
+ 0:38:45.426 --> 0:38:51.019
1385
+ a machine translation where it's mostly about
1386
+ European languages and used texts.
1387
+
1388
+ 0:38:51.051 --> 0:38:57.910
1389
+ The International Workshop of Spoken Language
1390
+ Translation, which is translation about lectures
1391
+
1392
+ 0:38:57.910 --> 0:39:04.263
1393
+ which we are co organizing, and there is a
1394
+ bovia as I said building strong systems this
1395
+
1396
+ 0:39:04.263 --> 0:39:04.696
1397
+ time.
1398
+
1399
+ 0:39:04.664 --> 0:39:11.295
1400
+ This has established translating conference
1401
+ presentations from English into ten different
1402
+
1403
+ 0:39:11.295 --> 0:39:17.080
1404
+ languages: And then, of course, you have to
1405
+ deal with things like special vocabulary.
1406
+
1407
+ 0:39:17.037 --> 0:39:23.984
1408
+ You think about recurrent real networks are
1409
+ terms like co-recurrent networks, convolutional
1410
+
1411
+ 0:39:23.984 --> 0:39:24.740
1412
+ networks.
1413
+
1414
+ 0:39:25.545 --> 0:39:29.917
1415
+ That might be more difficult to translate
1416
+ and you also have to decide who I need to translate
1417
+
1418
+ 0:39:29.917 --> 0:39:33.359
1419
+ or should I keep it in English, and that's
1420
+ not the same in each language.
1421
+
1422
+ 0:39:33.873 --> 0:39:37.045
1423
+ In German maybe mostly you keep it.
1424
+
1425
+ 0:39:37.045 --> 0:39:44.622
1426
+ I think in French people are typically like
1427
+ wanting to translate as much as possible.
1428
+
1429
+ 0:39:44.622 --> 0:39:52.200
1430
+ These are then challenges and then, of course,
1431
+ in Poland where it's also challenging.
1432
+
1433
+ 0:39:53.153 --> 0:39:59.369
1434
+ I think all of the speakers in the test that
1435
+ are not native in your speakers, so you need
1436
+
1437
+ 0:39:59.369 --> 0:40:05.655
1438
+ to translate people with a German accent or
1439
+ with a French accent or with a Japanese accent
1440
+
1441
+ 0:40:05.655 --> 0:40:09.178
1442
+ or an English accent, which poison has additional.
1443
+
1444
+ 0:40:12.272 --> 0:40:21.279
1445
+ Yes, so there is criticism always with new
1446
+ technologies because people say will never
1447
+
1448
+ 0:40:21.279 --> 0:40:23.688
1449
+ translate Shakespeare.
1450
+
1451
+ 0:40:24.204 --> 0:40:26.845
1452
+ Partly agree with the second.
1453
+
1454
+ 0:40:26.845 --> 0:40:34.682
1455
+ Maybe it's not good at translating Shakespeare,
1456
+ but there's many people working on that.
1457
+
1458
+ 0:40:35.255 --> 0:40:38.039
1459
+ Of course, the poison cookie is a challenge.
1460
+
1461
+ 0:40:38.858 --> 0:40:44.946
1462
+ The thing is here that the cookie chart that
1463
+ you can't never be sure if the machine translation
1464
+
1465
+ 0:40:44.946 --> 0:40:47.546
1466
+ system doesn't really mistake somewhere.
1467
+
1468
+ 0:40:47.546 --> 0:40:53.316
1469
+ So if you can't be sure that there's no error
1470
+ in there, how can you trust the translation?
1471
+
1472
+ 0:40:55.275 --> 0:41:01.892
1473
+ That is partly true, on the other hand, otherwise
1474
+ you have to translate to a human translator
1475
+
1476
+ 0:41:01.892 --> 0:41:06.116
1477
+ and men who are sometimes overestimating human
1478
+ performance.
1479
+
1480
+ 0:41:06.746 --> 0:41:15.111
1481
+ They are very good translators but under a
1482
+ lot of pressure and not human translations.
1483
+
1484
+ 0:41:15.715 --> 0:41:22.855
1485
+ The question is: When can you trust it enough
1486
+ anyway?
1487
+
1488
+ 0:41:22.855 --> 0:41:28.540
1489
+ You should be careful about trusting them.
1490
+
1491
+ 0:41:31.011 --> 0:41:38.023
1492
+ And I think some of them are too old now because
1493
+ it has been shown that it is helpful to have
1494
+
1495
+ 0:41:38.023 --> 0:41:41.082
1496
+ some type of machine translation system.
1497
+
1498
+ 0:41:41.082 --> 0:41:47.722
1499
+ Of course, it is not buying the car, so typically
1500
+ still a system is not working forever.
1501
+
1502
+ 0:41:48.048 --> 0:41:56.147
1503
+ If you want your dedicated system, which is
1504
+ good for the task you are, they are typically
1505
+
1506
+ 0:41:56.147 --> 0:41:57.947
1507
+ not as generalized.
1508
+
1509
+ 0:41:58.278 --> 0:42:07.414
1510
+ That can translate news and chats, and I don't
1511
+ know what.
1512
+
1513
+ 0:42:07.414 --> 0:42:12.770
1514
+ So typically if you want to show.
1515
+
1516
+ 0:42:12.772 --> 0:42:18.796
1517
+ It's not made for, it has not seen very well
1518
+ and then you see a bad quality.
1519
+
1520
+ 0:42:19.179 --> 0:42:27.139
1521
+ But that's also like yeah, therefore you don't
1522
+ build it.
1523
+
1524
+ 0:42:27.139 --> 0:42:42.187
1525
+ If you have a sports car and you are driving
1526
+ off road you should: Yeah, you can also say
1527
+
1528
+ 0:42:42.187 --> 0:42:49.180
1529
+ the other way around trans machine translation
1530
+ is already solved, and especially with more
1531
+
1532
+ 0:42:49.180 --> 0:42:50.487
1533
+ people think so.
1534
+
1535
+ 0:42:50.750 --> 0:43:04.275
1536
+ However, there is an impressive performance
1537
+ of machine translation, but it's not stated
1538
+
1539
+ 0:43:04.275 --> 0:43:06.119
1540
+ of the art.
1541
+
1542
+ 0:43:06.586 --> 0:43:11.811
1543
+ And yeah, they're good for some domains and
1544
+ some languages that are even like already.
1545
+
1546
+ 0:43:12.572 --> 0:43:27.359
1547
+ Have Microsoft has a very super human performance
1548
+ claiming that their machine translated system.
1549
+
1550
+ 0:43:27.467 --> 0:43:38.319
1551
+ However, there was one domain use and some
1552
+ language in Spanish where there is a huge amount
1553
+
1554
+ 0:43:38.319 --> 0:43:45.042
1555
+ of training data and you can build a very strong
1556
+ system.
1557
+
1558
+ 0:43:45.505 --> 0:43:48.605
1559
+ And you even don't have to go to these extreme
1560
+ cases.
1561
+
1562
+ 0:43:48.688 --> 0:43:54.328
1563
+ We have worked on Canada, which is a language
1564
+ in India spoken.
1565
+
1566
+ 0:43:54.328 --> 0:44:01.669
1567
+ I think by also around eighty million people
1568
+ so similar to to German that it has.
1569
+
1570
+ 0:44:01.669 --> 0:44:07.757
1571
+ The quality is significantly worse, it has
1572
+ significantly less data.
1573
+
1574
+ 0:44:08.108 --> 0:44:15.132
1575
+ There are still quite a lot of languages where
1576
+ the quality is not, where you want to have.
1577
+
1578
+ 0:44:15.295 --> 0:44:17.971
1579
+ Scaling this is not as easy at this thing.
1580
+
1581
+ 0:44:17.971 --> 0:44:23.759
1582
+ That's why we're also interested in multilingual
1583
+ systems with the hope that we don't have to
1584
+
1585
+ 0:44:23.759 --> 0:44:29.548
1586
+ build a system for each possible combination,
1587
+ but we can build a system which can cover many
1588
+
1589
+ 0:44:29.548 --> 0:44:33.655
1590
+ tags, many languages and then also need less
1591
+ data for each other.
1592
+
1593
+ 0:44:39.639 --> 0:44:51.067
1594
+ With invasion maybe some presentation of everything
1595
+ is a bit cat that can say the most important.
1596
+
1597
+ 0:44:51.331 --> 0:45:09.053
1598
+ So machine translation started coming from
1599
+ information theory in there was this: It's
1600
+
1601
+ 0:45:09.053 --> 0:45:13.286
1602
+ treating machine translation as encryption
1603
+ or decryption.
1604
+
1605
+ 0:45:13.533 --> 0:45:21.088
1606
+ Don't understand it, want to have it in English,
1607
+ treat it as if it's like encrypted English,
1608
+
1609
+ 0:45:21.088 --> 0:45:28.724
1610
+ and then apply my decryption algorithm, which
1611
+ they were working a lot during the Second World
1612
+
1613
+ 0:45:28.724 --> 0:45:29.130
1614
+ War.
1615
+
1616
+ 0:45:29.209 --> 0:45:34.194
1617
+ And so if I cannot do this detruction then
1618
+ this sings a song.
1619
+
1620
+ 0:45:34.934 --> 0:45:42.430
1621
+ And they based on that they had rules and
1622
+ so on.
1623
+
1624
+ 0:45:42.430 --> 0:45:50.843
1625
+ So they had the judge Georgetown experiments
1626
+ in where.
1627
+
1628
+ 0:45:51.691 --> 0:45:57.419
1629
+ From English and then they were like wow.
1630
+
1631
+ 0:45:57.419 --> 0:46:01.511
1632
+ This is solved in some years.
1633
+
1634
+ 0:46:01.511 --> 0:46:04.921
1635
+ Now we can do sentences.
1636
+
1637
+ 0:46:06.546 --> 0:46:18.657
1638
+ As you can imagine this didn't really work
1639
+ out that way, so it's not really happening.
1640
+
1641
+ 0:46:18.657 --> 0:46:24.503
1642
+ The spirit is willing, but flesh is weak.
1643
+
1644
+ 0:46:24.444 --> 0:46:30.779
1645
+ Translated it to Russian and then to Germany
1646
+ and then vodka is good but the meat is rotten.
1647
+
1648
+ 0:46:31.271 --> 0:46:39.694
1649
+ Think it never really happened this way, but
1650
+ you can see you can imagine that something
1651
+
1652
+ 0:46:39.694 --> 0:46:49.533
1653
+ like that could happen, and then in in the
1654
+ there was this report saying: It's more challenging
1655
+
1656
+ 0:46:49.533 --> 0:46:56.877
1657
+ than expected and the problem is that we have
1658
+ to invest more.
1659
+
1660
+ 0:46:56.877 --> 0:47:02.801
1661
+ There's no benefit for doing machine translation.
1662
+
1663
+ 0:47:04.044 --> 0:47:09.255
1664
+ At least in some other countries there was
1665
+ a bit, but then for some time there wasn't
1666
+
1667
+ 0:47:09.255 --> 0:47:10.831
1668
+ that big out of progress.
1669
+
1670
+ 0:47:12.152 --> 0:47:26.554
1671
+ We have then in the' 70s there were some rule
1672
+ based systems that would cover out some linguistic
1673
+
1674
+ 0:47:26.554 --> 0:47:28.336
1675
+ background.
1676
+
1677
+ 0:47:28.728 --> 0:47:34.013
1678
+ They are now doing very good machine translation,
1679
+ but they had a really huge rule base.
1680
+
1681
+ 0:47:34.314 --> 0:47:43.538
1682
+ So they really have like handwritten roots
1683
+ how to parse sentences, how to translate parse
1684
+
1685
+ 0:47:43.538 --> 0:47:45.587
1686
+ sentences to parse.
1687
+
1688
+ 0:47:46.306 --> 0:47:55.868
1689
+ When which word should be translated, these
1690
+ rule based systems were quite strong for a
1691
+
1692
+ 0:47:55.868 --> 0:47:57.627
1693
+ very long time.
1694
+
1695
+ 0:47:57.917 --> 0:48:03.947
1696
+ So even in or so for some language fares and
1697
+ some remains, it was better than a machine
1698
+
1699
+ 0:48:03.947 --> 0:48:04.633
1700
+ learning.
1701
+
1702
+ 0:48:05.505 --> 0:48:09.576
1703
+ Well, of course, there was a lot of effort
1704
+ in and a lot of experts were building this.
1705
+
1706
+ 0:48:11.791 --> 0:48:13.170
1707
+ And then.
1708
+
1709
+ 0:48:13.053 --> 0:48:18.782
1710
+ The first statistical machine translations
1711
+ were coming in the early nineties.
1712
+
1713
+ 0:48:18.782 --> 0:48:25.761
1714
+ There's the system by IBM will refer to them
1715
+ as a T by the IBM models, which are quite famous,
1716
+
1717
+ 0:48:25.761 --> 0:48:32.886
1718
+ and they were used to film your machine translations
1719
+ from the nineties nineties to two thousand.
1720
+
1721
+ 0:48:32.912 --> 0:48:35.891
1722
+ Fifteen or so people were working on the IBM
1723
+ models.
1724
+
1725
+ 0:48:36.496 --> 0:48:44.608
1726
+ And that was the first way of doing a machine
1727
+ translation with statisticals or machine learning.
1728
+
1729
+ 0:48:44.924 --> 0:48:52.143
1730
+ And it was possible through the French English
1731
+ under a corpusol from the Canadian Parliament
1732
+
1733
+ 0:48:52.143 --> 0:48:59.516
1734
+ they also had proceedings in French and English
1735
+ and people tried to use that to translate and.
1736
+
1737
+ 0:49:01.681 --> 0:49:06.919
1738
+ And yes, so that was than the start of statistical
1739
+ machine translation.
1740
+
1741
+ 0:49:07.227 --> 0:49:17.797
1742
+ Is called a phrase page machine translation
1743
+ was introduced where you could add more information
1744
+
1745
+ 0:49:17.797 --> 0:49:26.055
1746
+ in use longer chunks to translate and phrase
1747
+ page translation was somehow.
1748
+
1749
+ 0:49:26.326 --> 0:49:27.603
1750
+ She'll Start Fourteen.
1751
+
1752
+ 0:49:27.767 --> 0:49:37.721
1753
+ With this straight space machine sensation
1754
+ we saw the first commercial systems.
1755
+
1756
+ 0:49:38.178 --> 0:49:45.301
1757
+ And yeah, that was the first big advantage
1758
+ where really you can see the machine translation.
1759
+
1760
+ 0:49:47.287 --> 0:49:55.511
1761
+ And neural machine translation was mainly
1762
+ introduced.
1763
+
1764
+ 0:49:55.511 --> 0:50:07.239
1765
+ That means there was a shift from traditional
1766
+ statistical modeling to using.
1767
+
1768
+ 0:50:07.507 --> 0:50:09.496
1769
+ And that was quite impressive.
1770
+
1771
+ 0:50:09.496 --> 0:50:11.999
1772
+ It was really within one or two years.
1773
+
1774
+ 0:50:11.999 --> 0:50:17.453
1775
+ The whole research community shifted from
1776
+ what they had been working on since twenty
1777
+
1778
+ 0:50:17.453 --> 0:50:17.902
1779
+ years.
1780
+
1781
+ 0:50:17.902 --> 0:50:23.485
1782
+ And everybody was using this pattern, you
1783
+ know networks, because just the performances
1784
+
1785
+ 0:50:23.485 --> 0:50:25.089
1786
+ were really really much.
1787
+
1788
+ 0:50:25.425 --> 0:50:35.048
1789
+ Especially they are what we also see now with
1790
+ chat boards like the impressive thing.
1791
+
1792
+ 0:50:35.135 --> 0:50:45.261
1793
+ That was very, very challenging if you see
1794
+ machine translation before that, especially
1795
+
1796
+ 0:50:45.261 --> 0:50:47.123
1797
+ if the English.
1798
+
1799
+ 0:50:47.547 --> 0:50:53.352
1800
+ But if you were transmitting to German you
1801
+ would see that the agreement so that it's there
1802
+
1803
+ 0:50:53.352 --> 0:50:58.966
1804
+ shown abound and dishewn and boima and this
1805
+ didn't always really work perfect maybe for
1806
+
1807
+ 0:50:58.966 --> 0:51:04.835
1808
+ the short range of work but then it has to
1809
+ be accusative and it's like far away then things
1810
+
1811
+ 0:51:04.835 --> 0:51:06.430
1812
+ didn't really work well.
1813
+
1814
+ 0:51:06.866 --> 0:51:13.323
1815
+ Now with new machine translation we have a
1816
+ bit of a different problem: So the sentences
1817
+
1818
+ 0:51:13.323 --> 0:51:16.901
1819
+ are typically really nice.
1820
+
1821
+ 0:51:16.901 --> 0:51:24.056
1822
+ They are perfectly written not always but
1823
+ very often.
1824
+
1825
+ 0:51:24.224 --> 0:51:36.587
1826
+ So that adequacy and their conveillance should
1827
+ have the same meaning is typically the bigger.
1828
+
1829
+ 0:51:42.002 --> 0:51:46.039
1830
+ So how can we do so last?
1831
+
1832
+ 0:51:46.039 --> 0:51:54.889
1833
+ What are the things and how can we do machine
1834
+ rendering?
1835
+
1836
+ 0:51:55.235 --> 0:52:01.297
1837
+ So we had first blue based systems, and as
1838
+ a side systems we did that we manually created
1839
+
1840
+ 0:52:01.297 --> 0:52:01.769
1841
+ rules.
1842
+
1843
+ 0:52:01.861 --> 0:52:07.421
1844
+ And there were rules how to dissemvy real
1845
+ ambiguities.
1846
+
1847
+ 0:52:07.421 --> 0:52:16.417
1848
+ For example, we had the word banks look at
1849
+ the context and do rules like to decide when.
1850
+
1851
+ 0:52:17.197 --> 0:52:28.418
1852
+ How to translate the structure, but you know
1853
+ how to transfer the structure that you work
1854
+
1855
+ 0:52:28.418 --> 0:52:33.839
1856
+ has to split it in German and move to the.
1857
+
1858
+ 0:52:35.295 --> 0:52:36.675
1859
+ Here's a difficult thing.
1860
+
1861
+ 0:52:36.675 --> 0:52:39.118
1862
+ My thing is you don't need any training data.
1863
+
1864
+ 0:52:39.118 --> 0:52:41.295
1865
+ It's not like now with machine learning.
1866
+
1867
+ 0:52:41.295 --> 0:52:46.073
1868
+ If you build a machine translation system,
1869
+ the first question you should ask is do I have
1870
+
1871
+ 0:52:46.073 --> 0:52:46.976
1872
+ data to do that?
1873
+
1874
+ 0:52:46.976 --> 0:52:48.781
1875
+ Do I have parallel data to train?
1876
+
1877
+ 0:52:49.169 --> 0:52:50.885
1878
+ Here there's no data.
1879
+
1880
+ 0:52:50.885 --> 0:52:57.829
1881
+ It's like all trades, pencils and roads, but
1882
+ the problem is people trading the roads and
1883
+
1884
+ 0:52:57.829 --> 0:52:59.857
1885
+ this needs to be experts.
1886
+
1887
+ 0:52:59.799 --> 0:53:06.614
1888
+ Understand at least the grammar in one language,
1889
+ basically the grammar in both languages.
1890
+
1891
+ 0:53:06.614 --> 0:53:09.264
1892
+ It needs to be a real language to.
1893
+
1894
+ 0:53:10.090 --> 0:53:17.308
1895
+ Then we have the two corpus based machine
1896
+ translation approaches, and then we use machine
1897
+
1898
+ 0:53:17.308 --> 0:53:22.682
1899
+ learning to learn how to translate from one
1900
+ language to the other.
1901
+
1902
+ 0:53:22.882 --> 0:53:29.205
1903
+ We should find out ourselves what is the meaning
1904
+ of individual words, which words translate
1905
+
1906
+ 0:53:29.205 --> 0:53:30.236
1907
+ to each other.
1908
+
1909
+ 0:53:30.236 --> 0:53:36.215
1910
+ The only information we give is the German
1911
+ sentence, the English sentence, and then we
1912
+
1913
+ 0:53:36.215 --> 0:53:37.245
1914
+ look for many.
1915
+
1916
+ 0:53:37.697 --> 0:53:42.373
1917
+ So maybe you think there's a Bible for each
1918
+ language.
1919
+
1920
+ 0:53:42.373 --> 0:53:44.971
1921
+ There shouldn't be a problem.
1922
+
1923
+ 0:53:45.605 --> 0:53:52.752
1924
+ But this is not the scale when we're talking
1925
+ about.
1926
+
1927
+ 0:53:52.752 --> 0:54:05.122
1928
+ Small systems have maybe one hundred thousand
1929
+ sentences when we're building large models.
1930
+
1931
+ 0:54:05.745 --> 0:54:19.909
1932
+ The statistical models do statistics about
1933
+ how the word screw occur and how often the
1934
+
1935
+ 0:54:19.909 --> 0:54:21.886
1936
+ word screw.
1937
+
1938
+ 0:54:22.382 --> 0:54:29.523
1939
+ While we were focused on it was currently
1940
+ most of the cases referred to as neural communication.
1941
+
1942
+ 0:54:30.050 --> 0:54:44.792
1943
+ So in this case the idea is that you have
1944
+ a neural model which is a big neural network.
1945
+
1946
+ 0:54:45.345 --> 0:54:55.964
1947
+ And for these machine drums there quite challenging
1948
+ tasks.
1949
+
1950
+ 0:54:55.964 --> 0:55:03.883
1951
+ For example, this transformal architecture.
1952
+
1953
+ 0:55:03.903 --> 0:55:07.399
1954
+ Cast by Google in two thousand eight.
1955
+
1956
+ 0:55:08.028 --> 0:55:19.287
1957
+ Here want to ask the screw-based machine translation
1958
+ of that part.
1959
+
1960
+ 0:55:22.862 --> 0:55:33.201
1961
+ Would say it's mainly rule based systems because
1962
+ purely rule based systems maybe exist with
1963
+
1964
+ 0:55:33.201 --> 0:55:36.348
1965
+ some very exotic languages.
1966
+
1967
+ 0:55:36.776 --> 0:55:43.947
1968
+ Of course, the idea of investigating if we
1969
+ have this type of rulers that might be still
1970
+
1971
+ 0:55:43.947 --> 0:55:45.006
1972
+ interesting.
1973
+
1974
+ 0:55:45.105 --> 0:55:52.090
1975
+ Maybe you can try to let someone force the
1976
+ rules in there.
1977
+
1978
+ 0:55:52.090 --> 0:55:57.655
1979
+ You might use rules to create artificial data.
1980
+
1981
+ 0:55:57.557 --> 0:56:03.577
1982
+ That it might be helpful to have some concepts
1983
+ which develop by bilinguistic researches to
1984
+
1985
+ 0:56:03.577 --> 0:56:09.464
1986
+ somehow interview that that's still an open
1987
+ question is sometimes helpful, and of course
1988
+
1989
+ 0:56:09.464 --> 0:56:13.235
1990
+ is also interesting from more the analyzed
1991
+ perspectives.
1992
+
1993
+ 0:56:13.235 --> 0:56:13.499
1994
+ So.
1995
+
1996
+ 0:56:13.793 --> 0:56:20.755
1997
+ Do the new networks have these types of concepts
1998
+ of gender or anything?
1999
+
2000
+ 0:56:20.755 --> 0:56:23.560
2001
+ And can we test that though?
2002
+
2003
+ 0:56:30.330 --> 0:56:34.255
2004
+ Yes, and then the other way of describing
2005
+ how this can be done.
2006
+
2007
+ 0:56:34.574 --> 0:56:52.021
2008
+ And then originally mainly for a rule based
2009
+ system that can be used for a lot of scenarios.
2010
+
2011
+ 0:56:52.352 --> 0:57:04.135
2012
+ In real ways, the first world has really direct
2013
+ translation systems that work for related languages.
2014
+
2015
+ 0:57:04.135 --> 0:57:11.367
2016
+ You mainly look at each word and replace the
2017
+ word by the one.
2018
+
2019
+ 0:57:11.631 --> 0:57:22.642
2020
+ Another idea is that you first do some type
2021
+ of animus on the source side, so for example
2022
+
2023
+ 0:57:22.642 --> 0:57:28.952
2024
+ you can create what is referred to as a path
2025
+ tree.
2026
+
2027
+ 0:57:30.150 --> 0:57:36.290
2028
+ Or you can instead, and that is what is called
2029
+ the lingua face approach.
2030
+
2031
+ 0:57:36.290 --> 0:57:44.027
2032
+ You take the short sentence and parse it into
2033
+ a semantic representation, which is hopefully
2034
+
2035
+ 0:57:44.027 --> 0:57:44.448
2036
+ the.
2037
+
2038
+ 0:57:44.384 --> 0:57:50.100
2039
+ Only of the meaning of what is said and then
2040
+ you can generate it to any other language because
2041
+
2042
+ 0:57:50.100 --> 0:57:55.335
2043
+ it has a meaning and then you can need a part
2044
+ generation which can generate all other.
2045
+
2046
+ 0:57:57.077 --> 0:58:09.248
2047
+ The idea is somewhat nice to have this type
2048
+ of interlingua, general representation of all
2049
+
2050
+ 0:58:09.248 --> 0:58:17.092
2051
+ meanings, and they always translate into the
2052
+ interlingua.
2053
+
2054
+ 0:58:17.177 --> 0:58:19.189
2055
+ A Little World and It's Been Somewhere.
2056
+
2057
+ 0:58:20.580 --> 0:58:26.684
2058
+ It shouldn't be a natural language because
2059
+ it shouldn't have ambiguities so that's a big
2060
+
2061
+ 0:58:26.684 --> 0:58:32.995
2062
+ difference so the story and the tiger language
2063
+ have ambiguities so the idea is they do some
2064
+
2065
+ 0:58:32.995 --> 0:58:39.648
2066
+ semantic representation or what does it mean
2067
+ and so on and therefore it's very easy to generate.
2068
+
2069
+ 0:58:41.962 --> 0:58:45.176
2070
+ However, that is a challenge that this really
2071
+ exists.
2072
+
2073
+ 0:58:45.176 --> 0:58:48.628
2074
+ You cannot define the language for anything
2075
+ in the world.
2076
+
2077
+ 0:58:49.249 --> 0:58:56.867
2078
+ And that's why the Lingo-based approach typically
2079
+ worked for small domains to do hotel reservation,
2080
+
2081
+ 0:58:56.867 --> 0:59:00.676
2082
+ but if you want to define the Lingo for anything.
2083
+
2084
+ 0:59:01.061 --> 0:59:07.961
2085
+ There have been approaches and semantics,
2086
+ but it's yeah, it's not really possible CR.
2087
+
2088
+ 0:59:07.961 --> 0:59:15.905
2089
+ So approaches to this because I mean a seasonal
2090
+ vector's face and bitch eyes and slaves everything
2091
+
2092
+ 0:59:15.905 --> 0:59:20.961
2093
+ that I mitonized that they all could end up
2094
+ in the same space.
2095
+
2096
+ 0:59:21.821 --> 0:59:24.936
2097
+ That is not the question.
2098
+
2099
+ 0:59:24.936 --> 0:59:35.957
2100
+ If you talk about neural networks, it's direct
2101
+ translation on the one you're putting in the
2102
+
2103
+ 0:59:35.957 --> 0:59:36.796
2104
+ input.
2105
+
2106
+ 0:59:36.957 --> 0:59:44.061
2107
+ And you can argue for both that we have been
2108
+ making this representation language agnostic
2109
+
2110
+ 0:59:44.061 --> 0:59:45.324
2111
+ or independent.
2112
+
2113
+ 0:59:47.227 --> 0:59:52.912
2114
+ Until now we were able to make it less language
2115
+ dependent but it's very hard to make it completely
2116
+
2117
+ 0:59:52.912 --> 0:59:54.175
2118
+ language independent.
2119
+
2120
+ 0:59:54.175 --> 0:59:59.286
2121
+ Maybe it's also not necessary and of course
2122
+ if there's again the problem there's not all
2123
+
2124
+ 0:59:59.286 --> 1:00:04.798
2125
+ information and the source and the target there
2126
+ is different types of information if you remove
2127
+
2128
+ 1:00:04.798 --> 1:00:05.602
2129
+ all language.
2130
+
2131
+ 1:00:05.585 --> 1:00:09.408
2132
+ Information might be that you have removed
2133
+ too many information.
2134
+
2135
+ 1:00:10.290 --> 1:00:15.280
2136
+ Talk about this and there's a very interesting
2137
+ research direction in which we are working
2138
+
2139
+ 1:00:15.280 --> 1:00:20.325
2140
+ on on the multilingual part because there is
2141
+ especially the case if we have several source
2142
+
2143
+ 1:00:20.325 --> 1:00:25.205
2144
+ languages, several type of languages who try
2145
+ to generate a representation in the middle
2146
+
2147
+ 1:00:25.205 --> 1:00:27.422
2148
+ which have the few language dependence.
2149
+
2150
+ 1:00:32.752 --> 1:00:46.173
2151
+ Yes, so for a direct base approach, so as
2152
+ said the first one is dictionary based approach.
2153
+
2154
+ 1:00:46.806 --> 1:00:48.805
2155
+ Replace some words with other words.
2156
+
2157
+ 1:00:48.805 --> 1:00:51.345
2158
+ Then you have exactly the same same structure.
2159
+
2160
+ 1:00:51.771 --> 1:00:55.334
2161
+ Other problems are one to one correspondence.
2162
+
2163
+ 1:00:55.334 --> 1:01:01.686
2164
+ Some phrases are expressed with several words
2165
+ in English, but one word in German.
2166
+
2167
+ 1:01:01.686 --> 1:01:03.777
2168
+ That's extremely the case.
2169
+
2170
+ 1:01:03.777 --> 1:01:07.805
2171
+ Just think about all our composites like the
2172
+ Donau.
2173
+
2174
+ 1:01:08.608 --> 1:01:18.787
2175
+ Which is used very often as been referred
2176
+ to as translation memory.
2177
+
2178
+ 1:01:18.787 --> 1:01:25.074
2179
+ It might seem very simple, but it's like.
2180
+
2181
+ 1:01:26.406 --> 1:01:33.570
2182
+ That means you might think of this not helpful
2183
+ at all, but you know think about translating.
2184
+
2185
+ 1:01:33.513 --> 1:01:38.701
2186
+ The law text is more like the interactive
2187
+ scenario for the human translator.
2188
+
2189
+ 1:01:38.701 --> 1:01:44.091
2190
+ In law text there is a lot of repetition and
2191
+ a lot of phrases occur very often.
2192
+
2193
+ 1:01:44.424 --> 1:01:55.412
2194
+ The translator has just a background of translation
2195
+ memory and retrieve all this translation.
2196
+
2197
+ 1:01:55.895 --> 1:02:07.147
2198
+ There is even another benefit in addition
2199
+ to less work: That is also precise in the way
2200
+
2201
+ 1:02:07.147 --> 1:02:19.842
2202
+ know this creates a small mistake in the North
2203
+ Carolina.
2204
+
2205
+ 1:02:20.300 --> 1:02:22.584
2206
+ By especially its like consistence,.
2207
+
2208
+ 1:02:23.243 --> 1:02:32.954
2209
+ If you once translate the sentence this way
2210
+ you again translate it and especially for some
2211
+
2212
+ 1:02:32.954 --> 1:02:36.903
2213
+ situations like a company they have.
2214
+
2215
+ 1:02:37.217 --> 1:02:47.695
2216
+ With this one, of course, you get more consistent
2217
+ translations.
2218
+
2219
+ 1:02:47.695 --> 1:02:56.700
2220
+ Each one is a style where phrases maybe are
2221
+ retrieved.
2222
+
2223
+ 1:03:01.861 --> 1:03:15.502
2224
+ Then we have these transfer based approaches
2225
+ where we have three steps: Analysts remain
2226
+
2227
+ 1:03:15.502 --> 1:03:25.975
2228
+ that you check one synthetic structure, so
2229
+ for example for morphology the basic.
2230
+
2231
+ 1:03:26.286 --> 1:03:37.277
2232
+ Then you will do a parstry or dependency structure
2233
+ that this is the adjective of the balm.
2234
+
2235
+ 1:03:37.917 --> 1:03:42.117
2236
+ Then you can do the transfer where you transfer
2237
+ the structure to the other.
2238
+
2239
+ 1:03:42.382 --> 1:03:46.633
2240
+ There you have to do, for example, it's re-ordering
2241
+ because the satisfaction is different.
2242
+
2243
+ 1:03:46.987 --> 1:03:50.088
2244
+ In German, the adjective is before the noun.
2245
+
2246
+ 1:03:50.088 --> 1:03:52.777
2247
+ In Spanish, it's the other way around.
2248
+
2249
+ 1:03:52.777 --> 1:03:59.256
2250
+ You have first found and then that it's nice
2251
+ and these types of rehonoring can be done there.
2252
+
2253
+ 1:03:59.256 --> 1:04:04.633
2254
+ You might have to do other things like passive
2255
+ voice to exit voice and so on.
2256
+
2257
+ 1:04:05.145 --> 1:04:14.074
2258
+ And in some type of lexical transverse it
2259
+ should like to me: And then you are doing the
2260
+
2261
+ 1:04:14.074 --> 1:04:16.014
2262
+ generation.
2263
+
2264
+ 1:04:16.014 --> 1:04:25.551
2265
+ Of course, you would do the agreement if it
2266
+ is accusative.
2267
+
2268
+ 1:04:25.551 --> 1:04:29.430
2269
+ What type of adjective?
2270
+
2271
+ 1:04:30.090 --> 1:04:32.048
2272
+ Is some kind of saving.
2273
+
2274
+ 1:04:32.048 --> 1:04:39.720
2275
+ Of course, here, because the analyze has only
2276
+ to be done in the source language, the transfer
2277
+
2278
+ 1:04:39.720 --> 1:04:41.679
2279
+ has to do on the pairs.
2280
+
2281
+ 1:04:41.679 --> 1:04:48.289
2282
+ But if you not look German, English and French
2283
+ through all directions, you only.
2284
+
2285
+ 1:04:53.273 --> 1:04:59.340
2286
+ Then there is an interlingua card which is
2287
+ really about the pure meaning, so you have
2288
+
2289
+ 1:04:59.340 --> 1:05:00.751
2290
+ a semantic grammar.
2291
+
2292
+ 1:05:01.061 --> 1:05:07.930
2293
+ To represent everything and one thing, one
2294
+ nice implication is more extreme than before.
2295
+
2296
+ 1:05:07.930 --> 1:05:15.032
2297
+ You don't have the transfer anymore, so if
2298
+ you add one language to it and you have already.
2299
+
2300
+ 1:05:15.515 --> 1:05:26.188
2301
+ If you add the one parting and the one generation
2302
+ phase, you can now translate from: So you need
2303
+
2304
+ 1:05:26.188 --> 1:05:40.172
2305
+ components which do the and components which
2306
+ do the generation, and then you can translate:
2307
+
2308
+ 1:05:41.001 --> 1:05:45.994
2309
+ You can also do other things like paraphrasing.
2310
+
2311
+ 1:05:45.994 --> 1:05:52.236
2312
+ You can translate back to the words language
2313
+ and hopefully.
2314
+
2315
+ 1:05:53.533 --> 1:06:05.013
2316
+ If you're sparkling trying to analyze it,
2317
+ it was also down a lot for ungrammetical speech
2318
+
2319
+ 1:06:05.013 --> 1:06:11.518
2320
+ because the idea is you're in this representation.
2321
+
2322
+ 1:06:12.552 --> 1:06:18.679
2323
+ Of course, it's very much work and it's only
2324
+ realistic for limited domains.
2325
+
2326
+ 1:06:20.000 --> 1:06:25.454
2327
+ Then we're, we're have the campus based approach.
2328
+
2329
+ 1:06:25.745 --> 1:06:32.486
2330
+ So we'll talk about a lot about peril layer
2331
+ and what is really peril data is what you know
2332
+
2333
+ 1:06:32.486 --> 1:06:34.634
2334
+ from the Rosetta stone page.
2335
+
2336
+ 1:06:34.634 --> 1:06:41.227
2337
+ That is, you have a sewer sentence and you
2338
+ have a target sentence and you know they need
2339
+
2340
+ 1:06:41.227 --> 1:06:42.856
2341
+ to watch translation.
2342
+
2343
+ 1:06:43.343 --> 1:06:46.651
2344
+ And that's important, so the alignment is
2345
+ typically at a sentence level.
2346
+
2347
+ 1:06:46.987 --> 1:06:50.252
2348
+ So you know, for each sentence what is a translation?
2349
+
2350
+ 1:06:50.252 --> 1:06:55.756
2351
+ Not always perfect because maybe there's two
2352
+ German sentences and one English, but at that
2353
+
2354
+ 1:06:55.756 --> 1:06:57.570
2355
+ level it's normally possible.
2356
+
2357
+ 1:06:57.570 --> 1:07:03.194
2358
+ At word level you can't do that because it's
2359
+ a very complicated thing and sense level that's
2360
+
2361
+ 1:07:03.194 --> 1:07:04.464
2362
+ normally a relative.
2363
+
2364
+ 1:07:05.986 --> 1:07:12.693
2365
+ Some type of machine learning which tries
2366
+ to learn dismapping between sentences on the
2367
+
2368
+ 1:07:12.693 --> 1:07:14.851
2369
+ English side and sentences.
2370
+
2371
+ 1:07:15.355 --> 1:07:22.088
2372
+ Of course this doesn't look like good mapping
2373
+ too complex but you try to find something like
2374
+
2375
+ 1:07:22.088 --> 1:07:28.894
2376
+ that where it's a very nice mapping so there's
2377
+ always the mixing things are met to each other
2378
+
2379
+ 1:07:28.894 --> 1:07:32.224
2380
+ and then if you have the English you can try.
2381
+
2382
+ 1:07:32.172 --> 1:07:36.900
2383
+ In another English sentence you can apply
2384
+ the same mannering and hopefully adhere to
2385
+
2386
+ 1:07:36.900 --> 1:07:38.514
2387
+ the right sentence in terms.
2388
+
2389
+ 1:07:38.918 --> 1:07:41.438
2390
+ The big problem here.
2391
+
2392
+ 1:07:41.438 --> 1:07:44.646
2393
+ How can we find this model?
2394
+
2395
+ 1:07:44.646 --> 1:07:50.144
2396
+ How to map English centers into German centers?
2397
+
2398
+ 1:07:54.374 --> 1:08:08.492
2399
+ How we do that is that we are trying to maximize
2400
+ the probability, so we have all the letterstone.
2401
+
2402
+ 1:08:09.109 --> 1:08:15.230
2403
+ Then we're having some type of model here
2404
+ which takes the Suez language and translates
2405
+
2406
+ 1:08:15.230 --> 1:08:16.426
2407
+ it for a target.
2408
+
2409
+ 1:08:16.896 --> 1:08:34.008
2410
+ And then we are in our translation, and we
2411
+ are adjusting our model in a way that the probability.
2412
+
2413
+ 1:08:34.554 --> 1:08:48.619
2414
+ How that is the idea behind it, how we are
2415
+ pushed now, implement that is part of the bottle.
2416
+
2417
+ 1:08:51.131 --> 1:09:01.809
2418
+ And then if we want to do translation, what
2419
+ we are doing is we are trying to find the translation.
2420
+
2421
+ 1:09:01.962 --> 1:09:06.297
2422
+ So we are scoring many possible translations.
2423
+
2424
+ 1:09:06.297 --> 1:09:12.046
2425
+ There is an infinite number of sentences that
2426
+ we are trying.
2427
+
2428
+ 1:09:12.552 --> 1:09:18.191
2429
+ That may be a bit of a problem when we talk
2430
+ about confidence because we are always trying
2431
+
2432
+ 1:09:18.191 --> 1:09:19.882
2433
+ to find the most probable.
2434
+
2435
+ 1:09:20.440 --> 1:09:28.241
2436
+ And then, of course, we are not really having
2437
+ intrinsically the possibility to say, oh, I
2438
+
2439
+ 1:09:28.241 --> 1:09:31.015
2440
+ have no idea in this situation.
2441
+
2442
+ 1:09:31.015 --> 1:09:35.782
2443
+ But our general model is always about how
2444
+ can we find?
2445
+
2446
+ 1:09:40.440 --> 1:09:41.816
2447
+ Think It's.
2448
+
2449
+ 1:09:42.963 --> 1:09:44.242
2450
+ Get Four More Slides.
2451
+
2452
+ 1:09:46.686 --> 1:09:52.025
2453
+ So just high level, so for a proper space
2454
+ this one we won't cover again.
2455
+
2456
+ 1:09:52.352 --> 1:10:00.808
2457
+ Its example based machine translation was
2458
+ at the beginning of SMT.
2459
+
2460
+ 1:10:00.808 --> 1:10:08.254
2461
+ The idea is that you take subparts and combine
2462
+ them again.
2463
+
2464
+ 1:10:08.568 --> 1:10:11.569
2465
+ So this will not be really covered here.
2466
+
2467
+ 1:10:11.569 --> 1:10:15.228
2468
+ Then the statistical machine translation we
2469
+ will.
2470
+
2471
+ 1:10:17.077 --> 1:10:18.773
2472
+ Yeah, we will cover next week.
2473
+
2474
+ 1:10:19.079 --> 1:10:27.594
2475
+ The idea is there that we automatically now,
2476
+ if we have the sentence alignment, we automatically.
2477
+
2478
+ 1:10:27.527 --> 1:10:34.207
2479
+ In the sentences, and then we can learn statistical
2480
+ models of how probable words are translated
2481
+
2482
+ 1:10:34.207 --> 1:10:39.356
2483
+ to each other, and then the surge is that we
2484
+ create different hypotheses.
2485
+
2486
+ 1:10:39.356 --> 1:10:45.200
2487
+ This could be a translation of this part,
2488
+ this could be a translation of that part.
2489
+
2490
+ 1:10:45.200 --> 1:10:47.496
2491
+ We give a score to each of them.
2492
+
2493
+ 1:10:47.727 --> 1:10:51.584
2494
+ The statistical machine manual is where a
2495
+ lot of work is done.
2496
+
2497
+ 1:10:51.584 --> 1:10:54.155
2498
+ How can we score how good translation is?
2499
+
2500
+ 1:10:54.494 --> 1:11:04.764
2501
+ The words can recur this type of structure,
2502
+ how is it reordered, and then based on that
2503
+
2504
+ 1:11:04.764 --> 1:11:08.965
2505
+ we search for the best translation.
2506
+
2507
+ 1:11:12.252 --> 1:11:19.127
2508
+ Then yeah, that one what we'll cover most
2509
+ of the time is is a neural, a model where we
2510
+
2511
+ 1:11:19.127 --> 1:11:21.102
2512
+ can use neural networks.
2513
+
2514
+ 1:11:21.102 --> 1:11:27.187
2515
+ The nice thing is between everything together
2516
+ before we get some compliment.
2517
+
2518
+ 1:11:27.187 --> 1:11:30.269
2519
+ Each of them is trained independently.
2520
+
2521
+ 1:11:30.210 --> 1:11:34.349
2522
+ Which of course has a disadvantage that they
2523
+ might not best work together.
2524
+
2525
+ 1:11:34.694 --> 1:11:36.601
2526
+ Here everything is trained together.
2527
+
2528
+ 1:11:36.601 --> 1:11:39.230
2529
+ The continuous representation will look into
2530
+ that.
2531
+
2532
+ 1:11:39.339 --> 1:11:41.846
2533
+ That's very helpful soft.
2534
+
2535
+ 1:11:41.846 --> 1:11:50.426
2536
+ We then neonetworks are able to learn somehow
2537
+ the relation between words and that's very
2538
+
2539
+ 1:11:50.426 --> 1:11:57.753
2540
+ helpful because then we can more easily deal
2541
+ with words which didn't occur.
2542
+
2543
+ 1:12:00.000 --> 1:12:05.240
2544
+ One thing just to correlate that to interlingua
2545
+ based.
2546
+
2547
+ 1:12:05.345 --> 1:12:07.646
2548
+ So we have this as an actual language.
2549
+
2550
+ 1:12:07.627 --> 1:12:11.705
2551
+ And if you do an interlingual based approach
2552
+ but don't take an artificial.
2553
+
2554
+ 1:12:11.731 --> 1:12:17.814
2555
+ With no ambiguities, but with a natural language
2556
+ that's referred to as pivot based in tea and
2557
+
2558
+ 1:12:17.814 --> 1:12:20.208
2559
+ can be done with all the approaches.
2560
+
2561
+ 1:12:20.208 --> 1:12:25.902
2562
+ So the ideas instead of directly translating
2563
+ from German to French, you first translate
2564
+
2565
+ 1:12:25.902 --> 1:12:29.073
2566
+ from German to English and then from English
2567
+ to.
2568
+
2569
+ 1:12:29.409 --> 1:12:40.954
2570
+ French where the big advantage is that you
2571
+ might have a lot more data for these two directions
2572
+
2573
+ 1:12:40.954 --> 1:12:43.384
2574
+ than you have here.
2575
+
2576
+ 1:12:44.864 --> 1:12:54.666
2577
+ With this thank you and deserve more questions
2578
+ and a bit late I'm sorry and then I'll see
2579
+
2580
+ 1:12:54.666 --> 1:12:55.864
2581
+ you again.
2582
+
demo_data/lectures/Lecture-01-18.04.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7f95bffd5a310af38b1ee51daef47a0af905687cbee799c161515f743cb30d0c
3
+ size 103388000
demo_data/lectures/Lecture-02-20.04.2023/English.vtt ADDED
@@ -0,0 +1,2984 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.561 --> 0:00:05.186
4
+ Okay So Um.
5
+
6
+ 0:00:08.268 --> 0:00:17.655
7
+ Welcome to today's presentation of the second
8
+ class and machine translation where we'll today
9
+
10
+ 0:00:17.655 --> 0:00:25.044
11
+ do a bit of a specific topic and we'll talk
12
+ about linguistic backgrounds.
13
+
14
+ 0:00:26.226 --> 0:00:34.851
15
+ Will cover their three different parts of
16
+ the lecture.
17
+
18
+ 0:00:35.615 --> 0:00:42.538
19
+ We'll do first a very, very brief introduction
20
+ about linguistic background in a way that what
21
+
22
+ 0:00:42.538 --> 0:00:49.608
23
+ is language, what are ways of describing language,
24
+ what are a bit serious behind it, very, very
25
+
26
+ 0:00:49.608 --> 0:00:50.123
27
+ short.
28
+
29
+ 0:00:50.410 --> 0:00:57.669
30
+ Don't know some of you have listened, think
31
+ to NLP in the last semester or so.
32
+
33
+ 0:00:58.598 --> 0:01:02.553
34
+ So there we did a lot longer explanation.
35
+
36
+ 0:01:02.553 --> 0:01:08.862
37
+ Here is just because we are not talking about
38
+ machine translation.
39
+
40
+ 0:01:09.109 --> 0:01:15.461
41
+ So it's really focused on the parts which
42
+ are important when we talk about machine translation.
43
+
44
+ 0:01:15.755 --> 0:01:19.377
45
+ Though for everybody who has listened to that
46
+ already, it's a bit of a repetition.
47
+
48
+ 0:01:19.377 --> 0:01:19.683
49
+ Maybe.
50
+
51
+ 0:01:19.980 --> 0:01:23.415
52
+ But it's really trying to look.
53
+
54
+ 0:01:23.415 --> 0:01:31.358
55
+ These are properties of languages and how
56
+ can they influence translation.
57
+
58
+ 0:01:31.671 --> 0:01:38.928
59
+ We'll use that in the second part to discuss
60
+ why is machine translation more from what we
61
+
62
+ 0:01:38.928 --> 0:01:40.621
63
+ know about language.
64
+
65
+ 0:01:40.940 --> 0:01:47.044
66
+ We will see that I mean there's two main things
67
+ is that the language might express ideas and
68
+
69
+ 0:01:47.044 --> 0:01:53.279
70
+ information differently, and if they are expressed
71
+ different in different languages we have to
72
+
73
+ 0:01:53.279 --> 0:01:54.920
74
+ do somehow the transfer.
75
+
76
+ 0:01:55.135 --> 0:02:02.771
77
+ And it's not purely that we know there's words
78
+ used for it, but it's not that simple and very
79
+
80
+ 0:02:02.771 --> 0:02:03.664
81
+ different.
82
+
83
+ 0:02:04.084 --> 0:02:10.088
84
+ And the other problem we mentioned last time
85
+ about biases is that there's not always the
86
+
87
+ 0:02:10.088 --> 0:02:12.179
88
+ same amount of information in.
89
+
90
+ 0:02:12.592 --> 0:02:18.206
91
+ So it can be that there's some more information
92
+ in the one or you can't express that few information
93
+
94
+ 0:02:18.206 --> 0:02:19.039
95
+ on the target.
96
+
97
+ 0:02:19.039 --> 0:02:24.264
98
+ We had that also, for example, with the example
99
+ with the rice plant in Germany, we would just
100
+
101
+ 0:02:24.264 --> 0:02:24.820
102
+ say rice.
103
+
104
+ 0:02:24.904 --> 0:02:33.178
105
+ Or in English, while in other countries you
106
+ have to distinguish between rice plant or rice
107
+
108
+ 0:02:33.178 --> 0:02:33.724
109
+ as a.
110
+
111
+ 0:02:34.194 --> 0:02:40.446
112
+ And then it's not always possible to directly
113
+ infer this on the surface.
114
+
115
+ 0:02:41.781 --> 0:02:48.501
116
+ And if we make it to the last point otherwise
117
+ we'll do that next Tuesday or we'll partly
118
+
119
+ 0:02:48.501 --> 0:02:55.447
120
+ do it only here is like we'll describe briefly
121
+ the three main approaches on a rule based so
122
+
123
+ 0:02:55.447 --> 0:02:59.675
124
+ linguistic motivated ways of doing machine
125
+ translation.
126
+
127
+ 0:02:59.779 --> 0:03:03.680
128
+ We mentioned them last time like the direct
129
+ translation.
130
+
131
+ 0:03:03.680 --> 0:03:10.318
132
+ The translation by transfer the lingua interlingua
133
+ bass will do that a bit more in detail today.
134
+
135
+ 0:03:10.590 --> 0:03:27.400
136
+ But very briefly because this is not a focus
137
+ of this class and then next week because.
138
+
139
+ 0:03:29.569 --> 0:03:31.757
140
+ Why do we think this is important?
141
+
142
+ 0:03:31.757 --> 0:03:37.259
143
+ On the one hand, of course, we are dealing
144
+ with natural language, so therefore it might
145
+
146
+ 0:03:37.259 --> 0:03:43.074
147
+ be good to spend a bit of time in understanding
148
+ what we are really dealing with because this
149
+
150
+ 0:03:43.074 --> 0:03:45.387
151
+ is challenging these other problems.
152
+
153
+ 0:03:45.785 --> 0:03:50.890
154
+ And on the other hand, this was the first
155
+ way of how we're doing machine translation.
156
+
157
+ 0:03:51.271 --> 0:04:01.520
158
+ Therefore, it's interesting to understand
159
+ what was the idea behind that and also to later
160
+
161
+ 0:04:01.520 --> 0:04:08.922
162
+ see what is done differently and to understand
163
+ when some models.
164
+
165
+ 0:04:13.453 --> 0:04:20.213
166
+ When we're talking about linguistics, we can
167
+ of course do that on different levels and there's
168
+
169
+ 0:04:20.213 --> 0:04:21.352
170
+ different ways.
171
+
172
+ 0:04:21.521 --> 0:04:26.841
173
+ On the right side here you are seeing the
174
+ basic levels of linguistics.
175
+
176
+ 0:04:27.007 --> 0:04:31.431
177
+ So we have at the bottom the phonetics and
178
+ phonology.
179
+
180
+ 0:04:31.431 --> 0:04:38.477
181
+ Phones will not cover this year because we
182
+ are mainly focusing on text input where we
183
+
184
+ 0:04:38.477 --> 0:04:42.163
185
+ are directly having directors and then work.
186
+
187
+ 0:04:42.642 --> 0:04:52.646
188
+ Then what we touch today, at least mention
189
+ what it is, is a morphology which is the first
190
+
191
+ 0:04:52.646 --> 0:04:53.424
192
+ level.
193
+
194
+ 0:04:53.833 --> 0:04:59.654
195
+ Already mentioned it a bit on Tuesday that
196
+ of course there are some languages where this
197
+
198
+ 0:04:59.654 --> 0:05:05.343
199
+ is very, very basic and there is not really
200
+ a lot of rules of how you can build words.
201
+
202
+ 0:05:05.343 --> 0:05:11.099
203
+ But since I assume you all have some basic
204
+ knowledge of German there is like a lot more
205
+
206
+ 0:05:11.099 --> 0:05:12.537
207
+ challenges than that.
208
+
209
+ 0:05:13.473 --> 0:05:20.030
210
+ You know, maybe if you're a native speaker
211
+ that's quite easy and everything is clear,
212
+
213
+ 0:05:20.030 --> 0:05:26.969
214
+ but if you have to learn it like the endings
215
+ of a word, we are famous for doing compositar
216
+
217
+ 0:05:26.969 --> 0:05:29.103
218
+ and putting words together.
219
+
220
+ 0:05:29.103 --> 0:05:31.467
221
+ So this is like the first lab.
222
+
223
+ 0:05:32.332 --> 0:05:40.268
224
+ Then we have the syntax, which is both on
225
+ the word and on the sentence level, and that's
226
+
227
+ 0:05:40.268 --> 0:05:43.567
228
+ about the structure of the sentence.
229
+
230
+ 0:05:43.567 --> 0:05:46.955
231
+ What are the functions of some words?
232
+
233
+ 0:05:47.127 --> 0:05:51.757
234
+ You might remember part of speech text from
235
+ From Your High School Time.
236
+
237
+ 0:05:51.757 --> 0:05:57.481
238
+ There is like noun and adjective and and things
239
+ like that and this is something helpful.
240
+
241
+ 0:05:57.737 --> 0:06:03.933
242
+ Just imagine in the beginning that it was
243
+ not only used for rule based but for statistical
244
+
245
+ 0:06:03.933 --> 0:06:10.538
246
+ machine translation, for example, the reordering
247
+ between languages was quite a challenging task.
248
+
249
+ 0:06:10.770 --> 0:06:16.330
250
+ Especially if you have long range reorderings
251
+ and their part of speech information is very
252
+
253
+ 0:06:16.330 --> 0:06:16.880
254
+ helpful.
255
+
256
+ 0:06:16.880 --> 0:06:20.301
257
+ You know, in German you have to move the word
258
+ the verb.
259
+
260
+ 0:06:20.260 --> 0:06:26.599
261
+ To the second position, if you have Spanish
262
+ you have to change the noun and the adjective
263
+
264
+ 0:06:26.599 --> 0:06:30.120
265
+ so information from part of speech could be
266
+ very.
267
+
268
+ 0:06:30.410 --> 0:06:38.621
269
+ Then you have a syntax base structure where
270
+ you have a full syntax tree in the beginning
271
+
272
+ 0:06:38.621 --> 0:06:43.695
273
+ and then it came into statistical machine translation.
274
+
275
+ 0:06:44.224 --> 0:06:50.930
276
+ And it got more and more important for statistical
277
+ machine translation that you are really trying
278
+
279
+ 0:06:50.930 --> 0:06:53.461
280
+ to model the whole syntax tree of a.
281
+
282
+ 0:06:53.413 --> 0:06:57.574
283
+ Sentence in order to better match how to do
284
+ that in UM.
285
+
286
+ 0:06:57.574 --> 0:07:04.335
287
+ In the target language, a bit yeah, the syntax
288
+ based statistical machine translation had a
289
+
290
+ 0:07:04.335 --> 0:07:05.896
291
+ bitter of a problem.
292
+
293
+ 0:07:05.896 --> 0:07:08.422
294
+ It got better and better and was.
295
+
296
+ 0:07:08.368 --> 0:07:13.349
297
+ Just on the way of getting better in some
298
+ languages than traditional statistical models.
299
+
300
+ 0:07:13.349 --> 0:07:18.219
301
+ But then the neural models came up and they
302
+ were just so much better in modelling that
303
+
304
+ 0:07:18.219 --> 0:07:19.115
305
+ all implicitly.
306
+
307
+ 0:07:19.339 --> 0:07:23.847
308
+ So that they are never were used in practice
309
+ so much.
310
+
311
+ 0:07:24.304 --> 0:07:34.262
312
+ And then we'll talk about the semantics, so
313
+ what is the meaning of the words?
314
+
315
+ 0:07:34.262 --> 0:07:40.007
316
+ Last time words can have different meanings.
317
+
318
+ 0:07:40.260 --> 0:07:46.033
319
+ And yeah, how you represent meaning of cause
320
+ is very challenging.
321
+
322
+ 0:07:45.966 --> 0:07:53.043
323
+ And normally that like formalizing this is
324
+ typically done in quite limited domains because
325
+
326
+ 0:07:53.043 --> 0:08:00.043
327
+ like doing that for like all possible words
328
+ has not really been achieved yet in this very
329
+
330
+ 0:08:00.043 --> 0:08:00.898
331
+ challenge.
332
+
333
+ 0:08:02.882 --> 0:08:09.436
334
+ About pragmatics, so pragmatics is then what
335
+ is meaning in the context of the current situation.
336
+
337
+ 0:08:09.789 --> 0:08:16.202
338
+ So one famous example is there, for example,
339
+ if you say the light is red.
340
+
341
+ 0:08:16.716 --> 0:08:21.795
342
+ The traffic light is red so that typically
343
+ not you don't want to tell the other person
344
+
345
+ 0:08:21.795 --> 0:08:27.458
346
+ if you're sitting in a car that it's surprising
347
+ oh the light is red but typically you're meaning
348
+
349
+ 0:08:27.458 --> 0:08:30.668
350
+ okay you should stop and you shouldn't pass
351
+ the light.
352
+
353
+ 0:08:30.850 --> 0:08:40.994
354
+ So the meaning of this sentence, the light,
355
+ is red in the context of sitting in the car.
356
+
357
+ 0:08:42.762 --> 0:08:51.080
358
+ So let's start with the morphology so that
359
+ with the things we are starting there and one
360
+
361
+ 0:08:51.080 --> 0:08:53.977
362
+ easy and first thing is there.
363
+
364
+ 0:08:53.977 --> 0:09:02.575
365
+ Of course we have to split the sentence into
366
+ words or joint directors so that we have word.
367
+
368
+ 0:09:02.942 --> 0:09:09.017
369
+ Because in most of our work we'll deal like
370
+ machine translation with some type of words.
371
+
372
+ 0:09:09.449 --> 0:09:15.970
373
+ In neuromachine translation, people are working
374
+ also on director based and subwords, but a
375
+
376
+ 0:09:15.970 --> 0:09:20.772
377
+ basic unique words of the sentence is a very
378
+ important first step.
379
+
380
+ 0:09:21.421 --> 0:09:32.379
381
+ And for many languages that is quite simple
382
+ in German, it's not that hard to determine
383
+
384
+ 0:09:32.379 --> 0:09:33.639
385
+ the word.
386
+
387
+ 0:09:34.234 --> 0:09:46.265
388
+ In tokenization, the main challenge is if
389
+ we are doing corpus-based methods that we are
390
+
391
+ 0:09:46.265 --> 0:09:50.366
392
+ also dealing as normal words.
393
+
394
+ 0:09:50.770 --> 0:10:06.115
395
+ And there of course it's getting a bit more
396
+ challenging.
397
+
398
+ 0:10:13.173 --> 0:10:17.426
399
+ So that is maybe the main thing where, for
400
+ example, in Germany, if you think of German
401
+
402
+ 0:10:17.426 --> 0:10:19.528
403
+ tokenization, it's easy to get every word.
404
+
405
+ 0:10:19.779 --> 0:10:26.159
406
+ You split it at a space, but then you would
407
+ have the dots at the end join to the last word,
408
+
409
+ 0:10:26.159 --> 0:10:30.666
410
+ and of course that you don't want because it's
411
+ a different word.
412
+
413
+ 0:10:30.666 --> 0:10:37.046
414
+ The last word would not be go, but go dot,
415
+ but what you can do is split up the dots always.
416
+
417
+ 0:10:37.677 --> 0:10:45.390
418
+ Can you really do that always or it might
419
+ be sometimes better to keep the dot as a point?
420
+
421
+ 0:10:47.807 --> 0:10:51.001
422
+ For example, email addresses or abbreviations
423
+ here.
424
+
425
+ 0:10:51.001 --> 0:10:56.284
426
+ For example, doctor, maybe it doesn't make
427
+ sense to split up the dot because then you
428
+
429
+ 0:10:56.284 --> 0:11:01.382
430
+ would assume all year starts a new sentence,
431
+ but it's just the DR dot from doctor.
432
+
433
+ 0:11:01.721 --> 0:11:08.797
434
+ Or if you have numbers like he's a seventh
435
+ person like the zipter, then you don't want
436
+
437
+ 0:11:08.797 --> 0:11:09.610
438
+ to split.
439
+
440
+ 0:11:09.669 --> 0:11:15.333
441
+ So there are some things where it could be
442
+ a bit more difficult, but it's not really challenging.
443
+
444
+ 0:11:16.796 --> 0:11:23.318
445
+ In other languages it's getting a lot more
446
+ challenging, especially in Asian languages
447
+
448
+ 0:11:23.318 --> 0:11:26.882
449
+ where often there are no spaces between words.
450
+
451
+ 0:11:27.147 --> 0:11:32.775
452
+ So you just have the sequence of characters.
453
+
454
+ 0:11:32.775 --> 0:11:38.403
455
+ The quick brown fox jumps over the lazy dog.
456
+
457
+ 0:11:38.999 --> 0:11:44.569
458
+ And then it still might be helpful to work
459
+ on something like words.
460
+
461
+ 0:11:44.569 --> 0:11:48.009
462
+ Then you need to have a bit more complex.
463
+
464
+ 0:11:48.328 --> 0:11:55.782
465
+ And here you see we are again having our typical
466
+ problem.
467
+
468
+ 0:11:55.782 --> 0:12:00.408
469
+ That means that there is ambiguity.
470
+
471
+ 0:12:00.600 --> 0:12:02.104
472
+ So you're seeing here.
473
+
474
+ 0:12:02.104 --> 0:12:08.056
475
+ We have exactly the same sequence of characters
476
+ or here, but depending on how we split it,
477
+
478
+ 0:12:08.056 --> 0:12:12.437
479
+ it means he is your servant or he is the one
480
+ who used your things.
481
+
482
+ 0:12:12.437 --> 0:12:15.380
483
+ Or here we have round eyes and take the air.
484
+
485
+ 0:12:15.895 --> 0:12:22.953
486
+ So then of course yeah this type of tokenization
487
+ gets more important because you could introduce
488
+
489
+ 0:12:22.953 --> 0:12:27.756
490
+ already arrows and you can imagine if you're
491
+ doing it here wrong.
492
+
493
+ 0:12:27.756 --> 0:12:34.086
494
+ If you once do a wrong decision it's quite
495
+ difficult to recover from a wrong decision.
496
+
497
+ 0:12:34.634 --> 0:12:47.088
498
+ And so in these cases looking about how we're
499
+ doing tokenization is an important issue.
500
+
501
+ 0:12:47.127 --> 0:12:54.424
502
+ And then it might be helpful to do things
503
+ like director based models where we treat each
504
+
505
+ 0:12:54.424 --> 0:12:56.228
506
+ director as a symbol.
507
+
508
+ 0:12:56.228 --> 0:13:01.803
509
+ For example, do this decision in the later
510
+ or never really do this?
511
+
512
+ 0:13:06.306 --> 0:13:12.033
513
+ The other thing is that if we have words we
514
+ might, it might not be the optimal unit to
515
+
516
+ 0:13:12.033 --> 0:13:18.155
517
+ work with because it can be that we should
518
+ look into the internal structure of words because
519
+
520
+ 0:13:18.155 --> 0:13:20.986
521
+ if we have a morphological rich language,.
522
+
523
+ 0:13:21.141 --> 0:13:27.100
524
+ That means we have a lot of different types
525
+ of words, and if you have a lot of many different
526
+
527
+ 0:13:27.100 --> 0:13:32.552
528
+ types of words, it on the other hand means
529
+ of course each of these words we have seen
530
+
531
+ 0:13:32.552 --> 0:13:33.757
532
+ very infrequently.
533
+
534
+ 0:13:33.793 --> 0:13:39.681
535
+ So if you only have ten words and you have
536
+ a large corpus, each word occurs more often.
537
+
538
+ 0:13:39.681 --> 0:13:45.301
539
+ If you have three million different words,
540
+ then each of them will occur less often.
541
+
542
+ 0:13:45.301 --> 0:13:51.055
543
+ Hopefully you know, from machine learning,
544
+ it's helpful if you have seen each example
545
+
546
+ 0:13:51.055 --> 0:13:51.858
547
+ very often.
548
+
549
+ 0:13:52.552 --> 0:13:54.524
550
+ And so why does it help?
551
+
552
+ 0:13:54.524 --> 0:13:56.495
553
+ Why does it help happen?
554
+
555
+ 0:13:56.495 --> 0:14:02.410
556
+ Yeah, in some languages we have quite a complex
557
+ information inside a word.
558
+
559
+ 0:14:02.410 --> 0:14:09.271
560
+ So here's a word from a finish talosanikiko
561
+ or something like that, and it means in my
562
+
563
+ 0:14:09.271 --> 0:14:10.769
564
+ house to question.
565
+
566
+ 0:14:11.491 --> 0:14:15.690
567
+ So you have all these information attached
568
+ to the word.
569
+
570
+ 0:14:16.036 --> 0:14:20.326
571
+ And that of course in extreme case that's
572
+ why typically, for example, Finnish is the
573
+
574
+ 0:14:20.326 --> 0:14:20.831
575
+ language.
576
+
577
+ 0:14:20.820 --> 0:14:26.725
578
+ Where machine translation quality is less
579
+ good because generating all these different
580
+
581
+ 0:14:26.725 --> 0:14:33.110
582
+ morphological variants is is a challenge and
583
+ the additional challenge is typically in finish
584
+
585
+ 0:14:33.110 --> 0:14:39.564
586
+ not really low resource but for in low resource
587
+ languages you quite often have more difficult
588
+
589
+ 0:14:39.564 --> 0:14:40.388
590
+ morphology.
591
+
592
+ 0:14:40.440 --> 0:14:43.949
593
+ Mean English is an example of a relatively
594
+ easy one.
595
+
596
+ 0:14:46.066 --> 0:14:54.230
597
+ And so in general we can say that words are
598
+ composed of more themes, and more themes are
599
+
600
+ 0:14:54.230 --> 0:15:03.069
601
+ the smallest meaning carrying unit, so normally
602
+ it means: All morphine should have some type
603
+
604
+ 0:15:03.069 --> 0:15:04.218
605
+ of meaning.
606
+
607
+ 0:15:04.218 --> 0:15:09.004
608
+ For example, here does not really have a meaning.
609
+
610
+ 0:15:09.289 --> 0:15:12.005
611
+ Bian has some type of meaning.
612
+
613
+ 0:15:12.005 --> 0:15:14.371
614
+ It's changing the meaning.
615
+
616
+ 0:15:14.371 --> 0:15:21.468
617
+ The NES has the meaning that it's making out
618
+ of an adjective, a noun, and happy.
619
+
620
+ 0:15:21.701 --> 0:15:31.215
621
+ So each of these parts conveys some meaning,
622
+ but you cannot split them further up and have
623
+
624
+ 0:15:31.215 --> 0:15:32.156
625
+ somehow.
626
+
627
+ 0:15:32.312 --> 0:15:36.589
628
+ You see that of course a little bit more is
629
+ happening.
630
+
631
+ 0:15:36.589 --> 0:15:43.511
632
+ Typically the Y is going into an E so there
633
+ can be some variation, but these are typical
634
+
635
+ 0:15:43.511 --> 0:15:46.544
636
+ examples of what we have as morphines.
637
+
638
+ 0:16:02.963 --> 0:16:08.804
639
+ That is, of course, a problem and that's the
640
+ question why how you do your splitting.
641
+
642
+ 0:16:08.804 --> 0:16:15.057
643
+ But that problem we have anyway always because
644
+ even full words can have different meanings
645
+
646
+ 0:16:15.057 --> 0:16:17.806
647
+ depending on the context they're using.
648
+
649
+ 0:16:18.038 --> 0:16:24.328
650
+ So we always have to somewhat have a model
651
+ which can infer or represent the meaning of
652
+
653
+ 0:16:24.328 --> 0:16:25.557
654
+ the word in the.
655
+
656
+ 0:16:25.825 --> 0:16:30.917
657
+ But you are right that this problem might
658
+ get even more severe if you're splitting up.
659
+
660
+ 0:16:30.917 --> 0:16:36.126
661
+ Therefore, it might not be the best to go
662
+ for the very extreme and represent each letter
663
+
664
+ 0:16:36.126 --> 0:16:41.920
665
+ and have a model which is only on letters because,
666
+ of course, a letter can have a lot of different
667
+
668
+ 0:16:41.920 --> 0:16:44.202
669
+ meanings depending on where it's used.
670
+
671
+ 0:16:44.524 --> 0:16:50.061
672
+ And yeah, there is no right solution like
673
+ what is the right splitting.
674
+
675
+ 0:16:50.061 --> 0:16:56.613
676
+ It depends on the language and the application
677
+ on the amount of data you're having.
678
+
679
+ 0:16:56.613 --> 0:17:01.058
680
+ For example, typically it means the fewer
681
+ data you have.
682
+
683
+ 0:17:01.301 --> 0:17:12.351
684
+ The more splitting you should do, if you have
685
+ more data, then you can be better distinguish.
686
+
687
+ 0:17:13.653 --> 0:17:19.065
688
+ Then there are different types of morphines:
689
+ So we have typically one stemmed theme: It's
690
+
691
+ 0:17:19.065 --> 0:17:21.746
692
+ like house or tish, so the main meaning.
693
+
694
+ 0:17:21.941 --> 0:17:29.131
695
+ And then you can have functional or bound
696
+ morphemes which can be f which can be prefix,
697
+
698
+ 0:17:29.131 --> 0:17:34.115
699
+ suffix, infix or circumfix so it can be before
700
+ can be after.
701
+
702
+ 0:17:34.114 --> 0:17:39.416
703
+ It can be inside or it can be around it, something
704
+ like a coughed there.
705
+
706
+ 0:17:39.416 --> 0:17:45.736
707
+ Typically you would say that it's not like
708
+ two more themes, G and T, because they both
709
+
710
+ 0:17:45.736 --> 0:17:50.603
711
+ describe the function, but together G and T
712
+ are marking the cough.
713
+
714
+ 0:17:53.733 --> 0:18:01.209
715
+ For what are people using them you can use
716
+ them for inflection to describe something like
717
+
718
+ 0:18:01.209 --> 0:18:03.286
719
+ tense count person case.
720
+
721
+ 0:18:04.604 --> 0:18:09.238
722
+ That is yeah, if you know German, this is
723
+ commonly used in German.
724
+
725
+ 0:18:10.991 --> 0:18:16.749
726
+ But of course there is a lot more complicated
727
+ things: I think in in some languages it also.
728
+
729
+ 0:18:16.749 --> 0:18:21.431
730
+ I mean, in Germany it only depends counting
731
+ person on the subject.
732
+
733
+ 0:18:21.431 --> 0:18:27.650
734
+ For the word, for example, in other languages
735
+ it can also determine the first and on the
736
+
737
+ 0:18:27.650 --> 0:18:28.698
738
+ second object.
739
+
740
+ 0:18:28.908 --> 0:18:35.776
741
+ So that it like if you buy an apple or an
742
+ house, that not only the, the, the.
743
+
744
+ 0:18:35.776 --> 0:18:43.435
745
+ Kauft depends on on me like in German, but
746
+ it can also depend on whether it's an apple
747
+
748
+ 0:18:43.435 --> 0:18:44.492
749
+ or a house.
750
+
751
+ 0:18:44.724 --> 0:18:48.305
752
+ And then of course you have an exploding number
753
+ of web fronts.
754
+
755
+ 0:18:49.409 --> 0:19:04.731
756
+ Furthermore, it can be used to do derivations
757
+ so you can make other types of words from it.
758
+
759
+ 0:19:05.165 --> 0:19:06.254
760
+ And then yeah.
761
+
762
+ 0:19:06.254 --> 0:19:12.645
763
+ This is like creating new words by joining
764
+ them like rainbow waterproof but for example
765
+
766
+ 0:19:12.645 --> 0:19:19.254
767
+ in German like Einköw's Wagen, Ice Cult and
768
+ so on where you can join where you can do that
769
+
770
+ 0:19:19.254 --> 0:19:22.014
771
+ with nouns and German adjectives and.
772
+
773
+ 0:19:22.282 --> 0:19:29.077
774
+ Then of course you might have additional challenges
775
+ like the Fugan where you have to add this one.
776
+
777
+ 0:19:32.452 --> 0:19:39.021
778
+ Yeah, then there is a yeah of course additional
779
+ special things.
780
+
781
+ 0:19:39.639 --> 0:19:48.537
782
+ You have to sometimes put extra stuff because
783
+ of phonology, so it's dig the plural, not plural.
784
+
785
+ 0:19:48.537 --> 0:19:56.508
786
+ The third person singular, as in English,
787
+ is normally S, but by Goes, for example, is
788
+
789
+ 0:19:56.508 --> 0:19:57.249
790
+ an E S.
791
+
792
+ 0:19:57.277 --> 0:20:04.321
793
+ In German you can also have other things that
794
+ like Osmutta gets Mutter so you're changing
795
+
796
+ 0:20:04.321 --> 0:20:11.758
797
+ the Umlaud in order to express the plural and
798
+ in other languages for example the vowel harmony
799
+
800
+ 0:20:11.758 --> 0:20:17.315
801
+ where the vowels inside are changing depending
802
+ on which form you have.
803
+
804
+ 0:20:17.657 --> 0:20:23.793
805
+ Which makes things more difficult than splitting
806
+ a word into its part doesn't really work anymore.
807
+
808
+ 0:20:23.793 --> 0:20:28.070
809
+ So like for Muta and Muta, for example, that
810
+ is not really possible.
811
+
812
+ 0:20:28.348 --> 0:20:36.520
813
+ The nice thing is, of course, more like a
814
+ general thing, but often irregular things are
815
+
816
+ 0:20:36.520 --> 0:20:39.492
817
+ happening as words which occur.
818
+
819
+ 0:20:39.839 --> 0:20:52.177
820
+ So that you can have enough examples, while
821
+ the regular things you can do by some type
822
+
823
+ 0:20:52.177 --> 0:20:53.595
824
+ of rules.
825
+
826
+ 0:20:55.655 --> 0:20:57.326
827
+ Yeah, This Can Be Done.
828
+
829
+ 0:20:57.557 --> 0:21:02.849
830
+ So there are tasks on this: how to do automatic
831
+ inflection, how to analyze them.
832
+
833
+ 0:21:02.849 --> 0:21:04.548
834
+ So you give it a word to.
835
+
836
+ 0:21:04.548 --> 0:21:10.427
837
+ It's telling you what are the possible forms
838
+ of that, like how they are built, and so on.
839
+
840
+ 0:21:10.427 --> 0:21:15.654
841
+ And for the at least Ah Iris shoes language,
842
+ there are a lot of tools for that.
843
+
844
+ 0:21:15.654 --> 0:21:18.463
845
+ Of course, if you now want to do that for.
846
+
847
+ 0:21:18.558 --> 0:21:24.281
848
+ Some language which is very low resourced
849
+ might be very difficult and there might be
850
+
851
+ 0:21:24.281 --> 0:21:25.492
852
+ no tool for them.
853
+
854
+ 0:21:28.368 --> 0:21:37.652
855
+ Good before we are going for the next part
856
+ about part of speech, are there any questions
857
+
858
+ 0:21:37.652 --> 0:21:38.382
859
+ about?
860
+
861
+ 0:22:01.781 --> 0:22:03.187
862
+ Yeah, we'll come to that a bit.
863
+
864
+ 0:22:03.483 --> 0:22:09.108
865
+ So it's a very good question and difficult
866
+ and especially we'll see that later if you
867
+
868
+ 0:22:09.108 --> 0:22:14.994
869
+ just put in words it would be very bad because
870
+ words are put into neural networks just as
871
+
872
+ 0:22:14.994 --> 0:22:15.844
873
+ some digits.
874
+
875
+ 0:22:15.844 --> 0:22:21.534
876
+ Each word is mapped into a jitter and you
877
+ put it in so it doesn't really know any more
878
+
879
+ 0:22:21.534 --> 0:22:22.908
880
+ about the structure.
881
+
882
+ 0:22:23.543 --> 0:22:29.898
883
+ What we will see therefore the most successful
884
+ approach which is mostly done is a subword
885
+
886
+ 0:22:29.898 --> 0:22:34.730
887
+ unit where we split: But we will do this.
888
+
889
+ 0:22:34.730 --> 0:22:40.154
890
+ Don't know if you have been in advanced.
891
+
892
+ 0:22:40.154 --> 0:22:44.256
893
+ We'll cover this on a Tuesday.
894
+
895
+ 0:22:44.364 --> 0:22:52.316
896
+ So there is an algorithm called bite pairing
897
+ coding, which is about splitting words into
898
+
899
+ 0:22:52.316 --> 0:22:52.942
900
+ parts.
901
+
902
+ 0:22:53.293 --> 0:23:00.078
903
+ So it's doing the splitting of words but not
904
+ morphologically motivated but more based on
905
+
906
+ 0:23:00.078 --> 0:23:00.916
907
+ frequency.
908
+
909
+ 0:23:00.940 --> 0:23:11.312
910
+ However, it performs very good and that's
911
+ why it's used and there is a bit of correlation.
912
+
913
+ 0:23:11.312 --> 0:23:15.529
914
+ Sometimes they agree on count based.
915
+
916
+ 0:23:15.695 --> 0:23:20.709
917
+ So we're splitting words and we're splitting
918
+ especially words which are infrequent and that's
919
+
920
+ 0:23:20.709 --> 0:23:23.962
921
+ maybe a good motivation why that's good for
922
+ neural networks.
923
+
924
+ 0:23:23.962 --> 0:23:28.709
925
+ That means if you have seen a word very often
926
+ you don't need to split it and it's easier
927
+
928
+ 0:23:28.709 --> 0:23:30.043
929
+ to just process it fast.
930
+
931
+ 0:23:30.690 --> 0:23:39.218
932
+ While if you have seen the words infrequently,
933
+ it is good to split it into parts so it can
934
+
935
+ 0:23:39.218 --> 0:23:39.593
936
+ do.
937
+
938
+ 0:23:39.779 --> 0:23:47.729
939
+ So there is some way of doing it, but linguists
940
+ would say this is not a morphological analyst.
941
+
942
+ 0:23:47.729 --> 0:23:53.837
943
+ That is true, but we are spitting words into
944
+ parts if they are not seen.
945
+
946
+ 0:23:59.699 --> 0:24:06.324
947
+ Yes, so another important thing about words
948
+ are the paddle speech text.
949
+
950
+ 0:24:06.324 --> 0:24:14.881
951
+ These are the common ones: noun, verb, adjective,
952
+ verb, determine, pronoun, proposition, and
953
+
954
+ 0:24:14.881 --> 0:24:16.077
955
+ conjunction.
956
+
957
+ 0:24:16.077 --> 0:24:26.880
958
+ There are some more: They are not the same
959
+ in all language, but for example there is this
960
+
961
+ 0:24:26.880 --> 0:24:38.104
962
+ universal grammar which tries to do this type
963
+ of part of speech text for many languages.
964
+
965
+ 0:24:38.258 --> 0:24:42.018
966
+ And then, of course, it's helping you for
967
+ generalization.
968
+
969
+ 0:24:42.018 --> 0:24:48.373
970
+ There are some language deals with verbs and
971
+ nouns, especially if you look at sentence structure.
972
+
973
+ 0:24:48.688 --> 0:24:55.332
974
+ And so if you know the part of speech tag
975
+ you can easily generalize and do get these
976
+
977
+ 0:24:55.332 --> 0:24:58.459
978
+ rules or apply these rules as you know.
979
+
980
+ 0:24:58.459 --> 0:25:02.680
981
+ The verb in English is always at the second
982
+ position.
983
+
984
+ 0:25:03.043 --> 0:25:10.084
985
+ So you know how to deal with verbs independently
986
+ of which words you are now really looking at.
987
+
988
+ 0:25:12.272 --> 0:25:18.551
989
+ And that again can be done is ambiguous.
990
+
991
+ 0:25:18.598 --> 0:25:27.171
992
+ So there are some words which can have several
993
+ pot of speech text.
994
+
995
+ 0:25:27.171 --> 0:25:38.686
996
+ Example are the word can, for example, which
997
+ can be the can of beans or can do something.
998
+
999
+ 0:25:38.959 --> 0:25:46.021
1000
+ Often is also in English related work.
1001
+
1002
+ 0:25:46.021 --> 0:25:55.256
1003
+ Access can be to excess or to access to something.
1004
+
1005
+ 0:25:56.836 --> 0:26:02.877
1006
+ Most words have only one single part of speech
1007
+ tag, but they are some where it's a bit more
1008
+
1009
+ 0:26:02.877 --> 0:26:03.731
1010
+ challenging.
1011
+
1012
+ 0:26:03.731 --> 0:26:09.640
1013
+ The nice thing is the ones which are in big
1014
+ are often more words, which occur more often,
1015
+
1016
+ 0:26:09.640 --> 0:26:12.858
1017
+ while for really ware words it's not that often.
1018
+
1019
+ 0:26:13.473 --> 0:26:23.159
1020
+ If you look at these classes you can distinguish
1021
+ open classes where new words can happen so
1022
+
1023
+ 0:26:23.159 --> 0:26:25.790
1024
+ we can invent new nouns.
1025
+
1026
+ 0:26:26.926 --> 0:26:31.461
1027
+ But then there are the close classes which
1028
+ I think are determined or pronoun.
1029
+
1030
+ 0:26:31.461 --> 0:26:35.414
1031
+ For example, it's not that you can easily
1032
+ develop your new pronoun.
1033
+
1034
+ 0:26:35.414 --> 0:26:38.901
1035
+ So there is a fixed list of pronouns and we
1036
+ are using that.
1037
+
1038
+ 0:26:38.901 --> 0:26:44.075
1039
+ So it's not like that or tomorrow there is
1040
+ something happening and then people are using
1041
+
1042
+ 0:26:44.075 --> 0:26:44.482
1043
+ a new.
1044
+
1045
+ 0:26:45.085 --> 0:26:52.426
1046
+ Pronoun or new conjectures, so it's like end,
1047
+ because it's not that you normally invent a
1048
+
1049
+ 0:26:52.426 --> 0:26:52.834
1050
+ new.
1051
+
1052
+ 0:27:00.120 --> 0:27:03.391
1053
+ And additional to part of speech text.
1054
+
1055
+ 0:27:03.391 --> 0:27:09.012
1056
+ Then some of these part of speech texts have
1057
+ different properties.
1058
+
1059
+ 0:27:09.389 --> 0:27:21.813
1060
+ So, for example, for nouns and adjectives
1061
+ we can have a singular plural: In other languages,
1062
+
1063
+ 0:27:21.813 --> 0:27:29.351
1064
+ there is a duel so that a word is not only
1065
+ like a single or in plural, but also like a
1066
+
1067
+ 0:27:29.351 --> 0:27:31.257
1068
+ duel if it's meaning.
1069
+
1070
+ 0:27:31.631 --> 0:27:36.246
1071
+ You have the gender and masculine feminine
1072
+ neutre we know.
1073
+
1074
+ 0:27:36.246 --> 0:27:43.912
1075
+ In other language there is animated and inanimated
1076
+ and you have the cases like in German you have
1077
+
1078
+ 0:27:43.912 --> 0:27:46.884
1079
+ no maternative guinetive acquisitive.
1080
+
1081
+ 0:27:47.467 --> 0:27:57.201
1082
+ So here and then in other languages you also
1083
+ have Latin with the upper teeth.
1084
+
1085
+ 0:27:57.497 --> 0:28:03.729
1086
+ So there's like more, it's just like yeah,
1087
+ and there you have no one to one correspondence,
1088
+
1089
+ 0:28:03.729 --> 0:28:09.961
1090
+ so it can be that there are some cases which
1091
+ are only in the one language and do not happen
1092
+
1093
+ 0:28:09.961 --> 0:28:11.519
1094
+ in the other language.
1095
+
1096
+ 0:28:13.473 --> 0:28:20.373
1097
+ For whorps we have tenses of course like walk
1098
+ is walking walked have walked head walked will
1099
+
1100
+ 0:28:20.373 --> 0:28:21.560
1101
+ walk and so on.
1102
+
1103
+ 0:28:21.560 --> 0:28:28.015
1104
+ Interestingly for example in Japanese this
1105
+ can also happen for adjectives though there
1106
+
1107
+ 0:28:28.015 --> 0:28:32.987
1108
+ is a difference between something is white
1109
+ or something was white.
1110
+
1111
+ 0:28:35.635 --> 0:28:41.496
1112
+ There is this continuous thing which should
1113
+ not really have that commonly in German and
1114
+
1115
+ 0:28:41.496 --> 0:28:47.423
1116
+ I guess that's if you're German and learning
1117
+ English that's something like she sings and
1118
+
1119
+ 0:28:47.423 --> 0:28:53.350
1120
+ she is singing and of course we can express
1121
+ that but it's not commonly used and normally
1122
+
1123
+ 0:28:53.350 --> 0:28:55.281
1124
+ we're not doing this aspect.
1125
+
1126
+ 0:28:55.455 --> 0:28:57.240
1127
+ Also about tenses.
1128
+
1129
+ 0:28:57.240 --> 0:29:05.505
1130
+ If you use pasts in English you will also
1131
+ use past tenses in German, so we have similar
1132
+
1133
+ 0:29:05.505 --> 0:29:09.263
1134
+ tenses, but the use might be different.
1135
+
1136
+ 0:29:14.214 --> 0:29:20.710
1137
+ There is uncertainty like the mood in there
1138
+ indicative.
1139
+
1140
+ 0:29:20.710 --> 0:29:26.742
1141
+ If he were here, there's voices active and
1142
+ passive.
1143
+
1144
+ 0:29:27.607 --> 0:29:34.024
1145
+ That you know, that is like both in German
1146
+ and English there, but there is something in
1147
+
1148
+ 0:29:34.024 --> 0:29:35.628
1149
+ the Middle and Greek.
1150
+
1151
+ 0:29:35.628 --> 0:29:42.555
1152
+ I get myself taught, so there is other phenomens
1153
+ than which might only happen in one language.
1154
+
1155
+ 0:29:42.762 --> 0:29:50.101
1156
+ This is, like yeah, the different synthetic
1157
+ structures that you can can have in the language,
1158
+
1159
+ 0:29:50.101 --> 0:29:57.361
1160
+ and where there's the two things, so it might
1161
+ be that some only are in some language, others
1162
+
1163
+ 0:29:57.361 --> 0:29:58.376
1164
+ don't exist.
1165
+
1166
+ 0:29:58.358 --> 0:30:05.219
1167
+ And on the other hand there is also matching,
1168
+ so it might be that in some situations you
1169
+
1170
+ 0:30:05.219 --> 0:30:07.224
1171
+ use different structures.
1172
+
1173
+ 0:30:10.730 --> 0:30:13.759
1174
+ The next would be then about semantics.
1175
+
1176
+ 0:30:13.759 --> 0:30:16.712
1177
+ Do you have any questions before that?
1178
+
1179
+ 0:30:19.819 --> 0:30:31.326
1180
+ I'll just continue, but if something is unclear
1181
+ beside the structure, we typically have more
1182
+
1183
+ 0:30:31.326 --> 0:30:39.863
1184
+ ambiguities, so it can be that words itself
1185
+ have different meanings.
1186
+
1187
+ 0:30:40.200 --> 0:30:48.115
1188
+ And we are typically talking about polysemy
1189
+ and homonyme, where polysemy means that a word
1190
+
1191
+ 0:30:48.115 --> 0:30:50.637
1192
+ can have different meanings.
1193
+
1194
+ 0:30:50.690 --> 0:30:58.464
1195
+ So if you have the English word interest,
1196
+ it can be that you are interested in something.
1197
+
1198
+ 0:30:58.598 --> 0:31:07.051
1199
+ Or it can be like the interest rate financial,
1200
+ but it is somehow related because if you are
1201
+
1202
+ 0:31:07.051 --> 0:31:11.002
1203
+ getting some interest rates there is some.
1204
+
1205
+ 0:31:11.531 --> 0:31:18.158
1206
+ Are, but there is a homophemer where they
1207
+ really are not related.
1208
+
1209
+ 0:31:18.458 --> 0:31:24.086
1210
+ So you can and can doesn't really have anything
1211
+ in common, so it's really very different.
1212
+
1213
+ 0:31:24.324 --> 0:31:29.527
1214
+ And of course that's not completely clear
1215
+ so there is not a clear definition so for example
1216
+
1217
+ 0:31:29.527 --> 0:31:34.730
1218
+ for the bank it can be that you say it's related
1219
+ but it can also be other can argue that so
1220
+
1221
+ 0:31:34.730 --> 0:31:39.876
1222
+ there are some clear things which is interest
1223
+ there are some which is vague and then there
1224
+
1225
+ 0:31:39.876 --> 0:31:43.439
1226
+ are some where it's very clear again that there
1227
+ are different.
1228
+
1229
+ 0:31:45.065 --> 0:31:49.994
1230
+ And in order to translate them, of course,
1231
+ we might need the context to disambiguate.
1232
+
1233
+ 0:31:49.994 --> 0:31:54.981
1234
+ That's typically where we can disambiguate,
1235
+ and that's not only for lexical semantics,
1236
+
1237
+ 0:31:54.981 --> 0:32:00.198
1238
+ that's generally very often that if you want
1239
+ to disambiguate, context can be very helpful.
1240
+
1241
+ 0:32:00.198 --> 0:32:03.981
1242
+ So in which sentence and which general knowledge
1243
+ who is speaking?
1244
+
1245
+ 0:32:04.944 --> 0:32:09.867
1246
+ You can do that externally by some disinvigration
1247
+ task.
1248
+
1249
+ 0:32:09.867 --> 0:32:14.702
1250
+ Machine translation system will also do it
1251
+ internally.
1252
+
1253
+ 0:32:16.156 --> 0:32:21.485
1254
+ And sometimes you're lucky and you don't need
1255
+ to do it because you just have the same ambiguity
1256
+
1257
+ 0:32:21.485 --> 0:32:23.651
1258
+ in the source and the target language.
1259
+
1260
+ 0:32:23.651 --> 0:32:26.815
1261
+ And then it doesn't matter if you think about
1262
+ the mouse.
1263
+
1264
+ 0:32:26.815 --> 0:32:31.812
1265
+ As I said, you don't really need to know if
1266
+ it's a computer mouse or the living mouse you
1267
+
1268
+ 0:32:31.812 --> 0:32:36.031
1269
+ translate from German to English because it
1270
+ has exactly the same ambiguity.
1271
+
1272
+ 0:32:40.400 --> 0:32:46.764
1273
+ There's also relations between words like
1274
+ synonyms, antonyms, hipponomes, like the is
1275
+
1276
+ 0:32:46.764 --> 0:32:50.019
1277
+ a relation and the part of like Dora House.
1278
+
1279
+ 0:32:50.019 --> 0:32:55.569
1280
+ Big small is an antonym and synonym is like
1281
+ which needs something similar.
1282
+
1283
+ 0:32:56.396 --> 0:33:03.252
1284
+ There are resources which try to express all
1285
+ these linguistic information like word net
1286
+
1287
+ 0:33:03.252 --> 0:33:10.107
1288
+ or German net where you have a graph with words
1289
+ and how they are related to each other.
1290
+
1291
+ 0:33:11.131 --> 0:33:12.602
1292
+ Which can be helpful.
1293
+
1294
+ 0:33:12.602 --> 0:33:18.690
1295
+ Typically these things were more used in tasks
1296
+ where there is fewer data, so there's a lot
1297
+
1298
+ 0:33:18.690 --> 0:33:24.510
1299
+ of tasks in NLP where you have very limited
1300
+ data because you really need to hand align
1301
+
1302
+ 0:33:24.510 --> 0:33:24.911
1303
+ that.
1304
+
1305
+ 0:33:25.125 --> 0:33:28.024
1306
+ Machine translation has a big advantage.
1307
+
1308
+ 0:33:28.024 --> 0:33:31.842
1309
+ There's naturally a lot of text translated
1310
+ out there.
1311
+
1312
+ 0:33:32.212 --> 0:33:39.519
1313
+ Typically in machine translation we have compared
1314
+ to other tasks significantly amount of data.
1315
+
1316
+ 0:33:39.519 --> 0:33:46.212
1317
+ People have looked into integrating wordnet
1318
+ or things like that, but it is rarely used
1319
+
1320
+ 0:33:46.212 --> 0:33:49.366
1321
+ in like commercial systems or something.
1322
+
1323
+ 0:33:52.692 --> 0:33:55.626
1324
+ So this was based on the words.
1325
+
1326
+ 0:33:55.626 --> 0:34:03.877
1327
+ We have morphology, syntax, and semantics,
1328
+ and then of course it makes sense to also look
1329
+
1330
+ 0:34:03.877 --> 0:34:06.169
1331
+ at the bigger structure.
1332
+
1333
+ 0:34:06.169 --> 0:34:08.920
1334
+ That means information about.
1335
+
1336
+ 0:34:08.948 --> 0:34:17.822
1337
+ Of course, we don't have a really morphology
1338
+ there because morphology about the structure
1339
+
1340
+ 0:34:17.822 --> 0:34:26.104
1341
+ of words, but we have syntax on the sentence
1342
+ level and the semantic representation.
1343
+
1344
+ 0:34:28.548 --> 0:34:35.637
1345
+ When we are thinking about the sentence structure,
1346
+ then the sentence is, of course, first a sequence
1347
+
1348
+ 0:34:35.637 --> 0:34:37.742
1349
+ of words terminated by a dot.
1350
+
1351
+ 0:34:37.742 --> 0:34:42.515
1352
+ Jane bought the house and we can say something
1353
+ about the structure.
1354
+
1355
+ 0:34:42.515 --> 0:34:47.077
1356
+ It's typically its subject work and then one
1357
+ or several objects.
1358
+
1359
+ 0:34:47.367 --> 0:34:51.996
1360
+ And the number of objects, for example, is
1361
+ then determined by the word.
1362
+
1363
+ 0:34:52.232 --> 0:34:54.317
1364
+ It's Called the Valency.
1365
+
1366
+ 0:34:54.354 --> 0:35:01.410
1367
+ So you have intransitive verbs which don't
1368
+ get any object, it's just to sleep.
1369
+
1370
+ 0:35:02.622 --> 0:35:05.912
1371
+ For example, there is no object sleep beds.
1372
+
1373
+ 0:35:05.912 --> 0:35:14.857
1374
+ You cannot say that: And there are transitive
1375
+ verbs where you have to put one or more objects,
1376
+
1377
+ 0:35:14.857 --> 0:35:16.221
1378
+ and you always.
1379
+
1380
+ 0:35:16.636 --> 0:35:19.248
1381
+ Sentence is not correct if you don't put the
1382
+ object.
1383
+
1384
+ 0:35:19.599 --> 0:35:33.909
1385
+ So if you have to buy something you have to
1386
+ say bought this or give someone something then.
1387
+
1388
+ 0:35:34.194 --> 0:35:40.683
1389
+ Here you see a bit that may be interesting
1390
+ the relation between word order and morphology.
1391
+
1392
+ 0:35:40.683 --> 0:35:47.243
1393
+ Of course it's not that strong, but for example
1394
+ in English you always have to first say who
1395
+
1396
+ 0:35:47.243 --> 0:35:49.453
1397
+ you gave it and what you gave.
1398
+
1399
+ 0:35:49.453 --> 0:35:53.304
1400
+ So the structure is very clear and cannot
1401
+ be changed.
1402
+
1403
+ 0:35:54.154 --> 0:36:00.801
1404
+ German, for example, has a possibility of
1405
+ determining what you gave and whom you gave
1406
+
1407
+ 0:36:00.801 --> 0:36:07.913
1408
+ it because there is a morphology and you can
1409
+ do what you gave a different form than to whom
1410
+
1411
+ 0:36:07.913 --> 0:36:08.685
1412
+ you gave.
1413
+
1414
+ 0:36:11.691 --> 0:36:18.477
1415
+ And that is a general tendency that if you
1416
+ have morphology then typically the word order
1417
+
1418
+ 0:36:18.477 --> 0:36:25.262
1419
+ is more free and possible, while in English
1420
+ you cannot express these information through
1421
+
1422
+ 0:36:25.262 --> 0:36:26.482
1423
+ the morphology.
1424
+
1425
+ 0:36:26.706 --> 0:36:30.238
1426
+ You typically have to express them through
1427
+ the word order.
1428
+
1429
+ 0:36:30.238 --> 0:36:32.872
1430
+ It's not as free, but it's more restricted.
1431
+
1432
+ 0:36:35.015 --> 0:36:40.060
1433
+ Yeah, the first part is typically the noun
1434
+ phrase, the subject, and that can not only
1435
+
1436
+ 0:36:40.060 --> 0:36:43.521
1437
+ be a single noun, but of course it can be a
1438
+ longer phrase.
1439
+
1440
+ 0:36:43.521 --> 0:36:48.860
1441
+ So if you have Jane the woman, it can be Jane,
1442
+ it can be the woman, it can a woman, it can
1443
+
1444
+ 0:36:48.860 --> 0:36:52.791
1445
+ be the young woman or the young woman who lives
1446
+ across the street.
1447
+
1448
+ 0:36:53.073 --> 0:36:56.890
1449
+ All of these are the subjects, so this can
1450
+ be already very, very long.
1451
+
1452
+ 0:36:57.257 --> 0:36:58.921
1453
+ And they also put this.
1454
+
1455
+ 0:36:58.921 --> 0:37:05.092
1456
+ The verb is on the second position in a bit
1457
+ more complicated way because if you have now
1458
+
1459
+ 0:37:05.092 --> 0:37:11.262
1460
+ the young woman who lives across the street
1461
+ runs to somewhere or so then yeah runs is at
1462
+
1463
+ 0:37:11.262 --> 0:37:16.185
1464
+ the second position in this tree but the first
1465
+ position is quite long.
1466
+
1467
+ 0:37:16.476 --> 0:37:19.277
1468
+ And so it's not just counting okay.
1469
+
1470
+ 0:37:19.277 --> 0:37:22.700
1471
+ The second word is always is always a word.
1472
+
1473
+ 0:37:26.306 --> 0:37:32.681
1474
+ Additional to these simple things, there's
1475
+ more complex stuff.
1476
+
1477
+ 0:37:32.681 --> 0:37:43.104
1478
+ Jane bought the house from Jim without hesitation,
1479
+ or Jane bought the house in the pushed neighborhood
1480
+
1481
+ 0:37:43.104 --> 0:37:44.925
1482
+ across the river.
1483
+
1484
+ 0:37:45.145 --> 0:37:51.694
1485
+ And these often lead to additional ambiguities
1486
+ because it's not always completely clear to
1487
+
1488
+ 0:37:51.694 --> 0:37:53.565
1489
+ which this prepositional.
1490
+
1491
+ 0:37:54.054 --> 0:37:59.076
1492
+ So that we'll see and you have, of course,
1493
+ subclasses and so on.
1494
+
1495
+ 0:38:01.061 --> 0:38:09.926
1496
+ And then there is a theory behind it which
1497
+ was very important for rule based machine translation
1498
+
1499
+ 0:38:09.926 --> 0:38:14.314
1500
+ because that's exactly what you're doing there.
1501
+
1502
+ 0:38:14.314 --> 0:38:18.609
1503
+ You would take the sentence, do the syntactic.
1504
+
1505
+ 0:38:18.979 --> 0:38:28.432
1506
+ So that we can have this constituents which
1507
+ like describe the basic parts of the language.
1508
+
1509
+ 0:38:28.468 --> 0:38:35.268
1510
+ And we can create the sentence structure as
1511
+ a context free grammar, which you hopefully
1512
+
1513
+ 0:38:35.268 --> 0:38:42.223
1514
+ remember from basic computer science, which
1515
+ is a pair of non terminals, terminal symbols,
1516
+
1517
+ 0:38:42.223 --> 0:38:44.001
1518
+ production rules, and.
1519
+
1520
+ 0:38:43.943 --> 0:38:50.218
1521
+ And the star symbol, and you can then describe
1522
+ a sentence by this phrase structure grammar:
1523
+
1524
+ 0:38:51.751 --> 0:38:59.628
1525
+ So a simple example would be something like
1526
+ that: you have a lexicon, Jane is a noun, Frays
1527
+
1528
+ 0:38:59.628 --> 0:39:02.367
1529
+ is a noun, Telescope is a noun.
1530
+
1531
+ 0:39:02.782 --> 0:39:10.318
1532
+ And then you have these production rules sentences:
1533
+ a noun phrase in the web phrase.
1534
+
1535
+ 0:39:10.318 --> 0:39:18.918
1536
+ The noun phrase can either be a determinized
1537
+ noun or it can be a noun phrase and a propositional
1538
+
1539
+ 0:39:18.918 --> 0:39:19.628
1540
+ phrase.
1541
+
1542
+ 0:39:19.919 --> 0:39:25.569
1543
+ Or a prepositional phrase and a prepositional
1544
+ phrase is a preposition and a non phrase.
1545
+
1546
+ 0:39:26.426 --> 0:39:27.622
1547
+ We're looking at this.
1548
+
1549
+ 0:39:27.622 --> 0:39:30.482
1550
+ What is the valency of the word we're describing
1551
+ here?
1552
+
1553
+ 0:39:33.513 --> 0:39:36.330
1554
+ How many objects would in this case the world
1555
+ have?
1556
+
1557
+ 0:39:46.706 --> 0:39:48.810
1558
+ We're looking at the web phrase.
1559
+
1560
+ 0:39:48.810 --> 0:39:54.358
1561
+ The web phrase is a verb and a noun phrase,
1562
+ so one object here, so this would be for a
1563
+
1564
+ 0:39:54.358 --> 0:39:55.378
1565
+ balance of one.
1566
+
1567
+ 0:39:55.378 --> 0:40:00.925
1568
+ If you have intransitive verbs, it would be
1569
+ verb phrases, just a word, and if you have
1570
+
1571
+ 0:40:00.925 --> 0:40:03.667
1572
+ two, it would be noun phrase, noun phrase.
1573
+
1574
+ 0:40:08.088 --> 0:40:15.348
1575
+ And yeah, then the, the, the challenge or
1576
+ what you have to do is like this: Given a natural
1577
+
1578
+ 0:40:15.348 --> 0:40:23.657
1579
+ language sentence, you want to parse it to
1580
+ get this type of pastry from programming languages
1581
+
1582
+ 0:40:23.657 --> 0:40:30.198
1583
+ where you also need to parse the code in order
1584
+ to get the representation.
1585
+
1586
+ 0:40:30.330 --> 0:40:39.356
1587
+ However, there is one challenge if you parse
1588
+ natural language compared to computer language.
1589
+
1590
+ 0:40:43.823 --> 0:40:56.209
1591
+ So there are different ways of how you can
1592
+ express things and there are different pastures
1593
+
1594
+ 0:40:56.209 --> 0:41:00.156
1595
+ belonging to the same input.
1596
+
1597
+ 0:41:00.740 --> 0:41:05.241
1598
+ So if you have Jane buys a horse, how's that
1599
+ an easy example?
1600
+
1601
+ 0:41:05.241 --> 0:41:07.491
1602
+ So you do the lexicon look up.
1603
+
1604
+ 0:41:07.491 --> 0:41:13.806
1605
+ Jane can be a noun phrase, a bias is a verb,
1606
+ a is a determiner, and a house is a noun.
1607
+
1608
+ 0:41:15.215 --> 0:41:18.098
1609
+ And then you can now use the grammar rules
1610
+ of here.
1611
+
1612
+ 0:41:18.098 --> 0:41:19.594
1613
+ There is no rule for that.
1614
+
1615
+ 0:41:20.080 --> 0:41:23.564
1616
+ Here we have no rules, but here we have a
1617
+ rule.
1618
+
1619
+ 0:41:23.564 --> 0:41:27.920
1620
+ A noun is a non-phrase, so we have mapped
1621
+ that to the noun.
1622
+
1623
+ 0:41:28.268 --> 0:41:34.012
1624
+ Then we can map this to the web phrase.
1625
+
1626
+ 0:41:34.012 --> 0:41:47.510
1627
+ We have a verb noun phrase to web phrase and
1628
+ then we can map this to a sentence representing:
1629
+
1630
+ 0:41:49.069 --> 0:41:53.042
1631
+ We can have that even more complex.
1632
+
1633
+ 0:41:53.042 --> 0:42:01.431
1634
+ The woman who won the lottery yesterday bought
1635
+ the house across the street.
1636
+
1637
+ 0:42:01.431 --> 0:42:05.515
1638
+ The structure gets more complicated.
1639
+
1640
+ 0:42:05.685 --> 0:42:12.103
1641
+ You now see that the word phrase is at the
1642
+ second position, but the noun phrase is quite.
1643
+
1644
+ 0:42:12.052 --> 0:42:18.655
1645
+ Quite big in here and the p p phrases, it's
1646
+ sometimes difficult where to put them because
1647
+
1648
+ 0:42:18.655 --> 0:42:25.038
1649
+ they can be put to the noun phrase, but in
1650
+ other sentences they can also be put to the
1651
+
1652
+ 0:42:25.038 --> 0:42:25.919
1653
+ web phrase.
1654
+
1655
+ 0:42:36.496 --> 0:42:38.250
1656
+ Yeah.
1657
+
1658
+ 0:42:43.883 --> 0:42:50.321
1659
+ Yes, so then either it can have two tags,
1660
+ noun or noun phrase, or you can have the extra
1661
+
1662
+ 0:42:50.321 --> 0:42:50.755
1663
+ rule.
1664
+
1665
+ 0:42:50.755 --> 0:42:57.409
1666
+ The noun phrase can not only be a determiner
1667
+ in the noun, but it can also be a noun phrase.
1668
+
1669
+ 0:42:57.717 --> 0:43:04.360
1670
+ Then of course either you introduce additional
1671
+ rules when what is possible or the problem
1672
+
1673
+ 0:43:04.360 --> 0:43:11.446
1674
+ that if you do pastures which are not correct
1675
+ and then you have to add some type of probability
1676
+
1677
+ 0:43:11.446 --> 0:43:13.587
1678
+ which type is more probable.
1679
+
1680
+ 0:43:16.876 --> 0:43:23.280
1681
+ But of course some things also can't really
1682
+ model easily with this type of cheese.
1683
+
1684
+ 0:43:23.923 --> 0:43:32.095
1685
+ There, for example, the agreement is not straightforward
1686
+ to do so that in subject and work you can check
1687
+
1688
+ 0:43:32.095 --> 0:43:38.866
1689
+ that the person, the agreement, the number
1690
+ in person, the number agreement is correct,
1691
+
1692
+ 0:43:38.866 --> 0:43:41.279
1693
+ but if it's a singular object.
1694
+
1695
+ 0:43:41.561 --> 0:43:44.191
1696
+ A singular verb, it's also a singular.
1697
+
1698
+ 0:43:44.604 --> 0:43:49.242
1699
+ Non-subject, and if it's a plural subject,
1700
+ it's a plural work.
1701
+
1702
+ 0:43:49.489 --> 0:43:56.519
1703
+ Things like that are yeah, the agreement in
1704
+ determining action driven now, so they also
1705
+
1706
+ 0:43:56.519 --> 0:43:57.717
1707
+ have to agree.
1708
+
1709
+ 0:43:57.877 --> 0:44:05.549
1710
+ Things like that cannot be easily done with
1711
+ this type of grammar or this subcategorization
1712
+
1713
+ 0:44:05.549 --> 0:44:13.221
1714
+ that you check whether the verb is transitive
1715
+ or intransitive, and that Jane sleeps is OK,
1716
+
1717
+ 0:44:13.221 --> 0:44:16.340
1718
+ but Jane sleeps the house is not OK.
1719
+
1720
+ 0:44:16.436 --> 0:44:21.073
1721
+ And Jane Walterhouse is okay, but Jane Walterhouse
1722
+ is not okay.
1723
+
1724
+ 0:44:23.183 --> 0:44:29.285
1725
+ Furthermore, this long range dependency might
1726
+ be difficult and which word orders are allowed
1727
+
1728
+ 0:44:29.285 --> 0:44:31.056
1729
+ and which are not allowed.
1730
+
1731
+ 0:44:31.571 --> 0:44:40.011
1732
+ This is also not directly so you can say Maria
1733
+ give de man das bourg, de man give Maria das
1734
+
1735
+ 0:44:40.011 --> 0:44:47.258
1736
+ bourg, das bourg give Maria, de man aber Maria,
1737
+ de man give des bourg is some.
1738
+
1739
+ 0:44:47.227 --> 0:44:55.191
1740
+ One yeah, which one from this one is possible
1741
+ and not is sometimes not possible to model,
1742
+
1743
+ 0:44:55.191 --> 0:44:56.164
1744
+ is simple.
1745
+
1746
+ 0:44:56.876 --> 0:45:05.842
1747
+ Therefore, people have done more complex stuff
1748
+ like this unification grammar and tried to
1749
+
1750
+ 0:45:05.842 --> 0:45:09.328
1751
+ model both the categories of verb.
1752
+
1753
+ 0:45:09.529 --> 0:45:13.367
1754
+ The agreement has to be that it's person and
1755
+ single.
1756
+
1757
+ 0:45:13.367 --> 0:45:20.028
1758
+ You're joining that so you're annotating this
1759
+ thing with more information and then you have
1760
+
1761
+ 0:45:20.028 --> 0:45:25.097
1762
+ more complex synthetic structures in order
1763
+ to model also these types.
1764
+
1765
+ 0:45:28.948 --> 0:45:33.137
1766
+ Yeah, why is this difficult?
1767
+
1768
+ 0:45:33.873 --> 0:45:39.783
1769
+ We have different ambiguities and that makes
1770
+ it different, so words have different part
1771
+
1772
+ 0:45:39.783 --> 0:45:43.610
1773
+ of speech text and if you have time flies like
1774
+ an error.
1775
+
1776
+ 0:45:43.583 --> 0:45:53.554
1777
+ It can mean that sometimes the animal L look
1778
+ like an arrow and or it can mean that the time
1779
+
1780
+ 0:45:53.554 --> 0:45:59.948
1781
+ is flying very fast is going away very fast
1782
+ like an error.
1783
+
1784
+ 0:46:00.220 --> 0:46:10.473
1785
+ And if you want to do a pastry, these two
1786
+ meanings have a different part of speech text,
1787
+
1788
+ 0:46:10.473 --> 0:46:13.008
1789
+ so flies is the verb.
1790
+
1791
+ 0:46:13.373 --> 0:46:17.999
1792
+ And of course that is a different semantic,
1793
+ and so that is very different.
1794
+
1795
+ 0:46:19.499 --> 0:46:23.361
1796
+ And otherwise a structural.
1797
+
1798
+ 0:46:23.243 --> 0:46:32.419
1799
+ Ambiguity so that like some part of the sentence
1800
+ can have different rules, so the famous thing
1801
+
1802
+ 0:46:32.419 --> 0:46:34.350
1803
+ is this attachment.
1804
+
1805
+ 0:46:34.514 --> 0:46:39.724
1806
+ So the cops saw the Bulgara with a binoculars.
1807
+
1808
+ 0:46:39.724 --> 0:46:48.038
1809
+ Then with a binocular can be attached to saw
1810
+ or it can be attached to the.
1811
+
1812
+ 0:46:48.448 --> 0:46:59.897
1813
+ And so in the first two it's more probable
1814
+ that he saw the theft, and not that the theft
1815
+
1816
+ 0:46:59.897 --> 0:47:01.570
1817
+ has the one.
1818
+
1819
+ 0:47:01.982 --> 0:47:13.356
1820
+ And this, of course, makes things difficult
1821
+ while parsing and doing structure implicitly
1822
+
1823
+ 0:47:13.356 --> 0:47:16.424
1824
+ defining the semantics.
1825
+
1826
+ 0:47:20.120 --> 0:47:29.736
1827
+ Therefore, we would then go directly to semantics,
1828
+ but maybe some questions about spintax and
1829
+
1830
+ 0:47:29.736 --> 0:47:31.373
1831
+ how that works.
1832
+
1833
+ 0:47:33.113 --> 0:47:46.647
1834
+ Then we'll do a bit more about semantics,
1835
+ so now we only describe the structure of the
1836
+
1837
+ 0:47:46.647 --> 0:47:48.203
1838
+ sentence.
1839
+
1840
+ 0:47:48.408 --> 0:47:55.584
1841
+ And for the meaning of the sentence we typically
1842
+ have the compositionality of meaning.
1843
+
1844
+ 0:47:55.584 --> 0:48:03.091
1845
+ The meaning of the full sentence is determined
1846
+ by the meaning of the individual words, and
1847
+
1848
+ 0:48:03.091 --> 0:48:06.308
1849
+ they together form the meaning of the.
1850
+
1851
+ 0:48:06.686 --> 0:48:17.936
1852
+ For words that is partly true but not always
1853
+ mean for things like rainbow, jointly rain
1854
+
1855
+ 0:48:17.936 --> 0:48:19.086
1856
+ and bow.
1857
+
1858
+ 0:48:19.319 --> 0:48:26.020
1859
+ But this is not always a case, while for sentences
1860
+ typically that is happening because you can't
1861
+
1862
+ 0:48:26.020 --> 0:48:30.579
1863
+ directly determine the full meaning, but you
1864
+ split it into parts.
1865
+
1866
+ 0:48:30.590 --> 0:48:36.164
1867
+ Sometimes only in some parts like kick the
1868
+ bucket the expression.
1869
+
1870
+ 0:48:36.164 --> 0:48:43.596
1871
+ Of course you cannot get the meaning of kick
1872
+ the bucket by looking at the individual or
1873
+
1874
+ 0:48:43.596 --> 0:48:46.130
1875
+ in German abyss in its grass.
1876
+
1877
+ 0:48:47.207 --> 0:48:53.763
1878
+ You cannot get that he died by looking at
1879
+ the individual words of Bis ins grass, but
1880
+
1881
+ 0:48:53.763 --> 0:48:54.611
1882
+ they have.
1883
+
1884
+ 0:48:55.195 --> 0:49:10.264
1885
+ And there are different ways of describing
1886
+ that some people have tried that more commonly
1887
+
1888
+ 0:49:10.264 --> 0:49:13.781
1889
+ used for some tasks.
1890
+
1891
+ 0:49:14.654 --> 0:49:20.073
1892
+ Will come to so the first thing would be something
1893
+ like first order logic.
1894
+
1895
+ 0:49:20.073 --> 0:49:27.297
1896
+ If you have Peter loves Jane then you have
1897
+ this meaning and you're having the end of representation
1898
+
1899
+ 0:49:27.297 --> 0:49:33.005
1900
+ that you have a love property between Peter
1901
+ and Jane and you try to construct.
1902
+
1903
+ 0:49:32.953 --> 0:49:40.606
1904
+ That you will see this a lot more complex
1905
+ than directly than only doing syntax but also
1906
+
1907
+ 0:49:40.606 --> 0:49:43.650
1908
+ doing this type of representation.
1909
+
1910
+ 0:49:44.164 --> 0:49:47.761
1911
+ The other thing is to try to do frame semantics.
1912
+
1913
+ 0:49:47.867 --> 0:49:55.094
1914
+ That means that you try to represent the knowledge
1915
+ about the world and you have these ah frames.
1916
+
1917
+ 0:49:55.094 --> 0:49:58.372
1918
+ For example, you might have a frame to buy.
1919
+
1920
+ 0:49:58.418 --> 0:50:05.030
1921
+ And the meaning is that you have a commercial
1922
+ transaction.
1923
+
1924
+ 0:50:05.030 --> 0:50:08.840
1925
+ You have a person who is selling.
1926
+
1927
+ 0:50:08.969 --> 0:50:10.725
1928
+ You Have a Person Who's Buying.
1929
+
1930
+ 0:50:11.411 --> 0:50:16.123
1931
+ You have something that is priced, you might
1932
+ have a price, and so on.
1933
+
1934
+ 0:50:17.237 --> 0:50:22.698
1935
+ And then what you are doing in semantic parsing
1936
+ with frame semantics you first try to determine.
1937
+
1938
+ 0:50:22.902 --> 0:50:30.494
1939
+ Which frames are happening in the sentence,
1940
+ so if it's something with Bowie buying you
1941
+
1942
+ 0:50:30.494 --> 0:50:33.025
1943
+ would try to first identify.
1944
+
1945
+ 0:50:33.025 --> 0:50:40.704
1946
+ Oh, here we have to try Brain B, which does
1947
+ not always have to be indicated by the verb
1948
+
1949
+ 0:50:40.704 --> 0:50:42.449
1950
+ cell or other ways.
1951
+
1952
+ 0:50:42.582 --> 0:50:52.515
1953
+ And then you try to find out which elements
1954
+ of these frame are in the sentence and try
1955
+
1956
+ 0:50:52.515 --> 0:50:54.228
1957
+ to align them.
1958
+
1959
+ 0:50:56.856 --> 0:51:01.121
1960
+ Yeah, you have, for example, to buy and sell.
1961
+
1962
+ 0:51:01.121 --> 0:51:07.239
1963
+ If you have a model that has frames, they
1964
+ have the same elements.
1965
+
1966
+ 0:51:09.829 --> 0:51:15.018
1967
+ In addition over like sentence, then you have
1968
+ also a phenomenon beyond sentence level.
1969
+
1970
+ 0:51:15.018 --> 0:51:20.088
1971
+ We're coming to this later because it's a
1972
+ special challenge for machine translation.
1973
+
1974
+ 0:51:20.088 --> 0:51:22.295
1975
+ There is, for example, co reference.
1976
+
1977
+ 0:51:22.295 --> 0:51:27.186
1978
+ That means if you first mention it, it's like
1979
+ the President of the United States.
1980
+
1981
+ 0:51:27.467 --> 0:51:30.107
1982
+ And later you would refer to him maybe as
1983
+ he.
1984
+
1985
+ 0:51:30.510 --> 0:51:36.966
1986
+ And that is especially challenging in machine
1987
+ translation because you're not always using
1988
+
1989
+ 0:51:36.966 --> 0:51:38.114
1990
+ the same thing.
1991
+
1992
+ 0:51:38.114 --> 0:51:44.355
1993
+ Of course, for the president, it's he and
1994
+ air in German, but for other things it might
1995
+
1996
+ 0:51:44.355 --> 0:51:49.521
1997
+ be different depending on the gender in languages
1998
+ that you refer to it.
1999
+
2000
+ 0:51:55.435 --> 0:52:03.866
2001
+ So much for the background and the next, we
2002
+ want to look based on the knowledge we have
2003
+
2004
+ 0:52:03.866 --> 0:52:04.345
2005
+ now.
2006
+
2007
+ 0:52:04.345 --> 0:52:10.285
2008
+ Why is machine translation difficult before
2009
+ we have any more?
2010
+
2011
+ 0:52:16.316 --> 0:52:22.471
2012
+ The first type of problem is what we refer
2013
+ to as translation divers.
2014
+
2015
+ 0:52:22.471 --> 0:52:30.588
2016
+ That means that we have the same information
2017
+ in source and target, but the problem is that
2018
+
2019
+ 0:52:30.588 --> 0:52:33.442
2020
+ they are expressed differently.
2021
+
2022
+ 0:52:33.713 --> 0:52:42.222
2023
+ So it is not the same way, and we have to
2024
+ translate these things more easily by just
2025
+
2026
+ 0:52:42.222 --> 0:52:44.924
2027
+ having a bit more complex.
2028
+
2029
+ 0:52:45.325 --> 0:52:51.324
2030
+ So example is if it's only a structure in
2031
+ English, the delicious.
2032
+
2033
+ 0:52:51.324 --> 0:52:59.141
2034
+ The adjective is before the noun, while in
2035
+ Spanish you have to put it after the noun,
2036
+
2037
+ 0:52:59.141 --> 0:53:02.413
2038
+ and so you have to change the word.
2039
+
2040
+ 0:53:02.983 --> 0:53:10.281
2041
+ So there are different ways of divergence,
2042
+ so there can be structural divergence, which
2043
+
2044
+ 0:53:10.281 --> 0:53:10.613
2045
+ is.
2046
+
2047
+ 0:53:10.550 --> 0:53:16.121
2048
+ The word orders so that the order is different,
2049
+ so in German we have that especially in the
2050
+
2051
+ 0:53:16.121 --> 0:53:19.451
2052
+ in the sub clause, while in English in the
2053
+ sub clause.
2054
+
2055
+ 0:53:19.451 --> 0:53:24.718
2056
+ The verb is also at the second position, in
2057
+ German it's at the end, and so you have to
2058
+
2059
+ 0:53:24.718 --> 0:53:25.506
2060
+ move it all.
2061
+
2062
+ 0:53:25.465 --> 0:53:27.222
2063
+ Um All Over.
2064
+
2065
+ 0:53:27.487 --> 0:53:32.978
2066
+ It can be that that it's a complete different
2067
+ grammatical role.
2068
+
2069
+ 0:53:33.253 --> 0:53:35.080
2070
+ So,.
2071
+
2072
+ 0:53:35.595 --> 0:53:37.458
2073
+ You Have You Like Her.
2074
+
2075
+ 0:53:38.238 --> 0:53:41.472
2076
+ And eh in in.
2077
+
2078
+ 0:53:41.261 --> 0:53:47.708
2079
+ English: In Spanish it's a la ti gusta which
2080
+ means she so now she is no longer like object
2081
+
2082
+ 0:53:47.708 --> 0:53:54.509
2083
+ but she is subject here and you are now acquisitive
2084
+ and then pleases or like yeah so you really
2085
+
2086
+ 0:53:54.509 --> 0:53:58.689
2087
+ use a different sentence structure and you
2088
+ have to change.
2089
+
2090
+ 0:53:59.139 --> 0:54:03.624
2091
+ Can also be the head switch.
2092
+
2093
+ 0:54:03.624 --> 0:54:09.501
2094
+ In English you say the baby just ate.
2095
+
2096
+ 0:54:09.501 --> 0:54:16.771
2097
+ In Spanish literary you say the baby finishes.
2098
+
2099
+ 0:54:16.997 --> 0:54:20.803
2100
+ So the is no longer the word, but the finishing
2101
+ is the word.
2102
+
2103
+ 0:54:21.241 --> 0:54:30.859
2104
+ So you have to learn so you cannot always
2105
+ have the same structures in your input and
2106
+
2107
+ 0:54:30.859 --> 0:54:31.764
2108
+ output.
2109
+
2110
+ 0:54:36.856 --> 0:54:42.318
2111
+ Lexical things like to swim across or to cross
2112
+ swimming.
2113
+
2114
+ 0:54:43.243 --> 0:54:57.397
2115
+ You have categorical like an adjective gets
2116
+ into a noun, so you have a little bread to
2117
+
2118
+ 0:54:57.397 --> 0:55:00.162
2119
+ make a decision.
2120
+
2121
+ 0:55:00.480 --> 0:55:15.427
2122
+ That is the one challenge and the even bigger
2123
+ challenge is referred to as translation.
2124
+
2125
+ 0:55:17.017 --> 0:55:19.301
2126
+ That can be their lexical mismatch.
2127
+
2128
+ 0:55:19.301 --> 0:55:21.395
2129
+ That's the fish we talked about.
2130
+
2131
+ 0:55:21.395 --> 0:55:27.169
2132
+ If it's like the, the fish you eat or the
2133
+ fish which is living is the two different worlds
2134
+
2135
+ 0:55:27.169 --> 0:55:27.931
2136
+ in Spanish.
2137
+
2138
+ 0:55:28.108 --> 0:55:34.334
2139
+ And then that's partly sometimes even not
2140
+ known, so even the human might not be able
2141
+
2142
+ 0:55:34.334 --> 0:55:34.627
2143
+ to.
2144
+
2145
+ 0:55:34.774 --> 0:55:40.242
2146
+ Infer that you maybe need to see the context
2147
+ you maybe need to have the sentences around,
2148
+
2149
+ 0:55:40.242 --> 0:55:45.770
2150
+ so one problem is that at least traditional
2151
+ machine translation works on a sentence level,
2152
+
2153
+ 0:55:45.770 --> 0:55:51.663
2154
+ so we take each sentence and translate it independent
2155
+ of everything else, but that's, of course,
2156
+
2157
+ 0:55:51.663 --> 0:55:52.453
2158
+ not correct.
2159
+
2160
+ 0:55:52.532 --> 0:55:59.901
2161
+ Will look into some ways of looking at and
2162
+ doing document-based machine translation, but.
2163
+
2164
+ 0:56:00.380 --> 0:56:06.793
2165
+ There's gender information might be a problem,
2166
+ so in English it's player and you don't know
2167
+
2168
+ 0:56:06.793 --> 0:56:10.139
2169
+ if it's Spieler Spielerin or if it's not known.
2170
+
2171
+ 0:56:10.330 --> 0:56:15.770
2172
+ But in the English, if you now generate German,
2173
+ you should know is the reader.
2174
+
2175
+ 0:56:15.770 --> 0:56:21.830
2176
+ Does he know the gender or does he not know
2177
+ the gender and then generate the right one?
2178
+
2179
+ 0:56:22.082 --> 0:56:38.333
2180
+ So just imagine a commentator if he's talking
2181
+ about the player and you can see if it's male
2182
+
2183
+ 0:56:38.333 --> 0:56:40.276
2184
+ or female.
2185
+
2186
+ 0:56:40.540 --> 0:56:47.801
2187
+ So in generally the problem is that if you
2188
+ have less information and you need more information
2189
+
2190
+ 0:56:47.801 --> 0:56:51.928
2191
+ in your target, this translation doesn't really
2192
+ work.
2193
+
2194
+ 0:56:55.175 --> 0:56:59.180
2195
+ Another problem is we just talked about the
2196
+ the.
2197
+
2198
+ 0:56:59.119 --> 0:57:01.429
2199
+ The co reference.
2200
+
2201
+ 0:57:01.641 --> 0:57:08.818
2202
+ So if you refer to an object and that can
2203
+ be across sentence boundaries then you have
2204
+
2205
+ 0:57:08.818 --> 0:57:14.492
2206
+ to use the right pronoun and you cannot just
2207
+ translate the pronoun.
2208
+
2209
+ 0:57:14.492 --> 0:57:18.581
2210
+ If the baby does not thrive on raw milk boil
2211
+ it.
2212
+
2213
+ 0:57:19.079 --> 0:57:28.279
2214
+ And if you are now using it and just take
2215
+ the typical translation, it will be: And That
2216
+
2217
+ 0:57:28.279 --> 0:57:31.065
2218
+ Will Be Ah Wrong.
2219
+
2220
+ 0:57:31.291 --> 0:57:35.784
2221
+ No, that will be even right because it is
2222
+ dust baby.
2223
+
2224
+ 0:57:35.784 --> 0:57:42.650
2225
+ Yes, but I mean, you have to determine that
2226
+ and it might be wrong at some point.
2227
+
2228
+ 0:57:42.650 --> 0:57:48.753
2229
+ So getting this this um yeah, it will be wrong
2230
+ yes, that is right yeah.
2231
+
2232
+ 0:57:48.908 --> 0:57:55.469
2233
+ Because in English both are baby and milk,
2234
+ and baby are both referred to it, so if you
2235
+
2236
+ 0:57:55.469 --> 0:58:02.180
2237
+ do S it will be to the first one referred to,
2238
+ so it's correct, but in Germany it will be
2239
+
2240
+ 0:58:02.180 --> 0:58:06.101
2241
+ S, and so if you translate it as S it will
2242
+ be baby.
2243
+
2244
+ 0:58:06.546 --> 0:58:13.808
2245
+ But you have to do Z because milk is female,
2246
+ although that is really very uncommon because
2247
+
2248
+ 0:58:13.808 --> 0:58:18.037
2249
+ maybe a model is an object and so it should
2250
+ be more.
2251
+
2252
+ 0:58:18.358 --> 0:58:25.176
2253
+ Of course, I agree there might be a situation
2254
+ which is a bit created and not a common thing,
2255
+
2256
+ 0:58:25.176 --> 0:58:29.062
2257
+ but you can see that these things are not that
2258
+ easy.
2259
+
2260
+ 0:58:29.069 --> 0:58:31.779
2261
+ Another example is this: Dr.
2262
+
2263
+ 0:58:31.779 --> 0:58:37.855
2264
+ McLean often brings his dog champion to visit
2265
+ with his patients.
2266
+
2267
+ 0:58:37.855 --> 0:58:41.594
2268
+ He loves to give big wets loppy kisses.
2269
+
2270
+ 0:58:42.122 --> 0:58:58.371
2271
+ And there, of course, it's also important
2272
+ if he refers to the dog or to the doctor.
2273
+
2274
+ 0:58:59.779 --> 0:59:11.260
2275
+ Another example of challenging is that we
2276
+ don't have a fixed language and that was referred
2277
+
2278
+ 0:59:11.260 --> 0:59:16.501
2279
+ to morphology and we can build new words.
2280
+
2281
+ 0:59:16.496 --> 0:59:23.787
2282
+ So we can in all languages build new words
2283
+ by just concatinating part of it like braxits,
2284
+
2285
+ 0:59:23.787 --> 0:59:30.570
2286
+ some things like: And then, of course, also
2287
+ words don't exist in languages, don't exist
2288
+
2289
+ 0:59:30.570 --> 0:59:31.578
2290
+ in isolations.
2291
+
2292
+ 0:59:32.012 --> 0:59:41.591
2293
+ In Germany you can now use the word download
2294
+ somewhere and you can also use a morphological
2295
+
2296
+ 0:59:41.591 --> 0:59:43.570
2297
+ operation on that.
2298
+
2299
+ 0:59:43.570 --> 0:59:48.152
2300
+ I guess there is even not the correct word.
2301
+
2302
+ 0:59:48.508 --> 0:59:55.575
2303
+ But so you have to deal with these things,
2304
+ and yeah, in social meters.
2305
+
2306
+ 0:59:55.996 --> 1:00:00.215
2307
+ This word is maybe most of you have forgotten
2308
+ already.
2309
+
2310
+ 1:00:00.215 --> 1:00:02.517
2311
+ This was ten years ago or so.
2312
+
2313
+ 1:00:02.517 --> 1:00:08.885
2314
+ I don't know there was a volcano in Iceland
2315
+ which stopped Europeans flying around.
2316
+
2317
+ 1:00:09.929 --> 1:00:14.706
2318
+ So there is always new words coming up and
2319
+ you have to deal with.
2320
+
2321
+ 1:00:18.278 --> 1:00:24.041
2322
+ Yeah, one last thing, so some of these examples
2323
+ we have seen are a bit artificial.
2324
+
2325
+ 1:00:24.041 --> 1:00:30.429
2326
+ So one example what is very common with machine
2327
+ translation doesn't really work is this box
2328
+
2329
+ 1:00:30.429 --> 1:00:31.540
2330
+ was in the pen.
2331
+
2332
+ 1:00:32.192 --> 1:00:36.887
2333
+ And maybe you would be surprised, at least
2334
+ when read it.
2335
+
2336
+ 1:00:36.887 --> 1:00:39.441
2337
+ How can a box be inside a pen?
2338
+
2339
+ 1:00:40.320 --> 1:00:44.175
2340
+ Does anybody have a solution for that while
2341
+ the sentence is still correct?
2342
+
2343
+ 1:00:47.367 --> 1:00:51.692
2344
+ Maybe it's directly clear for you, maybe your
2345
+ English was aside, yeah.
2346
+
2347
+ 1:00:54.654 --> 1:01:07.377
2348
+ Yes, like at a farm or for small children,
2349
+ and that is also called a pen or a pen on a
2350
+
2351
+ 1:01:07.377 --> 1:01:08.254
2352
+ farm.
2353
+
2354
+ 1:01:08.368 --> 1:01:12.056
2355
+ And then this is, and so you can mean okay.
2356
+
2357
+ 1:01:12.056 --> 1:01:16.079
2358
+ To infer these two meanings is quite difficult.
2359
+
2360
+ 1:01:16.436 --> 1:01:23.620
2361
+ But at least when I saw it, I wasn't completely
2362
+ convinced because it's maybe not the sentence
2363
+
2364
+ 1:01:23.620 --> 1:01:29.505
2365
+ you're using in your daily life, and some of
2366
+ these constructions seem to be.
2367
+
2368
+ 1:01:29.509 --> 1:01:35.155
2369
+ They are very good in showing where the problem
2370
+ is, but the question is, does it really imply
2371
+
2372
+ 1:01:35.155 --> 1:01:35.995
2373
+ in real life?
2374
+
2375
+ 1:01:35.996 --> 1:01:42.349
2376
+ And therefore here some examples also that
2377
+ we had here with a lecture translator that
2378
+
2379
+ 1:01:42.349 --> 1:01:43.605
2380
+ really occurred.
2381
+
2382
+ 1:01:43.605 --> 1:01:49.663
2383
+ They maybe looked simple, but you will see
2384
+ that some of them still are happening.
2385
+
2386
+ 1:01:50.050 --> 1:01:53.948
2387
+ And they are partly about spitting words,
2388
+ and then they are happening.
2389
+
2390
+ 1:01:54.294 --> 1:01:56.816
2391
+ So Um.
2392
+
2393
+ 1:01:56.596 --> 1:02:03.087
2394
+ We had a text about the numeral system in
2395
+ German, the Silen system, which got splitted
2396
+
2397
+ 1:02:03.087 --> 1:02:07.041
2398
+ into sub parts because otherwise we can't translate.
2399
+
2400
+ 1:02:07.367 --> 1:02:14.927
2401
+ And then he did only a proximate match and
2402
+ was talking about the binary payment system
2403
+
2404
+ 1:02:14.927 --> 1:02:23.270
2405
+ because the payment system was a lot more common
2406
+ in the training data than the Thailand system.
2407
+
2408
+ 1:02:23.823 --> 1:02:29.900
2409
+ And so there you see like rare words, which
2410
+ don't occur that often.
2411
+
2412
+ 1:02:29.900 --> 1:02:38.211
2413
+ They are very challenging to deal with because
2414
+ we are good and inferring that sometimes, but
2415
+
2416
+ 1:02:38.211 --> 1:02:41.250
2417
+ for others that's very difficult.
2418
+
2419
+ 1:02:44.344 --> 1:02:49.605
2420
+ Another challenge is that, of course, the
2421
+ context is very difficult.
2422
+
2423
+ 1:02:50.010 --> 1:02:56.448
2424
+ This is also an example a bit older from also
2425
+ the lecture translators we were translating
2426
+
2427
+ 1:02:56.448 --> 1:03:01.813
2428
+ in mass lecture, and he was always talking
2429
+ about the omens of the numbers.
2430
+
2431
+ 1:03:02.322 --> 1:03:11.063
2432
+ Which doesn't make any sense at all, but the
2433
+ German word fortsizing can of course mean the
2434
+
2435
+ 1:03:11.063 --> 1:03:12.408
2436
+ sign and the.
2437
+
2438
+ 1:03:12.732 --> 1:03:22.703
2439
+ And if you not have the right to main knowledge
2440
+ in there and encode it, it might use the main
2441
+
2442
+ 1:03:22.703 --> 1:03:23.869
2443
+ knowledge.
2444
+
2445
+ 1:03:25.705 --> 1:03:31.205
2446
+ A more recent version of that is like here
2447
+ from a paper where it's about translating.
2448
+
2449
+ 1:03:31.205 --> 1:03:36.833
2450
+ We had this pivot based translation where
2451
+ you translate maybe to English and to another
2452
+
2453
+ 1:03:36.833 --> 1:03:39.583
2454
+ because you have not enough training data.
2455
+
2456
+ 1:03:40.880 --> 1:03:48.051
2457
+ And we did that from Dutch to German guess
2458
+ if you don't understand Dutch, if you speak
2459
+
2460
+ 1:03:48.051 --> 1:03:48.710
2461
+ German.
2462
+
2463
+ 1:03:48.908 --> 1:03:56.939
2464
+ So we have this raven forebuilt, which means
2465
+ to geben in English.
2466
+
2467
+ 1:03:56.939 --> 1:04:05.417
2468
+ It's correctly in setting an example: However,
2469
+ if we're then translate to German, he didn't
2470
+
2471
+ 1:04:05.417 --> 1:04:11.524
2472
+ get the full context, and in German you normally
2473
+ don't set an example, but you give an example,
2474
+
2475
+ 1:04:11.524 --> 1:04:16.740
2476
+ and so yes, going through another language
2477
+ you introduce their additional errors.
2478
+
2479
+ 1:04:19.919 --> 1:04:27.568
2480
+ Good so much for this are there more questions
2481
+ about why this is difficult.
2482
+
2483
+ 1:04:30.730 --> 1:04:35.606
2484
+ Then we'll start with this one.
2485
+
2486
+ 1:04:35.606 --> 1:04:44.596
2487
+ I have to leave a bit early today in a quarter
2488
+ of an hour.
2489
+
2490
+ 1:04:44.904 --> 1:04:58.403
2491
+ If you look about linguistic approaches to
2492
+ machine translation, they are typically described
2493
+
2494
+ 1:04:58.403 --> 1:05:03.599
2495
+ by: So we can do a direct translation, so you
2496
+ take the Suez language.
2497
+
2498
+ 1:05:03.599 --> 1:05:09.452
2499
+ Do not apply a lot of the analysis we were
2500
+ discussing today about syntax representation,
2501
+
2502
+ 1:05:09.452 --> 1:05:11.096
2503
+ semantic representation.
2504
+
2505
+ 1:05:11.551 --> 1:05:14.678
2506
+ But you directly translate to your target
2507
+ text.
2508
+
2509
+ 1:05:14.678 --> 1:05:16.241
2510
+ That's here the direct.
2511
+
2512
+ 1:05:16.516 --> 1:05:19.285
2513
+ Then there is a transfer based approach.
2514
+
2515
+ 1:05:19.285 --> 1:05:23.811
2516
+ Then you transfer everything over and you
2517
+ do the text translation.
2518
+
2519
+ 1:05:24.064 --> 1:05:28.354
2520
+ And you can do that at two levels, more at
2521
+ the syntax level.
2522
+
2523
+ 1:05:28.354 --> 1:05:34.683
2524
+ That means you only do synthetic analysts
2525
+ like you do a pasture or so, or at the semantic
2526
+
2527
+ 1:05:34.683 --> 1:05:37.848
2528
+ level where you do a semantic parsing frame.
2529
+
2530
+ 1:05:38.638 --> 1:05:51.489
2531
+ Then there is an interlingua based approach
2532
+ where you don't do any transfer anymore, but
2533
+
2534
+ 1:05:51.489 --> 1:05:55.099
2535
+ you only do an analysis.
2536
+
2537
+ 1:05:57.437 --> 1:06:02.790
2538
+ So how does now the direct transfer, the direct
2539
+ translation?
2540
+
2541
+ 1:06:03.043 --> 1:06:07.031
2542
+ Look like it's one of the earliest approaches.
2543
+
2544
+ 1:06:07.327 --> 1:06:18.485
2545
+ So you do maybe some morphological analysts,
2546
+ but not a lot, and then you do this bilingual
2547
+
2548
+ 1:06:18.485 --> 1:06:20.202
2549
+ word mapping.
2550
+
2551
+ 1:06:20.540 --> 1:06:25.067
2552
+ You might do some here in generations.
2553
+
2554
+ 1:06:25.067 --> 1:06:32.148
2555
+ These two things are not really big, but you
2556
+ are working on.
2557
+
2558
+ 1:06:32.672 --> 1:06:39.237
2559
+ And of course this might be a first easy solution
2560
+ about all the challenges we have seen that
2561
+
2562
+ 1:06:39.237 --> 1:06:41.214
2563
+ the structure is different.
2564
+
2565
+ 1:06:41.214 --> 1:06:45.449
2566
+ That you have to reorder, look at the agreement,
2567
+ then work.
2568
+
2569
+ 1:06:45.449 --> 1:06:47.638
2570
+ That's why the first approach.
2571
+
2572
+ 1:06:47.827 --> 1:06:54.618
2573
+ So if we have different word order, structural
2574
+ shifts or idiomatic expressions that doesn't
2575
+
2576
+ 1:06:54.618 --> 1:06:55.208
2577
+ really.
2578
+
2579
+ 1:06:57.797 --> 1:07:05.034
2580
+ Then there are these rule based approaches
2581
+ which were more commonly used.
2582
+
2583
+ 1:07:05.034 --> 1:07:15.249
2584
+ They might still be somewhere: Mean most commonly
2585
+ they are now used by neural networks but wouldn't
2586
+
2587
+ 1:07:15.249 --> 1:07:19.254
2588
+ be sure there is no system out there but.
2589
+
2590
+ 1:07:19.719 --> 1:07:25.936
2591
+ And in this transfer based approach we have
2592
+ these steps there nicely visualized in the.
2593
+
2594
+ 1:07:26.406 --> 1:07:32.397
2595
+ Triangle, so we have the analytic of the sur
2596
+ sentence where we then get some type of abstract
2597
+
2598
+ 1:07:32.397 --> 1:07:33.416
2599
+ representation.
2600
+
2601
+ 1:07:33.693 --> 1:07:40.010
2602
+ Then we are doing the transfer of the representation
2603
+ of the source sentence into the representation
2604
+
2605
+ 1:07:40.010 --> 1:07:40.263
2606
+ of.
2607
+
2608
+ 1:07:40.580 --> 1:07:46.754
2609
+ And then we have the generation where we take
2610
+ this abstract representation and do then the
2611
+
2612
+ 1:07:46.754 --> 1:07:47.772
2613
+ surface forms.
2614
+
2615
+ 1:07:47.772 --> 1:07:54.217
2616
+ For example, it might be that there is no
2617
+ morphological variants in the episode representation
2618
+
2619
+ 1:07:54.217 --> 1:07:56.524
2620
+ and we have to do this agreement.
2621
+
2622
+ 1:07:56.656 --> 1:08:00.077
2623
+ Which components do you they need?
2624
+
2625
+ 1:08:01.061 --> 1:08:08.854
2626
+ You need monolingual source and target lexicon
2627
+ and the corresponding grammars in order to
2628
+
2629
+ 1:08:08.854 --> 1:08:12.318
2630
+ do both the analyst and the generation.
2631
+
2632
+ 1:08:12.412 --> 1:08:18.584
2633
+ Then you need the bilingual dictionary in
2634
+ order to do the lexical translation and the
2635
+
2636
+ 1:08:18.584 --> 1:08:25.116
2637
+ bilingual transfer rules in order to transfer
2638
+ the grammar, for example in German, into the
2639
+
2640
+ 1:08:25.116 --> 1:08:28.920
2641
+ grammar in English, and that enables you to
2642
+ do that.
2643
+
2644
+ 1:08:29.269 --> 1:08:32.579
2645
+ So an example is is something like this here.
2646
+
2647
+ 1:08:32.579 --> 1:08:38.193
2648
+ So if you're doing a syntactic transfer it
2649
+ means you're starting with John E.
2650
+
2651
+ 1:08:38.193 --> 1:08:38.408
2652
+ Z.
2653
+
2654
+ 1:08:38.408 --> 1:08:43.014
2655
+ Apple you do the analyst then you have this
2656
+ type of graph here.
2657
+
2658
+ 1:08:43.014 --> 1:08:48.340
2659
+ Therefore you need your monolingual lexicon
2660
+ and your monolingual grammar.
2661
+
2662
+ 1:08:48.748 --> 1:08:59.113
2663
+ Then you're doing the transfer where you're
2664
+ transferring this representation into this
2665
+
2666
+ 1:08:59.113 --> 1:09:01.020
2667
+ representation.
2668
+
2669
+ 1:09:01.681 --> 1:09:05.965
2670
+ So how could this type of translation then
2671
+ look like?
2672
+
2673
+ 1:09:07.607 --> 1:09:08.276
2674
+ Style.
2675
+
2676
+ 1:09:08.276 --> 1:09:14.389
2677
+ We have the example of a delicious soup and
2678
+ una soup deliciosa.
2679
+
2680
+ 1:09:14.894 --> 1:09:22.173
2681
+ This is your source language tree and this
2682
+ is your target language tree and then the rules
2683
+
2684
+ 1:09:22.173 --> 1:09:26.092
2685
+ that you need are these ones to do the transfer.
2686
+
2687
+ 1:09:26.092 --> 1:09:31.211
2688
+ So if you have a noun phrase that also goes
2689
+ to the noun phrase.
2690
+
2691
+ 1:09:31.691 --> 1:09:44.609
2692
+ You see here that the switch is happening,
2693
+ so the second position is here at the first
2694
+
2695
+ 1:09:44.609 --> 1:09:46.094
2696
+ position.
2697
+
2698
+ 1:09:46.146 --> 1:09:52.669
2699
+ Then you have the translation of determiner
2700
+ of the words, so the dictionary entries.
2701
+
2702
+ 1:09:53.053 --> 1:10:07.752
2703
+ And with these types of rules you can then
2704
+ do these mappings and do the transfer between
2705
+
2706
+ 1:10:07.752 --> 1:10:11.056
2707
+ the representation.
2708
+
2709
+ 1:10:25.705 --> 1:10:32.505
2710
+ Think it more depends on the amount of expertise
2711
+ you have in representing them.
2712
+
2713
+ 1:10:32.505 --> 1:10:35.480
2714
+ The rules will get more difficult.
2715
+
2716
+ 1:10:36.136 --> 1:10:42.445
2717
+ For example, these rule based were, so I think
2718
+ it more depends on how difficult the structure
2719
+
2720
+ 1:10:42.445 --> 1:10:42.713
2721
+ is.
2722
+
2723
+ 1:10:42.713 --> 1:10:48.619
2724
+ So for German generating German they were
2725
+ quite long, quite successful because modeling
2726
+
2727
+ 1:10:48.619 --> 1:10:52.579
2728
+ all the German phenomena which are in there
2729
+ was difficult.
2730
+
2731
+ 1:10:52.953 --> 1:10:56.786
2732
+ And that can be done there, and it wasn't
2733
+ easy to learn that just from data.
2734
+
2735
+ 1:10:59.019 --> 1:11:07.716
2736
+ Think even if you think about Chinese and
2737
+ English or so, if you have the trees there
2738
+
2739
+ 1:11:07.716 --> 1:11:10.172
2740
+ is quite some rule and.
2741
+
2742
+ 1:11:15.775 --> 1:11:23.370
2743
+ Another thing is you can also try to do something
2744
+ like that on the semantic, which means this
2745
+
2746
+ 1:11:23.370 --> 1:11:24.905
2747
+ gets more complex.
2748
+
2749
+ 1:11:25.645 --> 1:11:31.047
2750
+ This gets maybe a bit easier because this
2751
+ representation, the semantic representation
2752
+
2753
+ 1:11:31.047 --> 1:11:36.198
2754
+ between languages, are more similar and therefore
2755
+ this gets more difficult again.
2756
+
2757
+ 1:11:36.496 --> 1:11:45.869
2758
+ So typically if you go higher in your triangle
2759
+ this is more work while this is less work.
2760
+
2761
+ 1:11:49.729 --> 1:11:56.023
2762
+ So it can be then, for example, like in Gusta,
2763
+ we have again that the the the order changes.
2764
+
2765
+ 1:11:56.023 --> 1:12:02.182
2766
+ So you see the transfer rule for like is that
2767
+ the first argument is here and the second is
2768
+
2769
+ 1:12:02.182 --> 1:12:06.514
2770
+ there, while on the on the Gusta side here
2771
+ the second argument.
2772
+
2773
+ 1:12:06.466 --> 1:12:11.232
2774
+ It is in the first position and the first
2775
+ argument is in the second position.
2776
+
2777
+ 1:12:11.511 --> 1:12:14.061
2778
+ So that you do yeah, and also there you're
2779
+ ordering,.
2780
+
2781
+ 1:12:14.354 --> 1:12:20.767
2782
+ From the principle it is more like you have
2783
+ a different type of formalism of representing
2784
+
2785
+ 1:12:20.767 --> 1:12:27.038
2786
+ your sentence and therefore you need to do
2787
+ more on one side and less on the other side.
2788
+
2789
+ 1:12:32.852 --> 1:12:42.365
2790
+ Then so in general transfer based approaches
2791
+ are you have to first select how to represent
2792
+
2793
+ 1:12:42.365 --> 1:12:44.769
2794
+ a synthetic structure.
2795
+
2796
+ 1:12:45.165 --> 1:12:55.147
2797
+ There's like these variable abstraction levels
2798
+ and then you have the three components: The
2799
+
2800
+ 1:12:55.147 --> 1:13:04.652
2801
+ disadvantage is that on the one hand you need
2802
+ normally a lot of experts monolingual experts
2803
+
2804
+ 1:13:04.652 --> 1:13:08.371
2805
+ who analyze how to do the transfer.
2806
+
2807
+ 1:13:08.868 --> 1:13:18.860
2808
+ And if you're doing a new language, you have
2809
+ to do analyst transfer in generation and the
2810
+
2811
+ 1:13:18.860 --> 1:13:19.970
2812
+ transfer.
2813
+
2814
+ 1:13:20.400 --> 1:13:27.074
2815
+ So if you need one language, add one language
2816
+ in existing systems, of course you have to
2817
+
2818
+ 1:13:27.074 --> 1:13:29.624
2819
+ do transfer to all the languages.
2820
+
2821
+ 1:13:32.752 --> 1:13:39.297
2822
+ Therefore, the other idea which people were
2823
+ interested in is the interlingua based machine
2824
+
2825
+ 1:13:39.297 --> 1:13:40.232
2826
+ translation.
2827
+
2828
+ 1:13:40.560 --> 1:13:47.321
2829
+ Where the idea is that we have this intermediate
2830
+ language with this abstract language independent
2831
+
2832
+ 1:13:47.321 --> 1:13:53.530
2833
+ representation and so the important thing is
2834
+ it's language independent so it's really the
2835
+
2836
+ 1:13:53.530 --> 1:13:59.188
2837
+ same for all language and it's a pure meaning
2838
+ and there is no ambiguity in there.
2839
+
2840
+ 1:14:00.100 --> 1:14:05.833
2841
+ That allows this nice translation without
2842
+ transfer, so you just do an analysis into your
2843
+
2844
+ 1:14:05.833 --> 1:14:11.695
2845
+ representation, and there afterwards you do
2846
+ the generation into the other target language.
2847
+
2848
+ 1:14:13.293 --> 1:14:16.953
2849
+ And that of course makes especially multilingual.
2850
+
2851
+ 1:14:16.953 --> 1:14:19.150
2852
+ It's like somehow is a dream.
2853
+
2854
+ 1:14:19.150 --> 1:14:25.519
2855
+ If you want to add a language you just need
2856
+ to add one analyst tool and one generation
2857
+
2858
+ 1:14:25.519 --> 1:14:25.959
2859
+ tool.
2860
+
2861
+ 1:14:29.249 --> 1:14:32.279
2862
+ Which is not the case in the other scenario.
2863
+
2864
+ 1:14:33.193 --> 1:14:40.547
2865
+ However, the big challenge is in this case
2866
+ the interlingua based representation because
2867
+
2868
+ 1:14:40.547 --> 1:14:47.651
2869
+ you need to represent all different types of
2870
+ knowledge in there in order to do that.
2871
+
2872
+ 1:14:47.807 --> 1:14:54.371
2873
+ And also like world knowledge, so something
2874
+ like an apple is a fruit and property is a
2875
+
2876
+ 1:14:54.371 --> 1:14:57.993
2877
+ fruit, so they are eatable and stuff like that.
2878
+
2879
+ 1:14:58.578 --> 1:15:06.286
2880
+ So that is why this is typically always only
2881
+ done for small amounts of data.
2882
+
2883
+ 1:15:06.326 --> 1:15:13.106
2884
+ So what people have done for special applications
2885
+ like hotel reservation people have looked into
2886
+
2887
+ 1:15:13.106 --> 1:15:18.348
2888
+ that, but they have typically not done it for
2889
+ any possibility of doing it.
2890
+
2891
+ 1:15:18.718 --> 1:15:31.640
2892
+ So the advantage is you need to represent
2893
+ all the world knowledge in your interlingua.
2894
+
2895
+ 1:15:32.092 --> 1:15:40.198
2896
+ And that is not possible at the moment or
2897
+ never was possible so far.
2898
+
2899
+ 1:15:40.198 --> 1:15:47.364
2900
+ Typically they were for small domains for
2901
+ hotel reservation.
2902
+
2903
+ 1:15:51.431 --> 1:15:57.926
2904
+ But of course this idea of doing that and
2905
+ that's why some people are interested in is
2906
+
2907
+ 1:15:57.926 --> 1:16:04.950
2908
+ like if you now do a neural system where you
2909
+ learn the representation in your neural network
2910
+
2911
+ 1:16:04.950 --> 1:16:07.442
2912
+ is that some type of artificial.
2913
+
2914
+ 1:16:08.848 --> 1:16:09.620
2915
+ Interlingua.
2916
+
2917
+ 1:16:09.620 --> 1:16:15.025
2918
+ However, what we at least found out until
2919
+ now is that there's often very language specific
2920
+
2921
+ 1:16:15.025 --> 1:16:15.975
2922
+ information in.
2923
+
2924
+ 1:16:16.196 --> 1:16:19.648
2925
+ And they might be important and essential.
2926
+
2927
+ 1:16:19.648 --> 1:16:26.552
2928
+ You don't have all the information in your
2929
+ input, so you typically can't do resolving
2930
+
2931
+ 1:16:26.552 --> 1:16:32.412
2932
+ all ambiguities inside there because you might
2933
+ not have all information.
2934
+
2935
+ 1:16:32.652 --> 1:16:37.870
2936
+ So in English you don't know if it's a living
2937
+ fish or the fish which you're eating, and if
2938
+
2939
+ 1:16:37.870 --> 1:16:43.087
2940
+ you're translating to Germany you also don't
2941
+ have to resolve this problem because you have
2942
+
2943
+ 1:16:43.087 --> 1:16:45.610
2944
+ the same ambiguity in your target language.
2945
+
2946
+ 1:16:45.610 --> 1:16:50.828
2947
+ So why would you put in our effort in finding
2948
+ out if it's a dish or the other fish if it's
2949
+
2950
+ 1:16:50.828 --> 1:16:52.089
2951
+ not necessary at all?
2952
+
2953
+ 1:16:54.774 --> 1:16:59.509
2954
+ Yeah Yeah.
2955
+
2956
+ 1:17:05.585 --> 1:17:15.019
2957
+ The semantic transfer is not the same for
2958
+ both languages, so you still represent the
2959
+
2960
+ 1:17:15.019 --> 1:17:17.127
2961
+ semantic language.
2962
+
2963
+ 1:17:17.377 --> 1:17:23.685
2964
+ So you have the like semantic representation
2965
+ in the Gusta, but that's not the same as semantic
2966
+
2967
+ 1:17:23.685 --> 1:17:28.134
2968
+ representation for both languages, and that's
2969
+ the main difference.
2970
+
2971
+ 1:17:35.515 --> 1:17:44.707
2972
+ Okay, then these are the most important things
2973
+ for today: what is language and how our rule
2974
+
2975
+ 1:17:44.707 --> 1:17:46.205
2976
+ based systems.
2977
+
2978
+ 1:17:46.926 --> 1:17:59.337
2979
+ And if there is no more questions thank you
2980
+ for joining, we have today a bit of a shorter
2981
+
2982
+ 1:17:59.337 --> 1:18:00.578
2983
+ lecture.
2984
+
demo_data/lectures/Lecture-02-20.04.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0ac15772e9e528ff3f7fb957401be410fcdf4a4ad54542e96916fe654443eb3
3
+ size 111655016
demo_data/lectures/Lecture-03-25.04.2023/English.vtt ADDED
@@ -0,0 +1,3102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:02.822 --> 0:00:07.880
4
+ We look into more linguistic approaches.
5
+
6
+ 0:00:07.880 --> 0:00:14.912
7
+ We can do machine translation in a more traditional
8
+ way.
9
+
10
+ 0:00:14.912 --> 0:00:21.224
11
+ It should be: Translation should be generated
12
+ this way.
13
+
14
+ 0:00:21.224 --> 0:00:27.933
15
+ We can analyze versus a sewer sentence what
16
+ is the meaning or the syntax.
17
+
18
+ 0:00:27.933 --> 0:00:35.185
19
+ Then we transfer this information to the target
20
+ side and then we then generate.
21
+
22
+ 0:00:36.556 --> 0:00:42.341
23
+ And this was the strong and common used approach
24
+ for yeah several years.
25
+
26
+ 0:00:44.024 --> 0:00:50.839
27
+ However, we saw already at the beginning there
28
+ some challenges with that: Language is very
29
+
30
+ 0:00:50.839 --> 0:00:57.232
31
+ ambigue and it's often very difficult to really
32
+ get high coated rules.
33
+
34
+ 0:00:57.232 --> 0:01:05.336
35
+ What are the different meanings and we have
36
+ to do that also with a living language so new
37
+
38
+ 0:01:05.336 --> 0:01:06.596
39
+ things occur.
40
+
41
+ 0:01:07.007 --> 0:01:09.308
42
+ And that's why people look into.
43
+
44
+ 0:01:09.308 --> 0:01:13.282
45
+ Can we maybe do it differently and use machine
46
+ learning?
47
+
48
+ 0:01:13.333 --> 0:01:24.849
49
+ So we are no longer giving rules of how to
50
+ do it, but we just give examples and the system.
51
+
52
+ 0:01:25.045 --> 0:01:34.836
53
+ And one important thing then is these examples:
54
+ how can we learn how to translate one sentence?
55
+
56
+ 0:01:35.635 --> 0:01:42.516
57
+ And therefore these yeah, the data is now
58
+ really a very important issue.
59
+
60
+ 0:01:42.582 --> 0:01:50.021
61
+ And that is what we want to look into today.
62
+
63
+ 0:01:50.021 --> 0:01:58.783
64
+ What type of data do we use for machine translation?
65
+
66
+ 0:01:59.019 --> 0:02:08.674
67
+ So the idea in preprocessing is always: Can
68
+ we make the task somehow a bit easier so that
69
+
70
+ 0:02:08.674 --> 0:02:13.180
71
+ the empty system will be in a way better?
72
+
73
+ 0:02:13.493 --> 0:02:28.309
74
+ So one example could be if it has problems
75
+ dealing with numbers because they are occurring.
76
+
77
+ 0:02:28.648 --> 0:02:35.479
78
+ Or think about so one problem which still
79
+ might be is there in some systems think about
80
+
81
+ 0:02:35.479 --> 0:02:36.333
82
+ different.
83
+
84
+ 0:02:36.656 --> 0:02:44.897
85
+ So a system might learn that of course if
86
+ there's a German over in English there should.
87
+
88
+ 0:02:45.365 --> 0:02:52.270
89
+ However, if it's in pearl text, it will see
90
+ that in Germany there is often km, and in English
91
+
92
+ 0:02:52.270 --> 0:02:54.107
93
+ typically various miles.
94
+
95
+ 0:02:54.594 --> 0:03:00.607
96
+ Might just translate three hundred and fifty
97
+ five miles into three hundred and fiftY five
98
+
99
+ 0:03:00.607 --> 0:03:04.348
100
+ kilometers, which of course is not right, and
101
+ so forth.
102
+
103
+ 0:03:04.348 --> 0:03:06.953
104
+ It might make things to look into the.
105
+
106
+ 0:03:07.067 --> 0:03:13.072
107
+ Therefore, first step when you build your
108
+ machine translation system is normally to look
109
+
110
+ 0:03:13.072 --> 0:03:19.077
111
+ at the data, to check it, to see if there is
112
+ anything happening which you should address
113
+
114
+ 0:03:19.077 --> 0:03:19.887
115
+ beforehand.
116
+
117
+ 0:03:20.360 --> 0:03:29.152
118
+ And then the second part is how do you represent
119
+ no works machine learning normally?
120
+
121
+ 0:03:29.109 --> 0:03:35.404
122
+ So the question is how do we get out from
123
+ the words into numbers and I've seen some of
124
+
125
+ 0:03:35.404 --> 0:03:35.766
126
+ you?
127
+
128
+ 0:03:35.766 --> 0:03:42.568
129
+ For example, in advance there we have introduced
130
+ to an algorithm which we also shortly repeat
131
+
132
+ 0:03:42.568 --> 0:03:43.075
133
+ today.
134
+
135
+ 0:03:43.303 --> 0:03:53.842
136
+ The subword unit approach which was first
137
+ introduced in machine translation and now used
138
+
139
+ 0:03:53.842 --> 0:04:05.271
140
+ for an in order to represent: Now you've learned
141
+ about morphology, so you know that maybe in
142
+
143
+ 0:04:05.271 --> 0:04:09.270
144
+ English it's not that important.
145
+
146
+ 0:04:09.429 --> 0:04:22.485
147
+ In German you have all these different word
148
+ poems and to learn independent representation.
149
+
150
+ 0:04:24.024 --> 0:04:26.031
151
+ And then, of course, they are more extreme.
152
+
153
+ 0:04:27.807 --> 0:04:34.387
154
+ So how are we doing?
155
+
156
+ 0:04:34.975 --> 0:04:37.099
157
+ Machine translation.
158
+
159
+ 0:04:37.099 --> 0:04:46.202
160
+ So hopefully you remember we had these approaches
161
+ to machine translation, the rule based.
162
+
163
+ 0:04:46.202 --> 0:04:52.473
164
+ We had a big block of corpus space machine
165
+ translation which.
166
+
167
+ 0:04:52.492 --> 0:05:00.443
168
+ Will on Thursday have an overview on statistical
169
+ models and then afterwards concentrate on the.
170
+
171
+ 0:05:00.680 --> 0:05:08.828
172
+ Both of them are corpus based machine translation
173
+ and therefore it's really essential, and while
174
+
175
+ 0:05:08.828 --> 0:05:16.640
176
+ we are typically training a machine translation
177
+ system is what we refer to as parallel data.
178
+
179
+ 0:05:16.957 --> 0:05:22.395
180
+ Talk a lot about pearl corpus or pearl data,
181
+ and what I mean there is something which you
182
+
183
+ 0:05:22.395 --> 0:05:28.257
184
+ might know from was that a stone or something
185
+ like that, so it's typically you have one sentence
186
+
187
+ 0:05:28.257 --> 0:05:33.273
188
+ in the one language, and then you have aligned
189
+ to it one sentence in the charcote.
190
+
191
+ 0:05:33.833 --> 0:05:38.261
192
+ And this is how we train all our alignments.
193
+
194
+ 0:05:38.261 --> 0:05:43.181
195
+ We'll see today that of course we might not
196
+ have.
197
+
198
+ 0:05:43.723 --> 0:05:51.279
199
+ However, this is relatively easy to create,
200
+ at least for iquality data.
201
+
202
+ 0:05:51.279 --> 0:06:00.933
203
+ We look into data trawling so that means how
204
+ we can automatically create this parallel data
205
+
206
+ 0:06:00.933 --> 0:06:02.927
207
+ from the Internet.
208
+
209
+ 0:06:04.144 --> 0:06:13.850
210
+ It's not so difficult to learn these alignments
211
+ if we have some type of dictionary, so which
212
+
213
+ 0:06:13.850 --> 0:06:16.981
214
+ sentence is aligned to which.
215
+
216
+ 0:06:18.718 --> 0:06:25.069
217
+ What it would, of course, be a lot more difficult
218
+ is really to word alignment, and that's also
219
+
220
+ 0:06:25.069 --> 0:06:27.476
221
+ often no longer that good possible.
222
+
223
+ 0:06:27.476 --> 0:06:33.360
224
+ We do that automatically in some yes for symbols,
225
+ but it's definitely more challenging.
226
+
227
+ 0:06:33.733 --> 0:06:40.691
228
+ For sentence alignment, of course, it's still
229
+ not always perfect, so there might be that
230
+
231
+ 0:06:40.691 --> 0:06:46.085
232
+ there is two German sentences and one English
233
+ sentence or the other.
234
+
235
+ 0:06:46.085 --> 0:06:53.511
236
+ So there's not always perfect alignment, but
237
+ if you look at text, it's still bigly relatively.
238
+
239
+ 0:06:54.014 --> 0:07:03.862
240
+ If we have that then we can build a machine
241
+ learning model which tries to map ignition
242
+
243
+ 0:07:03.862 --> 0:07:06.239
244
+ sentences somewhere.
245
+
246
+ 0:07:06.626 --> 0:07:15.932
247
+ So this is the idea of behind statistical
248
+ machine translation and machine translation.
249
+
250
+ 0:07:15.932 --> 0:07:27.098
251
+ The difference is: Statistical machine translation
252
+ is typically a whole box of different models
253
+
254
+ 0:07:27.098 --> 0:07:30.205
255
+ which try to evaluate the.
256
+
257
+ 0:07:30.510 --> 0:07:42.798
258
+ In neural machine translation, it's all one
259
+ large neural network where we use the one-sur-sentence
260
+
261
+ 0:07:42.798 --> 0:07:43.667
262
+ input.
263
+
264
+ 0:07:44.584 --> 0:07:50.971
265
+ And then we can train it by having exactly
266
+ this mapping port or parallel data.
267
+
268
+ 0:07:54.214 --> 0:08:02.964
269
+ So what we want today to look at today is
270
+ we want to first look at general text data.
271
+
272
+ 0:08:03.083 --> 0:08:06.250
273
+ So what is text data?
274
+
275
+ 0:08:06.250 --> 0:08:09.850
276
+ What text data is there?
277
+
278
+ 0:08:09.850 --> 0:08:18.202
279
+ Why is it challenging so that we have large
280
+ vocabularies?
281
+
282
+ 0:08:18.378 --> 0:08:22.003
283
+ It's so that you always have words which you
284
+ haven't seen.
285
+
286
+ 0:08:22.142 --> 0:08:29.053
287
+ If you increase your corporate science normally
288
+ you will also increase your vocabulary so you
289
+
290
+ 0:08:29.053 --> 0:08:30.744
291
+ always find new words.
292
+
293
+ 0:08:31.811 --> 0:08:39.738
294
+ Then based on that we'll look into pre-processing.
295
+
296
+ 0:08:39.738 --> 0:08:45.333
297
+ So how can we pre-process our data?
298
+
299
+ 0:08:45.333 --> 0:08:46.421
300
+ Maybe.
301
+
302
+ 0:08:46.526 --> 0:08:54.788
303
+ This is a lot about tokenization, for example,
304
+ which we heard is not so challenging in European
305
+
306
+ 0:08:54.788 --> 0:09:02.534
307
+ languages but still important, but might be
308
+ really difficult in Asian languages where you
309
+
310
+ 0:09:02.534 --> 0:09:05.030
311
+ don't have space separation.
312
+
313
+ 0:09:05.986 --> 0:09:12.161
314
+ And this preprocessing typically tries to
315
+ deal with the extreme cases where you have
316
+
317
+ 0:09:12.161 --> 0:09:13.105
318
+ seen things.
319
+
320
+ 0:09:13.353 --> 0:09:25.091
321
+ If you have seen your words three one hundred
322
+ times, it doesn't really matter if you have
323
+
324
+ 0:09:25.091 --> 0:09:31.221
325
+ seen them with them without punctuation or
326
+ so.
327
+
328
+ 0:09:31.651 --> 0:09:38.578
329
+ And then we look into word representation,
330
+ so what is the best way to represent a word?
331
+
332
+ 0:09:38.578 --> 0:09:45.584
333
+ And finally, we look into the other type of
334
+ data we really need for machine translation.
335
+
336
+ 0:09:45.725 --> 0:09:56.842
337
+ So in first we can use for many tasks, and
338
+ later we can also use purely monolingual data
339
+
340
+ 0:09:56.842 --> 0:10:00.465
341
+ to make machine translation.
342
+
343
+ 0:10:00.660 --> 0:10:03.187
344
+ So then the traditional approach was that
345
+ it was easier.
346
+
347
+ 0:10:03.483 --> 0:10:08.697
348
+ We have this type of language model which
349
+ we can train only on the target data to make
350
+
351
+ 0:10:08.697 --> 0:10:12.173
352
+ the text more fluent in neural machine translation
353
+ model.
354
+
355
+ 0:10:12.173 --> 0:10:18.106
356
+ It's partly a bit more complicated to integrate
357
+ this data but still it's very important especially
358
+
359
+ 0:10:18.106 --> 0:10:22.362
360
+ if you think about lower issue languages where
361
+ you have very few data.
362
+
363
+ 0:10:23.603 --> 0:10:26.999
364
+ It's harder to get parallel data than you
365
+ get monolingual data.
366
+
367
+ 0:10:27.347 --> 0:10:33.821
368
+ Because monolingual data you just have out
369
+ there not huge amounts for some languages,
370
+
371
+ 0:10:33.821 --> 0:10:38.113
372
+ but definitely the amount of data is always
373
+ significant.
374
+
375
+ 0:10:40.940 --> 0:10:50.454
376
+ When we talk about data, it's also of course
377
+ important how we use it for machine learning.
378
+
379
+ 0:10:50.530 --> 0:11:05.867
380
+ And that you hopefully learn in some prior
381
+ class, so typically we separate our data into
382
+
383
+ 0:11:05.867 --> 0:11:17.848
384
+ three chunks: So this is really by far the
385
+ largest, and this grows with the data we get.
386
+
387
+ 0:11:17.848 --> 0:11:21.387
388
+ Today we get here millions.
389
+
390
+ 0:11:22.222 --> 0:11:27.320
391
+ Then we have our validation data and that
392
+ is to train some type of parameters.
393
+
394
+ 0:11:27.320 --> 0:11:33.129
395
+ So not only you have some things to configure
396
+ and you don't know what is the right value,
397
+
398
+ 0:11:33.129 --> 0:11:39.067
399
+ so what you can do is train a model and change
400
+ these a bit and try to find the best ones on
401
+
402
+ 0:11:39.067 --> 0:11:40.164
403
+ your validation.
404
+
405
+ 0:11:40.700 --> 0:11:48.531
406
+ For a statistical model, for example data
407
+ in what you want to use if you have several
408
+
409
+ 0:11:48.531 --> 0:11:54.664
410
+ models: You know how to combine it, so how
411
+ much focus should you put on the different
412
+
413
+ 0:11:54.664 --> 0:11:55.186
414
+ models?
415
+
416
+ 0:11:55.186 --> 0:11:59.301
417
+ And if it's like twenty models, so it's only
418
+ twenty per meter.
419
+
420
+ 0:11:59.301 --> 0:12:02.828
421
+ It's not that much, so that is still bigly
422
+ estimated.
423
+
424
+ 0:12:03.183 --> 0:12:18.964
425
+ In your model there's often a question how
426
+ long should train the model before you have
427
+
428
+ 0:12:18.964 --> 0:12:21.322
429
+ overfitting.
430
+
431
+ 0:12:22.902 --> 0:12:28.679
432
+ And then you have your test data, which is
433
+ finally where you report on your test.
434
+
435
+ 0:12:29.009 --> 0:12:33.663
436
+ And therefore it's also important that from
437
+ time to time you get new test data because
438
+
439
+ 0:12:33.663 --> 0:12:38.423
440
+ if you're always through your experiments you
441
+ test on it and then you do new experiments
442
+
443
+ 0:12:38.423 --> 0:12:43.452
444
+ and tests again at some point you have tested
445
+ so many on it that you do some type of training
446
+
447
+ 0:12:43.452 --> 0:12:48.373
448
+ on your test data again because you just select
449
+ the things which is at the end best on your
450
+
451
+ 0:12:48.373 --> 0:12:48.962
452
+ test data.
453
+
454
+ 0:12:49.009 --> 0:12:54.755
455
+ It's important to get a new test data from
456
+ time to time, for example in important evaluation
457
+
458
+ 0:12:54.755 --> 0:12:58.340
459
+ campaigns for machine translation and speech
460
+ translation.
461
+
462
+ 0:12:58.618 --> 0:13:07.459
463
+ There is like every year there should do tests
464
+ that create it so we can see if the model really
465
+
466
+ 0:13:07.459 --> 0:13:09.761
467
+ gets better on new data.
468
+
469
+ 0:13:10.951 --> 0:13:19.629
470
+ And of course it is important that this is
471
+ a representative of the use case you are interested.
472
+
473
+ 0:13:19.879 --> 0:13:36.511
474
+ So if you're building a system for translating
475
+ websites, this should be on websites.
476
+
477
+ 0:13:36.816 --> 0:13:39.356
478
+ So normally a system is good on some tasks.
479
+
480
+ 0:13:40.780 --> 0:13:48.596
481
+ I would solve everything and then your test
482
+ data should be out of everything because if
483
+
484
+ 0:13:48.596 --> 0:13:54.102
485
+ you only have a very small subset you know
486
+ it's good on this.
487
+
488
+ 0:13:54.394 --> 0:14:02.714
489
+ Therefore, the selection of your test data
490
+ is really important in order to ensure that
491
+
492
+ 0:14:02.714 --> 0:14:05.200
493
+ the MP system in the end.
494
+
495
+ 0:14:05.525 --> 0:14:12.646
496
+ Is the greatest system ever you have evaluated
497
+ on translating Bible.
498
+
499
+ 0:14:12.646 --> 0:14:21.830
500
+ The use case is to translate some Twitter
501
+ data and you can imagine the performance might
502
+
503
+ 0:14:21.830 --> 0:14:22.965
504
+ be really.
505
+
506
+ 0:14:23.803 --> 0:14:25.471
507
+ And privately.
508
+
509
+ 0:14:25.471 --> 0:14:35.478
510
+ Of course, in honor to have this and realistic
511
+ evaluation, it's important that there's no
512
+
513
+ 0:14:35.478 --> 0:14:39.370
514
+ overlap between this data because.
515
+
516
+ 0:14:39.799 --> 0:14:51.615
517
+ Because the danger might be is learning by
518
+ heart how to translate the sentences from your
519
+
520
+ 0:14:51.615 --> 0:14:53.584
521
+ training data.
522
+
523
+ 0:14:54.194 --> 0:15:04.430
524
+ That the test data is really different from
525
+ your training data.
526
+
527
+ 0:15:04.430 --> 0:15:16.811
528
+ Therefore, it's important to: So what type
529
+ of data we have?
530
+
531
+ 0:15:16.811 --> 0:15:24.966
532
+ There's a lot of different text data and the
533
+ nice thing is with digitalization.
534
+
535
+ 0:15:25.345 --> 0:15:31.785
536
+ You might think there's a large amount with
537
+ books, but to be honest books and printed things
538
+
539
+ 0:15:31.785 --> 0:15:35.524
540
+ that's by now a minor percentage of the data
541
+ we have.
542
+
543
+ 0:15:35.815 --> 0:15:39.947
544
+ There's like so much data created every day
545
+ on the Internet.
546
+
547
+ 0:15:39.980 --> 0:15:46.223
548
+ With social media and all the other types.
549
+
550
+ 0:15:46.223 --> 0:15:56.821
551
+ This of course is a largest amount of data,
552
+ more of colloquial language.
553
+
554
+ 0:15:56.856 --> 0:16:02.609
555
+ It might be more noisy and harder to process,
556
+ so there is a whole area on how to deal with
557
+
558
+ 0:16:02.609 --> 0:16:04.948
559
+ more social media and outdoor stuff.
560
+
561
+ 0:16:07.347 --> 0:16:20.702
562
+ What type of data is there if you think about
563
+ parallel data news type of data official sites?
564
+
565
+ 0:16:20.900 --> 0:16:26.629
566
+ So the first Power Corpora were like things
567
+ like the European Parliament or like some news
568
+
569
+ 0:16:26.629 --> 0:16:27.069
570
+ sites.
571
+
572
+ 0:16:27.227 --> 0:16:32.888
573
+ Nowadays there's quite a large amount of data
574
+ crawled from the Internet, but of course if
575
+
576
+ 0:16:32.888 --> 0:16:38.613
577
+ you crawl parallel data from the Internet,
578
+ a lot of the data is also like company websites
579
+
580
+ 0:16:38.613 --> 0:16:41.884
581
+ or so which gets translated into several languages.
582
+
583
+ 0:16:45.365 --> 0:17:00.613
584
+ Then, of course, there is different levels
585
+ of text and we have to look at what level we
586
+
587
+ 0:17:00.613 --> 0:17:05.118
588
+ want to process our data.
589
+
590
+ 0:17:05.885 --> 0:17:16.140
591
+ It one normally doesn't make sense to work
592
+ on full sentences because a lot of sentences
593
+
594
+ 0:17:16.140 --> 0:17:22.899
595
+ have never been seen and you always create
596
+ new sentences.
597
+
598
+ 0:17:23.283 --> 0:17:37.421
599
+ So typically what we take is our basic words,
600
+ something between words and letters, and that
601
+
602
+ 0:17:37.421 --> 0:17:40.033
603
+ is an essential.
604
+
605
+ 0:17:40.400 --> 0:17:47.873
606
+ So we need some of these atomic blocks or
607
+ basic blocks on which we can't make smaller.
608
+
609
+ 0:17:48.128 --> 0:17:55.987
610
+ So if we're building a sentence, for example,
611
+ you can build it out of something and you can
612
+
613
+ 0:17:55.987 --> 0:17:57.268
614
+ either decide.
615
+
616
+ 0:17:57.268 --> 0:18:01.967
617
+ For example, you take words and you spit them
618
+ further.
619
+
620
+ 0:18:03.683 --> 0:18:10.178
621
+ Then, of course, the nice thing is not too
622
+ small and therefore building larger things
623
+
624
+ 0:18:10.178 --> 0:18:11.386
625
+ like sentences.
626
+
627
+ 0:18:11.831 --> 0:18:16.690
628
+ So you only have to take your vocabulary and
629
+ put it somewhere together to get your full
630
+
631
+ 0:18:16.690 --> 0:18:17.132
632
+ center.
633
+
634
+ 0:18:19.659 --> 0:18:27.670
635
+ However, if it's too large, these blocks don't
636
+ occur often enough, and you have more blocks
637
+
638
+ 0:18:27.670 --> 0:18:28.715
639
+ that occur.
640
+
641
+ 0:18:29.249 --> 0:18:34.400
642
+ And that's why yeah we can work with blocks
643
+ for smaller like software blocks.
644
+
645
+ 0:18:34.714 --> 0:18:38.183
646
+ Work with neural models.
647
+
648
+ 0:18:38.183 --> 0:18:50.533
649
+ Then you can work on letters so you have a
650
+ system which tries to understand the sentence
651
+
652
+ 0:18:50.533 --> 0:18:53.031
653
+ letter by letter.
654
+
655
+ 0:18:53.313 --> 0:18:57.608
656
+ But that is a design decision which you have
657
+ to take at some point.
658
+
659
+ 0:18:57.608 --> 0:19:03.292
660
+ On which level do you want to split your text
661
+ and that of the evasive blocks that you are
662
+
663
+ 0:19:03.292 --> 0:19:04.176
664
+ working with?
665
+
666
+ 0:19:04.176 --> 0:19:06.955
667
+ And that's something we'll look into today.
668
+
669
+ 0:19:06.955 --> 0:19:08.471
670
+ What possibilities are?
671
+
672
+ 0:19:12.572 --> 0:19:14.189
673
+ Any question.
674
+
675
+ 0:19:17.998 --> 0:19:24.456
676
+ Then let's look a bit on what type of data
677
+ there is in how much data there is to person.
678
+
679
+ 0:19:24.824 --> 0:19:34.006
680
+ Is that nowadays, at least for pure text,
681
+ it's no longer for some language.
682
+
683
+ 0:19:34.006 --> 0:19:38.959
684
+ There is so much data we cannot process.
685
+
686
+ 0:19:39.479 --> 0:19:49.384
687
+ That is only true for some languages, but
688
+ there is also interest in other languages and
689
+
690
+ 0:19:49.384 --> 0:19:50.622
691
+ important.
692
+
693
+ 0:19:50.810 --> 0:20:01.483
694
+ So if you want to build a system for Sweden
695
+ or for some dialect in other countries, then
696
+
697
+ 0:20:01.483 --> 0:20:02.802
698
+ of course.
699
+
700
+ 0:20:03.103 --> 0:20:06.888
701
+ Otherwise you have this huge amount of hair.
702
+
703
+ 0:20:06.888 --> 0:20:11.515
704
+ We are often no longer taking about gigabytes
705
+ or more.
706
+
707
+ 0:20:11.891 --> 0:20:35.788
708
+ The general information that is produced every
709
+ year is: And this is like all the information
710
+
711
+ 0:20:35.788 --> 0:20:40.661
712
+ that are available in the, so there are really.
713
+
714
+ 0:20:41.001 --> 0:20:44.129
715
+ We look at machine translation.
716
+
717
+ 0:20:44.129 --> 0:20:53.027
718
+ We can see these numbers are really like more
719
+ than ten years old, but we see this increase
720
+
721
+ 0:20:53.027 --> 0:20:58.796
722
+ in one billion works we had at that time for
723
+ English data.
724
+
725
+ 0:20:59.019 --> 0:21:01.955
726
+ Then I wore like new shuffle on Google Maps
727
+ and stuff.
728
+
729
+ 0:21:02.382 --> 0:21:05.003
730
+ For this one you could train your system on.
731
+
732
+ 0:21:05.805 --> 0:21:20.457
733
+ And the interesting thing is this one billion
734
+ words is more than any human typically speaks.
735
+
736
+ 0:21:21.001 --> 0:21:25.892
737
+ So these systems they see by now like a magnitude
738
+ of more data.
739
+
740
+ 0:21:25.892 --> 0:21:32.465
741
+ We know I think are a magnitude higher of
742
+ more data than a human has ever seen in his
743
+
744
+ 0:21:32.465 --> 0:21:33.229
745
+ lifetime.
746
+
747
+ 0:21:35.175 --> 0:21:41.808
748
+ And that is maybe the interesting thing why
749
+ it still doesn't work on it because you see
750
+
751
+ 0:21:41.808 --> 0:21:42.637
752
+ they seem.
753
+
754
+ 0:21:43.103 --> 0:21:48.745
755
+ So we are seeing a really impressive result,
756
+ but in most cases it's not that they're really
757
+
758
+ 0:21:48.745 --> 0:21:49.911
759
+ better than human.
760
+
761
+ 0:21:50.170 --> 0:21:56.852
762
+ However, they really have seen more data than
763
+ any human ever has seen in this lifetime.
764
+
765
+ 0:21:57.197 --> 0:22:01.468
766
+ They can just process so much data, so.
767
+
768
+ 0:22:01.501 --> 0:22:08.425
769
+ The question is, can we make them more efficient
770
+ so that they can learn similarly good without
771
+
772
+ 0:22:08.425 --> 0:22:09.592
773
+ that much data?
774
+
775
+ 0:22:09.592 --> 0:22:16.443
776
+ And that is essential if we now go to Lawrence's
777
+ languages where we might never get that much
778
+
779
+ 0:22:16.443 --> 0:22:21.254
780
+ data, and we should be also able to achieve
781
+ a reasonable perform.
782
+
783
+ 0:22:23.303 --> 0:22:32.399
784
+ On the other hand, this of course links also
785
+ to one topic which we will cover later: If
786
+
787
+ 0:22:32.399 --> 0:22:37.965
788
+ you think about this, it's really important
789
+ that your algorithms are also very efficient
790
+
791
+ 0:22:37.965 --> 0:22:41.280
792
+ in order to process that much data both in
793
+ training.
794
+
795
+ 0:22:41.280 --> 0:22:46.408
796
+ If you have more data, you want to process
797
+ more data so you can make use of that.
798
+
799
+ 0:22:46.466 --> 0:22:54.499
800
+ On the other hand, if more and more data is
801
+ processed, more and more people will use machine
802
+
803
+ 0:22:54.499 --> 0:23:06.816
804
+ translation to generate translations, and it
805
+ will be important to: And there is yeah, there
806
+
807
+ 0:23:06.816 --> 0:23:07.257
808
+ is.
809
+
810
+ 0:23:07.607 --> 0:23:10.610
811
+ More.
812
+
813
+ 0:23:10.170 --> 0:23:17.262
814
+ More data generated every day, we hear just
815
+ some general numbers on how much data there
816
+
817
+ 0:23:17.262 --> 0:23:17.584
818
+ is.
819
+
820
+ 0:23:17.584 --> 0:23:24.595
821
+ It says that a lot of the data we produce
822
+ at least at the moment is text rich, so text
823
+
824
+ 0:23:24.595 --> 0:23:26.046
825
+ that is produced.
826
+
827
+ 0:23:26.026 --> 0:23:29.748
828
+ That is very important to either wise.
829
+
830
+ 0:23:29.748 --> 0:23:33.949
831
+ We can use it as training data in some way.
832
+
833
+ 0:23:33.873 --> 0:23:40.836
834
+ That we want to translate some of that because
835
+ it might not be published in all the languages,
836
+
837
+ 0:23:40.836 --> 0:23:46.039
838
+ and step with the need for machine translation
839
+ is even more important.
840
+
841
+ 0:23:47.907 --> 0:23:51.547
842
+ So what are the challenges with this?
843
+
844
+ 0:23:51.831 --> 0:24:01.360
845
+ So first of all that seems to be very good
846
+ news, so there is more and more data, so we
847
+
848
+ 0:24:01.360 --> 0:24:10.780
849
+ can just wait for three years and have more
850
+ data, and then our system will be better.
851
+
852
+ 0:24:11.011 --> 0:24:22.629
853
+ If you see in competitions, the system performance
854
+ increases.
855
+
856
+ 0:24:24.004 --> 0:24:27.190
857
+ See that here are three different systems.
858
+
859
+ 0:24:27.190 --> 0:24:34.008
860
+ Blue score is metric to measure how good an
861
+ empty system is and we'll talk about evaluation
862
+
863
+ 0:24:34.008 --> 0:24:40.974
864
+ and the next week so you'll have to evaluate
865
+ machine validation and also a practical session.
866
+
867
+ 0:24:41.581 --> 0:24:45.219
868
+ And so.
869
+
870
+ 0:24:44.784 --> 0:24:50.960
871
+ This shows you that this is like how much
872
+ data of the training data you have five percent.
873
+
874
+ 0:24:50.960 --> 0:24:56.117
875
+ You're significantly worse than if you're
876
+ forty percent and eighty percent.
877
+
878
+ 0:24:56.117 --> 0:25:02.021
879
+ You're getting better and you're seeing two
880
+ between this curve, which maybe not really
881
+
882
+ 0:25:02.021 --> 0:25:02.971
883
+ flattens out.
884
+
885
+ 0:25:02.971 --> 0:25:03.311
886
+ But.
887
+
888
+ 0:25:03.263 --> 0:25:07.525
889
+ Of course, the gains you get are normally
890
+ smaller and smaller.
891
+
892
+ 0:25:07.525 --> 0:25:09.216
893
+ The more data you have,.
894
+
895
+ 0:25:09.549 --> 0:25:21.432
896
+ If your improvements are unnormally better,
897
+ if you add the same thing or even double your
898
+
899
+ 0:25:21.432 --> 0:25:25.657
900
+ data late, of course more data.
901
+
902
+ 0:25:26.526 --> 0:25:34.955
903
+ However, you see the clear tendency if you
904
+ need to improve your system.
905
+
906
+ 0:25:34.955 --> 0:25:38.935
907
+ This is possible by just getting.
908
+
909
+ 0:25:39.039 --> 0:25:41.110
910
+ But it's not all about data.
911
+
912
+ 0:25:41.110 --> 0:25:45.396
913
+ It can also be the domain of the day that
914
+ there's building.
915
+
916
+ 0:25:45.865 --> 0:25:55.668
917
+ So this was a test on machine translation
918
+ system on translating genome data.
919
+
920
+ 0:25:55.668 --> 0:26:02.669
921
+ We have the like SAI said he's working on
922
+ translating.
923
+
924
+ 0:26:02.862 --> 0:26:06.868
925
+ Here you see the performance began with GreenScore.
926
+
927
+ 0:26:06.868 --> 0:26:12.569
928
+ You see one system which only was trained
929
+ on genome data and it only has.
930
+
931
+ 0:26:12.812 --> 0:26:17.742
932
+ That's very, very few for machine translation.
933
+
934
+ 0:26:18.438 --> 0:26:23.927
935
+ And to compare that to a system which was
936
+ generally trained on used translation data.
937
+
938
+ 0:26:24.104 --> 0:26:34.177
939
+ With four point five million sentences so
940
+ roughly one hundred times as much data you
941
+
942
+ 0:26:34.177 --> 0:26:40.458
943
+ still see that this system doesn't really work
944
+ well.
945
+
946
+ 0:26:40.820 --> 0:26:50.575
947
+ So you see it's not only about data, it's
948
+ also that the data has to somewhat fit to the
949
+
950
+ 0:26:50.575 --> 0:26:51.462
951
+ domain.
952
+
953
+ 0:26:51.831 --> 0:26:58.069
954
+ The more general data you get that you have
955
+ covered up all domains.
956
+
957
+ 0:26:58.418 --> 0:27:07.906
958
+ But that's very difficult and especially for
959
+ more specific domains.
960
+
961
+ 0:27:07.906 --> 0:27:16.696
962
+ It can be really important to get data which
963
+ fits your domain.
964
+
965
+ 0:27:16.716 --> 0:27:18.520
966
+ Maybe if you can do some very much broccoli
967
+ or something like that, maybe if you.
968
+
969
+ 0:27:18.598 --> 0:27:22.341
970
+ To say okay, concentrate this as you like
971
+ for being at better.
972
+
973
+ 0:27:24.564 --> 0:27:28.201
974
+ It's not that easy to prompt it.
975
+
976
+ 0:27:28.201 --> 0:27:35.807
977
+ You can do the prompting in the more traditional
978
+ way of fine tuning.
979
+
980
+ 0:27:35.807 --> 0:27:44.514
981
+ Then, of course, if you select UIV later combine
982
+ this one, you can get better.
983
+
984
+ 0:27:44.904 --> 0:27:52.675
985
+ But it will always be that this type of similar
986
+ data is much more important than the general.
987
+
988
+ 0:27:52.912 --> 0:28:00.705
989
+ So of course it can make the lower system
990
+ a lot better if you search for similar data
991
+
992
+ 0:28:00.705 --> 0:28:01.612
993
+ and find.
994
+
995
+ 0:28:02.122 --> 0:28:08.190
996
+ Will have a lecture on domain adaptation where
997
+ it's exactly the idea how you can make systems
998
+
999
+ 0:28:08.190 --> 0:28:13.935
1000
+ in these situations better so you can adapt
1001
+ it to this data but then you still need this
1002
+
1003
+ 0:28:13.935 --> 0:28:14.839
1004
+ type of data.
1005
+
1006
+ 0:28:15.335 --> 0:28:21.590
1007
+ And in prompting it might work if you have
1008
+ seen it in your data so it can make the system
1009
+
1010
+ 0:28:21.590 --> 0:28:25.134
1011
+ aware and tell it focus more in this type of
1012
+ data.
1013
+
1014
+ 0:28:25.465 --> 0:28:30.684
1015
+ But if you haven't had enough of the really
1016
+ specific good matching data, I think it will
1017
+
1018
+ 0:28:30.684 --> 0:28:31.681
1019
+ always not work.
1020
+
1021
+ 0:28:31.681 --> 0:28:37.077
1022
+ So you need to have this type of data and
1023
+ therefore it's important not only to have general
1024
+
1025
+ 0:28:37.077 --> 0:28:42.120
1026
+ data but also data, at least in your overall
1027
+ system, which really fits to the domain.
1028
+
1029
+ 0:28:45.966 --> 0:28:53.298
1030
+ And then the second thing, of course, is you
1031
+ need to have data that has good quality.
1032
+
1033
+ 0:28:53.693 --> 0:29:00.170
1034
+ In the early stages it might be good to have
1035
+ all the data but later it's especially important
1036
+
1037
+ 0:29:00.170 --> 0:29:06.577
1038
+ that you have somehow good quality and so that
1039
+ you're learning what you really want to learn
1040
+
1041
+ 0:29:06.577 --> 0:29:09.057
1042
+ and not learning some great things.
1043
+
1044
+ 0:29:10.370 --> 0:29:21.551
1045
+ We talked about this with the kilometers and
1046
+ miles, so if you just take in some type of
1047
+
1048
+ 0:29:21.551 --> 0:29:26.253
1049
+ data and don't look at the quality,.
1050
+
1051
+ 0:29:26.766 --> 0:29:30.875
1052
+ But of course, the question here is what is
1053
+ good quality data?
1054
+
1055
+ 0:29:31.331 --> 0:29:35.054
1056
+ It is not yet that easy to define what is
1057
+ a good quality data.
1058
+
1059
+ 0:29:36.096 --> 0:29:43.961
1060
+ That doesn't mean it has to what people generally
1061
+ assume as high quality text or so, like written
1062
+
1063
+ 0:29:43.961 --> 0:29:47.814
1064
+ by a Nobel Prize winner or something like that.
1065
+
1066
+ 0:29:47.814 --> 0:29:54.074
1067
+ This is not what we mean by this quality,
1068
+ but again the most important again.
1069
+
1070
+ 0:29:54.354 --> 0:30:09.181
1071
+ So if you have Twitter data, high quality
1072
+ data doesn't mean you have now some novels.
1073
+
1074
+ 0:30:09.309 --> 0:30:12.875
1075
+ Test data, but it should also be represented
1076
+ similarly.
1077
+
1078
+ 0:30:12.875 --> 0:30:18.480
1079
+ Don't have, for example, quality definitely
1080
+ as it should be really translating yourself
1081
+
1082
+ 0:30:18.480 --> 0:30:18.862
1083
+ into.
1084
+
1085
+ 0:30:19.199 --> 0:30:25.556
1086
+ So especially if you corral data you would
1087
+ often have that it's not a direct translation.
1088
+
1089
+ 0:30:25.805 --> 0:30:28.436
1090
+ So then, of course, this is not high quality
1091
+ teaching.
1092
+
1093
+ 0:30:29.449 --> 0:30:39.974
1094
+ But in generally that's a very difficult thing
1095
+ to, and it's very difficult to design what
1096
+
1097
+ 0:30:39.974 --> 0:30:41.378
1098
+ is reading.
1099
+
1100
+ 0:30:41.982 --> 0:30:48.333
1101
+ And of course a biometric is always the quality
1102
+ of your data is good if your machine translation.
1103
+
1104
+ 0:30:48.648 --> 0:30:50.719
1105
+ So that is like the indirect.
1106
+
1107
+ 0:30:50.991 --> 0:30:52.447
1108
+ Well, what can we motive?
1109
+
1110
+ 0:30:52.447 --> 0:30:57.210
1111
+ Of course, it's difficult to always try a
1112
+ lot of things and evaluate either of them,
1113
+
1114
+ 0:30:57.210 --> 0:30:59.396
1115
+ build a full MP system and then check.
1116
+
1117
+ 0:30:59.396 --> 0:31:00.852
1118
+ Oh, was this a good idea?
1119
+
1120
+ 0:31:00.852 --> 0:31:01.357
1121
+ I mean,.
1122
+
1123
+ 0:31:01.581 --> 0:31:19.055
1124
+ You have two tokenizers who like split sentences
1125
+ and the words you really want to apply.
1126
+
1127
+ 0:31:19.179 --> 0:31:21.652
1128
+ Now you could maybe argue or your idea could
1129
+ be.
1130
+
1131
+ 0:31:21.841 --> 0:31:30.186
1132
+ Just take it there very fast and then get
1133
+ the result, but the problem is there is not
1134
+
1135
+ 0:31:30.186 --> 0:31:31.448
1136
+ always this.
1137
+
1138
+ 0:31:31.531 --> 0:31:36.269
1139
+ One thing that works very well for small data.
1140
+
1141
+ 0:31:36.269 --> 0:31:43.123
1142
+ It's not for sure that the same effect will
1143
+ happen in large stages.
1144
+
1145
+ 0:31:43.223 --> 0:31:50.395
1146
+ This idea really improves on very low resource
1147
+ data if only train on hundred words.
1148
+
1149
+ 0:31:51.271 --> 0:31:58.357
1150
+ But if you use it for a large data set, it
1151
+ doesn't really matter and all your ideas not.
1152
+
1153
+ 0:31:58.598 --> 0:32:01.172
1154
+ So that is also a typical thing.
1155
+
1156
+ 0:32:01.172 --> 0:32:05.383
1157
+ This quality issue is more and more important
1158
+ if you.
1159
+
1160
+ 0:32:06.026 --> 0:32:16.459
1161
+ By one motivation which generally you should
1162
+ have, you want to represent your data in having
1163
+
1164
+ 0:32:16.459 --> 0:32:17.469
1165
+ as many.
1166
+
1167
+ 0:32:17.677 --> 0:32:21.805
1168
+ Why is this the case any idea?
1169
+
1170
+ 0:32:21.805 --> 0:32:33.389
1171
+ Why this could be a motivation that we try
1172
+ to represent the data in a way that we have
1173
+
1174
+ 0:32:33.389 --> 0:32:34.587
1175
+ as many.
1176
+
1177
+ 0:32:38.338 --> 0:32:50.501
1178
+ We also want to learn about the fun text because
1179
+ maybe sometimes some grows in the fun text.
1180
+
1181
+ 0:32:52.612 --> 0:32:54.020
1182
+ The context is here.
1183
+
1184
+ 0:32:54.020 --> 0:32:56.432
1185
+ It's more about the learning first.
1186
+
1187
+ 0:32:56.432 --> 0:33:00.990
1188
+ You can generally learn better if you've seen
1189
+ something more often.
1190
+
1191
+ 0:33:00.990 --> 0:33:06.553
1192
+ So if you have seen an event only once, it's
1193
+ really hard to learn about the event.
1194
+
1195
+ 0:33:07.107 --> 0:33:15.057
1196
+ If you have seen an event a hundred times
1197
+ your bearing estimating which and maybe that
1198
+
1199
+ 0:33:15.057 --> 0:33:18.529
1200
+ is the context, then you can use the.
1201
+
1202
+ 0:33:18.778 --> 0:33:21.331
1203
+ So, for example, if you here have the word
1204
+ towels.
1205
+
1206
+ 0:33:21.761 --> 0:33:28.440
1207
+ If you would just take the data normally you
1208
+ would directly process the data.
1209
+
1210
+ 0:33:28.440 --> 0:33:32.893
1211
+ In the upper case you would the house with
1212
+ the dog.
1213
+
1214
+ 0:33:32.893 --> 0:33:40.085
1215
+ That's a different word than the house this
1216
+ way and then the house with the common.
1217
+
1218
+ 0:33:40.520 --> 0:33:48.365
1219
+ So you want to learn how this translates into
1220
+ house, but you translate an upper case.
1221
+
1222
+ 0:33:48.365 --> 0:33:50.281
1223
+ How this translates.
1224
+
1225
+ 0:33:50.610 --> 0:33:59.445
1226
+ You were learning how to translate into house
1227
+ and house, so you have to learn four different
1228
+
1229
+ 0:33:59.445 --> 0:34:00.205
1230
+ things.
1231
+
1232
+ 0:34:00.205 --> 0:34:06.000
1233
+ Instead, we really want to learn that house
1234
+ gets into house.
1235
+
1236
+ 0:34:06.366 --> 0:34:18.796
1237
+ And then imagine if it would be even a beak,
1238
+ it might be like here a house would be into.
1239
+
1240
+ 0:34:18.678 --> 0:34:22.089
1241
+ Good-bye Then.
1242
+
1243
+ 0:34:22.202 --> 0:34:29.512
1244
+ If it's an upper case then I always have to
1245
+ translate it into a boiler while it's a lower
1246
+
1247
+ 0:34:29.512 --> 0:34:34.955
1248
+ case that is translated into house and that's
1249
+ of course not right.
1250
+
1251
+ 0:34:34.955 --> 0:34:39.260
1252
+ We have to use the context to decide what
1253
+ is better.
1254
+
1255
+ 0:34:39.679 --> 0:34:47.086
1256
+ If you have seen an event several times then
1257
+ you are better able to learn your model and
1258
+
1259
+ 0:34:47.086 --> 0:34:51.414
1260
+ that doesn't matter what type of learning you
1261
+ have.
1262
+
1263
+ 0:34:52.392 --> 0:34:58.981
1264
+ I shouldn't say all but for most of these
1265
+ models it's always better to have like seen
1266
+
1267
+ 0:34:58.981 --> 0:35:00.897
1268
+ an event war more often.
1269
+
1270
+ 0:35:00.920 --> 0:35:11.483
1271
+ Therefore, if you preprocessive data, you
1272
+ should ask the question how can represent data
1273
+
1274
+ 0:35:11.483 --> 0:35:14.212
1275
+ in order to have seen.
1276
+
1277
+ 0:35:14.514 --> 0:35:17.885
1278
+ Of course you should not remove that information.
1279
+
1280
+ 0:35:18.078 --> 0:35:25.519
1281
+ So you could now, of course, just lowercase
1282
+ everything.
1283
+
1284
+ 0:35:25.519 --> 0:35:30.303
1285
+ Then you've seen things more often.
1286
+
1287
+ 0:35:30.710 --> 0:35:38.443
1288
+ And that might be an issue because in the
1289
+ final application you want to have real text
1290
+
1291
+ 0:35:38.443 --> 0:35:38.887
1292
+ and.
1293
+
1294
+ 0:35:40.440 --> 0:35:44.003
1295
+ And finally, even it's more important than
1296
+ it's consistent.
1297
+
1298
+ 0:35:44.965 --> 0:35:52.630
1299
+ So this is a problem where, for example, aren't
1300
+ consistent.
1301
+
1302
+ 0:35:52.630 --> 0:35:58.762
1303
+ So I am, I'm together written in training
1304
+ data.
1305
+
1306
+ 0:35:58.762 --> 0:36:04.512
1307
+ And if you're not in test data, have a high.
1308
+
1309
+ 0:36:04.824 --> 0:36:14.612
1310
+ Therefore, most important is to generate preprocessing
1311
+ and represent your data that is most consistent
1312
+
1313
+ 0:36:14.612 --> 0:36:18.413
1314
+ because it's easier to map how similar.
1315
+
1316
+ 0:36:18.758 --> 0:36:26.588
1317
+ If your text is represented very, very differently
1318
+ then your data will be badly be translated.
1319
+
1320
+ 0:36:26.666 --> 0:36:30.664
1321
+ So we once had the case.
1322
+
1323
+ 0:36:30.664 --> 0:36:40.420
1324
+ For example, there is some data who wrote
1325
+ it, but in German.
1326
+
1327
+ 0:36:40.900 --> 0:36:44.187
1328
+ And if you read it as a human you see it.
1329
+
1330
+ 0:36:44.187 --> 0:36:49.507
1331
+ It's even hard to get the difference because
1332
+ it looks very similar.
1333
+
1334
+ 0:36:50.130 --> 0:37:02.997
1335
+ If you use it for a machine translation system,
1336
+ it would not be able to translate anything
1337
+
1338
+ 0:37:02.997 --> 0:37:08.229
1339
+ of it because it's a different word.
1340
+
1341
+ 0:37:09.990 --> 0:37:17.736
1342
+ And especially on the other hand you should
1343
+ of course not rechange significant training
1344
+
1345
+ 0:37:17.736 --> 0:37:18.968
1346
+ data thereby.
1347
+
1348
+ 0:37:18.968 --> 0:37:27.155
1349
+ For example, removing case information because
1350
+ if your task is to generate case information.
1351
+
1352
+ 0:37:31.191 --> 0:37:41.081
1353
+ One thing which is a bit point to look into
1354
+ it in order to see the difficulty of your data
1355
+
1356
+ 0:37:41.081 --> 0:37:42.711
1357
+ is to compare.
1358
+
1359
+ 0:37:43.103 --> 0:37:45.583
1360
+ There are types.
1361
+
1362
+ 0:37:45.583 --> 0:37:57.983
1363
+ We mean the number of unique words in the
1364
+ corpus, so your vocabulary and the tokens.
1365
+
1366
+ 0:37:58.298 --> 0:38:08.628
1367
+ And then you can look at the type token ratio
1368
+ that means a number of types per token.
1369
+
1370
+ 0:38:15.815 --> 0:38:22.381
1371
+ Have less types than tokens because every
1372
+ word appears at least in the corpus, but most
1373
+
1374
+ 0:38:22.381 --> 0:38:27.081
1375
+ of them will occur more often until this number
1376
+ is bigger, so.
1377
+
1378
+ 0:38:27.667 --> 0:38:30.548
1379
+ And of course this changes if you have more
1380
+ date.
1381
+
1382
+ 0:38:31.191 --> 0:38:38.103
1383
+ Here is an example from an English Wikipedia.
1384
+
1385
+ 0:38:38.103 --> 0:38:45.015
1386
+ That means each word in average occurs times.
1387
+
1388
+ 0:38:45.425 --> 0:38:47.058
1389
+ Of course there's a big difference.
1390
+
1391
+ 0:38:47.058 --> 0:38:51.323
1392
+ There will be some words which occur one hundred
1393
+ times, but therefore most of the words occur
1394
+
1395
+ 0:38:51.323 --> 0:38:51.777
1396
+ only one.
1397
+
1398
+ 0:38:52.252 --> 0:38:55.165
1399
+ However, you see this ratio goes down.
1400
+
1401
+ 0:38:55.165 --> 0:39:01.812
1402
+ That's a good thing, so you have seen each
1403
+ word more often and therefore your model gets
1404
+
1405
+ 0:39:01.812 --> 0:39:03.156
1406
+ typically better.
1407
+
1408
+ 0:39:03.156 --> 0:39:08.683
1409
+ However, the problem is we always have a lot
1410
+ of words which we have seen.
1411
+
1412
+ 0:39:09.749 --> 0:39:15.111
1413
+ Even here there will be a bound of words which
1414
+ you have only seen once.
1415
+
1416
+ 0:39:15.111 --> 0:39:20.472
1417
+ However, this can give you an indication about
1418
+ the quality of the data.
1419
+
1420
+ 0:39:20.472 --> 0:39:27.323
1421
+ So you should always, of course, try to achieve
1422
+ data where you have a very low type to talk
1423
+
1424
+ 0:39:27.323 --> 0:39:28.142
1425
+ and ratio.
1426
+
1427
+ 0:39:28.808 --> 0:39:39.108
1428
+ For example, if you compare, simplify and
1429
+ not only Wikipedia, what would be your expectation?
1430
+
1431
+ 0:39:41.861 --> 0:39:49.842
1432
+ Yes, that's exactly, but however it's surprisingly
1433
+ only a little bit lower, but you see that it's
1434
+
1435
+ 0:39:49.842 --> 0:39:57.579
1436
+ lower, so we are using less words to express
1437
+ the same thing, and therefore the task to produce
1438
+
1439
+ 0:39:57.579 --> 0:39:59.941
1440
+ this text is also a gesture.
1441
+
1442
+ 0:40:01.221 --> 0:40:07.702
1443
+ However, as how many words are there, there
1444
+ is no clear definition.
1445
+
1446
+ 0:40:07.787 --> 0:40:19.915
1447
+ So there will be always more words, especially
1448
+ depending on your dataset, how many different
1449
+
1450
+ 0:40:19.915 --> 0:40:22.132
1451
+ words there are.
1452
+
1453
+ 0:40:22.482 --> 0:40:30.027
1454
+ So if you have million tweets where around
1455
+ fifty million tokens and you have six hundred
1456
+
1457
+ 0:40:30.027 --> 0:40:30.875
1458
+ thousand.
1459
+
1460
+ 0:40:31.251 --> 0:40:40.299
1461
+ If you have times this money teen tweeds you
1462
+ also have significantly more tokens but also.
1463
+
1464
+ 0:40:40.660 --> 0:40:58.590
1465
+ So especially in things like the social media,
1466
+ of course, there's always different types of
1467
+
1468
+ 0:40:58.590 --> 0:40:59.954
1469
+ words.
1470
+
1471
+ 0:41:00.040 --> 0:41:04.028
1472
+ Another example from not social media is here.
1473
+
1474
+ 0:41:04.264 --> 0:41:18.360
1475
+ So yeah, there is a small liter sandwich like
1476
+ phone conversations, two million tokens, and
1477
+
1478
+ 0:41:18.360 --> 0:41:22.697
1479
+ only twenty thousand words.
1480
+
1481
+ 0:41:23.883 --> 0:41:37.221
1482
+ If you think about Shakespeare, it has even
1483
+ less token, significantly less than a million,
1484
+
1485
+ 0:41:37.221 --> 0:41:40.006
1486
+ but the number of.
1487
+
1488
+ 0:41:40.060 --> 0:41:48.781
1489
+ On the other hand, there is this Google Engron
1490
+ corpus which has tokens and there is always
1491
+
1492
+ 0:41:48.781 --> 0:41:50.506
1493
+ new words coming.
1494
+
1495
+ 0:41:50.991 --> 0:41:52.841
1496
+ Is English.
1497
+
1498
+ 0:41:52.841 --> 0:42:08.103
1499
+ The nice thing about English is that the vocabulary
1500
+ is relatively small, too small, but relatively
1501
+
1502
+ 0:42:08.103 --> 0:42:09.183
1503
+ small.
1504
+
1505
+ 0:42:09.409 --> 0:42:14.224
1506
+ So here you see the Ted Corpus here.
1507
+
1508
+ 0:42:15.555 --> 0:42:18.144
1509
+ All know Ted's lectures.
1510
+
1511
+ 0:42:18.144 --> 0:42:26.429
1512
+ They are transcribed, translated, not a source
1513
+ for us, especially small crocus.
1514
+
1515
+ 0:42:26.846 --> 0:42:32.702
1516
+ You can do a lot of experiments with that
1517
+ and you see that the corpus site is relatively
1518
+
1519
+ 0:42:32.702 --> 0:42:36.782
1520
+ similar so we have around four million tokens
1521
+ in this corpus.
1522
+
1523
+ 0:42:36.957 --> 0:42:44.464
1524
+ However, if you look at the vocabulary, English
1525
+ has half as many words in their different words
1526
+
1527
+ 0:42:44.464 --> 0:42:47.045
1528
+ as German and Dutch and Italian.
1529
+
1530
+ 0:42:47.527 --> 0:42:56.260
1531
+ So this is one influence from positional works
1532
+ like which are more frequent in German, the
1533
+
1534
+ 0:42:56.260 --> 0:43:02.978
1535
+ more important since we have all these different
1536
+ morphological forms.
1537
+
1538
+ 0:43:03.263 --> 0:43:08.170
1539
+ There all leads to new words and they need
1540
+ to be somewhat expressed in there.
1541
+
1542
+ 0:43:11.531 --> 0:43:20.278
1543
+ So to deal with this, the question is how
1544
+ can we normalize the text in order to make
1545
+
1546
+ 0:43:20.278 --> 0:43:22.028
1547
+ the text easier?
1548
+
1549
+ 0:43:22.028 --> 0:43:25.424
1550
+ Can we simplify the task easier?
1551
+
1552
+ 0:43:25.424 --> 0:43:29.231
1553
+ But we need to keep all information.
1554
+
1555
+ 0:43:29.409 --> 0:43:32.239
1556
+ So an example where not all information skipped.
1557
+
1558
+ 0:43:32.239 --> 0:43:35.012
1559
+ Of course you make the task easier if you
1560
+ just.
1561
+
1562
+ 0:43:35.275 --> 0:43:41.141
1563
+ You don't have to deal with different cases.
1564
+
1565
+ 0:43:41.141 --> 0:43:42.836
1566
+ It's easier.
1567
+
1568
+ 0:43:42.836 --> 0:43:52.482
1569
+ However, information gets lost and you might
1570
+ need to generate the target.
1571
+
1572
+ 0:43:52.832 --> 0:44:00.153
1573
+ So the question is always: How can we on the
1574
+ one hand simplify the task but keep all the
1575
+
1576
+ 0:44:00.153 --> 0:44:01.223
1577
+ information?
1578
+
1579
+ 0:44:01.441 --> 0:44:06.639
1580
+ Say necessary because it depends on the task.
1581
+
1582
+ 0:44:06.639 --> 0:44:11.724
1583
+ For some tasks you might find to remove the.
1584
+
1585
+ 0:44:14.194 --> 0:44:23.463
1586
+ So the steps they were typically doing are
1587
+ that you can the segment and words in a running
1588
+
1589
+ 0:44:23.463 --> 0:44:30.696
1590
+ text, so you can normalize word forms and segmentation
1591
+ into sentences.
1592
+
1593
+ 0:44:30.696 --> 0:44:33.955
1594
+ Also, if you have not a single.
1595
+
1596
+ 0:44:33.933 --> 0:44:38.739
1597
+ If this is not a redundancy point to segments,
1598
+ the text is also into segments.
1599
+
1600
+ 0:44:39.779 --> 0:44:52.609
1601
+ So what are we doing there for European language
1602
+ segmentation into words?
1603
+
1604
+ 0:44:52.609 --> 0:44:57.290
1605
+ It's not that complicated.
1606
+
1607
+ 0:44:57.277 --> 0:45:06.001
1608
+ You have to somehow handle the joint words
1609
+ and by handling joint words the most important.
1610
+
1611
+ 0:45:06.526 --> 0:45:11.331
1612
+ So in most systems it really doesn't matter
1613
+ much.
1614
+
1615
+ 0:45:11.331 --> 0:45:16.712
1616
+ If you write, I'm together as one word or
1617
+ as two words.
1618
+
1619
+ 0:45:17.197 --> 0:45:23.511
1620
+ The nice thing about iron is maybe this is
1621
+ so often that it doesn't matter if you both
1622
+
1623
+ 0:45:23.511 --> 0:45:26.560
1624
+ and if they're both accrued often enough.
1625
+
1626
+ 0:45:26.560 --> 0:45:32.802
1627
+ But you'll have some of these cases where
1628
+ they don't occur there often, so you should
1629
+
1630
+ 0:45:32.802 --> 0:45:35.487
1631
+ have more as consistent as possible.
1632
+
1633
+ 0:45:36.796 --> 0:45:41.662
1634
+ But of course things can get more complicated.
1635
+
1636
+ 0:45:41.662 --> 0:45:48.598
1637
+ If you have Finland capital, do you want to
1638
+ split the ends or not?
1639
+
1640
+ 0:45:48.598 --> 0:45:53.256
1641
+ Isn't you split or do you even write it out?
1642
+
1643
+ 0:45:53.433 --> 0:46:00.468
1644
+ And what about like things with hyphens in
1645
+ the middle and so on?
1646
+
1647
+ 0:46:00.540 --> 0:46:07.729
1648
+ So there is not everything is very easy, but
1649
+ is generally possible to somewhat keep as.
1650
+
1651
+ 0:46:11.791 --> 0:46:25.725
1652
+ Sometimes the most challenging and traditional
1653
+ systems were compounds, or how to deal with
1654
+
1655
+ 0:46:25.725 --> 0:46:28.481
1656
+ things like this.
1657
+
1658
+ 0:46:28.668 --> 0:46:32.154
1659
+ The nice thing is, as said, will come to the
1660
+ later.
1661
+
1662
+ 0:46:32.154 --> 0:46:34.501
1663
+ Nowadays we typically use subword.
1664
+
1665
+ 0:46:35.255 --> 0:46:42.261
1666
+ Unit, so we don't have to deal with this in
1667
+ the preprocessing directly, but in the subword
1668
+
1669
+ 0:46:42.261 --> 0:46:47.804
1670
+ splitting we're doing it, and then we can learn
1671
+ how to best spit these.
1672
+
1673
+ 0:46:52.392 --> 0:46:56.974
1674
+ Things Get More Complicated.
1675
+
1676
+ 0:46:56.977 --> 0:46:59.934
1677
+ About non European languages.
1678
+
1679
+ 0:46:59.934 --> 0:47:08.707
1680
+ Because in non European languages, not all
1681
+ of them, there is no space between the words.
1682
+
1683
+ 0:47:09.029 --> 0:47:18.752
1684
+ Nowadays you can also download word segmentation
1685
+ models where you put in the full sentence and
1686
+
1687
+ 0:47:18.752 --> 0:47:22.744
1688
+ then it's getting splitted into parts.
1689
+
1690
+ 0:47:22.963 --> 0:47:31.814
1691
+ And then, of course, it's even that you have
1692
+ different writing systems, sometimes in Japanese.
1693
+
1694
+ 0:47:31.814 --> 0:47:40.385
1695
+ For example, they have these katakana, hiragana
1696
+ and kanji symbols in there, and you have to
1697
+
1698
+ 0:47:40.385 --> 0:47:42.435
1699
+ some idea with these.
1700
+
1701
+ 0:47:49.669 --> 0:47:54.560
1702
+ To the, the next thing is can reduce some
1703
+ normalization.
1704
+
1705
+ 0:47:54.874 --> 0:48:00.376
1706
+ So the idea is that you map several words
1707
+ onto the same.
1708
+
1709
+ 0:48:00.460 --> 0:48:07.877
1710
+ And that is test dependent, and the idea is
1711
+ to define something like acronym classes so
1712
+
1713
+ 0:48:07.877 --> 0:48:15.546
1714
+ that words, which have the same meaning where
1715
+ it's not in order to have the difference, to
1716
+
1717
+ 0:48:15.546 --> 0:48:19.423
1718
+ map onto the same thing in order to make the.
1719
+
1720
+ 0:48:19.679 --> 0:48:27.023
1721
+ The most important thing is there about tasing,
1722
+ and then there is something like sometimes
1723
+
1724
+ 0:48:27.023 --> 0:48:27.508
1725
+ word.
1726
+
1727
+ 0:48:28.048 --> 0:48:37.063
1728
+ For casing you can do two things and then
1729
+ depend on the task.
1730
+
1731
+ 0:48:37.063 --> 0:48:44.769
1732
+ You can lowercase everything, maybe some exceptions.
1733
+
1734
+ 0:48:45.045 --> 0:48:47.831
1735
+ For the target side, it should normally it's
1736
+ normally not done.
1737
+
1738
+ 0:48:48.188 --> 0:48:51.020
1739
+ Why is it not done?
1740
+
1741
+ 0:48:51.020 --> 0:48:56.542
1742
+ Why should you only do it for suicide?
1743
+
1744
+ 0:48:56.542 --> 0:49:07.729
1745
+ Yes, so you have to generate correct text
1746
+ instead of lower case and uppercase.
1747
+
1748
+ 0:49:08.848 --> 0:49:16.370
1749
+ Nowadays to be always do true casing on both
1750
+ sides, also on the sewer side, that means you
1751
+
1752
+ 0:49:16.370 --> 0:49:17.610
1753
+ keep the case.
1754
+
1755
+ 0:49:17.610 --> 0:49:24.966
1756
+ The only thing where people try to work on
1757
+ or sometimes do that is that at the beginning
1758
+
1759
+ 0:49:24.966 --> 0:49:25.628
1760
+ of the.
1761
+
1762
+ 0:49:25.825 --> 0:49:31.115
1763
+ For words like this, this is not that important
1764
+ because you will have seen otherwise a lot
1765
+
1766
+ 0:49:31.115 --> 0:49:31.696
1767
+ of times.
1768
+
1769
+ 0:49:31.696 --> 0:49:36.928
1770
+ But if you know have rare words, which you
1771
+ only have seen maybe three times, and you have
1772
+
1773
+ 0:49:36.928 --> 0:49:42.334
1774
+ only seen in the middle of the sentence, and
1775
+ now it occurs at the beginning of the sentence,
1776
+
1777
+ 0:49:42.334 --> 0:49:45.763
1778
+ which is upper case, then you don't know how
1779
+ to deal with.
1780
+
1781
+ 0:49:46.146 --> 0:49:50.983
1782
+ So then it might be good to do a true casing.
1783
+
1784
+ 0:49:50.983 --> 0:49:56.241
1785
+ That means you recase each word on the beginning.
1786
+
1787
+ 0:49:56.576 --> 0:49:59.830
1788
+ The only question, of course, is how do you
1789
+ recase it?
1790
+
1791
+ 0:49:59.830 --> 0:50:01.961
1792
+ So what case would you always know?
1793
+
1794
+ 0:50:02.162 --> 0:50:18.918
1795
+ Word of the senders, or do you have a better
1796
+ solution, especially not English, maybe German.
1797
+
1798
+ 0:50:18.918 --> 0:50:20.000
1799
+ It's.
1800
+
1801
+ 0:50:25.966 --> 0:50:36.648
1802
+ The fancy solution would be to count hope
1803
+ and decide based on this, the unfancy running
1804
+
1805
+ 0:50:36.648 --> 0:50:43.147
1806
+ would: Think it's not really good because most
1807
+ of the cane boards are lower paced.
1808
+
1809
+ 0:50:43.683 --> 0:50:53.657
1810
+ That is one idea to count and definitely better
1811
+ because as a word more often occurs upper case.
1812
+
1813
+ 0:50:53.653 --> 0:50:57.934
1814
+ Otherwise you only have a lower case at the
1815
+ beginning where you have again.
1816
+
1817
+ 0:50:58.338 --> 0:51:03.269
1818
+ Haven't gained anything, you can make it even
1819
+ a bit better when counting.
1820
+
1821
+ 0:51:03.269 --> 0:51:09.134
1822
+ You're ignoring the first position so that
1823
+ you don't count the word beginning and yeah,
1824
+
1825
+ 0:51:09.134 --> 0:51:12.999
1826
+ that's typically how it's done to do this type
1827
+ of casing.
1828
+
1829
+ 0:51:13.273 --> 0:51:23.907
1830
+ And that's the easy thing you can't even use
1831
+ like then bygram teachers who work pairs.
1832
+
1833
+ 0:51:23.907 --> 0:51:29.651
1834
+ There's very few words which occur more often.
1835
+
1836
+ 0:51:29.970 --> 0:51:33.163
1837
+ It's OK to have them boast because you can
1838
+ otherwise learn it.
1839
+
1840
+ 0:51:36.376 --> 0:51:52.305
1841
+ Another thing about these classes is to use
1842
+ word classes that were partly done, for example,
1843
+
1844
+ 0:51:52.305 --> 0:51:55.046
1845
+ and more often.
1846
+
1847
+ 0:51:55.375 --> 0:51:57.214
1848
+ Ten Thousand One Hundred Books.
1849
+
1850
+ 0:51:57.597 --> 0:52:07.397
1851
+ And then for an system that might not be important
1852
+ you can do something at number books.
1853
+
1854
+ 0:52:07.847 --> 0:52:16.450
1855
+ However, you see here already that it's not
1856
+ that easy because if you have one book you
1857
+
1858
+ 0:52:16.450 --> 0:52:19.318
1859
+ don't have to do with a pro.
1860
+
1861
+ 0:52:20.020 --> 0:52:21.669
1862
+ Always be careful.
1863
+
1864
+ 0:52:21.669 --> 0:52:28.094
1865
+ It's very fast to ignore some exceptions and
1866
+ make more things worse than.
1867
+
1868
+ 0:52:28.488 --> 0:52:37.879
1869
+ So it's always difficult to decide when to
1870
+ do this and when to better not do it and keep
1871
+
1872
+ 0:52:37.879 --> 0:52:38.724
1873
+ things.
1874
+
1875
+ 0:52:43.483 --> 0:52:56.202
1876
+ Then the next step is sentence segmentation,
1877
+ so we are typically working on sentences.
1878
+
1879
+ 0:52:56.476 --> 0:53:11.633
1880
+ However, dots things are a bit more complicated,
1881
+ so you can do a bit more.
1882
+
1883
+ 0:53:11.731 --> 0:53:20.111
1884
+ You can even have some type of classifier
1885
+ with features by then generally.
1886
+
1887
+ 0:53:20.500 --> 0:53:30.731
1888
+ Is not too complicated, so you can have different
1889
+ types of classifiers to do that, but in generally.
1890
+
1891
+ 0:53:30.650 --> 0:53:32.537
1892
+ I Didn't Know It.
1893
+
1894
+ 0:53:33.393 --> 0:53:35.583
1895
+ It's not a super complicated task.
1896
+
1897
+ 0:53:35.583 --> 0:53:39.461
1898
+ There are nowadays also a lot of libraries
1899
+ which you can use.
1900
+
1901
+ 0:53:39.699 --> 0:53:45.714
1902
+ To do that normally if you're doing the normalization
1903
+ beforehand that can be done there so you only
1904
+
1905
+ 0:53:45.714 --> 0:53:51.126
1906
+ split up the dot if it's like the sentence
1907
+ boundary and otherwise you keep it to the word
1908
+
1909
+ 0:53:51.126 --> 0:53:54.194
1910
+ so you can do that a bit jointly with the segment.
1911
+
1912
+ 0:53:54.634 --> 0:54:06.017
1913
+ It's something to think about to care because
1914
+ it's where arrows happen.
1915
+
1916
+ 0:54:06.017 --> 0:54:14.712
1917
+ However, on the one end you can still do it
1918
+ very well.
1919
+
1920
+ 0:54:14.834 --> 0:54:19.740
1921
+ You will never get data which is perfectly
1922
+ clean and where everything is great.
1923
+
1924
+ 0:54:20.340 --> 0:54:31.020
1925
+ There's just too much data and it will never
1926
+ happen, so therefore it's important to be aware
1927
+
1928
+ 0:54:31.020 --> 0:54:35.269
1929
+ of that during the full development.
1930
+
1931
+ 0:54:37.237 --> 0:54:42.369
1932
+ And one last thing about the preprocessing,
1933
+ we'll get into the representation.
1934
+
1935
+ 0:54:42.369 --> 0:54:47.046
1936
+ If you're working on that, you'll get a friend
1937
+ with regular expression.
1938
+
1939
+ 0:54:47.046 --> 0:54:50.034
1940
+ That's not only how you do all this matching.
1941
+
1942
+ 0:54:50.430 --> 0:55:03.811
1943
+ And if you look into the scripts of how to
1944
+ deal with pancreation marks and stuff like
1945
+
1946
+ 0:55:03.811 --> 0:55:04.900
1947
+ that,.
1948
+
1949
+ 0:55:11.011 --> 0:55:19.025
1950
+ So if we have now the data of our next step
1951
+ to build, the system is to represent our words.
1952
+
1953
+ 0:55:19.639 --> 0:55:27.650
1954
+ Before we start with this, any more questions
1955
+ about preprocessing.
1956
+
1957
+ 0:55:27.650 --> 0:55:32.672
1958
+ While we work on the pure text, I'm sure.
1959
+
1960
+ 0:55:33.453 --> 0:55:40.852
1961
+ The idea is again to make things more simple
1962
+ because if you think about the production mark
1963
+
1964
+ 0:55:40.852 --> 0:55:48.252
1965
+ at the beginning of a sentence, it might be
1966
+ that you haven't seen the word or, for example,
1967
+
1968
+ 0:55:48.252 --> 0:55:49.619
1969
+ think of titles.
1970
+
1971
+ 0:55:49.619 --> 0:55:56.153
1972
+ In newspaper articles there's: So you then
1973
+ have seen the word now in the title before,
1974
+
1975
+ 0:55:56.153 --> 0:55:58.425
1976
+ and the text you have never seen.
1977
+
1978
+ 0:55:58.898 --> 0:56:03.147
1979
+ But there is always the decision.
1980
+
1981
+ 0:56:03.123 --> 0:56:09.097
1982
+ Do I gain more because I've seen things more
1983
+ often or do I lose because now I remove information
1984
+
1985
+ 0:56:09.097 --> 0:56:11.252
1986
+ which helps me to the same degree?
1987
+
1988
+ 0:56:11.571 --> 0:56:21.771
1989
+ Because if we, for example, do that in German
1990
+ and remove the case, this might be an important
1991
+
1992
+ 0:56:21.771 --> 0:56:22.531
1993
+ issue.
1994
+
1995
+ 0:56:22.842 --> 0:56:30.648
1996
+ So there is not the perfect solution, but
1997
+ generally you can get some arrows to make things
1998
+
1999
+ 0:56:30.648 --> 0:56:32.277
2000
+ look more similar.
2001
+
2002
+ 0:56:35.295 --> 0:56:43.275
2003
+ What you can do about products like the state
2004
+ of the area or the trends that are more or
2005
+
2006
+ 0:56:43.275 --> 0:56:43.813
2007
+ less.
2008
+
2009
+ 0:56:44.944 --> 0:56:50.193
2010
+ It starts even less because models get more
2011
+ powerful, so it's not that important, but be
2012
+
2013
+ 0:56:50.193 --> 0:56:51.136
2014
+ careful partly.
2015
+
2016
+ 0:56:51.136 --> 0:56:56.326
2017
+ It's also the evaluation thing because these
2018
+ things which are problematic are happening
2019
+
2020
+ 0:56:56.326 --> 0:56:57.092
2021
+ very rarely.
2022
+
2023
+ 0:56:57.092 --> 0:57:00.159
2024
+ If you take average performance, it doesn't
2025
+ matter.
2026
+
2027
+ 0:57:00.340 --> 0:57:06.715
2028
+ However, in between it's doing the stupid
2029
+ mistakes that don't count on average, but they
2030
+
2031
+ 0:57:06.715 --> 0:57:08.219
2032
+ are not really good.
2033
+
2034
+ 0:57:09.089 --> 0:57:15.118
2035
+ Done you do some type of tokenization?
2036
+
2037
+ 0:57:15.118 --> 0:57:19.911
2038
+ You can do true casing or not.
2039
+
2040
+ 0:57:19.911 --> 0:57:28.723
2041
+ Some people nowadays don't do it, but that's
2042
+ still done.
2043
+
2044
+ 0:57:28.948 --> 0:57:34.441
2045
+ Then it depends on who is a bit on the type
2046
+ of domain.
2047
+
2048
+ 0:57:34.441 --> 0:57:37.437
2049
+ Again we have so translation.
2050
+
2051
+ 0:57:37.717 --> 0:57:46.031
2052
+ So in the text sometimes there is mark in
2053
+ the menu, later the shortcut.
2054
+
2055
+ 0:57:46.031 --> 0:57:49.957
2056
+ This letter is used for shortcut.
2057
+
2058
+ 0:57:49.957 --> 0:57:57.232
2059
+ You cannot mistake the word because it's no
2060
+ longer a file but.
2061
+
2062
+ 0:57:58.018 --> 0:58:09.037
2063
+ Then you cannot deal with it, so then it might
2064
+ make sense to remove this.
2065
+
2066
+ 0:58:12.032 --> 0:58:17.437
2067
+ Now the next step is how to match words into
2068
+ numbers.
2069
+
2070
+ 0:58:17.437 --> 0:58:22.142
2071
+ Machine learning models deal with some digits.
2072
+
2073
+ 0:58:22.342 --> 0:58:27.091
2074
+ The first idea is to use words as our basic
2075
+ components.
2076
+
2077
+ 0:58:27.247 --> 0:58:40.695
2078
+ And then you have a large vocabulary where
2079
+ each word gets referenced to an indigenous.
2080
+
2081
+ 0:58:40.900 --> 0:58:49.059
2082
+ So your sentence go home is now and that is
2083
+ your set.
2084
+
2085
+ 0:58:52.052 --> 0:59:00.811
2086
+ So the nice thing is you have very short sequences
2087
+ so that you can deal with them.
2088
+
2089
+ 0:59:00.811 --> 0:59:01.867
2090
+ However,.
2091
+
2092
+ 0:59:01.982 --> 0:59:11.086
2093
+ So you have not really understood how words
2094
+ are processed.
2095
+
2096
+ 0:59:11.086 --> 0:59:16.951
2097
+ Why is this or can that be a problem?
2098
+
2099
+ 0:59:17.497 --> 0:59:20.741
2100
+ And there is an easy solution to deal with
2101
+ unknown words.
2102
+
2103
+ 0:59:20.741 --> 0:59:22.698
2104
+ You just have one token, which is.
2105
+
2106
+ 0:59:23.123 --> 0:59:25.906
2107
+ Worrying in maybe some railroads in your training
2108
+ day, do you deal?
2109
+
2110
+ 0:59:26.206 --> 0:59:34.938
2111
+ That's working a bit for some province, but
2112
+ in general it's not good because you know nothing
2113
+
2114
+ 0:59:34.938 --> 0:59:35.588
2115
+ about.
2116
+
2117
+ 0:59:35.895 --> 0:59:38.770
2118
+ Can at least deal with this and maybe map
2119
+ it.
2120
+
2121
+ 0:59:38.770 --> 0:59:44.269
2122
+ So an easy solution in machine translation
2123
+ is always if it's an unknown word or we just
2124
+
2125
+ 0:59:44.269 --> 0:59:49.642
2126
+ copy it to the target side because unknown
2127
+ words are often named entities and in many
2128
+
2129
+ 0:59:49.642 --> 0:59:52.454
2130
+ languages the good solution is just to keep.
2131
+
2132
+ 0:59:53.013 --> 1:00:01.203
2133
+ So that is somehow a trick, trick, but yeah,
2134
+ that's of course not a good thing.
2135
+
2136
+ 1:00:01.821 --> 1:00:08.959
2137
+ It's also a problem if you deal with full
2138
+ words is that you have very few examples for
2139
+
2140
+ 1:00:08.959 --> 1:00:09.451
2141
+ some.
2142
+
2143
+ 1:00:09.949 --> 1:00:17.696
2144
+ And of course if you've seen a word once you
2145
+ can, someone may be translated, but we will
2146
+
2147
+ 1:00:17.696 --> 1:00:24.050
2148
+ learn that in your networks you represent words
2149
+ with continuous vectors.
2150
+
2151
+ 1:00:24.264 --> 1:00:26.591
2152
+ You have seen them two, three or four times.
2153
+
2154
+ 1:00:26.591 --> 1:00:31.246
2155
+ It is not really well learned, and you are
2156
+ typically doing most Arabs and words with your
2157
+
2158
+ 1:00:31.246 --> 1:00:31.763
2159
+ crow rap.
2160
+
2161
+ 1:00:33.053 --> 1:00:40.543
2162
+ And yeah, you cannot deal with things which
2163
+ are inside the world.
2164
+
2165
+ 1:00:40.543 --> 1:00:50.303
2166
+ So if you know that houses set one hundred
2167
+ and twelve and you see no houses, you have
2168
+
2169
+ 1:00:50.303 --> 1:00:51.324
2170
+ no idea.
2171
+
2172
+ 1:00:51.931 --> 1:00:55.533
2173
+ Of course, not really convenient, so humans
2174
+ are better.
2175
+
2176
+ 1:00:55.533 --> 1:00:58.042
2177
+ They can use the internal information.
2178
+
2179
+ 1:00:58.498 --> 1:01:04.080
2180
+ So if we have houses you'll know that it's
2181
+ like the bluer form of house.
2182
+
2183
+ 1:01:05.285 --> 1:01:16.829
2184
+ And for the ones who weren't in advance, ay,
2185
+ you have this night worth here and guess.
2186
+
2187
+ 1:01:16.716 --> 1:01:20.454
2188
+ Don't know the meaning of these words.
2189
+
2190
+ 1:01:20.454 --> 1:01:25.821
2191
+ However, all of you will know is the fear
2192
+ of something.
2193
+
2194
+ 1:01:26.686 --> 1:01:39.437
2195
+ From the ending, the phobia phobia is always
2196
+ the fear of something, but you don't know how.
2197
+
2198
+ 1:01:39.879 --> 1:01:46.618
2199
+ So we can split words into some parts that
2200
+ is helpful to deal with.
2201
+
2202
+ 1:01:46.618 --> 1:01:49.888
2203
+ This, for example, is a fear of.
2204
+
2205
+ 1:01:50.450 --> 1:02:04.022
2206
+ It's not very important, it's not how to happen
2207
+ very often, but yeah, it's also not important
2208
+
2209
+ 1:02:04.022 --> 1:02:10.374
2210
+ for understanding that you know everything.
2211
+
2212
+ 1:02:15.115 --> 1:02:18.791
2213
+ So what can we do instead?
2214
+
2215
+ 1:02:18.791 --> 1:02:29.685
2216
+ One thing which we could do instead is to
2217
+ represent words by the other extreme.
2218
+
2219
+ 1:02:29.949 --> 1:02:42.900
2220
+ So you really do like if you have a person's
2221
+ eye and a and age, then you need a space symbol.
2222
+
2223
+ 1:02:43.203 --> 1:02:55.875
2224
+ So you have now a representation for each
2225
+ character that enables you to implicitly learn
2226
+
2227
+ 1:02:55.875 --> 1:03:01.143
2228
+ morphology because words which have.
2229
+
2230
+ 1:03:01.541 --> 1:03:05.517
2231
+ And you can then deal with unknown words.
2232
+
2233
+ 1:03:05.517 --> 1:03:10.344
2234
+ There's still not everything you can process,
2235
+ but.
2236
+
2237
+ 1:03:11.851 --> 1:03:16.953
2238
+ So if you would go on charity level what might
2239
+ still be a problem?
2240
+
2241
+ 1:03:18.598 --> 1:03:24.007
2242
+ So all characters which you haven't seen,
2243
+ but that's nowadays a little bit more often
2244
+
2245
+ 1:03:24.007 --> 1:03:25.140
2246
+ with new emoties.
2247
+
2248
+ 1:03:25.140 --> 1:03:26.020
2249
+ You couldn't.
2250
+
2251
+ 1:03:26.020 --> 1:03:31.366
2252
+ It could also be that you have translated
2253
+ from Germany and German, and then there is
2254
+
2255
+ 1:03:31.366 --> 1:03:35.077
2256
+ a Japanese character or Chinese that you cannot
2257
+ translate.
2258
+
2259
+ 1:03:35.435 --> 1:03:43.938
2260
+ But most of the time all directions occur
2261
+ have been seen so that someone works very good.
2262
+
2263
+ 1:03:44.464 --> 1:03:58.681
2264
+ This is first a nice thing, so you have a
2265
+ very small vocabulary size, so one big part
2266
+
2267
+ 1:03:58.681 --> 1:04:01.987
2268
+ of the calculation.
2269
+
2270
+ 1:04:02.222 --> 1:04:11.960
2271
+ Neural networks is the calculation of the
2272
+ vocabulary size, so if you are efficient there
2273
+
2274
+ 1:04:11.960 --> 1:04:13.382
2275
+ it's better.
2276
+
2277
+ 1:04:14.914 --> 1:04:26.998
2278
+ On the other hand, the problem is you have
2279
+ no very long sequences, so if you think about
2280
+
2281
+ 1:04:26.998 --> 1:04:29.985
2282
+ this before you have.
2283
+
2284
+ 1:04:30.410 --> 1:04:43.535
2285
+ Your computation often depends on your input
2286
+ size and not only linear but quadratic going
2287
+
2288
+ 1:04:43.535 --> 1:04:44.410
2289
+ more.
2290
+
2291
+ 1:04:44.504 --> 1:04:49.832
2292
+ And of course it might also be that you just
2293
+ generally make things more complicated than
2294
+
2295
+ 1:04:49.832 --> 1:04:50.910
2296
+ they were before.
2297
+
2298
+ 1:04:50.951 --> 1:04:58.679
2299
+ We said before make things easy, but now if
2300
+ we really have to analyze each director independently,
2301
+
2302
+ 1:04:58.679 --> 1:05:05.003
2303
+ we cannot directly learn that university is
2304
+ the same, but we have to learn that.
2305
+
2306
+ 1:05:05.185 --> 1:05:12.179
2307
+ Is beginning and then there is an I and then
2308
+ there is an E and then all this together means
2309
+
2310
+ 1:05:12.179 --> 1:05:17.273
2311
+ university but another combination of these
2312
+ letters is a complete.
2313
+
2314
+ 1:05:17.677 --> 1:05:24.135
2315
+ So of course you make everything here a lot
2316
+ more complicated than you have on word basis.
2317
+
2318
+ 1:05:24.744 --> 1:05:32.543
2319
+ Character based models work very well in conditions
2320
+ with few data because you have seen the words
2321
+
2322
+ 1:05:32.543 --> 1:05:33.578
2323
+ very rarely.
2324
+
2325
+ 1:05:33.578 --> 1:05:38.751
2326
+ It's not good to learn but you have seen all
2327
+ letters more often.
2328
+
2329
+ 1:05:38.751 --> 1:05:44.083
2330
+ So if you have scenarios with very few data
2331
+ this is like one good.
2332
+
2333
+ 1:05:46.446 --> 1:05:59.668
2334
+ The other idea is to split now not doing the
2335
+ extreme, so either taking forwards or taking
2336
+
2337
+ 1:05:59.668 --> 1:06:06.573
2338
+ only directives by doing something in between.
2339
+
2340
+ 1:06:07.327 --> 1:06:12.909
2341
+ And one of these ideas has been done for a
2342
+ long time.
2343
+
2344
+ 1:06:12.909 --> 1:06:17.560
2345
+ It's called compound splitting, but we only.
2346
+
2347
+ 1:06:17.477 --> 1:06:18.424
2348
+ Bounce them.
2349
+
2350
+ 1:06:18.424 --> 1:06:24.831
2351
+ You see that Baum and Stumbo accrue very often,
2352
+ then maybe more often than Bounce them.
2353
+
2354
+ 1:06:24.831 --> 1:06:28.180
2355
+ Then you split Baum and Stumb and you use
2356
+ it.
2357
+
2358
+ 1:06:29.509 --> 1:06:44.165
2359
+ But it's even not so easy it will learn wrong
2360
+ splits so we did that in all the systems and
2361
+
2362
+ 1:06:44.165 --> 1:06:47.708
2363
+ there is a word Asia.
2364
+
2365
+ 1:06:48.288 --> 1:06:56.137
2366
+ And the business, of course, is not a really
2367
+ good way of dealing it because it is non-semantic.
2368
+
2369
+ 1:06:56.676 --> 1:07:05.869
2370
+ The good thing is we didn't really care that
2371
+ much about it because the system wasn't learned
2372
+
2373
+ 1:07:05.869 --> 1:07:09.428
2374
+ if you have Asia and Tish together.
2375
+
2376
+ 1:07:09.729 --> 1:07:17.452
2377
+ So you can of course learn all that the compound
2378
+ spirit doesn't really help you to get a deeper
2379
+
2380
+ 1:07:17.452 --> 1:07:18.658
2381
+ understanding.
2382
+
2383
+ 1:07:21.661 --> 1:07:23.364
2384
+ The Thing of Course.
2385
+
2386
+ 1:07:23.943 --> 1:07:30.475
2387
+ Yeah, there was one paper where this doesn't
2388
+ work like they report, but it's called Burning
2389
+
2390
+ 1:07:30.475 --> 1:07:30.972
2391
+ Ducks.
2392
+
2393
+ 1:07:30.972 --> 1:07:37.503
2394
+ I think because it was like if you had German
2395
+ NS Branter, you could split it in NS Branter,
2396
+
2397
+ 1:07:37.503 --> 1:07:43.254
2398
+ and sometimes you have to add an E to make
2399
+ the compounds that was Enter Branter.
2400
+
2401
+ 1:07:43.583 --> 1:07:48.515
2402
+ So he translated Esperanto into burning dark.
2403
+
2404
+ 1:07:48.888 --> 1:07:56.127
2405
+ So of course you can introduce there some
2406
+ type of additional arrows, but in generally
2407
+
2408
+ 1:07:56.127 --> 1:07:57.221
2409
+ it's a good.
2410
+
2411
+ 1:07:57.617 --> 1:08:03.306
2412
+ Of course there is a trade off between vocabulary
2413
+ size so you want to have a lower vocabulary
2414
+
2415
+ 1:08:03.306 --> 1:08:08.812
2416
+ size so you've seen everything more often but
2417
+ the length of the sequence should not be too
2418
+
2419
+ 1:08:08.812 --> 1:08:13.654
2420
+ long because if you split more often you get
2421
+ less different types but you have.
2422
+
2423
+ 1:08:16.896 --> 1:08:25.281
2424
+ The motivation of the advantage of compared
2425
+ to Character based models is that you can directly
2426
+
2427
+ 1:08:25.281 --> 1:08:33.489
2428
+ learn the representation for works that occur
2429
+ very often while still being able to represent
2430
+
2431
+ 1:08:33.489 --> 1:08:35.783
2432
+ works that are rare into.
2433
+
2434
+ 1:08:36.176 --> 1:08:42.973
2435
+ And while first this was only done for compounds,
2436
+ nowadays there's an algorithm which really
2437
+
2438
+ 1:08:42.973 --> 1:08:49.405
2439
+ tries to do it on everything and there are
2440
+ different ways to be honest compound fitting
2441
+
2442
+ 1:08:49.405 --> 1:08:50.209
2443
+ and so on.
2444
+
2445
+ 1:08:50.209 --> 1:08:56.129
2446
+ But the most successful one which is commonly
2447
+ used is based on data compression.
2448
+
2449
+ 1:08:56.476 --> 1:08:59.246
2450
+ And there the idea is okay.
2451
+
2452
+ 1:08:59.246 --> 1:09:06.765
2453
+ Can we find an encoding so that parts are
2454
+ compressed in the most efficient?
2455
+
2456
+ 1:09:07.027 --> 1:09:22.917
2457
+ And the compression algorithm is called the
2458
+ bipear encoding, and this is also then used
2459
+
2460
+ 1:09:22.917 --> 1:09:25.625
2461
+ for splitting.
2462
+
2463
+ 1:09:26.346 --> 1:09:39.164
2464
+ And the idea is we recursively represent the
2465
+ most frequent pair of bites by a new bike.
2466
+
2467
+ 1:09:39.819 --> 1:09:51.926
2468
+ Language is now you splitch, burst all your
2469
+ words into letters, and then you look at what
2470
+
2471
+ 1:09:51.926 --> 1:09:59.593
2472
+ is the most frequent bigrams of which two letters
2473
+ occur.
2474
+
2475
+ 1:10:00.040 --> 1:10:04.896
2476
+ And then you replace your repeat until you
2477
+ have a fixed vocabulary.
2478
+
2479
+ 1:10:04.985 --> 1:10:08.031
2480
+ So that's a nice thing.
2481
+
2482
+ 1:10:08.031 --> 1:10:16.663
2483
+ Now you can predefine your vocabulary as want
2484
+ to represent my text.
2485
+
2486
+ 1:10:16.936 --> 1:10:28.486
2487
+ By hand, and then you can represent any text
2488
+ with these symbols, and of course the shorter
2489
+
2490
+ 1:10:28.486 --> 1:10:30.517
2491
+ your text will.
2492
+
2493
+ 1:10:32.772 --> 1:10:36.543
2494
+ So the original idea was something like that.
2495
+
2496
+ 1:10:36.543 --> 1:10:39.411
2497
+ We have to sequence A, B, A, B, C.
2498
+
2499
+ 1:10:39.411 --> 1:10:45.149
2500
+ For example, a common biogram is A, B, so
2501
+ you can face A, B, B, I, D.
2502
+
2503
+ 1:10:45.149 --> 1:10:46.788
2504
+ Then the text gets.
2505
+
2506
+ 1:10:48.108 --> 1:10:53.615
2507
+ Then you can make to and then you have eating
2508
+ beet and so on, so this is then your text.
2509
+
2510
+ 1:10:54.514 --> 1:11:00.691
2511
+ Similarly, we can do it now for tanking.
2512
+
2513
+ 1:11:01.761 --> 1:11:05.436
2514
+ Let's assume you have these sentences.
2515
+
2516
+ 1:11:05.436 --> 1:11:11.185
2517
+ I go, he goes, she goes, so your vocabulary
2518
+ is go, goes, he.
2519
+
2520
+ 1:11:11.851 --> 1:11:30.849
2521
+ And the first thing you're doing is split
2522
+ your crocus into singles.
2523
+
2524
+ 1:11:30.810 --> 1:11:34.692
2525
+ So thereby you can split words again like
2526
+ split senses into words.
2527
+
2528
+ 1:11:34.692 --> 1:11:38.980
2529
+ Because now you only have chiracters, you
2530
+ don't know the word boundaries.
2531
+
2532
+ 1:11:38.980 --> 1:11:44.194
2533
+ You introduce the word boundaries by having
2534
+ a special symbol at the end of each word, and
2535
+
2536
+ 1:11:44.194 --> 1:11:46.222
2537
+ then you know this symbol happens.
2538
+
2539
+ 1:11:46.222 --> 1:11:48.366
2540
+ I can split it and have it in a new.
2541
+
2542
+ 1:11:48.708 --> 1:11:55.245
2543
+ So you have the corpus I go, he goes, and
2544
+ she goes, and then you have now here the sequences
2545
+
2546
+ 1:11:55.245 --> 1:11:56.229
2547
+ of Character.
2548
+
2549
+ 1:11:56.229 --> 1:12:02.625
2550
+ So then the Character based per presentation,
2551
+ and now you calculate the bigram statistics.
2552
+
2553
+ 1:12:02.625 --> 1:12:08.458
2554
+ So I and the end of word occurs one time G
2555
+ & O across three times, so there there.
2556
+
2557
+ 1:12:09.189 --> 1:12:18.732
2558
+ And these are all the others, and now you
2559
+ look, which is the most common happening.
2560
+
2561
+ 1:12:19.119 --> 1:12:26.046
2562
+ So then you have known the rules.
2563
+
2564
+ 1:12:26.046 --> 1:12:39.235
2565
+ If and have them together you have these new
2566
+ words: Now is no longer two symbols, but it's
2567
+
2568
+ 1:12:39.235 --> 1:12:41.738
2569
+ one single symbol because if you join that.
2570
+
2571
+ 1:12:42.402 --> 1:12:51.175
2572
+ And then you have here now the new number
2573
+ of biceps, steel and wood, and and so on.
2574
+
2575
+ 1:12:52.092 --> 1:13:01.753
2576
+ In small examples now you have a lot of rules
2577
+ which occur the same time.
2578
+
2579
+ 1:13:01.753 --> 1:13:09.561
2580
+ In reality that is happening sometimes but
2581
+ not that often.
2582
+
2583
+ 1:13:10.370 --> 1:13:21.240
2584
+ You add the end of words to him, and so this
2585
+ way you go on until you have your vocabulary.
2586
+
2587
+ 1:13:21.601 --> 1:13:38.242
2588
+ And your vocabulary is in these rules, so
2589
+ people speak about the vocabulary of the rules.
2590
+
2591
+ 1:13:38.658 --> 1:13:43.637
2592
+ And these are the rules, and if you have not
2593
+ a different sentence, something like they tell.
2594
+
2595
+ 1:13:44.184 --> 1:13:53.600
2596
+ Then your final output looks like something
2597
+ like that.
2598
+
2599
+ 1:13:53.600 --> 1:13:59.250
2600
+ These two words represent by by.
2601
+
2602
+ 1:14:00.940 --> 1:14:06.398
2603
+ And that is your algorithm.
2604
+
2605
+ 1:14:06.398 --> 1:14:18.873
2606
+ Now you can represent any type of text with
2607
+ a fixed vocabulary.
2608
+
2609
+ 1:14:20.400 --> 1:14:23.593
2610
+ So think that's defined in the beginning.
2611
+
2612
+ 1:14:23.593 --> 1:14:27.243
2613
+ Fill how many egos have won and that has spent.
2614
+
2615
+ 1:14:28.408 --> 1:14:35.253
2616
+ It's nearly correct that it writes a number
2617
+ of characters.
2618
+
2619
+ 1:14:35.253 --> 1:14:38.734
2620
+ It can be that in additional.
2621
+
2622
+ 1:14:38.878 --> 1:14:49.162
2623
+ So on the one end all three of the right side
2624
+ of the rules can occur, and then additionally
2625
+
2626
+ 1:14:49.162 --> 1:14:49.721
2627
+ all.
2628
+
2629
+ 1:14:49.809 --> 1:14:55.851
2630
+ In reality it can even happen that there is
2631
+ less your vocabulary smaller because it might
2632
+
2633
+ 1:14:55.851 --> 1:15:01.960
2634
+ happen that like for example go never occurs
2635
+ singular at the end but you always like merge
2636
+
2637
+ 1:15:01.960 --> 1:15:06.793
2638
+ all occurrences so there are not all right
2639
+ sides really happen because.
2640
+
2641
+ 1:15:06.746 --> 1:15:11.269
2642
+ This rule is never only applied, but afterwards
2643
+ another rule is also applied.
2644
+
2645
+ 1:15:11.531 --> 1:15:15.621
2646
+ So it's a summary approbounce of your vocabulary
2647
+ than static.
2648
+
2649
+ 1:15:20.480 --> 1:15:29.014
2650
+ Then we come to the last part, which is about
2651
+ parallel data, but we have some questions beforehand.
2652
+
2653
+ 1:15:36.436 --> 1:15:38.824
2654
+ So what is parallel data?
2655
+
2656
+ 1:15:38.824 --> 1:15:47.368
2657
+ So if we set machine translations really,
2658
+ really important that we are dealing with parallel
2659
+
2660
+ 1:15:47.368 --> 1:15:52.054
2661
+ data, that means we have a lined input and
2662
+ output.
2663
+
2664
+ 1:15:52.054 --> 1:15:54.626
2665
+ You have this type of data.
2666
+
2667
+ 1:15:55.015 --> 1:16:01.773
2668
+ However, in machine translation we have one
2669
+ very big advantage that is somewhat naturally
2670
+
2671
+ 1:16:01.773 --> 1:16:07.255
2672
+ occurring, so you have a lot of parallel data
2673
+ which you can summar gaps.
2674
+
2675
+ 1:16:07.255 --> 1:16:13.788
2676
+ In many P tests you need to manually annotate
2677
+ your data and generate the aligned data.
2678
+
2679
+ 1:16:14.414 --> 1:16:22.540
2680
+ We have to manually create translations, and
2681
+ of course that is very expensive, but it's
2682
+
2683
+ 1:16:22.540 --> 1:16:29.281
2684
+ really expensive to pay for like one million
2685
+ sentences to be translated.
2686
+
2687
+ 1:16:29.889 --> 1:16:36.952
2688
+ The nice thing is that in there is data normally
2689
+ available because other people have done machine
2690
+
2691
+ 1:16:36.952 --> 1:16:37.889
2692
+ translation.
2693
+
2694
+ 1:16:40.120 --> 1:16:44.672
2695
+ So there is this data and of course process
2696
+ it.
2697
+
2698
+ 1:16:44.672 --> 1:16:51.406
2699
+ We'll have a full lecture on how to deal with
2700
+ more complex situations.
2701
+
2702
+ 1:16:52.032 --> 1:16:56.645
2703
+ The idea is really you don't do really much
2704
+ human work.
2705
+
2706
+ 1:16:56.645 --> 1:17:02.825
2707
+ You really just start the caller with some
2708
+ initials, start pages and then.
2709
+
2710
+ 1:17:03.203 --> 1:17:07.953
2711
+ But a lot of iquality parallel data is really
2712
+ targeted on some scenarios.
2713
+
2714
+ 1:17:07.953 --> 1:17:13.987
2715
+ So, for example, think of the European Parliament
2716
+ as one website where you can easily extract
2717
+
2718
+ 1:17:13.987 --> 1:17:17.581
2719
+ these information from and there you have a
2720
+ large data.
2721
+
2722
+ 1:17:17.937 --> 1:17:22.500
2723
+ Or like we have the TED data, which is also
2724
+ you can get from the TED website.
2725
+
2726
+ 1:17:23.783 --> 1:17:33.555
2727
+ So in generally parallel corpus is a collection
2728
+ of texts with translations into one of several.
2729
+
2730
+ 1:17:34.134 --> 1:17:42.269
2731
+ And this data is important because there is
2732
+ no general empty normally, but you work secured.
2733
+
2734
+ 1:17:42.222 --> 1:17:46.732
2735
+ It works especially good if your training
2736
+ and test conditions are similar.
2737
+
2738
+ 1:17:46.732 --> 1:17:50.460
2739
+ So if the topic is similar, the style of modality
2740
+ is similar.
2741
+
2742
+ 1:17:50.460 --> 1:17:55.391
2743
+ So if you want to translate speech, it's often
2744
+ better to train all to own speech.
2745
+
2746
+ 1:17:55.391 --> 1:17:58.818
2747
+ If you want to translate text, it's better
2748
+ to translate.
2749
+
2750
+ 1:17:59.379 --> 1:18:08.457
2751
+ And there is a lot of these data available
2752
+ nowadays for common languages.
2753
+
2754
+ 1:18:08.457 --> 1:18:12.014
2755
+ You normally can start with.
2756
+
2757
+ 1:18:12.252 --> 1:18:15.298
2758
+ It's really available.
2759
+
2760
+ 1:18:15.298 --> 1:18:27.350
2761
+ For example, Opus is a big website collecting
2762
+ different types of parallel corpus where you
2763
+
2764
+ 1:18:27.350 --> 1:18:29.601
2765
+ can select them.
2766
+
2767
+ 1:18:29.529 --> 1:18:33.276
2768
+ You have this document alignment will come
2769
+ to that layout.
2770
+
2771
+ 1:18:33.553 --> 1:18:39.248
2772
+ There is things like comparable data where
2773
+ you have not full sentences but only some parts
2774
+
2775
+ 1:18:39.248 --> 1:18:40.062
2776
+ of parallel.
2777
+
2778
+ 1:18:40.220 --> 1:18:48.700
2779
+ But now first let's assume we have easy tasks
2780
+ like European Parliament when we have the speech
2781
+
2782
+ 1:18:48.700 --> 1:18:55.485
2783
+ in German and the speech in English and you
2784
+ need to generate parallel data.
2785
+
2786
+ 1:18:55.485 --> 1:18:59.949
2787
+ That means you have to align the sewer sentences.
2788
+
2789
+ 1:19:00.000 --> 1:19:01.573
2790
+ And doing this right.
2791
+
2792
+ 1:19:05.905 --> 1:19:08.435
2793
+ How can we do that?
2794
+
2795
+ 1:19:08.435 --> 1:19:19.315
2796
+ And that is what people refer to sentence
2797
+ alignment, so we have parallel documents in
2798
+
2799
+ 1:19:19.315 --> 1:19:20.707
2800
+ languages.
2801
+
2802
+ 1:19:22.602 --> 1:19:32.076
2803
+ This is so you cannot normally do that word
2804
+ by word because there is no direct correlation
2805
+
2806
+ 1:19:32.076 --> 1:19:34.158
2807
+ between, but it is.
2808
+
2809
+ 1:19:34.074 --> 1:19:39.837
2810
+ Relatively possible to do it on sentence level,
2811
+ it will not be perfect, so you sometimes have
2812
+
2813
+ 1:19:39.837 --> 1:19:42.535
2814
+ two sentences in English and one in German.
2815
+
2816
+ 1:19:42.535 --> 1:19:47.992
2817
+ German like to have these long sentences with
2818
+ sub clauses and so on, so there you can do
2819
+
2820
+ 1:19:47.992 --> 1:19:51.733
2821
+ it, but with long sentences it might not be
2822
+ really possible.
2823
+
2824
+ 1:19:55.015 --> 1:19:59.454
2825
+ And for some we saw that sentence Marcus Andre
2826
+ there, so it's more complicated.
2827
+
2828
+ 1:19:59.819 --> 1:20:10.090
2829
+ So how can we formalize this sentence alignment
2830
+ problem?
2831
+
2832
+ 1:20:10.090 --> 1:20:16.756
2833
+ So we have a set of sewer sentences.
2834
+
2835
+ 1:20:17.377 --> 1:20:22.167
2836
+ And machine translation relatively often.
2837
+
2838
+ 1:20:22.167 --> 1:20:32.317
2839
+ Sometimes source sentences nowadays are and,
2840
+ but traditionally it was and because people
2841
+
2842
+ 1:20:32.317 --> 1:20:34.027
2843
+ started using.
2844
+
2845
+ 1:20:34.594 --> 1:20:45.625
2846
+ And then the idea is to find this alignment
2847
+ where we have alignment.
2848
+
2849
+ 1:20:46.306 --> 1:20:50.421
2850
+ And of course you want these sequences to
2851
+ be shown as possible.
2852
+
2853
+ 1:20:50.421 --> 1:20:56.400
2854
+ Of course an easy solution is here all my
2855
+ screen sentences and here all my target sentences.
2856
+
2857
+ 1:20:56.756 --> 1:21:07.558
2858
+ So want to have short sequences there, typically
2859
+ one sentence or maximum two or three sentences,
2860
+
2861
+ 1:21:07.558 --> 1:21:09.340
2862
+ so that really.
2863
+
2864
+ 1:21:13.913 --> 1:21:21.479
2865
+ Then there is different ways of restriction
2866
+ to this type of alignment, so first of all
2867
+
2868
+ 1:21:21.479 --> 1:21:29.131
2869
+ it should be a monotone alignment, so that
2870
+ means that each segment on the source should
2871
+
2872
+ 1:21:29.131 --> 1:21:31.218
2873
+ start after each other.
2874
+
2875
+ 1:21:31.431 --> 1:21:36.428
2876
+ So we assume that in document there's really
2877
+ a monotone and it's going the same way in source.
2878
+
2879
+ 1:21:36.957 --> 1:21:41.965
2880
+ Course for a very free translation that might
2881
+ not be valid anymore.
2882
+
2883
+ 1:21:41.965 --> 1:21:49.331
2884
+ But this algorithm, the first one in the church
2885
+ and gay algorithm, is more than really translations
2886
+
2887
+ 1:21:49.331 --> 1:21:51.025
2888
+ which are very direct.
2889
+
2890
+ 1:21:51.025 --> 1:21:54.708
2891
+ So each segment should be like coming after
2892
+ each.
2893
+
2894
+ 1:21:55.115 --> 1:22:04.117
2895
+ Then we want to translate the full sequence,
2896
+ and of course each segment should start before
2897
+
2898
+ 1:22:04.117 --> 1:22:04.802
2899
+ it is.
2900
+
2901
+ 1:22:05.525 --> 1:22:22.654
2902
+ And then you want to have something like that,
2903
+ but you have to alignments or alignments.
2904
+
2905
+ 1:22:25.525 --> 1:22:41.851
2906
+ The alignment types are: You then, of course,
2907
+ sometimes insertions and Venetians where there
2908
+
2909
+ 1:22:41.851 --> 1:22:43.858
2910
+ is some information added.
2911
+
2912
+ 1:22:44.224 --> 1:22:50.412
2913
+ Hand be, for example, explanation, so it can
2914
+ be that some term is known in the one language
2915
+
2916
+ 1:22:50.412 --> 1:22:51.018
2917
+ but not.
2918
+
2919
+ 1:22:51.111 --> 1:22:53.724
2920
+ Think of things like Deutschland ticket.
2921
+
2922
+ 1:22:53.724 --> 1:22:58.187
2923
+ In Germany everybody will by now know what
2924
+ the Deutschland ticket is.
2925
+
2926
+ 1:22:58.187 --> 1:23:03.797
2927
+ But if you translate it to English it might
2928
+ be important to explain it and other things
2929
+
2930
+ 1:23:03.797 --> 1:23:04.116
2931
+ are.
2932
+
2933
+ 1:23:04.116 --> 1:23:09.853
2934
+ So sometimes you have to explain things and
2935
+ then you have more sentences with insertions.
2936
+
2937
+ 1:23:10.410 --> 1:23:15.956
2938
+ Then you have two to one and one to two alignment,
2939
+ and that is, for example, in Germany you have
2940
+
2941
+ 1:23:15.956 --> 1:23:19.616
2942
+ a lot of sub-classes and bipes that are expressed
2943
+ by two cents.
2944
+
2945
+ 1:23:20.580 --> 1:23:37.725
2946
+ Of course, it might be more complex, but typically
2947
+ to make it simple and only allow for this type
2948
+
2949
+ 1:23:37.725 --> 1:23:40.174
2950
+ of alignment.
2951
+
2952
+ 1:23:41.301 --> 1:23:56.588
2953
+ Then it is about finding the alignment and
2954
+ that is, we try to score where we just take
2955
+
2956
+ 1:23:56.588 --> 1:23:59.575
2957
+ a general score.
2958
+
2959
+ 1:24:00.000 --> 1:24:04.011
2960
+ That is true like gala algorithms and the
2961
+ matching of one segment.
2962
+
2963
+ 1:24:04.011 --> 1:24:09.279
2964
+ If you have one segment now so this is one
2965
+ of the global things so the global alignment
2966
+
2967
+ 1:24:09.279 --> 1:24:13.828
2968
+ is as good as the product of all single steps
2969
+ and then you have two scores.
2970
+
2971
+ 1:24:13.828 --> 1:24:18.558
2972
+ First of all you say one to one alignments
2973
+ are much better than all the hours.
2974
+
2975
+ 1:24:19.059 --> 1:24:26.884
2976
+ And then you have a lexical similarity, which
2977
+ is, for example, based on an initial dictionary
2978
+
2979
+ 1:24:26.884 --> 1:24:30.713
2980
+ which counts how many dictionary entries are.
2981
+
2982
+ 1:24:31.091 --> 1:24:35.407
2983
+ So this is a very simple algorithm.
2984
+
2985
+ 1:24:35.407 --> 1:24:41.881
2986
+ Typically violates like your first step and
2987
+ you want.
2988
+
2989
+ 1:24:43.303 --> 1:24:54.454
2990
+ And that is like with this one you can get
2991
+ an initial one you can have better parallel
2992
+
2993
+ 1:24:54.454 --> 1:24:55.223
2994
+ data.
2995
+
2996
+ 1:24:55.675 --> 1:25:02.369
2997
+ No, it is an optimization problem and you
2998
+ are now based on the scores you can calculate
2999
+
3000
+ 1:25:02.369 --> 1:25:07.541
3001
+ for each possible alignment and score and then
3002
+ select the best one.
3003
+
3004
+ 1:25:07.541 --> 1:25:14.386
3005
+ Of course, you won't try all possibilities
3006
+ out but you can do a good search and then find
3007
+
3008
+ 1:25:14.386 --> 1:25:15.451
3009
+ the best one.
3010
+
3011
+ 1:25:15.815 --> 1:25:18.726
3012
+ Can typically be automatically.
3013
+
3014
+ 1:25:18.726 --> 1:25:25.456
3015
+ Of course, you should do some checks like
3016
+ aligning sentences as possible.
3017
+
3018
+ 1:25:26.766 --> 1:25:32.043
3019
+ A bill like typically for training data is
3020
+ done this way.
3021
+
3022
+ 1:25:32.043 --> 1:25:35.045
3023
+ Maybe if you have test data you.
3024
+
3025
+ 1:25:40.000 --> 1:25:47.323
3026
+ Sorry, I'm a bit late because originally wanted
3027
+ to do a quiz at the end.
3028
+
3029
+ 1:25:47.323 --> 1:25:49.129
3030
+ Can we go a quiz?
3031
+
3032
+ 1:25:49.429 --> 1:25:51.833
3033
+ We'll do it somewhere else.
3034
+
3035
+ 1:25:51.833 --> 1:25:56.813
3036
+ We had a bachelor project about making quiz
3037
+ for lectures.
3038
+
3039
+ 1:25:56.813 --> 1:25:59.217
3040
+ And I still want to try it.
3041
+
3042
+ 1:25:59.217 --> 1:26:04.197
3043
+ So let's see I hope in some other lecture
3044
+ we can do that.
3045
+
3046
+ 1:26:04.197 --> 1:26:09.435
3047
+ Then we can at the island of the lecture do
3048
+ some quiz about.
3049
+
3050
+ 1:26:09.609 --> 1:26:13.081
3051
+ All We Can Do Is Is the Practical Thing Let's
3052
+ See.
3053
+
3054
+ 1:26:13.533 --> 1:26:24.719
3055
+ And: Today, so what you should remember is
3056
+ what is parallel data and how we can.
3057
+
3058
+ 1:26:25.045 --> 1:26:29.553
3059
+ Create parallel data like how to generally
3060
+ process data.
3061
+
3062
+ 1:26:29.553 --> 1:26:36.435
3063
+ What you think about data is really important
3064
+ if you build systems and different ways.
3065
+
3066
+ 1:26:36.696 --> 1:26:46.857
3067
+ The three main options like forwards is directly
3068
+ on director level or using subword things.
3069
+
3070
+ 1:26:47.687 --> 1:26:49.634
3071
+ Is there any question?
3072
+
3073
+ 1:26:52.192 --> 1:26:57.768
3074
+ Yes, this is the alignment thing in Cadillac
3075
+ band in Tyne walking with people.
3076
+
3077
+ 1:27:00.000 --> 1:27:05.761
3078
+ It's not directly using than every time walking,
3079
+ but the idea is similar and you can use all
3080
+
3081
+ 1:27:05.761 --> 1:27:11.771
3082
+ this type of similar algorithms, which is the
3083
+ main thing which is the question of the difficulty
3084
+
3085
+ 1:27:11.771 --> 1:27:14.807
3086
+ is to define me at your your loss function
3087
+ here.
3088
+
3089
+ 1:27:14.807 --> 1:27:16.418
3090
+ What is a good alignment?
3091
+
3092
+ 1:27:16.736 --> 1:27:24.115
3093
+ But as you do not have a time walk on, you
3094
+ have a monotone alignment in there, and you
3095
+
3096
+ 1:27:24.115 --> 1:27:26.150
3097
+ cannot have rehonoring.
3098
+
3099
+ 1:27:30.770 --> 1:27:40.121
3100
+ There then thanks a lot and on first day we
3101
+ will then start with or discuss.
3102
+
demo_data/lectures/Lecture-03-25.04.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7b241226dacb56a88fcbccaecb2639c3b5765fbea6f60e4758715c6941fbc512
3
+ size 117644511
demo_data/lectures/Lecture-04-27.04.2023/English.vtt ADDED
@@ -0,0 +1,2919 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:03.663 --> 0:00:07.970
4
+ Okay, then I should switch back to English,
5
+ sorry,.
6
+
7
+ 0:00:08.528 --> 0:00:18.970
8
+ So welcome to today's lecture in the cross
9
+ machine translation and today we're planning
10
+
11
+ 0:00:18.970 --> 0:00:20.038
12
+ to talk.
13
+
14
+ 0:00:20.880 --> 0:00:31.845
15
+ Which will be without our summary of power
16
+ translation was done from around till.
17
+
18
+ 0:00:32.872 --> 0:00:38.471
19
+ Fourteen, so this was an approach which was
20
+ quite long.
21
+
22
+ 0:00:38.471 --> 0:00:47.070
23
+ It was the first approach where at the end
24
+ the quality was really so good that it was
25
+
26
+ 0:00:47.070 --> 0:00:49.969
27
+ used as a commercial system.
28
+
29
+ 0:00:49.990 --> 0:00:56.482
30
+ Or something like that, so the first systems
31
+ there was using the statistical machine translation.
32
+
33
+ 0:00:57.937 --> 0:01:02.706
34
+ So when I came into the field this was the
35
+ main part of the lecture, so there would be
36
+
37
+ 0:01:02.706 --> 0:01:07.912
38
+ not be one lecture, but in more detail than
39
+ half of the full course would be about statistical
40
+
41
+ 0:01:07.912 --> 0:01:09.063
42
+ machine translation.
43
+
44
+ 0:01:09.369 --> 0:01:23.381
45
+ So what we try to do today is like get the
46
+ most important things, which think our part
47
+
48
+ 0:01:23.381 --> 0:01:27.408
49
+ is still very important.
50
+
51
+ 0:01:27.267 --> 0:01:31.196
52
+ Four State of the Art Box.
53
+
54
+ 0:01:31.952 --> 0:01:45.240
55
+ Then we'll have the presentation about how
56
+ to evaluate the other part of the machine translation.
57
+
58
+ 0:01:45.505 --> 0:01:58.396
59
+ The other important thing is the language
60
+ modeling part will explain later how they combine.
61
+
62
+ 0:01:59.539 --> 0:02:04.563
63
+ Shortly mentioned this one already.
64
+
65
+ 0:02:04.824 --> 0:02:06.025
66
+ On Tuesday.
67
+
68
+ 0:02:06.246 --> 0:02:21.849
69
+ So in a lot of these explanations, how we
70
+ model translation process, it might be surprising:
71
+
72
+ 0:02:22.082 --> 0:02:27.905
73
+ Later some people say it's for four eight words
74
+ traditionally came because the first models
75
+
76
+ 0:02:27.905 --> 0:02:32.715
77
+ which you'll discuss here also when they are
78
+ referred to as the IVM models.
79
+
80
+ 0:02:32.832 --> 0:02:40.043
81
+ They were trained on French to English translation
82
+ directions and that's why they started using
83
+
84
+ 0:02:40.043 --> 0:02:44.399
85
+ F and E and then this was done for the next
86
+ twenty years.
87
+
88
+ 0:02:44.664 --> 0:02:52.316
89
+ So while we are trying to wait, the source
90
+ words is: We have a big eye, typically the
91
+
92
+ 0:02:52.316 --> 0:03:02.701
93
+ lengths of the sewer sentence in small eye,
94
+ the position, and similarly in the target and
95
+
96
+ 0:03:02.701 --> 0:03:05.240
97
+ the lengths of small.
98
+
99
+ 0:03:05.485 --> 0:03:13.248
100
+ Things will get a bit complicated in this
101
+ way because it is not always clear what is
102
+
103
+ 0:03:13.248 --> 0:03:13.704
104
+ the.
105
+
106
+ 0:03:14.014 --> 0:03:21.962
107
+ See that there is this noisy channel model
108
+ which switches the direction in your model,
109
+
110
+ 0:03:21.962 --> 0:03:25.616
111
+ but in the application it's the target.
112
+
113
+ 0:03:26.006 --> 0:03:37.077
114
+ So that is why if you especially read these
115
+ papers, it might sometimes be a bit disturbing.
116
+
117
+ 0:03:37.437 --> 0:03:40.209
118
+ Try to keep it here always.
119
+
120
+ 0:03:40.209 --> 0:03:48.427
121
+ The source is, and even if we use a model
122
+ where it's inverse, we'll keep this way.
123
+
124
+ 0:03:48.468 --> 0:03:55.138
125
+ Don't get disturbed by that, and I think it's
126
+ possible to understand all that without this
127
+
128
+ 0:03:55.138 --> 0:03:55.944
129
+ confusion.
130
+
131
+ 0:03:55.944 --> 0:04:01.734
132
+ But in some of the papers you might get confused
133
+ because they switched to the.
134
+
135
+ 0:04:04.944 --> 0:04:17.138
136
+ In general, in statistics and machine translation,
137
+ the goal is how we do translation.
138
+
139
+ 0:04:17.377 --> 0:04:25.562
140
+ But first we are seeing all our possible target
141
+ sentences as possible translations.
142
+
143
+ 0:04:26.726 --> 0:04:37.495
144
+ And we are assigning some probability to the
145
+ combination, so we are modeling.
146
+
147
+ 0:04:39.359 --> 0:04:49.746
148
+ And then we are doing a search over all possible
149
+ things or at least theoretically, and we are
150
+
151
+ 0:04:49.746 --> 0:04:56.486
152
+ trying to find the translation with the highest
153
+ probability.
154
+
155
+ 0:04:56.936 --> 0:05:05.116
156
+ And this general idea is also true for neuromachine
157
+ translation.
158
+
159
+ 0:05:05.116 --> 0:05:07.633
160
+ They differ in how.
161
+
162
+ 0:05:08.088 --> 0:05:10.801
163
+ So these were then of course the two big challenges.
164
+
165
+ 0:05:11.171 --> 0:05:17.414
166
+ On the one hand, how can we estimate this
167
+ probability?
168
+
169
+ 0:05:17.414 --> 0:05:21.615
170
+ How is the translation of the other?
171
+
172
+ 0:05:22.262 --> 0:05:32.412
173
+ The other challenge is the search, so we cannot,
174
+ of course, say we want to find the most probable
175
+
176
+ 0:05:32.412 --> 0:05:33.759
177
+ translation.
178
+
179
+ 0:05:33.759 --> 0:05:42.045
180
+ We cannot go over all possible English sentences
181
+ and calculate the probability.
182
+
183
+ 0:05:43.103 --> 0:05:45.004
184
+ So,.
185
+
186
+ 0:05:45.165 --> 0:05:53.423
187
+ What we have to do there is some are doing
188
+ intelligent search and look for the ones and
189
+
190
+ 0:05:53.423 --> 0:05:54.268
191
+ compare.
192
+
193
+ 0:05:54.734 --> 0:05:57.384
194
+ That will be done.
195
+
196
+ 0:05:57.384 --> 0:06:07.006
197
+ This process of finding them is called the
198
+ decoding process because.
199
+
200
+ 0:06:07.247 --> 0:06:09.015
201
+ They will be covered well later.
202
+
203
+ 0:06:09.015 --> 0:06:11.104
204
+ Today we will concentrate on the mile.
205
+
206
+ 0:06:11.451 --> 0:06:23.566
207
+ The model is trained using data, so in the
208
+ first step we're having data, we're somehow
209
+
210
+ 0:06:23.566 --> 0:06:30.529
211
+ having a definition of what the model looks
212
+ like.
213
+
214
+ 0:06:34.034 --> 0:06:42.913
215
+ And in statistical machine translation the
216
+ common model is behind.
217
+
218
+ 0:06:42.913 --> 0:06:46.358
219
+ That is what is referred.
220
+
221
+ 0:06:46.786 --> 0:06:55.475
222
+ And this is motivated by the initial idea
223
+ from Shannon.
224
+
225
+ 0:06:55.475 --> 0:07:02.457
226
+ We have this that you can think of decoding.
227
+
228
+ 0:07:02.722 --> 0:07:10.472
229
+ So think of it as we have this text in maybe
230
+ German.
231
+
232
+ 0:07:10.472 --> 0:07:21.147
233
+ Originally it was an English text, but somebody
234
+ used some nice decoding.
235
+
236
+ 0:07:21.021 --> 0:07:28.579
237
+ Task is to decipher it again, this crazy cyborg
238
+ expressing things in German, and to decipher
239
+
240
+ 0:07:28.579 --> 0:07:31.993
241
+ the meaning again and doing that between.
242
+
243
+ 0:07:32.452 --> 0:07:35.735
244
+ And that is the idea about this noisy channel
245
+ when it.
246
+
247
+ 0:07:36.236 --> 0:07:47.209
248
+ It goes through some type of channel which
249
+ adds noise to the source and then you receive
250
+
251
+ 0:07:47.209 --> 0:07:48.811
252
+ the message.
253
+
254
+ 0:07:49.429 --> 0:08:00.190
255
+ And then the idea is, can we now construct
256
+ the original message out of these messages
257
+
258
+ 0:08:00.190 --> 0:08:05.070
259
+ by modeling some of the channels here?
260
+
261
+ 0:08:06.726 --> 0:08:15.797
262
+ There you know to see a bit the surface of
263
+ the source message with English.
264
+
265
+ 0:08:15.797 --> 0:08:22.361
266
+ It went through some channel and received
267
+ the message.
268
+
269
+ 0:08:22.682 --> 0:08:31.381
270
+ If you're not looking at machine translation,
271
+ your source language is English.
272
+
273
+ 0:08:31.671 --> 0:08:44.388
274
+ Here you see now a bit of this where the confusion
275
+ starts while English as a target language is
276
+
277
+ 0:08:44.388 --> 0:08:47.700
278
+ also the source message.
279
+
280
+ 0:08:47.927 --> 0:08:48.674
281
+ You can see.
282
+
283
+ 0:08:48.674 --> 0:08:51.488
284
+ There is also a mathematics of how we model
285
+ the.
286
+
287
+ 0:08:52.592 --> 0:08:56.888
288
+ It's a noisy channel model from a mathematic
289
+ point of view.
290
+
291
+ 0:08:56.997 --> 0:09:00.245
292
+ So this is again our general formula.
293
+
294
+ 0:09:00.245 --> 0:09:08.623
295
+ We are looking for the most probable translation
296
+ and that is the translation that has the highest
297
+
298
+ 0:09:08.623 --> 0:09:09.735
299
+ probability.
300
+
301
+ 0:09:09.809 --> 0:09:19.467
302
+ We are not interested in the probability itself,
303
+ but we are interesting in this target sentence
304
+
305
+ 0:09:19.467 --> 0:09:22.082
306
+ E where this probability.
307
+
308
+ 0:09:23.483 --> 0:09:33.479
309
+ And: Therefore, we can use them twice definition
310
+ of conditional probability and using the base
311
+
312
+ 0:09:33.479 --> 0:09:42.712
313
+ rules, so this probability equals the probability
314
+ of f giving any kind of probability of e divided
315
+
316
+ 0:09:42.712 --> 0:09:44.858
317
+ by the probability of.
318
+
319
+ 0:09:45.525 --> 0:09:48.218
320
+ Now see mathematically this confusion.
321
+
322
+ 0:09:48.218 --> 0:09:54.983
323
+ Originally we are interested in the probability
324
+ of the target sentence given the search sentence.
325
+
326
+ 0:09:55.295 --> 0:10:00.742
327
+ And if we are modeling things now, we are
328
+ looking here at the inverse direction, so the
329
+
330
+ 0:10:00.742 --> 0:10:06.499
331
+ probability of F given E to the probability
332
+ of the source sentence given the target sentence
333
+
334
+ 0:10:06.499 --> 0:10:10.832
335
+ is the probability of the target sentence divided
336
+ by the probability.
337
+
338
+ 0:10:13.033 --> 0:10:15.353
339
+ Why are we doing this?
340
+
341
+ 0:10:15.353 --> 0:10:24.333
342
+ Maybe I mean, of course, once it's motivated
343
+ by our model, that we were saying this type
344
+
345
+ 0:10:24.333 --> 0:10:27.058
346
+ of how we are modeling it.
347
+
348
+ 0:10:27.058 --> 0:10:30.791
349
+ The other interesting thing is that.
350
+
351
+ 0:10:31.231 --> 0:10:40.019
352
+ So we are looking at this probability up there,
353
+ which we had before we formulate that we can
354
+
355
+ 0:10:40.019 --> 0:10:40.775
356
+ remove.
357
+
358
+ 0:10:41.181 --> 0:10:46.164
359
+ If we are searching for the highest translation,
360
+ this is fixed.
361
+
362
+ 0:10:46.164 --> 0:10:47.800
363
+ This doesn't change.
364
+
365
+ 0:10:47.800 --> 0:10:52.550
366
+ We have an input, the source sentence, and
367
+ we cannot change.
368
+
369
+ 0:10:52.812 --> 0:11:02.780
370
+ Is always the same, so we can ignore it in
371
+ the ACMAX because the lower one is exactly
372
+
373
+ 0:11:02.780 --> 0:11:03.939
374
+ the same.
375
+
376
+ 0:11:04.344 --> 0:11:06.683
377
+ And then we have p o f.
378
+
379
+ 0:11:06.606 --> 0:11:13.177
380
+ E times P of E and that is so we are modeling
381
+ the translation process on the one hand with
382
+
383
+ 0:11:13.177 --> 0:11:19.748
384
+ the translation model which models how probable
385
+ is the sentence F given E and on the other
386
+
387
+ 0:11:19.748 --> 0:11:25.958
388
+ hand with the language model which models only
389
+ how probable is this English sentence.
390
+
391
+ 0:11:26.586 --> 0:11:39.366
392
+ That somebody wrote this language or translation
393
+ point of view, this is about fluency.
394
+
395
+ 0:11:40.200 --> 0:11:44.416
396
+ You should have in German, for example, agreement.
397
+
398
+ 0:11:44.416 --> 0:11:50.863
399
+ If the agreement is not right, that's properly
400
+ not said by anybody in German.
401
+
402
+ 0:11:50.863 --> 0:11:58.220
403
+ Nobody would say that's Schönest's house because
404
+ it's not according to the German rules.
405
+
406
+ 0:11:58.598 --> 0:12:02.302
407
+ So this can be modeled by the language model.
408
+
409
+ 0:12:02.542 --> 0:12:09.855
410
+ And you have the translation model which models
411
+ housings get translated between the.
412
+
413
+ 0:12:10.910 --> 0:12:18.775
414
+ And here you see again our confusion again,
415
+ and now here put the translation model: Wage
416
+
417
+ 0:12:18.775 --> 0:12:24.360
418
+ is a big income counterintuitive because the
419
+ probability of a sewer sentence giving the
420
+
421
+ 0:12:24.360 --> 0:12:24.868
422
+ target.
423
+
424
+ 0:12:26.306 --> 0:12:35.094
425
+ Have to do that for the bass farmer, but in
426
+ the following slides I'll talk again about.
427
+
428
+ 0:12:35.535 --> 0:12:45.414
429
+ Because yeah, that's more intuitive that you
430
+ model the translation of the target sentence
431
+
432
+ 0:12:45.414 --> 0:12:48.377
433
+ given the source sentence.
434
+
435
+ 0:12:50.930 --> 0:12:55.668
436
+ And this is what we want to talk about today.
437
+
438
+ 0:12:55.668 --> 0:13:01.023
439
+ We later talk about language models how to
440
+ do that.
441
+
442
+ 0:13:00.940 --> 0:13:04.493
443
+ And maybe also how to combine them.
444
+
445
+ 0:13:04.493 --> 0:13:13.080
446
+ But the focus on today would be how can we
447
+ model this probability to how to generate a
448
+
449
+ 0:13:13.080 --> 0:13:16.535
450
+ translation from source to target?
451
+
452
+ 0:13:19.960 --> 0:13:24.263
453
+ How can we do that and the easiest thing?
454
+
455
+ 0:13:24.263 --> 0:13:33.588
456
+ Maybe if you think about statistics, you count
457
+ how many examples you have, how many target
458
+
459
+ 0:13:33.588 --> 0:13:39.121
460
+ sentences go occur, and that gives you an estimation.
461
+
462
+ 0:13:40.160 --> 0:13:51.632
463
+ However, like in another model that is not
464
+ possible because most sentences you will never
465
+
466
+ 0:13:51.632 --> 0:13:52.780
467
+ see, so.
468
+
469
+ 0:13:53.333 --> 0:14:06.924
470
+ So what we have to do is break up the translation
471
+ process into smaller models and model each
472
+
473
+ 0:14:06.924 --> 0:14:09.555
474
+ of the decisions.
475
+
476
+ 0:14:09.970 --> 0:14:26.300
477
+ So this simple solution with how you throw
478
+ a dice is like you have a and that gives you
479
+
480
+ 0:14:26.300 --> 0:14:29.454
481
+ the probability.
482
+
483
+ 0:14:29.449 --> 0:14:40.439
484
+ But here's the principle because each event
485
+ is so rare that most of them never have helped.
486
+
487
+ 0:14:43.063 --> 0:14:48.164
488
+ Although it might be that in all your training
489
+ data you have never seen this title of set.
490
+
491
+ 0:14:49.589 --> 0:14:52.388
492
+ How can we do that?
493
+
494
+ 0:14:52.388 --> 0:15:04.845
495
+ We look in statistical machine translation
496
+ into two different models, a generative model
497
+
498
+ 0:15:04.845 --> 0:15:05.825
499
+ where.
500
+
501
+ 0:15:06.166 --> 0:15:11.736
502
+ So the idea was to really model model like
503
+ each individual translation between words.
504
+
505
+ 0:15:12.052 --> 0:15:22.598
506
+ So you break down the translation of a full
507
+ sentence into the translation of each individual's
508
+
509
+ 0:15:22.598 --> 0:15:23.264
510
+ word.
511
+
512
+ 0:15:23.264 --> 0:15:31.922
513
+ So you say if you have the black cat, if you
514
+ translate it, the full sentence.
515
+
516
+ 0:15:32.932 --> 0:15:38.797
517
+ Of course, this has some challenges, any ideas
518
+ where this type of model could be very challenging.
519
+
520
+ 0:15:40.240 --> 0:15:47.396
521
+ Vocabularies and videos: Yes, we're going
522
+ to be able to play in the very color.
523
+
524
+ 0:15:47.867 --> 0:15:51.592
525
+ Yes, but you could at least use a bit of the
526
+ context around it.
527
+
528
+ 0:15:51.592 --> 0:15:55.491
529
+ It will not only depend on the word, but it's
530
+ already challenging.
531
+
532
+ 0:15:55.491 --> 0:15:59.157
533
+ You make things very hard, so that's definitely
534
+ one challenge.
535
+
536
+ 0:16:00.500 --> 0:16:07.085
537
+ One other, what did you talk about that we
538
+ just don't want to say?
539
+
540
+ 0:16:08.348 --> 0:16:11.483
541
+ Yes, they are challenging.
542
+
543
+ 0:16:11.483 --> 0:16:21.817
544
+ You have to do something like words, but the
545
+ problem is that you might introduce errors.
546
+
547
+ 0:16:21.841 --> 0:16:23.298
548
+ Later and makes things very comfortable.
549
+
550
+ 0:16:25.265 --> 0:16:28.153
551
+ Wrong splitting is the worst things that are
552
+ very complicated.
553
+
554
+ 0:16:32.032 --> 0:16:35.580
555
+ Saints, for example, and also maybe Japanese
556
+ medicine.
557
+
558
+ 0:16:35.735 --> 0:16:41.203
559
+ In German, yes, especially like these are
560
+ all right.
561
+
562
+ 0:16:41.203 --> 0:16:46.981
563
+ The first thing is maybe the one which is
564
+ most obvious.
565
+
566
+ 0:16:46.981 --> 0:16:49.972
567
+ It is raining cats and dogs.
568
+
569
+ 0:16:51.631 --> 0:17:01.837
570
+ To German, the cat doesn't translate this
571
+ whole chunk into something because there is
572
+
573
+ 0:17:01.837 --> 0:17:03.261
574
+ not really.
575
+
576
+ 0:17:03.403 --> 0:17:08.610
577
+ Mean, of course, in generally there is this
578
+ type of alignment, so there is a correspondence
579
+
580
+ 0:17:08.610 --> 0:17:11.439
581
+ between words in English and the words in German.
582
+
583
+ 0:17:11.439 --> 0:17:16.363
584
+ However, that's not true for all sentences,
585
+ so in some sentences you cannot really say
586
+
587
+ 0:17:16.363 --> 0:17:18.174
588
+ this word translates into that.
589
+
590
+ 0:17:18.498 --> 0:17:21.583
591
+ But you can only let more locate this whole
592
+ phrase.
593
+
594
+ 0:17:21.583 --> 0:17:23.482
595
+ This model into something else.
596
+
597
+ 0:17:23.563 --> 0:17:30.970
598
+ If you think about the don't in English, the
599
+ do is not really clearly where should that
600
+
601
+ 0:17:30.970 --> 0:17:31.895
602
+ be allied.
603
+
604
+ 0:17:32.712 --> 0:17:39.079
605
+ Then for a long time the most successful approach
606
+ was this phrase based translation model where
607
+
608
+ 0:17:39.079 --> 0:17:45.511
609
+ the idea is your block is not a single word
610
+ but a longer phrase if you try to build translations
611
+
612
+ 0:17:45.511 --> 0:17:46.572
613
+ based on these.
614
+
615
+ 0:17:48.768 --> 0:17:54.105
616
+ But let's start with a word based and what
617
+ you need.
618
+
619
+ 0:17:54.105 --> 0:18:03.470
620
+ There is two main knowledge sources, so on
621
+ the one hand we have a lexicon where we translate
622
+
623
+ 0:18:03.470 --> 0:18:05.786
624
+ possible translations.
625
+
626
+ 0:18:06.166 --> 0:18:16.084
627
+ The main difference between the lexicon and
628
+ statistical machine translation and lexicon
629
+
630
+ 0:18:16.084 --> 0:18:17.550
631
+ as you know.
632
+
633
+ 0:18:17.837 --> 0:18:23.590
634
+ Traditional lexicon: You know how word is
635
+ translated and mainly it's giving you two or
636
+
637
+ 0:18:23.590 --> 0:18:26.367
638
+ three examples with any example sentence.
639
+
640
+ 0:18:26.367 --> 0:18:30.136
641
+ So in this context it gets translated like
642
+ that henceon.
643
+
644
+ 0:18:30.570 --> 0:18:38.822
645
+ In order to model that and work with probabilities
646
+ what we need in a machine translation is these:
647
+
648
+ 0:18:39.099 --> 0:18:47.962
649
+ So if we have the German word bargain, it sends
650
+ me out with a probability of zero point five.
651
+
652
+ 0:18:47.962 --> 0:18:51.545
653
+ Maybe it's translated into a vehicle.
654
+
655
+ 0:18:52.792 --> 0:18:58.876
656
+ And of course this is not easy to be created
657
+ by a shoveman.
658
+
659
+ 0:18:58.876 --> 0:19:07.960
660
+ If ask you and give probabilities for how
661
+ probable this vehicle is, there might: So how
662
+
663
+ 0:19:07.960 --> 0:19:12.848
664
+ we are doing is again that the lexicon is automatically
665
+ will be created from a corpus.
666
+
667
+ 0:19:13.333 --> 0:19:18.754
668
+ And we're just counting here, so we count
669
+ how often does it work, how often does it co
670
+
671
+ 0:19:18.754 --> 0:19:24.425
672
+ occur with vehicle, and then we're taking the
673
+ ratio and saying in the house of time on the
674
+
675
+ 0:19:24.425 --> 0:19:26.481
676
+ English side there was vehicles.
677
+
678
+ 0:19:26.481 --> 0:19:31.840
679
+ There was a probability of vehicles given
680
+ back, and there's something like zero point
681
+
682
+ 0:19:31.840 --> 0:19:32.214
683
+ five.
684
+
685
+ 0:19:33.793 --> 0:19:46.669
686
+ That we need another concept, and that is
687
+ this concept of alignment, and now you can
688
+
689
+ 0:19:46.669 --> 0:19:47.578
690
+ have.
691
+
692
+ 0:19:47.667 --> 0:19:53.113
693
+ Since this is quite complicated, the alignment
694
+ in general can be complex.
695
+
696
+ 0:19:53.113 --> 0:19:55.689
697
+ It can be that it's not only like.
698
+
699
+ 0:19:55.895 --> 0:20:04.283
700
+ It can be that two words of a surrender target
701
+ sign and it's also imbiguous.
702
+
703
+ 0:20:04.283 --> 0:20:13.761
704
+ It can be that you say all these two words
705
+ only are aligned together and our words are
706
+
707
+ 0:20:13.761 --> 0:20:15.504
708
+ aligned or not.
709
+
710
+ 0:20:15.875 --> 0:20:21.581
711
+ Is should the do be aligned to the knot in
712
+ German?
713
+
714
+ 0:20:21.581 --> 0:20:29.301
715
+ It's only there because in German it's not,
716
+ so it should be aligned.
717
+
718
+ 0:20:30.510 --> 0:20:39.736
719
+ However, typically it's formalized and it's
720
+ formalized by a function from the target language.
721
+
722
+ 0:20:40.180 --> 0:20:44.051
723
+ And that is to make these models get easier
724
+ and clearer.
725
+
726
+ 0:20:44.304 --> 0:20:49.860
727
+ That means what means does it mean that you
728
+ have a fence that means that each.
729
+
730
+ 0:20:49.809 --> 0:20:58.700
731
+ A sewer's word gives target word and the alliance
732
+ to only one source word because the function
733
+
734
+ 0:20:58.700 --> 0:21:00.384
735
+ is also directly.
736
+
737
+ 0:21:00.384 --> 0:21:05.999
738
+ However, a source word can be hit or like
739
+ by signal target.
740
+
741
+ 0:21:06.286 --> 0:21:11.332
742
+ So you are allowing for one to many alignments,
743
+ but not for many to one alignment.
744
+
745
+ 0:21:11.831 --> 0:21:17.848
746
+ That is a bit of a challenge because you assume
747
+ a lightning should be symmetrical.
748
+
749
+ 0:21:17.848 --> 0:21:24.372
750
+ So if you look at a parallel sentence, it
751
+ should not matter if you look at it from German
752
+
753
+ 0:21:24.372 --> 0:21:26.764
754
+ to English or English to German.
755
+
756
+ 0:21:26.764 --> 0:21:34.352
757
+ So however, it makes these models: Yea possible
758
+ and we'll like to see yea for the phrase bass
759
+
760
+ 0:21:34.352 --> 0:21:36.545
761
+ until we need these alignments.
762
+
763
+ 0:21:36.836 --> 0:21:41.423
764
+ So this alignment was the most important of
765
+ the world based models.
766
+
767
+ 0:21:41.423 --> 0:21:47.763
768
+ For the next twenty years you need the world
769
+ based models to generate this type of alignment,
770
+
771
+ 0:21:47.763 --> 0:21:50.798
772
+ which is then the first step for the phrase.
773
+
774
+ 0:21:51.931 --> 0:21:59.642
775
+ Approach, and there you can then combine them
776
+ again like both directions into one we'll see.
777
+
778
+ 0:22:00.280 --> 0:22:06.850
779
+ This alignment is very important and allows
780
+ us to do this type of separation.
781
+
782
+ 0:22:08.308 --> 0:22:15.786
783
+ And yet the most commonly used word based
784
+ models are these models referred to as IBM
785
+
786
+ 0:22:15.786 --> 0:22:25.422
787
+ models, and there is a sequence of them with
788
+ great names: And they were like yeah very commonly
789
+
790
+ 0:22:25.422 --> 0:22:26.050
791
+ used.
792
+
793
+ 0:22:26.246 --> 0:22:31.719
794
+ We'll mainly focus on the simple one here
795
+ and look how this works and then not do all
796
+
797
+ 0:22:31.719 --> 0:22:34.138
798
+ the details about the further models.
799
+
800
+ 0:22:34.138 --> 0:22:38.084
801
+ The interesting thing is also that all of
802
+ them are important.
803
+
804
+ 0:22:38.084 --> 0:22:43.366
805
+ So if you want to train this alignment what
806
+ you normally do is train an IVM model.
807
+
808
+ 0:22:43.743 --> 0:22:50.940
809
+ Then you take that as your initialization
810
+ to then train the IBM model too and so on.
811
+
812
+ 0:22:50.940 --> 0:22:53.734
813
+ The motivation for that is yeah.
814
+
815
+ 0:22:53.734 --> 0:23:00.462
816
+ The first model gives you: Is so simple that
817
+ you can even find a global optimum, so it gives
818
+
819
+ 0:23:00.462 --> 0:23:06.403
820
+ you a good starting point for the next one
821
+ where the optimization in finding the right
822
+
823
+ 0:23:06.403 --> 0:23:12.344
824
+ model is more difficult and therefore like
825
+ the defore technique was to make your model
826
+
827
+ 0:23:12.344 --> 0:23:13.641
828
+ step by step more.
829
+
830
+ 0:23:15.195 --> 0:23:27.333
831
+ In these models we are breaking down the probability
832
+ into smaller steps and then we can define:
833
+
834
+ 0:23:27.367 --> 0:23:38.981
835
+ You see it's not a bit different, so it's not
836
+ the curability and one specific alignment given.
837
+
838
+ 0:23:39.299 --> 0:23:42.729
839
+ We'll let us learn how we can then go from
840
+ one alignment to the full set.
841
+
842
+ 0:23:43.203 --> 0:23:52.889
843
+ The probability of target sentences and one
844
+ alignment between the source and target sentences
845
+
846
+ 0:23:52.889 --> 0:23:56.599
847
+ alignment is this type of function.
848
+
849
+ 0:23:57.057 --> 0:24:14.347
850
+ That every word is aligned in order to ensure
851
+ that every word is aligned.
852
+
853
+ 0:24:15.835 --> 0:24:28.148
854
+ So first of all you do some epsilon, the epsilon
855
+ is just a normalization factor that everything
856
+
857
+ 0:24:28.148 --> 0:24:31.739
858
+ is somehow to inferability.
859
+
860
+ 0:24:31.631 --> 0:24:37.539
861
+ Of source sentences plus one to the power
862
+ of the length of the targets.
863
+
864
+ 0:24:37.937 --> 0:24:50.987
865
+ And this is somehow the probability of this
866
+ alignment.
867
+
868
+ 0:24:51.131 --> 0:24:53.224
869
+ So is this alignment probable or not?
870
+
871
+ 0:24:53.224 --> 0:24:55.373
872
+ Of course you can have some intuition.
873
+
874
+ 0:24:55.373 --> 0:24:58.403
875
+ So if there's a lot of crossing, it may be
876
+ not a good.
877
+
878
+ 0:24:58.403 --> 0:25:03.196
879
+ If all of the words align to the same one
880
+ might be not a good alignment, but generally
881
+
882
+ 0:25:03.196 --> 0:25:06.501
883
+ it's difficult to really describe what is a
884
+ good alignment.
885
+
886
+ 0:25:07.067 --> 0:25:11.482
887
+ Say for the first model that's the most simple
888
+ thing.
889
+
890
+ 0:25:11.482 --> 0:25:18.760
891
+ What can be the most simple thing if you think
892
+ about giving a probability to some event?
893
+
894
+ 0:25:21.401 --> 0:25:25.973
895
+ Yes exactly, so just take the uniform distribution.
896
+
897
+ 0:25:25.973 --> 0:25:33.534
898
+ If we don't really know the best thing of
899
+ modeling is all equally probable, of course
900
+
901
+ 0:25:33.534 --> 0:25:38.105
902
+ that is not true, but it's giving you a good
903
+ study.
904
+
905
+ 0:25:38.618 --> 0:25:44.519
906
+ And so this one is just a number of all possible
907
+ alignments for this sentence.
908
+
909
+ 0:25:44.644 --> 0:25:53.096
910
+ So how many alignments are possible, so the
911
+ first target word can be allied to all sources
912
+
913
+ 0:25:53.096 --> 0:25:53.746
914
+ worth.
915
+
916
+ 0:25:54.234 --> 0:26:09.743
917
+ The second one can also be aligned to all
918
+ source work, and the third one also to source.
919
+
920
+ 0:26:10.850 --> 0:26:13.678
921
+ This is the number of alignments.
922
+
923
+ 0:26:13.678 --> 0:26:19.002
924
+ The second part is to model the probability
925
+ of the translation.
926
+
927
+ 0:26:19.439 --> 0:26:31.596
928
+ And there it's not nice to have this function,
929
+ so now we are making the product over all target.
930
+
931
+ 0:26:31.911 --> 0:26:40.068
932
+ And we are making a very strong independent
933
+ assumption because in these models we normally
934
+
935
+ 0:26:40.068 --> 0:26:45.715
936
+ assume the translation probability of one word
937
+ is independent.
938
+
939
+ 0:26:46.126 --> 0:26:49.800
940
+ So how you translate and visit it is independent
941
+ of all the other parts.
942
+
943
+ 0:26:50.290 --> 0:26:52.907
944
+ That is very strong and very bad.
945
+
946
+ 0:26:52.907 --> 0:26:55.294
947
+ Yeah, you should do it better.
948
+
949
+ 0:26:55.294 --> 0:27:00.452
950
+ We know that it's wrong because how you translate
951
+ this depends on.
952
+
953
+ 0:27:00.452 --> 0:27:05.302
954
+ However, it's a first easy solution and again
955
+ a good starting.
956
+
957
+ 0:27:05.966 --> 0:27:14.237
958
+ So what you do is that you take a product
959
+ of all words and take a translation probability
960
+
961
+ 0:27:14.237 --> 0:27:15.707
962
+ on this target.
963
+
964
+ 0:27:16.076 --> 0:27:23.901
965
+ And because we know that there is always one
966
+ source word allied to that, so it.
967
+
968
+ 0:27:24.344 --> 0:27:37.409
969
+ If the probability of visits in the zoo doesn't
970
+ really work, the good here I'm again.
971
+
972
+ 0:27:38.098 --> 0:27:51.943
973
+ So most only we have it here, so the probability
974
+ is an absolute divided pipe to the power.
975
+
976
+ 0:27:53.913 --> 0:27:58.401
977
+ And then there is somewhere in the last one.
978
+
979
+ 0:27:58.401 --> 0:28:04.484
980
+ There is an arrow and switch, so it is the
981
+ other way around.
982
+
983
+ 0:28:04.985 --> 0:28:07.511
984
+ Then you have your translation model.
985
+
986
+ 0:28:07.511 --> 0:28:12.498
987
+ Hopefully let's assume you have your water
988
+ train so that's only a signing.
989
+
990
+ 0:28:12.953 --> 0:28:25.466
991
+ And then this sentence has the probability
992
+ of generating I visit a friend given that you
993
+
994
+ 0:28:25.466 --> 0:28:31.371
995
+ have the source sentence if Bezukhov I'm.
996
+
997
+ 0:28:32.012 --> 0:28:34.498
998
+ Time stand to the power of minus five.
999
+
1000
+ 0:28:35.155 --> 0:28:36.098
1001
+ So this is your model.
1002
+
1003
+ 0:28:36.098 --> 0:28:37.738
1004
+ This is how you're applying your model.
1005
+
1006
+ 0:28:39.479 --> 0:28:44.220
1007
+ As you said, it's the most simple bottle you
1008
+ assume that all word translations are.
1009
+
1010
+ 0:28:44.204 --> 0:28:46.540
1011
+ Independent of each other.
1012
+
1013
+ 0:28:46.540 --> 0:28:54.069
1014
+ You assume that all alignments are equally
1015
+ important, and then the only thing you need
1016
+
1017
+ 0:28:54.069 --> 0:29:00.126
1018
+ for this type of model is to have this lexicon
1019
+ in order to calculate.
1020
+
1021
+ 0:29:00.940 --> 0:29:04.560
1022
+ And that is, of course, now the training process.
1023
+
1024
+ 0:29:04.560 --> 0:29:08.180
1025
+ The question is how do we get this type of
1026
+ lexic?
1027
+
1028
+ 0:29:09.609 --> 0:29:15.461
1029
+ But before we look into the training, do you
1030
+ have any questions about the model itself?
1031
+
1032
+ 0:29:21.101 --> 0:29:26.816
1033
+ The problem in training is that we have incomplete
1034
+ data.
1035
+
1036
+ 0:29:26.816 --> 0:29:32.432
1037
+ So if you want to count, I mean said you want
1038
+ to count.
1039
+
1040
+ 0:29:33.073 --> 0:29:39.348
1041
+ However, if you don't have the alignment,
1042
+ on the other hand, if you would have a lexicon
1043
+
1044
+ 0:29:39.348 --> 0:29:44.495
1045
+ you could maybe generate the alignment, which
1046
+ is the most probable word.
1047
+
1048
+ 0:29:45.225 --> 0:29:55.667
1049
+ And this is the very common problem that you
1050
+ have this type of incomplete data where you
1051
+
1052
+ 0:29:55.667 --> 0:29:59.656
1053
+ have not one type of information.
1054
+
1055
+ 0:30:00.120 --> 0:30:08.767
1056
+ And you can model this by considering the
1057
+ alignment as your hidden variable and then
1058
+
1059
+ 0:30:08.767 --> 0:30:17.619
1060
+ you can use the expectation maximization algorithm
1061
+ in order to generate the alignment.
1062
+
1063
+ 0:30:17.577 --> 0:30:26.801
1064
+ So the nice thing is that you only need your
1065
+ parallel data, which is aligned on sentence
1066
+
1067
+ 0:30:26.801 --> 0:30:29.392
1068
+ level, but you normally.
1069
+
1070
+ 0:30:29.389 --> 0:30:33.720
1071
+ Is just a lot of work we saw last time.
1072
+
1073
+ 0:30:33.720 --> 0:30:39.567
1074
+ Typically what you have is this type of corpus
1075
+ where.
1076
+
1077
+ 0:30:41.561 --> 0:30:50.364
1078
+ And yeah, the ERM algorithm sounds very fancy.
1079
+
1080
+ 0:30:50.364 --> 0:30:58.605
1081
+ However, again look at a little high level.
1082
+
1083
+ 0:30:58.838 --> 0:31:05.841
1084
+ So you're initializing a model by uniform
1085
+ distribution.
1086
+
1087
+ 0:31:05.841 --> 0:31:14.719
1088
+ You're just saying if have lexicon, if all
1089
+ words are equally possible.
1090
+
1091
+ 0:31:15.215 --> 0:31:23.872
1092
+ And then you apply your model to the data,
1093
+ and that is your expectation step.
1094
+
1095
+ 0:31:23.872 --> 0:31:30.421
1096
+ So given this initial lexicon, we are now
1097
+ calculating the.
1098
+
1099
+ 0:31:30.951 --> 0:31:36.043
1100
+ So we can now take all our parallel sentences,
1101
+ and of course ought to check what is the most
1102
+
1103
+ 0:31:36.043 --> 0:31:36.591
1104
+ probable.
1105
+
1106
+ 0:31:38.338 --> 0:31:49.851
1107
+ And then, of course, at the beginning maybe
1108
+ houses most often in line.
1109
+
1110
+ 0:31:50.350 --> 0:31:58.105
1111
+ Once we have done this expectation step, we
1112
+ can next do the maximization step and based
1113
+
1114
+ 0:31:58.105 --> 0:32:06.036
1115
+ on this guest alignment, which we have, we
1116
+ can now learn better translation probabilities
1117
+
1118
+ 0:32:06.036 --> 0:32:09.297
1119
+ by just counting how often do words.
1120
+
1121
+ 0:32:09.829 --> 0:32:22.289
1122
+ And then it's rated these steps: We can make
1123
+ this whole process even more stable, only taking
1124
+
1125
+ 0:32:22.289 --> 0:32:26.366
1126
+ the most probable alignment.
1127
+
1128
+ 0:32:26.346 --> 0:32:36.839
1129
+ Second step, but in contrast we calculate
1130
+ for all possible alignments the alignment probability
1131
+
1132
+ 0:32:36.839 --> 0:32:40.009
1133
+ and weigh the correcurrence.
1134
+
1135
+ 0:32:40.000 --> 0:32:41.593
1136
+ Then Things Are Most.
1137
+
1138
+ 0:32:42.942 --> 0:32:49.249
1139
+ Why could that be very challenging if we do
1140
+ it in general and really calculate all probabilities
1141
+
1142
+ 0:32:49.249 --> 0:32:49.834
1143
+ for all?
1144
+
1145
+ 0:32:53.673 --> 0:32:55.905
1146
+ How many alignments are there for a Simpson?
1147
+
1148
+ 0:32:58.498 --> 0:33:03.344
1149
+ Yes there, we just saw that in the formula
1150
+ if you remember.
1151
+
1152
+ 0:33:03.984 --> 0:33:12.336
1153
+ This was the formula so it's exponential in
1154
+ the lengths of the target sentence.
1155
+
1156
+ 0:33:12.336 --> 0:33:15.259
1157
+ It would calculate all the.
1158
+
1159
+ 0:33:15.415 --> 0:33:18.500
1160
+ Be very inefficient and really possible.
1161
+
1162
+ 0:33:18.500 --> 0:33:25.424
1163
+ The nice thing is we can again use some type
1164
+ of dynamic programming, so then we can do this
1165
+
1166
+ 0:33:25.424 --> 0:33:27.983
1167
+ without really calculating audit.
1168
+
1169
+ 0:33:28.948 --> 0:33:40.791
1170
+ We have the next pipe slides or so with the
1171
+ most equations in the whole lecture, so don't
1172
+
1173
+ 0:33:40.791 --> 0:33:41.713
1174
+ worry.
1175
+
1176
+ 0:33:42.902 --> 0:34:01.427
1177
+ So we said we have first explanation where
1178
+ it is about calculating the alignment.
1179
+
1180
+ 0:34:02.022 --> 0:34:20.253
1181
+ And we can do this with our initial definition
1182
+ of because this formula.
1183
+
1184
+ 0:34:20.160 --> 0:34:25.392
1185
+ So we can define this as and and divided by
1186
+ and.
1187
+
1188
+ 0:34:25.905 --> 0:34:30.562
1189
+ This is just the normal definition of a conditional
1190
+ probability.
1191
+
1192
+ 0:34:31.231 --> 0:34:37.937
1193
+ And what we then need to assume a meter calculate
1194
+ is P of E given.
1195
+
1196
+ 0:34:37.937 --> 0:34:41.441
1197
+ P of E given is still again quiet.
1198
+
1199
+ 0:34:41.982 --> 0:34:56.554
1200
+ Simple: The probability of the sewer sentence
1201
+ given the target sentence is quite intuitive.
1202
+
1203
+ 0:34:57.637 --> 0:35:15.047
1204
+ So let's just calculate how to calculate the
1205
+ probability of a event.
1206
+
1207
+ 0:35:15.215 --> 0:35:21.258
1208
+ So in here we can then put in our original
1209
+ form in our soils.
1210
+
1211
+ 0:35:21.201 --> 0:35:28.023
1212
+ There are some of the possible alignments
1213
+ of the first word, and so until the sum of
1214
+
1215
+ 0:35:28.023 --> 0:35:30.030
1216
+ all possible alignments.
1217
+
1218
+ 0:35:29.990 --> 0:35:41.590
1219
+ And then we have the probability here of the
1220
+ alignment type, this product of translation.
1221
+
1222
+ 0:35:42.562 --> 0:35:58.857
1223
+ Now this one is independent of the alignment,
1224
+ so we can put it to the front here.
1225
+
1226
+ 0:35:58.959 --> 0:36:03.537
1227
+ And now this is where dynamic programming
1228
+ works in.
1229
+
1230
+ 0:36:03.537 --> 0:36:08.556
1231
+ We can change that and make thereby things
1232
+ a lot easier.
1233
+
1234
+ 0:36:08.668 --> 0:36:21.783
1235
+ Can reform it like this just as a product
1236
+ over all target positions, and then it's the
1237
+
1238
+ 0:36:21.783 --> 0:36:26.456
1239
+ sum over all source positions.
1240
+
1241
+ 0:36:27.127 --> 0:36:36.454
1242
+ Maybe at least the intuition why this is equal
1243
+ is a lot easier if you look into it as graphic.
1244
+
1245
+ 0:36:36.816 --> 0:36:39.041
1246
+ So what we have here is the table.
1247
+
1248
+ 0:36:39.041 --> 0:36:42.345
1249
+ We have the target position and the Swiss
1250
+ position.
1251
+
1252
+ 0:36:42.862 --> 0:37:03.643
1253
+ And we have to sum up all possible passes
1254
+ through that: The nice thing is that each of
1255
+
1256
+ 0:37:03.643 --> 0:37:07.127
1257
+ these passes these probabilities are independent
1258
+ of each.
1259
+
1260
+ 0:37:07.607 --> 0:37:19.678
1261
+ In order to get the sum of all passes through
1262
+ this table you can use dynamic programming
1263
+
1264
+ 0:37:19.678 --> 0:37:27.002
1265
+ and then say oh this probability is exactly
1266
+ the same.
1267
+
1268
+ 0:37:26.886 --> 0:37:34.618
1269
+ Times the sun of this column finds the sum
1270
+ of this column, and times the sun of this colun.
1271
+
1272
+ 0:37:35.255 --> 0:37:41.823
1273
+ That is the same as if you go through all
1274
+ possible passes here and multiply always the
1275
+
1276
+ 0:37:41.823 --> 0:37:42.577
1277
+ elements.
1278
+
1279
+ 0:37:43.923 --> 0:37:54.227
1280
+ And that is a simplification because now we
1281
+ only have quadratic numbers and we don't have
1282
+
1283
+ 0:37:54.227 --> 0:37:55.029
1284
+ to go.
1285
+
1286
+ 0:37:55.355 --> 0:38:12.315
1287
+ Similar to guess you may be seen the same
1288
+ type of algorithm for what is it?
1289
+
1290
+ 0:38:14.314 --> 0:38:19.926
1291
+ Yeah, well yeah, so that is the saying.
1292
+
1293
+ 0:38:19.926 --> 0:38:31.431
1294
+ But yeah, I think graphically this is seeable
1295
+ if you don't know exactly the mass.
1296
+
1297
+ 0:38:32.472 --> 0:38:49.786
1298
+ Now put these both together, so if you really
1299
+ want to take a piece of and put these two formulas
1300
+
1301
+ 0:38:49.786 --> 0:38:51.750
1302
+ together,.
1303
+
1304
+ 0:38:51.611 --> 0:38:56.661
1305
+ Eliminated and Then You Get Your Final Formula.
1306
+
1307
+ 0:38:56.716 --> 0:39:01.148
1308
+ And that somehow really makes now really intuitively
1309
+ again sense.
1310
+
1311
+ 0:39:01.401 --> 0:39:08.301
1312
+ So the probability of an alignment is the
1313
+ product of all target sentences, and then it's
1314
+
1315
+ 0:39:08.301 --> 0:39:15.124
1316
+ the probability of to translate a word into
1317
+ the word that is aligned to divided by some
1318
+
1319
+ 0:39:15.124 --> 0:39:17.915
1320
+ of the other words in the sentence.
1321
+
1322
+ 0:39:18.678 --> 0:39:31.773
1323
+ If you look at this again, it makes real descent.
1324
+
1325
+ 0:39:31.891 --> 0:39:43.872
1326
+ So you're looking at how probable it is to
1327
+ translate compared to all the other words.
1328
+
1329
+ 0:39:43.872 --> 0:39:45.404
1330
+ So you're.
1331
+
1332
+ 0:39:45.865 --> 0:39:48.543
1333
+ So and that gives you the alignment probability.
1334
+
1335
+ 0:39:48.768 --> 0:39:54.949
1336
+ Somehow it's not only that it's mathematically
1337
+ correct if you look at it this way, it's somehow
1338
+
1339
+ 0:39:54.949 --> 0:39:55.785
1340
+ intuitively.
1341
+
1342
+ 0:39:55.785 --> 0:39:58.682
1343
+ So if you would say how good is it to align?
1344
+
1345
+ 0:39:58.638 --> 0:40:04.562
1346
+ We had to zoo him to visit, or yet it should
1347
+ depend on how good this is the translation
1348
+
1349
+ 0:40:04.562 --> 0:40:10.620
1350
+ probability compared to how good are the other
1351
+ words in the sentence, and how probable is
1352
+
1353
+ 0:40:10.620 --> 0:40:12.639
1354
+ it that I align them to them.
1355
+
1356
+ 0:40:15.655 --> 0:40:26.131
1357
+ Then you have the expectations that the next
1358
+ thing is now the maximization step, so we have
1359
+
1360
+ 0:40:26.131 --> 0:40:30.344
1361
+ now the probability of an alignment.
1362
+
1363
+ 0:40:31.451 --> 0:40:37.099
1364
+ Intuitively, that means how often are words
1365
+ aligned to each other giving this alignment
1366
+
1367
+ 0:40:37.099 --> 0:40:39.281
1368
+ or more in a perverse definition?
1369
+
1370
+ 0:40:39.281 --> 0:40:43.581
1371
+ What is the expectation value that they are
1372
+ aligned to each other?
1373
+
1374
+ 0:40:43.581 --> 0:40:49.613
1375
+ So if there's a lot of alignments with hyperability
1376
+ that they're aligned to each other, then.
1377
+
1378
+ 0:40:50.050 --> 0:41:07.501
1379
+ So the count of E and given F given our caravan
1380
+ data is a sum of all possible alignments.
1381
+
1382
+ 0:41:07.968 --> 0:41:14.262
1383
+ That is, this count, and you don't do just
1384
+ count with absolute numbers, but you count
1385
+
1386
+ 0:41:14.262 --> 0:41:14.847
1387
+ always.
1388
+
1389
+ 0:41:15.815 --> 0:41:26.519
1390
+ And to make that translation probability is
1391
+ that you have to normalize it, of course, through:
1392
+
1393
+ 0:41:27.487 --> 0:41:30.584
1394
+ And that's then the whole model.
1395
+
1396
+ 0:41:31.111 --> 0:41:39.512
1397
+ It looks now maybe a bit mathematically complex.
1398
+
1399
+ 0:41:39.512 --> 0:41:47.398
1400
+ The whole training process is described here.
1401
+
1402
+ 0:41:47.627 --> 0:41:53.809
1403
+ So you really, really just have to collect
1404
+ these counts and later normalize that.
1405
+
1406
+ 0:41:54.134 --> 0:42:03.812
1407
+ So repeating that until convergence we have
1408
+ said the ear migration is always done again.
1409
+
1410
+ 0:42:04.204 --> 0:42:15.152
1411
+ Equally, then you go over all sentence pairs
1412
+ and all of words and calculate the translation.
1413
+
1414
+ 0:42:15.355 --> 0:42:17.983
1415
+ And then you go once again over.
1416
+
1417
+ 0:42:17.983 --> 0:42:22.522
1418
+ It counted this count, count given, and totally
1419
+ e-given.
1420
+
1421
+ 0:42:22.702 --> 0:42:35.316
1422
+ Initially how probable is the E translated
1423
+ to something else, and you normalize your translation
1424
+
1425
+ 0:42:35.316 --> 0:42:37.267
1426
+ probabilities.
1427
+
1428
+ 0:42:38.538 --> 0:42:45.761
1429
+ So this is an old training process for this
1430
+ type of.
1431
+
1432
+ 0:42:46.166 --> 0:43:00.575
1433
+ How that then works is shown here a bit, so
1434
+ we have a very simple corpus.
1435
+
1436
+ 0:43:01.221 --> 0:43:12.522
1437
+ And as we said, you initialize your translation
1438
+ with yes or possible translations, so dusk
1439
+
1440
+ 0:43:12.522 --> 0:43:16.620
1441
+ can be aligned to the bookhouse.
1442
+
1443
+ 0:43:16.997 --> 0:43:25.867
1444
+ And the other ones are missing because only
1445
+ a curse with and book, and then the others
1446
+
1447
+ 0:43:25.867 --> 0:43:26.988
1448
+ will soon.
1449
+
1450
+ 0:43:27.127 --> 0:43:34.316
1451
+ In the initial way your vocabulary is for
1452
+ works, so the initial probabilities are all:
1453
+
1454
+ 0:43:34.794 --> 0:43:50.947
1455
+ And then if you iterate you see that the things
1456
+ which occur often and then get alignments get
1457
+
1458
+ 0:43:50.947 --> 0:43:53.525
1459
+ more and more.
1460
+
1461
+ 0:43:55.615 --> 0:44:01.506
1462
+ In reality, of course, you won't get like
1463
+ zero alignments, but you would normally get
1464
+
1465
+ 0:44:01.506 --> 0:44:02.671
1466
+ there sometimes.
1467
+
1468
+ 0:44:03.203 --> 0:44:05.534
1469
+ But as the probability increases.
1470
+
1471
+ 0:44:05.785 --> 0:44:17.181
1472
+ The training process is also guaranteed that
1473
+ the probability of your training data is always
1474
+
1475
+ 0:44:17.181 --> 0:44:20.122
1476
+ increased in iteration.
1477
+
1478
+ 0:44:21.421 --> 0:44:27.958
1479
+ You see that the model tries to model your
1480
+ training data and give you at least good models.
1481
+
1482
+ 0:44:30.130 --> 0:44:37.765
1483
+ Okay, are there any more questions to the
1484
+ training of these type of word-based models?
1485
+
1486
+ 0:44:38.838 --> 0:44:54.790
1487
+ Initially there is like forwards in the source
1488
+ site, so it's just one force to do equal distribution.
1489
+
1490
+ 0:44:55.215 --> 0:45:01.888
1491
+ So each target word, the probability of the
1492
+ target word, is at four target words, so the
1493
+
1494
+ 0:45:01.888 --> 0:45:03.538
1495
+ uniform distribution.
1496
+
1497
+ 0:45:07.807 --> 0:45:14.430
1498
+ However, there is problems with this initial
1499
+ order and we have this already mentioned at
1500
+
1501
+ 0:45:14.430 --> 0:45:15.547
1502
+ the beginning.
1503
+
1504
+ 0:45:15.547 --> 0:45:21.872
1505
+ There is for example things that yeah you
1506
+ want to allow for reordering but there are
1507
+
1508
+ 0:45:21.872 --> 0:45:27.081
1509
+ definitely some alignments which should be
1510
+ more probable than others.
1511
+
1512
+ 0:45:27.347 --> 0:45:42.333
1513
+ So a friend visit should have a lower probability
1514
+ than visit a friend.
1515
+
1516
+ 0:45:42.302 --> 0:45:50.233
1517
+ It's not always monitoring, there is some
1518
+ reordering happening, but if you just mix it
1519
+
1520
+ 0:45:50.233 --> 0:45:51.782
1521
+ crazy, it's not.
1522
+
1523
+ 0:45:52.252 --> 0:46:11.014
1524
+ You have slings like one too many alignments
1525
+ and they are not really models.
1526
+
1527
+ 0:46:11.491 --> 0:46:17.066
1528
+ But it shouldn't be that you align one word
1529
+ to all the others, and that is, you don't want
1530
+
1531
+ 0:46:17.066 --> 0:46:18.659
1532
+ this type of probability.
1533
+
1534
+ 0:46:19.199 --> 0:46:27.879
1535
+ You don't want to align to null, so there's
1536
+ nothing about that and how to deal with other
1537
+
1538
+ 0:46:27.879 --> 0:46:30.386
1539
+ words on the source side.
1540
+
1541
+ 0:46:32.272 --> 0:46:45.074
1542
+ And therefore this was only like the initial
1543
+ model in there.
1544
+
1545
+ 0:46:45.325 --> 0:46:47.639
1546
+ Models, which we saw.
1547
+
1548
+ 0:46:47.639 --> 0:46:57.001
1549
+ They only model the translation probability,
1550
+ so how probable is it to translate one word
1551
+
1552
+ 0:46:57.001 --> 0:46:58.263
1553
+ to another?
1554
+
1555
+ 0:46:58.678 --> 0:47:05.915
1556
+ What you could then add is the absolute position.
1557
+
1558
+ 0:47:05.915 --> 0:47:16.481
1559
+ Yeah, the second word should more probable
1560
+ align to the second position.
1561
+
1562
+ 0:47:17.557 --> 0:47:22.767
1563
+ We add a fertility model that means one word
1564
+ is mostly translated into one word.
1565
+
1566
+ 0:47:23.523 --> 0:47:29.257
1567
+ For example, we saw it there that should be
1568
+ translated into two words, but most words should
1569
+
1570
+ 0:47:29.257 --> 0:47:32.463
1571
+ be one to one, and it's even modeled for each
1572
+ word.
1573
+
1574
+ 0:47:32.463 --> 0:47:37.889
1575
+ So for each source word, how probable is it
1576
+ that it is translated to one, two, three or
1577
+
1578
+ 0:47:37.889 --> 0:47:38.259
1579
+ more?
1580
+
1581
+ 0:47:40.620 --> 0:47:50.291
1582
+ Then either one of four acts relative positions,
1583
+ so it's asks: Maybe instead of modeling, how
1584
+
1585
+ 0:47:50.291 --> 0:47:55.433
1586
+ probable is it that you translate from position
1587
+ five to position twenty five?
1588
+
1589
+ 0:47:55.433 --> 0:48:01.367
1590
+ It's not a very good way, but in a relative
1591
+ position instead of what you try to model it.
1592
+
1593
+ 0:48:01.321 --> 0:48:06.472
1594
+ How probable is that you are jumping Swiss
1595
+ steps forward or Swiss steps back?
1596
+
1597
+ 0:48:07.287 --> 0:48:15.285
1598
+ However, this makes sense more complex because
1599
+ what is a jump forward and a jump backward
1600
+
1601
+ 0:48:15.285 --> 0:48:16.885
1602
+ is not that easy.
1603
+
1604
+ 0:48:18.318 --> 0:48:30.423
1605
+ You want to have a model that describes reality,
1606
+ so every sentence that is not possible should
1607
+
1608
+ 0:48:30.423 --> 0:48:37.304
1609
+ have the probability zero because that cannot
1610
+ happen.
1611
+
1612
+ 0:48:37.837 --> 0:48:48.037
1613
+ However, with this type of IBM model four
1614
+ this has a positive probability, so it makes
1615
+
1616
+ 0:48:48.037 --> 0:48:54.251
1617
+ a sentence more complex and you can easily
1618
+ check it.
1619
+
1620
+ 0:48:57.457 --> 0:49:09.547
1621
+ So these models were the first models which
1622
+ tried to directly model and where they are
1623
+
1624
+ 0:49:09.547 --> 0:49:14.132
1625
+ the first to do the translation.
1626
+
1627
+ 0:49:14.414 --> 0:49:19.605
1628
+ So in all of these models, the probability
1629
+ of a word translating into another word is
1630
+
1631
+ 0:49:19.605 --> 0:49:25.339
1632
+ always independent of all the other translations,
1633
+ and that is a challenge because we know that
1634
+
1635
+ 0:49:25.339 --> 0:49:26.486
1636
+ this is not right.
1637
+
1638
+ 0:49:26.967 --> 0:49:32.342
1639
+ And therefore we will come now to then the
1640
+ phrase-based translation models.
1641
+
1642
+ 0:49:35.215 --> 0:49:42.057
1643
+ However, this word alignment is the very important
1644
+ concept which was used in phrase based.
1645
+
1646
+ 0:49:42.162 --> 0:49:50.559
1647
+ Even when people use phrase based, they first
1648
+ would always train a word based model not to
1649
+
1650
+ 0:49:50.559 --> 0:49:56.188
1651
+ get the really model but only to get this type
1652
+ of alignment.
1653
+
1654
+ 0:49:57.497 --> 0:50:01.343
1655
+ What was the main idea of a phrase based machine
1656
+ translation?
1657
+
1658
+ 0:50:03.223 --> 0:50:08.898
1659
+ It's not only that things got mathematically
1660
+ a lot more simple here because you don't try
1661
+
1662
+ 0:50:08.898 --> 0:50:13.628
1663
+ to express the whole translation process, but
1664
+ it's a discriminative model.
1665
+
1666
+ 0:50:13.628 --> 0:50:19.871
1667
+ So what you only try to model is this translation
1668
+ probability or is this translation more probable
1669
+
1670
+ 0:50:19.871 --> 0:50:20.943
1671
+ than some other.
1672
+
1673
+ 0:50:24.664 --> 0:50:28.542
1674
+ The main idea is that the basic units are
1675
+ are the phrases.
1676
+
1677
+ 0:50:28.542 --> 0:50:31.500
1678
+ That's why it's called phrase phrase phrase.
1679
+
1680
+ 0:50:31.500 --> 0:50:35.444
1681
+ You have to be aware that these are not linguistic
1682
+ phrases.
1683
+
1684
+ 0:50:35.444 --> 0:50:39.124
1685
+ I guess you have some intuition about what
1686
+ is a phrase.
1687
+
1688
+ 0:50:39.399 --> 0:50:45.547
1689
+ You would express as a phrase.
1690
+
1691
+ 0:50:45.547 --> 0:50:58.836
1692
+ However, you wouldn't say that is a very good
1693
+ phrase because it's.
1694
+
1695
+ 0:50:59.339 --> 0:51:06.529
1696
+ However, in this machine learning-based motivated
1697
+ thing, phrases are just indicative.
1698
+
1699
+ 0:51:07.127 --> 0:51:08.832
1700
+ So it can be any split.
1701
+
1702
+ 0:51:08.832 --> 0:51:12.455
1703
+ We don't consider linguistically motivated
1704
+ or not.
1705
+
1706
+ 0:51:12.455 --> 0:51:15.226
1707
+ It can be any sequence of consecutive.
1708
+
1709
+ 0:51:15.335 --> 0:51:16.842
1710
+ That's the Only Important Thing.
1711
+
1712
+ 0:51:16.977 --> 0:51:25.955
1713
+ The phrase is always a thing of consecutive
1714
+ words, and the motivation behind that is getting
1715
+
1716
+ 0:51:25.955 --> 0:51:27.403
1717
+ computational.
1718
+
1719
+ 0:51:27.387 --> 0:51:35.912
1720
+ People have looked into how you can also discontinuous
1721
+ phrases, which might be very helpful if you
1722
+
1723
+ 0:51:35.912 --> 0:51:38.237
1724
+ think about German harbor.
1725
+
1726
+ 0:51:38.237 --> 0:51:40.046
1727
+ Has this one phrase?
1728
+
1729
+ 0:51:40.000 --> 0:51:47.068
1730
+ There's two phrases, although there's many
1731
+ things in between, but in order to make things
1732
+
1733
+ 0:51:47.068 --> 0:51:52.330
1734
+ still possible and runner will, it's always
1735
+ like consecutive work.
1736
+
1737
+ 0:51:53.313 --> 0:52:05.450
1738
+ The nice thing is that on the one hand you
1739
+ don't need this word to word correspondence
1740
+
1741
+ 0:52:05.450 --> 0:52:06.706
1742
+ anymore.
1743
+
1744
+ 0:52:06.906 --> 0:52:17.088
1745
+ You now need to invent some type of alignment
1746
+ that in this case doesn't really make sense.
1747
+
1748
+ 0:52:17.417 --> 0:52:21.710
1749
+ So you can just learn okay, you have this
1750
+ phrase and this phrase and their translation.
1751
+
1752
+ 0:52:22.862 --> 0:52:25.989
1753
+ Secondly, we can add a bit of context into
1754
+ that.
1755
+
1756
+ 0:52:26.946 --> 0:52:43.782
1757
+ You're saying, for example, of Ultimate Customs
1758
+ and of My Shift.
1759
+
1760
+ 0:52:44.404 --> 0:52:51.443
1761
+ And this was difficult to model and work based
1762
+ models because they always model the translation.
1763
+
1764
+ 0:52:52.232 --> 0:52:57.877
1765
+ Here you can have phrases where you have more
1766
+ context and just jointly translate the phrases,
1767
+
1768
+ 0:52:57.877 --> 0:53:03.703
1769
+ and if you then have seen all by the question
1770
+ as a phrase you can directly use that to generate.
1771
+
1772
+ 0:53:08.468 --> 0:53:19.781
1773
+ Okay, before we go into how to do that, then
1774
+ we start, so the start is when we start with
1775
+
1776
+ 0:53:19.781 --> 0:53:21.667
1777
+ the alignment.
1778
+
1779
+ 0:53:22.022 --> 0:53:35.846
1780
+ So that is what we get from the work based
1781
+ model and we are assuming to get the.
1782
+
1783
+ 0:53:36.356 --> 0:53:40.786
1784
+ So that is your starting point.
1785
+
1786
+ 0:53:40.786 --> 0:53:47.846
1787
+ You have a certain sentence and one most probable.
1788
+
1789
+ 0:53:48.989 --> 0:54:11.419
1790
+ The challenge you now have is that these alignments
1791
+ are: On the one hand, a source word like hit
1792
+
1793
+ 0:54:11.419 --> 0:54:19.977
1794
+ several times with one source word can be aligned
1795
+ to several: So in this case you see that for
1796
+
1797
+ 0:54:19.977 --> 0:54:29.594
1798
+ example Bisher is aligned to three words, so
1799
+ this can be the alignment from English to German,
1800
+
1801
+ 0:54:29.594 --> 0:54:32.833
1802
+ but it cannot be the alignment.
1803
+
1804
+ 0:54:33.273 --> 0:54:41.024
1805
+ In order to address for this inconsistency
1806
+ and being able to do that, what you typically
1807
+
1808
+ 0:54:41.024 --> 0:54:49.221
1809
+ then do is: If you have this inconsistency
1810
+ and you get different things in both directions,.
1811
+
1812
+ 0:54:54.774 --> 0:55:01.418
1813
+ In machine translation to do that you just
1814
+ do it in both directions and somehow combine
1815
+
1816
+ 0:55:01.418 --> 0:55:08.363
1817
+ them because both will do arrows and the hope
1818
+ is yeah if you know both things you minimize.
1819
+
1820
+ 0:55:08.648 --> 0:55:20.060
1821
+ So you would also do it in the other direction
1822
+ and get a different type of lineup, for example
1823
+
1824
+ 0:55:20.060 --> 0:55:22.822
1825
+ that you now have saw.
1826
+
1827
+ 0:55:23.323 --> 0:55:37.135
1828
+ So in this way you are having two alignments
1829
+ and the question is now how do get one alignment
1830
+
1831
+ 0:55:37.135 --> 0:55:38.605
1832
+ and what?
1833
+
1834
+ 0:55:38.638 --> 0:55:45.828
1835
+ There were a lot of different types of heuristics.
1836
+
1837
+ 0:55:45.828 --> 0:55:55.556
1838
+ They normally start with intersection because
1839
+ you should trust them.
1840
+
1841
+ 0:55:55.996 --> 0:55:59.661
1842
+ And your maximum will could take this, the
1843
+ union thought,.
1844
+
1845
+ 0:55:59.980 --> 0:56:04.679
1846
+ If one of the systems says they are not aligned
1847
+ then maybe you should not align them.
1848
+
1849
+ 0:56:05.986 --> 0:56:12.240
1850
+ The only question they are different is what
1851
+ should I do about things where they don't agree?
1852
+
1853
+ 0:56:12.240 --> 0:56:18.096
1854
+ So where only one of them enlines and then
1855
+ you have heuristics depending on other words
1856
+
1857
+ 0:56:18.096 --> 0:56:22.288
1858
+ around it, you can decide should I align them
1859
+ or should I not.
1860
+
1861
+ 0:56:24.804 --> 0:56:34.728
1862
+ So that is your first step and then the second
1863
+ step in your model.
1864
+
1865
+ 0:56:34.728 --> 0:56:41.689
1866
+ So now you have one alignment for the process.
1867
+
1868
+ 0:56:42.042 --> 0:56:47.918
1869
+ And the idea is that we will now extract all
1870
+ phrase pairs to combinations of source and
1871
+
1872
+ 0:56:47.918 --> 0:56:51.858
1873
+ target phrases where they are consistent within
1874
+ alignment.
1875
+
1876
+ 0:56:52.152 --> 0:56:57.980
1877
+ The idea is a consistence with an alignment
1878
+ that should be a good example and that we can
1879
+
1880
+ 0:56:57.980 --> 0:56:58.563
1881
+ extract.
1882
+
1883
+ 0:56:59.459 --> 0:57:14.533
1884
+ And there are three conditions where we say
1885
+ an alignment has to be consistent.
1886
+
1887
+ 0:57:14.533 --> 0:57:17.968
1888
+ The first one is.
1889
+
1890
+ 0:57:18.318 --> 0:57:24.774
1891
+ So if you add bisher, then it's in your phrase.
1892
+
1893
+ 0:57:24.774 --> 0:57:32.306
1894
+ All the three words up till and now should
1895
+ be in there.
1896
+
1897
+ 0:57:32.492 --> 0:57:42.328
1898
+ So Bisheret Till would not be a valid phrase
1899
+ pair in this case, but for example Bisheret
1900
+
1901
+ 0:57:42.328 --> 0:57:43.433
1902
+ Till now.
1903
+
1904
+ 0:57:45.525 --> 0:58:04.090
1905
+ Does anybody now have already an idea about
1906
+ the second rule that should be there?
1907
+
1908
+ 0:58:05.325 --> 0:58:10.529
1909
+ Yes, that is exactly the other thing.
1910
+
1911
+ 0:58:10.529 --> 0:58:22.642
1912
+ If a target verse is in the phrase pair, there
1913
+ are also: Then there is one very obvious one.
1914
+
1915
+ 0:58:22.642 --> 0:58:28.401
1916
+ If you strike a phrase pair, at least one
1917
+ word in the phrase.
1918
+
1919
+ 0:58:29.069 --> 0:58:32.686
1920
+ And this is a knife with working.
1921
+
1922
+ 0:58:32.686 --> 0:58:40.026
1923
+ However, in reality a captain will select
1924
+ some part of the sentence.
1925
+
1926
+ 0:58:40.380 --> 0:58:47.416
1927
+ You can take any possible combination of sewers
1928
+ and target words for this part, and that of
1929
+
1930
+ 0:58:47.416 --> 0:58:54.222
1931
+ course is not very helpful because you just
1932
+ have no idea, and therefore it says at least
1933
+
1934
+ 0:58:54.222 --> 0:58:58.735
1935
+ one sewer should be aligned to one target word
1936
+ to prevent.
1937
+
1938
+ 0:58:59.399 --> 0:59:09.615
1939
+ But still, it means that if you have normally
1940
+ analyzed words, the more analyzed words you
1941
+
1942
+ 0:59:09.615 --> 0:59:10.183
1943
+ can.
1944
+
1945
+ 0:59:10.630 --> 0:59:13.088
1946
+ That's not true for the very extreme case.
1947
+
1948
+ 0:59:13.088 --> 0:59:17.603
1949
+ If no word is a line you can extract nothing
1950
+ because you can never fulfill it.
1951
+
1952
+ 0:59:17.603 --> 0:59:23.376
1953
+ However, if only for example one word is aligned
1954
+ then you can align a lot of different possibilities
1955
+
1956
+ 0:59:23.376 --> 0:59:28.977
1957
+ because you can start with this word and then
1958
+ add source words or target words or any combination
1959
+
1960
+ 0:59:28.977 --> 0:59:29.606
1961
+ of source.
1962
+
1963
+ 0:59:30.410 --> 0:59:37.585
1964
+ So there was typically a problem that if you
1965
+ have too few works in light you can really
1966
+
1967
+ 0:59:37.585 --> 0:59:38.319
1968
+ extract.
1969
+
1970
+ 0:59:38.558 --> 0:59:45.787
1971
+ If you think about this already here you can
1972
+ extract very, very many phrase pairs from:
1973
+
1974
+ 0:59:45.845 --> 0:59:55.476
1975
+ So what you can extract is, for example, what
1976
+ we saw up and so on.
1977
+
1978
+ 0:59:55.476 --> 1:00:00.363
1979
+ So all of them will be extracted.
1980
+
1981
+ 1:00:00.400 --> 1:00:08.379
1982
+ In order to limit this you typically have
1983
+ a length limit so you can only extract phrases
1984
+
1985
+ 1:00:08.379 --> 1:00:08.738
1986
+ up.
1987
+
1988
+ 1:00:09.049 --> 1:00:18.328
1989
+ But still there these phrases where you have
1990
+ all these phrases extracted.
1991
+
1992
+ 1:00:18.328 --> 1:00:22.968
1993
+ You have to think about how to deal.
1994
+
1995
+ 1:00:26.366 --> 1:00:34.966
1996
+ Now we have the phrases, so the other question
1997
+ is what is a good phrase pair and not so good.
1998
+
1999
+ 1:00:35.255 --> 1:00:39.933
2000
+ You might be that you sometimes extract one
2001
+ which is explaining this sentence but is not
2002
+
2003
+ 1:00:39.933 --> 1:00:44.769
2004
+ really a good one because there is something
2005
+ ever in there or something special so it might
2006
+
2007
+ 1:00:44.769 --> 1:00:47.239
2008
+ not be a good phase pair in another situation.
2009
+
2010
+ 1:00:49.629 --> 1:00:59.752
2011
+ And therefore the easiest thing is again just
2012
+ count, and if a phrase pair occurs very often
2013
+
2014
+ 1:00:59.752 --> 1:01:03.273
2015
+ seems to be a good phrase pair.
2016
+
2017
+ 1:01:03.743 --> 1:01:05.185
2018
+ So if we have this one.
2019
+
2020
+ 1:01:05.665 --> 1:01:09.179
2021
+ And if you have the exam up till now,.
2022
+
2023
+ 1:01:09.469 --> 1:01:20.759
2024
+ Then you look how often does up till now to
2025
+ this hair occur?
2026
+
2027
+ 1:01:20.759 --> 1:01:28.533
2028
+ How often does up until now to this hair?
2029
+
2030
+ 1:01:30.090 --> 1:01:36.426
2031
+ So this is one way of yeah describing the
2032
+ quality of the phrase book.
2033
+
2034
+ 1:01:37.257 --> 1:01:47.456
2035
+ So one difference is now, and that is the
2036
+ advantage of these primitive models.
2037
+
2038
+ 1:01:47.867 --> 1:01:55.442
2039
+ But instead we are trying to have a lot of
2040
+ features describing how good a phrase parent
2041
+
2042
+ 1:01:55.442 --> 1:01:55.786
2043
+ is.
2044
+
2045
+ 1:01:55.786 --> 1:02:04.211
2046
+ One of these features is this one describing:
2047
+ But in this model we'll later see how to combine
2048
+
2049
+ 1:02:04.211 --> 1:02:04.515
2050
+ it.
2051
+
2052
+ 1:02:04.515 --> 1:02:10.987
2053
+ The nice thing is we can invent any other
2054
+ type of features and add that and normally
2055
+
2056
+ 1:02:10.987 --> 1:02:14.870
2057
+ if you have two or three metrics to describe
2058
+ then.
2059
+
2060
+ 1:02:15.435 --> 1:02:18.393
2061
+ And therefore the spray spray sprays.
2062
+
2063
+ 1:02:18.393 --> 1:02:23.220
2064
+ They were not only like evaluated by one type
2065
+ but by several.
2066
+
2067
+ 1:02:23.763 --> 1:02:36.580
2068
+ So this could, for example, have a problem
2069
+ because your target phrase here occurs only
2070
+
2071
+ 1:02:36.580 --> 1:02:37.464
2072
+ once.
2073
+
2074
+ 1:02:38.398 --> 1:02:46.026
2075
+ It will of course only occur with one other
2076
+ source trait, and that probability will be
2077
+
2078
+ 1:02:46.026 --> 1:02:53.040
2079
+ one which might not be a very good estimation
2080
+ because you've only seen it once.
2081
+
2082
+ 1:02:53.533 --> 1:02:58.856
2083
+ Therefore, we use additional ones to better
2084
+ deal with that, and the first thing is we're
2085
+
2086
+ 1:02:58.856 --> 1:02:59.634
2087
+ doing again.
2088
+
2089
+ 1:02:59.634 --> 1:03:01.129
2090
+ Yeah, we know it by now.
2091
+
2092
+ 1:03:01.129 --> 1:03:06.692
2093
+ If you look at it in the one direction, it's
2094
+ helpful to us to look into the other direction.
2095
+
2096
+ 1:03:06.692 --> 1:03:11.297
2097
+ So you take also the inverse probability,
2098
+ so you not only take in peer of E.
2099
+
2100
+ 1:03:11.297 --> 1:03:11.477
2101
+ G.
2102
+
2103
+ 1:03:11.477 --> 1:03:11.656
2104
+ M.
2105
+
2106
+ 1:03:11.656 --> 1:03:12.972
2107
+ F., but also peer of.
2108
+
2109
+ 1:03:13.693 --> 1:03:19.933
2110
+ And then in addition you say maybe for the
2111
+ especially prolonged phrases they occur rarely,
2112
+
2113
+ 1:03:19.933 --> 1:03:25.898
2114
+ and then you have very high probabilities,
2115
+ and that might not be always the right one.
2116
+
2117
+ 1:03:25.898 --> 1:03:32.138
2118
+ So maybe it's good to also look at the word
2119
+ based probabilities to represent how good they
2120
+
2121
+ 1:03:32.138 --> 1:03:32.480
2122
+ are.
2123
+
2124
+ 1:03:32.692 --> 1:03:44.202
2125
+ So in addition you take the work based probabilities
2126
+ of this phrase pair as an additional model.
2127
+
2128
+ 1:03:44.704 --> 1:03:52.828
2129
+ So then you would have in total four different
2130
+ values describing how good the phrase is.
2131
+
2132
+ 1:03:52.828 --> 1:04:00.952
2133
+ It would be the relatively frequencies in
2134
+ both directions and the lexical probabilities.
2135
+
2136
+ 1:04:01.361 --> 1:04:08.515
2137
+ So four values in describing how probable
2138
+ a phrase translation is.
2139
+
2140
+ 1:04:11.871 --> 1:04:20.419
2141
+ Then the next challenge is how can we combine
2142
+ these different types of probabilities into
2143
+
2144
+ 1:04:20.419 --> 1:04:23.458
2145
+ a global score saying how good?
2146
+
2147
+ 1:04:24.424 --> 1:04:36.259
2148
+ Model, but before we are doing that give any
2149
+ questions to this phrase extraction and phrase
2150
+
2151
+ 1:04:36.259 --> 1:04:37.546
2152
+ creation.
2153
+
2154
+ 1:04:40.260 --> 1:04:44.961
2155
+ And the motivation for that this was our initial
2156
+ moral.
2157
+
2158
+ 1:04:44.961 --> 1:04:52.937
2159
+ If you remember from the beginning of a lecture
2160
+ we had the probability of like PFO three times
2161
+
2162
+ 1:04:52.937 --> 1:04:53.357
2163
+ PFO.
2164
+
2165
+ 1:04:55.155 --> 1:04:57.051
2166
+ Now the problem is here.
2167
+
2168
+ 1:04:57.051 --> 1:04:59.100
2169
+ That is, of course, right.
2170
+
2171
+ 1:04:59.100 --> 1:05:06.231
2172
+ However, we have done a lot of simplification
2173
+ that the translation probability is independent
2174
+
2175
+ 1:05:06.231 --> 1:05:08.204
2176
+ of the other translation.
2177
+
2178
+ 1:05:08.628 --> 1:05:14.609
2179
+ So therefore our estimations of pH give me
2180
+ and pH might not be right, and therefore the
2181
+
2182
+ 1:05:14.609 --> 1:05:16.784
2183
+ combination might not be right.
2184
+
2185
+ 1:05:17.317 --> 1:05:22.499
2186
+ So it can be that, for example, at the edge
2187
+ you have a fluid but not accurate translation.
2188
+
2189
+ 1:05:22.782 --> 1:05:25.909
2190
+ And Then There's Could Be an Easy Way Around
2191
+ It.
2192
+
2193
+ 1:05:26.126 --> 1:05:32.019
2194
+ If our effluent but not accurate, it might
2195
+ be that we put too much effort on the language
2196
+
2197
+ 1:05:32.019 --> 1:05:36.341
2198
+ model and we are putting too few effort on
2199
+ the translation model.
2200
+
2201
+ 1:05:36.936 --> 1:05:43.016
2202
+ There we can wait a minute so we can do this
2203
+ a bit stronger.
2204
+
2205
+ 1:05:43.016 --> 1:05:46.305
2206
+ This one is more important than.
2207
+
2208
+ 1:05:48.528 --> 1:05:53.511
2209
+ And based on that we can extend this idea
2210
+ to the lacteria mole.
2211
+
2212
+ 1:05:53.893 --> 1:06:02.164
2213
+ The log linear model now says all the translation
2214
+ probabilities is just we have.
2215
+
2216
+ 1:06:02.082 --> 1:06:09.230
2217
+ Describing how good this translation process
2218
+ is, these are the speeches H which depend on
2219
+
2220
+ 1:06:09.230 --> 1:06:09.468
2221
+ E.
2222
+
2223
+ 1:06:09.468 --> 1:06:09.706
2224
+ F.
2225
+
2226
+ 1:06:09.706 --> 1:06:13.280
2227
+ Only one of them, but generally depend on
2228
+ E.
2229
+
2230
+ 1:06:13.280 --> 1:06:13.518
2231
+ E.
2232
+
2233
+ 1:06:13.518 --> 1:06:13.757
2234
+ E.
2235
+
2236
+ 1:06:13.757 --> 1:06:13.995
2237
+ N.
2238
+
2239
+ 1:06:13.995 --> 1:06:14.233
2240
+ F.
2241
+
2242
+ 1:06:14.474 --> 1:06:22.393
2243
+ Each of these pictures has a weight saying
2244
+ yeah how good does it model it so that if you're
2245
+
2246
+ 1:06:22.393 --> 1:06:29.968
2247
+ asking a lot of people about some opinion it
2248
+ might also be waiting some opinion more so
2249
+
2250
+ 1:06:29.968 --> 1:06:34.100
2251
+ I put more effort on that and he may not be
2252
+ so.
2253
+
2254
+ 1:06:34.314 --> 1:06:39.239
2255
+ If you're saying that it's maybe a good indication,
2256
+ yeah, would trust that much.
2257
+
2258
+ 1:06:39.559 --> 1:06:41.380
2259
+ And exactly you can do that for you too.
2260
+
2261
+ 1:06:41.380 --> 1:06:42.446
2262
+ You can't add no below.
2263
+
2264
+ 1:06:43.423 --> 1:07:01.965
2265
+ It's like depending on how many you want to
2266
+ have and each of the features gives you value.
2267
+
2268
+ 1:07:02.102 --> 1:07:12.655
2269
+ The nice thing is that we can normally ignore
2270
+ because we are not interested in the probability
2271
+
2272
+ 1:07:12.655 --> 1:07:13.544
2273
+ itself.
2274
+
2275
+ 1:07:13.733 --> 1:07:18.640
2276
+ And again, if that's not normalized, that's
2277
+ fine.
2278
+
2279
+ 1:07:18.640 --> 1:07:23.841
2280
+ So if this value is the highest, that's the
2281
+ highest.
2282
+
2283
+ 1:07:26.987 --> 1:07:29.302
2284
+ Can we do that?
2285
+
2286
+ 1:07:29.302 --> 1:07:34.510
2287
+ Let's start with two simple things.
2288
+
2289
+ 1:07:34.510 --> 1:07:39.864
2290
+ Then you have one translation model.
2291
+
2292
+ 1:07:40.000 --> 1:07:43.102
2293
+ Which gives you the peer of eagerness.
2294
+
2295
+ 1:07:43.383 --> 1:07:49.203
2296
+ It can be typically as a feature it would
2297
+ take the liberalism of this ability, so mine
2298
+
2299
+ 1:07:49.203 --> 1:07:51.478
2300
+ is nine hundred and fourty seven.
2301
+
2302
+ 1:07:51.451 --> 1:07:57.846
2303
+ And the language model which says you how
2304
+ clue in the English side is how you can calculate
2305
+
2306
+ 1:07:57.846 --> 1:07:59.028
2307
+ the probability.
2308
+
2309
+ 1:07:58.979 --> 1:08:03.129
2310
+ In some future lectures we'll give you all
2311
+ superbology.
2312
+
2313
+ 1:08:03.129 --> 1:08:10.465
2314
+ You can feature again the luck of the purbology,
2315
+ then you have minus seven and then give different
2316
+
2317
+ 1:08:10.465 --> 1:08:11.725
2318
+ weights to them.
2319
+
2320
+ 1:08:12.292 --> 1:08:19.243
2321
+ And that means that your probability is one
2322
+ divided by said to the power of this.
2323
+
2324
+ 1:08:20.840 --> 1:08:38.853
2325
+ You're not really interested in the probability,
2326
+ so you just calculate on the score to the exponendum.
2327
+
2328
+ 1:08:40.000 --> 1:08:41.668
2329
+ Maximal Maximal I Think.
2330
+
2331
+ 1:08:42.122 --> 1:08:57.445
2332
+ You can, for example, try different translations,
2333
+ calculate all their scores and take in the
2334
+
2335
+ 1:08:57.445 --> 1:09:00.905
2336
+ end the translation.
2337
+
2338
+ 1:09:03.423 --> 1:09:04.661
2339
+ Why to do that.
2340
+
2341
+ 1:09:05.986 --> 1:09:10.698
2342
+ We've done that now for two, but of course
2343
+ you cannot only do it with two.
2344
+
2345
+ 1:09:10.698 --> 1:09:16.352
2346
+ You can do it now with any fixed number, so
2347
+ of course you have to decide in the beginning
2348
+
2349
+ 1:09:16.352 --> 1:09:21.944
2350
+ I want to have ten features or something like
2351
+ that, but you can take all these features.
2352
+
2353
+ 1:09:22.002 --> 1:09:29.378
2354
+ And yeah, based on them, they calculate your
2355
+ model probability or the model score.
2356
+
2357
+ 1:09:31.031 --> 1:09:40.849
2358
+ A big advantage over the initial.
2359
+
2360
+ 1:09:40.580 --> 1:09:45.506
2361
+ A model because now we can add a lot of features
2362
+ and there was diamond machine translation,
2363
+
2364
+ 1:09:45.506 --> 1:09:47.380
2365
+ a statistical machine translation.
2366
+
2367
+ 1:09:47.647 --> 1:09:57.063
2368
+ So how can develop new features, new ways
2369
+ of evaluating them so that can hopefully better
2370
+
2371
+ 1:09:57.063 --> 1:10:00.725
2372
+ describe what is good translation?
2373
+
2374
+ 1:10:01.001 --> 1:10:16.916
2375
+ If you have a new great feature you can calculate
2376
+ these features and then how much better do
2377
+
2378
+ 1:10:16.916 --> 1:10:18.969
2379
+ they model?
2380
+
2381
+ 1:10:21.741 --> 1:10:27.903
2382
+ There is one challenge which haven't touched
2383
+ upon yet.
2384
+
2385
+ 1:10:27.903 --> 1:10:33.505
2386
+ So could you easily build your model if you
2387
+ have.
2388
+
2389
+ 1:10:38.999 --> 1:10:43.016
2390
+ Assumed here something which just gazed, but
2391
+ which might not be that easy.
2392
+
2393
+ 1:10:49.990 --> 1:10:56.333
2394
+ The weight for the translation model is and
2395
+ the weight for the language model is.
2396
+
2397
+ 1:10:56.716 --> 1:11:08.030
2398
+ That's a bit arbitrary, so why should you
2399
+ use this one and guess normally you won't be
2400
+
2401
+ 1:11:08.030 --> 1:11:11.801
2402
+ able to select that by hand?
2403
+
2404
+ 1:11:11.992 --> 1:11:19.123
2405
+ So typically we didn't have like or features
2406
+ in there, but features is very common.
2407
+
2408
+ 1:11:19.779 --> 1:11:21.711
2409
+ So how do you select them?
2410
+
2411
+ 1:11:21.711 --> 1:11:24.645
2412
+ There was a second part of the training.
2413
+
2414
+ 1:11:24.645 --> 1:11:27.507
2415
+ These models were trained in two steps.
2416
+
2417
+ 1:11:27.507 --> 1:11:32.302
2418
+ On the one hand, we had the training of the
2419
+ individual components.
2420
+
2421
+ 1:11:32.302 --> 1:11:38.169
2422
+ We saw that now how to build the phrase based
2423
+ system, how to extract the phrases.
2424
+
2425
+ 1:11:38.738 --> 1:11:46.223
2426
+ But then if you have these different components
2427
+ you need a second training to learn the optimal.
2428
+
2429
+ 1:11:46.926 --> 1:11:51.158
2430
+ And typically this is referred to as the tuning
2431
+ of the system.
2432
+
2433
+ 1:11:51.431 --> 1:12:07.030
2434
+ So now if you have different types of models
2435
+ describing what a good translation is you need
2436
+
2437
+ 1:12:07.030 --> 1:12:10.760
2438
+ to find good weights.
2439
+
2440
+ 1:12:12.312 --> 1:12:14.315
2441
+ So how can you do it?
2442
+
2443
+ 1:12:14.315 --> 1:12:20.871
2444
+ The easiest thing is, of course, you can just
2445
+ try different things out.
2446
+
2447
+ 1:12:21.121 --> 1:12:27.496
2448
+ You can then always select the best hyper
2449
+ scissors.
2450
+
2451
+ 1:12:27.496 --> 1:12:38.089
2452
+ You can evaluate it with some metrics saying:
2453
+ You can score all your outputs, always select
2454
+
2455
+ 1:12:38.089 --> 1:12:42.543
2456
+ the best one and then get this translation.
2457
+
2458
+ 1:12:42.983 --> 1:12:45.930
2459
+ And you can do that for a lot of different
2460
+ possible combinations.
2461
+
2462
+ 1:12:47.067 --> 1:12:59.179
2463
+ However, the challenge is the complexity,
2464
+ so if you have only parameters and each of
2465
+
2466
+ 1:12:59.179 --> 1:13:04.166
2467
+ them has values you try for, then.
2468
+
2469
+ 1:13:04.804 --> 1:13:16.895
2470
+ We won't be able to try all of these possible
2471
+ combinations, so what we have to do is some
2472
+
2473
+ 1:13:16.895 --> 1:13:19.313
2474
+ more intelligent.
2475
+
2476
+ 1:13:20.540 --> 1:13:34.027
2477
+ And what has been done there in machine translation
2478
+ is referred to as a minimum error rate training.
2479
+
2480
+ 1:13:34.534 --> 1:13:41.743
2481
+ Whole surge is a very intuitive one, so have
2482
+ all these different parameters, so how do.
2483
+
2484
+ 1:13:42.522 --> 1:13:44.358
2485
+ And the idea is okay.
2486
+
2487
+ 1:13:44.358 --> 1:13:52.121
2488
+ I start with an initial guess and then I optimize
2489
+ one single parameter that's always easier.
2490
+
2491
+ 1:13:52.121 --> 1:13:54.041
2492
+ That's some or linear.
2493
+
2494
+ 1:13:54.041 --> 1:13:58.882
2495
+ So you're searching the best value for the
2496
+ one parameter.
2497
+
2498
+ 1:13:59.759 --> 1:14:04.130
2499
+ Often visualized with a San Francisco map.
2500
+
2501
+ 1:14:04.130 --> 1:14:13.786
2502
+ Just imagine if you want to go to the highest
2503
+ spot in San Francisco, you're standing somewhere
2504
+
2505
+ 1:14:13.786 --> 1:14:14.395
2506
+ here.
2507
+
2508
+ 1:14:14.574 --> 1:14:21.220
2509
+ You are switching your dimensions so you are
2510
+ going in this direction again finding.
2511
+
2512
+ 1:14:21.661 --> 1:14:33.804
2513
+ Now you're on a different street and this
2514
+ one is not a different one so you go in here
2515
+
2516
+ 1:14:33.804 --> 1:14:36.736
2517
+ so you can interact.
2518
+
2519
+ 1:14:36.977 --> 1:14:56.368
2520
+ The one thing of course is find a local optimum,
2521
+ especially if you start in two different positions.
2522
+
2523
+ 1:14:56.536 --> 1:15:10.030
2524
+ So yeah, there is a heuristic in there, so
2525
+ typically it's done again if you land in different
2526
+
2527
+ 1:15:10.030 --> 1:15:16.059
2528
+ positions with different starting points.
2529
+
2530
+ 1:15:16.516 --> 1:15:29.585
2531
+ What is different or what is like the addition
2532
+ of arrow rate training compared to the standard?
2533
+
2534
+ 1:15:29.729 --> 1:15:37.806
2535
+ So the question is, like we said, you can
2536
+ now evaluate different values for one parameter.
2537
+
2538
+ 1:15:38.918 --> 1:15:42.857
2539
+ And the question is: Which values should you
2540
+ try out for one parameters?
2541
+
2542
+ 1:15:42.857 --> 1:15:47.281
2543
+ Should you just do zero point one, zero point
2544
+ two, zero point three, or anything?
2545
+
2546
+ 1:15:49.029 --> 1:16:03.880
2547
+ If you change only one parameter then you
2548
+ can define the score of translation as a linear
2549
+
2550
+ 1:16:03.880 --> 1:16:05.530
2551
+ function.
2552
+
2553
+ 1:16:05.945 --> 1:16:17.258
2554
+ That this is the one that possesses, and yet
2555
+ if you change the parameter, the score of this.
2556
+
2557
+ 1:16:17.397 --> 1:16:26.506
2558
+ It may depend so your score is there because
2559
+ the rest you don't change your feature value.
2560
+
2561
+ 1:16:26.826 --> 1:16:30.100
2562
+ And the feature value is there for the steepness
2563
+ of their purse.
2564
+
2565
+ 1:16:30.750 --> 1:16:38.887
2566
+ And now look at different possible translations.
2567
+
2568
+ 1:16:38.887 --> 1:16:46.692
2569
+ Therefore, how they go up here is differently.
2570
+
2571
+ 1:16:47.247 --> 1:16:59.289
2572
+ So in this case if you look at the minimum
2573
+ score so there should be as minimum.
2574
+
2575
+ 1:17:00.300 --> 1:17:10.642
2576
+ So it's enough to check once a year and check
2577
+ once here because if you check here and here.
2578
+
2579
+ 1:17:11.111 --> 1:17:24.941
2580
+ And that is the idea in minimum air rate training
2581
+ when you select different hypotheses.
2582
+
2583
+ 1:17:29.309 --> 1:17:34.378
2584
+ So in yeah, the minimum air raid training
2585
+ is a power search.
2586
+
2587
+ 1:17:34.378 --> 1:17:37.453
2588
+ Then we do an intelligent step size.
2589
+
2590
+ 1:17:37.453 --> 1:17:39.364
2591
+ We do random restarts.
2592
+
2593
+ 1:17:39.364 --> 1:17:46.428
2594
+ Then things are still too slow because it
2595
+ might say we would have to decode a lot of
2596
+
2597
+ 1:17:46.428 --> 1:17:47.009
2598
+ times.
2599
+
2600
+ 1:17:46.987 --> 1:17:54.460
2601
+ So what we can do to make things even faster
2602
+ is we are decoding once with the current parameters,
2603
+
2604
+ 1:17:54.460 --> 1:18:01.248
2605
+ but then we are not generating only the most
2606
+ probable translation, but we are generating
2607
+
2608
+ 1:18:01.248 --> 1:18:05.061
2609
+ the most probable ten hundred translations
2610
+ or so.
2611
+
2612
+ 1:18:06.006 --> 1:18:18.338
2613
+ And then we are optimizing our weights by
2614
+ only looking at this one hundred translation
2615
+
2616
+ 1:18:18.338 --> 1:18:23.725
2617
+ and finding the optimal values there.
2618
+
2619
+ 1:18:24.564 --> 1:18:39.284
2620
+ Of course, it might be a problem that at some
2621
+ point you have now good ways to find good translations
2622
+
2623
+ 1:18:39.284 --> 1:18:42.928
2624
+ inside your ambest list.
2625
+
2626
+ 1:18:43.143 --> 1:18:52.357
2627
+ You have to iterate that sometime, but the
2628
+ important thing is you don't have to decode
2629
+
2630
+ 1:18:52.357 --> 1:18:56.382
2631
+ every time you need weights, but you.
2632
+
2633
+ 1:18:57.397 --> 1:19:11.325
2634
+ There is mainly a speed up process in order
2635
+ to make things more, make things even faster.
2636
+
2637
+ 1:19:15.515 --> 1:19:20.160
2638
+ Good Then We'll Finish With.
2639
+
2640
+ 1:19:20.440 --> 1:19:25.289
2641
+ Looking at how do you really calculate the
2642
+ scores and everything?
2643
+
2644
+ 1:19:25.289 --> 1:19:32.121
2645
+ Because what we did look into was a translation
2646
+ of a full sentence doesn't really consist of
2647
+
2648
+ 1:19:32.121 --> 1:19:37.190
2649
+ only one single phrase, but of course you have
2650
+ to combine different.
2651
+
2652
+ 1:19:37.637 --> 1:19:40.855
2653
+ So how does that now really look and how do
2654
+ we have to do?
2655
+
2656
+ 1:19:41.361 --> 1:19:48.252
2657
+ Just think again of the translation we have
2658
+ done before.
2659
+
2660
+ 1:19:48.252 --> 1:19:59.708
2661
+ The sentence must be: What is the probability
2662
+ of translating this one into what we saw after
2663
+
2664
+ 1:19:59.708 --> 1:20:00.301
2665
+ now?
2666
+
2667
+ 1:20:00.301 --> 1:20:03.501
2668
+ We're doing this by using.
2669
+
2670
+ 1:20:03.883 --> 1:20:07.157
2671
+ So we're having the phrase pair.
2672
+
2673
+ 1:20:07.157 --> 1:20:12.911
2674
+ Vasvia is the phrase pair up to now and gazine
2675
+ harm into.
2676
+
2677
+ 1:20:13.233 --> 1:20:18.970
2678
+ In addition, that is important because translation
2679
+ is not monotone.
2680
+
2681
+ 1:20:18.970 --> 1:20:26.311
2682
+ We are not putting phrase pairs in the same
2683
+ order as we are doing it on the source and
2684
+
2685
+ 1:20:26.311 --> 1:20:31.796
2686
+ on the target, but in order to generate the
2687
+ correct translation.
2688
+
2689
+ 1:20:31.771 --> 1:20:34.030
2690
+ So we have to shuffle the phrase pears.
2691
+
2692
+ 1:20:34.294 --> 1:20:39.747
2693
+ And the blue wand is in front on the search
2694
+ side but not on the back of the tag.
2695
+
2696
+ 1:20:40.200 --> 1:20:49.709
2697
+ This reordering makes a statistic of the machine
2698
+ translation really complicated because if you
2699
+
2700
+ 1:20:49.709 --> 1:20:53.313
2701
+ would just monotonely do this then.
2702
+
2703
+ 1:20:53.593 --> 1:21:05.288
2704
+ The problem is if you would analyze all possible
2705
+ combinations of reshuffling them, then again.
2706
+
2707
+ 1:21:05.565 --> 1:21:11.508
2708
+ So you again have to use some type of heuristics
2709
+ which shuffle you allow and which you don't
2710
+
2711
+ 1:21:11.508 --> 1:21:11.955
2712
+ allow.
2713
+
2714
+ 1:21:12.472 --> 1:21:27.889
2715
+ That was relatively challenging since, for
2716
+ example, if you think of Germany you would
2717
+
2718
+ 1:21:27.889 --> 1:21:32.371
2719
+ have to allow very long.
2720
+
2721
+ 1:21:33.033 --> 1:21:52.218
2722
+ But if we have now this, how do we calculate
2723
+ the translation score so the translation score?
2724
+
2725
+ 1:21:52.432 --> 1:21:55.792
2726
+ That's why we sum up the scores at the end.
2727
+
2728
+ 1:21:56.036 --> 1:22:08.524
2729
+ So you said our first feature is the probability
2730
+ of the full sentence.
2731
+
2732
+ 1:22:08.588 --> 1:22:13.932
2733
+ So we say, the translation of each phrase
2734
+ pair is independent of each other, and then
2735
+
2736
+ 1:22:13.932 --> 1:22:19.959
2737
+ we can hear the probability of the full sentences,
2738
+ fear of what we give, but fear of times, fear
2739
+
2740
+ 1:22:19.959 --> 1:22:24.246
2741
+ of sobbing because they have time to feel up
2742
+ till now is impossible.
2743
+
2744
+ 1:22:24.664 --> 1:22:29.379
2745
+ Now we can use the loss of logarithmal calculation.
2746
+
2747
+ 1:22:29.609 --> 1:22:36.563
2748
+ That's logarithm of the first perability.
2749
+
2750
+ 1:22:36.563 --> 1:22:48.153
2751
+ We'll get our first score, which says the
2752
+ translation model is minus.
2753
+
2754
+ 1:22:49.970 --> 1:22:56.586
2755
+ And that we're not doing only once, but we're
2756
+ exactly doing it with all our translation model.
2757
+
2758
+ 1:22:56.957 --> 1:23:03.705
2759
+ So we said we also have the relative frequency
2760
+ and the inverse directions of the.
2761
+
2762
+ 1:23:03.843 --> 1:23:06.226
2763
+ So in the end you'll have four scores.
2764
+
2765
+ 1:23:06.226 --> 1:23:09.097
2766
+ Here how you combine them is exactly the same.
2767
+
2768
+ 1:23:09.097 --> 1:23:12.824
2769
+ The only thing is how you look them up for
2770
+ each phrase pair.
2771
+
2772
+ 1:23:12.824 --> 1:23:18.139
2773
+ We have said in the beginning we are storing
2774
+ four scores describing how good they are.
2775
+
2776
+ 1:23:19.119 --> 1:23:25.415
2777
+ And these are then of force points describing
2778
+ how probable the sense.
2779
+
2780
+ 1:23:27.427 --> 1:23:31.579
2781
+ Then we can have more sports.
2782
+
2783
+ 1:23:31.579 --> 1:23:37.806
2784
+ For example, we can have a distortion model.
2785
+
2786
+ 1:23:37.806 --> 1:23:41.820
2787
+ How much reordering is done?
2788
+
2789
+ 1:23:41.841 --> 1:23:47.322
2790
+ There were different types of ones who won't
2791
+ go into detail, but just imagine you have no
2792
+
2793
+ 1:23:47.322 --> 1:23:47.748
2794
+ score.
2795
+
2796
+ 1:23:48.548 --> 1:23:56.651
2797
+ Then you have a language model which is the
2798
+ sequence of what we saw until now.
2799
+
2800
+ 1:23:56.651 --> 1:24:06.580
2801
+ How we generate this language model for ability
2802
+ will cover: And there weren't even more probabilities.
2803
+
2804
+ 1:24:06.580 --> 1:24:11.841
2805
+ So one, for example, was a phrase count scarf,
2806
+ which just counts how many.
2807
+
2808
+ 1:24:12.072 --> 1:24:19.555
2809
+ In order to learn is it better to have more
2810
+ short phrases or should bias on having fewer
2811
+
2812
+ 1:24:19.555 --> 1:24:20.564
2813
+ and longer.
2814
+
2815
+ 1:24:20.940 --> 1:24:28.885
2816
+ Easily add this but just counting so the value
2817
+ will be here and like putting in a count like
2818
+
2819
+ 1:24:28.885 --> 1:24:32.217
2820
+ typically how good is it to translate.
2821
+
2822
+ 1:24:32.932 --> 1:24:44.887
2823
+ For language model, the probability normally
2824
+ gets shorter the longer the sequences in order
2825
+
2826
+ 1:24:44.887 --> 1:24:46.836
2827
+ to counteract.
2828
+
2829
+ 1:24:47.827 --> 1:24:59.717
2830
+ And then you get your final score by multi-climbing
2831
+ each of the scores we had before.
2832
+
2833
+ 1:24:59.619 --> 1:25:07.339
2834
+ Optimization and that gives you a final score
2835
+ maybe of twenty three point seven eight five
2836
+
2837
+ 1:25:07.339 --> 1:25:13.278
2838
+ and then you can do that with several possible
2839
+ translation tests and.
2840
+
2841
+ 1:25:14.114 --> 1:25:23.949
2842
+ One may be important point here is so the
2843
+ score not only depends on the target side but
2844
+
2845
+ 1:25:23.949 --> 1:25:32.444
2846
+ it also depends on which phrases you have used
2847
+ so you could have generated.
2848
+
2849
+ 1:25:32.772 --> 1:25:38.076
2850
+ So you would have the same translation, but
2851
+ you would have a different split into phrase.
2852
+
2853
+ 1:25:38.979 --> 1:25:45.636
2854
+ And this was normally ignored so you would
2855
+ just look at all of them and then select the
2856
+
2857
+ 1:25:45.636 --> 1:25:52.672
2858
+ one which has the highest probability and ignore
2859
+ that this translation could be generated by
2860
+
2861
+ 1:25:52.672 --> 1:25:54.790
2862
+ several splits into phrase.
2863
+
2864
+ 1:25:57.497 --> 1:26:06.097
2865
+ So to summarize what we look into today and
2866
+ what you should hopefully remember is: Statistical
2867
+
2868
+ 1:26:06.097 --> 1:26:11.440
2869
+ models in how to generate machine translation
2870
+ output that were the word based statistical
2871
+
2872
+ 1:26:11.440 --> 1:26:11.915
2873
+ models.
2874
+
2875
+ 1:26:11.915 --> 1:26:16.962
2876
+ There was IBM models at the beginning and
2877
+ then we have the phrase based entity where
2878
+
2879
+ 1:26:16.962 --> 1:26:22.601
2880
+ it's about building the translation by putting
2881
+ together these blocks of phrases and combining.
2882
+
2883
+ 1:26:23.283 --> 1:26:34.771
2884
+ If you have a water which has several features
2885
+ you can't do that with millions but with features.
2886
+
2887
+ 1:26:34.834 --> 1:26:42.007
2888
+ Then you can combine them with your local
2889
+ model, which allows you to have your variable
2890
+
2891
+ 1:26:42.007 --> 1:26:45.186
2892
+ number of features and easily combine.
2893
+
2894
+ 1:26:45.365 --> 1:26:47.920
2895
+ Yeah, how much can you trust each of these
2896
+ more?
2897
+
2898
+ 1:26:51.091 --> 1:26:54.584
2899
+ Do you have any further questions for this
2900
+ topic?
2901
+
2902
+ 1:26:58.378 --> 1:27:08.715
2903
+ And there will be on Tuesday a lecture by
2904
+ Tuan about evaluation, and then next Thursday
2905
+
2906
+ 1:27:08.715 --> 1:27:12.710
2907
+ there will be the practical part.
2908
+
2909
+ 1:27:12.993 --> 1:27:21.461
2910
+ So please bring the practical pot here, but
2911
+ you can do something yourself if you are not
2912
+
2913
+ 1:27:21.461 --> 1:27:22.317
2914
+ able to.
2915
+
2916
+ 1:27:23.503 --> 1:27:26.848
2917
+ So then please tell us and we'll have to see
2918
+ how we find the difference in this.
2919
+
demo_data/lectures/Lecture-04-27.04.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8786f0bc34cf397879e95757fe367887c5f5d01d0f388aa98f768203cccc5269
3
+ size 116390723
demo_data/lectures/Lecture-05-02.05.2023/English.vtt ADDED
@@ -0,0 +1,1124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:56.957 --> 0:01:10.166
4
+ In today you are going to talk about evaluation
5
+ like how you can tell how well your translation.
6
+
7
+ 0:01:11.251 --> 0:01:23.175
8
+ Today we're going to talk about first some
9
+ introduction about the difficulties and also
10
+
11
+ 0:01:23.175 --> 0:01:27.783
12
+ the dimensions of the evaluation.
13
+
14
+ 0:01:28.248 --> 0:01:32.315
15
+ And the second one is on automatic evaluation.
16
+
17
+ 0:01:32.315 --> 0:01:33.960
18
+ The second one is.
19
+
20
+ 0:01:33.893 --> 0:01:40.952
21
+ Would be less human effort costly, but it
22
+ probably is not really as perfect.
23
+
24
+ 0:01:42.702 --> 0:02:01.262
25
+ So on machine translation evaluation, so the
26
+ goal is to measure the quality of translation.
27
+
28
+ 0:02:03.003 --> 0:02:06.949
29
+ We need machine translation evaluation.
30
+
31
+ 0:02:06.949 --> 0:02:14.152
32
+ The first thing is for application scenarios
33
+ and whether it is reliable.
34
+
35
+ 0:02:14.674 --> 0:02:22.911
36
+ Second thing is to guide our research because
37
+ given symmetrics we will be able to find out
38
+
39
+ 0:02:22.911 --> 0:02:30.875
40
+ which improvement direction is valuable for
41
+ our machine translation system and the last
42
+
43
+ 0:02:30.875 --> 0:02:34.224
44
+ thing is for our system development.
45
+
46
+ 0:02:36.116 --> 0:02:42.926
47
+ So now we will come to some difficulties on
48
+ evaluation.
49
+
50
+ 0:02:42.926 --> 0:02:50.952
51
+ The first thing is ambiguity because usually
52
+ for one sentence it.
53
+
54
+ 0:02:51.431 --> 0:03:04.031
55
+ Here you can see that, for example, we have
56
+ the correct reference.
57
+
58
+ 0:03:05.325 --> 0:03:19.124
59
+ The second difficulty is that small changes
60
+ can be very important.
61
+
62
+ 0:03:20.060 --> 0:03:22.531
63
+ The first difficulty is subjective.
64
+
65
+ 0:03:23.123 --> 0:03:39.266
66
+ So it depends on each person's opinion whether
67
+ translation is correct.
68
+
69
+ 0:03:41.041 --> 0:03:49.393
70
+ The last is that evaluation sometimes is application
71
+ dependent.
72
+
73
+ 0:03:49.393 --> 0:03:54.745
74
+ We're not sure how good it's getting up.
75
+
76
+ 0:03:57.437 --> 0:04:04.502
77
+ The first dimension is human versus automatic
78
+ evaluation, which I definitely talked about
79
+
80
+ 0:04:04.502 --> 0:04:06.151
81
+ in the introduction.
82
+
83
+ 0:04:06.151 --> 0:04:13.373
84
+ The second thing is on granulity, so evaluation
85
+ could be on sentence level, document level,
86
+
87
+ 0:04:13.373 --> 0:04:14.472
88
+ or task base.
89
+
90
+ 0:04:15.375 --> 0:04:28.622
91
+ The last thing is whether the translation
92
+ is correct in order to capture the meaning.
93
+
94
+ 0:04:30.630 --> 0:04:33.769
95
+ So on the first dimensions, human verses are
96
+ automatic.
97
+
98
+ 0:04:34.334 --> 0:04:45.069
99
+ So human evaluation education is the goal
100
+ standard because in the end we give our machine
101
+
102
+ 0:04:45.069 --> 0:04:48.647
103
+ translation system to people.
104
+
105
+ 0:04:49.329 --> 0:04:55.040
106
+ And is also expensive and time consuming for
107
+ people to manually evaluate some systems.
108
+
109
+ 0:04:57.057 --> 0:05:05.575
110
+ For automatic evaluation, it is of course
111
+ tupper and faster, and it would use human reference.
112
+
113
+ 0:05:08.168 --> 0:05:16.971
114
+ The next dimension is on granulity.
115
+
116
+ 0:05:16.971 --> 0:05:25.529
117
+ The first level is sentence based.
118
+
119
+ 0:05:25.885 --> 0:05:33.003
120
+ But this is difficult because if you translate
121
+ a single sentence, it will be difficult to
122
+
123
+ 0:05:33.003 --> 0:05:35.454
124
+ tell whether this translation.
125
+
126
+ 0:05:37.537 --> 0:05:40.633
127
+ The second level is document based.
128
+
129
+ 0:05:40.633 --> 0:05:46.051
130
+ This should be the most commonly used in automatic
131
+ evaluation.
132
+
133
+ 0:05:46.286 --> 0:06:00.750
134
+ This should be like the final bowl of our
135
+ machine translation.
136
+
137
+ 0:06:01.061 --> 0:06:02.315
138
+ And slow in general.
139
+
140
+ 0:06:02.315 --> 0:06:07.753
141
+ We are not sure whether the arrows come from
142
+ the machine translation system itself or some
143
+
144
+ 0:06:07.753 --> 0:06:08.828
145
+ other components.
146
+
147
+ 0:06:11.431 --> 0:06:21.300
148
+ The next dimension is on adigocy because it's
149
+ fluency, so adigocy is meaning translated correctly.
150
+
151
+ 0:06:22.642 --> 0:06:25.384
152
+ Can see the example here.
153
+
154
+ 0:06:25.384 --> 0:06:32.237
155
+ In hypothesis different is everything now,
156
+ so basically it just.
157
+
158
+ 0:06:32.852 --> 0:06:36.520
159
+ But then you can see it's not fluent.
160
+
161
+ 0:06:36.520 --> 0:06:38.933
162
+ It sounds kind of weird.
163
+
164
+ 0:06:38.933 --> 0:06:41.442
165
+ Nothing is different now.
166
+
167
+ 0:06:41.442 --> 0:06:43.179
168
+ It sounds fluent.
169
+
170
+ 0:06:46.006 --> 0:06:50.650
171
+ Next we come to error analysis.
172
+
173
+ 0:06:50.650 --> 0:07:02.407
174
+ When we value the system and give a score
175
+ we want to have interpretable results.
176
+
177
+ 0:07:03.083 --> 0:07:07.930
178
+ So usually there would be some tetsus first
179
+ in order to detect these errors.
180
+
181
+ 0:07:08.448 --> 0:07:21.077
182
+ And usually they would be like quite specific
183
+ to some specific type of arrow, for example
184
+
185
+ 0:07:21.077 --> 0:07:23.743
186
+ wrong translation.
187
+
188
+ 0:07:24.344 --> 0:07:32.127
189
+ All morphological agreements in whether the
190
+ world form is correct.
191
+
192
+ 0:07:32.127 --> 0:07:35.031
193
+ If you have the article.
194
+
195
+ 0:07:37.577 --> 0:07:45.904
196
+ So now we come to human evaluation, which
197
+ is the final goal of machine translation.
198
+
199
+ 0:07:47.287 --> 0:07:50.287
200
+ So why do we perform human evaluation?
201
+
202
+ 0:07:51.011 --> 0:08:00.115
203
+ The first thing is that automatic machine
204
+ translation magic is not sufficient.
205
+
206
+ 0:08:00.480 --> 0:08:06.725
207
+ Existing automated metrics and are sometimes
208
+ biased.
209
+
210
+ 0:08:06.725 --> 0:08:16.033
211
+ For example, the blue spar, but the blue scar
212
+ will usually try to look at the.
213
+
214
+ 0:08:16.496 --> 0:08:24.018
215
+ So it doesn't take into account some deeper
216
+ meaning like cares about word-to-word matching
217
+
218
+ 0:08:24.018 --> 0:08:26.829
219
+ instead of rephrasing or synonym.
220
+
221
+ 0:08:27.587 --> 0:08:34.881
222
+ And bias, as in that metrics like that would
223
+ usually depend a lot on the goal standard reference
224
+
225
+ 0:08:34.881 --> 0:08:41.948
226
+ given from some human, and that person could
227
+ have some specific type or language preferences,
228
+
229
+ 0:08:41.948 --> 0:08:43.979
230
+ and then the metric would.
231
+
232
+ 0:08:47.147 --> 0:08:55.422
233
+ The next thing is that automatic metrics don't
234
+ provide sufficient insights for error analysis.
235
+
236
+ 0:08:57.317 --> 0:09:04.096
237
+ Different types of errors would have different
238
+ implications depending on the underlying task.
239
+
240
+ 0:09:04.644 --> 0:09:09.895
241
+ So, for example, if you use machine translation
242
+ for information with you both,.
243
+
244
+ 0:09:10.470 --> 0:09:20.202
245
+ Then if it makes some error omitting some
246
+ words in translation then it would be very
247
+
248
+ 0:09:20.202 --> 0:09:20.775
249
+ bad.
250
+
251
+ 0:09:21.321 --> 0:09:30.305
252
+ Another example is if you use machine translation
253
+ in chat pop then fluency would be very important
254
+
255
+ 0:09:30.305 --> 0:09:50.253
256
+ because: And we also need human measure in
257
+ order to develop and assess automatic translation
258
+
259
+ 0:09:50.253 --> 0:09:52.324
260
+ evaluation.
261
+
262
+ 0:09:55.455 --> 0:10:01.872
263
+ Okay, so now we will come to the quality measures
264
+ of human evaluation.
265
+
266
+ 0:10:02.402 --> 0:10:05.165
267
+ The first thing is inter allotator agreement.
268
+
269
+ 0:10:05.825 --> 0:10:25.985
270
+ This is agreement between different annotators.
271
+
272
+ 0:10:26.126 --> 0:10:31.496
273
+ So as you can see here, this would measure
274
+ the reliability of the other features.
275
+
276
+ 0:10:32.252 --> 0:10:49.440
277
+ And here we have an example of where the pace
278
+ car here is.
279
+
280
+ 0:10:49.849 --> 0:10:57.700
281
+ And this is in contrast to intra-annuator
282
+ agreement, so this is agreement within an annotator.
283
+
284
+ 0:10:58.118 --> 0:11:03.950
285
+ So instead of measuring reliability, here
286
+ it measures consistency of a single animator.
287
+
288
+ 0:11:04.884 --> 0:11:07.027
289
+ And yep.
290
+
291
+ 0:11:07.027 --> 0:11:22.260
292
+ We also have an example here of the which
293
+ is so which is quite.
294
+
295
+ 0:11:23.263 --> 0:11:42.120
296
+ So now we will come to the main types of human
297
+ assessment: The first thing is direct assessment.
298
+
299
+ 0:11:42.842 --> 0:11:53.826
300
+ The second thing is human ranking of the translation
301
+ at sentence level.
302
+
303
+ 0:11:56.176 --> 0:12:11.087
304
+ So direct assessment given the source and
305
+ translation, and possibly the reference translation.
306
+
307
+ 0:12:12.612 --> 0:12:18.023
308
+ The goal here is to give the scores to evaluate
309
+ performance,adequacy and fluency.
310
+
311
+ 0:12:18.598 --> 0:12:23.619
312
+ The problem here is that we need normalization
313
+ across different judges, different human.
314
+
315
+ 0:12:24.604 --> 0:12:27.043
316
+ And here we have an example.
317
+
318
+ 0:12:27.043 --> 0:12:33.517
319
+ She was treated at the site by an emergency
320
+ doctor and taken to hospital by.
321
+
322
+ 0:12:34.334 --> 0:12:48.444
323
+ The hypothesis here is that she was treated
324
+ on site and emergency medical rescue workers
325
+
326
+ 0:12:48.444 --> 0:12:52.090
327
+ brought to a hospital.
328
+
329
+ 0:12:52.472 --> 0:12:56.267
330
+ Lesson five is best in one sport.
331
+
332
+ 0:13:00.060 --> 0:13:04.716
333
+ I don't think it's hard because I think there
334
+ should be broad threat to a hospital right.
335
+
336
+ 0:13:05.905 --> 0:13:09.553
337
+ Yes, that is like a crucial error.
338
+
339
+ 0:13:09.553 --> 0:13:19.558
340
+ Yeah, I think I would agree because this sentence
341
+ somehow gives us the idea of what the meaning
342
+
343
+ 0:13:19.558 --> 0:13:21.642
344
+ of the sentence is.
345
+
346
+ 0:13:21.642 --> 0:13:24.768
347
+ But then it lost towards her.
348
+
349
+ 0:13:27.027 --> 0:13:29.298
350
+ The next time of human evaluation is ranking.
351
+
352
+ 0:13:30.810 --> 0:13:38.893
353
+ Which is a great different system according
354
+ to performance like which one is better.
355
+
356
+ 0:13:40.981 --> 0:13:43.914
357
+ So here now we have a second hypothesis.
358
+
359
+ 0:13:43.914 --> 0:13:49.280
360
+ She was hospitalized on the spot and taken
361
+ to hospital by ambulance crews.
362
+
363
+ 0:13:50.630 --> 0:14:01.608
364
+ As you can see here, the second hypothesis
365
+ seems to be more fluent, more smooth.
366
+
367
+ 0:14:01.608 --> 0:14:09.096
368
+ The meaning capture seems to be: So yeah,
369
+ it's difficult to compare different errors
370
+
371
+ 0:14:09.096 --> 0:14:11.143
372
+ in whether which error is more severe.
373
+
374
+ 0:14:13.373 --> 0:14:16.068
375
+ The next type of human evaluation is post
376
+ editing.
377
+
378
+ 0:14:17.817 --> 0:14:29.483
379
+ So we want to measure how much time and effort
380
+ human needs to spend in order to turn it into
381
+
382
+ 0:14:29.483 --> 0:14:32.117
383
+ correct translation.
384
+
385
+ 0:14:32.993 --> 0:14:47.905
386
+ So this area can be measured by time or key
387
+ shop.
388
+
389
+ 0:14:49.649 --> 0:14:52.889
390
+ And the last one is task based evaluation.
391
+
392
+ 0:14:52.889 --> 0:14:56.806
393
+ Here we would want to evaluate the complete
394
+ system.
395
+
396
+ 0:14:56.806 --> 0:15:03.436
397
+ But if you are using the lecture translator
398
+ and you see my lecture in German, the final
399
+
400
+ 0:15:03.436 --> 0:15:05.772
401
+ evaluation here would be like.
402
+
403
+ 0:15:05.772 --> 0:15:08.183
404
+ In the end, can you understand?
405
+
406
+ 0:15:09.769 --> 0:15:15.301
407
+ Their friendship here that we get the overall
408
+ performance, which is our final goal.
409
+
410
+ 0:15:16.816 --> 0:15:25.850
411
+ But the disadvantage here that it could be
412
+ complex and again if the spur is low it might
413
+
414
+ 0:15:25.850 --> 0:15:31.432
415
+ be other problems than the machine translation
416
+ itself.
417
+
418
+ 0:15:33.613 --> 0:15:42.941
419
+ So guess that was about the human evaluation
420
+ part any question so far.
421
+
422
+ 0:15:42.941 --> 0:15:44.255
423
+ Yes, and.
424
+
425
+ 0:16:00.000 --> 0:16:15.655
426
+ Then we will come to our magic matrix here
427
+ to access the quality of the machine translation
428
+
429
+ 0:16:15.655 --> 0:16:26.179
430
+ system by comparing: So the premise here is
431
+ that the more similar translation is to reference,
432
+
433
+ 0:16:26.179 --> 0:16:31.437
434
+ the better and we want some algorithms that
435
+ can approximate.
436
+
437
+ 0:16:34.114 --> 0:16:47.735
438
+ So the most famous measure could be the blow
439
+ spark and the bilingual evaluation.
440
+
441
+ 0:16:50.930 --> 0:16:56.358
442
+ So if we are given the goal that the more
443
+ similar translation is to the reference, the
444
+
445
+ 0:16:56.358 --> 0:17:01.785
446
+ better I think the most naive way would be
447
+ count the number of people sentenced to the
448
+
449
+ 0:17:01.785 --> 0:17:02.472
450
+ reference.
451
+
452
+ 0:17:02.472 --> 0:17:08.211
453
+ But as you can see, this would be very difficult
454
+ because sentence being exactly the same to
455
+
456
+ 0:17:08.211 --> 0:17:10.332
457
+ the reference would be very rare.
458
+
459
+ 0:17:11.831 --> 0:17:24.222
460
+ You can see the example here in the reference
461
+ and machine translation output.
462
+
463
+ 0:17:24.764 --> 0:17:31.930
464
+ So the idea here is that instead of comparing
465
+ the two whole sentences up, we consider the.
466
+
467
+ 0:17:35.255 --> 0:17:43.333
468
+ Now we can look at an example, so for the
469
+ blow score we consider one to three four grams.
470
+
471
+ 0:17:44.844 --> 0:17:52.611
472
+ The one ramp of a lap we would have back to
473
+ the future, not at premieres thirty years ago,
474
+
475
+ 0:17:52.611 --> 0:17:59.524
476
+ so it should be like one, two, three, four,
477
+ five, six, seven, eight, so like it.
478
+
479
+ 0:17:59.459 --> 0:18:01.476
480
+ One ram is overlap to the reverence.
481
+
482
+ 0:18:01.921 --> 0:18:03.366
483
+ So you should be over.
484
+
485
+ 0:18:06.666 --> 0:18:08.994
486
+ Is kind of the same.
487
+
488
+ 0:18:08.994 --> 0:18:18.529
489
+ Instead of considering only the word back
490
+ for three, one is to be back to the future.
491
+
492
+ 0:18:19.439 --> 0:18:31.360
493
+ So that is basically the idea of the blue
494
+ score, and in the end we calculate the geometric.
495
+
496
+ 0:18:32.812 --> 0:18:39.745
497
+ So as you can see here, when we look at the
498
+ A brand overlap you can only look at the machine
499
+
500
+ 0:18:39.745 --> 0:18:40.715
501
+ translation.
502
+
503
+ 0:18:41.041 --> 0:18:55.181
504
+ We only care about how many words in the machine
505
+ translation output appear.
506
+
507
+ 0:18:55.455 --> 0:19:02.370
508
+ So this metric is kind of like a precision
509
+ based and not really recall based.
510
+
511
+ 0:19:04.224 --> 0:19:08.112
512
+ So this would lead to a problem like the example
513
+ here.
514
+
515
+ 0:19:08.112 --> 0:19:14.828
516
+ The reference is back to the future of Premier
517
+ 30 years ago and the machine translation output
518
+
519
+ 0:19:14.828 --> 0:19:16.807
520
+ is only back to the future.
521
+
522
+ 0:19:17.557 --> 0:19:28.722
523
+ The one grab overlap will be formed because
524
+ you can see back to the future is overlap entirely
525
+
526
+ 0:19:28.722 --> 0:19:30.367
527
+ in reference.
528
+
529
+ 0:19:31.231 --> 0:19:38.314
530
+ Is not right because one is the perfect score,
531
+ but this is obviously not a good translation.
532
+
533
+ 0:19:40.120 --> 0:19:47.160
534
+ So in order to tackle this they use something
535
+ called pre gravity velocity.
536
+
537
+ 0:19:47.988 --> 0:19:59.910
538
+ So it should be a factor that is multiplied
539
+ to the geometric nymph.
540
+
541
+ 0:19:59.910 --> 0:20:04.820
542
+ This form is the length of.
543
+
544
+ 0:20:05.525 --> 0:20:19.901
545
+ So the penalty over or overseas to the power
546
+ of the length of this river over.
547
+
548
+ 0:20:21.321 --> 0:20:32.298
549
+ Which is lower than, and if we apply this
550
+ to the example, the blowscorn is going to be
551
+
552
+ 0:20:32.298 --> 0:20:36.462
553
+ which is not a good translation.
554
+
555
+ 0:20:38.999 --> 0:20:42.152
556
+ Yep so any question of this place.
557
+
558
+ 0:20:44.064 --> 0:21:00.947
559
+ Yes exactly that should be a problem as well,
560
+ and it will be mentioned later on.
561
+
562
+ 0:21:00.947 --> 0:21:01.990
563
+ But.
564
+
565
+ 0:21:03.203 --> 0:21:08.239
566
+ Is very sensitive to zero score like that,
567
+ so that is why we usually don't use the blue
568
+
569
+ 0:21:08.239 --> 0:21:13.103
570
+ score sentence level because sentence can be
571
+ short and then there can be no overlap.
572
+
573
+ 0:21:13.103 --> 0:21:16.709
574
+ That is why we usually use it on documents
575
+ as you can imagine.
576
+
577
+ 0:21:16.709 --> 0:21:20.657
578
+ Documents are very long and very little chance
579
+ to have zero overlap.
580
+
581
+ 0:21:23.363 --> 0:21:28.531
582
+ Yeah okay, so the next thing on the blow's
583
+ floor is slipping.
584
+
585
+ 0:21:29.809 --> 0:21:42.925
586
+ So you can see here we have two references,
587
+ the new movie and the new film, and we have
588
+
589
+ 0:21:42.925 --> 0:21:47.396
590
+ a machine translation output.
591
+
592
+ 0:21:47.807 --> 0:21:54.735
593
+ Because the here is also in the reference,
594
+ so yeah two or two books is one, which is:
595
+
596
+ 0:21:56.236 --> 0:22:02.085
597
+ So but then this is not what we want because
598
+ this is just repeating something that appears.
599
+
600
+ 0:22:02.702 --> 0:22:06.058
601
+ So that's why we use clipping.
602
+
603
+ 0:22:06.058 --> 0:22:15.368
604
+ Clipping here is that we consider the mask
605
+ counts in any reference, so as you can see
606
+
607
+ 0:22:15.368 --> 0:22:17.425
608
+ here in reference.
609
+
610
+ 0:22:18.098 --> 0:22:28.833
611
+ So here when we do clipping we will just use
612
+ the maximum opponents in the references.
613
+
614
+ 0:22:29.809 --> 0:22:38.717
615
+ Yeah, just to avoid avoid overlapping repetitive
616
+ words in the translation.
617
+
618
+ 0:22:41.641 --> 0:23:00.599
619
+ It could happen that there is no overlap between
620
+ the machine translation output and reference.
621
+
622
+ 0:23:00.500 --> 0:23:01.917
623
+ Then Everything Is Going To Go To Zero.
624
+
625
+ 0:23:02.402 --> 0:23:07.876
626
+ So that's why for blow score we usually use
627
+ Japanese level score where we arrogate the
628
+
629
+ 0:23:07.876 --> 0:23:08.631
630
+ statistics.
631
+
632
+ 0:23:12.092 --> 0:23:18.589
633
+ Some summary about the brewer as you can see
634
+ it mash exact words.
635
+
636
+ 0:23:18.589 --> 0:23:31.751
637
+ It can take several references: It measured
638
+ a depotency by the word precision and if measured
639
+
640
+ 0:23:31.751 --> 0:23:36.656
641
+ the fluency by the gram precision.
642
+
643
+ 0:23:37.437 --> 0:23:47.254
644
+ And as mentioned, it doesn't consider how
645
+ much meaning that is captured in the machine
646
+
647
+ 0:23:47.254 --> 0:23:48.721
648
+ translation.
649
+
650
+ 0:23:49.589 --> 0:23:53.538
651
+ So here they use reality penalty to prevent
652
+ short sentences.
653
+
654
+ 0:23:54.654 --> 0:24:04.395
655
+ Will get the spot over the last test set to
656
+ avoid the zero issues.
657
+
658
+ 0:24:04.395 --> 0:24:07.012
659
+ As we mentioned,.
660
+
661
+ 0:24:09.829 --> 0:24:22.387
662
+ Yes, that's mentioned with multiple reference
663
+ translation simultaneously, and it's a precision
664
+
665
+ 0:24:22.387 --> 0:24:24.238
666
+ based matrix.
667
+
668
+ 0:24:24.238 --> 0:24:27.939
669
+ So we are not sure if this.
670
+
671
+ 0:24:29.689 --> 0:24:37.423
672
+ The second thing is that blows calls common
673
+ safe for recall by routine penalty, and we
674
+
675
+ 0:24:37.423 --> 0:24:38.667
676
+ are not sure.
677
+
678
+ 0:24:39.659 --> 0:24:50.902
679
+ Matches, so can still improve the similarity
680
+ measure and improve the correlation score to
681
+
682
+ 0:24:50.902 --> 0:24:51.776
683
+ human.
684
+
685
+ 0:24:52.832 --> 0:25:01.673
686
+ The next is that all work will have the same
687
+ importance.
688
+
689
+ 0:25:01.673 --> 0:25:07.101
690
+ What if a scheme for wedding work?
691
+
692
+ 0:25:11.571 --> 0:25:26.862
693
+ And the last witness is that blows for high
694
+ grade order engrams that can confluency dramatically.
695
+
696
+ 0:25:27.547 --> 0:25:32.101
697
+ So the pressure is that can be accounted for
698
+ fluency, and grammatically there's some other.
699
+
700
+ 0:25:35.956 --> 0:25:47.257
701
+ We have some further issues and not created
702
+ equally so we can use stemming or knowledge
703
+
704
+ 0:25:47.257 --> 0:25:48.156
705
+ space.
706
+
707
+ 0:25:50.730 --> 0:26:00.576
708
+ The next way we incorporate information is
709
+ within the metrics.
710
+
711
+ 0:26:01.101 --> 0:26:07.101
712
+ And can be used like a stop list to like somehow
713
+ ignore the non-important words.
714
+
715
+ 0:26:08.688 --> 0:26:12.687
716
+ Text normalization spelling conjugation lower
717
+ case and mix case.
718
+
719
+ 0:26:12.687 --> 0:26:18.592
720
+ The next thing is that for some language like
721
+ Chinese there can be different world segmentation
722
+
723
+ 0:26:18.592 --> 0:26:23.944
724
+ so exact word matching might no longer be a
725
+ good idea so maybe it's ready to cover the
726
+
727
+ 0:26:23.944 --> 0:26:27.388
728
+ score as the character level instead of the
729
+ word level.
730
+
731
+ 0:26:29.209 --> 0:26:33.794
732
+ And the last thing is speech translation.
733
+
734
+ 0:26:33.794 --> 0:26:38.707
735
+ Usually input from speech translation would.
736
+
737
+ 0:26:38.979 --> 0:26:51.399
738
+ And there should be some way to segment into
739
+ sentences so that we can calculate the score
740
+
741
+ 0:26:51.399 --> 0:26:52.090
742
+ and.
743
+
744
+ 0:26:52.953 --> 0:27:01.326
745
+ And the way to soften is to use some tools
746
+ like enware segmentation to align the output
747
+
748
+ 0:27:01.326 --> 0:27:01.896
749
+ with.
750
+
751
+ 0:27:06.306 --> 0:27:10.274
752
+ Yes, so guess that was all about the blow
753
+ score any question.
754
+
755
+ 0:27:14.274 --> 0:27:28.292
756
+ Again on automatic metrics we'll talk about
757
+ probably good metrics, strange automatic metrics,
758
+
759
+ 0:27:28.292 --> 0:27:32.021
760
+ use cases on evaluation.
761
+
762
+ 0:27:34.374 --> 0:27:44.763
763
+ How to measure the performance of the matrix,
764
+ so a good matrix would be a.
765
+
766
+ 0:27:49.949 --> 0:28:04.905
767
+ We would want the matrix to be interpretable
768
+ if this is the ranking from a human that somehow
769
+
770
+ 0:28:04.905 --> 0:28:08.247
771
+ can rank the system.
772
+
773
+ 0:28:12.132 --> 0:28:15.819
774
+ We would also want the evaluation metric to
775
+ be sensitive.
776
+
777
+ 0:28:15.819 --> 0:28:21.732
778
+ Like small differences in the machine translation
779
+ can be distinguished, we would not need to
780
+
781
+ 0:28:21.732 --> 0:28:22.686
782
+ be consistent.
783
+
784
+ 0:28:22.686 --> 0:28:28.472
785
+ Like if the same machine translation system
786
+ is used on a similar text, it should reproduce
787
+
788
+ 0:28:28.472 --> 0:28:29.553
789
+ a similar score.
790
+
791
+ 0:28:31.972 --> 0:28:40.050
792
+ Next, we would want the machine translation
793
+ system to be reliable.
794
+
795
+ 0:28:40.050 --> 0:28:42.583
796
+ Machine translation.
797
+
798
+ 0:28:43.223 --> 0:28:52.143
799
+ We want the matrix to be easy to run in general
800
+ and can be applied to multiple different machine.
801
+
802
+ 0:28:55.035 --> 0:29:11.148
803
+ The difficulty of evaluating the metric itself
804
+ is kind of similar to when you evaluate the
805
+
806
+ 0:29:11.148 --> 0:29:13.450
807
+ translation.
808
+
809
+ 0:29:18.638 --> 0:29:23.813
810
+ And here is some components of the automatic
811
+ machine translation matrix.
812
+
813
+ 0:29:23.813 --> 0:29:28.420
814
+ So for the matching matrix the component would
815
+ be the precision.
816
+
817
+ 0:29:28.420 --> 0:29:30.689
818
+ Recall our Levinstein distance.
819
+
820
+ 0:29:30.689 --> 0:29:35.225
821
+ So for the blow sparks you have seen it cares
822
+ mostly about the.
823
+
824
+ 0:29:36.396 --> 0:29:45.613
825
+ And on the features it would be about how
826
+ to measure the matches or character based.
827
+
828
+ 0:29:48.588 --> 0:30:01.304
829
+ Now we will talk about more matrix because
830
+ the blue score is the most common.
831
+
832
+ 0:30:02.082 --> 0:30:10.863
833
+ So it compared the reference and hypothesis
834
+ using edit operations.
835
+
836
+ 0:30:10.863 --> 0:30:14.925
837
+ They count how many insertion.
838
+
839
+ 0:30:23.143 --> 0:30:31.968
840
+ We already talked about it beyond what matching
841
+ would care about character based mathematization
842
+
843
+ 0:30:31.968 --> 0:30:34.425
844
+ or linguistic information.
845
+
846
+ 0:30:36.636 --> 0:30:41.502
847
+ The next metric is the meteor metric.
848
+
849
+ 0:30:41.502 --> 0:30:50.978
850
+ This is strong called metric for evaluation
851
+ of translation with explicit.
852
+
853
+ 0:30:51.331 --> 0:31:03.236
854
+ So merely their new idea is that they reintroduce
855
+ repose and combine with precision as small
856
+
857
+ 0:31:03.236 --> 0:31:04.772
858
+ components.
859
+
860
+ 0:31:05.986 --> 0:31:16.700
861
+ The language translation output with each
862
+ reference individually and takes part of the
863
+
864
+ 0:31:16.700 --> 0:31:18.301
865
+ best parent.
866
+
867
+ 0:31:20.940 --> 0:31:27.330
868
+ The next thing is that matching takes into
869
+ counterfection variation by stepping, so it's
870
+
871
+ 0:31:27.330 --> 0:31:28.119
872
+ no longer.
873
+
874
+ 0:31:30.230 --> 0:31:40.165
875
+ When they address fluency, they're a direct
876
+ penalty instead of ink arms so they would care
877
+
878
+ 0:31:40.165 --> 0:31:40.929
879
+ about.
880
+
881
+ 0:31:45.925 --> 0:31:56.287
882
+ The next thing is on two noble metrics, so
883
+ for this metric we want to extract some features.
884
+
885
+ 0:31:56.936 --> 0:32:04.450
886
+ So for example here the nice house is on the
887
+ right and the building is on the right side
888
+
889
+ 0:32:04.450 --> 0:32:12.216
890
+ so we will have to extract some pictures like
891
+ for example here the reference and hypothesis
892
+
893
+ 0:32:12.216 --> 0:32:14.158
894
+ have hypers in common.
895
+
896
+ 0:32:14.714 --> 0:32:19.163
897
+ They have one insertion, two deletions, and
898
+ they have the same verb.
899
+
900
+ 0:32:21.141 --> 0:32:31.530
901
+ So the idea is to use machine translation
902
+ techniques to combine features and this machine
903
+
904
+ 0:32:31.530 --> 0:32:37.532
905
+ translation model will be trained on human
906
+ ranking.
907
+
908
+ 0:32:39.819 --> 0:32:44.788
909
+ Any common framework for this is comet.
910
+
911
+ 0:32:44.684 --> 0:32:48.094
912
+ Which is a narrow model that is used with
913
+ X for.
914
+
915
+ 0:32:48.094 --> 0:32:54.149
916
+ The feature would be created using some prejutant
917
+ model like X, L, M, U, R, A, BO, DA.
918
+
919
+ 0:32:54.149 --> 0:33:00.622
920
+ Here the input would be the source, the reference
921
+ and the hypothesis and then they would try
922
+
923
+ 0:33:00.622 --> 0:33:02.431
924
+ to produce an assessment.
925
+
926
+ 0:33:03.583 --> 0:33:05.428
927
+ Yeah, it's strange to predict human sport.
928
+
929
+ 0:33:06.346 --> 0:33:19.131
930
+ And they also have some additional versions,
931
+ as we train this model in order to tell whether
932
+
933
+ 0:33:19.131 --> 0:33:20.918
934
+ translation.
935
+
936
+ 0:33:21.221 --> 0:33:29.724
937
+ So instead of checking the source and the
938
+ hypothesis as input, they could take only the
939
+
940
+ 0:33:29.724 --> 0:33:38.034
941
+ source and the hypotheses as input and try
942
+ to predict the quality of the translation.
943
+
944
+ 0:33:42.562 --> 0:33:49.836
945
+ So assumptions before machine translation
946
+ systems are often used in larger systems.
947
+
948
+ 0:33:50.430 --> 0:33:57.713
949
+ So the question is how to evaluate the performance
950
+ of the machine translation system in this larger
951
+
952
+ 0:33:57.713 --> 0:34:04.997
953
+ scenario, and an example would be speech translation
954
+ system when you try to translate English audio
955
+
956
+ 0:34:04.997 --> 0:34:05.798
957
+ to German.
958
+
959
+ 0:34:06.506 --> 0:34:13.605
960
+ Then it would usually have two opponents,
961
+ ASR and MT, where ASR is like speech recognition
962
+
963
+ 0:34:13.605 --> 0:34:20.626
964
+ that can describe English audio to English
965
+ text, and then we have the machine translation
966
+
967
+ 0:34:20.626 --> 0:34:24.682
968
+ system that translates English text to German
969
+ text.
970
+
971
+ 0:34:26.967 --> 0:34:33.339
972
+ So in order to have these overall performances
973
+ in this bigger scenario, they are so willing
974
+
975
+ 0:34:33.339 --> 0:34:34.447
976
+ to evaluate it.
977
+
978
+ 0:34:34.447 --> 0:34:41.236
979
+ So the first one is to evaluate the individual
980
+ components like how good is the speech recognizer,
981
+
982
+ 0:34:41.236 --> 0:34:46.916
983
+ how good is the analyzed and generalization
984
+ engines, how good is the synthesizer.
985
+
986
+ 0:34:47.727 --> 0:34:56.905
987
+ The second way is to evaluate translation
988
+ quality from speech input to text output.
989
+
990
+ 0:34:56.905 --> 0:35:00.729
991
+ How good is the final translation?
992
+
993
+ 0:35:02.102 --> 0:35:10.042
994
+ The next thing is to measure the to evaluate
995
+ the architecture effectiveness like: How is
996
+
997
+ 0:35:10.042 --> 0:35:12.325
998
+ the level effects in general?
999
+
1000
+ 0:35:12.325 --> 0:35:19.252
1001
+ The next one is task based evaluation or use
1002
+ a study like we just simply ask the user what
1003
+
1004
+ 0:35:19.252 --> 0:35:24.960
1005
+ is their experience like whether the system
1006
+ works well and how well it is.
1007
+
1008
+ 0:35:27.267 --> 0:35:32.646
1009
+ So here we have an example of the ITF shale
1010
+ test result.
1011
+
1012
+ 0:35:33.153 --> 0:35:38.911
1013
+ So the first block would be the human evaluation
1014
+ like I think they are asked to give a spawl
1015
+
1016
+ 0:35:38.911 --> 0:35:44.917
1017
+ from one to five again where a fight is best
1018
+ and one is worst and the lower one is the blowscore
1019
+
1020
+ 0:35:44.917 --> 0:35:50.490
1021
+ and they find out that the human evaluation
1022
+ is far actually correlated with the blowsfall
1023
+
1024
+ 0:35:50.490 --> 0:35:51.233
1025
+ quite well.
1026
+
1027
+ 0:35:53.193 --> 0:36:02.743
1028
+ Here you can also see that the systems from
1029
+ our university are actually on top many sub-tasts.
1030
+
1031
+ 0:36:05.605 --> 0:36:07.429
1032
+ So Yeah.
1033
+
1034
+ 0:36:08.868 --> 0:36:14.401
1035
+ For this lecture is that machine translation
1036
+ evaluation is difficult.
1037
+
1038
+ 0:36:14.401 --> 0:36:21.671
1039
+ We talk about human versus automatic evaluation
1040
+ that human would be costly, but then is the
1041
+
1042
+ 0:36:21.671 --> 0:36:27.046
1043
+ goal standard automatic evaluation would be
1044
+ a fast and cheaper way.
1045
+
1046
+ 0:36:27.547 --> 0:36:36.441
1047
+ We talk about granulity on sentence level,
1048
+ document level or task level evaluation machine
1049
+
1050
+ 0:36:36.441 --> 0:36:38.395
1051
+ translation system.
1052
+
1053
+ 0:36:39.679 --> 0:36:51.977
1054
+ And we talked about human evaluation versus
1055
+ automatic metrics in details.
1056
+
1057
+ 0:36:54.034 --> 0:36:59.840
1058
+ So we introduced a lot of metric metrics.
1059
+
1060
+ 0:36:59.840 --> 0:37:10.348
1061
+ How do they compare from the quadrating of
1062
+ human assessment so it's better?
1063
+
1064
+ 0:37:12.052 --> 0:37:16.294
1065
+ I don't have the exact score and reference
1066
+ in my head.
1067
+
1068
+ 0:37:16.294 --> 0:37:22.928
1069
+ I would assume that mediators should have
1070
+ a better correlation because here they also
1071
+
1072
+ 0:37:22.928 --> 0:37:30.025
1073
+ consider other aspects like the recall whether
1074
+ the information in the reference is captured
1075
+
1076
+ 0:37:30.025 --> 0:37:31.568
1077
+ in the translation.
1078
+
1079
+ 0:37:32.872 --> 0:37:41.875
1080
+ Like synonyms, so I would assume that mid
1081
+ air is better, but again don't have the reference
1082
+
1083
+ 0:37:41.875 --> 0:37:43.441
1084
+ in my hair, so.
1085
+
1086
+ 0:37:43.903 --> 0:37:49.771
1087
+ But guess the reason people are still using
1088
+ BlueScore is that in most literature, a machine
1089
+
1090
+ 0:37:49.771 --> 0:38:00.823
1091
+ translation system, they report: So now you
1092
+ create a new machine translation system.
1093
+
1094
+ 0:38:00.823 --> 0:38:07.990
1095
+ It might be better to also report the blow.
1096
+
1097
+ 0:38:08.228 --> 0:38:11.472
1098
+ Exactly just slice good, just spread white,
1099
+ and then we're going to go ahead.
1100
+
1101
+ 0:38:12.332 --> 0:38:14.745
1102
+ And don't know what you're doing.
1103
+
1104
+ 0:38:17.457 --> 0:38:18.907
1105
+ I Want to Talk Quickly About.
1106
+
1107
+ 0:38:19.059 --> 0:38:32.902
1108
+ So it is like a language model, so it's kind
1109
+ of the same uses as.
1110
+
1111
+ 0:38:33.053 --> 0:38:39.343
1112
+ So the idea is that we have this layer in
1113
+ order to embed the sauce and the reference
1114
+
1115
+ 0:38:39.343 --> 0:38:39.713
1116
+ and.
1117
+
1118
+ 0:38:40.000 --> 0:38:54.199
1119
+ Into some feature vectors that we can later
1120
+ on use to predict the human sport in the.
1121
+
1122
+ 0:38:58.618 --> 0:39:00.051
1123
+ It If There's Nothing Else.
1124
+
demo_data/lectures/Lecture-05-02.05.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5014f3570b8db38818ab44ed117dc6d67206c5163b6b87b45df4a2aa426b8222
3
+ size 314238982
demo_data/lectures/Lecture-06-09.05.2023/English.vtt ADDED
@@ -0,0 +1,2970 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.721 --> 0:00:08.584
4
+ Hey, then welcome to today's lecture on language
5
+ modeling.
6
+
7
+ 0:00:09.409 --> 0:00:21.608
8
+ We had not a different view on machine translation,
9
+ which was the evaluation path it's important
10
+
11
+ 0:00:21.608 --> 0:00:24.249
12
+ to evaluate and see.
13
+
14
+ 0:00:24.664 --> 0:00:33.186
15
+ We want to continue with building the MT system
16
+ and this will be the last part before we are
17
+
18
+ 0:00:33.186 --> 0:00:36.668
19
+ going into a neural step on Thursday.
20
+
21
+ 0:00:37.017 --> 0:00:45.478
22
+ So we had the the broader view on statistical
23
+ machine translation and the.
24
+
25
+ 0:00:45.385 --> 0:00:52.977
26
+ Thursday: A week ago we talked about the statistical
27
+ machine translation and mainly the translation
28
+
29
+ 0:00:52.977 --> 0:00:59.355
30
+ model, so how we model how probable is it that
31
+ one word is translated into another.
32
+
33
+ 0:01:00.800 --> 0:01:15.583
34
+ However, there is another component when doing
35
+ generation tasks in general and machine translation.
36
+
37
+ 0:01:16.016 --> 0:01:23.797
38
+ There are several characteristics which you
39
+ only need to model on the target side in the
40
+
41
+ 0:01:23.797 --> 0:01:31.754
42
+ traditional approach where we talked about
43
+ the generation from more semantic or synthectic
44
+
45
+ 0:01:31.754 --> 0:01:34.902
46
+ representation into the real world.
47
+
48
+ 0:01:35.555 --> 0:01:51.013
49
+ And the challenge is that there's some constructs
50
+ which are only there in the target language.
51
+
52
+ 0:01:52.132 --> 0:01:57.908
53
+ You cannot really get that translation, but
54
+ it's more something that needs to model on
55
+
56
+ 0:01:57.908 --> 0:01:58.704
57
+ the target.
58
+
59
+ 0:01:59.359 --> 0:02:05.742
60
+ And this is done typically by a language model
61
+ and this concept of language model.
62
+
63
+ 0:02:06.326 --> 0:02:11.057
64
+ Guess you can assume nowadays very important.
65
+
66
+ 0:02:11.057 --> 0:02:20.416
67
+ You've read a lot about large language models
68
+ recently and they are all somehow trained or
69
+
70
+ 0:02:20.416 --> 0:02:22.164
71
+ the idea behind.
72
+
73
+ 0:02:25.986 --> 0:02:41.802
74
+ What we'll look today at if get the next night
75
+ and look what a language model is and today's
76
+
77
+ 0:02:41.802 --> 0:02:42.992
78
+ focus.
79
+
80
+ 0:02:43.363 --> 0:02:49.188
81
+ This was the common approach to the language
82
+ model for twenty or thirty years, so a lot
83
+
84
+ 0:02:49.188 --> 0:02:52.101
85
+ of time it was really the state of the art.
86
+
87
+ 0:02:52.101 --> 0:02:58.124
88
+ And people have used that in many applications
89
+ in machine translation and automatic speech
90
+
91
+ 0:02:58.124 --> 0:02:58.985
92
+ recognition.
93
+
94
+ 0:02:59.879 --> 0:03:11.607
95
+ Again you are measuring the performance, but
96
+ this is purely the performance of the language
97
+
98
+ 0:03:11.607 --> 0:03:12.499
99
+ model.
100
+
101
+ 0:03:13.033 --> 0:03:23.137
102
+ And then we will see that the traditional
103
+ language will have a major drawback in how
104
+
105
+ 0:03:23.137 --> 0:03:24.683
106
+ we can deal.
107
+
108
+ 0:03:24.944 --> 0:03:32.422
109
+ So if you model language you will see that
110
+ in most of the sentences and you have not really
111
+
112
+ 0:03:32.422 --> 0:03:39.981
113
+ seen and you're still able to assess if this
114
+ is good language or if this is native language.
115
+
116
+ 0:03:40.620 --> 0:03:45.092
117
+ And this is challenging if you do just like
118
+ parameter estimation.
119
+
120
+ 0:03:45.605 --> 0:03:59.277
121
+ We are using two different techniques to do:
122
+ interpolation, and these are essentially in
123
+
124
+ 0:03:59.277 --> 0:04:01.735
125
+ order to build.
126
+
127
+ 0:04:01.881 --> 0:04:11.941
128
+ It also motivates why things might be easier
129
+ if we are going into neural morals as we will.
130
+
131
+ 0:04:12.312 --> 0:04:18.203
132
+ And at the end we'll talk a bit about some
133
+ additional type of language models which are
134
+
135
+ 0:04:18.203 --> 0:04:18.605
136
+ also.
137
+
138
+ 0:04:20.440 --> 0:04:29.459
139
+ So where our language was used, or how are
140
+ they used in the machine translations?
141
+
142
+ 0:04:30.010 --> 0:04:38.513
143
+ So the idea of a language model is that we
144
+ are modeling what is the fluency of language.
145
+
146
+ 0:04:38.898 --> 0:04:49.381
147
+ So if you have, for example, sentence will,
148
+ then you can estimate that there are some words:
149
+
150
+ 0:04:49.669 --> 0:05:08.929
151
+ For example, the next word is valid, but will
152
+ card's words not?
153
+
154
+ 0:05:09.069 --> 0:05:13.673
155
+ And we can do that.
156
+
157
+ 0:05:13.673 --> 0:05:22.192
158
+ We have seen that the noise channel.
159
+
160
+ 0:05:22.322 --> 0:05:33.991
161
+ That we have seen someone two weeks ago, and
162
+ today we will look into how can we model P
163
+
164
+ 0:05:33.991 --> 0:05:36.909
165
+ of Y or how possible.
166
+
167
+ 0:05:37.177 --> 0:05:44.192
168
+ Now this is completely independent of the
169
+ translation process.
170
+
171
+ 0:05:44.192 --> 0:05:49.761
172
+ How fluent is a sentence and how you can express?
173
+
174
+ 0:05:51.591 --> 0:06:01.699
175
+ And this language model task has one really
176
+ big advantage and assume that is even the big
177
+
178
+ 0:06:01.699 --> 0:06:02.935
179
+ advantage.
180
+
181
+ 0:06:03.663 --> 0:06:16.345
182
+ The big advantage is the data we need to train
183
+ that so normally we are doing supervised learning.
184
+
185
+ 0:06:16.876 --> 0:06:20.206
186
+ So machine translation will talk about.
187
+
188
+ 0:06:20.206 --> 0:06:24.867
189
+ That means we have the source center and target
190
+ center.
191
+
192
+ 0:06:25.005 --> 0:06:27.620
193
+ They need to be aligned.
194
+
195
+ 0:06:27.620 --> 0:06:31.386
196
+ We look into how we can model them.
197
+
198
+ 0:06:31.386 --> 0:06:39.270
199
+ Generally, the problem with this is that:
200
+ Machine translation: You still have the advantage
201
+
202
+ 0:06:39.270 --> 0:06:45.697
203
+ that there's quite huge amounts of this data
204
+ for many languages, not all but many, but other
205
+
206
+ 0:06:45.697 --> 0:06:47.701
207
+ classes even more difficult.
208
+
209
+ 0:06:47.701 --> 0:06:50.879
210
+ There's very few data where you have summary.
211
+
212
+ 0:06:51.871 --> 0:07:02.185
213
+ So the big advantage of language model is
214
+ we're only modeling the centers, so we only
215
+
216
+ 0:07:02.185 --> 0:07:04.103
217
+ need pure text.
218
+
219
+ 0:07:04.584 --> 0:07:11.286
220
+ And pure text, especially since we have the
221
+ Internet face melting large amounts of text.
222
+
223
+ 0:07:11.331 --> 0:07:17.886
224
+ Of course, it's still, it's still maybe only
225
+ for some domains, some type.
226
+
227
+ 0:07:18.198 --> 0:07:23.466
228
+ Want to have data for speech about machine
229
+ translation.
230
+
231
+ 0:07:23.466 --> 0:07:27.040
232
+ Maybe there's only limited data that.
233
+
234
+ 0:07:27.027 --> 0:07:40.030
235
+ There's always and also you go to some more
236
+ exotic languages and then you will have less
237
+
238
+ 0:07:40.030 --> 0:07:40.906
239
+ data.
240
+
241
+ 0:07:41.181 --> 0:07:46.803
242
+ And in language once we can now look, how
243
+ can we make use of these data?
244
+
245
+ 0:07:47.187 --> 0:07:54.326
246
+ And: Nowadays this is often also framed as
247
+ self supervised learning because on the one
248
+
249
+ 0:07:54.326 --> 0:08:00.900
250
+ hand here we'll see it's a time of classification
251
+ cast or supervised learning but we create some
252
+
253
+ 0:08:00.900 --> 0:08:02.730
254
+ other data science itself.
255
+
256
+ 0:08:02.742 --> 0:08:13.922
257
+ So it's not that we have this pair of data
258
+ text and labels, but we have only the text.
259
+
260
+ 0:08:15.515 --> 0:08:21.367
261
+ So the question is how can we use this modeling
262
+ data and how can we train our language?
263
+
264
+ 0:08:22.302 --> 0:08:35.086
265
+ The main goal is to produce fluent English,
266
+ so we want to somehow model that something
267
+
268
+ 0:08:35.086 --> 0:08:38.024
269
+ is a sentence of a.
270
+
271
+ 0:08:38.298 --> 0:08:44.897
272
+ So there is no clear separation about semantics
273
+ and syntax, but in this case it is not about
274
+
275
+ 0:08:44.897 --> 0:08:46.317
276
+ a clear separation.
277
+
278
+ 0:08:46.746 --> 0:08:50.751
279
+ So we will monitor them somehow in there.
280
+
281
+ 0:08:50.751 --> 0:08:56.091
282
+ There will be some notion of semantics, some
283
+ notion of.
284
+
285
+ 0:08:56.076 --> 0:09:08.748
286
+ Because you say you want to water how fluid
287
+ or probable is that the native speaker is producing
288
+
289
+ 0:09:08.748 --> 0:09:12.444
290
+ that because of the one in.
291
+
292
+ 0:09:12.512 --> 0:09:17.711
293
+ We are rarely talking like things that are
294
+ semantically wrong, and therefore there is
295
+
296
+ 0:09:17.711 --> 0:09:18.679
297
+ also some type.
298
+
299
+ 0:09:19.399 --> 0:09:24.048
300
+ So, for example, the house is small.
301
+
302
+ 0:09:24.048 --> 0:09:30.455
303
+ It should be a higher stability than the house
304
+ is.
305
+
306
+ 0:09:31.251 --> 0:09:38.112
307
+ Because home and house are both meaning German,
308
+ they are used differently.
309
+
310
+ 0:09:38.112 --> 0:09:43.234
311
+ For example, it should be more probable that
312
+ the plane.
313
+
314
+ 0:09:44.444 --> 0:09:51.408
315
+ So this is both synthetically correct, but
316
+ cementically not.
317
+
318
+ 0:09:51.408 --> 0:09:58.372
319
+ But still you will see much more often the
320
+ probability that.
321
+
322
+ 0:10:03.883 --> 0:10:14.315
323
+ So more formally, it's about like the language
324
+ should be some type of function, and it gives
325
+
326
+ 0:10:14.315 --> 0:10:18.690
327
+ us the probability that this sentence.
328
+
329
+ 0:10:19.519 --> 0:10:27.312
330
+ Indicating that this is good English or more
331
+ generally English, of course you can do that.
332
+
333
+ 0:10:28.448 --> 0:10:37.609
334
+ And earlier times people have even done try
335
+ to do that deterministic that was especially
336
+
337
+ 0:10:37.609 --> 0:10:40.903
338
+ used for more dialogue systems.
339
+
340
+ 0:10:40.840 --> 0:10:50.660
341
+ You have a very strict syntax so you can only
342
+ use like turn off the, turn off the radio.
343
+
344
+ 0:10:50.690 --> 0:10:56.928
345
+ Something else, but you have a very strict
346
+ deterministic finance state grammar like which
347
+
348
+ 0:10:56.928 --> 0:10:58.107
349
+ type of phrases.
350
+
351
+ 0:10:58.218 --> 0:11:04.791
352
+ The problem of course if we're dealing with
353
+ language is that language is variable, we're
354
+
355
+ 0:11:04.791 --> 0:11:10.183
356
+ not always talking correct sentences, and so
357
+ this type of deterministic.
358
+
359
+ 0:11:10.650 --> 0:11:22.121
360
+ That's why for already many, many years people
361
+ look into statistical language models and try
362
+
363
+ 0:11:22.121 --> 0:11:24.587
364
+ to model something.
365
+
366
+ 0:11:24.924 --> 0:11:35.096
367
+ So something like what is the probability
368
+ of the sequences of to, and that is what.
369
+
370
+ 0:11:35.495 --> 0:11:43.076
371
+ The advantage of doing it statistically is
372
+ that we can train large text databases so we
373
+
374
+ 0:11:43.076 --> 0:11:44.454
375
+ can train them.
376
+
377
+ 0:11:44.454 --> 0:11:52.380
378
+ We don't have to define it and most of these
379
+ cases we don't want to have the hard decision.
380
+
381
+ 0:11:52.380 --> 0:11:55.481
382
+ This is a sentence of the language.
383
+
384
+ 0:11:55.815 --> 0:11:57.914
385
+ Why we want to have some type of probability?
386
+
387
+ 0:11:57.914 --> 0:11:59.785
388
+ How probable is this part of the center?
389
+
390
+ 0:12:00.560 --> 0:12:04.175
391
+ Because yeah, even for a few minutes, it's
392
+ not always clear.
393
+
394
+ 0:12:04.175 --> 0:12:06.782
395
+ Is this a sentence that you can use or not?
396
+
397
+ 0:12:06.782 --> 0:12:12.174
398
+ I mean, I just in this presentation gave several
399
+ sentences, which are not correct English.
400
+
401
+ 0:12:12.174 --> 0:12:17.744
402
+ So it might still happen that people speak
403
+ sentences or write sentences that I'm not correct,
404
+
405
+ 0:12:17.744 --> 0:12:19.758
406
+ and you want to deal with all of.
407
+
408
+ 0:12:20.020 --> 0:12:25.064
409
+ So that is then, of course, a big advantage
410
+ if you use your more statistical models.
411
+
412
+ 0:12:25.705 --> 0:12:35.810
413
+ The disadvantage is that you need a subtitle
414
+ of large text databases which might exist from
415
+
416
+ 0:12:35.810 --> 0:12:37.567
417
+ many languages.
418
+
419
+ 0:12:37.857 --> 0:12:46.511
420
+ Nowadays you see that there is of course issues
421
+ that you need large computational resources
422
+
423
+ 0:12:46.511 --> 0:12:47.827
424
+ to deal with.
425
+
426
+ 0:12:47.827 --> 0:12:56.198
427
+ You need to collect all these crawlers on
428
+ the internet which can create enormous amounts
429
+
430
+ 0:12:56.198 --> 0:12:57.891
431
+ of training data.
432
+
433
+ 0:12:58.999 --> 0:13:08.224
434
+ So if we want to build this then the question
435
+ is of course how can we estimate the probability?
436
+
437
+ 0:13:08.448 --> 0:13:10.986
438
+ So how probable is the sentence good morning?
439
+
440
+ 0:13:11.871 --> 0:13:15.450
441
+ And you all know basic statistics.
442
+
443
+ 0:13:15.450 --> 0:13:21.483
444
+ So if you see this you have a large database
445
+ of sentences.
446
+
447
+ 0:13:21.901 --> 0:13:28.003
448
+ Made this a real example, so this was from
449
+ the TED talks.
450
+
451
+ 0:13:28.003 --> 0:13:37.050
452
+ I guess most of you have heard about them,
453
+ and if you account for all many sentences,
454
+
455
+ 0:13:37.050 --> 0:13:38.523
456
+ good morning.
457
+
458
+ 0:13:38.718 --> 0:13:49.513
459
+ It happens so the probability of good morning
460
+ is sweet point times to the power minus.
461
+
462
+ 0:13:50.030 --> 0:13:53.755
463
+ Okay, so this is a very easy thing.
464
+
465
+ 0:13:53.755 --> 0:13:58.101
466
+ We can directly model the language model.
467
+
468
+ 0:13:58.959 --> 0:14:03.489
469
+ Does anybody see a problem why this might
470
+ not be the final solution?
471
+
472
+ 0:14:06.326 --> 0:14:14.962
473
+ Think we would need a folder of more sentences
474
+ to make anything useful of this.
475
+
476
+ 0:14:15.315 --> 0:14:29.340
477
+ Because the probability of the talk starting
478
+ with good morning, good morning is much higher
479
+
480
+ 0:14:29.340 --> 0:14:32.084
481
+ than ten minutes.
482
+
483
+ 0:14:33.553 --> 0:14:41.700
484
+ In all the probability presented in this face,
485
+ not how we usually think about it.
486
+
487
+ 0:14:42.942 --> 0:14:55.038
488
+ The probability is even OK, but you're going
489
+ into the right direction about the large data.
490
+
491
+ 0:14:55.038 --> 0:14:59.771
492
+ Yes, you can't form a new sentence.
493
+
494
+ 0:15:00.160 --> 0:15:04.763
495
+ It's about a large data, so you said it's
496
+ hard to get enough data.
497
+
498
+ 0:15:04.763 --> 0:15:05.931
499
+ It's impossible.
500
+
501
+ 0:15:05.931 --> 0:15:11.839
502
+ I would say we are always saying sentences
503
+ which have never been said and we are able
504
+
505
+ 0:15:11.839 --> 0:15:12.801
506
+ to deal with.
507
+
508
+ 0:15:13.133 --> 0:15:25.485
509
+ The problem with the sparsity of the data
510
+ will have a lot of perfect English sentences.
511
+
512
+ 0:15:26.226 --> 0:15:31.338
513
+ And this is, of course, not what we want to
514
+ deal with.
515
+
516
+ 0:15:31.338 --> 0:15:39.332
517
+ If we want to model that, we need to have
518
+ a model which can really estimate how good.
519
+
520
+ 0:15:39.599 --> 0:15:47.970
521
+ And if we are just like counting this way,
522
+ most of it will get a zero probability, which
523
+
524
+ 0:15:47.970 --> 0:15:48.722
525
+ is not.
526
+
527
+ 0:15:49.029 --> 0:15:56.572
528
+ So we need to make things a bit different.
529
+
530
+ 0:15:56.572 --> 0:16:06.221
531
+ For the models we had already some idea of
532
+ doing that.
533
+
534
+ 0:16:06.486 --> 0:16:08.058
535
+ And that we can do here again.
536
+
537
+ 0:16:08.528 --> 0:16:12.866
538
+ So we can especially use the gel gel.
539
+
540
+ 0:16:12.772 --> 0:16:19.651
541
+ The chain rule and the definition of conditional
542
+ probability solve the conditional probability.
543
+
544
+ 0:16:19.599 --> 0:16:26.369
545
+ Of an event B given in an event A is the probability
546
+ of A and B divided to the probability of A.
547
+
548
+ 0:16:26.369 --> 0:16:32.720
549
+ Yes, I recently had a exam on a manic speech
550
+ recognition and Mister Rival said this is not
551
+
552
+ 0:16:32.720 --> 0:16:39.629
553
+ called a chain of wood because I use this terminology
554
+ and he said it's just applying base another.
555
+
556
+ 0:16:40.500 --> 0:16:56.684
557
+ But this is definitely the definition of the
558
+ condition of probability.
559
+
560
+ 0:16:57.137 --> 0:17:08.630
561
+ The probability is defined as P of A and P
562
+ of supposed to be divided by the one.
563
+
564
+ 0:17:08.888 --> 0:17:16.392
565
+ And that can be easily rewritten into and
566
+ times given.
567
+
568
+ 0:17:16.816 --> 0:17:35.279
569
+ And the nice thing is, we can easily extend
570
+ it, of course, into more variables so we can
571
+
572
+ 0:17:35.279 --> 0:17:38.383
573
+ have: And so on.
574
+
575
+ 0:17:38.383 --> 0:17:49.823
576
+ So more generally you can do that for now
577
+ any length of sequence.
578
+
579
+ 0:17:50.650 --> 0:18:04.802
580
+ So if we are now going back to words, we can
581
+ model that as the probability of the sequence
582
+
583
+ 0:18:04.802 --> 0:18:08.223
584
+ is given its history.
585
+
586
+ 0:18:08.908 --> 0:18:23.717
587
+ Maybe it's more clear if we're looking at
588
+ real works, so if we have pee-off, it's water
589
+
590
+ 0:18:23.717 --> 0:18:26.914
591
+ is so transparent.
592
+
593
+ 0:18:26.906 --> 0:18:39.136
594
+ So this way we are able to model the ability
595
+ of the whole sentence given the sequence by
596
+
597
+ 0:18:39.136 --> 0:18:42.159
598
+ looking at each word.
599
+
600
+ 0:18:42.762 --> 0:18:49.206
601
+ And of course the big advantage is that each
602
+ word occurs less often than the full sect.
603
+
604
+ 0:18:49.206 --> 0:18:54.991
605
+ So hopefully we see that still, of course,
606
+ the problem the word doesn't occur.
607
+
608
+ 0:18:54.991 --> 0:19:01.435
609
+ Then this doesn't work, but let's recover
610
+ most of the lectures today about dealing with
611
+
612
+ 0:19:01.435 --> 0:19:01.874
613
+ this.
614
+
615
+ 0:19:02.382 --> 0:19:08.727
616
+ So by first of all, we generally is at least
617
+ easier as the thing we have before.
618
+
619
+ 0:19:13.133 --> 0:19:23.531
620
+ That we really make sense easier, no, because
621
+ those jumps get utterly long and we have central.
622
+
623
+ 0:19:23.943 --> 0:19:29.628
624
+ Yes exactly, so when we look at the last probability
625
+ here, we still have to have seen the full.
626
+
627
+ 0:19:30.170 --> 0:19:38.146
628
+ So if we want a molecule of transparent, if
629
+ water is so we have to see the food sequence.
630
+
631
+ 0:19:38.578 --> 0:19:48.061
632
+ So in first step we didn't really have to
633
+ have seen the full sentence.
634
+
635
+ 0:19:48.969 --> 0:19:52.090
636
+ However, a little bit of a step nearer.
637
+
638
+ 0:19:52.512 --> 0:19:59.673
639
+ So this is still a problem and we will never
640
+ have seen it for all the time.
641
+
642
+ 0:20:00.020 --> 0:20:08.223
643
+ So you can look at this if you have a vocabulary
644
+ of words.
645
+
646
+ 0:20:08.223 --> 0:20:17.956
647
+ Now, for example, if the average sentence
648
+ is, you would leave to the.
649
+
650
+ 0:20:18.298 --> 0:20:22.394
651
+ And we are quite sure we have never seen that
652
+ much date.
653
+
654
+ 0:20:22.902 --> 0:20:26.246
655
+ So this is, we cannot really compute this
656
+ probability.
657
+
658
+ 0:20:26.786 --> 0:20:37.794
659
+ However, there's a trick how we can do that
660
+ and that's the idea between most of the language.
661
+
662
+ 0:20:38.458 --> 0:20:44.446
663
+ So instead of saying how often does this work
664
+ happen to exactly this history, we are trying
665
+
666
+ 0:20:44.446 --> 0:20:50.433
667
+ to do some kind of clustering and cluster a
668
+ lot of different histories into the same class,
669
+
670
+ 0:20:50.433 --> 0:20:55.900
671
+ and then we are modeling the probability of
672
+ the word given this class of histories.
673
+
674
+ 0:20:56.776 --> 0:21:06.245
675
+ And then, of course, the big design decision
676
+ is how to be modeled like how to cluster history.
677
+
678
+ 0:21:06.666 --> 0:21:17.330
679
+ So how do we put all these histories together
680
+ so that we have seen each of one off enough
681
+
682
+ 0:21:17.330 --> 0:21:18.396
683
+ so that.
684
+
685
+ 0:21:20.320 --> 0:21:25.623
686
+ So there is quite different types of things
687
+ people can do.
688
+
689
+ 0:21:25.623 --> 0:21:33.533
690
+ You can add some speech texts, you can do
691
+ semantic words, you can model the similarity,
692
+
693
+ 0:21:33.533 --> 0:21:46.113
694
+ you can model grammatical content, and things
695
+ like: However, like quite often in these statistical
696
+
697
+ 0:21:46.113 --> 0:21:53.091
698
+ models, if you have a very simple solution.
699
+
700
+ 0:21:53.433 --> 0:21:58.455
701
+ And this is what most statistical models do.
702
+
703
+ 0:21:58.455 --> 0:22:09.616
704
+ They are based on the so called mark of assumption,
705
+ and that means we are assuming all this history
706
+
707
+ 0:22:09.616 --> 0:22:12.183
708
+ is not that important.
709
+
710
+ 0:22:12.792 --> 0:22:25.895
711
+ So we are modeling the probability of zirkins
712
+ is so transparent that or we have maybe two
713
+
714
+ 0:22:25.895 --> 0:22:29.534
715
+ words by having a fixed.
716
+
717
+ 0:22:29.729 --> 0:22:38.761
718
+ So the class of all our history from word
719
+ to word minus one is just the last two words.
720
+
721
+ 0:22:39.679 --> 0:22:45.229
722
+ And by doing this classification, which of
723
+ course does need any additional knowledge.
724
+
725
+ 0:22:45.545 --> 0:22:51.176
726
+ It's very easy to calculate we have no limited
727
+ our our histories.
728
+
729
+ 0:22:51.291 --> 0:23:00.906
730
+ So instead of an arbitrary long one here,
731
+ we have here only like.
732
+
733
+ 0:23:00.906 --> 0:23:10.375
734
+ For example, if we have two grams, a lot of
735
+ them will not occur.
736
+
737
+ 0:23:10.930 --> 0:23:20.079
738
+ So it's a very simple trick to make all these
739
+ classes into a few classes and motivated by,
740
+
741
+ 0:23:20.079 --> 0:23:24.905
742
+ of course, the language the nearest things
743
+ are.
744
+
745
+ 0:23:24.944 --> 0:23:33.043
746
+ Like a lot of sequences, they mainly depend
747
+ on the previous one, and things which are far
748
+
749
+ 0:23:33.043 --> 0:23:33.583
750
+ away.
751
+
752
+ 0:23:38.118 --> 0:23:47.361
753
+ In our product here everything is just modeled
754
+ not by the whole history but by the last and
755
+
756
+ 0:23:47.361 --> 0:23:48.969
757
+ minus one word.
758
+
759
+ 0:23:50.470 --> 0:23:54.322
760
+ So and this is typically expressed by people.
761
+
762
+ 0:23:54.322 --> 0:24:01.776
763
+ They're therefore also talking by an N gram
764
+ language model because we are always looking
765
+
766
+ 0:24:01.776 --> 0:24:06.550
767
+ at these chimes of N words and modeling the
768
+ probability.
769
+
770
+ 0:24:07.527 --> 0:24:10.485
771
+ So again start with the most simple case.
772
+
773
+ 0:24:10.485 --> 0:24:15.485
774
+ Even extreme is the unigram case, so we're
775
+ ignoring the whole history.
776
+
777
+ 0:24:15.835 --> 0:24:24.825
778
+ The probability of a sequence of words is
779
+ just the probability of each of the words in
780
+
781
+ 0:24:24.825 --> 0:24:25.548
782
+ there.
783
+
784
+ 0:24:26.046 --> 0:24:32.129
785
+ And therefore we are removing the whole context.
786
+
787
+ 0:24:32.129 --> 0:24:40.944
788
+ The most probable sequence would be something
789
+ like one of them is the.
790
+
791
+ 0:24:42.162 --> 0:24:44.694
792
+ Most probable wordsuit by itself.
793
+
794
+ 0:24:44.694 --> 0:24:49.684
795
+ It might not make sense, but it, of course,
796
+ can give you a bit of.
797
+
798
+ 0:24:49.629 --> 0:24:52.682
799
+ Intuition like which types of words should
800
+ be more frequent.
801
+
802
+ 0:24:53.393 --> 0:25:00.012
803
+ And if you what you can do is train such a
804
+ button and you can just automatically generate.
805
+
806
+ 0:25:00.140 --> 0:25:09.496
807
+ And this sequence is generated by sampling,
808
+ so we will later come in the lecture too.
809
+
810
+ 0:25:09.496 --> 0:25:16.024
811
+ The sampling is that you randomly pick a word
812
+ but based on.
813
+
814
+ 0:25:16.096 --> 0:25:22.711
815
+ So if the probability of one word is zero
816
+ point two then you'll put it on and if another
817
+
818
+ 0:25:22.711 --> 0:25:23.157
819
+ word.
820
+
821
+ 0:25:23.483 --> 0:25:36.996
822
+ And if you see that you'll see here now, for
823
+ example, it seems that these are two occurring
824
+
825
+ 0:25:36.996 --> 0:25:38.024
826
+ posts.
827
+
828
+ 0:25:38.138 --> 0:25:53.467
829
+ But you see there's not really any continuing
830
+ type of structure because each word is modeled
831
+
832
+ 0:25:53.467 --> 0:25:55.940
833
+ independently.
834
+
835
+ 0:25:57.597 --> 0:26:03.037
836
+ This you can do better even though going to
837
+ a biograph, so then we're having a bit of context.
838
+
839
+ 0:26:03.037 --> 0:26:08.650
840
+ Of course, it's still very small, so the probability
841
+ of your word of the actual word only depends
842
+
843
+ 0:26:08.650 --> 0:26:12.429
844
+ on the previous word and all the context before
845
+ there is ignored.
846
+
847
+ 0:26:13.133 --> 0:26:18.951
848
+ This of course will come to that wrong, but
849
+ it models a regular language significantly
850
+
851
+ 0:26:18.951 --> 0:26:19.486
852
+ better.
853
+
854
+ 0:26:19.779 --> 0:26:28.094
855
+ Seeing some things here still doesn't really
856
+ make a lot of sense, but you're seeing some
857
+
858
+ 0:26:28.094 --> 0:26:29.682
859
+ typical phrases.
860
+
861
+ 0:26:29.949 --> 0:26:39.619
862
+ In this hope doesn't make sense, but in this
863
+ issue is also frequent.
864
+
865
+ 0:26:39.619 --> 0:26:51.335
866
+ Issue is also: Very nice is this year new
867
+ car parking lot after, so if you have the word
868
+
869
+ 0:26:51.335 --> 0:26:53.634
870
+ new then the word.
871
+
872
+ 0:26:53.893 --> 0:27:01.428
873
+ Is also quite common, but new car they wouldn't
874
+ put parking.
875
+
876
+ 0:27:01.428 --> 0:27:06.369
877
+ Often the continuation is packing lots.
878
+
879
+ 0:27:06.967 --> 0:27:12.417
880
+ And now it's very interesting because here
881
+ we see the two cementic meanings of lot: You
882
+
883
+ 0:27:12.417 --> 0:27:25.889
884
+ have a parking lot, but in general if you just
885
+ think about the history, the most common use
886
+
887
+ 0:27:25.889 --> 0:27:27.353
888
+ is a lot.
889
+
890
+ 0:27:27.527 --> 0:27:33.392
891
+ So you see that he's really not using the
892
+ context before, but he's only using the current
893
+
894
+ 0:27:33.392 --> 0:27:33.979
895
+ context.
896
+
897
+ 0:27:38.338 --> 0:27:41.371
898
+ So in general we can of course do that longer.
899
+
900
+ 0:27:41.371 --> 0:27:43.888
901
+ We can do unigrams, bigrams, trigrams.
902
+
903
+ 0:27:45.845 --> 0:27:52.061
904
+ People typically went up to four or five grams,
905
+ and then it's getting difficult because.
906
+
907
+ 0:27:52.792 --> 0:27:56.671
908
+ There are so many five grams that it's getting
909
+ complicated.
910
+
911
+ 0:27:56.671 --> 0:28:02.425
912
+ Storing all of them and storing these models
913
+ get so big that it's no longer working, and
914
+
915
+ 0:28:02.425 --> 0:28:08.050
916
+ of course at some point the calculation of
917
+ the probabilities again gets too difficult,
918
+
919
+ 0:28:08.050 --> 0:28:09.213
920
+ and each of them.
921
+
922
+ 0:28:09.429 --> 0:28:14.777
923
+ If you have a small corpus, of course you
924
+ will use a smaller ingram length.
925
+
926
+ 0:28:14.777 --> 0:28:16.466
927
+ You will take a larger.
928
+
929
+ 0:28:18.638 --> 0:28:24.976
930
+ What is important to keep in mind is that,
931
+ of course, this is wrong.
932
+
933
+ 0:28:25.285 --> 0:28:36.608
934
+ So we have long range dependencies, and if
935
+ we really want to model everything in language
936
+
937
+ 0:28:36.608 --> 0:28:37.363
938
+ then.
939
+
940
+ 0:28:37.337 --> 0:28:46.965
941
+ So here is like one of these extreme cases,
942
+ the computer, which has just put into the machine
943
+
944
+ 0:28:46.965 --> 0:28:49.423
945
+ room in the slow crash.
946
+
947
+ 0:28:49.423 --> 0:28:55.978
948
+ Like somehow, there is a dependency between
949
+ computer and crash.
950
+
951
+ 0:28:57.978 --> 0:29:10.646
952
+ However, in most situations these are typically
953
+ rare and normally most important things happen
954
+
955
+ 0:29:10.646 --> 0:29:13.446
956
+ in the near context.
957
+
958
+ 0:29:15.495 --> 0:29:28.408
959
+ But of course it's important to keep that
960
+ in mind that you can't model the thing so you
961
+
962
+ 0:29:28.408 --> 0:29:29.876
963
+ can't do.
964
+
965
+ 0:29:33.433 --> 0:29:50.200
966
+ The next question is again how can we train
967
+ so we have to estimate these probabilities.
968
+
969
+ 0:29:51.071 --> 0:30:00.131
970
+ And the question is how we do that, and again
971
+ the most simple thing.
972
+
973
+ 0:30:00.440 --> 0:30:03.168
974
+ The thing is exactly what's maximum legal
975
+ destination.
976
+
977
+ 0:30:03.168 --> 0:30:12.641
978
+ What gives you the right answer is: So how
979
+ probable is that the word is following minus
980
+
981
+ 0:30:12.641 --> 0:30:13.370
982
+ one?
983
+
984
+ 0:30:13.370 --> 0:30:20.946
985
+ You just count how often does this sequence
986
+ happen?
987
+
988
+ 0:30:21.301 --> 0:30:28.165
989
+ So guess this is what most of you would have
990
+ intuitively done, and this also works best.
991
+
992
+ 0:30:28.568 --> 0:30:39.012
993
+ So it's not a complicated train, so you once
994
+ have to go over your corpus, you have to count
995
+
996
+ 0:30:39.012 --> 0:30:48.662
997
+ our diagrams and unigrams, and then you can
998
+ directly train the basic language model.
999
+
1000
+ 0:30:49.189 --> 0:30:50.651
1001
+ Who is it difficult?
1002
+
1003
+ 0:30:50.651 --> 0:30:58.855
1004
+ There are two difficulties: The basic language
1005
+ well doesn't work that well because of zero
1006
+
1007
+ 0:30:58.855 --> 0:31:03.154
1008
+ counts and how we address that and the second.
1009
+
1010
+ 0:31:03.163 --> 0:31:13.716
1011
+ Because we saw that especially if you go for
1012
+ larger you have to store all these engrams
1013
+
1014
+ 0:31:13.716 --> 0:31:15.275
1015
+ efficiently.
1016
+
1017
+ 0:31:17.697 --> 0:31:21.220
1018
+ So how we can do that?
1019
+
1020
+ 0:31:21.220 --> 0:31:24.590
1021
+ Here's some examples.
1022
+
1023
+ 0:31:24.590 --> 0:31:33.626
1024
+ For example, if you have the sequence your
1025
+ training curve.
1026
+
1027
+ 0:31:33.713 --> 0:31:41.372
1028
+ You see that the word happens, ascends the
1029
+ star and the sequence happens two times.
1030
+
1031
+ 0:31:42.182 --> 0:31:45.651
1032
+ We have three times.
1033
+
1034
+ 0:31:45.651 --> 0:31:58.043
1035
+ The same starts as the probability is to thirds
1036
+ and the other probability.
1037
+
1038
+ 0:31:58.858 --> 0:32:09.204
1039
+ Here we have what is following so you have
1040
+ twice and once do so again two thirds and one.
1041
+
1042
+ 0:32:09.809 --> 0:32:20.627
1043
+ And this is all that you need to know here
1044
+ about it, so you can do this calculation.
1045
+
1046
+ 0:32:23.723 --> 0:32:35.506
1047
+ So the question then, of course, is what do
1048
+ we really learn in these types of models?
1049
+
1050
+ 0:32:35.506 --> 0:32:45.549
1051
+ Here are examples from the Europycopterus:
1052
+ The green, the red, and the blue, and here
1053
+
1054
+ 0:32:45.549 --> 0:32:48.594
1055
+ you have the probabilities which is the next.
1056
+
1057
+ 0:32:48.989 --> 0:33:01.897
1058
+ That there is a lot more than just like the
1059
+ syntax because the initial phrase is all the
1060
+
1061
+ 0:33:01.897 --> 0:33:02.767
1062
+ same.
1063
+
1064
+ 0:33:03.163 --> 0:33:10.132
1065
+ For example, you see the green paper in the
1066
+ green group.
1067
+
1068
+ 0:33:10.132 --> 0:33:16.979
1069
+ It's more European palaman, the red cross,
1070
+ which is by.
1071
+
1072
+ 0:33:17.197 --> 0:33:21.777
1073
+ What you also see that it's like sometimes
1074
+ Indian, sometimes it's more difficult.
1075
+
1076
+ 0:33:22.302 --> 0:33:28.345
1077
+ So, for example, following the rats, in one
1078
+ hundred cases it was a red cross.
1079
+
1080
+ 0:33:28.668 --> 0:33:48.472
1081
+ So it seems to be easier to guess the next
1082
+ word.
1083
+
1084
+ 0:33:48.528 --> 0:33:55.152
1085
+ So there is different types of information
1086
+ coded in that you also know that I guess sometimes
1087
+
1088
+ 0:33:55.152 --> 0:33:58.675
1089
+ you directly know all the speakers will continue.
1090
+
1091
+ 0:33:58.675 --> 0:34:04.946
1092
+ It's not a lot of new information in the next
1093
+ word, but in other cases like blue there's
1094
+
1095
+ 0:34:04.946 --> 0:34:06.496
1096
+ a lot of information.
1097
+
1098
+ 0:34:11.291 --> 0:34:14.849
1099
+ Another example is this Berkeley restaurant
1100
+ sentences.
1101
+
1102
+ 0:34:14.849 --> 0:34:21.059
1103
+ It's collected at Berkeley and you have sentences
1104
+ like can you tell me about any good spaghetti
1105
+
1106
+ 0:34:21.059 --> 0:34:21.835
1107
+ restaurant.
1108
+
1109
+ 0:34:21.835 --> 0:34:27.463
1110
+ Big price title is what I'm looking for so
1111
+ it's more like a dialogue system and people
1112
+
1113
+ 0:34:27.463 --> 0:34:31.215
1114
+ have collected this data and of course you
1115
+ can also look.
1116
+
1117
+ 0:34:31.551 --> 0:34:46.878
1118
+ Into this and get the counts, so you count
1119
+ the vibrants in the top, so the color is the.
1120
+
1121
+ 0:34:49.409 --> 0:34:52.912
1122
+ This is a bigram which is the first word of
1123
+ West.
1124
+
1125
+ 0:34:52.912 --> 0:34:54.524
1126
+ This one fuzzy is one.
1127
+
1128
+ 0:34:56.576 --> 0:35:12.160
1129
+ One because want to hyperability, but want
1130
+ a lot less, and there where you see it, for
1131
+
1132
+ 0:35:12.160 --> 0:35:17.004
1133
+ example: So here you see after I want.
1134
+
1135
+ 0:35:17.004 --> 0:35:23.064
1136
+ It's very often for I eat, but an island which
1137
+ is not just.
1138
+
1139
+ 0:35:27.347 --> 0:35:39.267
1140
+ The absolute counts of how often each road
1141
+ occurs, and then you can see here the probabilities
1142
+
1143
+ 0:35:39.267 --> 0:35:40.145
1144
+ again.
1145
+
1146
+ 0:35:42.422 --> 0:35:54.519
1147
+ Then do that if you want to do iwan Dutch
1148
+ food you get the sequence you have to multiply
1149
+
1150
+ 0:35:54.519 --> 0:35:55.471
1151
+ olive.
1152
+
1153
+ 0:35:55.635 --> 0:36:00.281
1154
+ And then you of course get a bit of interesting
1155
+ experience on that.
1156
+
1157
+ 0:36:00.281 --> 0:36:04.726
1158
+ For example: Information is there.
1159
+
1160
+ 0:36:04.726 --> 0:36:15.876
1161
+ So, for example, if you compare I want Dutch
1162
+ or I want Chinese, it seems that.
1163
+
1164
+ 0:36:16.176 --> 0:36:22.910
1165
+ That the sentence often starts with eye.
1166
+
1167
+ 0:36:22.910 --> 0:36:31.615
1168
+ You have it after two is possible, but after
1169
+ one it.
1170
+
1171
+ 0:36:31.731 --> 0:36:39.724
1172
+ And you cannot say want, but you have to say
1173
+ want to spend, so there's grammical information.
1174
+
1175
+ 0:36:40.000 --> 0:36:51.032
1176
+ To main information and source: Here before
1177
+ we're going into measuring quality, is there
1178
+
1179
+ 0:36:51.032 --> 0:36:58.297
1180
+ any questions about language model and the
1181
+ idea of modeling?
1182
+
1183
+ 0:37:02.702 --> 0:37:13.501
1184
+ Hope that doesn't mean everybody sleeping,
1185
+ and so when we're doing the training these
1186
+
1187
+ 0:37:13.501 --> 0:37:15.761
1188
+ language models,.
1189
+
1190
+ 0:37:16.356 --> 0:37:26.429
1191
+ You need to model what is the engrum length
1192
+ should we use a trigram or a forkrum.
1193
+
1194
+ 0:37:27.007 --> 0:37:34.040
1195
+ So in order to decide how can you now decide
1196
+ which of the two models are better?
1197
+
1198
+ 0:37:34.914 --> 0:37:40.702
1199
+ And if you would have to do that, how would
1200
+ you decide taking language model or taking
1201
+
1202
+ 0:37:40.702 --> 0:37:41.367
1203
+ language?
1204
+
1205
+ 0:37:43.263 --> 0:37:53.484
1206
+ I take some test text and see which model
1207
+ assigns a higher probability to me.
1208
+
1209
+ 0:37:54.354 --> 0:38:03.978
1210
+ It's very good, so that's even the second
1211
+ thing, so the first thing maybe would have
1212
+
1213
+ 0:38:03.978 --> 0:38:04.657
1214
+ been.
1215
+
1216
+ 0:38:05.925 --> 0:38:12.300
1217
+ The problem is the and then you take the language
1218
+ language language and machine translation.
1219
+
1220
+ 0:38:13.193 --> 0:38:18.773
1221
+ Problems: First of all you have to build a
1222
+ whole system which is very time consuming and
1223
+
1224
+ 0:38:18.773 --> 0:38:21.407
1225
+ it might not only depend on the language.
1226
+
1227
+ 0:38:21.407 --> 0:38:24.730
1228
+ On the other hand, that's of course what the
1229
+ end is.
1230
+
1231
+ 0:38:24.730 --> 0:38:30.373
1232
+ The end want and the pressure will model each
1233
+ component individually or do you want to do
1234
+
1235
+ 0:38:30.373 --> 0:38:31.313
1236
+ an end to end.
1237
+
1238
+ 0:38:31.771 --> 0:38:35.463
1239
+ What can also happen is you'll see your metric
1240
+ model.
1241
+
1242
+ 0:38:35.463 --> 0:38:41.412
1243
+ This is a very good language model, but it
1244
+ somewhat doesn't really work well with your
1245
+
1246
+ 0:38:41.412 --> 0:38:42.711
1247
+ translation model.
1248
+
1249
+ 0:38:43.803 --> 0:38:49.523
1250
+ But of course it's very good to also have
1251
+ this type of intrinsic evaluation where the
1252
+
1253
+ 0:38:49.523 --> 0:38:52.116
1254
+ assumption should be as a pointed out.
1255
+
1256
+ 0:38:52.116 --> 0:38:57.503
1257
+ If we have Good English it shouldn't be a
1258
+ high probability and it's bad English.
1259
+
1260
+ 0:38:58.318 --> 0:39:07.594
1261
+ And this is measured by the take a held out
1262
+ data set, so some data which you don't train
1263
+
1264
+ 0:39:07.594 --> 0:39:12.596
1265
+ on then calculate the probability of this data.
1266
+
1267
+ 0:39:12.912 --> 0:39:26.374
1268
+ Then you're just looking at the language model
1269
+ and you take the language model.
1270
+
1271
+ 0:39:27.727 --> 0:39:33.595
1272
+ You're not directly using the probability,
1273
+ but you're taking the perplexity.
1274
+
1275
+ 0:39:33.595 --> 0:39:40.454
1276
+ The perplexity is due to the power of the
1277
+ cross entropy, and you see in the cross entropy
1278
+
1279
+ 0:39:40.454 --> 0:39:46.322
1280
+ you're doing something like an average probability
1281
+ of always coming to this.
1282
+
1283
+ 0:39:46.846 --> 0:39:54.721
1284
+ Not so how exactly is that define perplexity
1285
+ is typically what people refer to all across.
1286
+
1287
+ 0:39:54.894 --> 0:40:02.328
1288
+ The cross edge is negative and average, and
1289
+ then you have the lock of the probability of
1290
+
1291
+ 0:40:02.328 --> 0:40:03.246
1292
+ the whole.
1293
+
1294
+ 0:40:04.584 --> 0:40:10.609
1295
+ We are modeling this probability as the product
1296
+ of each of the words.
1297
+
1298
+ 0:40:10.609 --> 0:40:18.613
1299
+ That's how the end gram was defined and now
1300
+ you hopefully can remember the rules of logarism
1301
+
1302
+ 0:40:18.613 --> 0:40:23.089
1303
+ so you can get the probability within the logarism.
1304
+
1305
+ 0:40:23.063 --> 0:40:31.036
1306
+ The sum here so the cross entry is minus one
1307
+ by two by n, and the sum of all your words
1308
+
1309
+ 0:40:31.036 --> 0:40:35.566
1310
+ and the lowerism of the probability of each
1311
+ word.
1312
+
1313
+ 0:40:36.176 --> 0:40:39.418
1314
+ And then the perplexity is just like two to
1315
+ the power.
1316
+
1317
+ 0:40:41.201 --> 0:40:44.706
1318
+ Why can this be interpreted as a branching
1319
+ factor?
1320
+
1321
+ 0:40:44.706 --> 0:40:50.479
1322
+ So it gives you a bit like the average thing,
1323
+ like how many possibilities you have.
1324
+
1325
+ 0:40:51.071 --> 0:41:02.249
1326
+ You have a digit task and you have no idea,
1327
+ but the probability of the next digit is like
1328
+
1329
+ 0:41:02.249 --> 0:41:03.367
1330
+ one ten.
1331
+
1332
+ 0:41:03.783 --> 0:41:09.354
1333
+ And if you then take a later perplexity, it
1334
+ will be exactly ten.
1335
+
1336
+ 0:41:09.849 --> 0:41:24.191
1337
+ And that is like this perplexity gives you
1338
+ a million interpretations, so how much randomness
1339
+
1340
+ 0:41:24.191 --> 0:41:27.121
1341
+ is still in there?
1342
+
1343
+ 0:41:27.307 --> 0:41:32.433
1344
+ Of course, now it's good to have a lower perplexity.
1345
+
1346
+ 0:41:32.433 --> 0:41:36.012
1347
+ We have less ambiguity in there and.
1348
+
1349
+ 0:41:35.976 --> 0:41:48.127
1350
+ If you have a hundred words and you only have
1351
+ to uniformly compare it to ten different, so
1352
+
1353
+ 0:41:48.127 --> 0:41:49.462
1354
+ you have.
1355
+
1356
+ 0:41:49.609 --> 0:41:53.255
1357
+ Yes, think so it should be.
1358
+
1359
+ 0:41:53.255 --> 0:42:03.673
1360
+ You had here logarism and then to the power
1361
+ and that should then be eliminated.
1362
+
1363
+ 0:42:03.743 --> 0:42:22.155
1364
+ So which logarism you use is not that important
1365
+ because it's a constant factor to reformulate.
1366
+
1367
+ 0:42:23.403 --> 0:42:28.462
1368
+ Yes and Yeah So the Best.
1369
+
1370
+ 0:42:31.931 --> 0:42:50.263
1371
+ The best model is always like you want to
1372
+ have a high probability.
1373
+
1374
+ 0:42:51.811 --> 0:43:04.549
1375
+ Time you see here, so here the probabilities
1376
+ would like to commend the rapporteur on his
1377
+
1378
+ 0:43:04.549 --> 0:43:05.408
1379
+ work.
1380
+
1381
+ 0:43:05.285 --> 0:43:14.116
1382
+ You have then locked two probabilities and
1383
+ then the average, so this is not the perplexity
1384
+
1385
+ 0:43:14.116 --> 0:43:18.095
1386
+ but the cross entropy as mentioned here.
1387
+
1388
+ 0:43:18.318 --> 0:43:26.651
1389
+ And then due to the power of that we'll give
1390
+ you the perplexity of the center.
1391
+
1392
+ 0:43:29.329 --> 0:43:40.967
1393
+ And these metrics of perplexity are essential
1394
+ in modeling that and we'll also see nowadays.
1395
+
1396
+ 0:43:41.121 --> 0:43:47.898
1397
+ You also measure like equality often in perplexity
1398
+ or cross entropy, which gives you how good
1399
+
1400
+ 0:43:47.898 --> 0:43:50.062
1401
+ is it in estimating the same.
1402
+
1403
+ 0:43:50.010 --> 0:43:53.647
1404
+ The better the model is, the more information
1405
+ you have about this.
1406
+
1407
+ 0:43:55.795 --> 0:44:03.106
1408
+ Talked about isomic ability or quit sentences,
1409
+ but don't most have to any much because.
1410
+
1411
+ 0:44:03.463 --> 0:44:12.512
1412
+ You are doing that in this way implicitly
1413
+ because of the correct word.
1414
+
1415
+ 0:44:12.512 --> 0:44:19.266
1416
+ If you are modeling this one, the sun over
1417
+ all next.
1418
+
1419
+ 0:44:20.020 --> 0:44:29.409
1420
+ Therefore, you have that implicitly in there
1421
+ because in each position you're modeling the
1422
+
1423
+ 0:44:29.409 --> 0:44:32.957
1424
+ probability of this witch behind.
1425
+
1426
+ 0:44:35.515 --> 0:44:43.811
1427
+ You have a very large number of negative examples
1428
+ because all the possible extensions which are
1429
+
1430
+ 0:44:43.811 --> 0:44:49.515
1431
+ not there are incorrect, which of course might
1432
+ also be a problem.
1433
+
1434
+ 0:44:52.312 --> 0:45:00.256
1435
+ And the biggest challenge of these types of
1436
+ models is how to model unseen events.
1437
+
1438
+ 0:45:00.840 --> 0:45:04.973
1439
+ So that can be unknown words or it can be
1440
+ unknown vibrants.
1441
+
1442
+ 0:45:05.245 --> 0:45:10.096
1443
+ So that's important also like you've seen
1444
+ all the words.
1445
+
1446
+ 0:45:10.096 --> 0:45:17.756
1447
+ But if you have a bigram language model, if
1448
+ you haven't seen the bigram, you'll still get
1449
+
1450
+ 0:45:17.756 --> 0:45:23.628
1451
+ a zero probability because we know that the
1452
+ bigram's divided by the.
1453
+
1454
+ 0:45:24.644 --> 0:45:35.299
1455
+ If you have unknown words, the problem gets
1456
+ even bigger because one word typically causes
1457
+
1458
+ 0:45:35.299 --> 0:45:37.075
1459
+ a lot of zero.
1460
+
1461
+ 0:45:37.217 --> 0:45:41.038
1462
+ So if you, for example, if your vocabulary
1463
+ is go to and care it,.
1464
+
1465
+ 0:45:41.341 --> 0:45:43.467
1466
+ And you have not a sentence.
1467
+
1468
+ 0:45:43.467 --> 0:45:47.941
1469
+ I want to pay a T, so you have one word, which
1470
+ is here 'an'.
1471
+
1472
+ 0:45:47.887 --> 0:45:54.354
1473
+ It is unknow then you have the proper.
1474
+
1475
+ 0:45:54.354 --> 0:46:02.147
1476
+ It is I get a sentence star and sentence star.
1477
+
1478
+ 0:46:02.582 --> 0:46:09.850
1479
+ To model this probability you always have
1480
+ to take the account from these sequences divided
1481
+
1482
+ 0:46:09.850 --> 0:46:19.145
1483
+ by: Since when does it occur, all of these
1484
+ angrams can also occur because of the word
1485
+
1486
+ 0:46:19.145 --> 0:46:19.961
1487
+ middle.
1488
+
1489
+ 0:46:20.260 --> 0:46:27.800
1490
+ So all of these probabilities are directly
1491
+ zero.
1492
+
1493
+ 0:46:27.800 --> 0:46:33.647
1494
+ You see that just by having a single.
1495
+
1496
+ 0:46:34.254 --> 0:46:47.968
1497
+ Tells you it might not always be better to
1498
+ have larger grams because if you have a gram
1499
+
1500
+ 0:46:47.968 --> 0:46:50.306
1501
+ language more.
1502
+
1503
+ 0:46:50.730 --> 0:46:57.870
1504
+ So sometimes it's better to have a smaller
1505
+ angram counter because the chances that you're
1506
+
1507
+ 0:46:57.870 --> 0:47:00.170
1508
+ seeing the angram is higher.
1509
+
1510
+ 0:47:00.170 --> 0:47:07.310
1511
+ On the other hand, you want to have a larger
1512
+ account because the larger the count is, the
1513
+
1514
+ 0:47:07.310 --> 0:47:09.849
1515
+ longer the context is modeling.
1516
+
1517
+ 0:47:10.670 --> 0:47:17.565
1518
+ So how can we address this type of problem?
1519
+
1520
+ 0:47:17.565 --> 0:47:28.064
1521
+ We address this type of problem by somehow
1522
+ adjusting our accounts.
1523
+
1524
+ 0:47:29.749 --> 0:47:40.482
1525
+ We have often, but most of your entries in
1526
+ the table are zero, and if one of these engrams
1527
+
1528
+ 0:47:40.482 --> 0:47:45.082
1529
+ occurs you'll have a zero probability.
1530
+
1531
+ 0:47:46.806 --> 0:48:06.999
1532
+ So therefore we need to find some of our ways
1533
+ in order to estimate this type of event because:
1534
+
1535
+ 0:48:07.427 --> 0:48:11.619
1536
+ So there are different ways of how to model
1537
+ it and how to adjust it.
1538
+
1539
+ 0:48:11.619 --> 0:48:15.326
1540
+ The one I hear is to do smoocing and that's
1541
+ the first thing.
1542
+
1543
+ 0:48:15.326 --> 0:48:20.734
1544
+ So in smoocing you're saying okay, we take
1545
+ a bit of the probability we have to our scene
1546
+
1547
+ 0:48:20.734 --> 0:48:23.893
1548
+ events and distribute this thing we're taking
1549
+ away.
1550
+
1551
+ 0:48:23.893 --> 0:48:26.567
1552
+ We're distributing to all the other events.
1553
+
1554
+ 0:48:26.946 --> 0:48:33.927
1555
+ The nice thing is in this case oh now each
1556
+ event has a non zero probability and that is
1557
+
1558
+ 0:48:33.927 --> 0:48:39.718
1559
+ of course very helpful because we don't have
1560
+ zero probabilities anymore.
1561
+
1562
+ 0:48:40.180 --> 0:48:48.422
1563
+ It smoothed out, but at least you have some
1564
+ kind of probability everywhere, so you take
1565
+
1566
+ 0:48:48.422 --> 0:48:50.764
1567
+ some of the probability.
1568
+
1569
+ 0:48:53.053 --> 0:49:05.465
1570
+ You can also do that more here when you have
1571
+ the endgram, for example, and this is your
1572
+
1573
+ 0:49:05.465 --> 0:49:08.709
1574
+ original distribution.
1575
+
1576
+ 0:49:08.648 --> 0:49:15.463
1577
+ Then you are taking some mass away from here
1578
+ and distributing this mass to all the other
1579
+
1580
+ 0:49:15.463 --> 0:49:17.453
1581
+ words that you have seen.
1582
+
1583
+ 0:49:18.638 --> 0:49:26.797
1584
+ And thereby you are now making sure that it's
1585
+ yeah, that it's now possible to model that.
1586
+
1587
+ 0:49:28.828 --> 0:49:36.163
1588
+ The other idea we're coming into more detail
1589
+ on how we can do this type of smoking, but
1590
+
1591
+ 0:49:36.163 --> 0:49:41.164
1592
+ one other idea you can do is to do some type
1593
+ of clustering.
1594
+
1595
+ 0:49:41.501 --> 0:49:48.486
1596
+ And that means if we are can't model go Kit's,
1597
+ for example because we haven't seen that.
1598
+
1599
+ 0:49:49.349 --> 0:49:56.128
1600
+ Then we're just looking at the full thing
1601
+ and we're just going to live directly how probable.
1602
+
1603
+ 0:49:56.156 --> 0:49:58.162
1604
+ Go two ways or so.
1605
+
1606
+ 0:49:58.162 --> 0:50:09.040
1607
+ Then we are modeling just only the word interpolation
1608
+ where you're interpolating all the probabilities
1609
+
1610
+ 0:50:09.040 --> 0:50:10.836
1611
+ and thereby can.
1612
+
1613
+ 0:50:11.111 --> 0:50:16.355
1614
+ These are the two things which are helpful
1615
+ in order to better calculate all these types.
1616
+
1617
+ 0:50:19.499 --> 0:50:28.404
1618
+ Let's start with what counts news so the idea
1619
+ is okay.
1620
+
1621
+ 0:50:28.404 --> 0:50:38.119
1622
+ We have not seen an event and then the probability
1623
+ is zero.
1624
+
1625
+ 0:50:38.618 --> 0:50:50.902
1626
+ It's not that high, but you should always
1627
+ be aware that there might be new things happening
1628
+
1629
+ 0:50:50.902 --> 0:50:55.308
1630
+ and somehow be able to estimate.
1631
+
1632
+ 0:50:56.276 --> 0:50:59.914
1633
+ So the idea is okay.
1634
+
1635
+ 0:50:59.914 --> 0:51:09.442
1636
+ We can also assign a positive probability
1637
+ to a higher.
1638
+
1639
+ 0:51:10.590 --> 0:51:23.233
1640
+ We are changing so currently we worked on
1641
+ imperial accounts so how often we have seen
1642
+
1643
+ 0:51:23.233 --> 0:51:25.292
1644
+ the accounts.
1645
+
1646
+ 0:51:25.745 --> 0:51:37.174
1647
+ And now we are going on to expect account
1648
+ how often this would occur in an unseen.
1649
+
1650
+ 0:51:37.517 --> 0:51:39.282
1651
+ So we are directly trying to model that.
1652
+
1653
+ 0:51:39.859 --> 0:51:45.836
1654
+ Of course, the empirical accounts are a good
1655
+ starting point, so if you've seen the world
1656
+
1657
+ 0:51:45.836 --> 0:51:51.880
1658
+ very often in your training data, it's a good
1659
+ estimation of how often you would see it in
1660
+
1661
+ 0:51:51.880 --> 0:51:52.685
1662
+ the future.
1663
+
1664
+ 0:51:52.685 --> 0:51:58.125
1665
+ However, it might make sense to think about
1666
+ it only because you haven't seen it.
1667
+
1668
+ 0:51:58.578 --> 0:52:10.742
1669
+ So does anybody have a very simple idea how
1670
+ you start with smoothing it?
1671
+
1672
+ 0:52:10.742 --> 0:52:15.241
1673
+ What count would you give?
1674
+
1675
+ 0:52:21.281 --> 0:52:32.279
1676
+ Now you have the probability to calculation
1677
+ how often have you seen the biogram with zero
1678
+
1679
+ 0:52:32.279 --> 0:52:33.135
1680
+ count.
1681
+
1682
+ 0:52:33.193 --> 0:52:39.209
1683
+ So what count would you give in order to still
1684
+ do this calculation?
1685
+
1686
+ 0:52:39.209 --> 0:52:41.509
1687
+ We have to smooth, so we.
1688
+
1689
+ 0:52:44.884 --> 0:52:52.151
1690
+ We could clump together all the rare words,
1691
+ for example everywhere we have only seen ones.
1692
+
1693
+ 0:52:52.652 --> 0:52:56.904
1694
+ And then just we can do the massive moment
1695
+ of those and don't.
1696
+
1697
+ 0:52:56.936 --> 0:53:00.085
1698
+ So remove the real ones.
1699
+
1700
+ 0:53:00.085 --> 0:53:06.130
1701
+ Yes, and then every unseen word is one of
1702
+ them.
1703
+
1704
+ 0:53:06.130 --> 0:53:13.939
1705
+ Yeah, but it's not only about unseen words,
1706
+ it's even unseen.
1707
+
1708
+ 0:53:14.874 --> 0:53:20.180
1709
+ You can even start easier and that's what
1710
+ people do at the first thing.
1711
+
1712
+ 0:53:20.180 --> 0:53:22.243
1713
+ That's at one smooth thing.
1714
+
1715
+ 0:53:22.243 --> 0:53:28.580
1716
+ You'll see it's not working good but the variation
1717
+ works fine and we're just as here.
1718
+
1719
+ 0:53:28.580 --> 0:53:30.644
1720
+ We've seen everything once.
1721
+
1722
+ 0:53:31.771 --> 0:53:39.896
1723
+ That's similar to this because you're clustering
1724
+ the one and the zero together and you just
1725
+
1726
+ 0:53:39.896 --> 0:53:45.814
1727
+ say you've seen everything once or have seen
1728
+ them twice and so on.
1729
+
1730
+ 0:53:46.386 --> 0:53:53.249
1731
+ And if you've done that wow, there's no probability
1732
+ because each event has happened once.
1733
+
1734
+ 0:53:55.795 --> 0:54:02.395
1735
+ If you otherwise have seen the bigram five
1736
+ times, you would not now do five times but
1737
+
1738
+ 0:54:02.395 --> 0:54:03.239
1739
+ six times.
1740
+
1741
+ 0:54:03.363 --> 0:54:09.117
1742
+ So the nice thing is to have seen everything.
1743
+
1744
+ 0:54:09.117 --> 0:54:19.124
1745
+ Once the probability of the engrap is now
1746
+ out, you have seen it divided by the.
1747
+
1748
+ 0:54:20.780 --> 0:54:23.763
1749
+ How long ago there's one big big problem with
1750
+ it?
1751
+
1752
+ 0:54:24.064 --> 0:54:38.509
1753
+ Just imagine that you have a vocabulary of
1754
+ words, and you have a corpus of thirty million
1755
+
1756
+ 0:54:38.509 --> 0:54:39.954
1757
+ bigrams.
1758
+
1759
+ 0:54:39.954 --> 0:54:42.843
1760
+ So if you have a.
1761
+
1762
+ 0:54:43.543 --> 0:54:46.580
1763
+ Simple Things So You've Seen Them Thirty Million
1764
+ Times.
1765
+
1766
+ 0:54:47.247 --> 0:54:49.818
1767
+ That is your count, your distributing.
1768
+
1769
+ 0:54:49.818 --> 0:54:55.225
1770
+ According to your gain, the problem is yet
1771
+ how many possible bigrams do you have?
1772
+
1773
+ 0:54:55.225 --> 0:55:00.895
1774
+ You have seven point five billion possible
1775
+ bigrams, and each of them you are counting
1776
+
1777
+ 0:55:00.895 --> 0:55:04.785
1778
+ now as give up your ability, like you give
1779
+ account of one.
1780
+
1781
+ 0:55:04.785 --> 0:55:07.092
1782
+ So each of them is saying a curse.
1783
+
1784
+ 0:55:07.627 --> 0:55:16.697
1785
+ Then this number of possible vigrams is many
1786
+ times larger than the number you really see.
1787
+
1788
+ 0:55:17.537 --> 0:55:21.151
1789
+ You're mainly doing equal distribution.
1790
+
1791
+ 0:55:21.151 --> 0:55:26.753
1792
+ Everything gets the same because this is much
1793
+ more important.
1794
+
1795
+ 0:55:26.753 --> 0:55:31.541
1796
+ Most of your probability mass is used for
1797
+ smoothing.
1798
+
1799
+ 0:55:32.412 --> 0:55:37.493
1800
+ Because most of the probability miles have
1801
+ to be distributed that you at least give every
1802
+
1803
+ 0:55:37.493 --> 0:55:42.687
1804
+ biogram at least a count of one, and the other
1805
+ counts are only the thirty million, so seven
1806
+
1807
+ 0:55:42.687 --> 0:55:48.219
1808
+ point five billion counts go to like a distribute
1809
+ around all the engrons, and only thirty million
1810
+
1811
+ 0:55:48.219 --> 0:55:50.026
1812
+ are according to your frequent.
1813
+
1814
+ 0:55:50.210 --> 0:56:02.406
1815
+ So you put a lot too much mass on your smoothing
1816
+ and you're doing some kind of extreme smoothing.
1817
+
1818
+ 0:56:02.742 --> 0:56:08.986
1819
+ So that of course is a bit bad then and will
1820
+ give you not the best performance.
1821
+
1822
+ 0:56:10.130 --> 0:56:16.160
1823
+ However, there's a nice thing and that means
1824
+ to do probability calculations.
1825
+
1826
+ 0:56:16.160 --> 0:56:21.800
1827
+ We are doing it based on counts, but to do
1828
+ this division we don't need.
1829
+
1830
+ 0:56:22.302 --> 0:56:32.112
1831
+ So we can also do that with floating point
1832
+ values and there is still a valid type of calculation.
1833
+
1834
+ 0:56:32.392 --> 0:56:39.380
1835
+ So we can have less probability mass to unseen
1836
+ events.
1837
+
1838
+ 0:56:39.380 --> 0:56:45.352
1839
+ We don't have to give one because if we count.
1840
+
1841
+ 0:56:45.785 --> 0:56:50.976
1842
+ But to do our calculation we can also give
1843
+ zero point zero to something like that, so
1844
+
1845
+ 0:56:50.976 --> 0:56:56.167
1846
+ very small value, and thereby we have less
1847
+ value on the smooth thing, and we are more
1848
+
1849
+ 0:56:56.167 --> 0:56:58.038
1850
+ focusing on the actual corpus.
1851
+
1852
+ 0:56:58.758 --> 0:57:03.045
1853
+ And that is what people refer to as Alpha
1854
+ Smoozing.
1855
+
1856
+ 0:57:03.223 --> 0:57:12.032
1857
+ You see that we are now adding not one to
1858
+ it but only alpha, and then we are giving less
1859
+
1860
+ 0:57:12.032 --> 0:57:19.258
1861
+ probability to the unseen event and more probability
1862
+ to the really seen.
1863
+
1864
+ 0:57:20.780 --> 0:57:24.713
1865
+ Questions: Of course, how do you find see
1866
+ also?
1867
+
1868
+ 0:57:24.713 --> 0:57:29.711
1869
+ I'm here to either use some help out data
1870
+ and optimize them.
1871
+
1872
+ 0:57:30.951 --> 0:57:35.153
1873
+ So what what does it now really mean?
1874
+
1875
+ 0:57:35.153 --> 0:57:40.130
1876
+ This gives you a bit of an idea behind that.
1877
+
1878
+ 0:57:40.700 --> 0:57:57.751
1879
+ So here you have the grams which occur one
1880
+ time, for example all grams which occur one.
1881
+
1882
+ 0:57:57.978 --> 0:58:10.890
1883
+ So, for example, that means that if you have
1884
+ engrams which occur one time, then.
1885
+
1886
+ 0:58:11.371 --> 0:58:22.896
1887
+ If you look at all the engrams which occur
1888
+ two times, then they occur.
1889
+
1890
+ 0:58:22.896 --> 0:58:31.013
1891
+ If you look at the engrams that occur zero,
1892
+ then.
1893
+
1894
+ 0:58:32.832 --> 0:58:46.511
1895
+ So if you are now doing the smoothing you
1896
+ can look what is the probability estimating
1897
+
1898
+ 0:58:46.511 --> 0:58:47.466
1899
+ them.
1900
+
1901
+ 0:58:47.847 --> 0:59:00.963
1902
+ You see that for all the endbreaks you heavily
1903
+ underestimate how often they occur in the test
1904
+
1905
+ 0:59:00.963 --> 0:59:01.801
1906
+ card.
1907
+
1908
+ 0:59:02.002 --> 0:59:10.067
1909
+ So what you want is very good to estimate
1910
+ this distribution, so for each Enron estimate
1911
+
1912
+ 0:59:10.067 --> 0:59:12.083
1913
+ quite well how often.
1914
+
1915
+ 0:59:12.632 --> 0:59:16.029
1916
+ You're quite bad at that for all of them.
1917
+
1918
+ 0:59:16.029 --> 0:59:22.500
1919
+ You're apparently underestimating only for
1920
+ the top ones which you haven't seen.
1921
+
1922
+ 0:59:22.500 --> 0:59:24.845
1923
+ You'll heavily overestimate.
1924
+
1925
+ 0:59:25.645 --> 0:59:30.887
1926
+ If you're doing alpha smoothing and optimize
1927
+ that to fit on the zero count because that's
1928
+
1929
+ 0:59:30.887 --> 0:59:36.361
1930
+ not completely fair because this alpha is now
1931
+ optimizes the test counter, you see that you're
1932
+
1933
+ 0:59:36.361 --> 0:59:37.526
1934
+ doing a lot better.
1935
+
1936
+ 0:59:37.526 --> 0:59:42.360
1937
+ It's not perfect, but you're a lot better
1938
+ in estimating how often they will occur.
1939
+
1940
+ 0:59:45.545 --> 0:59:49.316
1941
+ So this is one idea of doing it.
1942
+
1943
+ 0:59:49.316 --> 0:59:57.771
1944
+ Of course there's other ways and this is like
1945
+ a large research direction.
1946
+
1947
+ 0:59:58.318 --> 1:00:03.287
1948
+ So there is this needed estimation.
1949
+
1950
+ 1:00:03.287 --> 1:00:11.569
1951
+ What you are doing is filling your trading
1952
+ data into parts.
1953
+
1954
+ 1:00:11.972 --> 1:00:19.547
1955
+ Looking at how many engrams occur exactly
1956
+ are types, which engrams occur are times in
1957
+
1958
+ 1:00:19.547 --> 1:00:20.868
1959
+ your training.
1960
+
1961
+ 1:00:21.281 --> 1:00:27.716
1962
+ And then you look for these ones.
1963
+
1964
+ 1:00:27.716 --> 1:00:36.611
1965
+ How often do they occur in your training data?
1966
+
1967
+ 1:00:36.611 --> 1:00:37.746
1968
+ It's.
1969
+
1970
+ 1:00:38.118 --> 1:00:45.214
1971
+ And then you say oh this engram, the expector
1972
+ counts how often will see.
1973
+
1974
+ 1:00:45.214 --> 1:00:56.020
1975
+ It is divided by: Some type of clustering
1976
+ you're putting all the engrams which occur
1977
+
1978
+ 1:00:56.020 --> 1:01:04.341
1979
+ are at times in your data together and in order
1980
+ to estimate how often.
1981
+
1982
+ 1:01:05.185 --> 1:01:12.489
1983
+ And if you do half your data related to your
1984
+ final estimation by just using those statistics,.
1985
+
1986
+ 1:01:14.014 --> 1:01:25.210
1987
+ So this is called added estimation, and thereby
1988
+ you are not able to estimate better how often
1989
+
1990
+ 1:01:25.210 --> 1:01:25.924
1991
+ does.
1992
+
1993
+ 1:01:28.368 --> 1:01:34.559
1994
+ And again we can do the same look and compare
1995
+ it to the expected counts.
1996
+
1997
+ 1:01:34.559 --> 1:01:37.782
1998
+ Again we have exactly the same table.
1999
+
2000
+ 1:01:38.398 --> 1:01:47.611
2001
+ So then we're having to hear how many engrams
2002
+ that does exist.
2003
+
2004
+ 1:01:47.611 --> 1:01:55.361
2005
+ So, for example, there's like engrams which
2006
+ you can.
2007
+
2008
+ 1:01:55.835 --> 1:02:08.583
2009
+ Then you look into your other half and how
2010
+ often do these N grams occur in your 2nd part
2011
+
2012
+ 1:02:08.583 --> 1:02:11.734
2013
+ of the training data?
2014
+
2015
+ 1:02:12.012 --> 1:02:22.558
2016
+ For example, an unseen N gram I expect to
2017
+ occur, an engram which occurs one time.
2018
+
2019
+ 1:02:22.558 --> 1:02:25.774
2020
+ I expect that it occurs.
2021
+
2022
+ 1:02:27.527 --> 1:02:42.564
2023
+ Yeah, the number of zero counts are if take
2024
+ my one grams and then just calculate how many
2025
+
2026
+ 1:02:42.564 --> 1:02:45.572
2027
+ possible bigrams.
2028
+
2029
+ 1:02:45.525 --> 1:02:50.729
2030
+ Yes, so in this case we are now not assuming
2031
+ about having a more larger cattle because then,
2032
+
2033
+ 1:02:50.729 --> 1:02:52.127
2034
+ of course, it's getting.
2035
+
2036
+ 1:02:52.272 --> 1:02:54.730
2037
+ So you're doing that given the current gram.
2038
+
2039
+ 1:02:54.730 --> 1:03:06.057
2040
+ The cavalry is better to: So yeah, there's
2041
+ another problem in how to deal with them.
2042
+
2043
+ 1:03:06.057 --> 1:03:11.150
2044
+ This is more about how to smuse the engram
2045
+ counts to also deal.
2046
+
2047
+ 1:03:14.394 --> 1:03:18.329
2048
+ Certainly as I Think The.
2049
+
2050
+ 1:03:18.198 --> 1:03:25.197
2051
+ Yes, the last idea of doing is so called good
2052
+ cheering, and and the I hear here is in it
2053
+
2054
+ 1:03:25.197 --> 1:03:32.747
2055
+ similar, so there is a typical mathematic approve,
2056
+ but you can show that a very good estimation
2057
+
2058
+ 1:03:32.747 --> 1:03:34.713
2059
+ for the expected counts.
2060
+
2061
+ 1:03:34.654 --> 1:03:42.339
2062
+ Is that you take the number of engrams which
2063
+ occur one time more divided by the number of
2064
+
2065
+ 1:03:42.339 --> 1:03:46.011
2066
+ engram which occur R times and R plus one.
2067
+
2068
+ 1:03:46.666 --> 1:03:49.263
2069
+ So this is then the estimation of.
2070
+
2071
+ 1:03:49.549 --> 1:04:05.911
2072
+ So if you are looking now at an engram which
2073
+ occurs times then you are looking at how many
2074
+
2075
+ 1:04:05.911 --> 1:04:08.608
2076
+ engrams occur.
2077
+
2078
+ 1:04:09.009 --> 1:04:18.938
2079
+ It's very simple, so in this one you only
2080
+ have to count all the bigrams, how many different
2081
+
2082
+ 1:04:18.938 --> 1:04:23.471
2083
+ bigrams out there, and that is very good.
2084
+
2085
+ 1:04:23.903 --> 1:04:33.137
2086
+ So if you are saying now about end drums which
2087
+ occur or times,.
2088
+
2089
+ 1:04:33.473 --> 1:04:46.626
2090
+ It might be that there are some occurring
2091
+ times, but no times, and then.
2092
+
2093
+ 1:04:46.866 --> 1:04:54.721
2094
+ So what you normally do is you are doing for
2095
+ small R, and for large R you do some curve
2096
+
2097
+ 1:04:54.721 --> 1:04:55.524
2098
+ fitting.
2099
+
2100
+ 1:04:56.016 --> 1:05:07.377
2101
+ In general this type of smoothing is important
2102
+ for engrams which occur rarely.
2103
+
2104
+ 1:05:07.377 --> 1:05:15.719
2105
+ If an engram occurs so this is more important
2106
+ for events.
2107
+
2108
+ 1:05:17.717 --> 1:05:25.652
2109
+ So here again you see you have the counts
2110
+ and then based on that you get the adjusted
2111
+
2112
+ 1:05:25.652 --> 1:05:26.390
2113
+ counts.
2114
+
2115
+ 1:05:26.390 --> 1:05:34.786
2116
+ This is here and if you compare it's a test
2117
+ count you see that it really works quite well.
2118
+
2119
+ 1:05:35.035 --> 1:05:41.093
2120
+ But for the low numbers it's a very good modeling
2121
+ of how much how good this works.
2122
+
2123
+ 1:05:45.005 --> 1:05:50.018
2124
+ Then, of course, the question is how good
2125
+ does it work in language modeling?
2126
+
2127
+ 1:05:50.018 --> 1:05:51.516
2128
+ We also want tomorrow.
2129
+
2130
+ 1:05:52.372 --> 1:05:54.996
2131
+ We can measure that perplexity.
2132
+
2133
+ 1:05:54.996 --> 1:05:59.261
2134
+ We learned that before and then we have everyone's.
2135
+
2136
+ 1:05:59.579 --> 1:06:07.326
2137
+ You saw that a lot of too much probability
2138
+ mass is put to the events which have your probability.
2139
+
2140
+ 1:06:07.667 --> 1:06:11.098
2141
+ Then you have an alpha smoothing.
2142
+
2143
+ 1:06:11.098 --> 1:06:16.042
2144
+ Here's a start because it's not completely
2145
+ fair.
2146
+
2147
+ 1:06:16.042 --> 1:06:20.281
2148
+ The alpha was maximized on the test data.
2149
+
2150
+ 1:06:20.480 --> 1:06:25.904
2151
+ But you see that like the leaded estimation
2152
+ of the touring gives you a similar performance.
2153
+
2154
+ 1:06:26.226 --> 1:06:29.141
2155
+ So they seem to really work quite well.
2156
+
2157
+ 1:06:32.232 --> 1:06:41.552
2158
+ So this is about all assigning probability
2159
+ mass to aimed grams, which we have not seen
2160
+
2161
+ 1:06:41.552 --> 1:06:50.657
2162
+ in order to also estimate their probability
2163
+ before we're going to the interpolation.
2164
+
2165
+ 1:06:55.635 --> 1:07:00.207
2166
+ Good, so now we have.
2167
+
2168
+ 1:07:00.080 --> 1:07:11.818
2169
+ Done this estimation, and the problem is we
2170
+ have this general.
2171
+
2172
+ 1:07:11.651 --> 1:07:19.470
2173
+ We want to have a longer context because we
2174
+ can model longer than language better because
2175
+
2176
+ 1:07:19.470 --> 1:07:21.468
2177
+ long range dependency.
2178
+
2179
+ 1:07:21.701 --> 1:07:26.745
2180
+ On the other hand, we have limited data so
2181
+ we want to have stored angrums because they
2182
+
2183
+ 1:07:26.745 --> 1:07:28.426
2184
+ reach angrums at first more.
2185
+
2186
+ 1:07:29.029 --> 1:07:43.664
2187
+ And about the smooth thing in the discounting
2188
+ we did before, it always treats all angrams.
2189
+
2190
+ 1:07:44.024 --> 1:07:46.006
2191
+ So we didn't really look at the end drums.
2192
+
2193
+ 1:07:46.006 --> 1:07:48.174
2194
+ They were all classed into how often they
2195
+ are.
2196
+
2197
+ 1:07:49.169 --> 1:08:00.006
2198
+ However, sometimes this might not be very
2199
+ helpful, so for example look at the engram
2200
+
2201
+ 1:08:00.006 --> 1:08:06.253
2202
+ Scottish beer drinkers and Scottish beer eaters.
2203
+
2204
+ 1:08:06.686 --> 1:08:12.037
2205
+ Because we have not seen the trigram, so you
2206
+ will estimate the trigram probability by the
2207
+
2208
+ 1:08:12.037 --> 1:08:14.593
2209
+ probability you assign to the zero county.
2210
+
2211
+ 1:08:15.455 --> 1:08:26.700
2212
+ However, if you look at the background probability
2213
+ that you might have seen and might be helpful,.
2214
+
2215
+ 1:08:26.866 --> 1:08:34.538
2216
+ So be a drinker is more probable to see than
2217
+ Scottish be a drinker, and be a drinker should
2218
+
2219
+ 1:08:34.538 --> 1:08:36.039
2220
+ be more probable.
2221
+
2222
+ 1:08:36.896 --> 1:08:39.919
2223
+ So this type of information is somehow ignored.
2224
+
2225
+ 1:08:39.919 --> 1:08:45.271
2226
+ So if we have the Trigram language model,
2227
+ we are only looking at trigrams divided by
2228
+
2229
+ 1:08:45.271 --> 1:08:46.089
2230
+ the Vigrams.
2231
+
2232
+ 1:08:46.089 --> 1:08:49.678
2233
+ But if we have not seen the Vigrams, we are
2234
+ not looking.
2235
+
2236
+ 1:08:49.678 --> 1:08:53.456
2237
+ Oh, maybe we will have seen the Vigram and
2238
+ we can back off.
2239
+
2240
+ 1:08:54.114 --> 1:09:01.978
2241
+ And that is what people do in interpolation
2242
+ and back off.
2243
+
2244
+ 1:09:01.978 --> 1:09:09.164
2245
+ The idea is if we don't have seen the large
2246
+ engrams.
2247
+
2248
+ 1:09:09.429 --> 1:09:16.169
2249
+ So don't have to go to a shorter sequence
2250
+ and try to see if we came on in this probability.
2251
+
2252
+ 1:09:16.776 --> 1:09:20.730
2253
+ And this is the idea of interpolation.
2254
+
2255
+ 1:09:20.730 --> 1:09:25.291
2256
+ There's like two different ways of doing it.
2257
+
2258
+ 1:09:25.291 --> 1:09:26.507
2259
+ One is the.
2260
+
2261
+ 1:09:26.646 --> 1:09:29.465
2262
+ The easiest thing is like okay.
2263
+
2264
+ 1:09:29.465 --> 1:09:32.812
2265
+ If we have bigrams, we have trigrams.
2266
+
2267
+ 1:09:32.812 --> 1:09:35.103
2268
+ If we have programs, why?
2269
+
2270
+ 1:09:35.355 --> 1:09:46.544
2271
+ Mean, of course, we have the larger ones,
2272
+ the larger context, but the short amounts are
2273
+
2274
+ 1:09:46.544 --> 1:09:49.596
2275
+ maybe better estimated.
2276
+
2277
+ 1:09:50.090 --> 1:10:00.487
2278
+ Time just by taking the probability of just
2279
+ the word class of probability of and.
2280
+
2281
+ 1:10:01.261 --> 1:10:07.052
2282
+ And of course we need to know because otherwise
2283
+ we don't have a probability distribution, but
2284
+
2285
+ 1:10:07.052 --> 1:10:09.332
2286
+ we can somehow optimize the weights.
2287
+
2288
+ 1:10:09.332 --> 1:10:15.930
2289
+ For example, the health out data set: And
2290
+ thereby we have now a probability distribution
2291
+
2292
+ 1:10:15.930 --> 1:10:17.777
2293
+ which takes both into account.
2294
+
2295
+ 1:10:18.118 --> 1:10:23.705
2296
+ The thing about the Scottish be a drink business.
2297
+
2298
+ 1:10:23.705 --> 1:10:33.763
2299
+ The dry rum probability will be the same for
2300
+ the post office because they both occur zero
2301
+
2302
+ 1:10:33.763 --> 1:10:34.546
2303
+ times.
2304
+
2305
+ 1:10:36.116 --> 1:10:45.332
2306
+ But the two grand verability will hopefully
2307
+ be different because we might have seen beer
2308
+
2309
+ 1:10:45.332 --> 1:10:47.611
2310
+ eaters and therefore.
2311
+
2312
+ 1:10:48.668 --> 1:10:57.296
2313
+ The idea that sometimes it's better to have
2314
+ different models and combine them instead.
2315
+
2316
+ 1:10:58.678 --> 1:10:59.976
2317
+ Another idea in style.
2318
+
2319
+ 1:11:00.000 --> 1:11:08.506
2320
+ Of this overall interpolation is you can also
2321
+ do this type of recursive interpolation.
2322
+
2323
+ 1:11:08.969 --> 1:11:23.804
2324
+ The probability of the word given its history
2325
+ is in the current language model probability.
2326
+
2327
+ 1:11:24.664 --> 1:11:30.686
2328
+ Thus one minus the weights of this two some
2329
+ after one, and here it's an interpolated probability
2330
+
2331
+ 1:11:30.686 --> 1:11:36.832
2332
+ from the n minus one breath, and then of course
2333
+ it goes recursively on until you are at a junigram
2334
+
2335
+ 1:11:36.832 --> 1:11:37.639
2336
+ probability.
2337
+
2338
+ 1:11:38.558 --> 1:11:49.513
2339
+ What you can also do, you can not only do
2340
+ the same weights for all our words, but you
2341
+
2342
+ 1:11:49.513 --> 1:12:06.020
2343
+ can for example: For example, for engrams,
2344
+ which you have seen very often, you put more
2345
+
2346
+ 1:12:06.020 --> 1:12:10.580
2347
+ weight on the trigrams.
2348
+
2349
+ 1:12:13.673 --> 1:12:29.892
2350
+ The other thing you can do is the back off
2351
+ and the difference in back off is we are not
2352
+
2353
+ 1:12:29.892 --> 1:12:32.656
2354
+ interpolating.
2355
+
2356
+ 1:12:32.892 --> 1:12:41.954
2357
+ If we have seen the trigram probability so
2358
+ if the trigram hound is bigger then we take
2359
+
2360
+ 1:12:41.954 --> 1:12:48.412
2361
+ the trigram probability and if we have seen
2362
+ this one then we.
2363
+
2364
+ 1:12:48.868 --> 1:12:54.092
2365
+ So that is the difference.
2366
+
2367
+ 1:12:54.092 --> 1:13:06.279
2368
+ We are always taking all the angle probabilities
2369
+ and back off.
2370
+
2371
+ 1:13:07.147 --> 1:13:09.941
2372
+ Why do we need to do this just a minute?
2373
+
2374
+ 1:13:09.941 --> 1:13:13.621
2375
+ So why have we here just take the probability
2376
+ of the.
2377
+
2378
+ 1:13:15.595 --> 1:13:18.711
2379
+ Yes, because otherwise the probabilities from
2380
+ some people.
2381
+
2382
+ 1:13:19.059 --> 1:13:28.213
2383
+ In order to make them still sound one, we
2384
+ have to take away a bit of a probability mass
2385
+
2386
+ 1:13:28.213 --> 1:13:29.773
2387
+ for the scene.
2388
+
2389
+ 1:13:29.709 --> 1:13:38.919
2390
+ The difference is we are no longer distributing
2391
+ it equally as before to the unseen, but we
2392
+
2393
+ 1:13:38.919 --> 1:13:40.741
2394
+ are distributing.
2395
+
2396
+ 1:13:44.864 --> 1:13:56.220
2397
+ For example, this can be done with gutturing,
2398
+ so the expected counts in goodturing we saw.
2399
+
2400
+ 1:13:57.697 --> 1:13:59.804
2401
+ The adjusted counts.
2402
+
2403
+ 1:13:59.804 --> 1:14:04.719
2404
+ They are always lower than the ones we see
2405
+ here.
2406
+
2407
+ 1:14:04.719 --> 1:14:14.972
2408
+ These counts are always: See that so you can
2409
+ now take this different and distribute this
2410
+
2411
+ 1:14:14.972 --> 1:14:18.852
2412
+ weights to the lower based input.
2413
+
2414
+ 1:14:23.323 --> 1:14:29.896
2415
+ Is how we can distribute things.
2416
+
2417
+ 1:14:29.896 --> 1:14:43.442
2418
+ Then there is one last thing people are doing,
2419
+ especially how much.
2420
+
2421
+ 1:14:43.563 --> 1:14:55.464
2422
+ And there's one thing which is called well
2423
+ written by Mozilla.
2424
+
2425
+ 1:14:55.315 --> 1:15:01.335
2426
+ In the background, like in the background,
2427
+ it might make sense to look at the words and
2428
+
2429
+ 1:15:01.335 --> 1:15:04.893
2430
+ see how probable it is that you need to background.
2431
+
2432
+ 1:15:05.425 --> 1:15:11.232
2433
+ So look at these words five and one cent.
2434
+
2435
+ 1:15:11.232 --> 1:15:15.934
2436
+ Those occur exactly times in the.
2437
+
2438
+ 1:15:16.316 --> 1:15:27.804
2439
+ They would be treated exactly the same because
2440
+ both occur at the same time, and it would be
2441
+
2442
+ 1:15:27.804 --> 1:15:29.053
2443
+ the same.
2444
+
2445
+ 1:15:29.809 --> 1:15:48.401
2446
+ However, it shouldn't really model the same.
2447
+
2448
+ 1:15:48.568 --> 1:15:57.447
2449
+ If you compare that for constant there are
2450
+ four hundred different continuations of this
2451
+
2452
+ 1:15:57.447 --> 1:16:01.282
2453
+ work, so there is nearly always this.
2454
+
2455
+ 1:16:02.902 --> 1:16:11.203
2456
+ So if you're now seeing a new bigram or a
2457
+ biogram with Isaac Constant or Spite starting
2458
+
2459
+ 1:16:11.203 --> 1:16:13.467
2460
+ and then another word,.
2461
+
2462
+ 1:16:15.215 --> 1:16:25.606
2463
+ In constant, it's very frequent that you see
2464
+ new angrups because there are many different
2465
+
2466
+ 1:16:25.606 --> 1:16:27.222
2467
+ combinations.
2468
+
2469
+ 1:16:27.587 --> 1:16:35.421
2470
+ Therefore, it might look not only to look
2471
+ at the counts, the end grams, but also how
2472
+
2473
+ 1:16:35.421 --> 1:16:37.449
2474
+ many extensions does.
2475
+
2476
+ 1:16:38.218 --> 1:16:43.222
2477
+ And this is done by witt velk smoothing.
2478
+
2479
+ 1:16:43.222 --> 1:16:51.032
2480
+ The idea is we count how many possible extensions
2481
+ in this case.
2482
+
2483
+ 1:16:51.371 --> 1:17:01.966
2484
+ So we had for spive, we had possible extensions,
2485
+ and for constant we had a lot more.
2486
+
2487
+ 1:17:02.382 --> 1:17:09.394
2488
+ And then how much we put into our backup model,
2489
+ how much weight we put into the backup is,
2490
+
2491
+ 1:17:09.394 --> 1:17:13.170
2492
+ depending on this number of possible extensions.
2493
+
2494
+ 1:17:14.374 --> 1:17:15.557
2495
+ Style.
2496
+
2497
+ 1:17:15.557 --> 1:17:29.583
2498
+ We have it here, so this is the weight you
2499
+ put on your lower end gram probability.
2500
+
2501
+ 1:17:29.583 --> 1:17:46.596
2502
+ For example: And if you compare these two
2503
+ numbers, so for Spike you do how many extensions
2504
+
2505
+ 1:17:46.596 --> 1:17:55.333
2506
+ does Spike have divided by: While for constant
2507
+ you have zero point three, you know,.
2508
+
2509
+ 1:17:55.815 --> 1:18:05.780
2510
+ So you're putting a lot more weight to like
2511
+ it's not as bad to fall off to the back of
2512
+
2513
+ 1:18:05.780 --> 1:18:06.581
2514
+ model.
2515
+
2516
+ 1:18:06.581 --> 1:18:10.705
2517
+ So for the spy it's really unusual.
2518
+
2519
+ 1:18:10.730 --> 1:18:13.369
2520
+ For Constant there's a lot of probability
2521
+ medicine.
2522
+
2523
+ 1:18:13.369 --> 1:18:15.906
2524
+ The chances that you're doing that is quite
2525
+ high.
2526
+
2527
+ 1:18:20.000 --> 1:18:26.209
2528
+ Similarly, but just from the other way around,
2529
+ it's now looking at this probability distribution.
2530
+
2531
+ 1:18:26.546 --> 1:18:37.103
2532
+ So now when we back off the probability distribution
2533
+ for the lower angrums, we calculated exactly
2534
+
2535
+ 1:18:37.103 --> 1:18:40.227
2536
+ the same as the probability.
2537
+
2538
+ 1:18:40.320 --> 1:18:48.254
2539
+ However, they are used in a different way,
2540
+ so the lower order end drums are only used
2541
+
2542
+ 1:18:48.254 --> 1:18:49.361
2543
+ if we have.
2544
+
2545
+ 1:18:50.410 --> 1:18:54.264
2546
+ So it's like you're modeling something different.
2547
+
2548
+ 1:18:54.264 --> 1:19:01.278
2549
+ You're not modeling how probable this engram
2550
+ if we haven't seen the larger engram and that
2551
+
2552
+ 1:19:01.278 --> 1:19:04.361
2553
+ is tried by the diversity of histories.
2554
+
2555
+ 1:19:04.944 --> 1:19:14.714
2556
+ For example, if you look at York, that's a
2557
+ quite frequent work.
2558
+
2559
+ 1:19:14.714 --> 1:19:18.530
2560
+ It occurs as many times.
2561
+
2562
+ 1:19:19.559 --> 1:19:27.985
2563
+ However, four hundred seventy three times
2564
+ it was followed the way before it was mute.
2565
+
2566
+ 1:19:29.449 --> 1:19:40.237
2567
+ So if you now think the unigram model is only
2568
+ used, the probability of York as a unigram
2569
+
2570
+ 1:19:40.237 --> 1:19:49.947
2571
+ model should be very, very low because: So
2572
+ you should have a lower probability for your
2573
+
2574
+ 1:19:49.947 --> 1:19:56.292
2575
+ than, for example, for foods, although you
2576
+ have seen both of them at the same time, and
2577
+
2578
+ 1:19:56.292 --> 1:20:02.853
2579
+ this is done by Knesser and Nye Smoothing where
2580
+ you are not counting the words itself, but
2581
+
2582
+ 1:20:02.853 --> 1:20:05.377
2583
+ you count the number of mysteries.
2584
+
2585
+ 1:20:05.845 --> 1:20:15.233
2586
+ So how many other way around was it followed
2587
+ by how many different words were before?
2588
+
2589
+ 1:20:15.233 --> 1:20:28.232
2590
+ Then instead of the normal way you count the
2591
+ words: So you don't need to know all the formulas
2592
+
2593
+ 1:20:28.232 --> 1:20:28.864
2594
+ here.
2595
+
2596
+ 1:20:28.864 --> 1:20:33.498
2597
+ The more important thing is this intuition.
2598
+
2599
+ 1:20:34.874 --> 1:20:44.646
2600
+ More than it means already that I haven't
2601
+ seen the larger end grammar, and therefore
2602
+
2603
+ 1:20:44.646 --> 1:20:49.704
2604
+ it might be better to model it differently.
2605
+
2606
+ 1:20:49.929 --> 1:20:56.976
2607
+ So if there's a new engram with something
2608
+ in New York that's very unprofitable compared
2609
+
2610
+ 1:20:56.976 --> 1:20:57.297
2611
+ to.
2612
+
2613
+ 1:21:00.180 --> 1:21:06.130
2614
+ And yeah, this modified Kneffer Nice music
2615
+ is what people took into use.
2616
+
2617
+ 1:21:06.130 --> 1:21:08.249
2618
+ That's the fall approach.
2619
+
2620
+ 1:21:08.728 --> 1:21:20.481
2621
+ Has an absolute discounting for small and
2622
+ grams, and then bells smoothing, and for it
2623
+
2624
+ 1:21:20.481 --> 1:21:27.724
2625
+ uses the discounting of histories which we
2626
+ just had.
2627
+
2628
+ 1:21:28.028 --> 1:21:32.207
2629
+ And there's even two versions of it, like
2630
+ the backup and the interpolator.
2631
+
2632
+ 1:21:32.472 --> 1:21:34.264
2633
+ So that may be interesting.
2634
+
2635
+ 1:21:34.264 --> 1:21:40.216
2636
+ These are here even works well for interpolation,
2637
+ although your assumption is even no longer
2638
+
2639
+ 1:21:40.216 --> 1:21:45.592
2640
+ true because you're using the lower engrams
2641
+ even if you've seen the higher engrams.
2642
+
2643
+ 1:21:45.592 --> 1:21:49.113
2644
+ But since you're then focusing on the higher
2645
+ engrams,.
2646
+
2647
+ 1:21:49.929 --> 1:21:53.522
2648
+ So if you see that some beats on the perfectities,.
2649
+
2650
+ 1:21:54.754 --> 1:22:00.262
2651
+ So you see normally what interpolated movement
2652
+ class of nineties gives you some of the best
2653
+
2654
+ 1:22:00.262 --> 1:22:00.980
2655
+ performing.
2656
+
2657
+ 1:22:02.022 --> 1:22:08.032
2658
+ You see the larger your end drum than it is
2659
+ with interpolation.
2660
+
2661
+ 1:22:08.032 --> 1:22:15.168
2662
+ You also get significant better so you can
2663
+ not only look at the last words.
2664
+
2665
+ 1:22:18.638 --> 1:22:32.725
2666
+ Good so much for these types of things, and
2667
+ we will finish with some special things about
2668
+
2669
+ 1:22:32.725 --> 1:22:34.290
2670
+ language.
2671
+
2672
+ 1:22:38.678 --> 1:22:44.225
2673
+ One thing we talked about the unknown words,
2674
+ so there is different ways of doing it because
2675
+
2676
+ 1:22:44.225 --> 1:22:49.409
2677
+ in all the estimations we were still assuming
2678
+ mostly that we have a fixed vocabulary.
2679
+
2680
+ 1:22:50.270 --> 1:23:06.372
2681
+ So you can often, for example, create an unknown
2682
+ choken and use that while statistical language.
2683
+
2684
+ 1:23:06.766 --> 1:23:16.292
2685
+ It was mainly useful language processing since
2686
+ newer models are coming, but maybe it's surprising.
2687
+
2688
+ 1:23:18.578 --> 1:23:30.573
2689
+ What is also nice is that if you're going
2690
+ to really hard launch and ramps, it's more
2691
+
2692
+ 1:23:30.573 --> 1:23:33.114
2693
+ about efficiency.
2694
+
2695
+ 1:23:33.093 --> 1:23:37.378
2696
+ And then you have to remember lock it in your
2697
+ model.
2698
+
2699
+ 1:23:37.378 --> 1:23:41.422
2700
+ In a lot of situations it's not really important.
2701
+
2702
+ 1:23:41.661 --> 1:23:46.964
2703
+ It's more about ranking so which one is better
2704
+ and if they don't sum up to one that's not
2705
+
2706
+ 1:23:46.964 --> 1:23:47.907
2707
+ that important.
2708
+
2709
+ 1:23:47.907 --> 1:23:53.563
2710
+ Of course then you cannot calculate any perplexity
2711
+ anymore because if this is not a probability
2712
+
2713
+ 1:23:53.563 --> 1:23:58.807
2714
+ mass then the thing we had about the negative
2715
+ example doesn't fit anymore and that's not
2716
+
2717
+ 1:23:58.807 --> 1:23:59.338
2718
+ working.
2719
+
2720
+ 1:23:59.619 --> 1:24:02.202
2721
+ However, anification is also very helpful.
2722
+
2723
+ 1:24:02.582 --> 1:24:13.750
2724
+ And that is why there is this stupid bag-off
2725
+ presented remove all this complicated things
2726
+
2727
+ 1:24:13.750 --> 1:24:14.618
2728
+ which.
2729
+
2730
+ 1:24:15.055 --> 1:24:28.055
2731
+ And it just does once we directly take the
2732
+ absolute account, and otherwise we're doing.
2733
+
2734
+ 1:24:28.548 --> 1:24:41.867
2735
+ Is no longer any discounting anymore, so it's
2736
+ very, very simple and however they show you
2737
+
2738
+ 1:24:41.867 --> 1:24:47.935
2739
+ have to calculate a lot less statistics.
2740
+
2741
+ 1:24:50.750 --> 1:24:57.525
2742
+ In addition you can have other type of language
2743
+ models.
2744
+
2745
+ 1:24:57.525 --> 1:25:08.412
2746
+ We had word based language models and they
2747
+ normally go up to four or five for six brands.
2748
+
2749
+ 1:25:08.412 --> 1:25:10.831
2750
+ They are too large.
2751
+
2752
+ 1:25:11.531 --> 1:25:20.570
2753
+ So what people have then looked also into
2754
+ is what is referred to as part of speech language
2755
+
2756
+ 1:25:20.570 --> 1:25:21.258
2757
+ model.
2758
+
2759
+ 1:25:21.258 --> 1:25:29.806
2760
+ So instead of looking at the word sequence
2761
+ you're modeling directly the part of speech
2762
+
2763
+ 1:25:29.806 --> 1:25:30.788
2764
+ sequence.
2765
+
2766
+ 1:25:31.171 --> 1:25:34.987
2767
+ Then of course now you're only being modeling
2768
+ syntax.
2769
+
2770
+ 1:25:34.987 --> 1:25:41.134
2771
+ There's no cemented information anymore in
2772
+ the paddle speech test but now you might go
2773
+
2774
+ 1:25:41.134 --> 1:25:47.423
2775
+ to a larger context link so you can do seven
2776
+ H or nine grams and then you can write some
2777
+
2778
+ 1:25:47.423 --> 1:25:50.320
2779
+ of the long range dependencies in order.
2780
+
2781
+ 1:25:52.772 --> 1:25:59.833
2782
+ And there's other things people have done
2783
+ like cash language models, so the idea in cash
2784
+
2785
+ 1:25:59.833 --> 1:26:07.052
2786
+ language model is that yes words that you have
2787
+ recently seen are more frequently to do are
2788
+
2789
+ 1:26:07.052 --> 1:26:11.891
2790
+ more probable to reoccurr if you want to model
2791
+ the dynamics.
2792
+
2793
+ 1:26:12.152 --> 1:26:20.734
2794
+ If you're just talking here, we talked about
2795
+ language models in my presentation.
2796
+
2797
+ 1:26:20.734 --> 1:26:23.489
2798
+ There will be a lot more.
2799
+
2800
+ 1:26:23.883 --> 1:26:37.213
2801
+ Can do that by having a dynamic and a static
2802
+ component, and then you have a dynamic component
2803
+
2804
+ 1:26:37.213 --> 1:26:41.042
2805
+ which looks at the bigram.
2806
+
2807
+ 1:26:41.261 --> 1:26:49.802
2808
+ And thereby, for example, if you once generate
2809
+ language model of probability, it's increased
2810
+
2811
+ 1:26:49.802 --> 1:26:52.924
2812
+ and you're modeling that problem.
2813
+
2814
+ 1:26:56.816 --> 1:27:03.114
2815
+ Said the dynamic component is trained on the
2816
+ text translated so far.
2817
+
2818
+ 1:27:04.564 --> 1:27:12.488
2819
+ To train them what you just have done, there's
2820
+ no human feet there.
2821
+
2822
+ 1:27:12.712 --> 1:27:25.466
2823
+ The speech model all the time and then it
2824
+ will repeat its errors and that is, of course,.
2825
+
2826
+ 1:27:25.966 --> 1:27:31.506
2827
+ A similar idea is people have looked into
2828
+ trigger language model whereas one word occurs
2829
+
2830
+ 1:27:31.506 --> 1:27:34.931
2831
+ then you increase the probability of some other
2832
+ words.
2833
+
2834
+ 1:27:34.931 --> 1:27:40.596
2835
+ So if you're talking about money that will
2836
+ increase the probability of bank saving account
2837
+
2838
+ 1:27:40.596 --> 1:27:41.343
2839
+ dollar and.
2840
+
2841
+ 1:27:41.801 --> 1:27:47.352
2842
+ Because then you have to somehow model this
2843
+ dependency, but it's somehow also an idea of
2844
+
2845
+ 1:27:47.352 --> 1:27:52.840
2846
+ modeling long range dependency, because if
2847
+ one word occurs very often in your document,
2848
+
2849
+ 1:27:52.840 --> 1:27:58.203
2850
+ you like somehow like learning which other
2851
+ words to occur because they are more often
2852
+
2853
+ 1:27:58.203 --> 1:27:59.201
2854
+ than by chance.
2855
+
2856
+ 1:28:02.822 --> 1:28:10.822
2857
+ Yes, then the last thing is, of course, especially
2858
+ for languages which are, which are morphologically
2859
+
2860
+ 1:28:10.822 --> 1:28:11.292
2861
+ rich.
2862
+
2863
+ 1:28:11.292 --> 1:28:18.115
2864
+ You can do something similar to BPE so you
2865
+ can now do more themes or so, and then more
2866
+
2867
+ 1:28:18.115 --> 1:28:22.821
2868
+ the morphine sequence because the morphines
2869
+ are more often.
2870
+
2871
+ 1:28:23.023 --> 1:28:26.877
2872
+ However, the program is opposed that your
2873
+ sequence length also gets longer.
2874
+
2875
+ 1:28:27.127 --> 1:28:33.185
2876
+ And so if they have a four gram language model,
2877
+ it's not counting the last three words but
2878
+
2879
+ 1:28:33.185 --> 1:28:35.782
2880
+ only the last three more films, which.
2881
+
2882
+ 1:28:36.196 --> 1:28:39.833
2883
+ So of course then it's a bit challenging and
2884
+ know how to deal with.
2885
+
2886
+ 1:28:40.680 --> 1:28:51.350
2887
+ What about language is finished by the idea
2888
+ of a position at the end of the world?
2889
+
2890
+ 1:28:51.350 --> 1:28:58.807
2891
+ Yeah, but there you can typically do something
2892
+ like that.
2893
+
2894
+ 1:28:59.159 --> 1:29:02.157
2895
+ It is not the one perfect solution.
2896
+
2897
+ 1:29:02.157 --> 1:29:05.989
2898
+ You have to do a bit of testing what is best.
2899
+
2900
+ 1:29:06.246 --> 1:29:13.417
2901
+ One way of dealing with a large vocabulary
2902
+ that you haven't seen is to split these words
2903
+
2904
+ 1:29:13.417 --> 1:29:20.508
2905
+ into parts and themes that either like more
2906
+ linguistic motivated in more themes or more
2907
+
2908
+ 1:29:20.508 --> 1:29:25.826
2909
+ statistically motivated like we have in the
2910
+ bike pair and coding.
2911
+
2912
+ 1:29:28.188 --> 1:29:33.216
2913
+ The representation of your text is different.
2914
+
2915
+ 1:29:33.216 --> 1:29:41.197
2916
+ How you are later doing all the counting and
2917
+ the statistics is the same.
2918
+
2919
+ 1:29:41.197 --> 1:29:44.914
2920
+ What you assume is your sequence.
2921
+
2922
+ 1:29:45.805 --> 1:29:49.998
2923
+ That's the same thing for the other things
2924
+ we had here.
2925
+
2926
+ 1:29:49.998 --> 1:29:55.390
2927
+ Here you don't have words, but everything
2928
+ you're doing is done exactly.
2929
+
2930
+ 1:29:57.857 --> 1:29:59.457
2931
+ Some practical issues.
2932
+
2933
+ 1:29:59.457 --> 1:30:05.646
2934
+ Typically you're doing things on the lock
2935
+ and you're adding because mild decline in very
2936
+
2937
+ 1:30:05.646 --> 1:30:09.819
2938
+ small values gives you sometimes problems with
2939
+ calculation.
2940
+
2941
+ 1:30:10.230 --> 1:30:16.687
2942
+ Good thing is you don't have to care with
2943
+ this mostly so there is very good two kids
2944
+
2945
+ 1:30:16.687 --> 1:30:23.448
2946
+ like Azarayan or Kendalan which when you can
2947
+ just give your data and they will train the
2948
+
2949
+ 1:30:23.448 --> 1:30:30.286
2950
+ language more then do all the complicated maths
2951
+ behind that and you are able to run them.
2952
+
2953
+ 1:30:31.911 --> 1:30:39.894
2954
+ So what you should keep from today is what
2955
+ is a language model and how we can do maximum
2956
+
2957
+ 1:30:39.894 --> 1:30:44.199
2958
+ training on that and different language models.
2959
+
2960
+ 1:30:44.199 --> 1:30:49.939
2961
+ Similar ideas we use for a lot of different
2962
+ statistical models.
2963
+
2964
+ 1:30:50.350 --> 1:30:52.267
2965
+ Where You Always Have the Problem.
2966
+
2967
+ 1:30:53.233 --> 1:31:01.608
2968
+ Different way of looking at it and doing it
2969
+ will do it on Thursday when we will go to language.
2970
+
demo_data/lectures/Lecture-06-09.05.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59fe56576cf62256b2c62b8fdcf6e502ce1931907278fc420d397cd360774f72
3
+ size 129548573
demo_data/lectures/Lecture-07-11.05.2023/English.vtt ADDED
@@ -0,0 +1,2593 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.301 --> 0:00:05.707
4
+ Okay So Welcome to Today's Lecture.
5
+
6
+ 0:00:06.066 --> 0:00:12.592
7
+ I'm sorry for the inconvenience.
8
+
9
+ 0:00:12.592 --> 0:00:19.910
10
+ Sometimes they are project meetings.
11
+
12
+ 0:00:19.910 --> 0:00:25.843
13
+ There will be one other time.
14
+
15
+ 0:00:26.806 --> 0:00:40.863
16
+ So what we want to talk today about is want
17
+ to start with neural approaches to machine
18
+
19
+ 0:00:40.863 --> 0:00:42.964
20
+ translation.
21
+
22
+ 0:00:43.123 --> 0:00:51.285
23
+ I guess you have heard about other types of
24
+ neural models for other types of neural language
25
+
26
+ 0:00:51.285 --> 0:00:52.339
27
+ processing.
28
+
29
+ 0:00:52.339 --> 0:00:59.887
30
+ This was some of the first steps in introducing
31
+ neal networks to machine translation.
32
+
33
+ 0:01:00.600 --> 0:01:06.203
34
+ They are similar to what you know they see
35
+ in as large language models.
36
+
37
+ 0:01:06.666 --> 0:01:11.764
38
+ And today look into what are these neuro-language
39
+ models?
40
+
41
+ 0:01:11.764 --> 0:01:13.874
42
+ What is the difference?
43
+
44
+ 0:01:13.874 --> 0:01:15.983
45
+ What is the motivation?
46
+
47
+ 0:01:16.316 --> 0:01:21.445
48
+ And first will use them in statistics and
49
+ machine translation.
50
+
51
+ 0:01:21.445 --> 0:01:28.935
52
+ So if you remember how fully like two or three
53
+ weeks ago we had this likely model where you
54
+
55
+ 0:01:28.935 --> 0:01:31.052
56
+ can integrate easily any.
57
+
58
+ 0:01:31.351 --> 0:01:40.967
59
+ We just have another model which evaluates
60
+ how good a system is or how good a fluent language
61
+
62
+ 0:01:40.967 --> 0:01:41.376
63
+ is.
64
+
65
+ 0:01:41.376 --> 0:01:53.749
66
+ The main advantage compared to the statistical
67
+ models we saw on Tuesday is: Next week we will
68
+
69
+ 0:01:53.749 --> 0:02:06.496
70
+ then go for a neural machine translation where
71
+ we replace the whole model.
72
+
73
+ 0:02:11.211 --> 0:02:21.078
74
+ Just as a remember from Tuesday, we've seen
75
+ the main challenge in language world was that
76
+
77
+ 0:02:21.078 --> 0:02:25.134
78
+ most of the engrams we haven't seen.
79
+
80
+ 0:02:26.946 --> 0:02:33.967
81
+ So this was therefore difficult to estimate
82
+ any probability because you've seen that normally
83
+
84
+ 0:02:33.967 --> 0:02:39.494
85
+ if you have not seen the endgram you will assign
86
+ the probability of zero.
87
+
88
+ 0:02:39.980 --> 0:02:49.420
89
+ However, this is not really very good because
90
+ we don't want to give zero probabilities to
91
+
92
+ 0:02:49.420 --> 0:02:54.979
93
+ sentences, which still might be a very good
94
+ English.
95
+
96
+ 0:02:55.415 --> 0:03:02.167
97
+ And then we learned a lot of techniques and
98
+ that is the main challenging statistical machine
99
+
100
+ 0:03:02.167 --> 0:03:04.490
101
+ translate statistical language.
102
+
103
+ 0:03:04.490 --> 0:03:10.661
104
+ What's how we can give a good estimate of
105
+ probability to events that we haven't seen
106
+
107
+ 0:03:10.661 --> 0:03:12.258
108
+ smoothing techniques?
109
+
110
+ 0:03:12.258 --> 0:03:15.307
111
+ We've seen this interpolation and begoff.
112
+
113
+ 0:03:15.435 --> 0:03:21.637
114
+ And they invent or develop very specific techniques.
115
+
116
+ 0:03:21.637 --> 0:03:26.903
117
+ To deal with that, however, it might not be.
118
+
119
+ 0:03:28.568 --> 0:03:43.190
120
+ And therefore maybe we can do things different,
121
+ so if we have not seen an gram before in statistical
122
+
123
+ 0:03:43.190 --> 0:03:44.348
124
+ models.
125
+
126
+ 0:03:45.225 --> 0:03:51.361
127
+ Before and we can only get information from
128
+ exactly the same words.
129
+
130
+ 0:03:51.411 --> 0:04:06.782
131
+ We don't have some on like approximate matching
132
+ like that, maybe in a sentence that cures similarly.
133
+
134
+ 0:04:06.782 --> 0:04:10.282
135
+ So if you have seen a.
136
+
137
+ 0:04:11.191 --> 0:04:17.748
138
+ And so you would like to have more something
139
+ like that where endgrams are represented, more
140
+
141
+ 0:04:17.748 --> 0:04:21.953
142
+ in a general space, and we can generalize similar
143
+ numbers.
144
+
145
+ 0:04:22.262 --> 0:04:29.874
146
+ So if you learn something about walk then
147
+ maybe we can use this knowledge and also apply.
148
+
149
+ 0:04:30.290 --> 0:04:42.596
150
+ The same as we have done before, but we can
151
+ really better model how similar they are and
152
+
153
+ 0:04:42.596 --> 0:04:45.223
154
+ transfer to other.
155
+
156
+ 0:04:47.047 --> 0:04:54.236
157
+ And we maybe want to do that in a more hierarchical
158
+ approach that we know okay.
159
+
160
+ 0:04:54.236 --> 0:05:02.773
161
+ Some words are similar but like go and walk
162
+ is somehow similar and I and P and G and therefore
163
+
164
+ 0:05:02.773 --> 0:05:06.996
165
+ like maybe if we then merge them in an engram.
166
+
167
+ 0:05:07.387 --> 0:05:15.861
168
+ If we learn something about our walk, then
169
+ it should tell us also something about Hugo.
170
+
171
+ 0:05:15.861 --> 0:05:17.113
172
+ He walks or.
173
+
174
+ 0:05:17.197 --> 0:05:27.327
175
+ You see that there is some relations which
176
+ we need to integrate for you.
177
+
178
+ 0:05:27.327 --> 0:05:35.514
179
+ We need to add the s, but maybe walks should
180
+ also be here.
181
+
182
+ 0:05:37.137 --> 0:05:45.149
183
+ And luckily there is one really convincing
184
+ method in doing that: And that is by using
185
+
186
+ 0:05:45.149 --> 0:05:47.231
187
+ a neural mechanism.
188
+
189
+ 0:05:47.387 --> 0:05:58.497
190
+ That's what we will introduce today so we
191
+ can use this type of neural networks to try
192
+
193
+ 0:05:58.497 --> 0:06:04.053
194
+ to learn this similarity and to learn how.
195
+
196
+ 0:06:04.324 --> 0:06:14.355
197
+ And that is one of the main advantages that
198
+ we have by switching from the standard statistical
199
+
200
+ 0:06:14.355 --> 0:06:15.200
201
+ models.
202
+
203
+ 0:06:15.115 --> 0:06:22.830
204
+ To learn similarities between words and generalized,
205
+ and learn what is called hidden representations
206
+
207
+ 0:06:22.830 --> 0:06:29.705
208
+ or representations of words, where we can measure
209
+ similarity in some dimensions of words.
210
+
211
+ 0:06:30.290 --> 0:06:42.384
212
+ So we can measure in which way words are similar.
213
+
214
+ 0:06:42.822 --> 0:06:48.902
215
+ We had it before and we've seen that words
216
+ were just easier.
217
+
218
+ 0:06:48.902 --> 0:06:51.991
219
+ The only thing we did is like.
220
+
221
+ 0:06:52.192 --> 0:07:02.272
222
+ But this energies don't have any meaning,
223
+ so it wasn't that word is more similar to words.
224
+
225
+ 0:07:02.582 --> 0:07:12.112
226
+ So we couldn't learn anything about words
227
+ in the statistical model and that's a big challenge.
228
+
229
+ 0:07:12.192 --> 0:07:23.063
230
+ About words even like in morphology, so going
231
+ goes is somehow more similar because the person
232
+
233
+ 0:07:23.063 --> 0:07:24.219
234
+ singular.
235
+
236
+ 0:07:24.264 --> 0:07:34.924
237
+ The basic models we have to now have no idea
238
+ about that and goes as similar to go than it
239
+
240
+ 0:07:34.924 --> 0:07:37.175
241
+ might be to sleep.
242
+
243
+ 0:07:39.919 --> 0:07:44.073
244
+ So what we want to do today.
245
+
246
+ 0:07:44.073 --> 0:07:53.096
247
+ In order to go to this we will have a short
248
+ introduction into.
249
+
250
+ 0:07:53.954 --> 0:08:05.984
251
+ It very short just to see how we use them
252
+ here, but that's a good thing, so most of you
253
+
254
+ 0:08:05.984 --> 0:08:08.445
255
+ think it will be.
256
+
257
+ 0:08:08.928 --> 0:08:14.078
258
+ And then we will first look into a feet forward
259
+ neural network language models.
260
+
261
+ 0:08:14.454 --> 0:08:23.706
262
+ And there we will still have this approximation.
263
+
264
+ 0:08:23.706 --> 0:08:33.902
265
+ We have before we are looking only at a fixed
266
+ window.
267
+
268
+ 0:08:34.154 --> 0:08:35.030
269
+ The case.
270
+
271
+ 0:08:35.030 --> 0:08:38.270
272
+ However, we have the umbellent here.
273
+
274
+ 0:08:38.270 --> 0:08:43.350
275
+ That's why they're already better in order
276
+ to generalize.
277
+
278
+ 0:08:44.024 --> 0:08:53.169
279
+ And then at the end we'll look at language
280
+ models where we then have the additional advantage.
281
+
282
+ 0:08:53.093 --> 0:09:04.317
283
+ Case that we need to have a fixed history,
284
+ but in theory we can model arbitrary long dependencies.
285
+
286
+ 0:09:04.304 --> 0:09:12.687
287
+ And we talked about on Tuesday where it is
288
+ not clear what type of information it is to.
289
+
290
+ 0:09:16.396 --> 0:09:24.981
291
+ So in general molecular networks I normally
292
+ learn to prove that they perform some tasks.
293
+
294
+ 0:09:25.325 --> 0:09:33.472
295
+ We have the structure and we are learning
296
+ them from samples so that is similar to what
297
+
298
+ 0:09:33.472 --> 0:09:34.971
299
+ we have before.
300
+
301
+ 0:09:34.971 --> 0:09:42.275
302
+ So now we have the same task here, a language
303
+ model giving input or forwards.
304
+
305
+ 0:09:42.642 --> 0:09:48.959
306
+ And is somewhat originally motivated by human
307
+ brain.
308
+
309
+ 0:09:48.959 --> 0:10:00.639
310
+ However, when you now need to know about artificial
311
+ neural networks, it's hard to get similarity.
312
+
313
+ 0:10:00.540 --> 0:10:02.889
314
+ There seemed to be not that point.
315
+
316
+ 0:10:03.123 --> 0:10:11.014
317
+ So what they are mainly doing is summoning
318
+ multiplication and then one non-linear activation.
319
+
320
+ 0:10:12.692 --> 0:10:16.085
321
+ So the basic units are these type of.
322
+
323
+ 0:10:17.937 --> 0:10:29.891
324
+ Perceptron basic blocks which we have and
325
+ this does processing so we have a fixed number
326
+
327
+ 0:10:29.891 --> 0:10:36.070
328
+ of input features and that will be important.
329
+
330
+ 0:10:36.096 --> 0:10:39.689
331
+ So we have here numbers to xn as input.
332
+
333
+ 0:10:40.060 --> 0:10:53.221
334
+ And this makes partly of course language processing
335
+ difficult.
336
+
337
+ 0:10:54.114 --> 0:10:57.609
338
+ So we have to model this time on and then
339
+ go stand home and model.
340
+
341
+ 0:10:58.198 --> 0:11:02.099
342
+ Then we are having weights, which are the
343
+ parameters and the number of weights exactly
344
+
345
+ 0:11:02.099 --> 0:11:03.668
346
+ the same as the number of weights.
347
+
348
+ 0:11:04.164 --> 0:11:06.322
349
+ Of input features.
350
+
351
+ 0:11:06.322 --> 0:11:15.068
352
+ Sometimes he has his fires in there, and then
353
+ it's not really an input from.
354
+
355
+ 0:11:15.195 --> 0:11:19.205
356
+ And what you then do is multiply.
357
+
358
+ 0:11:19.205 --> 0:11:26.164
359
+ Each input resists weight and then you sum
360
+ it up and then.
361
+
362
+ 0:11:26.606 --> 0:11:34.357
363
+ What is then additionally later important
364
+ is that we have an activation function and
365
+
366
+ 0:11:34.357 --> 0:11:42.473
367
+ it's important that this activation function
368
+ is non linear, so we come to just a linear.
369
+
370
+ 0:11:43.243 --> 0:11:54.088
371
+ And later it will be important that this is
372
+ differentiable because otherwise all the training.
373
+
374
+ 0:11:54.714 --> 0:12:01.907
375
+ This model by itself is not very powerful.
376
+
377
+ 0:12:01.907 --> 0:12:10.437
378
+ It was originally shown that this is not powerful.
379
+
380
+ 0:12:10.710 --> 0:12:19.463
381
+ However, there is a very easy extension, the
382
+ multi layer perceptual, and then things get
383
+
384
+ 0:12:19.463 --> 0:12:20.939
385
+ very powerful.
386
+
387
+ 0:12:21.081 --> 0:12:27.719
388
+ The thing is you just connect a lot of these
389
+ in this layer of structures and we have our
390
+
391
+ 0:12:27.719 --> 0:12:35.029
392
+ input layer where we have the inputs and our
393
+ hidden layer at least one where there is everywhere.
394
+
395
+ 0:12:35.395 --> 0:12:39.817
396
+ And then we can combine them all to do that.
397
+
398
+ 0:12:40.260 --> 0:12:48.320
399
+ The input layer is of course somewhat given
400
+ by a problem of dimension.
401
+
402
+ 0:12:48.320 --> 0:13:00.013
403
+ The outward layer is also given by your dimension,
404
+ but the hidden layer is of course a hyperparameter.
405
+
406
+ 0:13:01.621 --> 0:13:08.802
407
+ So let's start with the first question, now
408
+ more language related, and that is how we represent.
409
+
410
+ 0:13:09.149 --> 0:13:23.460
411
+ So we've seen here we have the but the question
412
+ is now how can we put in a word into this?
413
+
414
+ 0:13:26.866 --> 0:13:34.117
415
+ Noise: The first thing we're able to be better
416
+ is by the fact that like you are said,.
417
+
418
+ 0:13:34.314 --> 0:13:43.028
419
+ That is not that easy because the continuous
420
+ vector will come to that.
421
+
422
+ 0:13:43.028 --> 0:13:50.392
423
+ So from the neo-network we can directly put
424
+ in the bedding.
425
+
426
+ 0:13:50.630 --> 0:13:57.277
427
+ But if we need to input a word into the needle
428
+ network, it has to be something which is easily
429
+
430
+ 0:13:57.277 --> 0:13:57.907
431
+ defined.
432
+
433
+ 0:13:59.079 --> 0:14:12.492
434
+ The one hood encoding, and then we have one
435
+ out of encoding, so one value is one, and all
436
+
437
+ 0:14:12.492 --> 0:14:15.324
438
+ the others is the.
439
+
440
+ 0:14:16.316 --> 0:14:25.936
441
+ That means we are always dealing with fixed
442
+ vocabulary because what said is we cannot.
443
+
444
+ 0:14:26.246 --> 0:14:38.017
445
+ So you cannot easily extend your vocabulary
446
+ because if you mean you would extend your vocabulary.
447
+
448
+ 0:14:39.980 --> 0:14:41.502
449
+ That's also motivating.
450
+
451
+ 0:14:41.502 --> 0:14:43.722
452
+ We're talked about biperriagoding.
453
+
454
+ 0:14:43.722 --> 0:14:45.434
455
+ That's a nice thing there.
456
+
457
+ 0:14:45.434 --> 0:14:47.210
458
+ We have a fixed vocabulary.
459
+
460
+ 0:14:48.048 --> 0:14:55.804
461
+ The big advantage of this one encoding is
462
+ that we don't implicitly sum our implement
463
+
464
+ 0:14:55.804 --> 0:15:04.291
465
+ similarity between words, but really re-learning
466
+ because if you first think about this, this
467
+
468
+ 0:15:04.291 --> 0:15:06.938
469
+ is a very, very inefficient.
470
+
471
+ 0:15:07.227 --> 0:15:15.889
472
+ So you need like to represent end words, you
473
+ need a dimension of an end dimensional vector.
474
+
475
+ 0:15:16.236 --> 0:15:24.846
476
+ Imagine you could do binary encoding so you
477
+ could represent words as binary vectors.
478
+
479
+ 0:15:24.846 --> 0:15:26.467
480
+ Then you would.
481
+
482
+ 0:15:26.806 --> 0:15:31.177
483
+ Will be significantly more efficient.
484
+
485
+ 0:15:31.177 --> 0:15:36.813
486
+ However, then you have some implicit similarity.
487
+
488
+ 0:15:36.813 --> 0:15:39.113
489
+ Some numbers share.
490
+
491
+ 0:15:39.559 --> 0:15:46.958
492
+ Would somehow be bad because you would force
493
+ someone to do this by hand or clear how to
494
+
495
+ 0:15:46.958 --> 0:15:47.631
496
+ define.
497
+
498
+ 0:15:48.108 --> 0:15:55.135
499
+ So therefore currently this is the most successful
500
+ approach to just do this one watch.
501
+
502
+ 0:15:55.095 --> 0:15:59.563
503
+ Representations, so we take a fixed vocabulary.
504
+
505
+ 0:15:59.563 --> 0:16:06.171
506
+ We map each word to the inise, and then we
507
+ represent a word like this.
508
+
509
+ 0:16:06.171 --> 0:16:13.246
510
+ So if home will be one, the representation
511
+ will be one zero zero zero, and.
512
+
513
+ 0:16:14.514 --> 0:16:30.639
514
+ But this dimension here is a vocabulary size
515
+ and that is quite high, so we are always trying
516
+
517
+ 0:16:30.639 --> 0:16:33.586
518
+ to be efficient.
519
+
520
+ 0:16:33.853 --> 0:16:43.792
521
+ We are doing then some type of efficiency
522
+ because typically we are having this next layer.
523
+
524
+ 0:16:44.104 --> 0:16:51.967
525
+ It can be still maybe two hundred or five
526
+ hundred or one thousand neurons, but this is
527
+
528
+ 0:16:51.967 --> 0:16:53.323
529
+ significantly.
530
+
531
+ 0:16:53.713 --> 0:17:03.792
532
+ You can learn that directly and there we then
533
+ have similarity between words.
534
+
535
+ 0:17:03.792 --> 0:17:07.458
536
+ Then it is that some words.
537
+
538
+ 0:17:07.807 --> 0:17:14.772
539
+ But the nice thing is that this is then learned
540
+ that we are not need to hand define that.
541
+
542
+ 0:17:17.117 --> 0:17:32.742
543
+ We'll come later to the explicit architecture
544
+ of the neural language one, and there we can
545
+
546
+ 0:17:32.742 --> 0:17:35.146
547
+ see how it's.
548
+
549
+ 0:17:38.418 --> 0:17:44.857
550
+ So we're seeing that the other one or our
551
+ representation always has the same similarity.
552
+
553
+ 0:17:45.105 --> 0:17:59.142
554
+ Then we're having this continuous factor which
555
+ is a lot smaller dimension and that's important
556
+
557
+ 0:17:59.142 --> 0:18:00.768
558
+ for later.
559
+
560
+ 0:18:01.121 --> 0:18:06.989
561
+ What we are doing then is learning these representations
562
+ so that they are best for language.
563
+
564
+ 0:18:07.487 --> 0:18:14.968
565
+ So the representations are implicitly training
566
+ the language for the cards.
567
+
568
+ 0:18:14.968 --> 0:18:19.058
569
+ This is the best way for doing language.
570
+
571
+ 0:18:19.479 --> 0:18:32.564
572
+ And the nice thing that was found out later
573
+ is these representations are really good.
574
+
575
+ 0:18:33.153 --> 0:18:39.253
576
+ And that is why they are now even called word
577
+ embeddings by themselves and used for other
578
+
579
+ 0:18:39.253 --> 0:18:39.727
580
+ tasks.
581
+
582
+ 0:18:40.360 --> 0:18:49.821
583
+ And they are somewhat describing very different
584
+ things so they can describe and semantic similarities.
585
+
586
+ 0:18:49.789 --> 0:18:58.650
587
+ Are looking at the very example of today mass
588
+ vector space by adding words and doing some
589
+
590
+ 0:18:58.650 --> 0:19:00.618
591
+ interesting things.
592
+
593
+ 0:19:00.940 --> 0:19:11.178
594
+ So they got really like the first big improvement
595
+ when switching to neurostaff.
596
+
597
+ 0:19:11.491 --> 0:19:20.456
598
+ Are like part of the model, but with more
599
+ complex representation, but they are the basic
600
+
601
+ 0:19:20.456 --> 0:19:21.261
602
+ models.
603
+
604
+ 0:19:23.683 --> 0:19:36.979
605
+ In the output layer we are also having one
606
+ output layer structure and a connection function.
607
+
608
+ 0:19:36.997 --> 0:19:46.525
609
+ That is, for language learning we want to
610
+ predict what is the most common word.
611
+
612
+ 0:19:47.247 --> 0:19:56.453
613
+ And that can be done very well with this so
614
+ called soft back layer, where again the dimension.
615
+
616
+ 0:19:56.376 --> 0:20:02.825
617
+ Vocabulary size, so this is a vocabulary size,
618
+ and again the case neural represents the case
619
+
620
+ 0:20:02.825 --> 0:20:03.310
621
+ class.
622
+
623
+ 0:20:03.310 --> 0:20:09.759
624
+ So in our case we have again one round representation,
625
+ someone saying this is a core report.
626
+
627
+ 0:20:10.090 --> 0:20:17.255
628
+ Our probability distribution is a probability
629
+ distribution over all works, so the case entry
630
+
631
+ 0:20:17.255 --> 0:20:21.338
632
+ tells us how probable is that the next word
633
+ is this.
634
+
635
+ 0:20:22.682 --> 0:20:33.885
636
+ So we need to have some probability distribution
637
+ at our output in order to achieve that this
638
+
639
+ 0:20:33.885 --> 0:20:37.017
640
+ activation function goes.
641
+
642
+ 0:20:37.197 --> 0:20:46.944
643
+ And we can achieve that with a soft max activation
644
+ we take the input to the form of the value,
645
+
646
+ 0:20:46.944 --> 0:20:47.970
647
+ and then.
648
+
649
+ 0:20:48.288 --> 0:20:58.021
650
+ So by having this type of activation function
651
+ we are really getting this type of probability.
652
+
653
+ 0:20:59.019 --> 0:21:15.200
654
+ At the beginning was also very challenging
655
+ because again we have this inefficient representation.
656
+
657
+ 0:21:15.235 --> 0:21:29.799
658
+ You can imagine that something over is maybe
659
+ a bit inefficient with cheap users, but definitely.
660
+
661
+ 0:21:36.316 --> 0:21:44.072
662
+ And then for training the models that will
663
+ be fine, so we have to use architecture now.
664
+
665
+ 0:21:44.264 --> 0:21:48.491
666
+ We need to minimize the arrow.
667
+
668
+ 0:21:48.491 --> 0:21:53.264
669
+ Are we doing it taking the output?
670
+
671
+ 0:21:53.264 --> 0:21:58.174
672
+ We are comparing it to our targets.
673
+
674
+ 0:21:58.298 --> 0:22:03.830
675
+ So one important thing is by training them.
676
+
677
+ 0:22:03.830 --> 0:22:07.603
678
+ How can we measure the error?
679
+
680
+ 0:22:07.603 --> 0:22:12.758
681
+ So what is if we are training the ideas?
682
+
683
+ 0:22:13.033 --> 0:22:15.163
684
+ And how well we are measuring.
685
+
686
+ 0:22:15.163 --> 0:22:19.768
687
+ It is in natural language processing, typically
688
+ the cross entropy.
689
+
690
+ 0:22:19.960 --> 0:22:35.575
691
+ And that means we are comparing the target
692
+ with the output.
693
+
694
+ 0:22:35.335 --> 0:22:44.430
695
+ It gets optimized and you're seeing that this,
696
+ of course, makes it again very nice and easy
697
+
698
+ 0:22:44.430 --> 0:22:49.868
699
+ because our target is again a one-hour representation.
700
+
701
+ 0:22:50.110 --> 0:23:00.116
702
+ So all of these are always zero, and what
703
+ we are then doing is we are taking the one.
704
+
705
+ 0:23:00.100 --> 0:23:04.615
706
+ And we only need to multiply the one with
707
+ the logarithm here, and that is all the feedback
708
+
709
+ 0:23:04.615 --> 0:23:05.955
710
+ signal we are taking here.
711
+
712
+ 0:23:06.946 --> 0:23:13.885
713
+ Of course, this is not always influenced by
714
+ all the others.
715
+
716
+ 0:23:13.885 --> 0:23:17.933
717
+ Why is this influenced by all the.
718
+
719
+ 0:23:24.304 --> 0:23:34.382
720
+ Have the activation function, which is the
721
+ current activation divided by some of the others.
722
+
723
+ 0:23:34.354 --> 0:23:45.924
724
+ Otherwise it could easily just increase this
725
+ volume and ignore the others, but if you increase
726
+
727
+ 0:23:45.924 --> 0:23:49.090
728
+ one value all the others.
729
+
730
+ 0:23:51.351 --> 0:23:59.912
731
+ Then we can do with neometrics one very nice
732
+ and easy type of training that is done in all
733
+
734
+ 0:23:59.912 --> 0:24:07.721
735
+ the neometrics where we are now calculating
736
+ our error and especially the gradient.
737
+
738
+ 0:24:07.707 --> 0:24:11.640
739
+ So in which direction does the error show?
740
+
741
+ 0:24:11.640 --> 0:24:18.682
742
+ And then if we want to go to a smaller arrow
743
+ that's what we want to achieve.
744
+
745
+ 0:24:18.682 --> 0:24:26.638
746
+ We are taking the inverse direction of the
747
+ gradient and thereby trying to minimize our
748
+
749
+ 0:24:26.638 --> 0:24:27.278
750
+ error.
751
+
752
+ 0:24:27.287 --> 0:24:31.041
753
+ And we have to do that, of course, for all
754
+ the weights.
755
+
756
+ 0:24:31.041 --> 0:24:36.672
757
+ And to calculate the error of all the weights,
758
+ we won't do the defectvagation here.
759
+
760
+ 0:24:36.672 --> 0:24:41.432
761
+ But but what you can do is you can propagate
762
+ the arrow which measured.
763
+
764
+ 0:24:41.432 --> 0:24:46.393
765
+ At the end you can propagate it back its basic
766
+ mass and basic derivation.
767
+
768
+ 0:24:46.706 --> 0:24:58.854
769
+ For each way in your model measure how much
770
+ you contribute to the error and then change
771
+
772
+ 0:24:58.854 --> 0:25:01.339
773
+ it in a way that.
774
+
775
+ 0:25:04.524 --> 0:25:11.625
776
+ So to summarize what for at least machine
777
+ translation on your machine translation should
778
+
779
+ 0:25:11.625 --> 0:25:19.044
780
+ remember, you know, to understand on this problem
781
+ is that this is how a multilayer first the
782
+
783
+ 0:25:19.044 --> 0:25:20.640
784
+ problem looks like.
785
+
786
+ 0:25:20.580 --> 0:25:28.251
787
+ There are fully two layers and no connections.
788
+
789
+ 0:25:28.108 --> 0:25:29.759
790
+ Across layers.
791
+
792
+ 0:25:29.829 --> 0:25:35.153
793
+ And what they're doing is always just a waited
794
+ sum here and then in activation production.
795
+
796
+ 0:25:35.415 --> 0:25:38.792
797
+ And in order to train you have this forward
798
+ and backward pass.
799
+
800
+ 0:25:39.039 --> 0:25:41.384
801
+ So We Put in Here.
802
+
803
+ 0:25:41.281 --> 0:25:41.895
804
+ Inputs.
805
+
806
+ 0:25:41.895 --> 0:25:45.347
807
+ We have some random values at the beginning.
808
+
809
+ 0:25:45.347 --> 0:25:47.418
810
+ Then calculate the output.
811
+
812
+ 0:25:47.418 --> 0:25:54.246
813
+ We are measuring how our error is propagating
814
+ the arrow back and then changing our model
815
+
816
+ 0:25:54.246 --> 0:25:57.928
817
+ in a way that we hopefully get a smaller arrow.
818
+
819
+ 0:25:57.928 --> 0:25:59.616
820
+ And then that is how.
821
+
822
+ 0:26:01.962 --> 0:26:12.893
823
+ So before we're coming into our neural networks
824
+ language models, how can we use this type of
825
+
826
+ 0:26:12.893 --> 0:26:17.595
827
+ neural network to do language modeling?
828
+
829
+ 0:26:23.103 --> 0:26:33.157
830
+ So how can we use them in natural language
831
+ processing, especially machine translation?
832
+
833
+ 0:26:33.157 --> 0:26:41.799
834
+ The first idea of using them was to estimate:
835
+ So we have seen that the output can be monitored
836
+
837
+ 0:26:41.799 --> 0:26:42.599
838
+ here as well.
839
+
840
+ 0:26:43.603 --> 0:26:50.311
841
+ A probability distribution and if we have
842
+ a full vocabulary we could mainly hear estimating
843
+
844
+ 0:26:50.311 --> 0:26:56.727
845
+ how probable each next word is and then use
846
+ that in our language model fashion as we've
847
+
848
+ 0:26:56.727 --> 0:26:58.112
849
+ done it last time.
850
+
851
+ 0:26:58.112 --> 0:27:03.215
852
+ We got the probability of a full sentence
853
+ as a product of individual.
854
+
855
+ 0:27:04.544 --> 0:27:12.820
856
+ And: That was done in the ninety seven years
857
+ and it's very easy to integrate it into this
858
+
859
+ 0:27:12.820 --> 0:27:14.545
860
+ lot of the year model.
861
+
862
+ 0:27:14.545 --> 0:27:19.570
863
+ So we have said that this is how the locker
864
+ here model looks like.
865
+
866
+ 0:27:19.570 --> 0:27:25.119
867
+ So we are searching the best translation which
868
+ minimizes each waste time.
869
+
870
+ 0:27:25.125 --> 0:27:26.362
871
+ The Future About You.
872
+
873
+ 0:27:26.646 --> 0:27:31.647
874
+ We have that with minimum error rate training
875
+ if you can remember where we search for the
876
+
877
+ 0:27:31.647 --> 0:27:32.147
878
+ optimal.
879
+
880
+ 0:27:32.512 --> 0:27:40.422
881
+ The language model and many others, and we
882
+ can just add here a neuromodel, have a knock
883
+
884
+ 0:27:40.422 --> 0:27:41.591
885
+ of features.
886
+
887
+ 0:27:41.861 --> 0:27:45.761
888
+ So that is quite easy as said.
889
+
890
+ 0:27:45.761 --> 0:27:53.183
891
+ That was how statistical machine translation
892
+ was improved.
893
+
894
+ 0:27:53.183 --> 0:27:57.082
895
+ You just add one more feature.
896
+
897
+ 0:27:58.798 --> 0:28:07.631
898
+ So how can we model the language modeling
899
+ with a network?
900
+
901
+ 0:28:07.631 --> 0:28:16.008
902
+ So what we have to do is model the probability
903
+ of the.
904
+
905
+ 0:28:16.656 --> 0:28:25.047
906
+ The problem in general in the head is that
907
+ mostly we haven't seen long sequences.
908
+
909
+ 0:28:25.085 --> 0:28:35.650
910
+ Mostly we have to beg off to very short sequences
911
+ and we are working on this discrete space where
912
+
913
+ 0:28:35.650 --> 0:28:36.944
914
+ similarity.
915
+
916
+ 0:28:37.337 --> 0:28:50.163
917
+ So the idea is if we have now a real network,
918
+ we can make words into continuous representation.
919
+
920
+ 0:28:51.091 --> 0:29:00.480
921
+ And the structure then looks like this, so
922
+ this is a basic still feed forward neural network.
923
+
924
+ 0:29:01.361 --> 0:29:10.645
925
+ We are doing this at perximation again, so
926
+ we are not putting in all previous words, but
927
+
928
+ 0:29:10.645 --> 0:29:11.375
929
+ it is.
930
+
931
+ 0:29:11.691 --> 0:29:25.856
932
+ This is done because we said that in the real
933
+ network we can have only a fixed type of input.
934
+
935
+ 0:29:25.945 --> 0:29:31.886
936
+ You can only do a fixed step and then we'll
937
+ be doing that exactly in minus one.
938
+
939
+ 0:29:33.593 --> 0:29:39.536
940
+ So here you are, for example, three words
941
+ and three different words.
942
+
943
+ 0:29:39.536 --> 0:29:50.704
944
+ One and all the others are: And then we're
945
+ having the first layer of the neural network,
946
+
947
+ 0:29:50.704 --> 0:29:56.230
948
+ which like you learns is word embedding.
949
+
950
+ 0:29:57.437 --> 0:30:04.976
951
+ There is one thing which is maybe special
952
+ compared to the standard neural member.
953
+
954
+ 0:30:05.345 --> 0:30:11.918
955
+ So the representation of this word we want
956
+ to learn first of all position independence.
957
+
958
+ 0:30:11.918 --> 0:30:19.013
959
+ So we just want to learn what is the general
960
+ meaning of the word independent of its neighbors.
961
+
962
+ 0:30:19.299 --> 0:30:26.239
963
+ And therefore the representation you get here
964
+ should be the same as if in the second position.
965
+
966
+ 0:30:27.247 --> 0:30:36.865
967
+ The nice thing you can achieve is that this
968
+ weights which you're using here you're reusing
969
+
970
+ 0:30:36.865 --> 0:30:41.727
971
+ here and reusing here so we are forcing them.
972
+
973
+ 0:30:42.322 --> 0:30:48.360
974
+ You then learn your word embedding, which
975
+ is contextual, independent, so it's the same
976
+
977
+ 0:30:48.360 --> 0:30:49.678
978
+ for each position.
979
+
980
+ 0:30:49.909 --> 0:31:03.482
981
+ So that's the idea that you want to learn
982
+ the representation first of and you don't want
983
+
984
+ 0:31:03.482 --> 0:31:07.599
985
+ to really use the context.
986
+
987
+ 0:31:08.348 --> 0:31:13.797
988
+ That of course might have a different meaning
989
+ depending on where it stands, but we'll learn
990
+
991
+ 0:31:13.797 --> 0:31:14.153
992
+ that.
993
+
994
+ 0:31:14.514 --> 0:31:20.386
995
+ So first we are learning here representational
996
+ words, which is just the representation.
997
+
998
+ 0:31:20.760 --> 0:31:32.498
999
+ Normally we said in neurons all input neurons
1000
+ here are connected to all here, but we're reducing
1001
+
1002
+ 0:31:32.498 --> 0:31:37.338
1003
+ the complexity by saying these neurons.
1004
+
1005
+ 0:31:37.857 --> 0:31:47.912
1006
+ Then we have a lot denser representation that
1007
+ is our three word embedded in here, and now
1008
+
1009
+ 0:31:47.912 --> 0:31:57.408
1010
+ we are learning this interaction between words,
1011
+ a direction between words not based.
1012
+
1013
+ 0:31:57.677 --> 0:32:08.051
1014
+ So we have at least one connected layer here,
1015
+ which takes a three embedding input and then
1016
+
1017
+ 0:32:08.051 --> 0:32:14.208
1018
+ learns a new embedding which now represents
1019
+ the full.
1020
+
1021
+ 0:32:15.535 --> 0:32:16.551
1022
+ Layers.
1023
+
1024
+ 0:32:16.551 --> 0:32:27.854
1025
+ It is the output layer which now and then
1026
+ again the probability distribution of all the.
1027
+
1028
+ 0:32:28.168 --> 0:32:48.612
1029
+ So here is your target prediction.
1030
+
1031
+ 0:32:48.688 --> 0:32:56.361
1032
+ The nice thing is that you learn everything
1033
+ together, so you don't have to teach them what
1034
+
1035
+ 0:32:56.361 --> 0:32:58.722
1036
+ a good word representation.
1037
+
1038
+ 0:32:59.079 --> 0:33:08.306
1039
+ Training the whole number together, so it
1040
+ learns what a good representation for a word
1041
+
1042
+ 0:33:08.306 --> 0:33:13.079
1043
+ you get in order to perform your final task.
1044
+
1045
+ 0:33:15.956 --> 0:33:19.190
1046
+ Yeah, that is the main idea.
1047
+
1048
+ 0:33:20.660 --> 0:33:32.731
1049
+ This is now a days often referred to as one
1050
+ way of self supervise learning.
1051
+
1052
+ 0:33:33.053 --> 0:33:37.120
1053
+ The output is the next word and the input
1054
+ is the previous word.
1055
+
1056
+ 0:33:37.377 --> 0:33:46.783
1057
+ But it's not really that we created labels,
1058
+ but we artificially created a task out of unlabeled.
1059
+
1060
+ 0:33:46.806 --> 0:33:59.434
1061
+ We just had pure text, and then we created
1062
+ the telescopes by predicting the next word,
1063
+
1064
+ 0:33:59.434 --> 0:34:18.797
1065
+ which is: Say we have like two sentences like
1066
+ go home and the second one is go to prepare.
1067
+
1068
+ 0:34:18.858 --> 0:34:30.135
1069
+ And then we have to predict the next series
1070
+ and my questions in the labels for the album.
1071
+
1072
+ 0:34:31.411 --> 0:34:42.752
1073
+ We model this as one vector with like probability
1074
+ for possible weights starting again.
1075
+
1076
+ 0:34:44.044 --> 0:34:57.792
1077
+ Multiple examples, so then you would twice
1078
+ train one to predict KRT, one to predict home,
1079
+
1080
+ 0:34:57.792 --> 0:35:02.374
1081
+ and then of course the easel.
1082
+
1083
+ 0:35:04.564 --> 0:35:13.568
1084
+ Is a very good point, so you are not aggregating
1085
+ examples beforehand, but you are taking each.
1086
+
1087
+ 0:35:19.259 --> 0:35:37.204
1088
+ So when you do it simultaneously learn the
1089
+ projection layer and the endgram for abilities
1090
+
1091
+ 0:35:37.204 --> 0:35:39.198
1092
+ and then.
1093
+
1094
+ 0:35:39.499 --> 0:35:47.684
1095
+ And later analyze it that these representations
1096
+ are very powerful.
1097
+
1098
+ 0:35:47.684 --> 0:35:56.358
1099
+ The task is just a very important task to
1100
+ model what is the next word.
1101
+
1102
+ 0:35:56.816 --> 0:35:59.842
1103
+ Is motivated by nowadays.
1104
+
1105
+ 0:35:59.842 --> 0:36:10.666
1106
+ In order to get the meaning of the word you
1107
+ have to look at its companies where the context.
1108
+
1109
+ 0:36:10.790 --> 0:36:16.048
1110
+ If you read texts in days of word which you
1111
+ have never seen, you often can still estimate
1112
+
1113
+ 0:36:16.048 --> 0:36:21.130
1114
+ the meaning of this word because you do not
1115
+ know how it is used, and this is typically
1116
+
1117
+ 0:36:21.130 --> 0:36:22.240
1118
+ used as a city or.
1119
+
1120
+ 0:36:22.602 --> 0:36:25.865
1121
+ Just imagine you read a text about some city.
1122
+
1123
+ 0:36:25.865 --> 0:36:32.037
1124
+ Even if you've never seen the city before,
1125
+ you often know from the context of how it's
1126
+
1127
+ 0:36:32.037 --> 0:36:32.463
1128
+ used.
1129
+
1130
+ 0:36:34.094 --> 0:36:42.483
1131
+ So what is now the big advantage of using
1132
+ neural neckworks?
1133
+
1134
+ 0:36:42.483 --> 0:36:51.851
1135
+ So just imagine we have to estimate that I
1136
+ bought my first iPhone.
1137
+
1138
+ 0:36:52.052 --> 0:36:56.608
1139
+ So you have to monitor the probability of
1140
+ ad hitting them.
1141
+
1142
+ 0:36:56.608 --> 0:37:00.237
1143
+ Now imagine iPhone, which you have never seen.
1144
+
1145
+ 0:37:00.600 --> 0:37:11.588
1146
+ So all the techniques we had last time at
1147
+ the end, if you haven't seen iPhone you will
1148
+
1149
+ 0:37:11.588 --> 0:37:14.240
1150
+ always fall back to.
1151
+
1152
+ 0:37:15.055 --> 0:37:26.230
1153
+ You have no idea how to deal that you won't
1154
+ have seen the diagram, the trigram, and all
1155
+
1156
+ 0:37:26.230 --> 0:37:27.754
1157
+ the others.
1158
+
1159
+ 0:37:28.588 --> 0:37:43.441
1160
+ If you're having this type of model, what
1161
+ does it do if you have my first and then something?
1162
+
1163
+ 0:37:43.483 --> 0:37:50.270
1164
+ Maybe this representation is really messed
1165
+ up because it's mainly on a cavalry word.
1166
+
1167
+ 0:37:50.730 --> 0:37:57.793
1168
+ However, you have still these two information
1169
+ that two words before was first and therefore.
1170
+
1171
+ 0:37:58.098 --> 0:38:06.954
1172
+ So you have a lot of information in order
1173
+ to estimate how good it is.
1174
+
1175
+ 0:38:06.954 --> 0:38:13.279
1176
+ There could be more information if you know
1177
+ that.
1178
+
1179
+ 0:38:13.593 --> 0:38:25.168
1180
+ So all this type of modeling we can do that
1181
+ we couldn't do beforehand because we always
1182
+
1183
+ 0:38:25.168 --> 0:38:25.957
1184
+ have.
1185
+
1186
+ 0:38:27.027 --> 0:38:40.466
1187
+ Good point, so typically you would have one
1188
+ token for a vocabulary so that you could, for
1189
+
1190
+ 0:38:40.466 --> 0:38:45.857
1191
+ example: All you're doing by parent coding
1192
+ when you have a fixed thing.
1193
+
1194
+ 0:38:46.226 --> 0:38:49.437
1195
+ Oh yeah, you have to do something like that
1196
+ that that that's true.
1197
+
1198
+ 0:38:50.050 --> 0:38:55.420
1199
+ So yeah, auto vocabulary are by thanking where
1200
+ you don't have other words written.
1201
+
1202
+ 0:38:55.735 --> 0:39:06.295
1203
+ But then, of course, you might be getting
1204
+ very long previous things, and your sequence
1205
+
1206
+ 0:39:06.295 --> 0:39:11.272
1207
+ length gets very long for unknown words.
1208
+
1209
+ 0:39:17.357 --> 0:39:20.067
1210
+ Any more questions to the basic stable.
1211
+
1212
+ 0:39:23.783 --> 0:39:36.719
1213
+ For this model, what we then want to continue
1214
+ is looking a bit into how complex or how we
1215
+
1216
+ 0:39:36.719 --> 0:39:39.162
1217
+ can make things.
1218
+
1219
+ 0:39:40.580 --> 0:39:49.477
1220
+ Because at the beginning there was definitely
1221
+ a major challenge, it's still not that easy,
1222
+
1223
+ 0:39:49.477 --> 0:39:58.275
1224
+ and I mean our likeers followed the talk about
1225
+ their environmental fingerprint and so on.
1226
+
1227
+ 0:39:58.478 --> 0:40:05.700
1228
+ So this calculation is not really heavy, and
1229
+ if you build systems yourselves you have to
1230
+
1231
+ 0:40:05.700 --> 0:40:06.187
1232
+ wait.
1233
+
1234
+ 0:40:06.466 --> 0:40:14.683
1235
+ So it's good to know a bit about how complex
1236
+ things are in order to do a good or efficient
1237
+
1238
+ 0:40:14.683 --> 0:40:15.405
1239
+ affair.
1240
+
1241
+ 0:40:15.915 --> 0:40:24.211
1242
+ So one thing where most of the calculation
1243
+ really happens is if you're doing it in a bad
1244
+
1245
+ 0:40:24.211 --> 0:40:24.677
1246
+ way.
1247
+
1248
+ 0:40:25.185 --> 0:40:33.523
1249
+ So in generally all these layers we are talking
1250
+ about networks and zones fancy.
1251
+
1252
+ 0:40:33.523 --> 0:40:46.363
1253
+ In the end it is: So what you have to do in
1254
+ order to calculate here, for example, these
1255
+
1256
+ 0:40:46.363 --> 0:40:52.333
1257
+ activations: So make it simple a bit.
1258
+
1259
+ 0:40:52.333 --> 0:41:06.636
1260
+ Let's see where outputs and you just do metric
1261
+ multiplication between your weight matrix and
1262
+
1263
+ 0:41:06.636 --> 0:41:08.482
1264
+ your input.
1265
+
1266
+ 0:41:08.969 --> 0:41:20.992
1267
+ So that is why computers are so powerful for
1268
+ neural networks because they are very good
1269
+
1270
+ 0:41:20.992 --> 0:41:22.358
1271
+ in doing.
1272
+
1273
+ 0:41:22.782 --> 0:41:28.013
1274
+ However, for some type for the embedding layer
1275
+ this is really very inefficient.
1276
+
1277
+ 0:41:28.208 --> 0:41:39.652
1278
+ So because remember we're having this one
1279
+ art encoding in this input, it's always like
1280
+
1281
+ 0:41:39.652 --> 0:41:42.940
1282
+ one and everything else.
1283
+
1284
+ 0:41:42.940 --> 0:41:47.018
1285
+ It's zero if we're doing this.
1286
+
1287
+ 0:41:47.387 --> 0:41:55.552
1288
+ So therefore you can do at least the forward
1289
+ pass a lot more efficient if you don't really
1290
+
1291
+ 0:41:55.552 --> 0:42:01.833
1292
+ do this calculation, but you can select the
1293
+ one color where there is.
1294
+
1295
+ 0:42:01.833 --> 0:42:07.216
1296
+ Therefore, you also see this is called your
1297
+ word embedding.
1298
+
1299
+ 0:42:08.348 --> 0:42:19.542
1300
+ So the weight matrix of the embedding layer
1301
+ is just that in each color you have the embedding
1302
+
1303
+ 0:42:19.542 --> 0:42:20.018
1304
+ of.
1305
+
1306
+ 0:42:20.580 --> 0:42:30.983
1307
+ So this is like how your initial weights look
1308
+ like and how you can interpret or understand.
1309
+
1310
+ 0:42:32.692 --> 0:42:39.509
1311
+ And this is already relatively important because
1312
+ remember this is a huge dimensional thing.
1313
+
1314
+ 0:42:39.509 --> 0:42:46.104
1315
+ So typically here we have the number of words
1316
+ is ten thousand or so, so this is the word
1317
+
1318
+ 0:42:46.104 --> 0:42:51.365
1319
+ embeddings metrics, typically the most expensive
1320
+ to calculate metrics.
1321
+
1322
+ 0:42:51.451 --> 0:42:59.741
1323
+ Because it's the largest one there, we have
1324
+ ten thousand entries, while for the hours we
1325
+
1326
+ 0:42:59.741 --> 0:43:00.393
1327
+ maybe.
1328
+
1329
+ 0:43:00.660 --> 0:43:03.408
1330
+ So therefore the addition to a little bit
1331
+ more to make this.
1332
+
1333
+ 0:43:06.206 --> 0:43:10.538
1334
+ Then you can go where else the calculations
1335
+ are very difficult.
1336
+
1337
+ 0:43:10.830 --> 0:43:20.389
1338
+ So here we then have our network, so we have
1339
+ the word embeddings.
1340
+
1341
+ 0:43:20.389 --> 0:43:29.514
1342
+ We have one hidden there, and then you can
1343
+ look how difficult.
1344
+
1345
+ 0:43:30.270 --> 0:43:38.746
1346
+ Could save a lot of calculation by not really
1347
+ calculating the selection because that is always.
1348
+
1349
+ 0:43:40.600 --> 0:43:46.096
1350
+ The number of calculations you have to do
1351
+ here is so.
1352
+
1353
+ 0:43:46.096 --> 0:43:51.693
1354
+ The length of this layer is minus one type
1355
+ projection.
1356
+
1357
+ 0:43:52.993 --> 0:43:56.321
1358
+ That is a hint size.
1359
+
1360
+ 0:43:56.321 --> 0:44:10.268
1361
+ So the first step of calculation for this
1362
+ metrics modification is how much calculation.
1363
+
1364
+ 0:44:10.730 --> 0:44:18.806
1365
+ Then you have to do some activation function
1366
+ and then you have to do again the calculation.
1367
+
1368
+ 0:44:19.339 --> 0:44:27.994
1369
+ Here we need the vocabulary size because we
1370
+ need to calculate the probability for each
1371
+
1372
+ 0:44:27.994 --> 0:44:29.088
1373
+ next word.
1374
+
1375
+ 0:44:29.889 --> 0:44:43.155
1376
+ And if you look at these numbers, so if you
1377
+ have a projector size of and a vocabulary size
1378
+
1379
+ 0:44:43.155 --> 0:44:53.876
1380
+ of, you see: And that is why there has been
1381
+ especially at the beginning some ideas how
1382
+
1383
+ 0:44:53.876 --> 0:44:55.589
1384
+ we can reduce.
1385
+
1386
+ 0:44:55.956 --> 0:45:01.942
1387
+ And if we really need to calculate all of
1388
+ our capabilities, or if we can calculate only
1389
+
1390
+ 0:45:01.942 --> 0:45:02.350
1391
+ some.
1392
+
1393
+ 0:45:02.582 --> 0:45:10.871
1394
+ And there again the one important thing to
1395
+ think about is for what will use my language
1396
+
1397
+ 0:45:10.871 --> 0:45:11.342
1398
+ mom.
1399
+
1400
+ 0:45:11.342 --> 0:45:19.630
1401
+ I can use it for generations and that's what
1402
+ we will see next week in an achiever which
1403
+
1404
+ 0:45:19.630 --> 0:45:22.456
1405
+ really is guiding the search.
1406
+
1407
+ 0:45:23.123 --> 0:45:30.899
1408
+ If it just uses a feature, we do not want
1409
+ to use it for generations, but we want to only
1410
+
1411
+ 0:45:30.899 --> 0:45:32.559
1412
+ know how probable.
1413
+
1414
+ 0:45:32.953 --> 0:45:39.325
1415
+ There we might not be really interested in
1416
+ all the probabilities, but we already know
1417
+
1418
+ 0:45:39.325 --> 0:45:46.217
1419
+ we just want to know the probability of this
1420
+ one word, and then it might be very inefficient
1421
+
1422
+ 0:45:46.217 --> 0:45:49.403
1423
+ to really calculate all the probabilities.
1424
+
1425
+ 0:45:51.231 --> 0:45:52.919
1426
+ And how can you do that so?
1427
+
1428
+ 0:45:52.919 --> 0:45:56.296
1429
+ Initially, for example, the people look into
1430
+ shortness.
1431
+
1432
+ 0:45:56.756 --> 0:46:02.276
1433
+ So this calculation at the end is really very
1434
+ expensive.
1435
+
1436
+ 0:46:02.276 --> 0:46:05.762
1437
+ So can we make that more efficient.
1438
+
1439
+ 0:46:05.945 --> 0:46:17.375
1440
+ And most words occur very rarely, and maybe
1441
+ we don't need anger, and so there we may want
1442
+
1443
+ 0:46:17.375 --> 0:46:18.645
1444
+ to focus.
1445
+
1446
+ 0:46:19.019 --> 0:46:29.437
1447
+ And so they use the smaller vocabulary, which
1448
+ is maybe.
1449
+
1450
+ 0:46:29.437 --> 0:46:34.646
1451
+ This layer is used from to.
1452
+
1453
+ 0:46:34.646 --> 0:46:37.623
1454
+ Then you merge.
1455
+
1456
+ 0:46:37.937 --> 0:46:45.162
1457
+ So you're taking if the word is in the shortest,
1458
+ so in the two thousand most frequent words.
1459
+
1460
+ 0:46:45.825 --> 0:46:58.299
1461
+ Of this short word by some normalization here,
1462
+ and otherwise you take a back of probability
1463
+
1464
+ 0:46:58.299 --> 0:46:59.655
1465
+ from the.
1466
+
1467
+ 0:47:00.020 --> 0:47:04.933
1468
+ It will not be as good, but the idea is okay.
1469
+
1470
+ 0:47:04.933 --> 0:47:14.013
1471
+ Then we don't have to calculate all these
1472
+ probabilities here at the end, but we only
1473
+
1474
+ 0:47:14.013 --> 0:47:16.042
1475
+ have to calculate.
1476
+
1477
+ 0:47:19.599 --> 0:47:32.097
1478
+ With some type of cost because it means we
1479
+ don't model the probability of the infrequent
1480
+
1481
+ 0:47:32.097 --> 0:47:39.399
1482
+ words, and maybe it's even very important to
1483
+ model.
1484
+
1485
+ 0:47:39.299 --> 0:47:46.671
1486
+ And one idea is to do what is reported as
1487
+ so so structured out there.
1488
+
1489
+ 0:47:46.606 --> 0:47:49.571
1490
+ Network language models you see some years
1491
+ ago.
1492
+
1493
+ 0:47:49.571 --> 0:47:53.154
1494
+ People were very creative and giving names
1495
+ to new models.
1496
+
1497
+ 0:47:53.813 --> 0:48:00.341
1498
+ And there the idea is that we model the output
1499
+ vocabulary as a clustered treat.
1500
+
1501
+ 0:48:00.680 --> 0:48:06.919
1502
+ So you don't need to model all of our bodies
1503
+ directly, but you are putting words into a
1504
+
1505
+ 0:48:06.919 --> 0:48:08.479
1506
+ sequence of clusters.
1507
+
1508
+ 0:48:08.969 --> 0:48:15.019
1509
+ So maybe a very intriguant world is first
1510
+ in cluster three and then in cluster three.
1511
+
1512
+ 0:48:15.019 --> 0:48:21.211
1513
+ You have subclusters again and there is subclusters
1514
+ seven and subclusters and there is.
1515
+
1516
+ 0:48:21.541 --> 0:48:40.134
1517
+ And this is the path, so that is what was
1518
+ the man in the past.
1519
+
1520
+ 0:48:40.340 --> 0:48:52.080
1521
+ And then you can calculate the probability
1522
+ of the word again just by the product of the
1523
+
1524
+ 0:48:52.080 --> 0:48:55.548
1525
+ first class of the world.
1526
+
1527
+ 0:48:57.617 --> 0:49:07.789
1528
+ That it may be more clear where you have this
1529
+ architecture, so this is all the same.
1530
+
1531
+ 0:49:07.789 --> 0:49:13.773
1532
+ But then you first predict here which main
1533
+ class.
1534
+
1535
+ 0:49:14.154 --> 0:49:24.226
1536
+ Then you go to the appropriate subclass, then
1537
+ you calculate the probability of the subclass
1538
+
1539
+ 0:49:24.226 --> 0:49:26.415
1540
+ and maybe the cell.
1541
+
1542
+ 0:49:27.687 --> 0:49:35.419
1543
+ Anybody have an idea why this is more efficient
1544
+ or if you do it first, it looks a lot more.
1545
+
1546
+ 0:49:42.242 --> 0:49:51.788
1547
+ You have to do less calculations, so maybe
1548
+ if you do it here you have to calculate the
1549
+
1550
+ 0:49:51.788 --> 0:49:59.468
1551
+ element there, but you don't have to do all
1552
+ the one hundred thousand.
1553
+
1554
+ 0:49:59.980 --> 0:50:06.115
1555
+ The probabilities in the set classes that
1556
+ you're going through and not for all of them.
1557
+
1558
+ 0:50:06.386 --> 0:50:18.067
1559
+ Therefore, it's more efficient if you don't
1560
+ need all output proficient because you have
1561
+
1562
+ 0:50:18.067 --> 0:50:21.253
1563
+ to calculate the class.
1564
+
1565
+ 0:50:21.501 --> 0:50:28.936
1566
+ So it's only more efficient and scenarios
1567
+ where you really need to use a language model
1568
+
1569
+ 0:50:28.936 --> 0:50:30.034
1570
+ to evaluate.
1571
+
1572
+ 0:50:35.275 --> 0:50:52.456
1573
+ How this works was that you can train first
1574
+ in your language one on the short list.
1575
+
1576
+ 0:50:52.872 --> 0:51:03.547
1577
+ But on the input layer you have your full
1578
+ vocabulary because at the input we saw that
1579
+
1580
+ 0:51:03.547 --> 0:51:06.650
1581
+ this is not complicated.
1582
+
1583
+ 0:51:06.906 --> 0:51:26.638
1584
+ And then you can cluster down all your words
1585
+ here into classes and use that as your glasses.
1586
+
1587
+ 0:51:29.249 --> 0:51:34.148
1588
+ That is one idea of doing it.
1589
+
1590
+ 0:51:34.148 --> 0:51:44.928
1591
+ There is also a second idea of doing it, and
1592
+ again we don't need.
1593
+
1594
+ 0:51:45.025 --> 0:51:53.401
1595
+ So sometimes it doesn't really need to be
1596
+ a probability to evaluate.
1597
+
1598
+ 0:51:53.401 --> 0:51:56.557
1599
+ It's only important that.
1600
+
1601
+ 0:51:58.298 --> 0:52:04.908
1602
+ And: Here it's called self normalization what
1603
+ people have done so.
1604
+
1605
+ 0:52:04.908 --> 0:52:11.562
1606
+ We have seen that the probability is in this
1607
+ soft mechanism always to the input divided
1608
+
1609
+ 0:52:11.562 --> 0:52:18.216
1610
+ by our normalization, and the normalization
1611
+ is a summary of the vocabulary to the power
1612
+
1613
+ 0:52:18.216 --> 0:52:19.274
1614
+ of the spell.
1615
+
1616
+ 0:52:19.759 --> 0:52:25.194
1617
+ So this is how we calculate the software.
1618
+
1619
+ 0:52:25.825 --> 0:52:41.179
1620
+ In self normalization of the idea, if this
1621
+ would be zero then we don't need to calculate
1622
+
1623
+ 0:52:41.179 --> 0:52:42.214
1624
+ that.
1625
+
1626
+ 0:52:42.102 --> 0:52:54.272
1627
+ Will be zero, and then you don't even have
1628
+ to calculate the normalization because it's.
1629
+
1630
+ 0:52:54.514 --> 0:53:08.653
1631
+ So how can we achieve that and then the nice
1632
+ thing in your networks?
1633
+
1634
+ 0:53:09.009 --> 0:53:23.928
1635
+ And now we're just adding a second note with
1636
+ some either permitted here.
1637
+
1638
+ 0:53:24.084 --> 0:53:29.551
1639
+ And the second lost just tells us he'll be
1640
+ strained away.
1641
+
1642
+ 0:53:29.551 --> 0:53:31.625
1643
+ The locks at is zero.
1644
+
1645
+ 0:53:32.352 --> 0:53:38.614
1646
+ So then if it's nearly zero at the end we
1647
+ don't need to calculate this and it's also
1648
+
1649
+ 0:53:38.614 --> 0:53:39.793
1650
+ very efficient.
1651
+
1652
+ 0:53:40.540 --> 0:53:49.498
1653
+ One important thing is this, of course, is
1654
+ only in inference.
1655
+
1656
+ 0:53:49.498 --> 0:54:04.700
1657
+ During tests we don't need to calculate that
1658
+ because: You can do a bit of a hyperparameter
1659
+
1660
+ 0:54:04.700 --> 0:54:14.851
1661
+ here where you do the waiting, so how good
1662
+ should it be estimating the probabilities and
1663
+
1664
+ 0:54:14.851 --> 0:54:16.790
1665
+ how much effort?
1666
+
1667
+ 0:54:18.318 --> 0:54:28.577
1668
+ The only disadvantage is no speed up during
1669
+ training.
1670
+
1671
+ 0:54:28.577 --> 0:54:43.843
1672
+ There are other ways of doing that, for example:
1673
+ Englishman is in case you get it.
1674
+
1675
+ 0:54:44.344 --> 0:54:48.540
1676
+ Then we are coming very, very briefly like
1677
+ just one idea.
1678
+
1679
+ 0:54:48.828 --> 0:54:53.058
1680
+ That there is more things on different types
1681
+ of language models.
1682
+
1683
+ 0:54:53.058 --> 0:54:58.002
1684
+ We are having a very short view on restricted
1685
+ person-based language models.
1686
+
1687
+ 0:54:58.298 --> 0:55:08.931
1688
+ Talk about recurrent neural networks for language
1689
+ mines because they have the advantage that
1690
+
1691
+ 0:55:08.931 --> 0:55:17.391
1692
+ we can even further improve by not having a
1693
+ continuous representation on.
1694
+
1695
+ 0:55:18.238 --> 0:55:23.845
1696
+ So there's different types of neural networks.
1697
+
1698
+ 0:55:23.845 --> 0:55:30.169
1699
+ These are these boxing machines and the interesting.
1700
+
1701
+ 0:55:30.330 --> 0:55:39.291
1702
+ They have these: And they define like an energy
1703
+ function on the network, which can be in restricted
1704
+
1705
+ 0:55:39.291 --> 0:55:44.372
1706
+ balsam machines efficiently calculated in general
1707
+ and restricted needs.
1708
+
1709
+ 0:55:44.372 --> 0:55:51.147
1710
+ You only have connection between the input
1711
+ and the hidden layer, but you don't have connections
1712
+
1713
+ 0:55:51.147 --> 0:55:53.123
1714
+ in the input or within the.
1715
+
1716
+ 0:55:53.393 --> 0:56:00.194
1717
+ So you see here you don't have an input output,
1718
+ you just have an input, and you calculate.
1719
+
1720
+ 0:56:00.460 --> 0:56:15.612
1721
+ Which of course nicely fits with the idea
1722
+ we're having, so you can then use this for
1723
+
1724
+ 0:56:15.612 --> 0:56:19.177
1725
+ an N Gram language.
1726
+
1727
+ 0:56:19.259 --> 0:56:25.189
1728
+ Retaining the flexibility of the input by
1729
+ this type of neon networks.
1730
+
1731
+ 0:56:26.406 --> 0:56:30.589
1732
+ And the advantage of this type of model was
1733
+ there's.
1734
+
1735
+ 0:56:30.550 --> 0:56:37.520
1736
+ Very, very fast to integrate it, so that one
1737
+ was the first one which was used during the
1738
+
1739
+ 0:56:37.520 --> 0:56:38.616
1740
+ coding model.
1741
+
1742
+ 0:56:38.938 --> 0:56:45.454
1743
+ The engram language models were that they
1744
+ were very good and gave performance.
1745
+
1746
+ 0:56:45.454 --> 0:56:50.072
1747
+ However, calculation still with all these
1748
+ tricks takes.
1749
+
1750
+ 0:56:50.230 --> 0:56:58.214
1751
+ We have talked about embest lists so they
1752
+ generated an embest list of the most probable
1753
+
1754
+ 0:56:58.214 --> 0:57:05.836
1755
+ outputs and then they took this and best list
1756
+ scored each entry with a new network.
1757
+
1758
+ 0:57:06.146 --> 0:57:09.306
1759
+ A language model, and then only change the
1760
+ order again.
1761
+
1762
+ 0:57:09.306 --> 0:57:10.887
1763
+ Select based on that which.
1764
+
1765
+ 0:57:11.231 --> 0:57:17.187
1766
+ The neighboring list is maybe only like hundred
1767
+ entries.
1768
+
1769
+ 0:57:17.187 --> 0:57:21.786
1770
+ When decoding you look at several thousand.
1771
+
1772
+ 0:57:26.186 --> 0:57:35.196
1773
+ Let's look at the context so we have now seen
1774
+ your language models.
1775
+
1776
+ 0:57:35.196 --> 0:57:43.676
1777
+ There is the big advantage we can use this
1778
+ word similarity and.
1779
+
1780
+ 0:57:44.084 --> 0:57:52.266
1781
+ Remember for engram language ones is not always
1782
+ minus one words because sometimes you have
1783
+
1784
+ 0:57:52.266 --> 0:57:59.909
1785
+ to back off or interpolation to lower engrams
1786
+ and you don't know the previous words.
1787
+
1788
+ 0:58:00.760 --> 0:58:04.742
1789
+ And however in neural models we always have
1790
+ all of this importance.
1791
+
1792
+ 0:58:04.742 --> 0:58:05.504
1793
+ Can some of.
1794
+
1795
+ 0:58:07.147 --> 0:58:20.288
1796
+ The disadvantage is that you are still limited
1797
+ in your context, and if you remember the sentence
1798
+
1799
+ 0:58:20.288 --> 0:58:22.998
1800
+ from last lecture,.
1801
+
1802
+ 0:58:22.882 --> 0:58:28.328
1803
+ Sometimes you need more context and there
1804
+ is unlimited context that you might need and
1805
+
1806
+ 0:58:28.328 --> 0:58:34.086
1807
+ you can always create sentences where you may
1808
+ need this five context in order to put a good
1809
+
1810
+ 0:58:34.086 --> 0:58:34.837
1811
+ estimation.
1812
+
1813
+ 0:58:35.315 --> 0:58:44.956
1814
+ Can also do it different in order to understand
1815
+ that it makes sense to view language.
1816
+
1817
+ 0:58:45.445 --> 0:58:59.510
1818
+ So secret labeling tasks are a very common
1819
+ type of task in language processing where you
1820
+
1821
+ 0:58:59.510 --> 0:59:03.461
1822
+ have the input sequence.
1823
+
1824
+ 0:59:03.323 --> 0:59:05.976
1825
+ So you have one output for each input.
1826
+
1827
+ 0:59:05.976 --> 0:59:12.371
1828
+ Machine translation is not a secret labeling
1829
+ cast because the number of inputs and the number
1830
+
1831
+ 0:59:12.371 --> 0:59:14.072
1832
+ of outputs is different.
1833
+
1834
+ 0:59:14.072 --> 0:59:20.598
1835
+ So you put in a string German which has five
1836
+ words and the output can be: See, for example,
1837
+
1838
+ 0:59:20.598 --> 0:59:24.078
1839
+ you always have the same number and the same
1840
+ number of offices.
1841
+
1842
+ 0:59:24.944 --> 0:59:39.779
1843
+ And you can more language waddling as that,
1844
+ and you just say the label for each word is
1845
+
1846
+ 0:59:39.779 --> 0:59:43.151
1847
+ always a next word.
1848
+
1849
+ 0:59:45.705 --> 0:59:50.312
1850
+ This is the more generous you can think of
1851
+ it.
1852
+
1853
+ 0:59:50.312 --> 0:59:56.194
1854
+ For example, Paddle Speech Taking named Entity
1855
+ Recognition.
1856
+
1857
+ 0:59:58.938 --> 1:00:08.476
1858
+ And if you look at now, this output token
1859
+ and generally sequenced labeling can depend
1860
+
1861
+ 1:00:08.476 --> 1:00:26.322
1862
+ on: The input tokens are the same so we can
1863
+ easily model it and they only depend on the
1864
+
1865
+ 1:00:26.322 --> 1:00:29.064
1866
+ input tokens.
1867
+
1868
+ 1:00:31.011 --> 1:00:42.306
1869
+ But we can always look at one specific type
1870
+ of sequence labeling, unidirectional sequence
1871
+
1872
+ 1:00:42.306 --> 1:00:44.189
1873
+ labeling type.
1874
+
1875
+ 1:00:44.584 --> 1:01:00.855
1876
+ The probability of the next word only depends
1877
+ on the previous words that we are having here.
1878
+
1879
+ 1:01:01.321 --> 1:01:05.998
1880
+ That's also not completely true in language.
1881
+
1882
+ 1:01:05.998 --> 1:01:14.418
1883
+ Well, the back context might also be helpful
1884
+ by direction of the model's Google.
1885
+
1886
+ 1:01:14.654 --> 1:01:23.039
1887
+ We will always admire the probability of the
1888
+ word given on its history.
1889
+
1890
+ 1:01:23.623 --> 1:01:30.562
1891
+ And currently there is approximation and sequence
1892
+ labeling that we have this windowing approach.
1893
+
1894
+ 1:01:30.951 --> 1:01:43.016
1895
+ So in order to predict this type of word we
1896
+ always look at the previous three words.
1897
+
1898
+ 1:01:43.016 --> 1:01:48.410
1899
+ This is this type of windowing model.
1900
+
1901
+ 1:01:49.389 --> 1:01:54.780
1902
+ If you're into neural networks you recognize
1903
+ this type of structure.
1904
+
1905
+ 1:01:54.780 --> 1:01:57.515
1906
+ Also, the typical neural networks.
1907
+
1908
+ 1:01:58.938 --> 1:02:11.050
1909
+ Yes, yes, so like engram models you can, at
1910
+ least in some way, prepare for that type of
1911
+
1912
+ 1:02:11.050 --> 1:02:12.289
1913
+ context.
1914
+
1915
+ 1:02:14.334 --> 1:02:23.321
1916
+ Are also other types of neonamic structures
1917
+ which we can use for sequins lately and which
1918
+
1919
+ 1:02:23.321 --> 1:02:30.710
1920
+ might help us where we don't have this type
1921
+ of fixed size representation.
1922
+
1923
+ 1:02:32.812 --> 1:02:34.678
1924
+ That we can do so.
1925
+
1926
+ 1:02:34.678 --> 1:02:39.391
1927
+ The idea is in recurrent new networks traction.
1928
+
1929
+ 1:02:39.391 --> 1:02:43.221
1930
+ We are saving complete history in one.
1931
+
1932
+ 1:02:43.623 --> 1:02:56.946
1933
+ So again we have to do this fixed size representation
1934
+ because the neural networks always need a habit.
1935
+
1936
+ 1:02:57.157 --> 1:03:09.028
1937
+ And then the network should look like that,
1938
+ so we start with an initial value for our storage.
1939
+
1940
+ 1:03:09.028 --> 1:03:15.900
1941
+ We are giving our first input and calculating
1942
+ the new.
1943
+
1944
+ 1:03:16.196 --> 1:03:35.895
1945
+ So again in your network with two types of
1946
+ inputs: Then you can apply it to the next type
1947
+
1948
+ 1:03:35.895 --> 1:03:41.581
1949
+ of input and you're again having this.
1950
+
1951
+ 1:03:41.581 --> 1:03:46.391
1952
+ You're taking this hidden state.
1953
+
1954
+ 1:03:47.367 --> 1:03:53.306
1955
+ Nice thing is now that you can do now step
1956
+ by step by step, so all the way over.
1957
+
1958
+ 1:03:55.495 --> 1:04:06.131
1959
+ The nice thing we are having here now is that
1960
+ now we are having context information from
1961
+
1962
+ 1:04:06.131 --> 1:04:07.206
1963
+ all the.
1964
+
1965
+ 1:04:07.607 --> 1:04:14.181
1966
+ So if you're looking like based on which words
1967
+ do you, you calculate the probability of varying.
1968
+
1969
+ 1:04:14.554 --> 1:04:20.090
1970
+ It depends on this part.
1971
+
1972
+ 1:04:20.090 --> 1:04:33.154
1973
+ It depends on and this hidden state was influenced
1974
+ by two.
1975
+
1976
+ 1:04:33.473 --> 1:04:38.259
1977
+ So now we're having something new.
1978
+
1979
+ 1:04:38.259 --> 1:04:46.463
1980
+ We can model like the word probability not
1981
+ only on a fixed.
1982
+
1983
+ 1:04:46.906 --> 1:04:53.565
1984
+ Because the hidden states we are having here
1985
+ in our Oregon are influenced by all the trivia.
1986
+
1987
+ 1:04:56.296 --> 1:05:02.578
1988
+ So how is there to be Singapore?
1989
+
1990
+ 1:05:02.578 --> 1:05:16.286
1991
+ But then we have the initial idea about this
1992
+ P of given on the history.
1993
+
1994
+ 1:05:16.736 --> 1:05:25.300
1995
+ So do not need to do any clustering here,
1996
+ and you also see how things are put together
1997
+
1998
+ 1:05:25.300 --> 1:05:26.284
1999
+ in order.
2000
+
2001
+ 1:05:29.489 --> 1:05:43.449
2002
+ The green box this night since we are starting
2003
+ from the left to the right.
2004
+
2005
+ 1:05:44.524 --> 1:05:51.483
2006
+ Voices: Yes, that's right, so there are clusters,
2007
+ and here is also sometimes clustering happens.
2008
+
2009
+ 1:05:51.871 --> 1:05:58.687
2010
+ The small difference does matter again, so
2011
+ if you have now a lot of different histories,
2012
+
2013
+ 1:05:58.687 --> 1:06:01.674
2014
+ the similarity which you have in here.
2015
+
2016
+ 1:06:01.674 --> 1:06:08.260
2017
+ If two of the histories are very similar,
2018
+ these representations will be the same, and
2019
+
2020
+ 1:06:08.260 --> 1:06:10.787
2021
+ then you're treating them again.
2022
+
2023
+ 1:06:11.071 --> 1:06:15.789
2024
+ Because in order to do the final restriction
2025
+ you only do a good base on the green box.
2026
+
2027
+ 1:06:16.156 --> 1:06:28.541
2028
+ So you are now still learning some type of
2029
+ clustering in there, but you are learning it
2030
+
2031
+ 1:06:28.541 --> 1:06:30.230
2032
+ implicitly.
2033
+
2034
+ 1:06:30.570 --> 1:06:38.200
2035
+ The only restriction you're giving is you
2036
+ have to stall everything that is important
2037
+
2038
+ 1:06:38.200 --> 1:06:39.008
2039
+ in this.
2040
+
2041
+ 1:06:39.359 --> 1:06:54.961
2042
+ So it's a different type of limitation, so
2043
+ you calculate the probability based on the
2044
+
2045
+ 1:06:54.961 --> 1:06:57.138
2046
+ last words.
2047
+
2048
+ 1:06:57.437 --> 1:07:04.430
2049
+ And that is how you still need to somehow
2050
+ cluster things together in order to do efficiently.
2051
+
2052
+ 1:07:04.430 --> 1:07:09.563
2053
+ Of course, you need to do some type of clustering
2054
+ because otherwise.
2055
+
2056
+ 1:07:09.970 --> 1:07:18.865
2057
+ But this is where things get merged together
2058
+ in this type of hidden representation.
2059
+
2060
+ 1:07:18.865 --> 1:07:27.973
2061
+ So here the probability of the word first
2062
+ only depends on this hidden representation.
2063
+
2064
+ 1:07:28.288 --> 1:07:33.104
2065
+ On the previous words, but they are some other
2066
+ bottleneck in order to make a good estimation.
2067
+
2068
+ 1:07:34.474 --> 1:07:41.231
2069
+ So the idea is that we can store all our history
2070
+ into or into one lecture.
2071
+
2072
+ 1:07:41.581 --> 1:07:44.812
2073
+ Which is the one that makes it more strong.
2074
+
2075
+ 1:07:44.812 --> 1:07:51.275
2076
+ Next we come to problems that of course at
2077
+ some point it might be difficult if you have
2078
+
2079
+ 1:07:51.275 --> 1:07:57.811
2080
+ very long sequences and you always write all
2081
+ the information you have on this one block.
2082
+
2083
+ 1:07:58.398 --> 1:08:02.233
2084
+ Then maybe things get overwritten or you cannot
2085
+ store everything in there.
2086
+
2087
+ 1:08:02.662 --> 1:08:04.514
2088
+ So,.
2089
+
2090
+ 1:08:04.184 --> 1:08:09.569
2091
+ Therefore, yet for short things like single
2092
+ sentences that works well, but especially if
2093
+
2094
+ 1:08:09.569 --> 1:08:15.197
2095
+ you think of other tasks and like symbolizations
2096
+ with our document based on T where you need
2097
+
2098
+ 1:08:15.197 --> 1:08:20.582
2099
+ to consider the full document, these things
2100
+ got got a bit more more more complicated and
2101
+
2102
+ 1:08:20.582 --> 1:08:23.063
2103
+ will learn another type of architecture.
2104
+
2105
+ 1:08:24.464 --> 1:08:30.462
2106
+ In order to understand these neighbors, it
2107
+ is good to have all the bus use always.
2108
+
2109
+ 1:08:30.710 --> 1:08:33.998
2110
+ So this is the unrolled view.
2111
+
2112
+ 1:08:33.998 --> 1:08:43.753
2113
+ Somewhere you're over the type or in language
2114
+ over the words you're unrolling a network.
2115
+
2116
+ 1:08:44.024 --> 1:08:52.096
2117
+ Here is the article and here is the network
2118
+ which is connected by itself and that is recurrent.
2119
+
2120
+ 1:08:56.176 --> 1:09:04.982
2121
+ There is one challenge in this networks and
2122
+ training.
2123
+
2124
+ 1:09:04.982 --> 1:09:11.994
2125
+ We can train them first of all as forward.
2126
+
2127
+ 1:09:12.272 --> 1:09:19.397
2128
+ So we don't really know how to train them,
2129
+ but if you unroll them like this is a feet
2130
+
2131
+ 1:09:19.397 --> 1:09:20.142
2132
+ forward.
2133
+
2134
+ 1:09:20.540 --> 1:09:38.063
2135
+ Is exactly the same, so you can measure your
2136
+ arrows here and be back to your arrows.
2137
+
2138
+ 1:09:38.378 --> 1:09:45.646
2139
+ If you unroll something, it's a feature in
2140
+ your laptop and you can train it the same way.
2141
+
2142
+ 1:09:46.106 --> 1:09:57.606
2143
+ The only important thing is again, of course,
2144
+ for different inputs.
2145
+
2146
+ 1:09:57.837 --> 1:10:05.145
2147
+ But since parameters are shared, it's somehow
2148
+ a similar point you can train it.
2149
+
2150
+ 1:10:05.145 --> 1:10:08.800
2151
+ The training algorithm is very similar.
2152
+
2153
+ 1:10:10.310 --> 1:10:29.568
2154
+ One thing which makes things difficult is
2155
+ what is referred to as the vanish ingredient.
2156
+
2157
+ 1:10:29.809 --> 1:10:32.799
2158
+ That's a very strong thing in the motivation
2159
+ of using hardness.
2160
+
2161
+ 1:10:33.593 --> 1:10:44.604
2162
+ The influence here gets smaller and smaller,
2163
+ and the modems are not really able to monitor.
2164
+
2165
+ 1:10:44.804 --> 1:10:51.939
2166
+ Because the gradient gets smaller and smaller,
2167
+ and so the arrow here propagated to this one
2168
+
2169
+ 1:10:51.939 --> 1:10:58.919
2170
+ that contributes to the arrow is very small,
2171
+ and therefore you don't do any changes there
2172
+
2173
+ 1:10:58.919 --> 1:10:59.617
2174
+ anymore.
2175
+
2176
+ 1:11:00.020 --> 1:11:06.703
2177
+ And yeah, that's why standard art men are
2178
+ undifficult or have to pick them at custard.
2179
+
2180
+ 1:11:07.247 --> 1:11:11.462
2181
+ So everywhere talking to me about fire and
2182
+ ants nowadays,.
2183
+
2184
+ 1:11:11.791 --> 1:11:23.333
2185
+ What we are typically meaning are LSDN's or
2186
+ long short memories.
2187
+
2188
+ 1:11:23.333 --> 1:11:30.968
2189
+ You see they are by now quite old already.
2190
+
2191
+ 1:11:31.171 --> 1:11:39.019
2192
+ So there was a model in the language model
2193
+ task.
2194
+
2195
+ 1:11:39.019 --> 1:11:44.784
2196
+ It's some more storing information.
2197
+
2198
+ 1:11:44.684 --> 1:11:51.556
2199
+ Because if you only look at the last words,
2200
+ it's often no longer clear this is a question
2201
+
2202
+ 1:11:51.556 --> 1:11:52.548
2203
+ or a normal.
2204
+
2205
+ 1:11:53.013 --> 1:12:05.318
2206
+ So there you have these mechanisms with ripgate
2207
+ in order to store things for a longer time
2208
+
2209
+ 1:12:05.318 --> 1:12:08.563
2210
+ into your hidden state.
2211
+
2212
+ 1:12:10.730 --> 1:12:20.162
2213
+ Here they are used in in in selling quite
2214
+ a lot of works.
2215
+
2216
+ 1:12:21.541 --> 1:12:29.349
2217
+ For especially machine translation now, the
2218
+ standard is to do transform base models which
2219
+
2220
+ 1:12:29.349 --> 1:12:30.477
2221
+ we'll learn.
2222
+
2223
+ 1:12:30.690 --> 1:12:38.962
2224
+ But for example, in architecture we have later
2225
+ one lecture about efficiency.
2226
+
2227
+ 1:12:38.962 --> 1:12:42.830
2228
+ So how can we build very efficient?
2229
+
2230
+ 1:12:42.882 --> 1:12:53.074
2231
+ And there in the decoder in parts of the networks
2232
+ they are still using.
2233
+
2234
+ 1:12:53.473 --> 1:12:57.518
2235
+ So it's not that yeah our hands are of no
2236
+ importance in the body.
2237
+
2238
+ 1:12:59.239 --> 1:13:08.956
2239
+ In order to make them strong, there are some
2240
+ more things which are helpful and should be:
2241
+
2242
+ 1:13:09.309 --> 1:13:19.683
2243
+ So one thing is there is a nice trick to make
2244
+ this new network stronger and better.
2245
+
2246
+ 1:13:19.739 --> 1:13:21.523
2247
+ So of course it doesn't work always.
2248
+
2249
+ 1:13:21.523 --> 1:13:23.451
2250
+ They have to have enough training data.
2251
+
2252
+ 1:13:23.763 --> 1:13:28.959
2253
+ But in general there's the easiest way of
2254
+ making your models bigger and stronger just
2255
+
2256
+ 1:13:28.959 --> 1:13:30.590
2257
+ to increase your pyramids.
2258
+
2259
+ 1:13:30.630 --> 1:13:43.236
2260
+ And you've seen that with a large language
2261
+ models they are always bragging about.
2262
+
2263
+ 1:13:43.903 --> 1:13:56.463
2264
+ This is one way, so the question is how do
2265
+ you get more parameters?
2266
+
2267
+ 1:13:56.463 --> 1:14:01.265
2268
+ There's ways of doing it.
2269
+
2270
+ 1:14:01.521 --> 1:14:10.029
2271
+ And the other thing is to make your networks
2272
+ deeper so to have more legs in between.
2273
+
2274
+ 1:14:11.471 --> 1:14:13.827
2275
+ And then you can also get to get more calm.
2276
+
2277
+ 1:14:14.614 --> 1:14:23.340
2278
+ There's more traveling with this and it's
2279
+ very similar to what we just saw with our hand.
2280
+
2281
+ 1:14:23.603 --> 1:14:34.253
2282
+ We have this problem of radiant flow that
2283
+ if it flows so fast like a radiant gets very
2284
+
2285
+ 1:14:34.253 --> 1:14:35.477
2286
+ swollen,.
2287
+
2288
+ 1:14:35.795 --> 1:14:42.704
2289
+ Exactly the same thing happens in deep LSD
2290
+ ends.
2291
+
2292
+ 1:14:42.704 --> 1:14:52.293
2293
+ If you take here the gradient, tell you what
2294
+ is the right or wrong.
2295
+
2296
+ 1:14:52.612 --> 1:14:56.439
2297
+ With three layers it's no problem, but if
2298
+ you're going to ten, twenty or hundred layers.
2299
+
2300
+ 1:14:57.797 --> 1:14:59.698
2301
+ That's Getting Typically Young.
2302
+
2303
+ 1:15:00.060 --> 1:15:07.000
2304
+ Are doing is using what is called decisional
2305
+ connections.
2306
+
2307
+ 1:15:07.000 --> 1:15:15.855
2308
+ That's a very helpful idea, which is maybe
2309
+ very surprising that it works.
2310
+
2311
+ 1:15:15.956 --> 1:15:20.309
2312
+ And so the idea is that these networks.
2313
+
2314
+ 1:15:20.320 --> 1:15:29.982
2315
+ In between should no longer calculate what
2316
+ is a new good representation, but they're more
2317
+
2318
+ 1:15:29.982 --> 1:15:31.378
2319
+ calculating.
2320
+
2321
+ 1:15:31.731 --> 1:15:37.588
2322
+ Therefore, in the end you're always the output
2323
+ of a layer is added with the input.
2324
+
2325
+ 1:15:38.318 --> 1:15:48.824
2326
+ The knife is later if you are doing back propagation
2327
+ with this very fast back propagation.
2328
+
2329
+ 1:15:49.209 --> 1:16:02.540
2330
+ Nowadays in very deep architectures, not only
2331
+ on other but always has this residual or highway
2332
+
2333
+ 1:16:02.540 --> 1:16:04.224
2334
+ connection.
2335
+
2336
+ 1:16:04.704 --> 1:16:06.616
2337
+ Has two advantages.
2338
+
2339
+ 1:16:06.616 --> 1:16:15.409
2340
+ On the one hand, these layers don't need to
2341
+ learn a representation, they only need to learn
2342
+
2343
+ 1:16:15.409 --> 1:16:18.754
2344
+ what to change the representation.
2345
+
2346
+ 1:16:22.082 --> 1:16:24.172
2347
+ Good.
2348
+
2349
+ 1:16:23.843 --> 1:16:31.768
2350
+ That much for the new map before, so the last
2351
+ thing now means this.
2352
+
2353
+ 1:16:31.671 --> 1:16:33.750
2354
+ Language was are yeah.
2355
+
2356
+ 1:16:33.750 --> 1:16:41.976
2357
+ I were used in the molds itself and now were
2358
+ seeing them again, but one thing which at the
2359
+
2360
+ 1:16:41.976 --> 1:16:53.558
2361
+ beginning they were reading was very essential
2362
+ was: So people really train part of the language
2363
+
2364
+ 1:16:53.558 --> 1:16:59.999
2365
+ models only to get this type of embedding.
2366
+
2367
+ 1:16:59.999 --> 1:17:04.193
2368
+ Therefore, we want to look.
2369
+
2370
+ 1:17:09.229 --> 1:17:15.678
2371
+ So now some last words to the word embeddings.
2372
+
2373
+ 1:17:15.678 --> 1:17:27.204
2374
+ The interesting thing is that word embeddings
2375
+ can be used for very different tasks.
2376
+
2377
+ 1:17:27.347 --> 1:17:31.329
2378
+ The knife wing is you can train that on just
2379
+ large amounts of data.
2380
+
2381
+ 1:17:31.931 --> 1:17:41.569
2382
+ And then if you have these wooden beddings
2383
+ we have seen that they reduce the parameters.
2384
+
2385
+ 1:17:41.982 --> 1:17:52.217
2386
+ So then you can train your small mark to do
2387
+ any other task and therefore you are more efficient.
2388
+
2389
+ 1:17:52.532 --> 1:17:55.218
2390
+ These initial word embeddings is important.
2391
+
2392
+ 1:17:55.218 --> 1:18:00.529
2393
+ They really depend only on the word itself,
2394
+ so if you look at the two meanings of can,
2395
+
2396
+ 1:18:00.529 --> 1:18:06.328
2397
+ the can of beans or I can do that, they will
2398
+ have the same embedding, so some of the embedding
2399
+
2400
+ 1:18:06.328 --> 1:18:08.709
2401
+ has to save the ambiguity inside that.
2402
+
2403
+ 1:18:09.189 --> 1:18:12.486
2404
+ That cannot be resolved.
2405
+
2406
+ 1:18:12.486 --> 1:18:24.753
2407
+ Therefore, if you look at the higher levels
2408
+ in the context, but in the word embedding layers
2409
+
2410
+ 1:18:24.753 --> 1:18:27.919
2411
+ that really depends on.
2412
+
2413
+ 1:18:29.489 --> 1:18:33.757
2414
+ However, even this one has quite very interesting.
2415
+
2416
+ 1:18:34.034 --> 1:18:39.558
2417
+ So that people like to visualize them.
2418
+
2419
+ 1:18:39.558 --> 1:18:47.208
2420
+ They're always difficult because if you look
2421
+ at this.
2422
+
2423
+ 1:18:47.767 --> 1:18:52.879
2424
+ And drawing your five hundred damage, the
2425
+ vector is still a bit challenging.
2426
+
2427
+ 1:18:53.113 --> 1:19:12.472
2428
+ So you cannot directly do that, so people
2429
+ have to do it like they look at some type of.
2430
+
2431
+ 1:19:13.073 --> 1:19:17.209
2432
+ And of course then yes some information is
2433
+ getting lost by a bunch of control.
2434
+
2435
+ 1:19:18.238 --> 1:19:24.802
2436
+ And you see, for example, this is the most
2437
+ famous and common example, so what you can
2438
+
2439
+ 1:19:24.802 --> 1:19:31.289
2440
+ look is you can look at the difference between
2441
+ the main and the female word English.
2442
+
2443
+ 1:19:31.289 --> 1:19:37.854
2444
+ This is here in your embedding of king, and
2445
+ this is the embedding of queen, and this.
2446
+
2447
+ 1:19:38.058 --> 1:19:40.394
2448
+ You can do that for a very different work.
2449
+
2450
+ 1:19:40.780 --> 1:19:45.407
2451
+ And that is where the masks come into, that
2452
+ is what people then look into.
2453
+
2454
+ 1:19:45.725 --> 1:19:50.995
2455
+ So what you can now, for example, do is you
2456
+ can calculate the difference between man and
2457
+
2458
+ 1:19:50.995 --> 1:19:51.410
2459
+ woman?
2460
+
2461
+ 1:19:52.232 --> 1:19:55.511
2462
+ Then you can take the embedding of tea.
2463
+
2464
+ 1:19:55.511 --> 1:20:02.806
2465
+ You can add on it the difference between man
2466
+ and woman, and then you can notice what are
2467
+
2468
+ 1:20:02.806 --> 1:20:04.364
2469
+ the similar words.
2470
+
2471
+ 1:20:04.364 --> 1:20:08.954
2472
+ So you won't, of course, directly hit the
2473
+ correct word.
2474
+
2475
+ 1:20:08.954 --> 1:20:10.512
2476
+ It's a continuous.
2477
+
2478
+ 1:20:10.790 --> 1:20:23.127
2479
+ But you can look what are the nearest neighbors
2480
+ to this same, and often these words are near
2481
+
2482
+ 1:20:23.127 --> 1:20:24.056
2483
+ there.
2484
+
2485
+ 1:20:24.224 --> 1:20:33.913
2486
+ So it somehow learns that the difference between
2487
+ these words is always the same.
2488
+
2489
+ 1:20:34.374 --> 1:20:37.746
2490
+ You can do that for different things.
2491
+
2492
+ 1:20:37.746 --> 1:20:41.296
2493
+ He also imagines that it's not perfect.
2494
+
2495
+ 1:20:41.296 --> 1:20:49.017
2496
+ He says the world tends to be swimming and
2497
+ swimming, and with walking and walking you.
2498
+
2499
+ 1:20:49.469 --> 1:20:51.639
2500
+ So you can try to use them.
2501
+
2502
+ 1:20:51.639 --> 1:20:59.001
2503
+ It's no longer like saying yeah, but the interesting
2504
+ thing is this is completely unsupervised.
2505
+
2506
+ 1:20:59.001 --> 1:21:03.961
2507
+ So nobody taught him the principle of their
2508
+ gender in language.
2509
+
2510
+ 1:21:04.284 --> 1:21:09.910
2511
+ So it's purely trained on the task of doing
2512
+ the next work prediction.
2513
+
2514
+ 1:21:10.230 --> 1:21:20.658
2515
+ And even for really cementing information
2516
+ like the capital, this is the difference between
2517
+
2518
+ 1:21:20.658 --> 1:21:23.638
2519
+ the city and the capital.
2520
+
2521
+ 1:21:23.823 --> 1:21:25.518
2522
+ Visualization.
2523
+
2524
+ 1:21:25.518 --> 1:21:33.766
2525
+ Here we have done the same things of the difference
2526
+ between country and.
2527
+
2528
+ 1:21:33.853 --> 1:21:41.991
2529
+ You see it's not perfect, but it's building
2530
+ some kinds of a right direction, so you can't
2531
+
2532
+ 1:21:41.991 --> 1:21:43.347
2533
+ even use them.
2534
+
2535
+ 1:21:43.347 --> 1:21:51.304
2536
+ For example, for question answering, if you
2537
+ have the difference between them, you apply
2538
+
2539
+ 1:21:51.304 --> 1:21:53.383
2540
+ that to a new country.
2541
+
2542
+ 1:21:54.834 --> 1:22:02.741
2543
+ So it seems these ones are able to really
2544
+ learn a lot of information and collapse all
2545
+
2546
+ 1:22:02.741 --> 1:22:04.396
2547
+ this information.
2548
+
2549
+ 1:22:05.325 --> 1:22:11.769
2550
+ At just to do the next word prediction: And
2551
+ that also explains a bit maybe or not explains
2552
+
2553
+ 1:22:11.769 --> 1:22:19.016
2554
+ wrong life by motivating why what is the main
2555
+ advantage of this type of neural models that
2556
+
2557
+ 1:22:19.016 --> 1:22:26.025
2558
+ we can use this type of hidden representation,
2559
+ transfer them and use them in different.
2560
+
2561
+ 1:22:28.568 --> 1:22:43.707
2562
+ So summarize what we did today, so what you
2563
+ should hopefully have with you is for machine
2564
+
2565
+ 1:22:43.707 --> 1:22:45.893
2566
+ translation.
2567
+
2568
+ 1:22:45.805 --> 1:22:49.149
2569
+ Then how we can do language modern Chinese
2570
+ literature?
2571
+
2572
+ 1:22:49.449 --> 1:22:55.617
2573
+ We looked at three different architectures:
2574
+ We looked into the feet forward language mode
2575
+
2576
+ 1:22:55.617 --> 1:22:59.063
2577
+ and the one based on Bluetooth machines.
2578
+
2579
+ 1:22:59.039 --> 1:23:05.366
2580
+ And finally there are different architectures
2581
+ to do in your networks.
2582
+
2583
+ 1:23:05.366 --> 1:23:14.404
2584
+ We have seen feet for your networks and we'll
2585
+ see the next lectures, the last type of architecture.
2586
+
2587
+ 1:23:15.915 --> 1:23:17.412
2588
+ Have Any Questions.
2589
+
2590
+ 1:23:20.680 --> 1:23:27.341
2591
+ Then thanks a lot, and next on Tuesday we
2592
+ will be again in our order to know how to play.
2593
+
demo_data/lectures/Lecture-07-11.05.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee1fc2af8bf4d95a18dacaa3d5d9aad8c6c207e0f5f63090a9adefcfcf29f418
3
+ size 150440033
demo_data/lectures/Lecture-07-16.05.2023/English.vtt ADDED
The diff for this file is too large to render. See raw diff
 
demo_data/lectures/Lecture-07-16.05.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee1fc2af8bf4d95a18dacaa3d5d9aad8c6c207e0f5f63090a9adefcfcf29f418
3
+ size 150440033
demo_data/lectures/Lecture-09-25.05.2023/English.vtt ADDED
@@ -0,0 +1,3031 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.721 --> 0:00:05.064
4
+ Hey, and then welcome to today's lecture.
5
+
6
+ 0:00:06.126 --> 0:00:13.861
7
+ What we want to do today is we will finish
8
+ with what we have done last time, so we started
9
+
10
+ 0:00:13.861 --> 0:00:22.192
11
+ looking at the new machine translation system,
12
+ but we have had all the components of the sequence
13
+
14
+ 0:00:22.192 --> 0:00:22.787
15
+ model.
16
+
17
+ 0:00:22.722 --> 0:00:29.361
18
+ We're still missing is the transformer based
19
+ architecture so that maybe the self attention.
20
+
21
+ 0:00:29.849 --> 0:00:31.958
22
+ Then we want to look at the beginning today.
23
+
24
+ 0:00:32.572 --> 0:00:39.315
25
+ And then the main part of the day's lecture
26
+ will be decoding.
27
+
28
+ 0:00:39.315 --> 0:00:43.992
29
+ That means we know how to train the model.
30
+
31
+ 0:00:44.624 --> 0:00:47.507
32
+ So decoding sewage all they can be.
33
+
34
+ 0:00:47.667 --> 0:00:53.359
35
+ Be useful that and the idea is how we find
36
+ that and what challenges are there.
37
+
38
+ 0:00:53.359 --> 0:00:59.051
39
+ Since it's unregressive, we will see that
40
+ it's not as easy as for other tasks.
41
+
42
+ 0:00:59.359 --> 0:01:08.206
43
+ While generating the translation step by step,
44
+ we might make additional arrows that lead.
45
+
46
+ 0:01:09.069 --> 0:01:16.464
47
+ But let's start with a self attention, so
48
+ what we looked at into was an base model.
49
+
50
+ 0:01:16.816 --> 0:01:27.931
51
+ And then in our based models you always take
52
+ the last new state, you take your input, you
53
+
54
+ 0:01:27.931 --> 0:01:31.513
55
+ generate a new hidden state.
56
+
57
+ 0:01:31.513 --> 0:01:35.218
58
+ This is more like a standard.
59
+
60
+ 0:01:35.675 --> 0:01:41.088
61
+ And one challenge in this is that we always
62
+ store all our history in one signal hidden
63
+
64
+ 0:01:41.088 --> 0:01:41.523
65
+ stick.
66
+
67
+ 0:01:41.781 --> 0:01:50.235
68
+ We saw that this is a problem when going from
69
+ encoder to decoder, and that is why we then
70
+
71
+ 0:01:50.235 --> 0:01:58.031
72
+ introduced the attention mechanism so that
73
+ we can look back and see all the parts.
74
+
75
+ 0:01:59.579 --> 0:02:06.059
76
+ However, in the decoder we still have this
77
+ issue so we are still storing all information
78
+
79
+ 0:02:06.059 --> 0:02:12.394
80
+ in one hidden state and we might do things
81
+ like here that we start to overwrite things
82
+
83
+ 0:02:12.394 --> 0:02:13.486
84
+ and we forgot.
85
+
86
+ 0:02:14.254 --> 0:02:23.575
87
+ So the idea is, can we do something similar
88
+ which we do between encoder and decoder within
89
+
90
+ 0:02:23.575 --> 0:02:24.907
91
+ the decoder?
92
+
93
+ 0:02:26.526 --> 0:02:33.732
94
+ And the idea is each time we're generating
95
+ here in New York State, it will not only depend
96
+
97
+ 0:02:33.732 --> 0:02:40.780
98
+ on the previous one, but we will focus on the
99
+ whole sequence and look at different parts
100
+
101
+ 0:02:40.780 --> 0:02:46.165
102
+ as we did in attention in order to generate
103
+ our new representation.
104
+
105
+ 0:02:46.206 --> 0:02:53.903
106
+ So each time we generate a new representation
107
+ we will look into what is important now to
108
+
109
+ 0:02:53.903 --> 0:02:54.941
110
+ understand.
111
+
112
+ 0:02:55.135 --> 0:03:00.558
113
+ You may want to understand what much is important.
114
+
115
+ 0:03:00.558 --> 0:03:08.534
116
+ You might want to look to vary and to like
117
+ so that it's much about liking.
118
+
119
+ 0:03:08.808 --> 0:03:24.076
120
+ So the idea is that we are not staring everything
121
+ in each time we are looking at the full sequence.
122
+
123
+ 0:03:25.125 --> 0:03:35.160
124
+ And that is achieved by no longer going really
125
+ secret, and the hidden states here aren't dependent
126
+
127
+ 0:03:35.160 --> 0:03:37.086
128
+ on the same layer.
129
+
130
+ 0:03:37.086 --> 0:03:42.864
131
+ But instead we are always looking at the previous
132
+ layer.
133
+
134
+ 0:03:42.942 --> 0:03:45.510
135
+ We will always have more information that
136
+ we are coming.
137
+
138
+ 0:03:47.147 --> 0:03:51.572
139
+ So how does this censor work in detail?
140
+
141
+ 0:03:51.572 --> 0:03:56.107
142
+ So we started with our initial mistakes.
143
+
144
+ 0:03:56.107 --> 0:04:08.338
145
+ So, for example: Now where we had the three
146
+ terms already, the query, the key and the value,
147
+
148
+ 0:04:08.338 --> 0:04:12.597
149
+ it was motivated by our database.
150
+
151
+ 0:04:12.772 --> 0:04:20.746
152
+ We are comparing it to the keys to all the
153
+ other values, and then we are merging the values.
154
+
155
+ 0:04:21.321 --> 0:04:35.735
156
+ There was a difference between the decoder
157
+ and the encoder.
158
+
159
+ 0:04:35.775 --> 0:04:41.981
160
+ You can assume all the same because we are
161
+ curving ourselves.
162
+
163
+ 0:04:41.981 --> 0:04:49.489
164
+ However, we can make them different but just
165
+ learning a linear projection.
166
+
167
+ 0:04:49.529 --> 0:05:01.836
168
+ So you learn here some projection based on
169
+ what need to do in order to ask which question.
170
+
171
+ 0:05:02.062 --> 0:05:11.800
172
+ That is, the query and the key is to what
173
+ do want to compare and provide others, and
174
+
175
+ 0:05:11.800 --> 0:05:13.748
176
+ which values do.
177
+
178
+ 0:05:14.014 --> 0:05:23.017
179
+ This is not like hand defined, but learn,
180
+ so it's like three linear projections that
181
+
182
+ 0:05:23.017 --> 0:05:26.618
183
+ you apply on all of these hidden.
184
+
185
+ 0:05:26.618 --> 0:05:32.338
186
+ That is the first thing based on your initial
187
+ hidden.
188
+
189
+ 0:05:32.612 --> 0:05:37.249
190
+ And now you can do exactly as before, you
191
+ can do the attention.
192
+
193
+ 0:05:37.637 --> 0:05:40.023
194
+ How did the attention work?
195
+
196
+ 0:05:40.023 --> 0:05:45.390
197
+ The first thing is we are comparing our query
198
+ to all the keys.
199
+
200
+ 0:05:45.445 --> 0:05:52.713
201
+ And that is now the difference before the
202
+ quarry was from the decoder, the keys were
203
+
204
+ 0:05:52.713 --> 0:05:54.253
205
+ from the encoder.
206
+
207
+ 0:05:54.253 --> 0:06:02.547
208
+ Now it's like all from the same, so we started
209
+ the first in state to the keys of all the others.
210
+
211
+ 0:06:02.582 --> 0:06:06.217
212
+ We're learning some value here.
213
+
214
+ 0:06:06.217 --> 0:06:12.806
215
+ How important are these information to better
216
+ understand?
217
+
218
+ 0:06:13.974 --> 0:06:19.103
219
+ And these are just like floating point numbers.
220
+
221
+ 0:06:19.103 --> 0:06:21.668
222
+ They are normalized so.
223
+
224
+ 0:06:22.762 --> 0:06:30.160
225
+ And that is the first step, so let's go first
226
+ for the first curve.
227
+
228
+ 0:06:30.470 --> 0:06:41.937
229
+ What we can then do is multiply each value
230
+ as we have done before with the importance
231
+
232
+ 0:06:41.937 --> 0:06:43.937
233
+ of each state.
234
+
235
+ 0:06:45.145 --> 0:06:47.686
236
+ And then we have in here the new hit step.
237
+
238
+ 0:06:48.308 --> 0:06:57.862
239
+ See now this new hidden status is depending
240
+ on all the hidden state of all the sequences
241
+
242
+ 0:06:57.862 --> 0:06:59.686
243
+ of the previous.
244
+
245
+ 0:06:59.879 --> 0:07:01.739
246
+ One important thing.
247
+
248
+ 0:07:01.739 --> 0:07:08.737
249
+ This one doesn't really depend, so the hidden
250
+ states here don't depend on the.
251
+
252
+ 0:07:09.029 --> 0:07:15.000
253
+ So it only depends on the hidden state of
254
+ the previous layer, but it depends on all the
255
+
256
+ 0:07:15.000 --> 0:07:18.664
257
+ hidden states, and that is of course a big
258
+ advantage.
259
+
260
+ 0:07:18.664 --> 0:07:25.111
261
+ So on the one hand information can directly
262
+ flow from each hidden state before the information
263
+
264
+ 0:07:25.111 --> 0:07:27.214
265
+ flow was always a bit limited.
266
+
267
+ 0:07:28.828 --> 0:07:35.100
268
+ And the independence is important so we can
269
+ calculate all these in the states in parallel.
270
+
271
+ 0:07:35.100 --> 0:07:41.371
272
+ That's another big advantage of self attention
273
+ that we can calculate all the hidden states
274
+
275
+ 0:07:41.371 --> 0:07:46.815
276
+ in one layer in parallel and therefore it's
277
+ the ad designed for GPUs and fast.
278
+
279
+ 0:07:47.587 --> 0:07:50.235
280
+ Then we can do the same thing for the second
281
+ in the state.
282
+
283
+ 0:07:50.530 --> 0:08:06.866
284
+ And the only difference here is how we calculate
285
+ what is occurring.
286
+
287
+ 0:08:07.227 --> 0:08:15.733
288
+ Getting these values is different because
289
+ we use the different query and then getting
290
+
291
+ 0:08:15.733 --> 0:08:17.316
292
+ our new hidden.
293
+
294
+ 0:08:18.258 --> 0:08:26.036
295
+ Yes, this is the word of words that underneath
296
+ this case might, but this is simple.
297
+
298
+ 0:08:26.036 --> 0:08:26.498
299
+ Not.
300
+
301
+ 0:08:27.127 --> 0:08:33.359
302
+ That's a very good question that is like on
303
+ the initial thing.
304
+
305
+ 0:08:33.359 --> 0:08:38.503
306
+ That is exactly not one of you in the architecture.
307
+
308
+ 0:08:38.503 --> 0:08:44.042
309
+ Maybe first you would think of a very big
310
+ disadvantage.
311
+
312
+ 0:08:44.384 --> 0:08:49.804
313
+ So this hidden state would be the same if
314
+ the movie would be different.
315
+
316
+ 0:08:50.650 --> 0:08:59.983
317
+ And of course this estate is a site someone
318
+ should like, so if the estate would be here
319
+
320
+ 0:08:59.983 --> 0:09:06.452
321
+ except for this correspondence the word order
322
+ is completely.
323
+
324
+ 0:09:06.706 --> 0:09:17.133
325
+ Therefore, just doing self attention wouldn't
326
+ work at all because we know word order is important
327
+
328
+ 0:09:17.133 --> 0:09:21.707
329
+ and there is a complete different meaning.
330
+
331
+ 0:09:22.262 --> 0:09:26.277
332
+ We introduce the word position again.
333
+
334
+ 0:09:26.277 --> 0:09:33.038
335
+ The main idea is if the position is already
336
+ in your embeddings.
337
+
338
+ 0:09:33.533 --> 0:09:39.296
339
+ Then of course the position is there and you
340
+ don't lose it anymore.
341
+
342
+ 0:09:39.296 --> 0:09:46.922
343
+ So mainly if your life representation here
344
+ encodes at the second position and your output
345
+
346
+ 0:09:46.922 --> 0:09:48.533
347
+ will be different.
348
+
349
+ 0:09:49.049 --> 0:09:54.585
350
+ And that's how you encode it, but that's essential
351
+ in order to get this work.
352
+
353
+ 0:09:57.137 --> 0:10:08.752
354
+ But before we are coming to the next slide,
355
+ one other thing that is typically done is multi-head
356
+
357
+ 0:10:08.752 --> 0:10:10.069
358
+ attention.
359
+
360
+ 0:10:10.430 --> 0:10:15.662
361
+ And it might be that in order to understand
362
+ much, it might be good that in some way we
363
+
364
+ 0:10:15.662 --> 0:10:19.872
365
+ focus on life, and in some way we can focus
366
+ on vary, but not equally.
367
+
368
+ 0:10:19.872 --> 0:10:25.345
369
+ But maybe it's like to understand again on
370
+ different dimensions we should look into these.
371
+
372
+ 0:10:25.905 --> 0:10:31.393
373
+ And therefore what we're doing is we're just
374
+ doing the self attention at once, but we're
375
+
376
+ 0:10:31.393 --> 0:10:35.031
377
+ doing it end times or based on your multi head
378
+ attentions.
379
+
380
+ 0:10:35.031 --> 0:10:41.299
381
+ So in typical examples, the number of heads
382
+ people are talking about is like: So you're
383
+
384
+ 0:10:41.299 --> 0:10:50.638
385
+ doing this process and have different queries
386
+ and keys so you can focus.
387
+
388
+ 0:10:50.790 --> 0:10:52.887
389
+ How can you generate eight different?
390
+
391
+ 0:10:53.593 --> 0:11:07.595
392
+ Things it's quite easy here, so instead of
393
+ having one linear projection you can have age
394
+
395
+ 0:11:07.595 --> 0:11:09.326
396
+ different.
397
+
398
+ 0:11:09.569 --> 0:11:13.844
399
+ And it might be that sometimes you're looking
400
+ more into one thing, and sometimes you're Looking
401
+
402
+ 0:11:13.844 --> 0:11:14.779
403
+ more into the other.
404
+
405
+ 0:11:15.055 --> 0:11:24.751
406
+ So that's of course nice with this type of
407
+ learned approach because we can automatically
408
+
409
+ 0:11:24.751 --> 0:11:25.514
410
+ learn.
411
+
412
+ 0:11:29.529 --> 0:11:36.629
413
+ And what you correctly said is its positional
414
+ independence, so it doesn't really matter the
415
+
416
+ 0:11:36.629 --> 0:11:39.176
417
+ order which should be important.
418
+
419
+ 0:11:39.379 --> 0:11:47.686
420
+ So how can we do that and the idea is we are
421
+ just encoding it directly into the embedding
422
+
423
+ 0:11:47.686 --> 0:11:52.024
424
+ so into the starting so that a representation.
425
+
426
+ 0:11:52.512 --> 0:11:55.873
427
+ How do we get that so we started with our
428
+ embeddings?
429
+
430
+ 0:11:55.873 --> 0:11:58.300
431
+ Just imagine this is embedding of eye.
432
+
433
+ 0:11:59.259 --> 0:12:06.169
434
+ And then we are having additionally this positional
435
+ encoding.
436
+
437
+ 0:12:06.169 --> 0:12:10.181
438
+ In this position, encoding is just.
439
+
440
+ 0:12:10.670 --> 0:12:19.564
441
+ With different wavelength, so with different
442
+ lengths of your signal as you see here.
443
+
444
+ 0:12:20.160 --> 0:12:37.531
445
+ And the number of functions you have is exactly
446
+ the number of dimensions you have in your embedded.
447
+
448
+ 0:12:38.118 --> 0:12:51.091
449
+ And what will then do is take the first one,
450
+ and based on your position you multiply your
451
+
452
+ 0:12:51.091 --> 0:12:51.955
453
+ word.
454
+
455
+ 0:12:52.212 --> 0:13:02.518
456
+ And you see now if you put it in this position,
457
+ of course it will get a different value.
458
+
459
+ 0:13:03.003 --> 0:13:12.347
460
+ And thereby in each position a different function
461
+ is multiplied.
462
+
463
+ 0:13:12.347 --> 0:13:19.823
464
+ This is a representation for at the first
465
+ position.
466
+
467
+ 0:13:20.020 --> 0:13:34.922
468
+ If you have it in the input already encoded
469
+ then of course the model is able to keep the
470
+
471
+ 0:13:34.922 --> 0:13:38.605
472
+ position information.
473
+
474
+ 0:13:38.758 --> 0:13:48.045
475
+ But your embeddings can also learn your embeddings
476
+ in a way that they are optimal collaborating
477
+
478
+ 0:13:48.045 --> 0:13:49.786
479
+ with these types.
480
+
481
+ 0:13:51.451 --> 0:13:59.351
482
+ Is that somehow clear where he is there?
483
+
484
+ 0:14:06.006 --> 0:14:13.630
485
+ Am the first position and second position?
486
+
487
+ 0:14:16.576 --> 0:14:17.697
488
+ Have a long wait period.
489
+
490
+ 0:14:17.697 --> 0:14:19.624
491
+ I'm not going to tell you how to turn the.
492
+
493
+ 0:14:21.441 --> 0:14:26.927
494
+ Be completely issued because if you have a
495
+ very short wavelength there might be quite
496
+
497
+ 0:14:26.927 --> 0:14:28.011
498
+ big differences.
499
+
500
+ 0:14:28.308 --> 0:14:33.577
501
+ And it might also be that then it depends,
502
+ of course, like what type of world embedding
503
+
504
+ 0:14:33.577 --> 0:14:34.834
505
+ you've learned like.
506
+
507
+ 0:14:34.834 --> 0:14:37.588
508
+ Is the dimension where you have long changes?
509
+
510
+ 0:14:37.588 --> 0:14:43.097
511
+ Is the report for your embedding or not so
512
+ that's what I mean so that the model can somehow
513
+
514
+ 0:14:43.097 --> 0:14:47.707
515
+ learn that by putting more information into
516
+ one of the embedding dimensions?
517
+
518
+ 0:14:48.128 --> 0:14:54.560
519
+ So incorporated and would assume it's learning
520
+ it a bit haven't seen.
521
+
522
+ 0:14:54.560 --> 0:14:57.409
523
+ Details studied how different.
524
+
525
+ 0:14:58.078 --> 0:15:07.863
526
+ It's also a bit difficult because really measuring
527
+ how similar or different a world isn't that
528
+
529
+ 0:15:07.863 --> 0:15:08.480
530
+ easy.
531
+
532
+ 0:15:08.480 --> 0:15:13.115
533
+ You can do, of course, the average distance.
534
+
535
+ 0:15:14.114 --> 0:15:21.393
536
+ Them, so are the weight tags not at model
537
+ two, or is there fixed weight tags that the
538
+
539
+ 0:15:21.393 --> 0:15:21.986
540
+ model.
541
+
542
+ 0:15:24.164 --> 0:15:30.165
543
+ To believe they are fixed and the mono learns
544
+ there's a different way of doing it.
545
+
546
+ 0:15:30.165 --> 0:15:32.985
547
+ The other thing you can do is you can.
548
+
549
+ 0:15:33.213 --> 0:15:36.945
550
+ So you can learn the second embedding which
551
+ says this is position one.
552
+
553
+ 0:15:36.945 --> 0:15:38.628
554
+ This is position two and so on.
555
+
556
+ 0:15:38.628 --> 0:15:42.571
557
+ Like for words you could learn fixed embeddings
558
+ and then add them upwards.
559
+
560
+ 0:15:42.571 --> 0:15:45.094
561
+ So then it would have the same thing it's
562
+ done.
563
+
564
+ 0:15:45.094 --> 0:15:46.935
565
+ There is one disadvantage of this.
566
+
567
+ 0:15:46.935 --> 0:15:51.403
568
+ There is anybody an idea what could be the
569
+ disadvantage of a more learned embedding.
570
+
571
+ 0:15:54.955 --> 0:16:00.000
572
+ Here maybe extra play this finger and ethnic
573
+ stuff that will be an art.
574
+
575
+ 0:16:00.000 --> 0:16:01.751
576
+ This will be an art for.
577
+
578
+ 0:16:02.502 --> 0:16:08.323
579
+ You would only be good at positions you have
580
+ seen often and especially for long sequences.
581
+
582
+ 0:16:08.323 --> 0:16:14.016
583
+ You might have seen the positions very rarely
584
+ and then normally not performing that well
585
+
586
+ 0:16:14.016 --> 0:16:17.981
587
+ while here it can better learn a more general
588
+ representation.
589
+
590
+ 0:16:18.298 --> 0:16:22.522
591
+ So that is another thing which we won't discuss
592
+ here.
593
+
594
+ 0:16:22.522 --> 0:16:25.964
595
+ Guess is what is called relative attention.
596
+
597
+ 0:16:25.945 --> 0:16:32.570
598
+ And in this case you don't learn absolute
599
+ positions, but in your calculation of the similarity
600
+
601
+ 0:16:32.570 --> 0:16:39.194
602
+ you take again the relative distance into account
603
+ and have a different similarity depending on
604
+
605
+ 0:16:39.194 --> 0:16:40.449
606
+ how far they are.
607
+
608
+ 0:16:40.660 --> 0:16:45.898
609
+ And then you don't need to encode it beforehand,
610
+ but you would more happen within your comparison.
611
+
612
+ 0:16:46.186 --> 0:16:53.471
613
+ So when you compare how similar things you
614
+ print, of course also take the relative position.
615
+
616
+ 0:16:55.715 --> 0:17:03.187
617
+ Because there are multiple ways to use the
618
+ one, to multiply all the embedding, or to use
619
+
620
+ 0:17:03.187 --> 0:17:03.607
621
+ all.
622
+
623
+ 0:17:17.557 --> 0:17:21.931
624
+ The encoder can be bidirectional.
625
+
626
+ 0:17:21.931 --> 0:17:30.679
627
+ We have everything from the beginning so we
628
+ can have a model where.
629
+
630
+ 0:17:31.111 --> 0:17:36.455
631
+ Decoder training of course has also everything
632
+ available but during inference you always have
633
+
634
+ 0:17:36.455 --> 0:17:41.628
635
+ only the past available so you can only look
636
+ into the previous one and not into the future
637
+
638
+ 0:17:41.628 --> 0:17:46.062
639
+ because if you generate word by word you don't
640
+ know what it will be there in.
641
+
642
+ 0:17:46.866 --> 0:17:53.180
643
+ And so we also have to consider this somehow
644
+ in the attention, and until now we look more
645
+
646
+ 0:17:53.180 --> 0:17:54.653
647
+ at the ecoder style.
648
+
649
+ 0:17:54.653 --> 0:17:58.652
650
+ So if you look at this type of model, it's
651
+ by direction.
652
+
653
+ 0:17:58.652 --> 0:18:03.773
654
+ So for this hill state we are looking into
655
+ the past and into the future.
656
+
657
+ 0:18:04.404 --> 0:18:14.436
658
+ So the question is, can we have to do this
659
+ like unidirectional so that you only look into
660
+
661
+ 0:18:14.436 --> 0:18:15.551
662
+ the past?
663
+
664
+ 0:18:15.551 --> 0:18:22.573
665
+ And the nice thing is, this is even easier
666
+ than for our hands.
667
+
668
+ 0:18:23.123 --> 0:18:29.738
669
+ So we would have different types of parameters
670
+ and models because you have a forward direction.
671
+
672
+ 0:18:31.211 --> 0:18:35.679
673
+ For attention, that is very simple.
674
+
675
+ 0:18:35.679 --> 0:18:39.403
676
+ We are doing what is masking.
677
+
678
+ 0:18:39.403 --> 0:18:45.609
679
+ If you want to have a backward model, these
680
+ ones.
681
+
682
+ 0:18:45.845 --> 0:18:54.355
683
+ So on the first hit stage it's been over,
684
+ so it's maybe only looking at its health.
685
+
686
+ 0:18:54.894 --> 0:19:05.310
687
+ By the second it looks on the second and the
688
+ third, so you're always selling all values
689
+
690
+ 0:19:05.310 --> 0:19:07.085
691
+ in the future.
692
+
693
+ 0:19:07.507 --> 0:19:13.318
694
+ And thereby you can have with the same parameters
695
+ the same model.
696
+
697
+ 0:19:13.318 --> 0:19:15.783
698
+ You can have then a unique.
699
+
700
+ 0:19:16.156 --> 0:19:29.895
701
+ In the decoder you do the masked self attention
702
+ where you only look into the past and you don't
703
+
704
+ 0:19:29.895 --> 0:19:30.753
705
+ look.
706
+
707
+ 0:19:32.212 --> 0:19:36.400
708
+ Then we only have, of course, looked onto
709
+ itself.
710
+
711
+ 0:19:36.616 --> 0:19:50.903
712
+ So the question: How can we combine forward
713
+ and decoder and then we can do a decoder and
714
+
715
+ 0:19:50.903 --> 0:19:54.114
716
+ just have a second?
717
+
718
+ 0:19:54.374 --> 0:20:00.286
719
+ And then we're doing the cross attention which
720
+ attacks from the decoder to the anchoder.
721
+
722
+ 0:20:00.540 --> 0:20:10.239
723
+ So in this time it's again that the queries
724
+ is a current state of decoder, while the keys
725
+
726
+ 0:20:10.239 --> 0:20:22.833
727
+ are: You can do both onto yourself to get the
728
+ meaning on the target side and to get the meaning.
729
+
730
+ 0:20:23.423 --> 0:20:25.928
731
+ So see then the full picture.
732
+
733
+ 0:20:25.928 --> 0:20:33.026
734
+ This is now the typical picture of the transformer
735
+ and where you use self attention.
736
+
737
+ 0:20:33.026 --> 0:20:36.700
738
+ So what you have is have your power hidden.
739
+
740
+ 0:20:37.217 --> 0:20:43.254
741
+ What you then apply is here the position they're
742
+ coding: We have then doing the self attention
743
+
744
+ 0:20:43.254 --> 0:20:46.734
745
+ to all the others, and this can be bi-directional.
746
+
747
+ 0:20:47.707 --> 0:20:54.918
748
+ You normally do another feed forward layer
749
+ just like to make things to learn additional
750
+
751
+ 0:20:54.918 --> 0:20:55.574
752
+ things.
753
+
754
+ 0:20:55.574 --> 0:21:02.785
755
+ You're just having also a feed forward layer
756
+ which takes your heel stable and generates
757
+
758
+ 0:21:02.785 --> 0:21:07.128
759
+ your heel state because we are making things
760
+ deeper.
761
+
762
+ 0:21:07.747 --> 0:21:15.648
763
+ Then this blue part you can stack over several
764
+ times so you can have layers so that.
765
+
766
+ 0:21:16.336 --> 0:21:30.256
767
+ In addition to these blue arrows, so we talked
768
+ about this in R&amp;S that if you are now back
769
+
770
+ 0:21:30.256 --> 0:21:35.883
771
+ propagating your arrow from the top,.
772
+
773
+ 0:21:36.436 --> 0:21:48.578
774
+ In order to prevent that we are not really
775
+ learning how to transform that, but instead
776
+
777
+ 0:21:48.578 --> 0:21:51.230
778
+ we have to change.
779
+
780
+ 0:21:51.671 --> 0:22:00.597
781
+ You're calculating what should be changed
782
+ with this one.
783
+
784
+ 0:22:00.597 --> 0:22:09.365
785
+ The backwards clip each layer and the learning
786
+ is just.
787
+
788
+ 0:22:10.750 --> 0:22:21.632
789
+ The encoder before we go to the decoder.
790
+
791
+ 0:22:21.632 --> 0:22:30.655
792
+ We have any additional questions.
793
+
794
+ 0:22:31.471 --> 0:22:33.220
795
+ That's a Very Good Point.
796
+
797
+ 0:22:33.553 --> 0:22:38.709
798
+ Yeah, you normally take always that at least
799
+ the default architecture to only look at the
800
+
801
+ 0:22:38.709 --> 0:22:38.996
802
+ top.
803
+
804
+ 0:22:40.000 --> 0:22:40.388
805
+ Coder.
806
+
807
+ 0:22:40.388 --> 0:22:42.383
808
+ Of course, you can do other things.
809
+
810
+ 0:22:42.383 --> 0:22:45.100
811
+ We investigated, for example, the lowest layout.
812
+
813
+ 0:22:45.100 --> 0:22:49.424
814
+ The decoder is looking at the lowest level
815
+ of the incoder and not of the top.
816
+
817
+ 0:22:49.749 --> 0:23:05.342
818
+ You can average or you can even learn theoretically
819
+ that what you can also do is attending to all.
820
+
821
+ 0:23:05.785 --> 0:23:11.180
822
+ Can attend to all possible layers and states.
823
+
824
+ 0:23:11.180 --> 0:23:18.335
825
+ But what the default thing is is that you
826
+ only have the top.
827
+
828
+ 0:23:20.580 --> 0:23:31.999
829
+ The decoder when we're doing is firstly doing
830
+ the same position and coding, then we're doing
831
+
832
+ 0:23:31.999 --> 0:23:36.419
833
+ self attention in the decoder side.
834
+
835
+ 0:23:37.837 --> 0:23:43.396
836
+ Of course here it's not important we're doing
837
+ the mask self attention so that we're only
838
+
839
+ 0:23:43.396 --> 0:23:45.708
840
+ attending to the past and we're not.
841
+
842
+ 0:23:47.287 --> 0:24:02.698
843
+ Here you see the difference, so in this case
844
+ the keys and values are from the encoder and
845
+
846
+ 0:24:02.698 --> 0:24:03.554
847
+ the.
848
+
849
+ 0:24:03.843 --> 0:24:12.103
850
+ You're comparing it to all the counter hidden
851
+ states calculating the similarity and then
852
+
853
+ 0:24:12.103 --> 0:24:13.866
854
+ you do the weight.
855
+
856
+ 0:24:14.294 --> 0:24:17.236
857
+ And that is an edit to what is here.
858
+
859
+ 0:24:18.418 --> 0:24:29.778
860
+ Then you have a linen layer and again this
861
+ green one is sticked several times and then.
862
+
863
+ 0:24:32.232 --> 0:24:36.987
864
+ Question, so each code is off.
865
+
866
+ 0:24:36.987 --> 0:24:46.039
867
+ Every one of those has the last layer of thing,
868
+ so in the.
869
+
870
+ 0:24:46.246 --> 0:24:51.007
871
+ All with and only to the last or the top layer
872
+ of the anchor.
873
+
874
+ 0:24:57.197 --> 0:25:00.127
875
+ Good So That Would Be.
876
+
877
+ 0:25:01.501 --> 0:25:12.513
878
+ To sequence models we have looked at attention
879
+ and before we are decoding do you have any
880
+
881
+ 0:25:12.513 --> 0:25:18.020
882
+ more questions to this type of architecture.
883
+
884
+ 0:25:20.480 --> 0:25:30.049
885
+ Transformer was first used in machine translation,
886
+ but now it's a standard thing for doing nearly
887
+
888
+ 0:25:30.049 --> 0:25:32.490
889
+ any tie sequence models.
890
+
891
+ 0:25:33.013 --> 0:25:35.984
892
+ Even large language models.
893
+
894
+ 0:25:35.984 --> 0:25:38.531
895
+ They are a bit similar.
896
+
897
+ 0:25:38.531 --> 0:25:45.111
898
+ They are just throwing away the anchor and
899
+ cross the tension.
900
+
901
+ 0:25:45.505 --> 0:25:59.329
902
+ And that is maybe interesting that it's important
903
+ to have this attention because you cannot store
904
+
905
+ 0:25:59.329 --> 0:26:01.021
906
+ everything.
907
+
908
+ 0:26:01.361 --> 0:26:05.357
909
+ The interesting thing with the attention is
910
+ now we can attend to everything.
911
+
912
+ 0:26:05.745 --> 0:26:13.403
913
+ So you can again go back to your initial model
914
+ and have just a simple sequence model and then
915
+
916
+ 0:26:13.403 --> 0:26:14.055
917
+ target.
918
+
919
+ 0:26:14.694 --> 0:26:24.277
920
+ There would be a more language model style
921
+ or people call it Decoder Only model where
922
+
923
+ 0:26:24.277 --> 0:26:26.617
924
+ you throw this away.
925
+
926
+ 0:26:27.247 --> 0:26:30.327
927
+ The nice thing is because of your self attention.
928
+
929
+ 0:26:30.327 --> 0:26:34.208
930
+ You have the original problem why you introduce
931
+ the attention.
932
+
933
+ 0:26:34.208 --> 0:26:39.691
934
+ You don't have that anymore because it's not
935
+ everything is summarized, but each time you
936
+
937
+ 0:26:39.691 --> 0:26:44.866
938
+ generate, you're looking back at all the previous
939
+ words, the source and the target.
940
+
941
+ 0:26:45.805 --> 0:26:51.734
942
+ And there is a lot of work on is a really
943
+ important to have encoded a decoded model or
944
+
945
+ 0:26:51.734 --> 0:26:54.800
946
+ is a decoded only model as good if you have.
947
+
948
+ 0:26:54.800 --> 0:27:00.048
949
+ But the comparison is not that easy because
950
+ how many parameters do you have?
951
+
952
+ 0:27:00.360 --> 0:27:08.832
953
+ So think the general idea at the moment is,
954
+ at least for machine translation, it's normally
955
+
956
+ 0:27:08.832 --> 0:27:17.765
957
+ a bit better to have an encoded decoder model
958
+ and not a decoder model where you just concatenate
959
+
960
+ 0:27:17.765 --> 0:27:20.252
961
+ the source and the target.
962
+
963
+ 0:27:21.581 --> 0:27:24.073
964
+ But there is not really a big difference anymore.
965
+
966
+ 0:27:24.244 --> 0:27:29.891
967
+ Because this big issue, which we had initially
968
+ with it that everything is stored in the working
969
+
970
+ 0:27:29.891 --> 0:27:31.009
971
+ state, is nothing.
972
+
973
+ 0:27:31.211 --> 0:27:45.046
974
+ Of course, the advantage maybe here is that
975
+ you give it a bias at your same language information.
976
+
977
+ 0:27:45.285 --> 0:27:53.702
978
+ While in an encoder only model this all is
979
+ merged into one thing and sometimes it is good
980
+
981
+ 0:27:53.702 --> 0:28:02.120
982
+ to give models a bit of bias okay you should
983
+ maybe treat things separately and you should
984
+
985
+ 0:28:02.120 --> 0:28:03.617
986
+ look different.
987
+
988
+ 0:28:04.144 --> 0:28:11.612
989
+ And of course one other difference, one other
990
+ disadvantage, maybe of an encoder owning one.
991
+
992
+ 0:28:16.396 --> 0:28:19.634
993
+ You think about the suicide sentence and how
994
+ it's treated.
995
+
996
+ 0:28:21.061 --> 0:28:33.787
997
+ Architecture: Anchorer can both be in the
998
+ sentence for every state and cause a little
999
+
1000
+ 0:28:33.787 --> 0:28:35.563
1001
+ difference.
1002
+
1003
+ 0:28:35.475 --> 0:28:43.178
1004
+ If you only have a decoder that has to be
1005
+ unidirectional because for the decoder side
1006
+
1007
+ 0:28:43.178 --> 0:28:51.239
1008
+ for the generation you need it and so your
1009
+ input is read state by state so you don't have
1010
+
1011
+ 0:28:51.239 --> 0:28:54.463
1012
+ positional bidirection information.
1013
+
1014
+ 0:28:56.596 --> 0:29:05.551
1015
+ Again, it receives a sequence of embeddings
1016
+ with position encoding.
1017
+
1018
+ 0:29:05.551 --> 0:29:11.082
1019
+ The piece is like long vector has output.
1020
+
1021
+ 0:29:11.031 --> 0:29:17.148
1022
+ Don't understand how you can set footworks
1023
+ to this part of each other through inputs.
1024
+
1025
+ 0:29:17.097 --> 0:29:20.060
1026
+ Other than cola is the same as the food consume.
1027
+
1028
+ 0:29:21.681 --> 0:29:27.438
1029
+ Okay, it's very good bye, so this one hand
1030
+ coding is only done on the top layer.
1031
+
1032
+ 0:29:27.727 --> 0:29:32.012
1033
+ So this green one is only repeated.
1034
+
1035
+ 0:29:32.012 --> 0:29:38.558
1036
+ You have the word embedding or the position
1037
+ embedding.
1038
+
1039
+ 0:29:38.558 --> 0:29:42.961
1040
+ You have one layer of decoder which.
1041
+
1042
+ 0:29:43.283 --> 0:29:48.245
1043
+ Then you stick in the second one, the third
1044
+ one, the fourth one, and then on the top.
1045
+
1046
+ 0:29:48.208 --> 0:29:55.188
1047
+ Layer: You put this projection layer which
1048
+ takes a one thousand dimensional backtalk and
1049
+
1050
+ 0:29:55.188 --> 0:30:02.089
1051
+ generates based on your vocabulary maybe in
1052
+ ten thousand soft max layer which gives you
1053
+
1054
+ 0:30:02.089 --> 0:30:04.442
1055
+ the probability of all words.
1056
+
1057
+ 0:30:06.066 --> 0:30:22.369
1058
+ It's a very good part part of the mass tape
1059
+ ladies, but it wouldn't be for the X-rays.
1060
+
1061
+ 0:30:22.262 --> 0:30:27.015
1062
+ Aquarium filters to be like monsoon roding
1063
+ as they get by the river.
1064
+
1065
+ 0:30:27.647 --> 0:30:33.140
1066
+ Yes, there is work on that think we will discuss
1067
+ that in the pre-trained models.
1068
+
1069
+ 0:30:33.493 --> 0:30:39.756
1070
+ It's called where you exactly do that.
1071
+
1072
+ 0:30:39.756 --> 0:30:48.588
1073
+ If you have more metric side, it's like diagonal
1074
+ here.
1075
+
1076
+ 0:30:48.708 --> 0:30:53.018
1077
+ And it's a full metric, so here everybody's
1078
+ attending to each position.
1079
+
1080
+ 0:30:53.018 --> 0:30:54.694
1081
+ Here you're only attending.
1082
+
1083
+ 0:30:54.975 --> 0:31:05.744
1084
+ Then you can do the previous one where this
1085
+ one is decoded, not everything but everything.
1086
+
1087
+ 0:31:06.166 --> 0:31:13.961
1088
+ So you have a bit more that is possible, and
1089
+ we'll have that in the lecture on pre-train
1090
+
1091
+ 0:31:13.961 --> 0:31:14.662
1092
+ models.
1093
+
1094
+ 0:31:18.478 --> 0:31:27.440
1095
+ So we now know how to build a translation
1096
+ system, but of course we don't want to have
1097
+
1098
+ 0:31:27.440 --> 0:31:30.774
1099
+ a translation system by itself.
1100
+
1101
+ 0:31:31.251 --> 0:31:40.037
1102
+ Now given this model an input sentence, how
1103
+ can we generate an output mind?
1104
+
1105
+ 0:31:40.037 --> 0:31:49.398
1106
+ The general idea is still: So what we really
1107
+ want to do is we start with the model.
1108
+
1109
+ 0:31:49.398 --> 0:31:53.893
1110
+ We generate different possible translations.
1111
+
1112
+ 0:31:54.014 --> 0:31:59.754
1113
+ We score them the lock probability that we're
1114
+ getting, so for each input and output pair
1115
+
1116
+ 0:31:59.754 --> 0:32:05.430
1117
+ we can calculate the lock probability, which
1118
+ is a product of all probabilities for each
1119
+
1120
+ 0:32:05.430 --> 0:32:09.493
1121
+ word in there, and then we can find what is
1122
+ the most probable.
1123
+
1124
+ 0:32:09.949 --> 0:32:15.410
1125
+ However, that's a bit complicated we will
1126
+ see because we can't look at all possible translations.
1127
+
1128
+ 0:32:15.795 --> 0:32:28.842
1129
+ So there is infinite or a number of possible
1130
+ translations, so we have to do it somehow in
1131
+
1132
+ 0:32:28.842 --> 0:32:31.596
1133
+ more intelligence.
1134
+
1135
+ 0:32:32.872 --> 0:32:37.821
1136
+ So what we want to do today in the rest of
1137
+ the lecture?
1138
+
1139
+ 0:32:37.821 --> 0:32:40.295
1140
+ What is the search problem?
1141
+
1142
+ 0:32:40.295 --> 0:32:44.713
1143
+ Then we will look at different search algorithms.
1144
+
1145
+ 0:32:45.825 --> 0:32:56.636
1146
+ Will compare model and search errors, so there
1147
+ can be errors on the model where the model
1148
+
1149
+ 0:32:56.636 --> 0:33:03.483
1150
+ is not giving the highest score to the best
1151
+ translation.
1152
+
1153
+ 0:33:03.903 --> 0:33:21.069
1154
+ This is always like searching the best translation
1155
+ out of one model, which is often also interesting.
1156
+
1157
+ 0:33:24.004 --> 0:33:29.570
1158
+ And how do we do the search?
1159
+
1160
+ 0:33:29.570 --> 0:33:41.853
1161
+ We want to find the translation where the
1162
+ reference is minimal.
1163
+
1164
+ 0:33:42.042 --> 0:33:44.041
1165
+ So the nice thing is SMT.
1166
+
1167
+ 0:33:44.041 --> 0:33:51.347
1168
+ It wasn't the case, but in neuromachine translation
1169
+ we can't find any possible translation, so
1170
+
1171
+ 0:33:51.347 --> 0:33:53.808
1172
+ at least within our vocabulary.
1173
+
1174
+ 0:33:53.808 --> 0:33:58.114
1175
+ But if we have BPE we can really generate
1176
+ any possible.
1177
+
1178
+ 0:33:58.078 --> 0:34:04.604
1179
+ Translation and cereal: We could always minimize
1180
+ that, but yeah, we can't do it that easy because
1181
+
1182
+ 0:34:04.604 --> 0:34:07.734
1183
+ of course we don't have the reference at hand.
1184
+
1185
+ 0:34:07.747 --> 0:34:10.384
1186
+ If it has a reference, it's not a problem.
1187
+
1188
+ 0:34:10.384 --> 0:34:13.694
1189
+ We know what we are searching for, but we
1190
+ don't know.
1191
+
1192
+ 0:34:14.054 --> 0:34:23.886
1193
+ So how can we then model this by just finding
1194
+ the translation with the highest probability?
1195
+
1196
+ 0:34:23.886 --> 0:34:29.015
1197
+ Looking at it, we want to find the translation.
1198
+
1199
+ 0:34:29.169 --> 0:34:32.525
1200
+ Idea is our model is a good approximation.
1201
+
1202
+ 0:34:32.525 --> 0:34:34.399
1203
+ That's how we train it.
1204
+
1205
+ 0:34:34.399 --> 0:34:36.584
1206
+ What is a good translation?
1207
+
1208
+ 0:34:36.584 --> 0:34:43.687
1209
+ And if we find translation with the highest
1210
+ probability, this should also give us the best
1211
+
1212
+ 0:34:43.687 --> 0:34:44.702
1213
+ translation.
1214
+
1215
+ 0:34:45.265 --> 0:34:56.965
1216
+ And that is then, of course, the difference
1217
+ between the search error is that the model
1218
+
1219
+ 0:34:56.965 --> 0:35:02.076
1220
+ doesn't predict the best translation.
1221
+
1222
+ 0:35:02.622 --> 0:35:08.777
1223
+ How can we do the basic search first of all
1224
+ in basic search that seems to be very easy
1225
+
1226
+ 0:35:08.777 --> 0:35:15.003
1227
+ so what we can do is we can do the forward
1228
+ pass for the whole encoder and that's how it
1229
+
1230
+ 0:35:15.003 --> 0:35:21.724
1231
+ starts the input sentences known you can put
1232
+ the input sentence and calculate all your estates
1233
+
1234
+ 0:35:21.724 --> 0:35:22.573
1235
+ and hidden?
1236
+
1237
+ 0:35:23.083 --> 0:35:35.508
1238
+ Then you can put in your sentence start and
1239
+ you can generate.
1240
+
1241
+ 0:35:35.508 --> 0:35:41.721
1242
+ Here you have the probability.
1243
+
1244
+ 0:35:41.801 --> 0:35:52.624
1245
+ A good idea we would see later that as a typical
1246
+ algorithm is guess what you all would do, you
1247
+
1248
+ 0:35:52.624 --> 0:35:54.788
1249
+ would then select.
1250
+
1251
+ 0:35:55.235 --> 0:36:06.265
1252
+ So if you generate here a probability distribution
1253
+ over all the words in your vocabulary then
1254
+
1255
+ 0:36:06.265 --> 0:36:08.025
1256
+ you can solve.
1257
+
1258
+ 0:36:08.688 --> 0:36:13.147
1259
+ Yeah, this is how our auto condition is done
1260
+ in our system.
1261
+
1262
+ 0:36:14.794 --> 0:36:19.463
1263
+ Yeah, this is also why there you have to have
1264
+ a model of possible extending.
1265
+
1266
+ 0:36:19.463 --> 0:36:24.314
1267
+ It's more of a language model, but then this
1268
+ is one algorithm to do the search.
1269
+
1270
+ 0:36:24.314 --> 0:36:26.801
1271
+ They maybe have also more advanced ones.
1272
+
1273
+ 0:36:26.801 --> 0:36:32.076
1274
+ We will see that so this search and other
1275
+ completion should be exactly the same as the
1276
+
1277
+ 0:36:32.076 --> 0:36:33.774
1278
+ search machine translation.
1279
+
1280
+ 0:36:34.914 --> 0:36:40.480
1281
+ So we'll see that this is not optimal, so
1282
+ hopefully it's not that this way, but for this
1283
+
1284
+ 0:36:40.480 --> 0:36:41.043
1285
+ problem.
1286
+
1287
+ 0:36:41.941 --> 0:36:47.437
1288
+ And what you can do then you can select this
1289
+ word.
1290
+
1291
+ 0:36:47.437 --> 0:36:50.778
1292
+ This was the best translation.
1293
+
1294
+ 0:36:51.111 --> 0:36:57.675
1295
+ Because the decoder, of course, in the next
1296
+ step needs not to know what is the best word
1297
+
1298
+ 0:36:57.675 --> 0:37:02.396
1299
+ here, it inputs it and generates that flexibility
1300
+ distribution.
1301
+
1302
+ 0:37:03.423 --> 0:37:14.608
1303
+ And then your new distribution, and you can
1304
+ do the same thing, there's the best word there,
1305
+
1306
+ 0:37:14.608 --> 0:37:15.216
1307
+ and.
1308
+
1309
+ 0:37:15.435 --> 0:37:22.647
1310
+ So you can continue doing that and always
1311
+ get the hopefully the best translation in.
1312
+
1313
+ 0:37:23.483 --> 0:37:30.839
1314
+ The first question is, of course, how long
1315
+ are you doing it?
1316
+
1317
+ 0:37:30.839 --> 0:37:33.854
1318
+ Now we could go forever.
1319
+
1320
+ 0:37:36.476 --> 0:37:52.596
1321
+ We had this token at the input and we put
1322
+ the stop token at the output.
1323
+
1324
+ 0:37:53.974 --> 0:38:07.217
1325
+ And this is important because if we wouldn't
1326
+ do that then we wouldn't have a good idea.
1327
+
1328
+ 0:38:10.930 --> 0:38:16.193
1329
+ So that seems to be a good idea, but is it
1330
+ really?
1331
+
1332
+ 0:38:16.193 --> 0:38:21.044
1333
+ Do we find the most probable sentence in this?
1334
+
1335
+ 0:38:23.763 --> 0:38:25.154
1336
+ Or my dear healed proverb,.
1337
+
1338
+ 0:38:27.547 --> 0:38:41.823
1339
+ We are always selecting the highest probability
1340
+ one, so it seems to be that this is a very
1341
+
1342
+ 0:38:41.823 --> 0:38:45.902
1343
+ good solution to anybody.
1344
+
1345
+ 0:38:46.406 --> 0:38:49.909
1346
+ Yes, that is actually the problem.
1347
+
1348
+ 0:38:49.909 --> 0:38:56.416
1349
+ You might do early decisions and you don't
1350
+ have the global view.
1351
+
1352
+ 0:38:56.796 --> 0:39:02.813
1353
+ And this problem happens because it is an
1354
+ outer regressive model.
1355
+
1356
+ 0:39:03.223 --> 0:39:13.275
1357
+ So it happens because yeah, the output we
1358
+ generate is the input in the next step.
1359
+
1360
+ 0:39:13.793 --> 0:39:19.493
1361
+ And this, of course, is leading to problems.
1362
+
1363
+ 0:39:19.493 --> 0:39:27.474
1364
+ If we always take the best solution, it doesn't
1365
+ mean you have.
1366
+
1367
+ 0:39:27.727 --> 0:39:33.941
1368
+ It would be different if you have a problem
1369
+ where the output is not influencing your input.
1370
+
1371
+ 0:39:34.294 --> 0:39:44.079
1372
+ Then this solution will give you the best
1373
+ model, but since the output is influencing
1374
+
1375
+ 0:39:44.079 --> 0:39:47.762
1376
+ your next input and the model,.
1377
+
1378
+ 0:39:48.268 --> 0:39:51.599
1379
+ Because one question might not be why do we
1380
+ have this type of model?
1381
+
1382
+ 0:39:51.771 --> 0:39:58.946
1383
+ So why do we really need to put here in the
1384
+ last source word?
1385
+
1386
+ 0:39:58.946 --> 0:40:06.078
1387
+ You can also put in: And then always predict
1388
+ the word and the nice thing is then you wouldn't
1389
+
1390
+ 0:40:06.078 --> 0:40:11.846
1391
+ need to do beams or a difficult search because
1392
+ then the output here wouldn't influence what
1393
+
1394
+ 0:40:11.846 --> 0:40:12.975
1395
+ is inputted here.
1396
+
1397
+ 0:40:15.435 --> 0:40:20.219
1398
+ Idea whether that might not be the best idea.
1399
+
1400
+ 0:40:20.219 --> 0:40:24.588
1401
+ You'll just be translating each word and.
1402
+
1403
+ 0:40:26.626 --> 0:40:37.815
1404
+ The second one is right, yes, you're not generating
1405
+ a Korean sentence.
1406
+
1407
+ 0:40:38.058 --> 0:40:48.197
1408
+ We'll also see that later it's called non
1409
+ auto-progressive translation, so there is work
1410
+
1411
+ 0:40:48.197 --> 0:40:49.223
1412
+ on that.
1413
+
1414
+ 0:40:49.529 --> 0:41:02.142
1415
+ So you might know it roughly because you know
1416
+ it's based on this hidden state, but it can
1417
+
1418
+ 0:41:02.142 --> 0:41:08.588
1419
+ be that in the end you have your probability.
1420
+
1421
+ 0:41:09.189 --> 0:41:14.633
1422
+ And then you're not modeling the dependencies
1423
+ within a work within the target sentence.
1424
+
1425
+ 0:41:14.633 --> 0:41:27.547
1426
+ For example: You can express things in German,
1427
+ then you don't know which one you really select.
1428
+
1429
+ 0:41:27.547 --> 0:41:32.156
1430
+ That influences what you later.
1431
+
1432
+ 0:41:33.393 --> 0:41:46.411
1433
+ Then you try to find a better way not only
1434
+ based on the English sentence and the words
1435
+
1436
+ 0:41:46.411 --> 0:41:48.057
1437
+ that come.
1438
+
1439
+ 0:41:49.709 --> 0:42:00.954
1440
+ Yes, that is more like a two-step decoding,
1441
+ but that is, of course, a lot more like computational.
1442
+
1443
+ 0:42:01.181 --> 0:42:15.978
1444
+ The first thing you can do, which is typically
1445
+ done, is doing not really search.
1446
+
1447
+ 0:42:16.176 --> 0:42:32.968
1448
+ So first look at what the problem of research
1449
+ is to make it a bit more clear.
1450
+
1451
+ 0:42:34.254 --> 0:42:53.163
1452
+ And now you can extend them and you can extend
1453
+ these and the joint probabilities.
1454
+
1455
+ 0:42:54.334 --> 0:42:59.063
1456
+ The other thing is the second word.
1457
+
1458
+ 0:42:59.063 --> 0:43:03.397
1459
+ You can do the second word dusk.
1460
+
1461
+ 0:43:03.397 --> 0:43:07.338
1462
+ Now you see the problem here.
1463
+
1464
+ 0:43:07.707 --> 0:43:17.507
1465
+ It is true that these have the highest probability,
1466
+ but for these you have an extension.
1467
+
1468
+ 0:43:18.078 --> 0:43:31.585
1469
+ So the problem is just because in one position
1470
+ one hypothesis, so you can always call this
1471
+
1472
+ 0:43:31.585 --> 0:43:34.702
1473
+ partial translation.
1474
+
1475
+ 0:43:34.874 --> 0:43:41.269
1476
+ The blue one begin is higher, but the green
1477
+ one can be better extended and it will overtake.
1478
+
1479
+ 0:43:45.525 --> 0:43:54.672
1480
+ So the problem is if we are doing this greedy
1481
+ search is that we might not end up in really
1482
+
1483
+ 0:43:54.672 --> 0:43:55.275
1484
+ good.
1485
+
1486
+ 0:43:55.956 --> 0:44:00.916
1487
+ So the first thing we could not do is like
1488
+ yeah, we can just try.
1489
+
1490
+ 0:44:00.880 --> 0:44:06.049
1491
+ All combinations that are there, so there
1492
+ is the other direction.
1493
+
1494
+ 0:44:06.049 --> 0:44:13.020
1495
+ So if the solution to to check the first one
1496
+ is to just try all and it doesn't give us a
1497
+
1498
+ 0:44:13.020 --> 0:44:17.876
1499
+ good result, maybe what we have to do is just
1500
+ try everything.
1501
+
1502
+ 0:44:18.318 --> 0:44:23.120
1503
+ The nice thing is if we try everything, we'll
1504
+ definitely find the best translation.
1505
+
1506
+ 0:44:23.463 --> 0:44:26.094
1507
+ So we won't have a search error.
1508
+
1509
+ 0:44:26.094 --> 0:44:28.167
1510
+ We'll come to that later.
1511
+
1512
+ 0:44:28.167 --> 0:44:32.472
1513
+ The interesting thing is our translation performance.
1514
+
1515
+ 0:44:33.353 --> 0:44:37.039
1516
+ But we will definitely find the most probable
1517
+ translation.
1518
+
1519
+ 0:44:38.598 --> 0:44:44.552
1520
+ However, it's not really possible because
1521
+ the number of combinations is just too high.
1522
+
1523
+ 0:44:44.764 --> 0:44:57.127
1524
+ So the number of congregations is your vocabulary
1525
+ science times the lengths of your sentences.
1526
+
1527
+ 0:44:57.157 --> 0:45:03.665
1528
+ Ten thousand or so you can imagine that very
1529
+ soon you will have so many possibilities here
1530
+
1531
+ 0:45:03.665 --> 0:45:05.597
1532
+ that you cannot check all.
1533
+
1534
+ 0:45:06.226 --> 0:45:13.460
1535
+ So this is not really an implication or an
1536
+ algorithm that you can use for applying machine
1537
+
1538
+ 0:45:13.460 --> 0:45:14.493
1539
+ translation.
1540
+
1541
+ 0:45:15.135 --> 0:45:24.657
1542
+ So maybe we have to do something in between
1543
+ and yeah, not look at all but only look at
1544
+
1545
+ 0:45:24.657 --> 0:45:25.314
1546
+ some.
1547
+
1548
+ 0:45:26.826 --> 0:45:29.342
1549
+ And the easiest thing for that is okay.
1550
+
1551
+ 0:45:29.342 --> 0:45:34.877
1552
+ Just do sampling, so if we don't know what
1553
+ to look at, maybe it's good to randomly pick
1554
+
1555
+ 0:45:34.877 --> 0:45:35.255
1556
+ some.
1557
+
1558
+ 0:45:35.255 --> 0:45:40.601
1559
+ That's not only a very good algorithm, so
1560
+ the basic idea will always randomly select
1561
+
1562
+ 0:45:40.601 --> 0:45:42.865
1563
+ the word, of course, based on bits.
1564
+
1565
+ 0:45:43.223 --> 0:45:52.434
1566
+ We are doing that or times, and then we are
1567
+ looking which one at the end has the highest.
1568
+
1569
+ 0:45:52.672 --> 0:45:59.060
1570
+ So we are not doing anymore really searching
1571
+ for the best one, but we are more randomly
1572
+
1573
+ 0:45:59.060 --> 0:46:05.158
1574
+ doing selections with the idea that we always
1575
+ select the best one at the beginning.
1576
+
1577
+ 0:46:05.158 --> 0:46:11.764
1578
+ So maybe it's better to do random, but of
1579
+ course one important thing is how do we randomly
1580
+
1581
+ 0:46:11.764 --> 0:46:12.344
1582
+ select?
1583
+
1584
+ 0:46:12.452 --> 0:46:15.756
1585
+ If we just do uniform distribution, it would
1586
+ be very bad.
1587
+
1588
+ 0:46:15.756 --> 0:46:18.034
1589
+ You'll only have very bad translations.
1590
+
1591
+ 0:46:18.398 --> 0:46:23.261
1592
+ Because in each position if you think about
1593
+ it you have ten thousand possibilities.
1594
+
1595
+ 0:46:23.903 --> 0:46:28.729
1596
+ Most of them are really bad decisions and
1597
+ you shouldn't do that.
1598
+
1599
+ 0:46:28.729 --> 0:46:35.189
1600
+ There is always only a very small number,
1601
+ at least compared to the 10 000 translation.
1602
+
1603
+ 0:46:35.395 --> 0:46:43.826
1604
+ So if you have the sentence here, this is
1605
+ an English sentence.
1606
+
1607
+ 0:46:43.826 --> 0:46:47.841
1608
+ You can start with these and.
1609
+
1610
+ 0:46:48.408 --> 0:46:58.345
1611
+ You're thinking about setting legal documents
1612
+ in a legal document.
1613
+
1614
+ 0:46:58.345 --> 0:47:02.350
1615
+ You should not change the.
1616
+
1617
+ 0:47:03.603 --> 0:47:11.032
1618
+ The problem is we have a neural network, we
1619
+ have a black box, so it's anyway a bit random.
1620
+
1621
+ 0:47:12.092 --> 0:47:24.341
1622
+ It is considered, but you will see that if
1623
+ you make it intelligent for clear sentences,
1624
+
1625
+ 0:47:24.341 --> 0:47:26.986
1626
+ there is not that.
1627
+
1628
+ 0:47:27.787 --> 0:47:35.600
1629
+ Is an issue we should consider that this one
1630
+ might lead to more randomness, but it might
1631
+
1632
+ 0:47:35.600 --> 0:47:39.286
1633
+ also be positive for machine translation.
1634
+
1635
+ 0:47:40.080 --> 0:47:46.395
1636
+ Least can't directly think of a good implication
1637
+ where it's positive, but if you most think
1638
+
1639
+ 0:47:46.395 --> 0:47:52.778
1640
+ about dialogue systems, for example, whereas
1641
+ the similar architecture is nowadays also used,
1642
+
1643
+ 0:47:52.778 --> 0:47:55.524
1644
+ you predict what the system should say.
1645
+
1646
+ 0:47:55.695 --> 0:48:00.885
1647
+ Then you want to have randomness because it's
1648
+ not always saying the same thing.
1649
+
1650
+ 0:48:01.341 --> 0:48:08.370
1651
+ Machine translation is typically not you want
1652
+ to have consistency, so if you have the same
1653
+
1654
+ 0:48:08.370 --> 0:48:09.606
1655
+ input normally.
1656
+
1657
+ 0:48:09.889 --> 0:48:14.528
1658
+ Therefore, sampling is not a mathieu.
1659
+
1660
+ 0:48:14.528 --> 0:48:22.584
1661
+ There are some things you will later see as
1662
+ a preprocessing step.
1663
+
1664
+ 0:48:23.003 --> 0:48:27.832
1665
+ But of course it's important how you can make
1666
+ this process not too random.
1667
+
1668
+ 0:48:29.269 --> 0:48:41.619
1669
+ Therefore, the first thing is don't take a
1670
+ uniform distribution, but we have a very nice
1671
+
1672
+ 0:48:41.619 --> 0:48:43.562
1673
+ distribution.
1674
+
1675
+ 0:48:43.843 --> 0:48:46.621
1676
+ So I'm like randomly taking a word.
1677
+
1678
+ 0:48:46.621 --> 0:48:51.328
1679
+ We are looking at output distribution and
1680
+ now taking a word.
1681
+
1682
+ 0:48:51.731 --> 0:49:03.901
1683
+ So that means we are taking the word these,
1684
+ we are taking the word does, and all these.
1685
+
1686
+ 0:49:04.444 --> 0:49:06.095
1687
+ How can you do that?
1688
+
1689
+ 0:49:06.095 --> 0:49:09.948
1690
+ You randomly draw a number between zero and
1691
+ one.
1692
+
1693
+ 0:49:10.390 --> 0:49:23.686
1694
+ And then you have ordered your words in some
1695
+ way, and then you take the words before the
1696
+
1697
+ 0:49:23.686 --> 0:49:26.375
1698
+ sum of the words.
1699
+
1700
+ 0:49:26.806 --> 0:49:34.981
1701
+ So the easiest thing is you have zero point
1702
+ five, zero point two five, and zero point two
1703
+
1704
+ 0:49:34.981 --> 0:49:35.526
1705
+ five.
1706
+
1707
+ 0:49:35.526 --> 0:49:43.428
1708
+ If you have a number smaller than you take
1709
+ the first word, it takes a second word, and
1710
+
1711
+ 0:49:43.428 --> 0:49:45.336
1712
+ if it's higher than.
1713
+
1714
+ 0:49:45.845 --> 0:49:57.707
1715
+ Therefore, you can very easily get a distribution
1716
+ distributed according to this probability mass
1717
+
1718
+ 0:49:57.707 --> 0:49:59.541
1719
+ and no longer.
1720
+
1721
+ 0:49:59.799 --> 0:50:12.479
1722
+ You can't even do that a bit more and more
1723
+ focus on the important part if we are not randomly
1724
+
1725
+ 0:50:12.479 --> 0:50:19.494
1726
+ drawing from all words, but we are looking
1727
+ only at.
1728
+
1729
+ 0:50:21.361 --> 0:50:24.278
1730
+ You have an idea why this is an important
1731
+ stamp.
1732
+
1733
+ 0:50:24.278 --> 0:50:29.459
1734
+ Although we say I'm only throwing away the
1735
+ words which have a very low probability, so
1736
+
1737
+ 0:50:29.459 --> 0:50:32.555
1738
+ anyway the probability of taking them is quite
1739
+ low.
1740
+
1741
+ 0:50:32.555 --> 0:50:35.234
1742
+ So normally that shouldn't matter that much.
1743
+
1744
+ 0:50:36.256 --> 0:50:38.830
1745
+ There's ten thousand words.
1746
+
1747
+ 0:50:40.300 --> 0:50:42.074
1748
+ Of course, they admire thousand nine hundred.
1749
+
1750
+ 0:50:42.074 --> 0:50:44.002
1751
+ They're going to build a good people steal
1752
+ it up.
1753
+
1754
+ 0:50:45.085 --> 0:50:47.425
1755
+ Hi, I'm Sarah Hauer and I'm Sig Hauer and
1756
+ We're Professional.
1757
+
1758
+ 0:50:47.867 --> 0:50:55.299
1759
+ Yes, that's exactly why you do this most sampling
1760
+ or so that you don't take the lowest.
1761
+
1762
+ 0:50:55.415 --> 0:50:59.694
1763
+ Probability words, but you only look at the
1764
+ most probable ones and then like.
1765
+
1766
+ 0:50:59.694 --> 0:51:04.632
1767
+ Of course you have to rescale your probability
1768
+ mass then so that it's still a probability
1769
+
1770
+ 0:51:04.632 --> 0:51:08.417
1771
+ because now it's a probability distribution
1772
+ over ten thousand words.
1773
+
1774
+ 0:51:08.417 --> 0:51:13.355
1775
+ If you only take ten of them or so it's no
1776
+ longer a probability distribution, you rescale
1777
+
1778
+ 0:51:13.355 --> 0:51:15.330
1779
+ them and you can still do that and.
1780
+
1781
+ 0:51:16.756 --> 0:51:20.095
1782
+ That is what is done assembling.
1783
+
1784
+ 0:51:20.095 --> 0:51:26.267
1785
+ It's not the most common thing, but it's done
1786
+ several times.
1787
+
1788
+ 0:51:28.088 --> 0:51:40.625
1789
+ Then the search, which is somehow a standard,
1790
+ and if you're doing some type of machine translation.
1791
+
1792
+ 0:51:41.181 --> 0:51:50.162
1793
+ And the basic idea is that in research we
1794
+ select for the most probable and only continue
1795
+
1796
+ 0:51:50.162 --> 0:51:51.171
1797
+ with the.
1798
+
1799
+ 0:51:51.691 --> 0:51:53.970
1800
+ You can easily generalize this.
1801
+
1802
+ 0:51:53.970 --> 0:52:00.451
1803
+ We are not only continuing the most probable
1804
+ one, but we are continuing the most probable.
1805
+
1806
+ 0:52:00.880 --> 0:52:21.376
1807
+ The.
1808
+
1809
+ 0:52:17.697 --> 0:52:26.920
1810
+ You should say we are sampling how many examples
1811
+ it makes sense to take the one with the highest.
1812
+
1813
+ 0:52:27.127 --> 0:52:33.947
1814
+ But that is important that once you do a mistake
1815
+ you might want to not influence that much.
1816
+
1817
+ 0:52:39.899 --> 0:52:45.815
1818
+ So the idea is if we're keeping the end best
1819
+ hypotheses and not only the first fact.
1820
+
1821
+ 0:52:46.586 --> 0:52:51.558
1822
+ And the nice thing is in statistical machine
1823
+ translation.
1824
+
1825
+ 0:52:51.558 --> 0:52:54.473
1826
+ We have exactly the same problem.
1827
+
1828
+ 0:52:54.473 --> 0:52:57.731
1829
+ You would do the same thing, however.
1830
+
1831
+ 0:52:57.731 --> 0:53:03.388
1832
+ Since the model wasn't that strong you needed
1833
+ a quite large beam.
1834
+
1835
+ 0:53:03.984 --> 0:53:18.944
1836
+ Machine translation models are really strong
1837
+ and you get already a very good performance.
1838
+
1839
+ 0:53:19.899 --> 0:53:22.835
1840
+ So how does it work?
1841
+
1842
+ 0:53:22.835 --> 0:53:35.134
1843
+ We can't relate to our capabilities, but now
1844
+ we are not storing the most probable ones.
1845
+
1846
+ 0:53:36.156 --> 0:53:45.163
1847
+ Done that we extend all these hypothesis and
1848
+ of course there is now a bit difficult because
1849
+
1850
+ 0:53:45.163 --> 0:53:54.073
1851
+ now we always have to switch what is the input
1852
+ so the search gets more complicated and the
1853
+
1854
+ 0:53:54.073 --> 0:53:55.933
1855
+ first one is easy.
1856
+
1857
+ 0:53:56.276 --> 0:54:09.816
1858
+ In this case we have to once put in here these
1859
+ and then somehow delete this one and instead
1860
+
1861
+ 0:54:09.816 --> 0:54:12.759
1862
+ put that into that.
1863
+
1864
+ 0:54:13.093 --> 0:54:24.318
1865
+ Otherwise you could only store your current
1866
+ network states here and just continue by going
1867
+
1868
+ 0:54:24.318 --> 0:54:25.428
1869
+ forward.
1870
+
1871
+ 0:54:26.766 --> 0:54:34.357
1872
+ So now you have done the first two, and then
1873
+ you have known the best.
1874
+
1875
+ 0:54:34.357 --> 0:54:37.285
1876
+ Can you now just continue?
1877
+
1878
+ 0:54:39.239 --> 0:54:53.511
1879
+ Yes, that's very important, otherwise all
1880
+ your beam search doesn't really help because
1881
+
1882
+ 0:54:53.511 --> 0:54:57.120
1883
+ you would still have.
1884
+
1885
+ 0:54:57.317 --> 0:55:06.472
1886
+ So now you have to do one important step and
1887
+ then reduce again to end.
1888
+
1889
+ 0:55:06.472 --> 0:55:13.822
1890
+ So in our case to make things easier we have
1891
+ the inputs.
1892
+
1893
+ 0:55:14.014 --> 0:55:19.072
1894
+ Otherwise you will have two to the power of
1895
+ length possibilities, so it is still exponential.
1896
+
1897
+ 0:55:19.559 --> 0:55:26.637
1898
+ But by always throwing them away you keep
1899
+ your beans fixed.
1900
+
1901
+ 0:55:26.637 --> 0:55:31.709
1902
+ The items now differ in the last position.
1903
+
1904
+ 0:55:32.492 --> 0:55:42.078
1905
+ They are completely different, but you are
1906
+ always searching what is the best one.
1907
+
1908
+ 0:55:44.564 --> 0:55:50.791
1909
+ So another way of hearing it is like this,
1910
+ so just imagine you start with the empty sentence.
1911
+
1912
+ 0:55:50.791 --> 0:55:55.296
1913
+ Then you have three possible extensions: A,
1914
+ B, and end of sentence.
1915
+
1916
+ 0:55:55.296 --> 0:55:59.205
1917
+ It's throwing away the worst one, continuing
1918
+ with the two.
1919
+
1920
+ 0:55:59.699 --> 0:56:13.136
1921
+ Then you want to stay too, so in this state
1922
+ it's either or and then you continue.
1923
+
1924
+ 0:56:13.293 --> 0:56:24.924
1925
+ So you always have this exponential growing
1926
+ tree by destroying most of them away and only
1927
+
1928
+ 0:56:24.924 --> 0:56:26.475
1929
+ continuing.
1930
+
1931
+ 0:56:26.806 --> 0:56:42.455
1932
+ And thereby you can hopefully do less errors
1933
+ because in these examples you always see this
1934
+
1935
+ 0:56:42.455 --> 0:56:43.315
1936
+ one.
1937
+
1938
+ 0:56:43.503 --> 0:56:47.406
1939
+ So you're preventing some errors, but of course
1940
+ it's not perfect.
1941
+
1942
+ 0:56:47.447 --> 0:56:56.829
1943
+ You can still do errors because it could be
1944
+ not the second one but the fourth one.
1945
+
1946
+ 0:56:57.017 --> 0:57:03.272
1947
+ Now just the idea is that you make yeah less
1948
+ errors and prevent that.
1949
+
1950
+ 0:57:07.667 --> 0:57:11.191
1951
+ Then the question is how much does it help?
1952
+
1953
+ 0:57:11.191 --> 0:57:14.074
1954
+ And here is some examples for that.
1955
+
1956
+ 0:57:14.074 --> 0:57:16.716
1957
+ So for S & T it was really like.
1958
+
1959
+ 0:57:16.716 --> 0:57:23.523
1960
+ Typically the larger beam you have a larger
1961
+ third space and you have a better score.
1962
+
1963
+ 0:57:23.763 --> 0:57:27.370
1964
+ So the larger you get, the bigger your emails,
1965
+ the better you will.
1966
+
1967
+ 0:57:27.370 --> 0:57:30.023
1968
+ Typically maybe use something like three hundred.
1969
+
1970
+ 0:57:30.250 --> 0:57:38.777
1971
+ And it's mainly a trade-off between quality
1972
+ and speed because the larger your beams, the
1973
+
1974
+ 0:57:38.777 --> 0:57:43.184
1975
+ more time it takes and you want to finish it.
1976
+
1977
+ 0:57:43.184 --> 0:57:49.124
1978
+ So your quality improvements are getting smaller
1979
+ and smaller.
1980
+
1981
+ 0:57:49.349 --> 0:57:57.164
1982
+ So the difference between a beam of one and
1983
+ ten is bigger than the difference between a.
1984
+
1985
+ 0:57:58.098 --> 0:58:14.203
1986
+ And the interesting thing is we're seeing
1987
+ a bit of a different view, and we're seeing
1988
+
1989
+ 0:58:14.203 --> 0:58:16.263
1990
+ typically.
1991
+
1992
+ 0:58:16.776 --> 0:58:24.376
1993
+ And then especially if you look at the green
1994
+ ones, this is unnormalized.
1995
+
1996
+ 0:58:24.376 --> 0:58:26.770
1997
+ You're seeing a sharp.
1998
+
1999
+ 0:58:27.207 --> 0:58:32.284
2000
+ So your translation quality here measured
2001
+ in blue will go down again.
2002
+
2003
+ 0:58:33.373 --> 0:58:35.663
2004
+ That is now a question.
2005
+
2006
+ 0:58:35.663 --> 0:58:37.762
2007
+ Why is that the case?
2008
+
2009
+ 0:58:37.762 --> 0:58:43.678
2010
+ Why should we are seeing more and more possible
2011
+ translations?
2012
+
2013
+ 0:58:46.226 --> 0:58:48.743
2014
+ If we have a bigger stretch and we are going.
2015
+
2016
+ 0:58:52.612 --> 0:58:56.312
2017
+ I'm going to be using my examples before we
2018
+ also look at the bar.
2019
+
2020
+ 0:58:56.656 --> 0:58:59.194
2021
+ A good idea.
2022
+
2023
+ 0:59:00.000 --> 0:59:18.521
2024
+ But it's not everything because we in the
2025
+ end always in this list we're selecting.
2026
+
2027
+ 0:59:18.538 --> 0:59:19.382
2028
+ So this is here.
2029
+
2030
+ 0:59:19.382 --> 0:59:21.170
2031
+ We don't do any regions to do that.
2032
+
2033
+ 0:59:21.601 --> 0:59:29.287
2034
+ So the probabilities at the end we always
2035
+ give out the hypothesis with the highest probabilities.
2036
+
2037
+ 0:59:30.250 --> 0:59:33.623
2038
+ That is always the case.
2039
+
2040
+ 0:59:33.623 --> 0:59:43.338
2041
+ If you have a beam of this should be a subset
2042
+ of the items you look at.
2043
+
2044
+ 0:59:44.224 --> 0:59:52.571
2045
+ So if you increase your biomeat you're just
2046
+ looking at more and you're always taking the
2047
+
2048
+ 0:59:52.571 --> 0:59:54.728
2049
+ wine with the highest.
2050
+
2051
+ 0:59:57.737 --> 1:00:07.014
2052
+ Maybe they are all the probability that they
2053
+ will be comparable to don't really have.
2054
+
2055
+ 1:00:08.388 --> 1:00:14.010
2056
+ But the probabilities are the same, not that
2057
+ easy.
2058
+
2059
+ 1:00:14.010 --> 1:00:23.931
2060
+ One morning maybe you will have more examples
2061
+ where we look at some stuff that's not seen
2062
+
2063
+ 1:00:23.931 --> 1:00:26.356
2064
+ in the trading space.
2065
+
2066
+ 1:00:28.428 --> 1:00:36.478
2067
+ That's mainly the answer why we give a hyperability
2068
+ math we will see, but that is first of all
2069
+
2070
+ 1:00:36.478 --> 1:00:43.087
2071
+ the biggest issues, so here is a blue score,
2072
+ so that is somewhat translation.
2073
+
2074
+ 1:00:43.883 --> 1:00:48.673
2075
+ This will go down by the probability of the
2076
+ highest one that only goes out where stays
2077
+
2078
+ 1:00:48.673 --> 1:00:49.224
2079
+ at least.
2080
+
2081
+ 1:00:49.609 --> 1:00:57.971
2082
+ The problem is if we are searching more, we
2083
+ are finding high processes which have a high
2084
+
2085
+ 1:00:57.971 --> 1:00:59.193
2086
+ translation.
2087
+
2088
+ 1:00:59.579 --> 1:01:10.375
2089
+ So we are finding these things which we wouldn't
2090
+ find and we'll see why this is happening.
2091
+
2092
+ 1:01:10.375 --> 1:01:15.714
2093
+ So somehow we are reducing our search error.
2094
+
2095
+ 1:01:16.336 --> 1:01:25.300
2096
+ However, we also have a model error and we
2097
+ don't assign the highest probability to translation
2098
+
2099
+ 1:01:25.300 --> 1:01:27.942
2100
+ quality to the really best.
2101
+
2102
+ 1:01:28.548 --> 1:01:31.460
2103
+ They don't always add up.
2104
+
2105
+ 1:01:31.460 --> 1:01:34.932
2106
+ Of course somehow they add up.
2107
+
2108
+ 1:01:34.932 --> 1:01:41.653
2109
+ If your bottle is worse then your performance
2110
+ will even go.
2111
+
2112
+ 1:01:42.202 --> 1:01:49.718
2113
+ But sometimes it's happening that by increasing
2114
+ search errors we are missing out the really
2115
+
2116
+ 1:01:49.718 --> 1:01:57.969
2117
+ bad translations which have a high probability
2118
+ and we are only finding the decently good probability
2119
+
2120
+ 1:01:57.969 --> 1:01:58.460
2121
+ mass.
2122
+
2123
+ 1:01:59.159 --> 1:02:03.859
2124
+ So they are a bit independent of each other
2125
+ and you can make those types of arrows.
2126
+
2127
+ 1:02:04.224 --> 1:02:09.858
2128
+ That's why, for example, doing exact search
2129
+ will give you the translation with the highest
2130
+
2131
+ 1:02:09.858 --> 1:02:15.245
2132
+ probability, but there has been work on it
2133
+ that you then even have a lower translation
2134
+
2135
+ 1:02:15.245 --> 1:02:21.436
2136
+ quality because then you find some random translation
2137
+ which has a very high translation probability
2138
+
2139
+ 1:02:21.436 --> 1:02:22.984
2140
+ by which I'm really bad.
2141
+
2142
+ 1:02:23.063 --> 1:02:29.036
2143
+ Because our model is not perfect and giving
2144
+ a perfect translation probability over air,.
2145
+
2146
+ 1:02:31.431 --> 1:02:34.537
2147
+ So why is this happening?
2148
+
2149
+ 1:02:34.537 --> 1:02:42.301
2150
+ And one issue with this is the so called label
2151
+ or length spiral.
2152
+
2153
+ 1:02:42.782 --> 1:02:47.115
2154
+ And we are in each step of decoding.
2155
+
2156
+ 1:02:47.115 --> 1:02:55.312
2157
+ We are modeling the probability of the next
2158
+ word given the input and.
2159
+
2160
+ 1:02:55.895 --> 1:03:06.037
2161
+ So if you have this picture, so you always
2162
+ hear you have the probability of the next word.
2163
+
2164
+ 1:03:06.446 --> 1:03:16.147
2165
+ That's that's what your modeling, and of course
2166
+ the model is not perfect.
2167
+
2168
+ 1:03:16.576 --> 1:03:22.765
2169
+ So it can be that if we at one time do a bitter
2170
+ wrong prediction not for the first one but
2171
+
2172
+ 1:03:22.765 --> 1:03:28.749
2173
+ maybe for the 5th or 6th thing, then we're
2174
+ giving it an exceptional high probability we
2175
+
2176
+ 1:03:28.749 --> 1:03:30.178
2177
+ cannot recover from.
2178
+
2179
+ 1:03:30.230 --> 1:03:34.891
2180
+ Because this high probability will stay there
2181
+ forever and we just multiply other things to
2182
+
2183
+ 1:03:34.891 --> 1:03:39.910
2184
+ it, but we cannot like later say all this probability
2185
+ was a bit too high, we shouldn't have done.
2186
+
2187
+ 1:03:41.541 --> 1:03:48.984
2188
+ And this leads to that the more the longer
2189
+ your translation is, the more often you use
2190
+
2191
+ 1:03:48.984 --> 1:03:51.637
2192
+ this probability distribution.
2193
+
2194
+ 1:03:52.112 --> 1:04:03.321
2195
+ The typical example is this one, so you have
2196
+ the probability of the translation.
2197
+
2198
+ 1:04:04.104 --> 1:04:12.608
2199
+ And this probability is quite low as you see,
2200
+ and maybe there are a lot of other things.
2201
+
2202
+ 1:04:13.053 --> 1:04:25.658
2203
+ However, it might still be overestimated that
2204
+ it's still a bit too high.
2205
+
2206
+ 1:04:26.066 --> 1:04:33.042
2207
+ The problem is if you know the project translation
2208
+ is a very long one, but probability mask gets
2209
+
2210
+ 1:04:33.042 --> 1:04:33.545
2211
+ lower.
2212
+
2213
+ 1:04:34.314 --> 1:04:45.399
2214
+ Because each time you multiply your probability
2215
+ to it, so your sequence probability gets lower
2216
+
2217
+ 1:04:45.399 --> 1:04:46.683
2218
+ and lower.
2219
+
2220
+ 1:04:48.588 --> 1:04:59.776
2221
+ And this means that at some point you might
2222
+ get over this, and it might be a lower probability.
2223
+
2224
+ 1:05:00.180 --> 1:05:09.651
2225
+ And if you then have this probability at the
2226
+ beginning away, but it wasn't your beam, then
2227
+
2228
+ 1:05:09.651 --> 1:05:14.958
2229
+ at this point you would select the empty sentence.
2230
+
2231
+ 1:05:15.535 --> 1:05:25.379
2232
+ So this has happened because this short translation
2233
+ is seen and it's not thrown away.
2234
+
2235
+ 1:05:28.268 --> 1:05:31.121
2236
+ So,.
2237
+
2238
+ 1:05:31.151 --> 1:05:41.256
2239
+ If you have a very sore beam that can be prevented,
2240
+ but if you have a large beam, this one is in
2241
+
2242
+ 1:05:41.256 --> 1:05:41.986
2243
+ there.
2244
+
2245
+ 1:05:42.302 --> 1:05:52.029
2246
+ This in general seems reasonable that shorter
2247
+ pronunciations instead of longer sentences
2248
+
2249
+ 1:05:52.029 --> 1:05:54.543
2250
+ because non-religious.
2251
+
2252
+ 1:05:56.376 --> 1:06:01.561
2253
+ It's a bit depending on whether the translation
2254
+ should be a bit related to your input.
2255
+
2256
+ 1:06:02.402 --> 1:06:18.053
2257
+ And since we are always multiplying things,
2258
+ the longer the sequences we are getting smaller,
2259
+
2260
+ 1:06:18.053 --> 1:06:18.726
2261
+ it.
2262
+
2263
+ 1:06:19.359 --> 1:06:29.340
2264
+ It's somewhat right for human main too, but
2265
+ the models tend to overestimate because of
2266
+
2267
+ 1:06:29.340 --> 1:06:34.388
2268
+ this short translation of long translation.
2269
+
2270
+ 1:06:35.375 --> 1:06:46.474
2271
+ Then, of course, that means that it's not
2272
+ easy to stay on a computer because eventually
2273
+
2274
+ 1:06:46.474 --> 1:06:48.114
2275
+ it suggests.
2276
+
2277
+ 1:06:51.571 --> 1:06:59.247
2278
+ First of all there is another way and that's
2279
+ typically used but you don't have to do really
2280
+
2281
+ 1:06:59.247 --> 1:07:07.089
2282
+ because this is normally not a second position
2283
+ and if it's like on the 20th position you only
2284
+
2285
+ 1:07:07.089 --> 1:07:09.592
2286
+ have to have some bean lower.
2287
+
2288
+ 1:07:10.030 --> 1:07:17.729
2289
+ But you are right because these issues get
2290
+ larger, the larger your input is, and then
2291
+
2292
+ 1:07:17.729 --> 1:07:20.235
2293
+ you might make more errors.
2294
+
2295
+ 1:07:20.235 --> 1:07:27.577
2296
+ So therefore this is true, but it's not as
2297
+ simple that this one is always in the.
2298
+
2299
+ 1:07:28.408 --> 1:07:45.430
2300
+ That the translation for it goes down with
2301
+ higher insert sizes has there been more control.
2302
+
2303
+ 1:07:47.507 --> 1:07:51.435
2304
+ In this work you see a dozen knocks.
2305
+
2306
+ 1:07:51.435 --> 1:07:53.027
2307
+ Knots go down.
2308
+
2309
+ 1:07:53.027 --> 1:08:00.246
2310
+ That's light green here, but at least you
2311
+ don't see the sharp rock.
2312
+
2313
+ 1:08:00.820 --> 1:08:07.897
2314
+ So if you do some type of normalization, at
2315
+ least you can assess this probability and limit
2316
+
2317
+ 1:08:07.897 --> 1:08:08.204
2318
+ it.
2319
+
2320
+ 1:08:15.675 --> 1:08:24.828
2321
+ There is other reasons why, like initial,
2322
+ it's not only the length, but there can be
2323
+
2324
+ 1:08:24.828 --> 1:08:26.874
2325
+ other reasons why.
2326
+
2327
+ 1:08:27.067 --> 1:08:37.316
2328
+ And if you just take it too large, you're
2329
+ looking too often at ways in between, but it's
2330
+
2331
+ 1:08:37.316 --> 1:08:40.195
2332
+ better to ignore things.
2333
+
2334
+ 1:08:41.101 --> 1:08:44.487
2335
+ But that's more a hand gravy argument.
2336
+
2337
+ 1:08:44.487 --> 1:08:47.874
2338
+ Agree so don't know if the exact word.
2339
+
2340
+ 1:08:48.648 --> 1:08:53.223
2341
+ You need to do the normalization and there
2342
+ are different ways of doing it.
2343
+
2344
+ 1:08:53.223 --> 1:08:54.199
2345
+ It's mainly OK.
2346
+
2347
+ 1:08:54.199 --> 1:08:59.445
2348
+ We're just now not taking the translation
2349
+ with the highest probability, but we during
2350
+
2351
+ 1:08:59.445 --> 1:09:04.935
2352
+ the coding have another feature saying not
2353
+ only take the one with the highest probability
2354
+
2355
+ 1:09:04.935 --> 1:09:08.169
2356
+ but also prefer translations which are a bit
2357
+ longer.
2358
+
2359
+ 1:09:08.488 --> 1:09:16.933
2360
+ You can do that different in a way to divide
2361
+ by the center length.
2362
+
2363
+ 1:09:16.933 --> 1:09:23.109
2364
+ We take not the highest but the highest average.
2365
+
2366
+ 1:09:23.563 --> 1:09:28.841
2367
+ Of course, if both are the same lengths, it
2368
+ doesn't matter if M is the same lengths in
2369
+
2370
+ 1:09:28.841 --> 1:09:34.483
2371
+ all cases, but if you compare a translation
2372
+ with seven or eight words, there is a difference
2373
+
2374
+ 1:09:34.483 --> 1:09:39.700
2375
+ if you want to have the one with the highest
2376
+ probability or with the highest average.
2377
+
2378
+ 1:09:41.021 --> 1:09:50.993
2379
+ So that is the first one can have some reward
2380
+ model for each word, add a bit of the score,
2381
+
2382
+ 1:09:50.993 --> 1:09:51.540
2383
+ and.
2384
+
2385
+ 1:09:51.711 --> 1:10:03.258
2386
+ And then, of course, you have to find you
2387
+ that there is also more complex ones here.
2388
+
2389
+ 1:10:03.903 --> 1:10:08.226
2390
+ So there is different ways of doing that,
2391
+ and of course that's important.
2392
+
2393
+ 1:10:08.428 --> 1:10:11.493
2394
+ But in all of that, the main idea is OK.
2395
+
2396
+ 1:10:11.493 --> 1:10:18.520
2397
+ We are like knowing of the arrow that the
2398
+ model seems to prevent or prefer short translation.
2399
+
2400
+ 1:10:18.520 --> 1:10:24.799
2401
+ We circumvent that by OK we are adding we
2402
+ are no longer searching for the best one.
2403
+
2404
+ 1:10:24.764 --> 1:10:30.071
2405
+ But we're searching for the one best one and
2406
+ some additional constraints, so mainly you
2407
+
2408
+ 1:10:30.071 --> 1:10:32.122
2409
+ are doing here during the coding.
2410
+
2411
+ 1:10:32.122 --> 1:10:37.428
2412
+ You're not completely trusting your model,
2413
+ but you're adding some buyers or constraints
2414
+
2415
+ 1:10:37.428 --> 1:10:39.599
2416
+ into what should also be fulfilled.
2417
+
2418
+ 1:10:40.000 --> 1:10:42.543
2419
+ That can be, for example, that the length
2420
+ should be recently.
2421
+
2422
+ 1:10:49.369 --> 1:10:51.071
2423
+ Any More Questions to That.
2424
+
2425
+ 1:10:56.736 --> 1:11:04.001
2426
+ Last idea which gets recently quite a bit
2427
+ more interest also is what is called minimum
2428
+
2429
+ 1:11:04.001 --> 1:11:11.682
2430
+ base risk decoding and there is maybe not the
2431
+ one correct translation but there are several
2432
+
2433
+ 1:11:11.682 --> 1:11:13.937
2434
+ good correct translations.
2435
+
2436
+ 1:11:14.294 --> 1:11:21.731
2437
+ And the idea is now we don't want to find
2438
+ the one translation, which is maybe the highest
2439
+
2440
+ 1:11:21.731 --> 1:11:22.805
2441
+ probability.
2442
+
2443
+ 1:11:23.203 --> 1:11:31.707
2444
+ Instead we are looking at all the high translation,
2445
+ all translation with high probability and then
2446
+
2447
+ 1:11:31.707 --> 1:11:39.524
2448
+ we want to take one representative out of this
2449
+ so we're just most similar to all the other
2450
+
2451
+ 1:11:39.524 --> 1:11:42.187
2452
+ hydrobility translation again.
2453
+
2454
+ 1:11:43.643 --> 1:11:46.642
2455
+ So how does it work?
2456
+
2457
+ 1:11:46.642 --> 1:11:55.638
2458
+ First you could have imagined you have reference
2459
+ translations.
2460
+
2461
+ 1:11:55.996 --> 1:12:13.017
2462
+ You have a set of reference translations and
2463
+ then what you want to get is you want to have.
2464
+
2465
+ 1:12:13.073 --> 1:12:28.641
2466
+ As a probability distribution you measure
2467
+ the similarity of reference and the hypothesis.
2468
+
2469
+ 1:12:28.748 --> 1:12:31.408
2470
+ So you have two sets of translation.
2471
+
2472
+ 1:12:31.408 --> 1:12:34.786
2473
+ You have the human translations of a sentence.
2474
+
2475
+ 1:12:35.675 --> 1:12:39.251
2476
+ That's of course not realistic, but first
2477
+ from the idea.
2478
+
2479
+ 1:12:39.251 --> 1:12:42.324
2480
+ Then you have your set of possible translations.
2481
+
2482
+ 1:12:42.622 --> 1:12:52.994
2483
+ And now you're not saying okay, we have only
2484
+ one human, but we have several humans with
2485
+
2486
+ 1:12:52.994 --> 1:12:56.294
2487
+ different types of quality.
2488
+
2489
+ 1:12:56.796 --> 1:13:07.798
2490
+ You have to have two metrics here, the similarity
2491
+ between the automatic translation and the quality
2492
+
2493
+ 1:13:07.798 --> 1:13:09.339
2494
+ of the human.
2495
+
2496
+ 1:13:10.951 --> 1:13:17.451
2497
+ Of course, we have the same problem that we
2498
+ don't have the human reference, so we have.
2499
+
2500
+ 1:13:18.058 --> 1:13:29.751
2501
+ So when we are doing it, instead of estimating
2502
+ the quality based on the human, we use our
2503
+
2504
+ 1:13:29.751 --> 1:13:30.660
2505
+ model.
2506
+
2507
+ 1:13:31.271 --> 1:13:37.612
2508
+ So we can't be like humans, so we take the
2509
+ model probability.
2510
+
2511
+ 1:13:37.612 --> 1:13:40.782
2512
+ We take the set here first of.
2513
+
2514
+ 1:13:41.681 --> 1:13:48.755
2515
+ Then we are comparing each hypothesis to this
2516
+ one, so you have two sets.
2517
+
2518
+ 1:13:48.755 --> 1:13:53.987
2519
+ Just imagine here you take all possible translations.
2520
+
2521
+ 1:13:53.987 --> 1:13:58.735
2522
+ Here you take your hypothesis in comparing
2523
+ them.
2524
+
2525
+ 1:13:58.678 --> 1:14:03.798
2526
+ And then you're taking estimating the quality
2527
+ based on the outcome.
2528
+
2529
+ 1:14:04.304 --> 1:14:06.874
2530
+ So the overall idea is okay.
2531
+
2532
+ 1:14:06.874 --> 1:14:14.672
2533
+ We are not finding the best hypothesis but
2534
+ finding the hypothesis which is most similar
2535
+
2536
+ 1:14:14.672 --> 1:14:17.065
2537
+ to many good translations.
2538
+
2539
+ 1:14:19.599 --> 1:14:21.826
2540
+ Why would you do that?
2541
+
2542
+ 1:14:21.826 --> 1:14:25.119
2543
+ It's a bit like a smoothing idea.
2544
+
2545
+ 1:14:25.119 --> 1:14:28.605
2546
+ Imagine this is the probability of.
2547
+
2548
+ 1:14:29.529 --> 1:14:36.634
2549
+ So if you would do beam search or mini search
2550
+ or anything, if you just take the highest probability
2551
+
2552
+ 1:14:36.634 --> 1:14:39.049
2553
+ one, you would take this red one.
2554
+
2555
+ 1:14:39.799 --> 1:14:45.686
2556
+ Has this type of probability distribution.
2557
+
2558
+ 1:14:45.686 --> 1:14:58.555
2559
+ Then it might be better to take some of these
2560
+ models because it's a bit lower in probability.
2561
+
2562
+ 1:14:58.618 --> 1:15:12.501
2563
+ So what you're mainly doing is you're doing
2564
+ some smoothing of your probability distribution.
2565
+
2566
+ 1:15:15.935 --> 1:15:17.010
2567
+ How can you do that?
2568
+
2569
+ 1:15:17.010 --> 1:15:20.131
2570
+ Of course, we cannot do this again compared
2571
+ to all the hype.
2572
+
2573
+ 1:15:21.141 --> 1:15:29.472
2574
+ But what we can do is we have just two sets
2575
+ and we're just taking them the same.
2576
+
2577
+ 1:15:29.472 --> 1:15:38.421
2578
+ So we're having our penny data of the hypothesis
2579
+ and the sum of the soider references.
2580
+
2581
+ 1:15:39.179 --> 1:15:55.707
2582
+ And we can just take the same clue so we can
2583
+ just compare the utility of the.
2584
+
2585
+ 1:15:56.656 --> 1:16:16.182
2586
+ And then, of course, the question is how do
2587
+ we measure the quality of the hypothesis?
2588
+
2589
+ 1:16:16.396 --> 1:16:28.148
2590
+ Course: You could also take here the probability
2591
+ of this pee of given, but you can also say
2592
+
2593
+ 1:16:28.148 --> 1:16:30.958
2594
+ we only take the top.
2595
+
2596
+ 1:16:31.211 --> 1:16:39.665
2597
+ And where we don't want to really rely on
2598
+ how good they are, we filtered out all the
2599
+
2600
+ 1:16:39.665 --> 1:16:40.659
2601
+ bad ones.
2602
+
2603
+ 1:16:40.940 --> 1:16:54.657
2604
+ So that is the first question for the minimum
2605
+ base rhythm, and what are your pseudo references?
2606
+
2607
+ 1:16:55.255 --> 1:17:06.968
2608
+ So how do you set the quality of all these
2609
+ references here in the independent sampling?
2610
+
2611
+ 1:17:06.968 --> 1:17:10.163
2612
+ They all have the same.
2613
+
2614
+ 1:17:10.750 --> 1:17:12.308
2615
+ There's Also Work Where You Can Take That.
2616
+
2617
+ 1:17:13.453 --> 1:17:17.952
2618
+ And then the second question you have to do
2619
+ is, of course,.
2620
+
2621
+ 1:17:17.917 --> 1:17:26.190
2622
+ How do you prepare now two hypothesisms so
2623
+ you have now Y and H which are post generated
2624
+
2625
+ 1:17:26.190 --> 1:17:34.927
2626
+ by the system and you want to find the H which
2627
+ is most similar to all the other translations.
2628
+
2629
+ 1:17:35.335 --> 1:17:41.812
2630
+ So it's mainly like this model here, which
2631
+ says how similar is age to all the other whites.
2632
+
2633
+ 1:17:42.942 --> 1:17:50.127
2634
+ So you have to again use some type of similarity
2635
+ metric, which says how similar to possible.
2636
+
2637
+ 1:17:52.172 --> 1:17:53.775
2638
+ How can you do that?
2639
+
2640
+ 1:17:53.775 --> 1:17:58.355
2641
+ We luckily knew how to compare a reference
2642
+ to a hypothesis.
2643
+
2644
+ 1:17:58.355 --> 1:18:00.493
2645
+ We have evaluation metrics.
2646
+
2647
+ 1:18:00.493 --> 1:18:03.700
2648
+ You can do something like sentence level.
2649
+
2650
+ 1:18:04.044 --> 1:18:13.501
2651
+ But especially if you're looking into neuromodels
2652
+ you should have a stromometric so you can use
2653
+
2654
+ 1:18:13.501 --> 1:18:17.836
2655
+ a neural metric which directly compares to.
2656
+
2657
+ 1:18:22.842 --> 1:18:29.292
2658
+ Yes, so that is, is the main idea of minimum
2659
+ base risk to, so the important idea you should
2660
+
2661
+ 1:18:29.292 --> 1:18:35.743
2662
+ keep in mind is that it's doing somehow the
2663
+ smoothing by not taking the highest probability
2664
+
2665
+ 1:18:35.743 --> 1:18:40.510
2666
+ one, but by comparing like by taking a set
2667
+ of high probability one.
2668
+
2669
+ 1:18:40.640 --> 1:18:45.042
2670
+ And then looking for the translation, which
2671
+ is most similar to all of that.
2672
+
2673
+ 1:18:45.445 --> 1:18:49.888
2674
+ And thereby doing a bit more smoothing because
2675
+ you look at this one.
2676
+
2677
+ 1:18:49.888 --> 1:18:55.169
2678
+ If you have this one, for example, it would
2679
+ be more similar to all of these ones.
2680
+
2681
+ 1:18:55.169 --> 1:19:00.965
2682
+ But if you take this one, it's higher probability,
2683
+ but it's very dissimilar to all these.
2684
+
2685
+ 1:19:05.445 --> 1:19:17.609
2686
+ Hey, that is all for decoding before we finish
2687
+ with your combination of models.
2688
+
2689
+ 1:19:18.678 --> 1:19:20.877
2690
+ Sort of set of pseudo-reperences.
2691
+
2692
+ 1:19:20.877 --> 1:19:24.368
2693
+ Thomas Brown writes a little bit of type research
2694
+ or.
2695
+
2696
+ 1:19:24.944 --> 1:19:27.087
2697
+ For example, you can do beam search.
2698
+
2699
+ 1:19:27.087 --> 1:19:28.825
2700
+ You can do sampling for that.
2701
+
2702
+ 1:19:28.825 --> 1:19:31.257
2703
+ Oh yeah, we had mentioned sampling there.
2704
+
2705
+ 1:19:31.257 --> 1:19:34.500
2706
+ I don't know somebody asking for what sampling
2707
+ is good.
2708
+
2709
+ 1:19:34.500 --> 1:19:37.280
2710
+ So there's, of course, another important issue.
2711
+
2712
+ 1:19:37.280 --> 1:19:40.117
2713
+ How do you get a good representative set of
2714
+ age?
2715
+
2716
+ 1:19:40.620 --> 1:19:47.147
2717
+ If you do beam search, it might be that you
2718
+ end up with two similar ones, and maybe it's
2719
+
2720
+ 1:19:47.147 --> 1:19:49.274
2721
+ prevented by doing sampling.
2722
+
2723
+ 1:19:49.274 --> 1:19:55.288
2724
+ But maybe in sampling you find worse ones,
2725
+ but yet some type of model is helpful.
2726
+
2727
+ 1:19:56.416 --> 1:20:04.863
2728
+ Search method use more transformed based translation
2729
+ points.
2730
+
2731
+ 1:20:04.863 --> 1:20:09.848
2732
+ Nowadays beam search is definitely.
2733
+
2734
+ 1:20:10.130 --> 1:20:13.749
2735
+ There is work on this.
2736
+
2737
+ 1:20:13.749 --> 1:20:27.283
2738
+ The problem is that the MBR is often a lot
2739
+ more like heavy because you have to sample
2740
+
2741
+ 1:20:27.283 --> 1:20:29.486
2742
+ translations.
2743
+
2744
+ 1:20:31.871 --> 1:20:40.946
2745
+ If you are bustling then we take a pen or
2746
+ a pen for the most possible one.
2747
+
2748
+ 1:20:40.946 --> 1:20:43.003
2749
+ Now we put them.
2750
+
2751
+ 1:20:43.623 --> 1:20:46.262
2752
+ Bit and then we say okay, you don't have to
2753
+ be fine.
2754
+
2755
+ 1:20:46.262 --> 1:20:47.657
2756
+ I'm going to put it to you.
2757
+
2758
+ 1:20:48.428 --> 1:20:52.690
2759
+ Yes, so that is what you can also do.
2760
+
2761
+ 1:20:52.690 --> 1:21:00.092
2762
+ Instead of taking uniform per ability, you
2763
+ could take the modest.
2764
+
2765
+ 1:21:01.041 --> 1:21:14.303
2766
+ The uniform is a bit more robust because if
2767
+ you had this one it might be that there is
2768
+
2769
+ 1:21:14.303 --> 1:21:17.810
2770
+ some crazy exceptions.
2771
+
2772
+ 1:21:17.897 --> 1:21:21.088
2773
+ And then it would still relax.
2774
+
2775
+ 1:21:21.088 --> 1:21:28.294
2776
+ So if you look at this picture, the probability
2777
+ here would be higher.
2778
+
2779
+ 1:21:28.294 --> 1:21:31.794
2780
+ But yeah, that's a bit of tuning.
2781
+
2782
+ 1:21:33.073 --> 1:21:42.980
2783
+ In this case, and yes, it is like modeling
2784
+ also the ants that.
2785
+
2786
+ 1:21:49.169 --> 1:21:56.265
2787
+ The last thing is now we always have considered
2788
+ one model.
2789
+
2790
+ 1:21:56.265 --> 1:22:04.084
2791
+ It's also some prints helpful to not only
2792
+ look at one model but.
2793
+
2794
+ 1:22:04.384 --> 1:22:10.453
2795
+ So in general there's many ways of how you
2796
+ can make several models and with it's even
2797
+
2798
+ 1:22:10.453 --> 1:22:17.370
2799
+ easier you can just start three different random
2800
+ municipalizations you get three different models
2801
+
2802
+ 1:22:17.370 --> 1:22:18.428
2803
+ and typically.
2804
+
2805
+ 1:22:19.019 --> 1:22:27.299
2806
+ And then the question is, can we combine their
2807
+ strength into one model and use that then?
2808
+
2809
+ 1:22:29.669 --> 1:22:39.281
2810
+ And that can be done and it can be either
2811
+ online or ensemble, and the more offline thing
2812
+
2813
+ 1:22:39.281 --> 1:22:41.549
2814
+ is called reranking.
2815
+
2816
+ 1:22:42.462 --> 1:22:52.800
2817
+ So the idea is, for example, an ensemble that
2818
+ you combine different initializations.
2819
+
2820
+ 1:22:52.800 --> 1:23:02.043
2821
+ Of course, you can also do other things like
2822
+ having different architecture.
2823
+
2824
+ 1:23:02.222 --> 1:23:08.922
2825
+ But the easiest thing you can change always
2826
+ in generating two motors is to have different.
2827
+
2828
+ 1:23:09.209 --> 1:23:24.054
2829
+ And then the question is how can you combine
2830
+ that?
2831
+
2832
+ 1:23:26.006 --> 1:23:34.245
2833
+ And the easiest thing, as said, is the bottle
2834
+ of soda.
2835
+
2836
+ 1:23:34.245 --> 1:23:39.488
2837
+ What you mainly do is in parallel.
2838
+
2839
+ 1:23:39.488 --> 1:23:43.833
2840
+ You decode all of the money.
2841
+
2842
+ 1:23:44.444 --> 1:23:59.084
2843
+ So the probability of the output and you can
2844
+ join this one to a joint one by just summing
2845
+
2846
+ 1:23:59.084 --> 1:24:04.126
2847
+ up over your key models again.
2848
+
2849
+ 1:24:04.084 --> 1:24:10.374
2850
+ So you still have a pro bonding distribution,
2851
+ but you are not taking only one output here,
2852
+
2853
+ 1:24:10.374 --> 1:24:10.719
2854
+ but.
2855
+
2856
+ 1:24:11.491 --> 1:24:20.049
2857
+ So that's one you can easily combine different
2858
+ models, and the nice thing is it typically
2859
+
2860
+ 1:24:20.049 --> 1:24:20.715
2861
+ works.
2862
+
2863
+ 1:24:21.141 --> 1:24:27.487
2864
+ You additional improvement with only more
2865
+ calculation but not more human work.
2866
+
2867
+ 1:24:27.487 --> 1:24:33.753
2868
+ You just do the same thing for times and you're
2869
+ getting a better performance.
2870
+
2871
+ 1:24:33.793 --> 1:24:41.623
2872
+ Like having more layers and so on, the advantage
2873
+ of bigger models is of course you have to have
2874
+
2875
+ 1:24:41.623 --> 1:24:46.272
2876
+ the big models only joint and decoding during
2877
+ inference.
2878
+
2879
+ 1:24:46.272 --> 1:24:52.634
2880
+ There you have to load models in parallel
2881
+ because you have to do your search.
2882
+
2883
+ 1:24:52.672 --> 1:24:57.557
2884
+ Normally there is more memory resources for
2885
+ training than you need for insurance.
2886
+
2887
+ 1:25:00.000 --> 1:25:12.637
2888
+ You have to train four models and the decoding
2889
+ speed is also slower because you need to decode
2890
+
2891
+ 1:25:12.637 --> 1:25:14.367
2892
+ four models.
2893
+
2894
+ 1:25:14.874 --> 1:25:25.670
2895
+ There is one other very important thing and
2896
+ the models have to be very similar, at least
2897
+
2898
+ 1:25:25.670 --> 1:25:27.368
2899
+ in some ways.
2900
+
2901
+ 1:25:27.887 --> 1:25:28.506
2902
+ Course.
2903
+
2904
+ 1:25:28.506 --> 1:25:34.611
2905
+ You can only combine this one if you have
2906
+ the same words because you are just.
2907
+
2908
+ 1:25:34.874 --> 1:25:43.110
2909
+ So just imagine you have two different sizes
2910
+ because you want to compare them or a director
2911
+
2912
+ 1:25:43.110 --> 1:25:44.273
2913
+ based model.
2914
+
2915
+ 1:25:44.724 --> 1:25:53.327
2916
+ That's at least not easily possible here because
2917
+ once your output would be here a word and the
2918
+
2919
+ 1:25:53.327 --> 1:25:56.406
2920
+ other one would have to sum over.
2921
+
2922
+ 1:25:56.636 --> 1:26:07.324
2923
+ So this ensemble typically only works if you
2924
+ have the same output vocabulary.
2925
+
2926
+ 1:26:07.707 --> 1:26:16.636
2927
+ Your input can be different because that is
2928
+ only done once and then.
2929
+
2930
+ 1:26:16.636 --> 1:26:23.752
2931
+ Your hardware vocabulary has to be the same
2932
+ otherwise.
2933
+
2934
+ 1:26:27.507 --> 1:26:41.522
2935
+ There's even a surprising effect of improving
2936
+ your performance and it's again some kind of
2937
+
2938
+ 1:26:41.522 --> 1:26:43.217
2939
+ smoothing.
2940
+
2941
+ 1:26:43.483 --> 1:26:52.122
2942
+ So normally during training what we are doing
2943
+ is we can save the checkpoints after each epoch.
2944
+
2945
+ 1:26:52.412 --> 1:27:01.774
2946
+ And you have this type of curve where your
2947
+ Arab performance normally should go down, and
2948
+
2949
+ 1:27:01.774 --> 1:27:09.874
2950
+ if you do early stopping it means that at the
2951
+ end you select not the lowest.
2952
+
2953
+ 1:27:11.571 --> 1:27:21.467
2954
+ However, some type of smoothing is there again.
2955
+
2956
+ 1:27:21.467 --> 1:27:31.157
2957
+ Sometimes what you can do is take an ensemble.
2958
+
2959
+ 1:27:31.491 --> 1:27:38.798
2960
+ That is not as good, but you still have four
2961
+ different bottles, and they give you a little.
2962
+
2963
+ 1:27:39.259 --> 1:27:42.212
2964
+ So,.
2965
+
2966
+ 1:27:43.723 --> 1:27:48.340
2967
+ It's some are helping you, so now they're
2968
+ supposed to be something different, you know.
2969
+
2970
+ 1:27:49.489 --> 1:27:53.812
2971
+ Oh didn't do that, so that is a checkpoint.
2972
+
2973
+ 1:27:53.812 --> 1:27:59.117
2974
+ There is one thing interesting, which is even
2975
+ faster.
2976
+
2977
+ 1:27:59.419 --> 1:28:12.255
2978
+ Normally let's give you better performance
2979
+ because this one might be again like a smooth
2980
+
2981
+ 1:28:12.255 --> 1:28:13.697
2982
+ ensemble.
2983
+
2984
+ 1:28:16.736 --> 1:28:22.364
2985
+ Of course, there is also some problems with
2986
+ this, so I said.
2987
+
2988
+ 1:28:22.364 --> 1:28:30.022
2989
+ For example, maybe you want to do different
2990
+ web representations with Cherokee and.
2991
+
2992
+ 1:28:30.590 --> 1:28:37.189
2993
+ You want to do right to left decoding so you
2994
+ normally do like I go home but then your translation
2995
+
2996
+ 1:28:37.189 --> 1:28:39.613
2997
+ depends only on the previous words.
2998
+
2999
+ 1:28:39.613 --> 1:28:45.942
3000
+ If you want to model on the future you could
3001
+ do the inverse direction and generate the target
3002
+
3003
+ 1:28:45.942 --> 1:28:47.895
3004
+ sentence from right to left.
3005
+
3006
+ 1:28:48.728 --> 1:28:50.839
3007
+ But it's not easy to combine these things.
3008
+
3009
+ 1:28:51.571 --> 1:28:56.976
3010
+ In order to do this, or what is also sometimes
3011
+ interesting is doing in verse translation.
3012
+
3013
+ 1:28:57.637 --> 1:29:07.841
3014
+ You can combine these types of models in the
3015
+ next election.
3016
+
3017
+ 1:29:07.841 --> 1:29:13.963
3018
+ That is only a bit which we can do.
3019
+
3020
+ 1:29:14.494 --> 1:29:29.593
3021
+ Next time what you should remember is how
3022
+ search works and do you have any final questions.
3023
+
3024
+ 1:29:33.773 --> 1:29:43.393
3025
+ Then I wish you a happy holiday for next week
3026
+ and then Monday there is another practical
3027
+
3028
+ 1:29:43.393 --> 1:29:50.958
3029
+ and then Thursday in two weeks so we'll have
3030
+ the next lecture Monday.
3031
+
demo_data/lectures/Lecture-09-25.05.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb17280ddd03304eacdea7239b8a65b617c0c5bc9a4ab92e07100370c09187af
3
+ size 119262060
demo_data/lectures/Lecture-10-13.06.2023/English.vtt ADDED
@@ -0,0 +1,2450 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:00.860 --> 0:00:04.211
4
+ Okay Again Welcome.
5
+
6
+ 0:00:04.524 --> 0:00:09.256
7
+ So today I'll be doing the lecture.
8
+
9
+ 0:00:09.256 --> 0:00:12.279
10
+ My name is Danny Liro.
11
+
12
+ 0:00:12.279 --> 0:00:16.747
13
+ I'm one of the PhD students with.
14
+
15
+ 0:00:17.137 --> 0:00:25.942
16
+ And specifically how to learn representations
17
+ that are common across languages and use that
18
+
19
+ 0:00:25.942 --> 0:00:29.004
20
+ to help low resource languages.
21
+
22
+ 0:00:29.689 --> 0:00:39.445
23
+ So hope today we can explore a little bit
24
+ about motoring machine translation and hopefully.
25
+
26
+ 0:00:40.100 --> 0:00:50.940
27
+ So today what we are going to do first we
28
+ are going to look at.
29
+
30
+ 0:00:52.152 --> 0:01:02.491
31
+ Second, we will be looking into more details
32
+ as in how we achieve modeling or machine translation
33
+
34
+ 0:01:02.491 --> 0:01:06.183
35
+ and what are the techniques there.
36
+
37
+ 0:01:06.183 --> 0:01:12.197
38
+ At last, we are going to look at the current
39
+ challenges.
40
+
41
+ 0:01:13.573 --> 0:01:15.976
42
+ Alright, so some definitions.
43
+
44
+ 0:01:15.976 --> 0:01:19.819
45
+ First, what is modeling or machine translation?
46
+
47
+ 0:01:21.201 --> 0:01:28.637
48
+ So for a multilingual machine translation
49
+ system, it's basically a system that is able
50
+
51
+ 0:01:28.637 --> 0:01:34.279
52
+ to handle multiple source languages or multiple
53
+ target languages.
54
+
55
+ 0:01:34.254 --> 0:01:44.798
56
+ You see here you've got source on the source
57
+ side, some German Chinese, Spanish and English.
58
+
59
+ 0:01:45.485 --> 0:01:50.615
60
+ Physically, it's also a quite interesting
61
+ machine learning challenge actually.
62
+
63
+ 0:01:51.031 --> 0:02:05.528
64
+ So if you consider each translation pair as
65
+ a different task in machine learning, then
66
+
67
+ 0:02:05.528 --> 0:02:08.194
68
+ a multilingual.
69
+
70
+ 0:02:08.628 --> 0:02:17.290
71
+ Where it has to specialize in all these different
72
+ translation directions and try to be good.
73
+
74
+ 0:02:17.917 --> 0:02:26.890
75
+ So this is basically about multi-task learning,
76
+ and here when translation direction being one
77
+
78
+ 0:02:26.890 --> 0:02:27.462
79
+ task.
80
+
81
+ 0:02:28.428 --> 0:02:35.096
82
+ Interesting question to ask here is like do
83
+ we get synergy like different tasks helping
84
+
85
+ 0:02:35.096 --> 0:02:39.415
86
+ each other, the knowledge of one task helping
87
+ the other?
88
+
89
+ 0:02:39.539 --> 0:02:48.156
90
+ Or do we get more interference in English
91
+ to German, and now I get worse at English to
92
+
93
+ 0:02:48.156 --> 0:02:49.047
94
+ Chinese.
95
+
96
+ 0:02:49.629 --> 0:02:55.070
97
+ So this is also a very interesting question
98
+ that we'll look into later.
99
+
100
+ 0:02:56.096 --> 0:02:58.605
101
+ Now a little bit of context.
102
+
103
+ 0:02:59.519 --> 0:03:04.733
104
+ We care about multilingual machine translation.
105
+
106
+ 0:03:04.733 --> 0:03:10.599
107
+ Part of the thing is that machine translation
108
+ models.
109
+
110
+ 0:03:11.291 --> 0:03:22.659
111
+ If you consider all the languages in the world,
112
+ there are a read it here roughly seven thousand
113
+
114
+ 0:03:22.659 --> 0:03:23.962
115
+ languages.
116
+
117
+ 0:03:24.684 --> 0:03:37.764
118
+ So consider this number, and if you think
119
+ about this many languages out there, how many
120
+
121
+ 0:03:37.764 --> 0:03:39.548
122
+ directions.
123
+
124
+ 0:03:40.220 --> 0:03:46.897
125
+ So this means to cover end languages.
126
+
127
+ 0:03:46.897 --> 0:03:59.374
128
+ We're going to end up with a prodretic in
129
+ square number of directions.
130
+
131
+ 0:03:59.779 --> 0:04:02.290
132
+ This Is Very Bad, Padre Is Very Bad.
133
+
134
+ 0:04:03.203 --> 0:04:14.078
135
+ The prosthetic situation going on means that
136
+ for a lot of translation directions, if you
137
+
138
+ 0:04:14.078 --> 0:04:16.278
139
+ consider all the.
140
+
141
+ 0:04:17.177 --> 0:04:34.950
142
+ For many of them we aren't going to have any
143
+ parallel data as in existing translated data.
144
+
145
+ 0:04:35.675 --> 0:04:40.001
146
+ So this is a very data scarce situation.
147
+
148
+ 0:04:40.001 --> 0:04:49.709
149
+ We're not going to get parallel data in blue
150
+ wear, especially likely when you have a system
151
+
152
+ 0:04:49.709 --> 0:04:52.558
153
+ that covers tan languages.
154
+
155
+ 0:04:52.912 --> 0:05:04.437
156
+ If this access actually goes towards thousands
157
+ that are realistic, we are going to end up
158
+
159
+ 0:05:04.437 --> 0:05:06.614
160
+ with some holes.
161
+
162
+ 0:05:07.667 --> 0:05:15.400
163
+ So now we are going to ask: Can we use motel
164
+ inquality to help this kind of glow resource?
165
+
166
+ 0:05:15.875 --> 0:05:22.858
167
+ So when useful concept there is mutual intelligibility,
168
+ don't know if you've heard of this.
169
+
170
+ 0:05:23.203 --> 0:05:30.264
171
+ Basically isn't linguistic when you say somebody
172
+ who's speaking one language can directly without
173
+
174
+ 0:05:30.264 --> 0:05:33.218
175
+ learning understands the other language.
176
+
177
+ 0:05:33.218 --> 0:05:39.343
178
+ So if you're a German speaker maybe Dutch
179
+ or Danish and all that kind of stuff would
180
+
181
+ 0:05:39.343 --> 0:05:39.631
182
+ be.
183
+
184
+ 0:05:40.000 --> 0:05:45.990
185
+ Useful or like directly understandable partially
186
+ to you.
187
+
188
+ 0:05:46.586 --> 0:05:52.082
189
+ That is, thanks to this kind of mutual enthology
190
+ ability that is basically based on language
191
+
192
+ 0:05:52.082 --> 0:05:52.791
193
+ similarity.
194
+
195
+ 0:05:53.893 --> 0:05:57.105
196
+ And then there's knowledge sharing this concept.
197
+
198
+ 0:05:57.105 --> 0:06:01.234
199
+ I mean, it's quite intuitive, basically a
200
+ very German speaker.
201
+
202
+ 0:06:01.234 --> 0:06:06.805
203
+ If you start learning Dutch or Danish and
204
+ all these Mordic languages, I think you're
205
+
206
+ 0:06:06.805 --> 0:06:11.196
207
+ going to be faster than just a native English
208
+ speaker or anything.
209
+
210
+ 0:06:11.952 --> 0:06:18.751
211
+ So hopefully our model is also able to do
212
+ this, but we'll see later what the real situation.
213
+
214
+ 0:06:19.799 --> 0:06:27.221
215
+ So we said multilingual is good multilingual
216
+ transmission, it's nice and there's a lot of
217
+
218
+ 0:06:27.221 --> 0:06:28.210
219
+ potentials.
220
+
221
+ 0:06:28.969 --> 0:06:32.205
222
+ So it's a long path towards there.
223
+
224
+ 0:06:32.205 --> 0:06:37.569
225
+ Think all the efforts started in so quite
226
+ some years ago.
227
+
228
+ 0:06:37.958 --> 0:06:54.639
229
+ At first people started with models with language
230
+ specific modules.
231
+
232
+ 0:06:54.454 --> 0:06:58.747
233
+ So we talked about the input of the decoder
234
+ architecture in the previous lecturer area.
235
+
236
+ 0:07:00.100 --> 0:07:06.749
237
+ And with this separation of the inputter and
238
+ the decoder, it gives it a natural way to split
239
+
240
+ 0:07:06.749 --> 0:07:07.679
241
+ the modules.
242
+
243
+ 0:07:09.069 --> 0:07:20.805
244
+ So basically what's happening going on here
245
+ is dedicated to each toes language and dedicated.
246
+
247
+ 0:07:21.281 --> 0:07:34.252
248
+ Now given parallel data of body good data
249
+ English German data we just activate this German
250
+
251
+ 0:07:34.252 --> 0:07:39.241
252
+ inputter and activate this and an.
253
+
254
+ 0:07:40.680 --> 0:07:48.236
255
+ So now we are training basically like corresponding
256
+ parts of the encoder decoders.
257
+
258
+ 0:07:48.236 --> 0:07:55.278
259
+ It has some advantages: First, we have a multilingual
260
+ system.
261
+
262
+ 0:07:55.278 --> 0:08:03.898
263
+ Of course, second modularity is also an advantage
264
+ in software engineering.
265
+
266
+ 0:08:03.898 --> 0:08:10.565
267
+ We want to decouple things if the German input
268
+ is broken.
269
+
270
+ 0:08:11.011 --> 0:08:19.313
271
+ So modularity is advantage in this case, but
272
+ again if we think about scalability, if we
273
+
274
+ 0:08:19.313 --> 0:08:27.521
275
+ think about languages out there that we talked
276
+ about, scalability isn't a great thing.
277
+
278
+ 0:08:27.947 --> 0:08:37.016
279
+ We also talked about sharing knowledge or
280
+ sharing representations for different languages.
281
+
282
+ 0:08:37.317 --> 0:08:41.968
283
+ We have a separate thing for each language.
284
+
285
+ 0:08:41.968 --> 0:08:46.513
286
+ How likely is it that we are sharing much?
287
+
288
+ 0:08:46.513 --> 0:08:52.538
289
+ So these are potential disadvantages with
290
+ this approach.
291
+
292
+ 0:08:53.073 --> 0:09:01.181
293
+ So yeah we talked about, we want to have knowledge
294
+ transfer, we want to have similar languages
295
+
296
+ 0:09:01.181 --> 0:09:02.888
297
+ helping each other.
298
+
299
+ 0:09:02.822 --> 0:09:06.095
300
+ This is somehow a more reachable goal.
301
+
302
+ 0:09:06.095 --> 0:09:13.564
303
+ If you have a shared in corner and a shared
304
+ in physically, a full perimeter shared model
305
+
306
+ 0:09:13.564 --> 0:09:21.285
307
+ for all the translation pairs out there, and
308
+ there's also another game, so if you just have
309
+
310
+ 0:09:21.285 --> 0:09:21.705
311
+ one.
312
+
313
+ 0:09:22.582 --> 0:09:26.084
314
+ Lock of model for all the translation directions
315
+ out there.
316
+
317
+ 0:09:26.606 --> 0:09:38.966
318
+ It's easier to deploy in the sense that if
319
+ you are serving a model you don't have a thousand
320
+
321
+ 0:09:38.966 --> 0:09:42.555
322
+ small modules to maintain.
323
+
324
+ 0:09:42.762 --> 0:09:52.448
325
+ So in terms of engineering somehow these kind
326
+ of fully primitive shared models have: So this
327
+
328
+ 0:09:52.448 --> 0:09:59.819
329
+ is also where the parent research has been
330
+ going towards in recent years.
331
+
332
+ 0:10:00.460 --> 0:10:16.614
333
+ So the rest of the electro are also going
334
+ to focus on this kind of model.
335
+
336
+ 0:10:17.037 --> 0:10:30.901
337
+ So the first type of multilinguali is this
338
+ kind of many to one abbreviated kind of situation.
339
+
340
+ 0:10:30.901 --> 0:10:34.441
341
+ Basically what's going.
342
+
343
+ 0:10:35.355 --> 0:10:49.804
344
+ So one news case that you can think of here
345
+ is if you're subtitled for international movies
346
+
347
+ 0:10:49.804 --> 0:10:51.688
348
+ in Germany.
349
+
350
+ 0:10:53.073 --> 0:11:02.863
351
+ Then flipping the situation there is also
352
+ many configurations where we only have when
353
+
354
+ 0:11:02.863 --> 0:11:04.798
355
+ source language.
356
+
357
+ 0:11:06.046 --> 0:11:13.716
358
+ There's also many use cases like if you think
359
+ about the lecture translator here you've seen.
360
+
361
+ 0:11:14.914 --> 0:11:21.842
362
+ So here most of the lecturers are in German
363
+ and now we want to translate it into.
364
+
365
+ 0:11:21.842 --> 0:11:28.432
366
+ I think on the user end we only support English
367
+ but they're also supportable.
368
+
369
+ 0:11:28.608 --> 0:11:38.988
370
+ So in this kind of used case, if you have
371
+ one speaker and you want to serve or expand
372
+
373
+ 0:11:38.988 --> 0:11:41.281
374
+ to many audience,.
375
+
376
+ 0:11:42.802 --> 0:11:50.542
377
+ But of course, combining everything, there's
378
+ the many to many situation here.
379
+
380
+ 0:11:50.542 --> 0:11:54.015
381
+ You can think of Google Translate.
382
+
383
+ 0:11:54.015 --> 0:11:58.777
384
+ They are doing basically any selected language.
385
+
386
+ 0:11:59.159 --> 0:12:03.760
387
+ And this is also more difficult.
388
+
389
+ 0:12:03.760 --> 0:12:14.774
390
+ If you consider the data you need to get and
391
+ concerns, we'll cover this later.
392
+
393
+ 0:12:15.135 --> 0:12:21.034
394
+ But first we are going to start with many
395
+ to one translations.
396
+
397
+ 0:12:21.741 --> 0:12:30.436
398
+ Say this is the most similar to the bilingual
399
+ translation situation you saw earlier, but
400
+
401
+ 0:12:30.436 --> 0:12:39.423
402
+ now one difference is we need a vocabulary
403
+ or tokens that can represent all these different
404
+
405
+ 0:12:39.423 --> 0:12:40.498
406
+ languages.
407
+
408
+ 0:12:41.301 --> 0:12:44.200
409
+ So we need a joint more telecom global vocabulary.
410
+
411
+ 0:12:44.924 --> 0:12:48.794
412
+ So let's just quickly recall what word embedding
413
+ is to do.
414
+
415
+ 0:12:49.189 --> 0:12:54.561
416
+ Basically we need to represent it.
417
+
418
+ 0:12:54.561 --> 0:13:04.077
419
+ We have to get some vector representation
420
+ for discrete words.
421
+
422
+ 0:13:04.784 --> 0:13:16.911
423
+ And when we embed a token, we are retrieving
424
+ the corresponding vector out of this little.
425
+
426
+ 0:13:17.697 --> 0:13:19.625
427
+ And then we put it.
428
+
429
+ 0:13:19.625 --> 0:13:26.082
430
+ We feed a sequence of vectors into the inputter
431
+ as the next steps.
432
+
433
+ 0:13:26.987 --> 0:13:34.973
434
+ Now if it's motelingual you can imagine that
435
+ vocabulary suddenly gets very, very big because
436
+
437
+ 0:13:34.973 --> 0:13:36.262
438
+ the languages.
439
+
440
+ 0:13:37.877 --> 0:13:46.141
441
+ So what is quite useful here is the by pair
442
+ like subwords you talked about by pairing.
443
+
444
+ 0:13:46.406 --> 0:13:55.992
445
+ So in this case we are still limiting ourselves
446
+ to a finite number of vocabularies that we
447
+
448
+ 0:13:55.992 --> 0:13:59.785
449
+ are exploding the vocabulary table.
450
+
451
+ 0:14:01.181 --> 0:14:11.631
452
+ So when we learn these kinds of subwords,
453
+ what happens basically?
454
+
455
+ 0:14:11.631 --> 0:14:17.015
456
+ We look at all the training data.
457
+
458
+ 0:14:18.558 --> 0:14:20.856
459
+ So think about this.
460
+
461
+ 0:14:20.856 --> 0:14:28.077
462
+ If we do this now on a bunch of Mozilla data,
463
+ are there concerns?
464
+
465
+ 0:14:30.050 --> 0:14:36.811
466
+ Maybe we have an underground status head,
467
+ so we get over English mergers and nocularities.
468
+
469
+ 0:14:37.337 --> 0:14:39.271
470
+ Yeah Exactly Thanks.
471
+
472
+ 0:14:39.539 --> 0:14:46.602
473
+ So what we have to pay attention to here is
474
+ learn this motilingual vocabulary.
475
+
476
+ 0:14:46.602 --> 0:14:52.891
477
+ We should pay attention: All the languages
478
+ are more or less balanced, not that you only
479
+
480
+ 0:14:52.891 --> 0:14:58.912
481
+ learning words is for for English or some bigger
482
+ languages, and then neglecting other other
483
+
484
+ 0:14:58.912 --> 0:15:00.025
485
+ languages, yeah.
486
+
487
+ 0:15:01.021 --> 0:15:04.068
488
+ Of course, this is not going to solve everything.
489
+
490
+ 0:15:04.068 --> 0:15:09.614
491
+ Even if we get a perfectly uniform distribution
492
+ out of all the languages out, there is not
493
+
494
+ 0:15:09.614 --> 0:15:13.454
495
+ going to mean that we are ending up with a
496
+ perfect vocabulary.
497
+
498
+ 0:15:14.154 --> 0:15:20.068
499
+ There are also language differences read,
500
+ so if you consider more European languages.
501
+
502
+ 0:15:20.180 --> 0:15:27.081
503
+ There will be many shared subcomponents like
504
+ how you write a certain word, somewhat similar.
505
+
506
+ 0:15:27.267 --> 0:15:34.556
507
+ But then there are other languages with completely
508
+ different scripts like Arabic, Cyrillic scripts
509
+
510
+ 0:15:34.556 --> 0:15:40.594
511
+ or Eastern Asian scripts where you get a vocabulary
512
+ like the characters set with.
513
+
514
+ 0:15:40.940 --> 0:15:43.531
515
+ Tens of thousands of characters.
516
+
517
+ 0:15:43.531 --> 0:15:50.362
518
+ So these are also individual concerns that
519
+ one has to think about my building specific
520
+
521
+ 0:15:50.362 --> 0:15:51.069
522
+ systems.
523
+
524
+ 0:15:51.591 --> 0:16:02.660
525
+ But overall, the rule of thumb is that when
526
+ you do a mottling tokenizer vocabulary, there's
527
+
528
+ 0:16:02.660 --> 0:16:04.344
529
+ more or less.
530
+
531
+ 0:16:05.385 --> 0:16:17.566
532
+ And there's actually some paper showing that
533
+ the performance of the final system is going
534
+
535
+ 0:16:17.566 --> 0:16:25.280
536
+ to start to degrade if you have a disproportionate
537
+ data.
538
+
539
+ 0:16:27.207 --> 0:16:33.186
540
+ Of course there is currently the trend of
541
+ using pre-train models.
542
+
543
+ 0:16:33.186 --> 0:16:39.890
544
+ If you take a pre-train model somewhere then
545
+ you don't have this concern.
546
+
547
+ 0:16:40.580 --> 0:16:47.810
548
+ Making sure that you use the same organizers
549
+ that they used so that there is no train test
550
+
551
+ 0:16:47.810 --> 0:16:48.287
552
+ time.
553
+
554
+ 0:16:48.888 --> 0:16:53.634
555
+ Yeah for a pre-trainer, we're going to talk
556
+ about a little bit later as well.
557
+
558
+ 0:16:54.734 --> 0:16:59.960
559
+ Alright: So now where's a Martin Luther vocabulary?
560
+
561
+ 0:17:00.920 --> 0:17:04.187
562
+ There are several good things, obviously.
563
+
564
+ 0:17:04.187 --> 0:17:10.953
565
+ So one thing is that if we have words that
566
+ are in the textful form like we said, there
567
+
568
+ 0:17:10.953 --> 0:17:16.242
569
+ are European languages that share some vocabulary,
570
+ then it's great.
571
+
572
+ 0:17:16.242 --> 0:17:19.897
573
+ Then we have the first step towards knowledge.
574
+
575
+ 0:17:20.000 --> 0:17:30.464
576
+ For example, the word pineapple for some reason
577
+ is also in Eastern European languages.
578
+
579
+ 0:17:30.464 --> 0:17:34.915
580
+ In Cyrillic scripts that's also the.
581
+
582
+ 0:17:36.116 --> 0:17:42.054
583
+ But however, there is also ambiguity if you've
584
+ embracing together or dye.
585
+
586
+ 0:17:42.054 --> 0:17:46.066
587
+ Of course, they mean different things for
588
+ German.
589
+
590
+ 0:17:46.246 --> 0:17:53.276
591
+ Then, of course, that's possible to rely on
592
+ further context.
593
+
594
+ 0:17:53.276 --> 0:17:59.154
595
+ It's not a problem, it's something to think
596
+ about.
597
+
598
+ 0:18:00.200 --> 0:18:11.061
599
+ And when we go higher to cover more vocabulary
600
+ entries, we might need to go bigger in the
601
+
602
+ 0:18:11.061 --> 0:18:13.233
603
+ vocabulary count.
604
+
605
+ 0:18:13.653 --> 0:18:28.561
606
+ So there is always sort of a bottleneck as
607
+ the number of languages increase.
608
+
609
+ 0:18:30.110 --> 0:18:32.836
610
+ Right, so what is the result?
611
+
612
+ 0:18:32.836 --> 0:18:38.289
613
+ What are these crustling over inventings actually
614
+ learning?
615
+
616
+ 0:18:40.160 --> 0:18:44.658
617
+ So normally to inspect them it's quite hard.
618
+
619
+ 0:18:44.658 --> 0:18:53.853
620
+ It's like high dimensional vectors with dimensions,
621
+ but researchers also try to project it.
622
+
623
+ 0:18:54.454 --> 0:19:05.074
624
+ So in this case it is a little bit small,
625
+ but in this case for English and French there
626
+
627
+ 0:19:05.074 --> 0:19:07.367
628
+ are many injuries.
629
+
630
+ 0:19:07.467 --> 0:19:20.014
631
+ My example is like different words with the
632
+ same word in morphological forms.
633
+
634
+ 0:19:20.014 --> 0:19:26.126
635
+ Basically, it's like a morphological.
636
+
637
+ 0:19:26.546 --> 0:19:32.727
638
+ There are also words in different languages
639
+ like think there is research for English and
640
+
641
+ 0:19:32.727 --> 0:19:33.282
642
+ French.
643
+
644
+ 0:19:33.954 --> 0:19:41.508
645
+ So the take away from this plot is that somehow
646
+ we learn a bit of semantic meanings beyond
647
+
648
+ 0:19:41.508 --> 0:19:43.086
649
+ the textual forms.
650
+
651
+ 0:19:45.905 --> 0:19:50.851
652
+ But then this looks good and this gives us
653
+ hope.
654
+
655
+ 0:19:52.252 --> 0:20:05.240
656
+ That if we consider what is the baseline here,
657
+ the baseline we compare to is a bilingual system
658
+
659
+ 0:20:05.240 --> 0:20:09.164
660
+ without any multilinguality.
661
+
662
+ 0:20:10.290 --> 0:20:19.176
663
+ This looks good because if we compare for
664
+ many Central European languages, Eastern and
665
+
666
+ 0:20:19.176 --> 0:20:28.354
667
+ Central European languages to English, we compare:
668
+ And we see that the Mini Two English has actually
669
+
670
+ 0:20:28.354 --> 0:20:30.573
671
+ always gained quite a bit over it.
672
+
673
+ 0:20:31.751 --> 0:20:38.876
674
+ But there is also later investigation on whether
675
+ it is actually out of mountain linguality or
676
+
677
+ 0:20:38.876 --> 0:20:39.254
678
+ not.
679
+
680
+ 0:20:39.639 --> 0:20:46.692
681
+ So this is a spoiler won't tell much about
682
+ it until the second half, but just remember
683
+
684
+ 0:20:46.692 --> 0:20:47.908
685
+ there is this.
686
+
687
+ 0:20:49.449 --> 0:20:53.601
688
+ Now move on to many translations.
689
+
690
+ 0:20:53.601 --> 0:21:01.783
691
+ Let's recall in a normal transformer or any
692
+ encoder decoder setup.
693
+
694
+ 0:21:02.242 --> 0:21:08.839
695
+ We have an inkluder that creates sort of contextual
696
+ representation for the sort of sentence.
697
+
698
+ 0:21:09.949 --> 0:21:17.787
699
+ Is more or less the context for generating
700
+ the target sentence red.
701
+
702
+ 0:21:17.787 --> 0:21:28.392
703
+ Now on the target side we get the first open,
704
+ then we feed it again and then get the second
705
+
706
+ 0:21:28.392 --> 0:21:29.544
707
+ decoding.
708
+
709
+ 0:21:31.651 --> 0:21:35.039
710
+ And now we have multiple target languages.
711
+
712
+ 0:21:35.039 --> 0:21:39.057
713
+ Does anybody see a problem with this architecture?
714
+
715
+ 0:21:48.268 --> 0:21:57.791
716
+ Specifically, it's in the decoder, so now
717
+ have a German sentiments encoded.
718
+
719
+ 0:21:57.791 --> 0:22:01.927
720
+ It now want to generate Spanish.
721
+
722
+ 0:22:07.367 --> 0:22:11.551
723
+ So the problem is how does the model know
724
+ which language to generate?
725
+
726
+ 0:22:12.112 --> 0:22:24.053
727
+ If you just give it a generic start token,
728
+ there is nowhere where we are telling the model.
729
+
730
+ 0:22:24.944 --> 0:22:30.277
731
+ So that this can only be a guess, and this
732
+ model will definitely not run well.
733
+
734
+ 0:22:32.492 --> 0:22:40.021
735
+ So this comes to the question: How do we indicate
736
+ the one's intended language to the model?
737
+
738
+ 0:22:41.441 --> 0:22:52.602
739
+ One first idea is what people tried is basically
740
+ now in a source where not only including the
741
+
742
+ 0:22:52.602 --> 0:22:53.552
743
+ source.
744
+
745
+ 0:22:53.933 --> 0:23:01.172
746
+ To Spanish things like this, so basically
747
+ the source is already informed.
748
+
749
+ 0:23:01.172 --> 0:23:12.342
750
+ The source sentence is already supplemented
751
+ with: Now this is also called a target forcing
752
+
753
+ 0:23:12.342 --> 0:23:19.248
754
+ in the sense that we try to force it to give
755
+ the right target.
756
+
757
+ 0:23:20.080 --> 0:23:24.622
758
+ This is one approach.
759
+
760
+ 0:23:24.622 --> 0:23:38.044
761
+ Another approach is basically based on the
762
+ idea that if we have.
763
+
764
+ 0:23:38.438 --> 0:23:52.177
765
+ So if we create a context of our world, the
766
+ incode output shouldn't really differ.
767
+
768
+ 0:23:52.472 --> 0:24:02.397
769
+ So out of this motivation people have moved
770
+ this signaling mechanism.
771
+
772
+ 0:24:02.397 --> 0:24:09.911
773
+ They basically replaced the traditional start
774
+ token.
775
+
776
+ 0:24:10.330 --> 0:24:17.493
777
+ So here we are not kids starting into the
778
+ generic start talking anymore instead language
779
+
780
+ 0:24:17.493 --> 0:24:18.298
781
+ specific.
782
+
783
+ 0:24:18.938 --> 0:24:21.805
784
+ So this is also another way to achieve this.
785
+
786
+ 0:24:23.283 --> 0:24:27.714
787
+ But there are still more challenging cases.
788
+
789
+ 0:24:27.714 --> 0:24:35.570
790
+ Sometimes here it can be called as General
791
+ English or German when it's there.
792
+
793
+ 0:24:35.570 --> 0:24:39.700
794
+ Later on it goes further and further on.
795
+
796
+ 0:24:40.320 --> 0:24:46.752
797
+ Basically this information is not strong enough
798
+ to always enforce the target language, especially
799
+
800
+ 0:24:46.752 --> 0:24:48.392
801
+ in zero shot conditions.
802
+
803
+ 0:24:48.392 --> 0:24:54.168
804
+ We'll look into this later so we'll get this
805
+ kind of target translation into generating
806
+
807
+ 0:24:54.168 --> 0:24:57.843
808
+ and generating and then going into some wrong
809
+ language.
810
+
811
+ 0:24:59.219 --> 0:25:12.542
812
+ So another technique actually developed here
813
+ some years ago was to inject this language.
814
+
815
+ 0:25:12.872 --> 0:25:19.834
816
+ So when we are feeding doing the auto-aggressive
817
+ decoding normally, we only feed the upherb.
818
+
819
+ 0:25:20.000 --> 0:25:22.327
820
+ Into the depoter.
821
+
822
+ 0:25:22.327 --> 0:25:33.704
823
+ But if we also add a language embedding for
824
+ the target language, on top of that we have
825
+
826
+ 0:25:33.704 --> 0:25:37.066
827
+ the language information.
828
+
829
+ 0:25:37.397 --> 0:25:44.335
830
+ And this has shown to perform quite a bit
831
+ better, especially in conditions where the
832
+
833
+ 0:25:44.335 --> 0:25:44.906
834
+ model.
835
+
836
+ 0:25:46.126 --> 0:25:56.040
837
+ So yeah, we introduced three ways to enforce
838
+ the Tardid language: And now with this we're
839
+
840
+ 0:25:56.040 --> 0:26:02.607
841
+ going to move on to the more interesting case
842
+ of many too many translations.
843
+
844
+ 0:26:03.503 --> 0:26:14.021
845
+ Am so here we just consider a system that
846
+ translates two directions: English to English
847
+
848
+ 0:26:14.021 --> 0:26:15.575
849
+ and English.
850
+
851
+ 0:26:16.676 --> 0:26:21.416
852
+ Now we have target languages read.
853
+
854
+ 0:26:21.416 --> 0:26:29.541
855
+ Can you see where we're enforcing the target
856
+ language here?
857
+
858
+ 0:26:29.541 --> 0:26:33.468
859
+ In this case what technique?
860
+
861
+ 0:26:34.934 --> 0:26:45.338
862
+ So here we are enforcing the characteristic
863
+ language with the yelling we train this system.
864
+
865
+ 0:26:46.526 --> 0:27:00.647
866
+ And at the inference time we are able to generate
867
+ English to French, but in addition to this
868
+
869
+ 0:27:00.647 --> 0:27:12.910
870
+ we are also able to: We will be able to do
871
+ zero shot inference that basically translates
872
+
873
+ 0:27:12.910 --> 0:27:17.916
874
+ a direction that is not seen in training.
875
+
876
+ 0:27:19.319 --> 0:27:25.489
877
+ So this is so called zero shot translation
878
+ using a modeling wall system.
879
+
880
+ 0:27:26.606 --> 0:27:34.644
881
+ Of course, we have to reach several things
882
+ before we are able to control the language,
883
+
884
+ 0:27:34.644 --> 0:27:36.769
885
+ otherwise it's no use.
886
+
887
+ 0:27:37.317 --> 0:27:51.087
888
+ Second, we should also have some kind of language
889
+ independent representation.
890
+
891
+ 0:27:51.731 --> 0:27:53.196
892
+ Why is this?
893
+
894
+ 0:27:53.196 --> 0:27:55.112
895
+ Why is this big?
896
+
897
+ 0:27:55.112 --> 0:28:00.633
898
+ Because if women drink generally French up
899
+ here?
900
+
901
+ 0:28:00.940 --> 0:28:05.870
902
+ It was trained to translate from some English.
903
+
904
+ 0:28:07.187 --> 0:28:15.246
905
+ But now we use Anchored Germans in the French,
906
+ so intuitively we need these representations
907
+
908
+ 0:28:15.246 --> 0:28:22.429
909
+ to be similar enough, not that they are so
910
+ far attracted that we cannot use this.
911
+
912
+ 0:28:25.085 --> 0:28:32.059
913
+ So there are several works out there showing
914
+ that if you do a standard transformer architecture
915
+
916
+ 0:28:32.059 --> 0:28:39.107
917
+ this language independent property is not really
918
+ there and you need to add additional approaches
919
+
920
+ 0:28:39.107 --> 0:28:40.633
921
+ in order to enforce.
922
+
923
+ 0:28:41.201 --> 0:28:51.422
924
+ So you can, for example, add an additional
925
+ training objective: That says, we invoked SARSN,
926
+
927
+ 0:28:51.422 --> 0:29:00.305
928
+ be invoked by German, and the invoked English
929
+ have to be the same or be as close to each
930
+
931
+ 0:29:00.305 --> 0:29:02.201
932
+ other as possible.
933
+
934
+ 0:29:02.882 --> 0:29:17.576
935
+ So if we take the output and the output for
936
+ another language, how can we formulate this
937
+
938
+ 0:29:17.576 --> 0:29:18.745
939
+ as an.
940
+
941
+ 0:29:20.981 --> 0:29:27.027
942
+ We can take the translation to the encoder
943
+ and whatever you translate.
944
+
945
+ 0:29:27.027 --> 0:29:32.817
946
+ The embeddings also must be similar and that's
947
+ the great direction.
948
+
949
+ 0:29:33.253 --> 0:29:42.877
950
+ So one thing to take care of here is the length
951
+ for the same sentence in German and English
952
+
953
+ 0:29:42.877 --> 0:29:44.969
954
+ is not necessarily.
955
+
956
+ 0:29:45.305 --> 0:30:00.858
957
+ So if we just do a word to word matching,
958
+ we can always do pulling to a fixed length
959
+
960
+ 0:30:00.858 --> 0:30:03.786
961
+ representation.
962
+
963
+ 0:30:04.004 --> 0:30:08.392
964
+ Or there are more advanced techniques that
965
+ involve some alignments.
966
+
967
+ 0:30:08.848 --> 0:30:23.456
968
+ So this is useful in the sense that in this
969
+ part in experiments we have shown it improves
970
+
971
+ 0:30:23.456 --> 0:30:27.189
972
+ zero shot translation.
973
+
974
+ 0:30:27.447 --> 0:30:36.628
975
+ This is on the data condition of English to
976
+ Malay, Java and Filipino, so kind of made to
977
+
978
+ 0:30:36.628 --> 0:30:39.722
979
+ low resource language family.
980
+
981
+ 0:30:40.100 --> 0:30:50.876
982
+ And there we assume that we get parallel English
983
+ to all of them, but among all these.
984
+
985
+ 0:30:51.451 --> 0:31:03.592
986
+ So the blue bar is a Vanilla Transformer model,
987
+ and the purple bar is when we add a language.
988
+
989
+ 0:31:04.544 --> 0:31:12.547
990
+ You see that in supervised conditions it's
991
+ not changing much, but in zero shots there's
992
+
993
+ 0:31:12.547 --> 0:31:13.183
994
+ quite.
995
+
996
+ 0:31:15.215 --> 0:31:22.649
997
+ Yeah, so far we said zero shots is doable
998
+ and it's even more achievable if we enforce
999
+
1000
+ 0:31:22.649 --> 0:31:26.366
1001
+ some language independent representations.
1002
+
1003
+ 0:31:26.366 --> 0:31:29.823
1004
+ However, there's one practical concern.
1005
+
1006
+ 0:31:29.823 --> 0:31:33.800
1007
+ Don't know if you also had the same question.
1008
+
1009
+ 0:31:34.514 --> 0:31:39.835
1010
+ If you have two languages, you don't have
1011
+ direct parallel.
1012
+
1013
+ 0:31:39.835 --> 0:31:43.893
1014
+ One's into English and one's out of English.
1015
+
1016
+ 0:31:45.685 --> 0:31:52.845
1017
+ It's actually this kind of approach is called
1018
+ pivoting as in pivoting over an intermediate
1019
+
1020
+ 0:31:52.845 --> 0:31:53.632
1021
+ language.
1022
+
1023
+ 0:31:55.935 --> 0:32:00.058
1024
+ Yeah, that it definitely has advantages in
1025
+ the sense that we're going.
1026
+
1027
+ 0:32:00.440 --> 0:32:11.507
1028
+ Now if we go over these two steps every direction
1029
+ was trained with supervised data so you could
1030
+
1031
+ 0:32:11.507 --> 0:32:18.193
1032
+ always assume that when we are working with
1033
+ a supervised.
1034
+
1035
+ 0:32:18.718 --> 0:32:26.868
1036
+ So in this case we can expect more robust
1037
+ inference time behavior.
1038
+
1039
+ 0:32:26.868 --> 0:32:31.613
1040
+ However, there are also disadvantages.
1041
+
1042
+ 0:32:31.531 --> 0:32:38.860
1043
+ An inference where passing through the model
1044
+ ties so that's doubling the inference time
1045
+
1046
+ 0:32:38.860 --> 0:32:39.943
1047
+ computation.
1048
+
1049
+ 0:32:40.500 --> 0:32:47.878
1050
+ You might think okay doubling then what, but
1051
+ if you consider if your company like Google,
1052
+
1053
+ 0:32:47.878 --> 0:32:54.929
1054
+ Google Translate and all your life traffic
1055
+ suddenly becomes twice as big, this is not
1056
+
1057
+ 0:32:54.929 --> 0:33:00.422
1058
+ something scalable that you want to see, especially
1059
+ in production.
1060
+
1061
+ 0:33:01.641 --> 0:33:11.577
1062
+ A problem with this is making information
1063
+ loss because if we go over these games when
1064
+
1065
+ 0:33:11.577 --> 0:33:20.936
1066
+ a chain of kids pass the word to each other,
1067
+ in the end it's losing information.
1068
+
1069
+ 0:33:22.082 --> 0:33:24.595
1070
+ Can give it an example here.
1071
+
1072
+ 0:33:24.595 --> 0:33:27.803
1073
+ It's also from a master thesis here.
1074
+
1075
+ 0:33:27.803 --> 0:33:30.316
1076
+ It's on gender preservation.
1077
+
1078
+ 0:33:30.770 --> 0:33:39.863
1079
+ Basically, some languages like Italian and
1080
+ French have different word forms based on the
1081
+
1082
+ 0:33:39.863 --> 0:33:40.782
1083
+ speaker.
1084
+
1085
+ 0:33:41.001 --> 0:33:55.987
1086
+ So if a male person says feel alienated, this
1087
+ word for alienated would be exclusive and a
1088
+
1089
+ 0:33:55.987 --> 0:33:58.484
1090
+ female person.
1091
+
1092
+ 0:34:00.620 --> 0:34:05.730
1093
+ Now imagine that we pivot through anguish.
1094
+
1095
+ 0:34:05.730 --> 0:34:08.701
1096
+ The information is lost.
1097
+
1098
+ 0:34:08.701 --> 0:34:11.910
1099
+ We don't know what gender.
1100
+
1101
+ 0:34:12.492 --> 0:34:19.626
1102
+ When we go out into branch again, there are
1103
+ different forms.
1104
+
1105
+ 0:34:19.626 --> 0:34:29.195
1106
+ Depending on the speaker gender, we can: So
1107
+ this is one problem.
1108
+
1109
+ 0:34:31.871 --> 0:34:44.122
1110
+ This is especially the case because English
1111
+ compared to many other languages is relatively
1112
+
1113
+ 0:34:44.122 --> 0:34:45.199
1114
+ simple.
1115
+
1116
+ 0:34:45.205 --> 0:34:53.373
1117
+ Gendered where it forms like this, it also
1118
+ doesn't have many cases, so going through English
1119
+
1120
+ 0:34:53.373 --> 0:34:56.183
1121
+ many information would be lost.
1122
+
1123
+ 0:34:57.877 --> 0:35:12.796
1124
+ And another thing is if you have similar languages
1125
+ that you are translating out of my systems
1126
+
1127
+ 0:35:12.796 --> 0:35:15.494
1128
+ that translates.
1129
+
1130
+ 0:35:16.496 --> 0:35:24.426
1131
+ This is the output of going from Dutch to
1132
+ German again.
1133
+
1134
+ 0:35:24.426 --> 0:35:30.231
1135
+ If you read the German, how many of you?
1136
+
1137
+ 0:35:32.552 --> 0:35:51.679
1138
+ Good and the problem here is that we are going
1139
+ over English and then the English to German.
1140
+
1141
+ 0:35:51.831 --> 0:36:06.332
1142
+ However, if we go direct in this case zero
1143
+ shot translation you see that word forgive.
1144
+
1145
+ 0:36:06.546 --> 0:36:09.836
1146
+ In this case, the outward translation is better.
1147
+
1148
+ 0:36:10.150 --> 0:36:20.335
1149
+ And we believe this has to do with using the
1150
+ language similarity between the two languages.
1151
+
1152
+ 0:36:20.335 --> 0:36:26.757
1153
+ There is also quantitative results we found
1154
+ when born in.
1155
+
1156
+ 0:36:27.988 --> 0:36:33.780
1157
+ The models are always doing better when translating
1158
+ similar languages compared to the.
1159
+
1160
+ 0:36:35.535 --> 0:36:42.093
1161
+ Yeah, so in this first half what we talked
1162
+ about basically first, we started with how
1163
+
1164
+ 0:36:42.093 --> 0:36:49.719
1165
+ motilinguality or motilingual machine translation
1166
+ could enable knowledge transfer between languages
1167
+
1168
+ 0:36:49.719 --> 0:36:53.990
1169
+ and help with conditions where we don't have
1170
+ much data.
1171
+
1172
+ 0:36:55.235 --> 0:37:02.826
1173
+ Now it looks at three types of multilingual
1174
+ translation, so one is many to one, one to
1175
+
1176
+ 0:37:02.826 --> 0:37:03.350
1177
+ many.
1178
+
1179
+ 0:37:05.285 --> 0:37:13.397
1180
+ We got there first about a shared vocabulary
1181
+ based on different languages and how these
1182
+
1183
+ 0:37:13.397 --> 0:37:22.154
1184
+ cross lingual word embeddings capture semantic
1185
+ meanings rather than just on a text proof form.
1186
+
1187
+ 0:37:25.505 --> 0:37:37.637
1188
+ Then we looked at how to signal the target
1189
+ language, how to ask for the model to generate,
1190
+
1191
+ 0:37:37.637 --> 0:37:43.636
1192
+ and then we looked at zero shot translation.
1193
+
1194
+ 0:37:45.325 --> 0:37:58.187
1195
+ You now before go into the second half are
1196
+ there questions about the first okay good.
1197
+
1198
+ 0:38:00.140 --> 0:38:10.932
1199
+ In the second half of this lecture we'll be
1200
+ looking into challenges like what is still
1201
+
1202
+ 0:38:10.932 --> 0:38:12.916
1203
+ unsolved about.
1204
+
1205
+ 0:38:13.113 --> 0:38:18.620
1206
+ There are some aspects to look at it.
1207
+
1208
+ 0:38:18.620 --> 0:38:26.591
1209
+ The first is modeling, the second is more
1210
+ engineering.
1211
+
1212
+ 0:38:28.248 --> 0:38:33.002
1213
+ Okay, so we talked about this question several
1214
+ times.
1215
+
1216
+ 0:38:33.002 --> 0:38:35.644
1217
+ How does motilinguality help?
1218
+
1219
+ 0:38:35.644 --> 0:38:37.405
1220
+ Where does it help?
1221
+
1222
+ 0:38:38.298 --> 0:38:45.416
1223
+ Here want to show results of an experiment
1224
+ based on over a hundred languages.
1225
+
1226
+ 0:38:46.266 --> 0:38:58.603
1227
+ Here you can see the data amount so they use
1228
+ parallel data to English and it's very.
1229
+
1230
+ 0:38:58.999 --> 0:39:00.514
1231
+ This is already lock scale.
1232
+
1233
+ 0:39:00.961 --> 0:39:12.982
1234
+ So for higher resource languages like English
1235
+ to French, German to Spanish you get over billion
1236
+
1237
+ 0:39:12.982 --> 0:39:14.359
1238
+ sentences.
1239
+
1240
+ 0:39:14.254 --> 0:39:21.003
1241
+ In parallel, and when we go more to the right
1242
+ to the more low resource spectrum on the other
1243
+
1244
+ 0:39:21.003 --> 0:39:26.519
1245
+ hand, there are languages that maybe many of
1246
+ us have new and heard of like.
1247
+
1248
+ 0:39:26.466 --> 0:39:29.589
1249
+ Do You Want to Move Back?
1250
+
1251
+ 0:39:30.570 --> 0:39:33.270
1252
+ Hawaiian Indians have heard of it.
1253
+
1254
+ 0:39:34.414 --> 0:39:39.497
1255
+ So on that spectrum we only have like thirty
1256
+ thousand sentences.
1257
+
1258
+ 0:39:40.400 --> 0:39:48.389
1259
+ So what this means is when we train, we have
1260
+ to up sample these guys.
1261
+
1262
+ 0:39:48.389 --> 0:39:51.585
1263
+ The model didn't even know.
1264
+
1265
+ 0:39:52.732 --> 0:40:05.777
1266
+ Yeah, so on this graph on how we read it is
1267
+ this horizontal line and zero is basically
1268
+
1269
+ 0:40:05.777 --> 0:40:07.577
1270
+ indicating.
1271
+
1272
+ 0:40:07.747 --> 0:40:14.761
1273
+ Because we want to see where mottling quality
1274
+ helps only compare to what happens when there
1275
+
1276
+ 0:40:14.761 --> 0:40:15.371
1277
+ is not.
1278
+
1279
+ 0:40:16.356 --> 0:40:29.108
1280
+ So upper like higher than the zero line it
1281
+ means we're gaining.
1282
+
1283
+ 0:40:29.309 --> 0:40:34.154
1284
+ The same like for these languages.
1285
+
1286
+ 0:40:34.154 --> 0:40:40.799
1287
+ This side means we are a high resource for
1288
+ the.
1289
+
1290
+ 0:40:40.981 --> 0:40:46.675
1291
+ Yeah sorry, think I've somehow removed the
1292
+ the ex-O as he does.
1293
+
1294
+ 0:40:48.008 --> 0:40:58.502
1295
+ Yeah alright, what happens now if we look
1296
+ at many into English?
1297
+
1298
+ 0:40:58.698 --> 0:41:08.741
1299
+ On the low resource spectrum by going multilingua
1300
+ we gain a lot over the Palumbo system.
1301
+
1302
+ 0:41:10.010 --> 0:41:16.658
1303
+ Overall, if you consider the average for all
1304
+ of the languages, it's still again.
1305
+
1306
+ 0:41:17.817 --> 0:41:27.301
1307
+ Now we're looking at the green line so you
1308
+ can ignore the blue line.
1309
+
1310
+ 0:41:27.301 --> 0:41:32.249
1311
+ Basically we have to do our sample.
1312
+
1313
+ 0:41:33.753 --> 0:41:41.188
1314
+ Yeah, so if you just even consider the average,
1315
+ it's still a game form over by link.
1316
+
1317
+ 0:41:42.983 --> 0:41:57.821
1318
+ However, if we go to the English to many systems
1319
+ looking at the gains, we only get minor improvements.
1320
+
1321
+ 0:41:59.039 --> 0:42:12.160
1322
+ So why is it the case that Going Mott Lingu
1323
+ isn't really helping universally?
1324
+
1325
+ 0:42:16.016 --> 0:42:18.546
1326
+ Do you have some intuitions on yeah?
1327
+
1328
+ 0:42:18.698 --> 0:42:38.257
1329
+ It's easier to understand something that generates
1330
+ if we consider what the model has to generate.
1331
+
1332
+ 0:42:38.718 --> 0:42:40.091
1333
+ I See It Like.
1334
+
1335
+ 0:42:40.460 --> 0:42:49.769
1336
+ Generating is a bit like writing or speaking,
1337
+ while inputing on the source side is more like
1338
+
1339
+ 0:42:49.769 --> 0:42:50.670
1340
+ reading.
1341
+
1342
+ 0:42:50.650 --> 0:42:57.971
1343
+ So one is more passive and the other is more
1344
+ active and don't know if you have similar experience.
1345
+
1346
+ 0:42:57.971 --> 0:43:05.144
1347
+ I think speaking and writing is always a little
1348
+ bit more difficult than just passively listening
1349
+
1350
+ 0:43:05.144 --> 0:43:06.032
1351
+ or reading.
1352
+
1353
+ 0:43:06.032 --> 0:43:09.803
1354
+ But this is a very pendwavy kind of understanding.
1355
+
1356
+ 0:43:10.390 --> 0:43:11.854
1357
+ And fed.
1358
+
1359
+ 0:43:12.032 --> 0:43:20.309
1360
+ In terms of the model, if we consider what
1361
+ is the difference for the target side for many
1362
+
1363
+ 0:43:20.309 --> 0:43:26.703
1364
+ to English: One difference is that there's
1365
+ a data difference.
1366
+
1367
+ 0:43:27.167 --> 0:43:33.438
1368
+ So if you just consider a modern English system
1369
+ with German to English and Spanish to English,.
1370
+
1371
+ 0:43:34.975 --> 0:43:44.321
1372
+ One thing we have to keep in mind is that
1373
+ the parallel data is not all the same, so on
1374
+
1375
+ 0:43:44.321 --> 0:43:49.156
1376
+ the target side there are different English.
1377
+
1378
+ 0:43:49.769 --> 0:43:54.481
1379
+ So the situation rather looks like this.
1380
+
1381
+ 0:43:54.481 --> 0:43:59.193
1382
+ What this means is that we are going to.
1383
+
1384
+ 0:44:00.820 --> 0:44:04.635
1385
+ We also add more data on the target side for
1386
+ English.
1387
+
1388
+ 0:44:06.967 --> 0:44:18.581
1389
+ Now since the target side data is not identical,
1390
+ how do we do a controlled experiment to remove
1391
+
1392
+ 0:44:18.581 --> 0:44:21.121
1393
+ the multilinguality?
1394
+
1395
+ 0:44:24.644 --> 0:44:42.794
1396
+ So what people tried as a control experiment
1397
+ is to keep all the English same as the above
1398
+
1399
+ 0:44:42.794 --> 0:44:44.205
1400
+ setup.
1401
+
1402
+ 0:44:44.684 --> 0:44:49.700
1403
+ So they take the English on English data of
1404
+ the same branch to German.
1405
+
1406
+ 0:44:50.090 --> 0:44:55.533
1407
+ And then the general synthetic data for Germans.
1408
+
1409
+ 0:44:55.533 --> 0:45:05.864
1410
+ So now we have a bilingual system again, but
1411
+ on the target side we still have the previously
1412
+
1413
+ 0:45:05.864 --> 0:45:08.419
1414
+ enriched English data.
1415
+
1416
+ 0:45:10.290 --> 0:45:25.092
1417
+ Now back to this picture that we've seen before,
1418
+ this mysterious orange line here is basically
1419
+
1420
+ 0:45:25.092 --> 0:45:26.962
1421
+ the result.
1422
+
1423
+ 0:45:27.907 --> 0:45:36.594
1424
+ And somewhat struckly and perhaps sadly for
1425
+ believers of multilinguality.
1426
+
1427
+ 0:45:36.594 --> 0:45:39.176
1428
+ This is also gaining.
1429
+
1430
+ 0:45:41.001 --> 0:45:52.775
1431
+ So what this means is for the many English
1432
+ is gaining not really because of multilinguality
1433
+
1434
+ 0:45:52.775 --> 0:45:55.463
1435
+ but just because of.
1436
+
1437
+ 0:45:55.976 --> 0:46:10.650
1438
+ And this means that there is still quite a
1439
+ lot to do if we really want to gain from just
1440
+
1441
+ 0:46:10.650 --> 0:46:13.618
1442
+ shared knowledge.
1443
+
1444
+ 0:46:14.514 --> 0:46:27.599
1445
+ But this also gives hope because there are
1446
+ still many things to research in this area
1447
+
1448
+ 0:46:27.599 --> 0:46:28.360
1449
+ now.
1450
+
1451
+ 0:46:28.708 --> 0:46:40.984
1452
+ So we've seen adding more languages helps
1453
+ with somewhat data side effect and can it hurt.
1454
+
1455
+ 0:46:40.984 --> 0:46:45.621
1456
+ So if we just add more languages.
1457
+
1458
+ 0:46:47.007 --> 0:46:48.408
1459
+ We've seen this.
1460
+
1461
+ 0:46:48.408 --> 0:46:52.694
1462
+ This is the picture for the Manitou English
1463
+ system.
1464
+
1465
+ 0:46:53.793 --> 0:47:09.328
1466
+ Comparing to this valuable face line, we see
1467
+ that for these high resource languages we are
1468
+
1469
+ 0:47:09.328 --> 0:47:12.743
1470
+ not doing as great.
1471
+
1472
+ 0:47:15.956 --> 0:47:18.664
1473
+ So why are we losing here?
1474
+
1475
+ 0:47:18.664 --> 0:47:25.285
1476
+ It's been showing that this performance last
1477
+ is somewhat related.
1478
+
1479
+ 0:47:26.026 --> 0:47:37.373
1480
+ In the sense that the motto has to learn so
1481
+ much that at some point it has to sacrifice
1482
+
1483
+ 0:47:37.373 --> 0:47:39.308
1484
+ capacity from.
1485
+
1486
+ 0:47:41.001 --> 0:47:57.081
1487
+ So what to do to basically grow a bigger brain
1488
+ to tackle this is to add some dedicated capacity
1489
+
1490
+ 0:47:57.081 --> 0:47:59.426
1491
+ per language.
1492
+
1493
+ 0:48:00.100 --> 0:48:15.600
1494
+ Here it's like a simplified graph of a transformer
1495
+ architecture, so this is the encoder within
1496
+
1497
+ 0:48:15.600 --> 0:48:16.579
1498
+ time.
1499
+
1500
+ 0:48:17.357 --> 0:48:27.108
1501
+ But additionally here these little colorable
1502
+ blouse are now the language-specific capable
1503
+
1504
+ 0:48:27.108 --> 0:48:28.516
1505
+ of capacity.
1506
+
1507
+ 0:48:29.169 --> 0:48:42.504
1508
+ There are language specific in the sense that
1509
+ if you get the Chinese to English, the pattern.
1510
+
1511
+ 0:48:43.103 --> 0:48:54.900
1512
+ We are also going to language specific parts
1513
+ that in this case consists of a down projection.
1514
+
1515
+ 0:48:56.416 --> 0:49:07.177
1516
+ So this is also called adaptors, something
1517
+ that is plugged into an existing model and
1518
+
1519
+ 0:49:07.177 --> 0:49:11.556
1520
+ it adapts towards a specific task.
1521
+
1522
+ 0:49:12.232 --> 0:49:22.593
1523
+ And this is conditionally activated in the
1524
+ sense that if you get a different input sentence.
1525
+
1526
+ 0:49:27.307 --> 0:49:34.173
1527
+ So this was first proposed in by some folks
1528
+ selling Google.
1529
+
1530
+ 0:49:34.173 --> 0:49:36.690
1531
+ Does this scale well?
1532
+
1533
+ 0:49:39.619 --> 0:49:56.621
1534
+ Yes exactly, so this is a translation periscusive
1535
+ cannon adapter, and this is not going to scale
1536
+
1537
+ 0:49:56.621 --> 0:49:57.672
1538
+ well.
1539
+
1540
+ 0:49:58.959 --> 0:50:13.676
1541
+ So this also brought people to try some more
1542
+ simple architecture.
1543
+
1544
+ 0:50:16.196 --> 0:50:22.788
1545
+ Yeah, this is also an alternative, in this
1546
+ case called monolingual adapters.
1547
+
1548
+ 0:50:24.184 --> 0:50:32.097
1549
+ Any of these adapters so again have this low
1550
+ resource.
1551
+
1552
+ 0:50:32.097 --> 0:50:42.025
1553
+ The zero line is bilingual baseline, but the
1554
+ lines are interpolated.
1555
+
1556
+ 0:50:43.783 --> 0:50:48.767
1557
+ The red one is the mottling word original
1558
+ mottling word model.
1559
+
1560
+ 0:50:49.929 --> 0:50:57.582
1561
+ And if we put the adapters in like a basic
1562
+ virginal adapter that goes to the blue liner,.
1563
+
1564
+ 0:50:58.078 --> 0:51:08.582
1565
+ You see the lids gaining performance for the
1566
+ high resource languages.
1567
+
1568
+ 0:51:08.582 --> 0:51:16.086
1569
+ If they even scale a lot, this further increases.
1570
+
1571
+ 0:51:16.556 --> 0:51:22.770
1572
+ So this is also a side kind of this.
1573
+
1574
+ 0:51:23.103 --> 0:51:27.807
1575
+ From the side shows that it's really a capacity
1576
+ bottom up.
1577
+
1578
+ 0:51:28.488 --> 0:51:30.590
1579
+ Like If You Eleanor.
1580
+
1581
+ 0:51:31.151 --> 0:51:34.313
1582
+ Resource they regain their performance.
1583
+
1584
+ 0:51:38.959 --> 0:51:50.514
1585
+ For smaller languages, but it's just.
1586
+
1587
+ 0:51:50.770 --> 0:52:03.258
1588
+ Think in the original modeling, the smaller
1589
+ languages they weren't constrained by capacity.
1590
+
1591
+ 0:52:05.445 --> 0:52:13.412
1592
+ So guess for the smaller languages, the difficulty
1593
+ is more the data rather than the model capacity.
1594
+
1595
+ 0:52:13.573 --> 0:52:26.597
1596
+ So in general you always want to have more
1597
+ or less data matching your model capacity.
1598
+
1599
+ 0:52:27.647 --> 0:52:33.255
1600
+ Yeah, here think the bigger challenge for
1601
+ lower roots was the data.
1602
+
1603
+ 0:52:34.874 --> 0:52:39.397
1604
+ You also mention it a little bit.
1605
+
1606
+ 0:52:39.397 --> 0:52:46.979
1607
+ Are these adapters per language or how many
1608
+ adapters do?
1609
+
1610
+ 0:52:47.267 --> 0:52:55.378
1611
+ And do we have to design them differently
1612
+ so that we learn to share more like a language
1613
+
1614
+ 0:52:55.378 --> 0:52:56.107
1615
+ family?
1616
+
1617
+ 0:52:56.576 --> 0:53:15.680
1618
+ So one downside of the adaptor we talked about
1619
+ is that basically there is no way to go over.
1620
+
1621
+ 0:53:16.516 --> 0:53:31.391
1622
+ So then a recent kind of additional approach
1623
+ for these language specific capacity is so
1624
+
1625
+ 0:53:31.391 --> 0:53:36.124
1626
+ called routing or learning.
1627
+
1628
+ 0:53:36.256 --> 0:53:42.438
1629
+ Basically, we have these language specific
1630
+ components.
1631
+
1632
+ 0:53:42.438 --> 0:53:45.923
1633
+ We also have a shared adapter.
1634
+
1635
+ 0:53:45.923 --> 0:53:52.574
1636
+ The model should learn: So in this case maybe
1637
+ we could imagine for the lower resource case
1638
+
1639
+ 0:53:52.574 --> 0:53:54.027
1640
+ that we just talked about.
1641
+
1642
+ 0:53:54.094 --> 0:54:04.838
1643
+ Sense to go there because there's not much
1644
+ to do with language specific anyway than it's
1645
+
1646
+ 0:54:04.838 --> 0:54:10.270
1647
+ better to make use of similarity with other.
1648
+
1649
+ 0:54:11.111 --> 0:54:30.493
1650
+ So this architecture is more data driven instead
1651
+ of what we specify prior to training.
1652
+
1653
+ 0:54:31.871 --> 0:54:33.998
1654
+ So how do we learn this?
1655
+
1656
+ 0:54:35.095 --> 0:54:49.286
1657
+ Basically, in terms of the mask, we want to
1658
+ basically have a binary rule that goes either
1659
+
1660
+ 0:54:49.286 --> 0:54:50.548
1661
+ to the.
1662
+
1663
+ 0:54:51.311 --> 0:54:56.501
1664
+ But how do we get a valued zero or one mean
1665
+ we can?
1666
+
1667
+ 0:54:56.501 --> 0:54:58.498
1668
+ We can do a signal.
1669
+
1670
+ 0:54:58.999 --> 0:55:13.376
1671
+ However, one thing is we don't want to get
1672
+ stuck in the middle, so we don't want black.
1673
+
1674
+ 0:55:14.434 --> 0:55:28.830
1675
+ It is also bad because it is not going to
1676
+ be the same training and test time by the way.
1677
+
1678
+ 0:55:31.151 --> 0:55:50.483
1679
+ So here the question is how do we force basically
1680
+ the model to always go there prior to activation?
1681
+
1682
+ 0:55:54.894 --> 0:56:02.463
1683
+ Found it interesting because it sounds like
1684
+ a trick for me.
1685
+
1686
+ 0:56:02.463 --> 0:56:05.491
1687
+ This approach has been.
1688
+
1689
+ 0:56:06.026 --> 0:56:15.844
1690
+ So what they do is prior to going through
1691
+ this activation, and they add some bosom noise.
1692
+
1693
+ 0:56:17.257 --> 0:56:31.610
1694
+ If there is always noise prior to activation
1695
+ then the model will be encouraged to preserve
1696
+
1697
+ 0:56:31.610 --> 0:56:34.291
1698
+ the information.
1699
+
1700
+ 0:56:36.356 --> 0:56:44.067
1701
+ Was a very interesting thing that found out
1702
+ while preparing this, so wanted to share this
1703
+
1704
+ 0:56:44.067 --> 0:56:44.410
1705
+ as.
1706
+
1707
+ 0:56:44.544 --> 0:56:48.937
1708
+ So basically you can create a battery gate
1709
+ with this technique.
1710
+
1711
+ 0:56:50.390 --> 0:57:01.668
1712
+ And if you add these language specific routing:
1713
+ Here they also have some that can control how
1714
+
1715
+ 0:57:01.668 --> 0:57:07.790
1716
+ much is shared and how much is language specific.
1717
+
1718
+ 0:57:07.727 --> 0:57:16.374
1719
+ Here the seals are the is the routing with
1720
+ the red and orange lines, so.
1721
+
1722
+ 0:57:16.576 --> 0:57:22.752
1723
+ So you can see that poor for many and many
1724
+ to one there in both cases quite some games.
1725
+
1726
+ 0:57:23.063 --> 0:57:30.717
1727
+ So that is the overall picture and just find
1728
+ the idea of the routing quite interesting.
1729
+
1730
+ 0:57:30.991 --> 0:57:32.363
1731
+ And UM.
1732
+
1733
+ 0:57:32.212 --> 0:57:38.348
1734
+ It's also getting a bit more increasingly
1735
+ used as there are the so called mixture of
1736
+
1737
+ 0:57:38.348 --> 0:57:39.431
1738
+ expert models.
1739
+
1740
+ 0:57:39.499 --> 0:57:51.801
1741
+ The model learns where to route the input
1742
+ so they are all conditionally activated when
1743
+
1744
+ 0:57:51.801 --> 0:57:53.074
1745
+ you are.
1746
+
1747
+ 0:57:53.213 --> 0:57:59.089
1748
+ But this is not really something specific
1749
+ to mortal inquality, so won't talk too much
1750
+
1751
+ 0:57:59.089 --> 0:57:59.567
1752
+ about.
1753
+
1754
+ 0:58:00.620 --> 0:58:02.115
1755
+ No.
1756
+
1757
+ 0:58:01.761 --> 0:58:09.640
1758
+ From this parrot is first that we talked about
1759
+ the listing of the capacity bottleneck.
1760
+
1761
+ 0:58:10.570 --> 0:58:19.808
1762
+ Where we can partly compensate by adapters
1763
+ or adding language specific capacity, there's
1764
+
1765
+ 0:58:19.808 --> 0:58:23.026
1766
+ the idea of negative transfer.
1767
+
1768
+ 0:58:24.844 --> 0:58:35.915
1769
+ When we add any additional capacity, how can
1770
+ we improve the knowledge sharing?
1771
+
1772
+ 0:58:38.318 --> 0:58:46.662
1773
+ Also, for this one too many directions that
1774
+ seem to be hopeless for multilinguality, can
1775
+
1776
+ 0:58:46.662 --> 0:58:47.881
1777
+ we actually?
1778
+
1779
+ 0:58:49.129 --> 0:58:52.171
1780
+ Yeah, these are all open things still in the
1781
+ area.
1782
+
1783
+ 0:58:53.673 --> 0:59:04.030
1784
+ Now next part, I'm going to talk about some
1785
+ data challenges for Model Ewell.
1786
+
1787
+ 0:59:04.030 --> 0:59:07.662
1788
+ We talk about Model Ewell.
1789
+
1790
+ 0:59:08.488 --> 0:59:14.967
1791
+ But there are these lower resource languages
1792
+ that don't have well curated parallel data.
1793
+
1794
+ 0:59:16.216 --> 0:59:27.539
1795
+ When alternative people resort to Pro Data
1796
+ from the Internet, there's a lot of noise.
1797
+
1798
+ 0:59:27.927 --> 0:59:36.244
1799
+ And in this paper last year they did some
1800
+ manual analyses of several popular cross data
1801
+
1802
+ 0:59:36.244 --> 0:59:36.811
1803
+ sets.
1804
+
1805
+ 0:59:37.437 --> 0:59:55.262
1806
+ And you'll see that there are a lot of wrong
1807
+ translations, non-linguistic contents, pornographic
1808
+
1809
+ 0:59:55.262 --> 0:59:57.100
1810
+ contents.
1811
+
1812
+ 0:59:57.777 --> 1:00:04.661
1813
+ So as you can imagine, they say what you eat.
1814
+
1815
+ 1:00:04.661 --> 1:00:20.116
1816
+ If you use this kind of data to train a model,
1817
+ you can: So there are also many techniques
1818
+
1819
+ 1:00:20.116 --> 1:00:28.819
1820
+ for filtering and filtering these noisy data
1821
+ sets.
1822
+
1823
+ 1:00:29.809 --> 1:00:36.982
1824
+ So to filter these out we can use an additional
1825
+ classifier that basically are trained to classify
1826
+
1827
+ 1:00:36.982 --> 1:00:43.496
1828
+ which language to sentences and then kick out
1829
+ all the sentences with the wrong language.
1830
+
1831
+ 1:00:45.105 --> 1:00:49.331
1832
+ Another thing is the length ratio.
1833
+
1834
+ 1:00:49.331 --> 1:01:00.200
1835
+ Basically, the assumption there is that if
1836
+ two sentences are translations of each other,.
1837
+
1838
+ 1:01:01.901 --> 1:01:08.718
1839
+ So often people use maybe a ratio of three
1840
+ and then it eliminates the rest.
1841
+
1842
+ 1:01:09.909 --> 1:01:20.187
1843
+ Also, the other idea maybe similar to the
1844
+ language classifier is basically to heaven
1845
+
1846
+ 1:01:20.187 --> 1:01:24.540
1847
+ allowed character set per language.
1848
+
1849
+ 1:01:24.540 --> 1:01:28.289
1850
+ So if you're trying to filter.
1851
+
1852
+ 1:01:28.568 --> 1:01:34.622
1853
+ Don't know Cyrillic spribs or Arabic spribs,
1854
+ then it's maybe a good idea to remove them.
1855
+
1856
+ 1:01:35.775 --> 1:01:43.123
1857
+ This is not all there are many other ideas
1858
+ using some pre-trained neural networks to compare
1859
+
1860
+ 1:01:43.123 --> 1:01:50.629
1861
+ the representations, but just to give you an
1862
+ idea of what our basic techniques were filtering.
1863
+
1864
+ 1:01:50.991 --> 1:01:53.458
1865
+ Is quite important.
1866
+
1867
+ 1:01:53.458 --> 1:02:02.465
1868
+ We have seen in our experience that if you
1869
+ do these thoroughly there is.
1870
+
1871
+ 1:02:03.883 --> 1:02:17.814
1872
+ So after all, even if we do web crawling,
1873
+ there is still a bit of data scarcity problem.
1874
+
1875
+ 1:02:18.118 --> 1:02:30.760
1876
+ So there are many bad things that can happen
1877
+ when there's too little training data.
1878
+
1879
+ 1:02:30.760 --> 1:02:35.425
1880
+ The first is low performances.
1881
+
1882
+ 1:02:35.735 --> 1:02:55.562
1883
+ So they did it on many English system index
1884
+ languages, all together with here means: So
1885
+
1886
+ 1:02:55.562 --> 1:03:04.079
1887
+ we really need to get that area of a lot of
1888
+ data in order to get that ideal performance.
1889
+
1890
+ 1:03:04.884 --> 1:03:20.639
1891
+ There are also many horrible things that can
1892
+ happen in general when you train a model across
1893
+
1894
+ 1:03:20.639 --> 1:03:24.874
1895
+ different training runs.
1896
+
1897
+ 1:03:26.946 --> 1:03:36.733
1898
+ So one solution to tackle this problem, the
1899
+ data scarcity problem, is by fine tuning some
1900
+
1901
+ 1:03:36.733 --> 1:03:38.146
1902
+ pre-trained.
1903
+
1904
+ 1:03:38.979 --> 1:03:46.245
1905
+ And basically the idea is you've got the pre-trained
1906
+ model that can already do translation.
1907
+
1908
+ 1:03:46.846 --> 1:03:54.214
1909
+ Then you find units on your own training data
1910
+ and you end up with a more specialized model.
1911
+
1912
+ 1:03:55.155 --> 1:03:59.369
1913
+ So why does pretraining help?
1914
+
1915
+ 1:03:59.369 --> 1:04:11.448
1916
+ One argument is that if you do pretraining
1917
+ then the motto has seen over more data and
1918
+
1919
+ 1:04:11.448 --> 1:04:12.713
1920
+ learned.
1921
+
1922
+ 1:04:13.313 --> 1:04:19.135
1923
+ Say more generalizable representations that
1924
+ can help more downstream tasks.
1925
+
1926
+ 1:04:19.719 --> 1:04:28.063
1927
+ So in this case we are basically trying to
1928
+ make use of the more meaningful and generalizable
1929
+
1930
+ 1:04:28.063 --> 1:04:29.499
1931
+ representation.
1932
+
1933
+ 1:04:30.490 --> 1:04:45.103
1934
+ So for machine translation there are several
1935
+ open source models out there that can handle
1936
+
1937
+ 1:04:45.103 --> 1:04:46.889
1938
+ languages.
1939
+
1940
+ 1:04:48.188 --> 1:04:49.912
1941
+ Two hundred model.
1942
+
1943
+ 1:04:49.912 --> 1:04:53.452
1944
+ They also cover two hundred languages.
1945
+
1946
+ 1:04:53.452 --> 1:04:57.628
1947
+ That means that's quite a lot of translation.
1948
+
1949
+ 1:04:57.978 --> 1:05:06.218
1950
+ However, one thing to remember is that these
1951
+ lados are more like a how do you call them.
1952
+
1953
+ 1:05:06.146 --> 1:05:12.812
1954
+ Jackson Waltry is a master of none in the
1955
+ sense that they are very good as coverage,
1956
+
1957
+ 1:05:12.812 --> 1:05:20.498
1958
+ but if you look at specific translation directions
1959
+ they might be not as good as dedicated models.
1960
+
1961
+ 1:05:21.521 --> 1:05:34.170
1962
+ So here I'm going to have some results by
1963
+ comparing random initialization versus the
1964
+
1965
+ 1:05:34.170 --> 1:05:36.104
1966
+ first thing.
1967
+
1968
+ 1:05:36.396 --> 1:05:46.420
1969
+ The third line is the result of basically
1970
+ finding a pre-train model that is one of the
1971
+
1972
+ 1:05:46.420 --> 1:05:47.342
1973
+ family.
1974
+
1975
+ 1:05:47.947 --> 1:05:51.822
1976
+ So in this case you could see the.
1977
+
1978
+ 1:05:51.831 --> 1:05:58.374
1979
+ If we just look at the second line, that is
1980
+ the pre trade model out of the box, you see
1981
+
1982
+ 1:05:58.374 --> 1:06:04.842
1983
+ that if we just use it out of the box, the
1984
+ performance everywhere isn't super great as
1985
+
1986
+ 1:06:04.842 --> 1:06:06.180
1987
+ dedicated models.
1988
+
1989
+ 1:06:07.867 --> 1:06:21.167
1990
+ But then here that ex-here means English:
1991
+ So the first takeaway here is that if we do
1992
+
1993
+ 1:06:21.167 --> 1:06:31.560
1994
+ pre-train financing again when we do it into
1995
+ English,.
1996
+
1997
+ 1:06:33.433 --> 1:06:40.438
1998
+ Here is that we are forgetting.
1999
+
2000
+ 1:06:40.438 --> 1:06:50.509
2001
+ When we do further training there is no data.
2002
+
2003
+ 1:06:50.770 --> 1:07:04.865
2004
+ So even if we initialize the pre-trained bottle
2005
+ and continue training, if we don't see translation.
2006
+
2007
+ 1:07:05.345 --> 1:07:13.826
2008
+ So this is bad machine learning people termed
2009
+ it as perfect forgetting in the sense that
2010
+
2011
+ 1:07:13.826 --> 1:07:20.115
2012
+ if you have a model that is trained to do some
2013
+ task and then you.
2014
+
2015
+ 1:07:20.860 --> 1:07:22.487
2016
+ This Is Also Pretty Bad.
2017
+
2018
+ 1:07:24.244 --> 1:07:32.341
2019
+ Is especially bad if you consider training
2020
+ data actually grows over time.
2021
+
2022
+ 1:07:32.341 --> 1:07:35.404
2023
+ It's not like you have one.
2024
+
2025
+ 1:07:36.336 --> 1:07:46.756
2026
+ So in practice we do not always train systems
2027
+ from stretch so it's more like you have an
2028
+
2029
+ 1:07:46.756 --> 1:07:54.951
2030
+ existing system and later we want to expand
2031
+ the translation coverage.
2032
+
2033
+ 1:07:57.277 --> 1:08:08.932
2034
+ Here and the key question is how do we continue
2035
+ training from an existing system in doing so?
2036
+
2037
+ 1:08:09.909 --> 1:08:12.288
2038
+ Approaches.
2039
+
2040
+ 1:08:12.288 --> 1:08:27.945
2041
+ One very simple one is to include a portion
2042
+ of your previous training so that.
2043
+
2044
+ 1:08:28.148 --> 1:08:34.333
2045
+ So if you consider you have an English German
2046
+ system and now you want to explain it to English
2047
+
2048
+ 1:08:34.333 --> 1:08:34.919
2049
+ French,.
2050
+
2051
+ 1:08:36.036 --> 1:08:42.308
2052
+ Like so nice going English, French and English
2053
+ German, so when you train it you still include
2054
+
2055
+ 1:08:42.308 --> 1:08:45.578
2056
+ a small proportion of your previous German
2057
+ data.
2058
+
2059
+ 1:08:45.578 --> 1:08:51.117
2060
+ Hopefully your model is not forgetting that
2061
+ much about the previously lent German.
2062
+
2063
+ 1:08:53.073 --> 1:08:58.876
2064
+ Idea here is what we saw earlier.
2065
+
2066
+ 1:08:58.876 --> 1:09:09.800
2067
+ We can also add adaptors and only train them
2068
+ while keeping the.
2069
+
2070
+ 1:09:10.170 --> 1:09:26.860
2071
+ So this means we're going to end up with a
2072
+ generic model that was not anyhow changed.
2073
+
2074
+ 1:09:27.447 --> 1:09:37.972
2075
+ So in this way it's also more module and more
2076
+ suitable to the incremental learning kind of.
2077
+
2078
+ 1:09:38.758 --> 1:09:49.666
2079
+ Right in this part, the takeaways guess are
2080
+ first data filtering.
2081
+
2082
+ 1:09:49.666 --> 1:09:55.120
2083
+ His Internet data is very noisy.
2084
+
2085
+ 1:09:56.496 --> 1:10:05.061
2086
+ Second, it's about paint tuning pre-fine models
2087
+ and how we can or cannot avoid catastrophic
2088
+
2089
+ 1:10:05.061 --> 1:10:06.179
2090
+ forgetting.
2091
+
2092
+ 1:10:07.247 --> 1:10:15.866
2093
+ And of course open questions would include
2094
+ how can we do incremental learning with these
2095
+
2096
+ 1:10:15.866 --> 1:10:19.836
2097
+ multilingual machine translation models?
2098
+
2099
+ 1:10:20.860 --> 1:10:31.840
2100
+ So with this in mind would like to briefly
2101
+ cover several engineering challenges when we
2102
+
2103
+ 1:10:31.840 --> 1:10:43.031
2104
+ talk about: Yeah, earlier we also briefly talked
2105
+ about the motelingual means sometimes you have
2106
+
2107
+ 1:10:43.031 --> 1:10:51.384
2108
+ to scale up, you have to make your models bigger
2109
+ just to have that capacity to deal with.
2110
+
2111
+ 1:10:52.472 --> 1:10:59.262
2112
+ This means the model sizes are getting bigger
2113
+ and sometimes having one single is not enough
2114
+
2115
+ 1:10:59.262 --> 1:11:00.073
2116
+ to handle.
2117
+
2118
+ 1:11:00.400 --> 1:11:08.914
2119
+ Here wanted to introduce ideas of going parallel
2120
+ and scaling up.
2121
+
2122
+ 1:11:08.914 --> 1:11:12.843
2123
+ The first is so called model.
2124
+
2125
+ 1:11:14.434 --> 1:11:18.859
2126
+ Don't know if you also had this in other like
2127
+ maury cue related courses.
2128
+
2129
+ 1:11:20.220 --> 1:11:30.639
2130
+ Okay, so the idea of data parallel is basically
2131
+ we train in parallel.
2132
+
2133
+ 1:11:30.790 --> 1:11:35.852
2134
+ We put our model onto several GPS.
2135
+
2136
+ 1:11:35.852 --> 1:11:47.131
2137
+ We send the same model there and then when
2138
+ we get the training data we split.
2139
+
2140
+ 1:11:48.108 --> 1:11:54.594
2141
+ So each on each of these we are doing the
2142
+ forward and backward pass in parallel.
2143
+
2144
+ 1:11:55.355 --> 1:12:07.779
2145
+ Then after we get his gradient all these reviews
2146
+ will be synchronized and the gradients will
2147
+
2148
+ 1:12:07.779 --> 1:12:09.783
2149
+ be aggregated.
2150
+
2151
+ 1:12:11.691 --> 1:12:27.127
2152
+ We are having a bigger batch size in effect,
2153
+ so this would be much faster than, for example,
2154
+
2155
+ 1:12:27.127 --> 1:12:31.277
2156
+ doing all these smaller.
2157
+
2158
+ 1:12:32.772 --> 1:12:45.252
2159
+ That is, if your model itself is too big to
2160
+ fit onto an energy group, so you cannot split
2161
+
2162
+ 1:12:45.252 --> 1:12:46.084
2163
+ this.
2164
+
2165
+ 1:12:46.486 --> 1:12:51.958
2166
+ And honestly, the model itself, unless you're
2167
+ going for those.
2168
+
2169
+ 1:12:51.891 --> 1:12:55.500
2170
+ Huge models the industry made these days.
2171
+
2172
+ 1:12:55.500 --> 1:13:03.233
2173
+ I've never run into a situation where the
2174
+ single model itself does not fit into one shape
2175
+
2176
+ 1:13:03.233 --> 1:13:03.748
2177
+ here.
2178
+
2179
+ 1:13:03.748 --> 1:13:08.474
2180
+ Realistically, it's more the what is memory
2181
+ consuming.
2182
+
2183
+ 1:13:08.528 --> 1:13:14.871
2184
+ It is more of the backward cast and the Optimizer
2185
+ states that led me to be stored.
2186
+
2187
+ 1:13:15.555 --> 1:13:22.193
2188
+ So but still there are people training gigantic
2189
+ models where they have to go model parallel.
2190
+
2191
+ 1:13:22.602 --> 1:13:35.955
2192
+ This means you have a model consisting of
2193
+ all those orange pets, but it doesn't fit to
2194
+
2195
+ 1:13:35.955 --> 1:13:40.714
2196
+ split the next several layers.
2197
+
2198
+ 1:13:41.581 --> 1:13:51.787
2199
+ So this means when you do the forward pass
2200
+ you have to wait and to finish before doing.
2201
+
2202
+ 1:13:52.532 --> 1:14:11.193
2203
+ And this kind of implementation is sometimes
2204
+ a bit architecture or specific.
2205
+
2206
+ 1:14:12.172 --> 1:14:17.177
2207
+ Right, so there's one more thing when scaling
2208
+ up.
2209
+
2210
+ 1:14:17.177 --> 1:14:19.179
2211
+ Want it to mention.
2212
+
2213
+ 1:14:20.080 --> 1:14:25.687
2214
+ We also talked about it briefly earlier.
2215
+
2216
+ 1:14:25.687 --> 1:14:34.030
2217
+ We said that when we go to Linguo we need
2218
+ a vocabulary that.
2219
+
2220
+ 1:14:34.614 --> 1:14:40.867
2221
+ And can give you some numbers.
2222
+
2223
+ 1:14:40.867 --> 1:14:53.575
2224
+ Most of the pre-trained modeling models here
2225
+ use a vocabulary.
2226
+
2227
+ 1:14:53.933 --> 1:14:58.454
2228
+ Normally each vector is.
2229
+
2230
+ 1:14:58.454 --> 1:15:10.751
2231
+ This means just the word embedding table alone
2232
+ is times parameters.
2233
+
2234
+ 1:15:11.011 --> 1:15:18.620
2235
+ This means just for the embedding table alone
2236
+ it's already taking million parameters of the.
2237
+
2238
+ 1:15:19.859 --> 1:15:28.187
2239
+ And this is often one of the largest parts
2240
+ of the machine.
2241
+
2242
+ 1:15:28.187 --> 1:15:31.292
2243
+ This also comes with.
2244
+
2245
+ 1:15:31.651 --> 1:15:43.891
2246
+ So one question is how can we efficiently
2247
+ represent a multilingual vocabulary?
2248
+
2249
+ 1:15:43.891 --> 1:15:49.003
2250
+ Are there better ways than just?
2251
+
2252
+ 1:15:50.750 --> 1:16:00.526
2253
+ There are many out there people tread, maybe
2254
+ not all targeted for mottling wool, but think.
2255
+
2256
+ 1:16:00.840 --> 1:16:03.635
2257
+ So when is bites level representation?
2258
+
2259
+ 1:16:03.743 --> 1:16:11.973
2260
+ So the idea there is if we train with data
2261
+ they're all stored on computers, so all their
2262
+
2263
+ 1:16:11.973 --> 1:16:15.579
2264
+ characters must be reused in by bites.
2265
+
2266
+ 1:16:15.579 --> 1:16:23.716
2267
+ So they want to then not using subwords, not
2268
+ using characters, but using bites instead.
2269
+
2270
+ 1:16:25.905 --> 1:16:27.693
2271
+ Do You See Some Downsides?
2272
+
2273
+ 1:16:31.791 --> 1:16:38.245
2274
+ There are some languages that are easier to
2275
+ represent than others.
2276
+
2277
+ 1:16:38.245 --> 1:16:40.556
2278
+ That's definitely true.
2279
+
2280
+ 1:16:41.081 --> 1:16:44.981
2281
+ So if you have a sentence normally of five
2282
+ words,.
2283
+
2284
+ 1:16:46.246 --> 1:16:59.899
2285
+ You think about if we split it into characters,
2286
+ how many characters we have, and each character
2287
+
2288
+ 1:16:59.899 --> 1:17:04.166
2289
+ that would be how many bites.
2290
+
2291
+ 1:17:04.424 --> 1:17:15.749
2292
+ And then it's more to model, it's more for
2293
+ the model to learn, and it's also a bigger
2294
+
2295
+ 1:17:15.749 --> 1:17:19.831
2296
+ sequence to give to the model.
2297
+
2298
+ 1:17:20.260 --> 1:17:22.038
2299
+ Yeah.
2300
+
2301
+ 1:17:21.941 --> 1:17:31.232
2302
+ Visual representation is also quite interesting,
2303
+ so some people argued that we don't want to
2304
+
2305
+ 1:17:31.232 --> 1:17:35.428
2306
+ have a fixed discrete vocabulary anymore.
2307
+
2308
+ 1:17:35.428 --> 1:17:41.921
2309
+ Instead, we want to do it like OCR, like reading
2310
+ them as images.
2311
+
2312
+ 1:17:42.942 --> 1:17:54.016
2313
+ We'll look at one example for this next: Then
2314
+ another idea is how if you can distill the
2315
+
2316
+ 1:17:54.016 --> 1:18:03.966
2317
+ vocabulary as in learning some more compact
2318
+ representation,.
2319
+
2320
+ 1:18:04.284 --> 1:18:12.554
2321
+ But next wanted to show you an example of
2322
+ pixel inputs for modeling war machine.
2323
+
2324
+ 1:18:12.852 --> 1:18:29.757
2325
+ If you look at the picture, all the characters
2326
+ that are marked with red are actually not.
2327
+
2328
+ 1:18:32.772 --> 1:18:48.876
2329
+ They are actually from a different script
2330
+ for the model and let it do the subword tokenization.
2331
+
2332
+ 1:18:52.852 --> 1:19:04.373
2333
+ You would get maybe mostly characters out
2334
+ of it because I guess in the pre existing vocabulary
2335
+
2336
+ 1:19:04.373 --> 1:19:07.768
2337
+ there won't be Latin H and.
2338
+
2339
+ 1:19:07.707 --> 1:19:16.737
2340
+ So you'll get characters out of it, which
2341
+ means it's probably going to be more difficult
2342
+
2343
+ 1:19:16.737 --> 1:19:18.259
2344
+ for the model.
2345
+
2346
+ 1:19:20.140 --> 1:19:28.502
2347
+ Yeah, so the motivation for pixel inputs is
2348
+ that there is more sharing across languages.
2349
+
2350
+ 1:19:30.010 --> 1:19:37.773
2351
+ Here basically illustrates an embedding table
2352
+ for subwords and saying if you have sentences
2353
+
2354
+ 1:19:37.773 --> 1:19:45.705
2355
+ in the letter scripts like French and the English
2356
+ then it's going to take certain proportions
2357
+
2358
+ 1:19:45.705 --> 1:19:48.152
2359
+ of this big embetting table.
2360
+
2361
+ 1:19:48.328 --> 1:19:56.854
2362
+ While for Arabic and Chinese it's yet again
2363
+ another,.
2364
+
2365
+ 1:19:56.796 --> 1:20:09.037
2366
+ That is not joined with the previous one if
2367
+ we want to have shared representations for
2368
+
2369
+ 1:20:09.037 --> 1:20:11.992
2370
+ different languages.
2371
+
2372
+ 1:20:12.692 --> 1:20:18.531
2373
+ On the other hand, if we're going with pixels,
2374
+ there's definitely more sharing.
2375
+
2376
+ 1:20:22.362 --> 1:20:30.911
2377
+ There's a difference though to a standard
2378
+ kind of norm machine translation typeline.
2379
+
2380
+ 1:20:32.252 --> 1:20:47.581
2381
+ If you have this brace then how do we go with
2382
+ images into a translation model?
2383
+
2384
+ 1:20:50.690 --> 1:20:58.684
2385
+ We still have to tokenize it somehow, so in
2386
+ this case they do an overlapping sliding window.
2387
+
2388
+ 1:20:59.259 --> 1:21:13.636
2389
+ Since it's more visual, we're using some kind
2390
+ of convolution blocks before going into these
2391
+
2392
+ 1:21:13.636 --> 1:21:14.730
2393
+ black.
2394
+
2395
+ 1:21:15.035 --> 1:21:25.514
2396
+ So here wanted to show that if you go with
2397
+ these more specialist architectures we get
2398
+
2399
+ 1:21:25.514 --> 1:21:27.829
2400
+ pixels and that's.
2401
+
2402
+ 1:21:30.050 --> 1:21:31.310
2403
+ There's Also One Down the Side.
2404
+
2405
+ 1:21:31.431 --> 1:21:51.380
2406
+ If we go with pixels and present teachings,
2407
+ what are our challenges?
2408
+
2409
+ 1:21:52.993 --> 1:22:00.001
2410
+ Exactly so as they beat us others here, also
2411
+ pointing out here for their experiments.
2412
+
2413
+ 1:22:01.061 --> 1:22:08.596
2414
+ They only consider a one target language,
2415
+ and this is also on their target site.
2416
+
2417
+ 1:22:08.596 --> 1:22:10.643
2418
+ It's not pixel based.
2419
+
2420
+ 1:22:11.131 --> 1:22:31.033
2421
+ So this is definitely, in my opinion, very
2422
+ interesting steps towards more shared representations.
2423
+
2424
+ 1:22:31.831 --> 1:22:40.574
2425
+ Yeah, so with this kind of out of the box
2426
+ approach just wanted to summarize today's lecture.
2427
+
2428
+ 1:22:41.962 --> 1:22:53.158
2429
+ First think we saw why motelingue is cool,
2430
+ why there are several open challenges out there
2431
+
2432
+ 1:22:53.158 --> 1:22:53.896
2433
+ that.
2434
+
2435
+ 1:22:55.355 --> 1:23:03.601
2436
+ We also saw, like several approaches, how
2437
+ to realize implement a modern molecular translation
2438
+
2439
+ 1:23:03.601 --> 1:23:11.058
2440
+ system, and yeah, lastly, we've seen quite
2441
+ some over challenges on what is unsolved.
2442
+
2443
+ 1:23:11.691 --> 1:23:22.403
2444
+ Yeah, so with this want to thank you for being
2445
+ here today and I'm up there if you want.
2446
+
2447
+ 1:23:26.106 --> 1:23:29.727
2448
+ If you have questions, how will we also share
2449
+ with the moment?
2450
+
demo_data/lectures/Lecture-10-13.06.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8dc282db3512e8731326f1898c8dd757c40f33bd1468ffae249a9374f76fe28
3
+ size 122197601
demo_data/lectures/Lecture-11-15.06.2023/English.vtt ADDED
The diff for this file is too large to render. See raw diff
 
demo_data/lectures/Lecture-11-15.06.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:018f7b42f2225e9ea6d68c39e22111b3d3e172c045fde57e3dfd6b2ca3df4198
3
+ size 123175586
demo_data/lectures/Lecture-12-20.06.2023/English.vtt ADDED
The diff for this file is too large to render. See raw diff
 
demo_data/lectures/Lecture-12-20.06.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e86b4df900483ac17cf6e78c131d83ab5f7df2a0790c7ae034502bdce61554f3
3
+ size 158173841
demo_data/lectures/Lecture-13-04.07.2023/English.vtt ADDED
@@ -0,0 +1,2696 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.641 --> 0:00:06.302
4
+ Hey so what again to today's lecture on machine
5
+ translation.
6
+
7
+ 0:00:07.968 --> 0:00:15.152
8
+ This week we'll have a bit of different focus,
9
+ so last two weeks or so we have looking into.
10
+
11
+ 0:00:15.655 --> 0:00:28.073
12
+ How we can improve our system by having more
13
+ data, other data sources, or using them to
14
+
15
+ 0:00:28.073 --> 0:00:30.331
16
+ more efficient.
17
+
18
+ 0:00:30.590 --> 0:00:38.046
19
+ And we'll have a bit more of that next week
20
+ with the anti-travised and the context.
21
+
22
+ 0:00:38.338 --> 0:00:47.415
23
+ So that we are shifting from this idea of
24
+ we treat each sentence independently, but treat
25
+
26
+ 0:00:47.415 --> 0:00:49.129
27
+ the translation.
28
+
29
+ 0:00:49.129 --> 0:00:58.788
30
+ Because maybe you can remember from the beginning,
31
+ there are phenomenon in machine translation
32
+
33
+ 0:00:58.788 --> 0:01:02.143
34
+ that you cannot correctly check.
35
+
36
+ 0:01:03.443 --> 0:01:14.616
37
+ However, today we want to more look into what
38
+ challenges arise, specifically when we're practically
39
+
40
+ 0:01:14.616 --> 0:01:16.628
41
+ applying machine.
42
+
43
+ 0:01:17.017 --> 0:01:23.674
44
+ And this block will be a total of four different
45
+ lectures.
46
+
47
+ 0:01:23.674 --> 0:01:29.542
48
+ What type of biases are in machine translation
49
+ can.
50
+
51
+ 0:01:29.729 --> 0:01:37.646
52
+ Just then can we try to improve this, but
53
+ of course the first focus can be at least the.
54
+
55
+ 0:01:37.717 --> 0:01:41.375
56
+ And this, of course, gets more and more important.
57
+
58
+ 0:01:41.375 --> 0:01:48.333
59
+ The more often you apply this type of technology,
60
+ when it was mainly a basic research tool which
61
+
62
+ 0:01:48.333 --> 0:01:53.785
63
+ you were using in a research environment, it's
64
+ not directly that important.
65
+
66
+ 0:01:54.054 --> 0:02:00.370
67
+ But once you apply it to the question, is
68
+ it performed the same for everybody or is it
69
+
70
+ 0:02:00.370 --> 0:02:04.436
71
+ performance of some people less good than other
72
+ people?
73
+
74
+ 0:02:04.436 --> 0:02:10.462
75
+ Does it have specific challenges and we are
76
+ seeing that especially in translation?
77
+
78
+ 0:02:10.710 --> 0:02:13.420
79
+ We have the major challenge.
80
+
81
+ 0:02:13.420 --> 0:02:20.333
82
+ We have the grammatical gender and this is
83
+ not the same in all languages.
84
+
85
+ 0:02:20.520 --> 0:02:35.431
86
+ In English, it's not clear if you talk about
87
+ some person, if it's male or female, and so
88
+
89
+ 0:02:35.431 --> 0:02:39.787
90
+ hopefully you've learned.
91
+
92
+ 0:02:41.301 --> 0:02:50.034
93
+ Just as a brief view, so based on this one
94
+ aspect of application will then have two other
95
+
96
+ 0:02:50.034 --> 0:02:57.796
97
+ aspects: On Thursday we'll look into adaptation,
98
+ so how can we adapt to specific situations?
99
+
100
+ 0:02:58.718 --> 0:03:09.127
101
+ Because we have seen that your systems perform
102
+ well when the test case is similar to the training
103
+
104
+ 0:03:09.127 --> 0:03:15.181
105
+ case, it's always the case you should get training
106
+ data.
107
+
108
+ 0:03:16.036 --> 0:03:27.577
109
+ However, in practical applications, it's not
110
+ always possible to collect really the best
111
+
112
+ 0:03:27.577 --> 0:03:31.642
113
+ fitting data, so in that case.
114
+
115
+ 0:03:32.092 --> 0:03:39.269
116
+ And then the third larger group of applications
117
+ will then be speech translation.
118
+
119
+ 0:03:39.269 --> 0:03:42.991
120
+ What do we have to change in our machine?
121
+
122
+ 0:03:43.323 --> 0:03:53.569
123
+ If we are now not translating text, but if
124
+ we want to translate speech, that will be more
125
+
126
+ 0:03:53.569 --> 0:03:54.708
127
+ lectures.
128
+
129
+ 0:04:00.180 --> 0:04:12.173
130
+ So what are we talking about when we are talking
131
+ about bias from a definition point?
132
+
133
+ 0:04:12.092 --> 0:04:21.799
134
+ Means we are introducing systematic errors
135
+ when testing, and then we encourage the selection
136
+
137
+ 0:04:21.799 --> 0:04:24.408
138
+ of the specific answers.
139
+
140
+ 0:04:24.804 --> 0:04:36.862
141
+ The most prominent case, which is analyzed
142
+ most in the research community, is a bias based
143
+
144
+ 0:04:36.862 --> 0:04:38.320
145
+ on gender.
146
+
147
+ 0:04:38.320 --> 0:04:43.355
148
+ One example: she works in a hospital.
149
+
150
+ 0:04:43.523 --> 0:04:50.787
151
+ It is not directly able to assess whether
152
+ this is now a point or a friend.
153
+
154
+ 0:04:51.251 --> 0:05:07.095
155
+ And although in this one even there is, it's
156
+ possible to ambiguate this based on the context.
157
+
158
+ 0:05:07.127 --> 0:05:14.391
159
+ However, there is yeah, this relation to learn
160
+ is of course not that easy.
161
+
162
+ 0:05:14.614 --> 0:05:27.249
163
+ So the system might also learn more like shortcut
164
+ connections, which might be that in your training
165
+
166
+ 0:05:27.249 --> 0:05:31.798
167
+ data most of the doctors are males.
168
+
169
+ 0:05:32.232 --> 0:05:41.725
170
+ That is like that was too bigly analyzed and
171
+ biased, and we'll focus on that also in this.
172
+
173
+ 0:05:41.641 --> 0:05:47.664
174
+ In this lecture, however, of course, the system
175
+ might be a lot of other biases too, which have
176
+
177
+ 0:05:47.664 --> 0:05:50.326
178
+ been partly investigated in other fields.
179
+
180
+ 0:05:50.326 --> 0:05:53.496
181
+ But I think machine translation is not that
182
+ much.
183
+
184
+ 0:05:53.813 --> 0:05:57.637
185
+ For example, it can be based on your originals.
186
+
187
+ 0:05:57.737 --> 0:06:09.405
188
+ So there is an example for a sentiment analysis
189
+ that's a bit prominent.
190
+
191
+ 0:06:09.405 --> 0:06:15.076
192
+ A sentiment analysis means you're.
193
+
194
+ 0:06:15.035 --> 0:06:16.788
195
+ Like you're seeing it in reviews.
196
+
197
+ 0:06:17.077 --> 0:06:24.045
198
+ And then you can show that with baseline models,
199
+ if the name is Mohammed then the sentiment
200
+
201
+ 0:06:24.045 --> 0:06:30.786
202
+ in a lot of systems will be more negative than
203
+ if it's like a traditional European name.
204
+
205
+ 0:06:31.271 --> 0:06:33.924
206
+ Are with foods that is simple.
207
+
208
+ 0:06:33.924 --> 0:06:36.493
209
+ It's this type of restaurant.
210
+
211
+ 0:06:36.493 --> 0:06:38.804
212
+ It's positive and another.
213
+
214
+ 0:06:39.319 --> 0:06:49.510
215
+ You have other aspects, so we have seen this.
216
+
217
+ 0:06:49.510 --> 0:06:59.480
218
+ We have done some experiments in Vietnamese.
219
+
220
+ 0:06:59.559 --> 0:07:11.040
221
+ And then, for example, you can analyze that
222
+ if it's like he's Germany will address it more
223
+
224
+ 0:07:11.040 --> 0:07:18.484
225
+ formal, while if he is North Korean he'll use
226
+ an informal.
227
+
228
+ 0:07:18.838 --> 0:07:24.923
229
+ So these are also possible types of gender.
230
+
231
+ 0:07:24.923 --> 0:07:31.009
232
+ However, this is difficult types of biases.
233
+
234
+ 0:07:31.251 --> 0:07:38.903
235
+ However, especially in translation, the bias
236
+ for gender is the most challenging because
237
+
238
+ 0:07:38.903 --> 0:07:42.989
239
+ we are treating gender in different languages.
240
+
241
+ 0:07:45.405 --> 0:07:46.930
242
+ Hi this is challenging.
243
+
244
+ 0:07:48.148 --> 0:07:54.616
245
+ The reason for that is that there is a translation
246
+ mismatch and we have, I mean, one reason for
247
+
248
+ 0:07:54.616 --> 0:08:00.140
249
+ that is there's a translation mismatch and
250
+ that's the most challenging situation.
251
+
252
+ 0:08:00.140 --> 0:08:05.732
253
+ So there is there is different information
254
+ in the Sears language or in the target.
255
+
256
+ 0:08:06.046 --> 0:08:08.832
257
+ So if we have the English word dot player,.
258
+
259
+ 0:08:09.029 --> 0:08:12.911
260
+ It's there is no information about the gender
261
+ in there.
262
+
263
+ 0:08:12.911 --> 0:08:19.082
264
+ However, if you want to translate in German,
265
+ you cannot easily generate a word without a
266
+
267
+ 0:08:19.082 --> 0:08:20.469
268
+ gender information.
269
+
270
+ 0:08:20.469 --> 0:08:27.056
271
+ Or man, you can't do something like Shubila
272
+ in, but that sounds a bit weird if you're talking.
273
+
274
+ 0:08:27.027 --> 0:08:29.006
275
+ About a specific person.
276
+
277
+ 0:08:29.006 --> 0:08:32.331
278
+ Then you should use the appropriate font.
279
+
280
+ 0:08:32.692 --> 0:08:44.128
281
+ And so it's most challenging translation as
282
+ always in this situation where you have less
283
+
284
+ 0:08:44.128 --> 0:08:50.939
285
+ information on the source side but more information.
286
+
287
+ 0:08:51.911 --> 0:08:57.103
288
+ Similar things like if you think about Japanese,
289
+ for example where there's different formality
290
+
291
+ 0:08:57.103 --> 0:08:57.540
292
+ levels.
293
+
294
+ 0:08:57.540 --> 0:09:02.294
295
+ If in German there is no formality or like
296
+ two only or in English there's no formality
297
+
298
+ 0:09:02.294 --> 0:09:02.677
299
+ level.
300
+
301
+ 0:09:02.862 --> 0:09:08.139
302
+ And now you have to estimate the formality
303
+ level.
304
+
305
+ 0:09:08.139 --> 0:09:10.884
306
+ Of course, it takes some.
307
+
308
+ 0:09:10.884 --> 0:09:13.839
309
+ It's not directly possible.
310
+
311
+ 0:09:14.094 --> 0:09:20.475
312
+ What nowadays systems are doing is at least
313
+ assess.
314
+
315
+ 0:09:20.475 --> 0:09:27.470
316
+ This is a situation where don't have enough
317
+ information.
318
+
319
+ 0:09:27.567 --> 0:09:28.656
320
+ Translation.
321
+
322
+ 0:09:28.656 --> 0:09:34.938
323
+ So here you have that suggesting it can be
324
+ doctor or doctorate in Spanish.
325
+
326
+ 0:09:35.115 --> 0:09:37.051
327
+ So that is a possibility.
328
+
329
+ 0:09:37.051 --> 0:09:41.595
330
+ However, it is of course very, very challenging
331
+ to find out.
332
+
333
+ 0:09:42.062 --> 0:09:46.130
334
+ Is there two really different meanings, or
335
+ is it not the case?
336
+
337
+ 0:09:46.326 --> 0:09:47.933
338
+ You can do the big rule base here.
339
+
340
+ 0:09:47.933 --> 0:09:49.495
341
+ Maybe don't know how they did it.
342
+
343
+ 0:09:49.990 --> 0:09:57.469
344
+ You can, of course, if you are focusing on
345
+ gender, the source and the target is different,
346
+
347
+ 0:09:57.469 --> 0:09:57.879
348
+ and.
349
+
350
+ 0:09:58.118 --> 0:10:05.799
351
+ But if you want to do it more general, it's
352
+ not that easy because there's always.
353
+
354
+ 0:10:06.166 --> 0:10:18.255
355
+ But it's not clear if these are really different
356
+ or if there's only slight differences.
357
+
358
+ 0:10:22.142 --> 0:10:36.451
359
+ Between that another reason why there is a
360
+ bias in there is typically the system tries
361
+
362
+ 0:10:36.451 --> 0:10:41.385
363
+ to always do the most simple.
364
+
365
+ 0:10:42.262 --> 0:10:54.483
366
+ And also in your training data there are unintended
367
+ shortcuts or clues only in the training data
368
+
369
+ 0:10:54.483 --> 0:10:59.145
370
+ because you sample them in some way.
371
+
372
+ 0:10:59.379 --> 0:11:06.257
373
+ This example, if she works in a hospital and
374
+ my friend is a nurse, then it might be that
375
+
376
+ 0:11:06.257 --> 0:11:07.184
377
+ one friend.
378
+
379
+ 0:11:08.168 --> 0:11:18.979
380
+ Male and female because it has learned that
381
+ in your trained doctor is a male and a nurse
382
+
383
+ 0:11:18.979 --> 0:11:20.802
384
+ is doing this.
385
+
386
+ 0:11:20.880 --> 0:11:29.587
387
+ And of course, if we are doing maximum likelihood
388
+ approximation as we are doing it in general,
389
+
390
+ 0:11:29.587 --> 0:11:30.962
391
+ we are always.
392
+
393
+ 0:11:30.951 --> 0:11:43.562
394
+ So that means if in your training data this
395
+ correlation is maybe in the case then your
396
+
397
+ 0:11:43.562 --> 0:11:48.345
398
+ predictions are always the same.
399
+
400
+ 0:11:48.345 --> 0:11:50.375
401
+ It typically.
402
+
403
+ 0:11:55.035 --> 0:12:06.007
404
+ What does it mean, of course, if we are having
405
+ this type of fires and if we are applying?
406
+
407
+ 0:12:05.925 --> 0:12:14.821
408
+ It might be that the benefit of machine translation
409
+ rice so more and more people can benefit from
410
+
411
+ 0:12:14.821 --> 0:12:20.631
412
+ the ability to talk to people in different
413
+ languages and so on.
414
+
415
+ 0:12:20.780 --> 0:12:27.261
416
+ But if you more often use it, problems of
417
+ the system also get more and more important.
418
+
419
+ 0:12:27.727 --> 0:12:36.984
420
+ And so if we are seeing that these problems
421
+ and people nowadays only start to analyze these
422
+
423
+ 0:12:36.984 --> 0:12:46.341
424
+ problems partly, also because if it hasn't
425
+ been used, it's not that important if the quality
426
+
427
+ 0:12:46.341 --> 0:12:47.447
428
+ is so bad.
429
+
430
+ 0:12:47.627 --> 0:12:51.907
431
+ Version or is mixing it all the time like
432
+ we have seen in old systems.
433
+
434
+ 0:12:51.907 --> 0:12:52.993
435
+ Then, of course,.
436
+
437
+ 0:12:53.053 --> 0:12:57.303
438
+ The issue is not that you have biased issues
439
+ that you at first need to create a right view.
440
+
441
+ 0:12:57.637 --> 0:13:10.604
442
+ So only with the wide application of the good
443
+ quality this becomes important, and then of
444
+
445
+ 0:13:10.604 --> 0:13:15.359
446
+ course you should look into how.
447
+
448
+ 0:13:15.355 --> 0:13:23.100
449
+ In order to first get aware of what are the
450
+ challenges, and that is a general idea not
451
+
452
+ 0:13:23.100 --> 0:13:24.613
453
+ only about bias.
454
+
455
+ 0:13:24.764 --> 0:13:31.868
456
+ Of course, we have learned about blue scores,
457
+ so how can you evaluate the over quality and
458
+
459
+ 0:13:31.868 --> 0:13:36.006
460
+ they are very important, either blue or any
461
+ of that.
462
+
463
+ 0:13:36.006 --> 0:13:40.378
464
+ However, they are somehow giving us a general
465
+ overview.
466
+
467
+ 0:13:40.560 --> 0:13:58.410
468
+ And if we want to improve our systems, of
469
+ course it's important that we also do more
470
+
471
+ 0:13:58.410 --> 0:14:00.510
472
+ detailed.
473
+
474
+ 0:14:00.340 --> 0:14:05.828
475
+ Test sets which are very challenging in order
476
+ to attend to see how good these systems.
477
+
478
+ 0:14:06.446 --> 0:14:18.674
479
+ Of course, one last reminder to that if you
480
+ do a challenge that says it's typically good
481
+
482
+ 0:14:18.674 --> 0:14:24.581
483
+ to keep track of your general performance.
484
+
485
+ 0:14:24.784 --> 0:14:28.648
486
+ You don't want to improve normally then on
487
+ the general quality.
488
+
489
+ 0:14:28.688 --> 0:14:41.555
490
+ So if you build a system which will mitigate
491
+ some biases then the aim is that if you evaluate
492
+
493
+ 0:14:41.555 --> 0:14:45.662
494
+ it on the challenging biases.
495
+
496
+ 0:14:45.745 --> 0:14:53.646
497
+ You don't need to get better because the aggregated
498
+ versions don't really measure that aspect well,
499
+
500
+ 0:14:53.646 --> 0:14:57.676
501
+ but if you significantly drop in performance
502
+ then.
503
+
504
+ 0:15:00.000 --> 0:15:19.164
505
+ What are, in generally calms, people report
506
+ about that or why should you care about?
507
+
508
+ 0:15:19.259 --> 0:15:23.598
509
+ And you're even then amplifying this type
510
+ of stereotypes.
511
+
512
+ 0:15:23.883 --> 0:15:33.879
513
+ And that is not what you want to achieve with
514
+ using this technology.
515
+
516
+ 0:15:33.879 --> 0:15:39.384
517
+ It's not working through some groups.
518
+
519
+ 0:15:39.819 --> 0:15:47.991
520
+ And secondly what is referred to as allocational
521
+ parts.
522
+
523
+ 0:15:47.991 --> 0:15:54.119
524
+ The system might not perform as well for.
525
+
526
+ 0:15:54.314 --> 0:16:00.193
527
+ So another example of which we would like
528
+ to see is that sometimes the translation depends
529
+
530
+ 0:16:00.193 --> 0:16:01.485
531
+ on who is speaking.
532
+
533
+ 0:16:01.601 --> 0:16:03.463
534
+ So Here You Have It in French.
535
+
536
+ 0:16:03.723 --> 0:16:16.359
537
+ Not say it, but the word happy or French has
538
+ to be expressed differently, whether it's a
539
+
540
+ 0:16:16.359 --> 0:16:20.902
541
+ male person or a female person.
542
+
543
+ 0:16:21.121 --> 0:16:28.917
544
+ It's nearly impossible to guess that or it's
545
+ impossible, so then you always select one.
546
+
547
+ 0:16:29.189 --> 0:16:37.109
548
+ And of course, since we do greedy search,
549
+ it will always generate the same, so you will
550
+
551
+ 0:16:37.109 --> 0:16:39.449
552
+ have a worse performance.
553
+
554
+ 0:16:39.779 --> 0:16:46.826
555
+ And of course not what we want to achieve
556
+ in average.
557
+
558
+ 0:16:46.826 --> 0:16:54.004
559
+ You might be then good, but you also have
560
+ the ability.
561
+
562
+ 0:16:54.234 --> 0:17:08.749
563
+ This is a biased problem or an interface problem
564
+ because mean you can say well.
565
+
566
+ 0:17:09.069 --> 0:17:17.358
567
+ And if you do it, we still have a system that
568
+ generates unusable output.
569
+
570
+ 0:17:17.358 --> 0:17:24.057
571
+ If you don't tell it what you want to do,
572
+ so in this case.
573
+
574
+ 0:17:24.244 --> 0:17:27.173
575
+ So in this case it's like if we don't have
576
+ enough information.
577
+
578
+ 0:17:27.467 --> 0:17:34.629
579
+ So you have to adapt your system in some way
580
+ that can either access the information or output.
581
+
582
+ 0:17:34.894 --> 0:17:46.144
583
+ But yeah, how you mean there's different ways
584
+ of how to improve over that first thing is
585
+
586
+ 0:17:46.144 --> 0:17:47.914
587
+ you find out.
588
+
589
+ 0:17:48.688 --> 0:17:53.826
590
+ Then there is different ways of addressing
591
+ them, and they of course differ.
592
+
593
+ 0:17:53.826 --> 0:17:57.545
594
+ Isn't the situation where the information's
595
+ available?
596
+
597
+ 0:17:58.038 --> 0:18:12.057
598
+ That's the first case we have, or is it a
599
+ situation where we don't have the information
600
+
601
+ 0:18:12.057 --> 0:18:13.332
602
+ either?
603
+
604
+ 0:18:14.154 --> 0:18:28.787
605
+ Or should give the system maybe the opportunity
606
+ to output those or say don't know this is still
607
+
608
+ 0:18:28.787 --> 0:18:29.701
609
+ open.
610
+
611
+ 0:18:29.769 --> 0:18:35.470
612
+ And even if they have enough information,
613
+ need this additional information, but they
614
+
615
+ 0:18:35.470 --> 0:18:36.543
616
+ are just doing.
617
+
618
+ 0:18:36.776 --> 0:18:51.132
619
+ Which is a bit based on how we find that there
620
+ is research on that, but it's not that easy
621
+
622
+ 0:18:51.132 --> 0:18:52.710
623
+ to solve.
624
+
625
+ 0:18:52.993 --> 0:19:05.291
626
+ But in general, detecting do have enough information
627
+ to do a good translation or are information
628
+
629
+ 0:19:05.291 --> 0:19:06.433
630
+ missing?
631
+
632
+ 0:19:09.669 --> 0:19:18.951
633
+ But before we come on how we will address
634
+ it or try to change it, and before we look
635
+
636
+ 0:19:18.951 --> 0:19:22.992
637
+ at how we can assess it, of course,.
638
+
639
+ 0:19:23.683 --> 0:19:42.820
640
+ And therefore wanted to do a bit of a review
641
+ on how gender is represented in languages.
642
+
643
+ 0:19:43.743 --> 0:19:48.920
644
+ Course: You can have more fine grained.
645
+
646
+ 0:19:48.920 --> 0:20:00.569
647
+ It's not that everything in the group is the
648
+ same, but in general you have a large group.
649
+
650
+ 0:20:01.381 --> 0:20:08.347
651
+ For example, you even don't say ishi or but
652
+ it's just one word for it written.
653
+
654
+ 0:20:08.347 --> 0:20:16.107
655
+ Oh, don't know how it's pronounced, so you
656
+ cannot say from a sentence whether it's ishi
657
+
658
+ 0:20:16.107 --> 0:20:16.724
659
+ or it.
660
+
661
+ 0:20:17.937 --> 0:20:29.615
662
+ Of course, there are some exceptions for whether
663
+ it's a difference between male and female.
664
+
665
+ 0:20:29.615 --> 0:20:35.962
666
+ They have different names for brother and
667
+ sister.
668
+
669
+ 0:20:36.036 --> 0:20:41.772
670
+ So normally you cannot infer whether this
671
+ is a male speaker or speaking about a male
672
+
673
+ 0:20:41.772 --> 0:20:42.649
674
+ or a female.
675
+
676
+ 0:20:44.304 --> 0:20:50.153
677
+ Examples for these languages are, for example,
678
+ Finnish and Turkish.
679
+
680
+ 0:20:50.153 --> 0:21:00.370
681
+ There are more languages, but these are: Then
682
+ we have no nutritional gender languages where
683
+
684
+ 0:21:00.370 --> 0:21:05.932
685
+ there's some gender information in there, but
686
+ it's.
687
+
688
+ 0:21:05.905 --> 0:21:08.169
689
+ And this is an example.
690
+
691
+ 0:21:08.169 --> 0:21:15.149
692
+ This is English, which is in that way a nice
693
+ example because most people.
694
+
695
+ 0:21:15.415 --> 0:21:20.164
696
+ So you have there some lexicogender and phenomenal
697
+ gender.
698
+
699
+ 0:21:20.164 --> 0:21:23.303
700
+ I mean mamadeta there she-hee and him.
701
+
702
+ 0:21:23.643 --> 0:21:31.171
703
+ And very few words are marked like actor and
704
+ actress, but in general most words are not
705
+
706
+ 0:21:31.171 --> 0:21:39.468
707
+ marked, so it's teacher and lecturer and friend,
708
+ so in all these words the gender is not marked,
709
+
710
+ 0:21:39.468 --> 0:21:41.607
711
+ and so you cannot infer.
712
+
713
+ 0:21:42.622 --> 0:21:48.216
714
+ So the initial Turkish sentence here would
715
+ be translated to either he is a good friend
716
+
717
+ 0:21:48.216 --> 0:21:49.373
718
+ or she is a good.
719
+
720
+ 0:21:51.571 --> 0:22:05.222
721
+ In this case you would have them gender information
722
+ in there, but of course there's a good friend.
723
+
724
+ 0:22:07.667 --> 0:22:21.077
725
+ And then finally there is the grammatical
726
+ German languages where each noun has a gender.
727
+
728
+ 0:22:21.077 --> 0:22:25.295
729
+ That's the case in Spanish.
730
+
731
+ 0:22:26.186 --> 0:22:34.025
732
+ This is mostly formal, but at least if you're
733
+ talking about a human that also agrees.
734
+
735
+ 0:22:34.214 --> 0:22:38.209
736
+ Of course, it's like the sun.
737
+
738
+ 0:22:38.209 --> 0:22:50.463
739
+ There is no clear thing why the sun should
740
+ be female, and in other language it's different.
741
+
742
+ 0:22:50.390 --> 0:22:56.100
743
+ The matching, and then you also have more
744
+ agreements with this that makes things more
745
+
746
+ 0:22:56.100 --> 0:22:56.963
747
+ complicated.
748
+
749
+ 0:22:57.958 --> 0:23:08.571
750
+ Here he is a good friend and the good is also
751
+ depending whether it's male or went up so it's
752
+
753
+ 0:23:08.571 --> 0:23:17.131
754
+ changing also based on the gender so you have
755
+ a lot of gender information.
756
+
757
+ 0:23:17.777 --> 0:23:21.364
758
+ Get them, but do you always get them correctly?
759
+
760
+ 0:23:21.364 --> 0:23:25.099
761
+ It might be that they're in English, for example.
762
+
763
+ 0:23:28.748 --> 0:23:36.154
764
+ And since this is the case, and you need to
765
+ like often express the gender even though you
766
+
767
+ 0:23:36.154 --> 0:23:37.059
768
+ might not.
769
+
770
+ 0:23:37.377 --> 0:23:53.030
771
+ Aware of it or it's not possible, there's
772
+ some ways in German how to mark mutual forms.
773
+
774
+ 0:23:54.194 --> 0:24:03.025
775
+ But then it's again from the machine learning
776
+ side of view, of course quite challenging because
777
+
778
+ 0:24:03.025 --> 0:24:05.417
779
+ you only want to use the.
780
+
781
+ 0:24:05.625 --> 0:24:11.108
782
+ If it's known to the reader you want to use
783
+ the correct, the not mutual form but either
784
+
785
+ 0:24:11.108 --> 0:24:12.354
786
+ the male or female.
787
+
788
+ 0:24:13.013 --> 0:24:21.771
789
+ So they are assessing what is known to the
790
+ reader as a challenge which needs to in some
791
+
792
+ 0:24:21.771 --> 0:24:23.562
793
+ way be addressed.
794
+
795
+ 0:24:26.506 --> 0:24:30.887
796
+ Here why does that happen?
797
+
798
+ 0:24:30.887 --> 0:24:42.084
799
+ Three reasons we have that in a bit so one
800
+ is, of course, that your.
801
+
802
+ 0:24:42.162 --> 0:24:49.003
803
+ Example: If you look at the Europe High Corpus,
804
+ which is an important resource for doing machine
805
+
806
+ 0:24:49.003 --> 0:24:49.920
807
+ translation.
808
+
809
+ 0:24:50.010 --> 0:24:59.208
810
+ Then there's only thirty percent of the speakers
811
+ are female, and so if you train a model on
812
+
813
+ 0:24:59.208 --> 0:25:06.606
814
+ that data, if you're translating to French,
815
+ there will be a male version.
816
+
817
+ 0:25:06.746 --> 0:25:10.762
818
+ And so you'll just have a lot more like seventy
819
+ percent of your mail for it.
820
+
821
+ 0:25:10.971 --> 0:25:18.748
822
+ And that will be Yep will make the model therefore
823
+ from this data sub.
824
+
825
+ 0:25:18.898 --> 0:25:25.882
826
+ And of course this will be in the data for
827
+ a very long time.
828
+
829
+ 0:25:25.882 --> 0:25:33.668
830
+ So if there's more female speakers in the
831
+ European Parliament, but.
832
+
833
+ 0:25:33.933 --> 0:25:42.338
834
+ But we are training on historical data, so
835
+ even if there is for a long time, it will not
836
+
837
+ 0:25:42.338 --> 0:25:43.377
838
+ be in the.
839
+
840
+ 0:25:46.346 --> 0:25:57.457
841
+ Then besides these preexisting data there
842
+ is of course technical biases which will amplify
843
+
844
+ 0:25:57.457 --> 0:25:58.800
845
+ this type.
846
+
847
+ 0:25:59.039 --> 0:26:04.027
848
+ So one we already address, that's for example
849
+ sampling or beam search.
850
+
851
+ 0:26:04.027 --> 0:26:06.416
852
+ You get the most probable output.
853
+
854
+ 0:26:06.646 --> 0:26:16.306
855
+ So if there's a bias in your model, it will
856
+ amplify that not only in the case we had before,
857
+
858
+ 0:26:16.306 --> 0:26:19.423
859
+ and produce the male version.
860
+
861
+ 0:26:20.040 --> 0:26:32.873
862
+ So if you have the same source sentence like
863
+ am happy and in your training data it will
864
+
865
+ 0:26:32.873 --> 0:26:38.123
866
+ be male and female if you're doing.
867
+
868
+ 0:26:38.418 --> 0:26:44.510
869
+ So in that way by doing this type of algorithmic
870
+ design you will have.
871
+
872
+ 0:26:44.604 --> 0:26:59.970
873
+ Another use case is if you think about a multilingual
874
+ machine translation, for example if you are
875
+
876
+ 0:26:59.970 --> 0:27:04.360
877
+ now doing a pivot language.
878
+
879
+ 0:27:04.524 --> 0:27:13.654
880
+ But if you're first trying to English this
881
+ information might get lost and then you translate
882
+
883
+ 0:27:13.654 --> 0:27:14.832
884
+ to Spanish.
885
+
886
+ 0:27:15.075 --> 0:27:21.509
887
+ So while in general in this class there is
888
+ not this type of bias there,.
889
+
890
+ 0:27:22.922 --> 0:27:28.996
891
+ You might introduce it because you might have
892
+ good reasons for doing a modular system because
893
+
894
+ 0:27:28.996 --> 0:27:31.968
895
+ you don't have enough training data or so on.
896
+
897
+ 0:27:31.968 --> 0:27:37.589
898
+ It's performing better in average, but of
899
+ course by doing this choice you'll introduce
900
+
901
+ 0:27:37.589 --> 0:27:40.044
902
+ an additional type of bias into your.
903
+
904
+ 0:27:45.805 --> 0:27:52.212
905
+ And then there is what people refer to as
906
+ emergent bias, and that is, if you use a system
907
+
908
+ 0:27:52.212 --> 0:27:58.903
909
+ for a different use case as we see in, generally
910
+ it is the case that is performing worse, but
911
+
912
+ 0:27:58.903 --> 0:28:02.533
913
+ then of course you can have even more challenging.
914
+
915
+ 0:28:02.942 --> 0:28:16.196
916
+ So the extreme case would be if you train
917
+ a system only on male speakers, then of course
918
+
919
+ 0:28:16.196 --> 0:28:22.451
920
+ it will perform worse on female speakers.
921
+
922
+ 0:28:22.902 --> 0:28:36.287
923
+ So, of course, if you're doing this type of
924
+ problem, if you use a system for a different
925
+
926
+ 0:28:36.287 --> 0:28:42.152
927
+ situation where it was original, then.
928
+
929
+ 0:28:44.004 --> 0:28:54.337
930
+ And with this we would then go for type of
931
+ evaluation, but before we are looking at how
932
+
933
+ 0:28:54.337 --> 0:28:56.333
934
+ we can evaluate.
935
+
936
+ 0:29:00.740 --> 0:29:12.176
937
+ Before we want to look into how we can improve
938
+ the system, think yeah, maybe at the moment
939
+
940
+ 0:29:12.176 --> 0:29:13.559
941
+ most work.
942
+
943
+ 0:29:13.954 --> 0:29:21.659
944
+ And the one thing is the system trying to
945
+ look into stereotypes.
946
+
947
+ 0:29:21.659 --> 0:29:26.164
948
+ So how does a system use stereotypes?
949
+
950
+ 0:29:26.466 --> 0:29:29.443
951
+ So if you have the Hungarian sentence,.
952
+
953
+ 0:29:29.729 --> 0:29:33.805
954
+ Which should be he is an engineer or she is
955
+ an engineer.
956
+
957
+ 0:29:35.375 --> 0:29:43.173
958
+ And you cannot guess that because we saw that
959
+ he and she is not different in Hungary.
960
+
961
+ 0:29:43.423 --> 0:29:57.085
962
+ Then you can have a test set where you have
963
+ these type of ailanomal occupations.
964
+
965
+ 0:29:56.977 --> 0:30:03.862
966
+ You have statistics from how is the distribution
967
+ by gender so you can automatically generate
968
+
969
+ 0:30:03.862 --> 0:30:04.898
970
+ the sentence.
971
+
972
+ 0:30:04.985 --> 0:30:21.333
973
+ Then you could put in jobs which are mostly
974
+ done by a man and then you can check how is
975
+
976
+ 0:30:21.333 --> 0:30:22.448
977
+ your.
978
+
979
+ 0:30:22.542 --> 0:30:31.315
980
+ That is one type of evaluating stereotypes
981
+ that one of the most famous benchmarks called
982
+
983
+ 0:30:31.315 --> 0:30:42.306
984
+ vino is exactly: The second type of evaluation
985
+ is about gender preserving.
986
+
987
+ 0:30:42.342 --> 0:30:51.201
988
+ So that is exactly what we have seen beforehand.
989
+
990
+ 0:30:51.201 --> 0:31:00.240
991
+ If these information are not in the text itself,.
992
+
993
+ 0:31:00.320 --> 0:31:01.875
994
+ Gender as a speaker.
995
+
996
+ 0:31:02.062 --> 0:31:04.450
997
+ And how good does a system do that?
998
+
999
+ 0:31:04.784 --> 0:31:09.675
1000
+ And we'll see there's, for example, one benchmark
1001
+ on this.
1002
+
1003
+ 0:31:09.675 --> 0:31:16.062
1004
+ For example: For Arabic there is one benchmark
1005
+ on this foot: Audio because if you're now think
1006
+
1007
+ 0:31:16.062 --> 0:31:16.781
1008
+ already of the.
1009
+
1010
+ 0:31:17.157 --> 0:31:25.257
1011
+ From when we're talking about speech translation,
1012
+ it might be interesting because in the speech
1013
+
1014
+ 0:31:25.257 --> 0:31:32.176
1015
+ signal you should have a better guess on whether
1016
+ it's a male or a female speaker.
1017
+
1018
+ 0:31:32.432 --> 0:31:38.928
1019
+ So but mean current systems, mostly you can
1020
+ always add, and they will just first transcribe.
1021
+
1022
+ 0:31:42.562 --> 0:31:45.370
1023
+ Yes, so how do these benchmarks?
1024
+
1025
+ 0:31:45.305 --> 0:31:51.356
1026
+ Look like that, the first one is here.
1027
+
1028
+ 0:31:51.356 --> 0:32:02.837
1029
+ There's an occupation test where it looks
1030
+ like a simple test set because.
1031
+
1032
+ 0:32:03.023 --> 0:32:10.111
1033
+ So I've known either hurry him or pronounce
1034
+ the name for a long time.
1035
+
1036
+ 0:32:10.111 --> 0:32:13.554
1037
+ My friend works as an occupation.
1038
+
1039
+ 0:32:13.833 --> 0:32:16.771
1040
+ So that is like all sentences in that look
1041
+ like that.
1042
+
1043
+ 0:32:17.257 --> 0:32:28.576
1044
+ So in this case you haven't had the biggest
1045
+ work in here, which is friends.
1046
+
1047
+ 0:32:28.576 --> 0:32:33.342
1048
+ So your only checking later is.
1049
+
1050
+ 0:32:34.934 --> 0:32:46.981
1051
+ This can be inferred from whether it's her
1052
+ or her or her, or if it's a proper name, so
1053
+
1054
+ 0:32:46.981 --> 0:32:55.013
1055
+ can you infer it from the name, and then you
1056
+ can compare.
1057
+
1058
+ 0:32:55.115 --> 0:33:01.744
1059
+ So is this because the job description is
1060
+ nearer to friend.
1061
+
1062
+ 0:33:01.744 --> 0:33:06.937
1063
+ Does the system get disturbed by this type
1064
+ of.
1065
+
1066
+ 0:33:08.828 --> 0:33:14.753
1067
+ And there you can then automatically assess
1068
+ yeah this type.
1069
+
1070
+ 0:33:14.774 --> 0:33:18.242
1071
+ Of course, that's what said at the beginning.
1072
+
1073
+ 0:33:18.242 --> 0:33:24.876
1074
+ You shouldn't only rely on that because if
1075
+ you only rely on it you can easily trick the
1076
+
1077
+ 0:33:24.876 --> 0:33:25.479
1078
+ system.
1079
+
1080
+ 0:33:25.479 --> 0:33:31.887
1081
+ So one type of sentence is translated, but
1082
+ of course it can give you very important.
1083
+
1084
+ 0:33:33.813 --> 0:33:35.309
1085
+ Any questions yeah.
1086
+
1087
+ 0:33:36.736 --> 0:33:44.553
1088
+ Much like the evaluation of stereotype, we
1089
+ want the system to agree with stereotypes because
1090
+
1091
+ 0:33:44.553 --> 0:33:46.570
1092
+ it increases precision.
1093
+
1094
+ 0:33:46.786 --> 0:33:47.979
1095
+ No, no, no.
1096
+
1097
+ 0:33:47.979 --> 0:33:53.149
1098
+ In this case, if we say oh yeah, he is an
1099
+ engineer.
1100
+
1101
+ 0:33:53.149 --> 0:34:01.600
1102
+ From the example, it's probably the most likely
1103
+ translation, probably in more cases.
1104
+
1105
+ 0:34:02.702 --> 0:34:08.611
1106
+ Now there is two things, so yeah yeah, so
1107
+ there is two ways of evaluating.
1108
+
1109
+ 0:34:08.611 --> 0:34:15.623
1110
+ The one thing is in this case he's using that
1111
+ he's an engineer, but there is conflicting
1112
+
1113
+ 0:34:15.623 --> 0:34:19.878
1114
+ information that in this case the engineer
1115
+ is female.
1116
+
1117
+ 0:34:20.380 --> 0:34:21.890
1118
+ So anything was.
1119
+
1120
+ 0:34:22.342 --> 0:34:29.281
1121
+ Information yes, so that is the one in the
1122
+ other case.
1123
+
1124
+ 0:34:29.281 --> 0:34:38.744
1125
+ Typically it's not evaluated in that, but
1126
+ in that time you really want it.
1127
+
1128
+ 0:34:38.898 --> 0:34:52.732
1129
+ That's why most of those cases you have evaluated
1130
+ in scenarios where you have context information.
1131
+
1132
+ 0:34:53.453 --> 0:34:58.878
1133
+ How to deal with the other thing is even more
1134
+ challenging to one case where it is the case
1135
+
1136
+ 0:34:58.878 --> 0:35:04.243
1137
+ is what I said before is when it's about the
1138
+ speaker so that the speech translation test.
1139
+
1140
+ 0:35:04.584 --> 0:35:17.305
1141
+ And there they try to look in a way that can
1142
+ you use, so use the audio also as input.
1143
+
1144
+ 0:35:18.678 --> 0:35:20.432
1145
+ Yeah.
1146
+
1147
+ 0:35:20.640 --> 0:35:30.660
1148
+ So if we have a reference where she is an
1149
+ engineer okay, are there efforts to adjust
1150
+
1151
+ 0:35:30.660 --> 0:35:37.497
1152
+ the metric so that our transmissions go into
1153
+ the correct?
1154
+
1155
+ 0:35:37.497 --> 0:35:38.676
1156
+ We don't.
1157
+
1158
+ 0:35:38.618 --> 0:35:40.389
1159
+ Only done for mean this is evaluation.
1160
+
1161
+ 0:35:40.389 --> 0:35:42.387
1162
+ You are not pushing the model for anything.
1163
+
1164
+ 0:35:43.023 --> 0:35:53.458
1165
+ But if you want to do it in training, that
1166
+ you're not doing it this way.
1167
+
1168
+ 0:35:53.458 --> 0:35:58.461
1169
+ I'm not aware of any direct model.
1170
+
1171
+ 0:35:58.638 --> 0:36:04.146
1172
+ Because you have to find out, is it known
1173
+ in this scenario or not?
1174
+
1175
+ 0:36:05.725 --> 0:36:12.622
1176
+ So at least I'm not aware of there's like
1177
+ the directive doing training try to assess
1178
+
1179
+ 0:36:12.622 --> 0:36:13.514
1180
+ more than.
1181
+
1182
+ 0:36:13.813 --> 0:36:18.518
1183
+ Mean there is data augmentation in the way
1184
+ that is done.
1185
+
1186
+ 0:36:18.518 --> 0:36:23.966
1187
+ Think we'll have that later, so what you can
1188
+ do is generate more.
1189
+
1190
+ 0:36:24.144 --> 0:36:35.355
1191
+ You can do that automatically or there's ways
1192
+ of biasing so that you can try to make your
1193
+
1194
+ 0:36:35.355 --> 0:36:36.600
1195
+ training.
1196
+
1197
+ 0:36:36.957 --> 0:36:46.228
1198
+ That's typically not done with focusing on
1199
+ scenarios where you check before or do have
1200
+
1201
+ 0:36:46.228 --> 0:36:47.614
1202
+ information.
1203
+
1204
+ 0:36:49.990 --> 0:36:58.692
1205
+ Mean, but for everyone it's not clear and
1206
+ agree with you in this scenario, the normal
1207
+
1208
+ 0:36:58.692 --> 0:37:01.222
1209
+ evaluation system where.
1210
+
1211
+ 0:37:01.341 --> 0:37:07.006
1212
+ Maybe you could say it shouldn't do always
1213
+ the same but have a distribution like a training
1214
+
1215
+ 0:37:07.006 --> 0:37:12.733
1216
+ data or something like that because otherwise
1217
+ we're amplifying but that current system can't
1218
+
1219
+ 0:37:12.733 --> 0:37:15.135
1220
+ do current systems can't predict both.
1221
+
1222
+ 0:37:15.135 --> 0:37:17.413
1223
+ That's why we see all the beginning.
1224
+
1225
+ 0:37:17.413 --> 0:37:20.862
1226
+ They have this extra interface where they
1227
+ then propose.
1228
+
1229
+ 0:37:24.784 --> 0:37:33.896
1230
+ Another thing is the vino empty system and
1231
+ it started from a challenge set for co-reference
1232
+
1233
+ 0:37:33.896 --> 0:37:35.084
1234
+ resolution.
1235
+
1236
+ 0:37:35.084 --> 0:37:43.502
1237
+ Co-reference resolution means we have pear
1238
+ on him and we need to find out what it's.
1239
+
1240
+ 0:37:43.823 --> 0:37:53.620
1241
+ So you have the doctor off the nurse to help
1242
+ her in the procedure, and now her does not
1243
+
1244
+ 0:37:53.620 --> 0:37:55.847
1245
+ refer to the nurse.
1246
+
1247
+ 0:37:56.556 --> 0:38:10.689
1248
+ And there you of course have the same type
1249
+ of stewardesses and the same type of buyers
1250
+
1251
+ 0:38:10.689 --> 0:38:15.237
1252
+ as the machine translation.
1253
+
1254
+ 0:38:16.316 --> 0:38:25.165
1255
+ And no think that normally yeah mean maybe
1256
+ that's also biased.
1257
+
1258
+ 0:38:27.687 --> 0:38:37.514
1259
+ No, but if you ask somebody, I guess if you
1260
+ ask somebody, then I mean syntectically it's
1261
+
1262
+ 0:38:37.514 --> 0:38:38.728
1263
+ ambiguous.
1264
+
1265
+ 0:38:38.918 --> 0:38:50.248
1266
+ If you ask somebody to help, then the horror
1267
+ has to refer to that.
1268
+
1269
+ 0:38:50.248 --> 0:38:54.983
1270
+ So it should also help the.
1271
+
1272
+ 0:38:56.396 --> 0:38:57.469
1273
+ Of the time.
1274
+
1275
+ 0:38:57.469 --> 0:39:03.906
1276
+ The doctor is female and says please have
1277
+ me in the procedure, but the other.
1278
+
1279
+ 0:39:04.904 --> 0:39:09.789
1280
+ Oh, you mean that it's helping the third person.
1281
+
1282
+ 0:39:12.192 --> 0:39:16.140
1283
+ Yeah, agree that it could also be yes.
1284
+
1285
+ 0:39:16.140 --> 0:39:19.077
1286
+ Don't know how easy that is.
1287
+
1288
+ 0:39:19.077 --> 0:39:21.102
1289
+ Only know the test.
1290
+
1291
+ 0:39:21.321 --> 0:39:31.820
1292
+ Then guess yeah, then you need a situation
1293
+ context where you know the situation, the other
1294
+
1295
+ 0:39:31.820 --> 0:39:34.589
1296
+ person having problems.
1297
+
1298
+ 0:39:36.936 --> 0:39:42.251
1299
+ Yeah no yeah that is like here when there
1300
+ is additional ambiguity in there.
1301
+
1302
+ 0:39:45.465 --> 0:39:48.395
1303
+ See that pure text models is not always okay.
1304
+
1305
+ 0:39:48.395 --> 0:39:51.134
1306
+ How full mean there is a lot of work also.
1307
+
1308
+ 0:39:52.472 --> 0:40:00.119
1309
+ Will not cover that in the lecture, but there
1310
+ are things like multimodal machine translation
1311
+
1312
+ 0:40:00.119 --> 0:40:07.109
1313
+ where you try to add pictures or something
1314
+ like that to have more context, and then.
1315
+
1316
+ 0:40:10.370 --> 0:40:23.498
1317
+ Yeah, it starts with this, so in order to
1318
+ evaluate that what it does is that you translate
1319
+
1320
+ 0:40:23.498 --> 0:40:25.229
1321
+ the system.
1322
+
1323
+ 0:40:25.305 --> 0:40:32.310
1324
+ It's doing stereotyping so the doctor is male
1325
+ and the nurse is female.
1326
+
1327
+ 0:40:32.492 --> 0:40:42.362
1328
+ And then you're using word alignment, and
1329
+ then you check whether this gender maps with
1330
+
1331
+ 0:40:42.362 --> 0:40:52.345
1332
+ the annotated gender of there, and that is
1333
+ how you evaluate in this type of vino empty.
1334
+
1335
+ 0:40:52.832 --> 0:40:59.475
1336
+ Mean, as you see, you're only focusing on
1337
+ the situation where you can or where the gender
1338
+
1339
+ 0:40:59.475 --> 0:41:00.214
1340
+ is known.
1341
+
1342
+ 0:41:00.214 --> 0:41:06.930
1343
+ Why for this one you don't do any evaluation,
1344
+ but because nurses can in that case be those
1345
+
1346
+ 0:41:06.930 --> 0:41:08.702
1347
+ and you cannot, as has.
1348
+
1349
+ 0:41:08.728 --> 0:41:19.112
1350
+ The benchmarks are at the moment designed
1351
+ in a way that you only evaluate things that
1352
+
1353
+ 0:41:19.112 --> 0:41:20.440
1354
+ are known.
1355
+
1356
+ 0:41:23.243 --> 0:41:25.081
1357
+ Then yeah, you can have a look.
1358
+
1359
+ 0:41:25.081 --> 0:41:28.931
1360
+ For example, here what people are looking
1361
+ is you can do the first.
1362
+
1363
+ 0:41:28.931 --> 0:41:32.149
1364
+ Oh well, the currency, how often does it do
1365
+ it correct?
1366
+
1367
+ 0:41:32.552 --> 0:41:41.551
1368
+ And there you see these numbers are a bit
1369
+ older.
1370
+
1371
+ 0:41:41.551 --> 0:41:51.835
1372
+ There's more work on that, but this is the
1373
+ first color.
1374
+
1375
+ 0:41:51.731 --> 0:42:01.311
1376
+ Because they do it like in this test, they
1377
+ do it twice, one with him and one with her.
1378
+
1379
+ 0:42:01.311 --> 0:42:04.834
1380
+ So the chance is fifty percent.
1381
+
1382
+ 0:42:05.065 --> 0:42:12.097
1383
+ Except somehow here, the one system seems
1384
+ to be quite good there that everything.
1385
+
1386
+ 0:42:13.433 --> 0:42:30.863
1387
+ What you can also do is look at the difference,
1388
+ where you need to predict female and the difference.
1389
+
1390
+ 0:42:30.850 --> 0:42:40.338
1391
+ It's more often correct on the male forms
1392
+ than on the female forms, and you see that
1393
+
1394
+ 0:42:40.338 --> 0:42:43.575
1395
+ it's except for this system.
1396
+
1397
+ 0:42:43.603 --> 0:42:53.507
1398
+ So would assume that they maybe in this one
1399
+ language did some type of method in there.
1400
+
1401
+ 0:42:55.515 --> 0:42:57.586
1402
+ If you are more often mean there is like.
1403
+
1404
+ 0:42:58.178 --> 0:43:01.764
1405
+ It's not a lot lower, there's one.
1406
+
1407
+ 0:43:01.764 --> 0:43:08.938
1408
+ I don't know why, but if you're always to
1409
+ the same then it should be.
1410
+
1411
+ 0:43:08.938 --> 0:43:14.677
1412
+ You seem to be counter intuitive, so maybe
1413
+ it's better.
1414
+
1415
+ 0:43:15.175 --> 0:43:18.629
1416
+ Don't know exactly how yes, but it's, it's
1417
+ true.
1418
+
1419
+ 0:43:19.019 --> 0:43:20.849
1420
+ Mean, there's very few cases.
1421
+
1422
+ 0:43:20.849 --> 0:43:22.740
1423
+ I also don't know for Russian.
1424
+
1425
+ 0:43:22.740 --> 0:43:27.559
1426
+ I mean, there is, I think, mainly for Russian
1427
+ where you have very low numbers.
1428
+
1429
+ 0:43:27.559 --> 0:43:30.183
1430
+ I mean, I would say like forty five or so.
1431
+
1432
+ 0:43:30.183 --> 0:43:32.989
1433
+ There can be more about renting and sampling.
1434
+
1435
+ 0:43:32.989 --> 0:43:37.321
1436
+ I don't know if they have even more gender
1437
+ or if they have a new tool.
1438
+
1439
+ 0:43:37.321 --> 0:43:38.419
1440
+ I don't think so.
1441
+
1442
+ 0:43:40.040 --> 0:43:46.901
1443
+ Then you have typically even a stronger bias
1444
+ here where you not do the differentiation between
1445
+
1446
+ 0:43:46.901 --> 0:43:53.185
1447
+ how often is it correct for me and the female,
1448
+ but you are distinguishing between the.
1449
+
1450
+ 0:43:53.553 --> 0:44:00.503
1451
+ So you're here, for you can check for each
1452
+ occupation, which is the most important.
1453
+
1454
+ 0:44:00.440 --> 0:44:06.182
1455
+ A comment one based on statistics, and then
1456
+ you take that on the one side and the anti
1457
+
1458
+ 0:44:06.182 --> 0:44:12.188
1459
+ stereotypically on the other side, and you
1460
+ see that not in all cases but in a lot of cases
1461
+
1462
+ 0:44:12.188 --> 0:44:16.081
1463
+ that null probabilities are even higher than
1464
+ on the other.
1465
+
1466
+ 0:44:21.061 --> 0:44:24.595
1467
+ Ah, I'm telling you there's something.
1468
+
1469
+ 0:44:28.668 --> 0:44:32.850
1470
+ But it has to be for a doctor.
1471
+
1472
+ 0:44:32.850 --> 0:44:39.594
1473
+ For example, for a doctor there three don't
1474
+ know.
1475
+
1476
+ 0:44:40.780 --> 0:44:44.275
1477
+ Yeah, but guess here it's mainly imminent
1478
+ job description.
1479
+
1480
+ 0:44:44.275 --> 0:44:45.104
1481
+ So yeah, but.
1482
+
1483
+ 0:44:50.050 --> 0:45:01.145
1484
+ And then there is the Arabic capital gender
1485
+ corpus where it is about more assessing how
1486
+
1487
+ 0:45:01.145 --> 0:45:03.289
1488
+ strong a singer.
1489
+
1490
+ 0:45:03.483 --> 0:45:09.445
1491
+ How that is done is the open subtitles.
1492
+
1493
+ 0:45:09.445 --> 0:45:18.687
1494
+ Corpus is like a corpus of subtitles generated
1495
+ by volunteers.
1496
+
1497
+ 0:45:18.558 --> 0:45:23.426
1498
+ For the Words Like I Mean Myself.
1499
+
1500
+ 0:45:23.303 --> 0:45:30.670
1501
+ And mine, and then they annotated the Arabic
1502
+ sentences, whether here I refer to as a female
1503
+
1504
+ 0:45:30.670 --> 0:45:38.198
1505
+ and masculine, or whether it's ambiguous, and
1506
+ then from the male and female one they generate
1507
+
1508
+ 0:45:38.198 --> 0:45:40.040
1509
+ types of translations.
1510
+
1511
+ 0:45:43.703 --> 0:45:51.921
1512
+ And then a bit more different test sets as
1513
+ the last one that is referred to as the machine.
1514
+
1515
+ 0:45:52.172 --> 0:45:57.926
1516
+ Corpus, which is based on these lectures.
1517
+
1518
+ 0:45:57.926 --> 0:46:05.462
1519
+ In general, this lecture is very important
1520
+ because it.
1521
+
1522
+ 0:46:05.765 --> 0:46:22.293
1523
+ And here is also interesting because you also
1524
+ have the obvious signal and it's done in the
1525
+
1526
+ 0:46:22.293 --> 0:46:23.564
1527
+ worst.
1528
+
1529
+ 0:46:23.763 --> 0:46:27.740
1530
+ In the first case is where it can only be
1531
+ determined based on the speaker.
1532
+
1533
+ 0:46:27.968 --> 0:46:30.293
1534
+ So something like am a good speaker.
1535
+
1536
+ 0:46:30.430 --> 0:46:32.377
1537
+ You cannot do that correctly.
1538
+
1539
+ 0:46:32.652 --> 0:46:36.970
1540
+ However, if you would have the audio signal
1541
+ you should have a lot better guests.
1542
+
1543
+ 0:46:37.257 --> 0:46:47.812
1544
+ So it wasn't evaluated, especially machine
1545
+ translation and speech translation system,
1546
+
1547
+ 0:46:47.812 --> 0:46:53.335
1548
+ which take this into account or, of course,.
1549
+
1550
+ 0:46:57.697 --> 0:47:04.265
1551
+ The second thing is where you can do it based
1552
+ on the context.
1553
+
1554
+ 0:47:04.265 --> 0:47:08.714
1555
+ In this case we are not using artificial.
1556
+
1557
+ 0:47:11.011 --> 0:47:15.550
1558
+ Cope from the from the real data, so it's
1559
+ not like artificial creative data, but.
1560
+
1561
+ 0:47:15.815 --> 0:47:20.939
1562
+ Of course, in a lot more work you have to
1563
+ somehow find these in the corpus and use them
1564
+
1565
+ 0:47:20.939 --> 0:47:21.579
1566
+ as a test.
1567
+
1568
+ 0:47:21.601 --> 0:47:27.594
1569
+ Is something she got together with two of
1570
+ her dearest friends, this older woman, and
1571
+
1572
+ 0:47:27.594 --> 0:47:34.152
1573
+ then, of course, here friends can we get from
1574
+ the context, but it might be that some systems
1575
+
1576
+ 0:47:34.152 --> 0:47:36.126
1577
+ ignore that that should be.
1578
+
1579
+ 0:47:36.256 --> 0:47:43.434
1580
+ So you have two test sets in there, two types
1581
+ of benchmarks, and you want to determine which
1582
+
1583
+ 0:47:43.434 --> 0:47:43.820
1584
+ one.
1585
+
1586
+ 0:47:47.787 --> 0:47:55.801
1587
+ Yes, this is how we can evaluate it, so the
1588
+ next question is how can we improve our systems
1589
+
1590
+ 0:47:55.801 --> 0:48:03.728
1591
+ because that's normally how we do evaluation
1592
+ and why we do evaluation so before we go into
1593
+
1594
+ 0:48:03.728 --> 0:48:04.251
1595
+ that?
1596
+
1597
+ 0:48:08.508 --> 0:48:22.685
1598
+ One idea is to do what is referred to as modeling,
1599
+ so the idea is somehow change the model in
1600
+
1601
+ 0:48:22.685 --> 0:48:24.495
1602
+ a way that.
1603
+
1604
+ 0:48:24.965 --> 0:48:38.271
1605
+ And yes, one idea is, of course, if we are
1606
+ giving him more information, the system doesn't
1607
+
1608
+ 0:48:38.271 --> 0:48:44.850
1609
+ need to do a guess without this information.
1610
+
1611
+ 0:48:44.724 --> 0:48:47.253
1612
+ In order to just ambiguate the bias,.
1613
+
1614
+ 0:48:47.707 --> 0:48:59.746
1615
+ The first thing is you can do that on the
1616
+ sentence level, for example, especially if
1617
+
1618
+ 0:48:59.746 --> 0:49:03.004
1619
+ you have the speakers.
1620
+
1621
+ 0:49:03.063 --> 0:49:12.518
1622
+ You can annotate the sentence with whether
1623
+ a speaker is made or a female, and then you
1624
+
1625
+ 0:49:12.518 --> 0:49:25.998
1626
+ can: Here we're seeing one thing which is very
1627
+ successful in neuromachine translation and
1628
+
1629
+ 0:49:25.998 --> 0:49:30.759
1630
+ other kinds of neural networks.
1631
+
1632
+ 0:49:31.711 --> 0:49:39.546
1633
+ However, in neuromachine translation, since
1634
+ we have no longer the strong correlation between
1635
+
1636
+ 0:49:39.546 --> 0:49:47.043
1637
+ input and output, the nice thing is you can
1638
+ normally put everything into your input, and
1639
+
1640
+ 0:49:47.043 --> 0:49:50.834
1641
+ if you have enough data, it's well balanced.
1642
+
1643
+ 0:49:51.151 --> 0:50:00.608
1644
+ So how you can do it here is you can add the
1645
+ token here saying female or male if the speaker
1646
+
1647
+ 0:50:00.608 --> 0:50:01.523
1648
+ is male.
1649
+
1650
+ 0:50:01.881 --> 0:50:07.195
1651
+ So, of course, this is no longer for human
1652
+ correct translation.
1653
+
1654
+ 0:50:07.195 --> 0:50:09.852
1655
+ It's like female Madam because.
1656
+
1657
+ 0:50:10.090 --> 0:50:22.951
1658
+ If you are doing the same thing then the translation
1659
+ would not be to translate female but can use
1660
+
1661
+ 0:50:22.951 --> 0:50:25.576
1662
+ it to disintegrate.
1663
+
1664
+ 0:50:25.865 --> 0:50:43.573
1665
+ And so this type of tagging is a very commonly
1666
+ used method in order to add more information.
1667
+
1668
+ 0:50:47.107 --> 0:50:54.047
1669
+ So this is first of all a very good thing,
1670
+ a very easy one.
1671
+
1672
+ 0:50:54.047 --> 0:50:57.633
1673
+ You don't have to change your.
1674
+
1675
+ 0:50:58.018 --> 0:51:04.581
1676
+ For example, has also been done if you think
1677
+ about formality in German.
1678
+
1679
+ 0:51:04.581 --> 0:51:11.393
1680
+ Whether you have to produce or, you can: We'll
1681
+ see it on Thursday.
1682
+
1683
+ 0:51:11.393 --> 0:51:19.628
1684
+ It's a very common approach for domains, so
1685
+ you put in the domain beforehand.
1686
+
1687
+ 0:51:19.628 --> 0:51:24.589
1688
+ This is from a Twitter or something like that.
1689
+
1690
+ 0:51:24.904 --> 0:51:36.239
1691
+ Of course, it only learns it if it has seen
1692
+ it and it dees them out, but in this case you
1693
+
1694
+ 0:51:36.239 --> 0:51:38.884
1695
+ don't need an equal.
1696
+
1697
+ 0:51:39.159 --> 0:51:42.593
1698
+ But however, it's still like challenging to
1699
+ get this availability.
1700
+
1701
+ 0:51:42.983 --> 0:51:55.300
1702
+ If you would do that on the first of all,
1703
+ of course, it only works if you really have
1704
+
1705
+ 0:51:55.300 --> 0:52:02.605
1706
+ data from speaking because otherwise it's unclear.
1707
+
1708
+ 0:52:02.642 --> 0:52:09.816
1709
+ You would only have the text and you would
1710
+ not easily see whether it is the mayor or the
1711
+
1712
+ 0:52:09.816 --> 0:52:14.895
1713
+ female speaker because this information has
1714
+ been removed from.
1715
+
1716
+ 0:52:16.456 --> 0:52:18.745
1717
+ Does anybody of you have an idea of how it
1718
+ fits?
1719
+
1720
+ 0:52:20.000 --> 0:52:25.480
1721
+ Manage that and still get the data of whether
1722
+ it's made or not speaking.
1723
+
1724
+ 0:52:32.152 --> 0:52:34.270
1725
+ Can do a small trick.
1726
+
1727
+ 0:52:34.270 --> 0:52:37.834
1728
+ We can just look on the target side.
1729
+
1730
+ 0:52:37.937 --> 0:52:43.573
1731
+ Mean this is, of course, only important if
1732
+ in the target side this is the case.
1733
+
1734
+ 0:52:44.004 --> 0:52:50.882
1735
+ So for your training data you can irritate
1736
+ it based on your target site in German you
1737
+
1738
+ 0:52:50.882 --> 0:52:51.362
1739
+ know.
1740
+
1741
+ 0:52:51.362 --> 0:52:58.400
1742
+ In German you don't know but in Spanish for
1743
+ example you know because different and then
1744
+
1745
+ 0:52:58.400 --> 0:53:00.400
1746
+ you can use grammatical.
1747
+
1748
+ 0:53:00.700 --> 0:53:10.964
1749
+ Of course, the test day would still need to
1750
+ do that more interface decision.
1751
+
1752
+ 0:53:13.954 --> 0:53:18.829
1753
+ And: You can, of course, do it even more advanced.
1754
+
1755
+ 0:53:18.898 --> 0:53:30.659
1756
+ You can even try to add these information
1757
+ to each word, so you're not doing it for the
1758
+
1759
+ 0:53:30.659 --> 0:53:32.687
1760
+ full sentence.
1761
+
1762
+ 0:53:32.572 --> 0:53:42.129
1763
+ If it's unknown, if it's female or if it's
1764
+ male, you know word alignment so you can't
1765
+
1766
+ 0:53:42.129 --> 0:53:42.573
1767
+ do.
1768
+
1769
+ 0:53:42.502 --> 0:53:55.919
1770
+ Here then you can do a word alignment, which
1771
+ is of course not always perfect, but roughly
1772
+
1773
+ 0:53:55.919 --> 0:53:59.348
1774
+ then you can annotate.
1775
+
1776
+ 0:54:01.401 --> 0:54:14.165
1777
+ Now you have these type of inputs where you
1778
+ have one information per word, but on the one
1779
+
1780
+ 0:54:14.165 --> 0:54:16.718
1781
+ end you have the.
1782
+
1783
+ 0:54:17.517 --> 0:54:26.019
1784
+ This has been used before in other scenarios,
1785
+ so you might not put in the gender, but in
1786
+
1787
+ 0:54:26.019 --> 0:54:29.745
1788
+ general this can be other information.
1789
+
1790
+ 0:54:30.090 --> 0:54:39.981
1791
+ And people refer to that or have used that
1792
+ as a factored translation model, so what you
1793
+
1794
+ 0:54:39.981 --> 0:54:42.454
1795
+ may do is you factor.
1796
+
1797
+ 0:54:42.742 --> 0:54:45.612
1798
+ You have the word itself.
1799
+
1800
+ 0:54:45.612 --> 0:54:48.591
1801
+ You might have the gender.
1802
+
1803
+ 0:54:48.591 --> 0:54:55.986
1804
+ You could have more information like don't
1805
+ know the paddle speech.
1806
+
1807
+ 0:54:56.316 --> 0:54:58.564
1808
+ And then you have an embedding for each of
1809
+ them.
1810
+
1811
+ 0:54:59.199 --> 0:55:03.599
1812
+ And you congratulate them, and then you have
1813
+ years of congratulated a bedding.
1814
+
1815
+ 0:55:03.563 --> 0:55:09.947
1816
+ Which says okay, this is a female plumber
1817
+ or a male plumber or so on.
1818
+
1819
+ 0:55:09.947 --> 0:55:18.064
1820
+ This has additional information and then you
1821
+ can train this factory model where you have
1822
+
1823
+ 0:55:18.064 --> 0:55:22.533
1824
+ the ability to give the model extra information.
1825
+
1826
+ 0:55:23.263 --> 0:55:35.702
1827
+ And of course now if you are training this
1828
+ way directly you always need to have this information.
1829
+
1830
+ 0:55:36.576 --> 0:55:45.396
1831
+ So that might not be the best way if you want
1832
+ to use a translation system and sometimes don't
1833
+
1834
+ 0:55:45.396 --> 0:55:45.959
1835
+ have.
1836
+
1837
+ 0:55:46.866 --> 0:55:57.987
1838
+ So any idea of how you can train it or what
1839
+ machine learning technique you can use to deal
1840
+
1841
+ 0:55:57.987 --> 0:55:58.720
1842
+ with.
1843
+
1844
+ 0:56:03.263 --> 0:56:07.475
1845
+ Mainly despite it already, many of your things.
1846
+
1847
+ 0:56:14.154 --> 0:56:21.521
1848
+ Drop out so you sometimes put information
1849
+ in there and then you can use dropouts to inputs.
1850
+
1851
+ 0:56:21.861 --> 0:56:27.599
1852
+ Is sometimes put in this information in there,
1853
+ sometimes not, and the system is then able
1854
+
1855
+ 0:56:27.599 --> 0:56:28.874
1856
+ to deal with those.
1857
+
1858
+ 0:56:28.874 --> 0:56:34.803
1859
+ If it doesn't have the information, it's doing
1860
+ some of the best it can do, but if it has the
1861
+
1862
+ 0:56:34.803 --> 0:56:39.202
1863
+ information, it can use the information and
1864
+ maybe do a more rounded.
1865
+
1866
+ 0:56:46.766 --> 0:56:52.831
1867
+ So then there is, of course, more ways to
1868
+ try to do a moderately biased one.
1869
+
1870
+ 0:56:52.993 --> 0:57:01.690
1871
+ We will only want to mention here because
1872
+ you'll have a full lecture on that next week
1873
+
1874
+ 0:57:01.690 --> 0:57:08.188
1875
+ and that is referred to where context based
1876
+ machine translation.
1877
+
1878
+ 0:57:08.728 --> 0:57:10.397
1879
+ Good, and in this other ones, but.
1880
+
1881
+ 0:57:10.750 --> 0:57:16.830
1882
+ If you translate several sentences well, of
1883
+ course, there are more situations where you
1884
+
1885
+ 0:57:16.830 --> 0:57:17.866
1886
+ can dissemble.
1887
+
1888
+ 0:57:18.118 --> 0:57:23.996
1889
+ Because it might be that the information is
1890
+ not in the current sentence, but it's in the
1891
+
1892
+ 0:57:23.996 --> 0:57:25.911
1893
+ previous sentence or before.
1894
+
1895
+ 0:57:26.967 --> 0:57:33.124
1896
+ If you have the mean with the speaker maybe
1897
+ not, but if it's referring to, you can core
1898
+
1899
+ 0:57:33.124 --> 0:57:33.963
1900
+ references.
1901
+
1902
+ 0:57:34.394 --> 0:57:40.185
1903
+ They are often referring to things in the
1904
+ previous sentence so you can use them in order
1905
+
1906
+ 0:57:40.185 --> 0:57:44.068
1907
+ to: And that can be done basically and very
1908
+ easy.
1909
+
1910
+ 0:57:44.068 --> 0:57:47.437
1911
+ You'll see more advanced options, but the
1912
+ main.
1913
+
1914
+ 0:57:48.108 --> 0:57:58.516
1915
+ Mean, no machine translation is a sequence
1916
+ to sequence model, which can use any input
1917
+
1918
+ 0:57:58.516 --> 0:58:02.993
1919
+ sequence to output sequence mapping.
1920
+
1921
+ 0:58:02.993 --> 0:58:04.325
1922
+ So now at.
1923
+
1924
+ 0:58:04.484 --> 0:58:11.281
1925
+ So then you can do, for example, five to five
1926
+ translations, or also five to one, or so there's.
1927
+
1928
+ 0:58:11.811 --> 0:58:19.211
1929
+ This is not a method like only dedicated to
1930
+ buying, of course, but the hope is.
1931
+
1932
+ 0:58:19.139 --> 0:58:25.534
1933
+ If you're using this because I mean bias often,
1934
+ we have seen that it rises in situations where
1935
+
1936
+ 0:58:25.534 --> 0:58:27.756
1937
+ we're not having enough context.
1938
+
1939
+ 0:58:27.756 --> 0:58:32.940
1940
+ So the idea is if we generally increase our
1941
+ context, it will also help this.
1942
+
1943
+ 0:58:32.932 --> 0:58:42.378
1944
+ Of course, it will help other situations where
1945
+ you need context to disintegrate.
1946
+
1947
+ 0:58:43.603 --> 0:58:45.768
1948
+ Get There If You're Saying I'm Going to the
1949
+ Bank.
1950
+
1951
+ 0:58:46.286 --> 0:58:54.761
1952
+ It's not directly from this sentence clear
1953
+ whether it's the finance institute or the bank
1954
+
1955
+ 0:58:54.761 --> 0:58:59.093
1956
+ for sitting, but maybe if you say afterward,.
1957
+
1958
+ 0:59:02.322 --> 0:59:11.258
1959
+ And then there is in generally a very large
1960
+ amount of work on debiasing the word embelling.
1961
+
1962
+ 0:59:11.258 --> 0:59:20.097
1963
+ So the one I hear like, I mean, I think that
1964
+ partly comes from the fact that like a first.
1965
+
1966
+ 0:59:21.041 --> 0:59:26.925
1967
+ Or that first research was done often on inspecting
1968
+ the word embeddings and seeing whether they
1969
+
1970
+ 0:59:26.925 --> 0:59:32.503
1971
+ are biased or not, and people found out how
1972
+ there is some bias in there, and then the idea
1973
+
1974
+ 0:59:32.503 --> 0:59:38.326
1975
+ is oh, if you remove them from the word embedded
1976
+ in already, then maybe your system later will
1977
+
1978
+ 0:59:38.326 --> 0:59:39.981
1979
+ not have that strong of a.
1980
+
1981
+ 0:59:40.520 --> 0:59:44.825
1982
+ So how can that work?
1983
+
1984
+ 0:59:44.825 --> 0:59:56.369
1985
+ Or like maybe first, how do words encounter
1986
+ bias in there?
1987
+
1988
+ 0:59:56.369 --> 0:59:57.152
1989
+ So.
1990
+
1991
+ 0:59:57.137 --> 1:00:05.555
1992
+ So you can look at the word embedding, and
1993
+ then you can compare the distance of the word
1994
+
1995
+ 1:00:05.555 --> 1:00:11.053
1996
+ compared: And there's like interesting findings.
1997
+
1998
+ 1:00:11.053 --> 1:00:18.284
1999
+ For example, you have the difference in occupation
2000
+ and how similar.
2001
+
2002
+ 1:00:18.678 --> 1:00:33.068
2003
+ And of course it's not a perfect correlation,
2004
+ but you see some type of correlation: jobs
2005
+
2006
+ 1:00:33.068 --> 1:00:37.919
2007
+ which have a high occupation.
2008
+
2009
+ 1:00:37.797 --> 1:00:41.387
2010
+ They also are more similar to the word what
2011
+ we're going to be talking about.
2012
+
2013
+ 1:00:43.023 --> 1:00:50.682
2014
+ Maybe a secretary is also a bit difficult,
2015
+ but because yeah maybe it's more often.
2016
+
2017
+ 1:00:50.610 --> 1:00:52.438
2018
+ Done in general by by women.
2019
+
2020
+ 1:00:52.438 --> 1:00:58.237
2021
+ However, there is a secretary like the Secretary
2022
+ of State or so, the German minister, which
2023
+
2024
+ 1:00:58.237 --> 1:01:03.406
2025
+ I of course know that many so in the statistics
2026
+ they are not counting that often.
2027
+
2028
+ 1:01:03.543 --> 1:01:11.576
2029
+ But in data they of course cook quite often,
2030
+ so there's different ways of different meanings.
2031
+
2032
+ 1:01:14.154 --> 1:01:23.307
2033
+ So how can you not try to remove this type
2034
+ of bias?
2035
+
2036
+ 1:01:23.307 --> 1:01:32.988
2037
+ One way is the idea of hearts, devices and
2038
+ embeddings.
2039
+
2040
+ 1:01:33.113 --> 1:01:39.354
2041
+ So if you remember on word embeddings think
2042
+ we have this image that you can do the difference
2043
+
2044
+ 1:01:39.354 --> 1:01:44.931
2045
+ between man and woman and add this difference
2046
+ to king and then look at your screen.
2047
+
2048
+ 1:01:45.865 --> 1:01:57.886
2049
+ So here's the idea we want to remove this
2050
+ gender information from some things which should
2051
+
2052
+ 1:01:57.886 --> 1:02:00.132
2053
+ not have gender.
2054
+
2055
+ 1:02:00.120 --> 1:02:01.386
2056
+ The word engineer.
2057
+
2058
+ 1:02:01.386 --> 1:02:06.853
2059
+ There is no information about the gender in
2060
+ that, so you should remove this type.
2061
+
2062
+ 1:02:07.347 --> 1:02:16.772
2063
+ Of course, you first need to find out where
2064
+ these inflammations are and you can.
2065
+
2066
+ 1:02:17.037 --> 1:02:23.603
2067
+ However, normally if you do the difference
2068
+ like the subspace by only one example, it's
2069
+
2070
+ 1:02:23.603 --> 1:02:24.659
2071
+ not the best.
2072
+
2073
+ 1:02:24.924 --> 1:02:31.446
2074
+ So you can do the same thing for things like
2075
+ brother and sister, man and dad, and then you
2076
+
2077
+ 1:02:31.446 --> 1:02:38.398
2078
+ can somehow take the average of these differences
2079
+ saying this is a vector which maps a male from
2080
+
2081
+ 1:02:38.398 --> 1:02:39.831
2082
+ to the female form.
2083
+
2084
+ 1:02:40.660 --> 1:02:50.455
2085
+ And then you can try to neutralize this gender
2086
+ information on this dimension.
2087
+
2088
+ 1:02:50.490 --> 1:02:57.951
2089
+ You can find it's subspace or dimensional.
2090
+
2091
+ 1:02:57.951 --> 1:03:08.882
2092
+ It would be a line, but now this is dimensional,
2093
+ and then you.
2094
+
2095
+ 1:03:08.728 --> 1:03:13.104
2096
+ Representation: Where you remove this type
2097
+ of embellishment.
2098
+
2099
+ 1:03:15.595 --> 1:03:18.178
2100
+ This is, of course, quite strong of the questions.
2101
+
2102
+ 1:03:18.178 --> 1:03:19.090
2103
+ How good does it?
2104
+
2105
+ 1:03:19.090 --> 1:03:20.711
2106
+ Thanks tell them for one other.
2107
+
2108
+ 1:03:20.880 --> 1:03:28.256
2109
+ But it's an idea we are trying to after learning
2110
+ before we are using the Word and Banks for
2111
+
2112
+ 1:03:28.256 --> 1:03:29.940
2113
+ machine translation.
2114
+
2115
+ 1:03:29.940 --> 1:03:37.315
2116
+ We are trying to remove the gender information
2117
+ from the jobs and then have a representation
2118
+
2119
+ 1:03:37.315 --> 1:03:38.678
2120
+ which hopefully.
2121
+
2122
+ 1:03:40.240 --> 1:03:45.047
2123
+ Similar idea is the one of agenda neutral
2124
+ glove.
2125
+
2126
+ 1:03:45.047 --> 1:03:50.248
2127
+ Glove is another technique to learn word embeddings.
2128
+
2129
+ 1:03:50.750 --> 1:03:52.870
2130
+ Think we discussed one shortly.
2131
+
2132
+ 1:03:52.870 --> 1:03:56.182
2133
+ It was too back, which was some of the first
2134
+ one.
2135
+
2136
+ 1:03:56.456 --> 1:04:04.383
2137
+ But there are other of course methods how
2138
+ you can train word embeddings and glove as
2139
+
2140
+ 1:04:04.383 --> 1:04:04.849
2141
+ one.
2142
+
2143
+ 1:04:04.849 --> 1:04:07.460
2144
+ The idea is we're training.
2145
+
2146
+ 1:04:07.747 --> 1:04:19.007
2147
+ At least this is somehow a bit separated,
2148
+ so where you have part of the vector is gender
2149
+
2150
+ 1:04:19.007 --> 1:04:20.146
2151
+ neutral.
2152
+
2153
+ 1:04:20.300 --> 1:04:29.247
2154
+ What you need therefore is three sets of words,
2155
+ so you have male words and you have words.
2156
+
2157
+ 1:04:29.769 --> 1:04:39.071
2158
+ And then you're trying to learn some type
2159
+ of vector where some dimensions are not.
2160
+
2161
+ 1:04:39.179 --> 1:04:51.997
2162
+ So the idea is can learn a representation
2163
+ where at least know that this part is gender
2164
+
2165
+ 1:04:51.997 --> 1:04:56.123
2166
+ neutral and the other part.
2167
+
2168
+ 1:05:00.760 --> 1:05:03.793
2169
+ How can we do that?
2170
+
2171
+ 1:05:03.793 --> 1:05:12.435
2172
+ How can we change the system to learn anything
2173
+ specific?
2174
+
2175
+ 1:05:12.435 --> 1:05:20.472
2176
+ Nearly in all cases this works by the loss
2177
+ function.
2178
+
2179
+ 1:05:20.520 --> 1:05:26.206
2180
+ And that is more a general approach in machine
2181
+ translation.
2182
+
2183
+ 1:05:26.206 --> 1:05:30.565
2184
+ The general loss function is we are learning.
2185
+
2186
+ 1:05:31.111 --> 1:05:33.842
2187
+ Here is the same idea.
2188
+
2189
+ 1:05:33.842 --> 1:05:44.412
2190
+ You have the general loss function in order
2191
+ to learn good embeddings and then you try to
2192
+
2193
+ 1:05:44.412 --> 1:05:48.687
2194
+ introduce additional loss function.
2195
+
2196
+ 1:05:48.969 --> 1:05:58.213
2197
+ Yes, I think yes, yes, that's the solution,
2198
+ and how you make sure that if I have training
2199
+
2200
+ 1:05:58.213 --> 1:06:07.149
2201
+ for all nurses of email, how do you make sure
2202
+ that the algorithm puts it into neutral?
2203
+
2204
+ 1:06:07.747 --> 1:06:12.448
2205
+ And you need, so this is like for only the
2206
+ first learning of word embeddings.
2207
+
2208
+ 1:06:12.448 --> 1:06:18.053
2209
+ Then the idea is if you have word embeddings
2210
+ where the gender is separate and then you train
2211
+
2212
+ 1:06:18.053 --> 1:06:23.718
2213
+ on top of that machine translation where you
2214
+ don't change the embeddings, it should hopefully
2215
+
2216
+ 1:06:23.718 --> 1:06:25.225
2217
+ be less and less biased.
2218
+
2219
+ 1:06:25.865 --> 1:06:33.465
2220
+ And in order to train that yes you need additional
2221
+ information so these information need to be
2222
+
2223
+ 1:06:33.465 --> 1:06:40.904
2224
+ hence defined and they can't be general so
2225
+ you need to have a list of these are male persons
2226
+
2227
+ 1:06:40.904 --> 1:06:44.744
2228
+ or males these are nouns for females and these.
2229
+
2230
+ 1:06:49.429 --> 1:06:52.575
2231
+ So the first step, of course, we still want
2232
+ to have good word inventings.
2233
+
2234
+ 1:06:54.314 --> 1:07:04.100
2235
+ So you have the normal objective function
2236
+ of the word embedding.
2237
+
2238
+ 1:07:04.100 --> 1:07:09.519
2239
+ It's something like the similarity.
2240
+
2241
+ 1:07:09.849 --> 1:07:19.751
2242
+ How it's exactly derived is not that important
2243
+ because we're not interested in love itself,
2244
+
2245
+ 1:07:19.751 --> 1:07:23.195
2246
+ but you have any loss function.
2247
+
2248
+ 1:07:23.195 --> 1:07:26.854
2249
+ Of course, you have to keep that.
2250
+
2251
+ 1:07:27.167 --> 1:07:37.481
2252
+ And then there's three more lost functions
2253
+ that you can add: So the one is you take the
2254
+
2255
+ 1:07:37.481 --> 1:07:51.341
2256
+ average value of all the male words and the
2257
+ average word embedding of all the female words.
2258
+
2259
+ 1:07:51.731 --> 1:08:00.066
2260
+ So the good thing about this is we don't always
2261
+ need to have for one word the male and the
2262
+
2263
+ 1:08:00.066 --> 1:08:05.837
2264
+ female worship, so it's only like we have a
2265
+ set of male words.
2266
+
2267
+ 1:08:06.946 --> 1:08:21.719
2268
+ So this is just saying yeah, we want these
2269
+ two should be somehow similar to each other.
2270
+
2271
+ 1:08:21.719 --> 1:08:25.413
2272
+ It shouldn't be that.
2273
+
2274
+ 1:08:30.330 --> 1:08:40.081
2275
+ Should be the other one, or think this should
2276
+ be it.
2277
+
2278
+ 1:08:40.081 --> 1:08:45.969
2279
+ This is agenda, the average of.
2280
+
2281
+ 1:08:45.945 --> 1:09:01.206
2282
+ The average should be the same, but if you're
2283
+ looking at the female should be at the other.
2284
+
2285
+ 1:09:01.681 --> 1:09:06.959
2286
+ This is like on these dimensions, the male
2287
+ should be on the one and the female on the
2288
+
2289
+ 1:09:06.959 --> 1:09:07.388
2290
+ other.
2291
+
2292
+ 1:09:07.627 --> 1:09:16.123
2293
+ The same yeah, this gender information should
2294
+ be there, so you're pushing all the males to
2295
+
2296
+ 1:09:16.123 --> 1:09:17.150
2297
+ the other.
2298
+
2299
+ 1:09:21.541 --> 1:09:23.680
2300
+ Then their words should be.
2301
+
2302
+ 1:09:23.680 --> 1:09:30.403
2303
+ If you have that you see the neutral words,
2304
+ they should be in the middle of between the
2305
+
2306
+ 1:09:30.403 --> 1:09:32.008
2307
+ male and the female.
2308
+
2309
+ 1:09:32.012 --> 1:09:48.261
2310
+ So you say is the middle point between all
2311
+ male and female words and just somehow putting
2312
+
2313
+ 1:09:48.261 --> 1:09:51.691
2314
+ the neutral words.
2315
+
2316
+ 1:09:52.912 --> 1:09:56.563
2317
+ And then you're learning them, and then you
2318
+ can apply them in different ways.
2319
+
2320
+ 1:09:57.057 --> 1:10:03.458
2321
+ So you have this a bit in the pre-training
2322
+ thing.
2323
+
2324
+ 1:10:03.458 --> 1:10:10.372
2325
+ You can use the pre-trained inbeddings on
2326
+ the output.
2327
+
2328
+ 1:10:10.372 --> 1:10:23.117
2329
+ All you can use are: And then you can analyze
2330
+ what happens instead of training them directly.
2331
+
2332
+ 1:10:23.117 --> 1:10:30.504
2333
+ If have this additional loss, which tries
2334
+ to optimize.
2335
+
2336
+ 1:10:32.432 --> 1:10:42.453
2337
+ And then it was evaluated exactly on the sentences
2338
+ we had at the beginning where it is about know
2339
+
2340
+ 1:10:42.453 --> 1:10:44.600
2341
+ her for a long time.
2342
+
2343
+ 1:10:44.600 --> 1:10:48.690
2344
+ My friend works as an accounting cling.
2345
+
2346
+ 1:10:48.788 --> 1:10:58.049
2347
+ So all these examples are not very difficult
2348
+ to translation, but the question is how often
2349
+
2350
+ 1:10:58.049 --> 1:10:58.660
2351
+ does?
2352
+
2353
+ 1:11:01.621 --> 1:11:06.028
2354
+ That it's not that complicated as you see
2355
+ here, so even the baseline.
2356
+
2357
+ 1:11:06.366 --> 1:11:10.772
2358
+ If you're doing nothing is working quite well,
2359
+ it's most challenging.
2360
+
2361
+ 1:11:10.772 --> 1:11:16.436
2362
+ It seems overall in the situation where it's
2363
+ a name, so for he and him he has learned the
2364
+
2365
+ 1:11:16.436 --> 1:11:22.290
2366
+ correlation because that's maybe not surprisingly
2367
+ because this correlation occurs more often
2368
+
2369
+ 1:11:22.290 --> 1:11:23.926
2370
+ than with any name there.
2371
+
2372
+ 1:11:24.044 --> 1:11:31.749
2373
+ If you have a name that you can extract, that
2374
+ is talking about Mary, that's female is a lot
2375
+
2376
+ 1:11:31.749 --> 1:11:34.177
2377
+ harder to extract than this.
2378
+
2379
+ 1:11:34.594 --> 1:11:40.495
2380
+ So you'll see already in the bass line this
2381
+ is yeah, not working, not working.
2382
+
2383
+ 1:11:43.403 --> 1:11:47.159
2384
+ And for all the other cases it's working very
2385
+ well.
2386
+
2387
+ 1:11:47.787 --> 1:11:53.921
2388
+ Where all the best one is achieved here with
2389
+ an arc debiasing both on the encoder, on the.
2390
+
2391
+ 1:11:57.077 --> 1:12:09.044
2392
+ It makes sense that a hard debasing on the
2393
+ decoder doesn't really work because there you
2394
+
2395
+ 1:12:09.044 --> 1:12:12.406
2396
+ have gender information.
2397
+
2398
+ 1:12:14.034 --> 1:12:17.406
2399
+ For glove it seems to already work here.
2400
+
2401
+ 1:12:17.406 --> 1:12:20.202
2402
+ That's maybe surprising and yeah.
2403
+
2404
+ 1:12:20.260 --> 1:12:28.263
2405
+ So there is no clear else we don't have numbers
2406
+ for that doesn't really work well on the other.
2407
+
2408
+ 1:12:28.263 --> 1:12:30.513
2409
+ So how much do I use then?
2410
+
2411
+ 1:12:33.693 --> 1:12:44.720
2412
+ Then as a last way of improving that is a
2413
+ bit what we had mentioned before.
2414
+
2415
+ 1:12:44.720 --> 1:12:48.493
2416
+ That is what is referred.
2417
+
2418
+ 1:12:48.488 --> 1:12:59.133
2419
+ One problem is the bias in the data so you
2420
+ can adapt your data so you can just try to
2421
+
2422
+ 1:12:59.133 --> 1:13:01.485
2423
+ find equal amount.
2424
+
2425
+ 1:13:01.561 --> 1:13:11.368
2426
+ In your data like you adapt your data and
2427
+ then you find your data on the smaller but
2428
+
2429
+ 1:13:11.368 --> 1:13:12.868
2430
+ you can try.
2431
+
2432
+ 1:13:18.298 --> 1:13:19.345
2433
+ This is line okay.
2434
+
2435
+ 1:13:19.345 --> 1:13:21.605
2436
+ We have access to the data to the model.
2437
+
2438
+ 1:13:21.605 --> 1:13:23.038
2439
+ We can improve our model.
2440
+
2441
+ 1:13:24.564 --> 1:13:31.328
2442
+ One situation we haven't talked a lot about
2443
+ but another situation might also be and that's
2444
+
2445
+ 1:13:31.328 --> 1:13:37.942
2446
+ even getting more important is oh you want
2447
+ to work with a model which you don't have but
2448
+
2449
+ 1:13:37.942 --> 1:13:42.476
2450
+ you want to improve the model without having
2451
+ access so when.
2452
+
2453
+ 1:13:42.862 --> 1:13:49.232
2454
+ Nowadays there are a lot of companies who
2455
+ are not developing their own system but they're
2456
+
2457
+ 1:13:49.232 --> 1:13:52.983
2458
+ using or something like that or machine translation.
2459
+
2460
+ 1:13:53.313 --> 1:13:59.853
2461
+ So there is interest that you might not be
2462
+ able to find children with models completely.
2463
+
2464
+ 1:14:00.080 --> 1:14:09.049
2465
+ So the question is, can you do some type of
2466
+ black box adaptation of a system that takes
2467
+
2468
+ 1:14:09.049 --> 1:14:19.920
2469
+ the black box system but tries to improve it
2470
+ in some ways through: There's some ways of
2471
+
2472
+ 1:14:19.920 --> 1:14:21.340
2473
+ doing that.
2474
+
2475
+ 1:14:21.340 --> 1:14:30.328
2476
+ One is called black box injection and that's
2477
+ what is referred to as prompt.
2478
+
2479
+ 1:14:30.730 --> 1:14:39.793
2480
+ So the problem is if you have sentences you
2481
+ don't have information about the speakers.
2482
+
2483
+ 1:14:39.793 --> 1:14:43.127
2484
+ So how can you put information?
2485
+
2486
+ 1:14:43.984 --> 1:14:53.299
2487
+ And what we know from a large language model,
2488
+ we just prompt them, and you can do that.
2489
+
2490
+ 1:14:53.233 --> 1:14:59.545
2491
+ Translating directly, I love you, you said
2492
+ she said to him, I love you, and then of course
2493
+
2494
+ 1:14:59.545 --> 1:15:01.210
2495
+ you have to strip away.
2496
+
2497
+ 1:15:01.181 --> 1:15:06.629
2498
+ I mean, you cannot prevent the model from
2499
+ translating that, but you should be able to
2500
+
2501
+ 1:15:06.629 --> 1:15:08.974
2502
+ see what is the translation of this.
2503
+
2504
+ 1:15:08.974 --> 1:15:14.866
2505
+ One can strip that away, and now the system
2506
+ had hopefully the information that it's somebody
2507
+
2508
+ 1:15:14.866 --> 1:15:15.563
2509
+ like that.
2510
+
2511
+ 1:15:15.563 --> 1:15:17.020
2512
+ The speaker is female.
2513
+
2514
+ 1:15:18.198 --> 1:15:23.222
2515
+ Because you're no longer translating love
2516
+ you, but you're translating the sentence she
2517
+
2518
+ 1:15:23.222 --> 1:15:24.261
2519
+ said to him love.
2520
+
2521
+ 1:15:24.744 --> 1:15:37.146
2522
+ And so you insert this information as contextual
2523
+ information around it and don't have to change
2524
+
2525
+ 1:15:37.146 --> 1:15:38.567
2526
+ the model.
2527
+
2528
+ 1:15:41.861 --> 1:15:56.946
2529
+ Last idea is to do what is referred to as
2530
+ letters rescoring, so the idea there is you
2531
+
2532
+ 1:15:56.946 --> 1:16:01.156
2533
+ generate a translation.
2534
+
2535
+ 1:16:01.481 --> 1:16:18.547
2536
+ And now you have an additional component which
2537
+ tries to add possibilities where gender information
2538
+
2539
+ 1:16:18.547 --> 1:16:21.133
2540
+ might be lost.
2541
+
2542
+ 1:16:21.261 --> 1:16:29.687
2543
+ It's just a graph in this way, a simplified
2544
+ graph where there's always one word between
2545
+
2546
+ 1:16:29.687 --> 1:16:31.507
2547
+ two notes and you.
2548
+
2549
+ 1:16:31.851 --> 1:16:35.212
2550
+ So you have something like Zi is an ads or
2551
+ a Zi is an ads.
2552
+
2553
+ 1:16:35.535 --> 1:16:41.847
2554
+ And then you can generate all possible variants.
2555
+
2556
+ 1:16:41.847 --> 1:16:49.317
2557
+ Then, of course, we're not done because the
2558
+ final output.
2559
+
2560
+ 1:16:50.530 --> 1:16:56.999
2561
+ Then you can re-score the system by a gender
2562
+ de-biased model.
2563
+
2564
+ 1:16:56.999 --> 1:17:03.468
2565
+ So the nice thing is why why don't we directly
2566
+ use our model?
2567
+
2568
+ 1:17:03.468 --> 1:17:10.354
2569
+ The idea is our model, which is only focusing
2570
+ on gender devising.
2571
+
2572
+ 1:17:10.530 --> 1:17:16.470
2573
+ It can be, for example, if it's just trained
2574
+ on some synthetical data, it will not be that
2575
+
2576
+ 1:17:16.470 --> 1:17:16.862
2577
+ well.
2578
+
2579
+ 1:17:16.957 --> 1:17:21.456
2580
+ But what we can do then is now you can rescore
2581
+ the possible translations in here.
2582
+
2583
+ 1:17:21.721 --> 1:17:31.090
2584
+ And here the cases of course in general structure
2585
+ is already done how to translate the words.
2586
+
2587
+ 1:17:31.051 --> 1:17:42.226
2588
+ Then you're only using the second component
2589
+ in order to react for some variants and then
2590
+
2591
+ 1:17:42.226 --> 1:17:45.490
2592
+ get the best translation.
2593
+
2594
+ 1:17:45.925 --> 1:17:58.479
2595
+ And: As the last one there is the post processing
2596
+ so you can't have it.
2597
+
2598
+ 1:17:58.538 --> 1:18:02.830
2599
+ Mean this was one way of post-processing was
2600
+ to generate the lattice and retranslate it.
2601
+
2602
+ 1:18:03.123 --> 1:18:08.407
2603
+ But you can also have a processing, for example
2604
+ only on the target side where you have additional
2605
+
2606
+ 1:18:08.407 --> 1:18:12.236
2607
+ components with checks about the gender which
2608
+ maybe only knows gender.
2609
+
2610
+ 1:18:12.236 --> 1:18:17.089
2611
+ So it's not a machine translation component
2612
+ but more like a grammatical checker which can
2613
+
2614
+ 1:18:17.089 --> 1:18:19.192
2615
+ be used as most processing to do that.
2616
+
2617
+ 1:18:19.579 --> 1:18:22.926
2618
+ Think about it a bit like when you use PPT.
2619
+
2620
+ 1:18:22.926 --> 1:18:25.892
2621
+ There's also a lot of post processing.
2622
+
2623
+ 1:18:25.892 --> 1:18:32.661
2624
+ If you use a directive, it would tell you
2625
+ how to build a bond, but they have some checks
2626
+
2627
+ 1:18:32.661 --> 1:18:35.931
2628
+ either before and after to prevent things.
2629
+
2630
+ 1:18:36.356 --> 1:18:40.580
2631
+ So often there might be an application system.
2632
+
2633
+ 1:18:40.580 --> 1:18:44.714
2634
+ There might be extra pre and post processing.
2635
+
2636
+ 1:18:48.608 --> 1:18:52.589
2637
+ And yeah, with this we're at the end of.
2638
+
2639
+ 1:18:52.512 --> 1:19:09.359
2640
+ To this lecture where we focused on the bias,
2641
+ but think a lot of these techniques we have
2642
+
2643
+ 1:19:09.359 --> 1:19:11.418
2644
+ seen here.
2645
+
2646
+ 1:19:11.331 --> 1:19:17.664
2647
+ So we saw, on the one hand, we saw that evaluating
2648
+ just pure blues first might not always be.
2649
+
2650
+ 1:19:17.677 --> 1:19:18.947
2651
+ Mean it's very important.
2652
+
2653
+ 1:19:20.000 --> 1:19:30.866
2654
+ Always do that, but if you want to check and
2655
+ some specific things are important, then you
2656
+
2657
+ 1:19:30.866 --> 1:19:35.696
2658
+ might have to do dedicated evaluations.
2659
+
2660
+ 1:19:36.036 --> 1:19:44.296
2661
+ It is now translating for the President and
2662
+ it is like in German that guess it is not very
2663
+
2664
+ 1:19:44.296 --> 1:19:45.476
2665
+ appropriate.
2666
+
2667
+ 1:19:45.785 --> 1:19:53.591
2668
+ So it might be important if characteristics
2669
+ of your system are essential to have dedicated
2670
+
2671
+ 1:19:53.591 --> 1:19:54.620
2672
+ evaluation.
2673
+
2674
+ 1:19:55.135 --> 1:20:02.478
2675
+ And then if you have that, of course, it might
2676
+ be also important to develop delicate techniques.
2677
+
2678
+ 1:20:02.862 --> 1:20:10.988
2679
+ We have seen today some how to mitigate biases,
2680
+ but I hope you see that a lot of these techniques
2681
+
2682
+ 1:20:10.988 --> 1:20:13.476
2683
+ you can also use to mitigate.
2684
+
2685
+ 1:20:13.573 --> 1:20:31.702
2686
+ At least related things you can adjust the
2687
+ training data you can do for other things.
2688
+
2689
+ 1:20:33.253 --> 1:20:36.022
2690
+ Before we have been finishing, we have any
2691
+ more questions.
2692
+
2693
+ 1:20:41.761 --> 1:20:47.218
2694
+ Then thanks a lot, and then we will see each
2695
+ other again on the first step.
2696
+
demo_data/lectures/Lecture-13-04.07.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:42f89fc932d5818061ea4e7490a1ea9a58c6b937b7696d69d117fca50623f0a2
3
+ size 108699463
demo_data/lectures/Lecture-14-27.06.2023/English.vtt ADDED
@@ -0,0 +1,2747 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.921 --> 0:00:16.424
4
+ Hey welcome to today's lecture, what we today
5
+ want to look at is how we can make new.
6
+
7
+ 0:00:16.796 --> 0:00:26.458
8
+ So until now we have this global system, the
9
+ encoder and the decoder mostly, and we haven't
10
+
11
+ 0:00:26.458 --> 0:00:29.714
12
+ really thought about how long.
13
+
14
+ 0:00:30.170 --> 0:00:42.684
15
+ And what we, for example, know is yeah, you
16
+ can make the systems bigger in different ways.
17
+
18
+ 0:00:42.684 --> 0:00:47.084
19
+ We can make them deeper so the.
20
+
21
+ 0:00:47.407 --> 0:00:56.331
22
+ And if we have at least enough data that typically
23
+ helps you make things performance better,.
24
+
25
+ 0:00:56.576 --> 0:01:00.620
26
+ But of course leads to problems that we need
27
+ more resources.
28
+
29
+ 0:01:00.620 --> 0:01:06.587
30
+ That is a problem at universities where we
31
+ have typically limited computation capacities.
32
+
33
+ 0:01:06.587 --> 0:01:11.757
34
+ So at some point you have such big models
35
+ that you cannot train them anymore.
36
+
37
+ 0:01:13.033 --> 0:01:23.792
38
+ And also for companies is of course important
39
+ if it costs you like to generate translation
40
+
41
+ 0:01:23.792 --> 0:01:26.984
42
+ just by power consumption.
43
+
44
+ 0:01:27.667 --> 0:01:35.386
45
+ So yeah, there's different reasons why you
46
+ want to do efficient machine translation.
47
+
48
+ 0:01:36.436 --> 0:01:48.338
49
+ One reason is there are different ways of
50
+ how you can improve your machine translation
51
+
52
+ 0:01:48.338 --> 0:01:50.527
53
+ system once we.
54
+
55
+ 0:01:50.670 --> 0:01:55.694
56
+ There can be different types of data we looked
57
+ into data crawling, monolingual data.
58
+
59
+ 0:01:55.875 --> 0:01:59.024
60
+ All this data and the aim is always.
61
+
62
+ 0:01:59.099 --> 0:02:05.735
63
+ Of course, we are not just purely interested
64
+ in having more data, but the idea why we want
65
+
66
+ 0:02:05.735 --> 0:02:12.299
67
+ to have more data is that more data also means
68
+ that we have better quality because mostly
69
+
70
+ 0:02:12.299 --> 0:02:17.550
71
+ we are interested in increasing the quality
72
+ of the machine translation.
73
+
74
+ 0:02:18.838 --> 0:02:24.892
75
+ But there's also other ways of how you can
76
+ improve the quality of a machine translation.
77
+
78
+ 0:02:25.325 --> 0:02:36.450
79
+ And what is, of course, that is where most
80
+ research is focusing on.
81
+
82
+ 0:02:36.450 --> 0:02:44.467
83
+ It means all we want to build better algorithms.
84
+
85
+ 0:02:44.684 --> 0:02:48.199
86
+ Course: The other things are normally as good.
87
+
88
+ 0:02:48.199 --> 0:02:54.631
89
+ Sometimes it's easier to improve, so often
90
+ it's easier to just collect more data than
91
+
92
+ 0:02:54.631 --> 0:02:57.473
93
+ to invent some great view algorithms.
94
+
95
+ 0:02:57.473 --> 0:03:00.315
96
+ But yeah, both of them are important.
97
+
98
+ 0:03:00.920 --> 0:03:09.812
99
+ But there is this third thing, especially
100
+ with neural machine translation, and that means
101
+
102
+ 0:03:09.812 --> 0:03:11.590
103
+ we make a bigger.
104
+
105
+ 0:03:11.751 --> 0:03:16.510
106
+ Can be, as said, that we have more layers,
107
+ that we have wider layers.
108
+
109
+ 0:03:16.510 --> 0:03:19.977
110
+ The other thing we talked a bit about is ensemble.
111
+
112
+ 0:03:19.977 --> 0:03:24.532
113
+ That means we are not building one new machine
114
+ translation system.
115
+
116
+ 0:03:24.965 --> 0:03:27.505
117
+ And we can easily build four.
118
+
119
+ 0:03:27.505 --> 0:03:32.331
120
+ What is the typical strategy to build different
121
+ systems?
122
+
123
+ 0:03:32.331 --> 0:03:33.177
124
+ Remember.
125
+
126
+ 0:03:35.795 --> 0:03:40.119
127
+ It should be of course a bit different if
128
+ you have the same.
129
+
130
+ 0:03:40.119 --> 0:03:44.585
131
+ If they all predict the same then combining
132
+ them doesn't help.
133
+
134
+ 0:03:44.585 --> 0:03:48.979
135
+ So what is the easiest way if you have to
136
+ build four systems?
137
+
138
+ 0:03:51.711 --> 0:04:01.747
139
+ And the Charleston's will take, but this is
140
+ the best output of a single system.
141
+
142
+ 0:04:02.362 --> 0:04:10.165
143
+ Mean now, it's really three different systems
144
+ so that you later can combine them and maybe
145
+
146
+ 0:04:10.165 --> 0:04:11.280
147
+ the average.
148
+
149
+ 0:04:11.280 --> 0:04:16.682
150
+ Ensembles are typically that the average is
151
+ all probabilities.
152
+
153
+ 0:04:19.439 --> 0:04:24.227
154
+ The idea is to think about neural networks.
155
+
156
+ 0:04:24.227 --> 0:04:29.342
157
+ There's one parameter which can easily adjust.
158
+
159
+ 0:04:29.342 --> 0:04:36.525
160
+ That's exactly the easiest way to randomize
161
+ with three different.
162
+
163
+ 0:04:37.017 --> 0:04:43.119
164
+ They have the same architecture, so all the
165
+ hydroparameters are the same, but they are
166
+
167
+ 0:04:43.119 --> 0:04:43.891
168
+ different.
169
+
170
+ 0:04:43.891 --> 0:04:46.556
171
+ They will have different predictions.
172
+
173
+ 0:04:48.228 --> 0:04:52.572
174
+ So, of course, bigger amounts.
175
+
176
+ 0:04:52.572 --> 0:05:05.325
177
+ Some of these are a bit the easiest way of
178
+ improving your quality because you don't really
179
+
180
+ 0:05:05.325 --> 0:05:08.268
181
+ have to do anything.
182
+
183
+ 0:05:08.588 --> 0:05:12.588
184
+ There is limits on that bigger models only
185
+ get better.
186
+
187
+ 0:05:12.588 --> 0:05:19.132
188
+ If you have enough training data you can't
189
+ do like a handheld layer and you will not work
190
+
191
+ 0:05:19.132 --> 0:05:24.877
192
+ on very small data but with a recent amount
193
+ of data that is the easiest thing.
194
+
195
+ 0:05:25.305 --> 0:05:33.726
196
+ However, they are challenging with making
197
+ better models, bigger motors, and that is the
198
+
199
+ 0:05:33.726 --> 0:05:34.970
200
+ computation.
201
+
202
+ 0:05:35.175 --> 0:05:44.482
203
+ So, of course, if you have a bigger model
204
+ that can mean that you have longer running
205
+
206
+ 0:05:44.482 --> 0:05:49.518
207
+ times, if you have models, you have to times.
208
+
209
+ 0:05:51.171 --> 0:05:56.685
210
+ Normally you cannot paralyze the different
211
+ layers because the input to one layer is always
212
+
213
+ 0:05:56.685 --> 0:06:02.442
214
+ the output of the previous layer, so you propagate
215
+ that so it will also increase your runtime.
216
+
217
+ 0:06:02.822 --> 0:06:10.720
218
+ Then you have to store all your models in
219
+ memory.
220
+
221
+ 0:06:10.720 --> 0:06:20.927
222
+ If you have double weights you will have:
223
+ Is more difficult to then do back propagation.
224
+
225
+ 0:06:20.927 --> 0:06:27.680
226
+ You have to store in between the activations,
227
+ so there's not only do you increase the model
228
+
229
+ 0:06:27.680 --> 0:06:31.865
230
+ in your memory, but also all these other variables
231
+ that.
232
+
233
+ 0:06:34.414 --> 0:06:36.734
234
+ And so in general it is more expensive.
235
+
236
+ 0:06:37.137 --> 0:06:54.208
237
+ And therefore there's good reasons in looking
238
+ into can we make these models sound more efficient.
239
+
240
+ 0:06:54.134 --> 0:07:00.982
241
+ So it's been through the viewer, you can have
242
+ it okay, have one and one day of training time,
243
+
244
+ 0:07:00.982 --> 0:07:01.274
245
+ or.
246
+
247
+ 0:07:01.221 --> 0:07:07.535
248
+ Forty thousand euros and then what is the
249
+ best machine translation system I can get within
250
+
251
+ 0:07:07.535 --> 0:07:08.437
252
+ this budget.
253
+
254
+ 0:07:08.969 --> 0:07:19.085
255
+ And then, of course, you can make the models
256
+ bigger, but then you have to train them shorter,
257
+
258
+ 0:07:19.085 --> 0:07:24.251
259
+ and then we can make more efficient algorithms.
260
+
261
+ 0:07:25.925 --> 0:07:31.699
262
+ If you think about efficiency, there's a bit
263
+ different scenarios.
264
+
265
+ 0:07:32.312 --> 0:07:43.635
266
+ So if you're more of coming from the research
267
+ community, what you'll be doing is building
268
+
269
+ 0:07:43.635 --> 0:07:47.913
270
+ a lot of models in your research.
271
+
272
+ 0:07:48.088 --> 0:07:58.645
273
+ So you're having your test set of maybe sentences,
274
+ calculating the blue score, then another model.
275
+
276
+ 0:07:58.818 --> 0:08:08.911
277
+ So what that means is typically you're training
278
+ on millions of cents, so your training time
279
+
280
+ 0:08:08.911 --> 0:08:14.944
281
+ is long, maybe a day, but maybe in other cases
282
+ a week.
283
+
284
+ 0:08:15.135 --> 0:08:22.860
285
+ The testing is not really the cost efficient,
286
+ but the training is very costly.
287
+
288
+ 0:08:23.443 --> 0:08:37.830
289
+ If you are more thinking of building models
290
+ for application, the scenario is quite different.
291
+
292
+ 0:08:38.038 --> 0:08:46.603
293
+ And then you keep it running, and maybe thousands
294
+ of customers are using it in translating.
295
+
296
+ 0:08:46.603 --> 0:08:47.720
297
+ So in that.
298
+
299
+ 0:08:48.168 --> 0:08:59.577
300
+ And we will see that it is not always the
301
+ same type of challenges you can paralyze some
302
+
303
+ 0:08:59.577 --> 0:09:07.096
304
+ things in training, which you cannot paralyze
305
+ in testing.
306
+
307
+ 0:09:07.347 --> 0:09:14.124
308
+ For example, in training you have to do back
309
+ propagation, so you have to store the activations.
310
+
311
+ 0:09:14.394 --> 0:09:23.901
312
+ Therefore, in testing we briefly discussed
313
+ that we would do it in more detail today in
314
+
315
+ 0:09:23.901 --> 0:09:24.994
316
+ training.
317
+
318
+ 0:09:25.265 --> 0:09:36.100
319
+ You know they're a target and you can process
320
+ everything in parallel while in testing.
321
+
322
+ 0:09:36.356 --> 0:09:46.741
323
+ So you can only do one word at a time, and
324
+ so you can less paralyze this.
325
+
326
+ 0:09:46.741 --> 0:09:50.530
327
+ Therefore, it's important.
328
+
329
+ 0:09:52.712 --> 0:09:55.347
330
+ Is a specific task on this.
331
+
332
+ 0:09:55.347 --> 0:10:03.157
333
+ For example, it's the efficiency task where
334
+ it's about making things as efficient.
335
+
336
+ 0:10:03.123 --> 0:10:09.230
337
+ Is possible and they can look at different
338
+ resources.
339
+
340
+ 0:10:09.230 --> 0:10:14.207
341
+ So how much deep fuel run time do you need?
342
+
343
+ 0:10:14.454 --> 0:10:19.366
344
+ See how much memory you need or you can have
345
+ a fixed memory budget and then have to build
346
+
347
+ 0:10:19.366 --> 0:10:20.294
348
+ the best system.
349
+
350
+ 0:10:20.500 --> 0:10:29.010
351
+ And here is a bit like an example of that,
352
+ so there's three teams from Edinburgh from
353
+
354
+ 0:10:29.010 --> 0:10:30.989
355
+ and they submitted.
356
+
357
+ 0:10:31.131 --> 0:10:36.278
358
+ So then, of course, if you want to know the
359
+ most efficient system you have to do a bit
360
+
361
+ 0:10:36.278 --> 0:10:36.515
362
+ of.
363
+
364
+ 0:10:36.776 --> 0:10:44.656
365
+ You want to have a better quality or more
366
+ runtime and there's not the one solution.
367
+
368
+ 0:10:44.656 --> 0:10:46.720
369
+ You can improve your.
370
+
371
+ 0:10:46.946 --> 0:10:49.662
372
+ And that you see that there are different
373
+ systems.
374
+
375
+ 0:10:49.909 --> 0:11:06.051
376
+ Here is how many words you can do for a second
377
+ on the clock, and you want to be as talk as
378
+
379
+ 0:11:06.051 --> 0:11:07.824
380
+ possible.
381
+
382
+ 0:11:08.068 --> 0:11:08.889
383
+ And you see here a bit.
384
+
385
+ 0:11:08.889 --> 0:11:09.984
386
+ This is a little bit different.
387
+
388
+ 0:11:11.051 --> 0:11:27.717
389
+ You want to be there on the top right corner
390
+ and you can get a score of something between
391
+
392
+ 0:11:27.717 --> 0:11:29.014
393
+ words.
394
+
395
+ 0:11:30.250 --> 0:11:34.161
396
+ Two hundred and fifty thousand, then you'll
397
+ ever come and score zero point three.
398
+
399
+ 0:11:34.834 --> 0:11:41.243
400
+ There is, of course, any bit of a decision,
401
+ but the question is, like how far can you again?
402
+
403
+ 0:11:41.243 --> 0:11:47.789
404
+ Some of all these points on this line would
405
+ be winners because they are somehow most efficient
406
+
407
+ 0:11:47.789 --> 0:11:53.922
408
+ in a way that there's no system which achieves
409
+ the same quality with less computational.
410
+
411
+ 0:11:57.657 --> 0:12:04.131
412
+ So there's the one question of which resources
413
+ are you interested.
414
+
415
+ 0:12:04.131 --> 0:12:07.416
416
+ Are you running it on CPU or GPU?
417
+
418
+ 0:12:07.416 --> 0:12:11.668
419
+ There's different ways of paralyzing stuff.
420
+
421
+ 0:12:14.654 --> 0:12:20.777
422
+ Another dimension is how you process your
423
+ data.
424
+
425
+ 0:12:20.777 --> 0:12:27.154
426
+ There's really the best processing and streaming.
427
+
428
+ 0:12:27.647 --> 0:12:34.672
429
+ So in batch processing you have the whole
430
+ document available so you can translate all
431
+
432
+ 0:12:34.672 --> 0:12:39.981
433
+ sentences in perimeter and then you're interested
434
+ in throughput.
435
+
436
+ 0:12:40.000 --> 0:12:43.844
437
+ But you can then process, for example, especially
438
+ in GPS.
439
+
440
+ 0:12:43.844 --> 0:12:49.810
441
+ That's interesting, you're not translating
442
+ one sentence at a time, but you're translating
443
+
444
+ 0:12:49.810 --> 0:12:56.108
445
+ one hundred sentences or so in parallel, so
446
+ you have one more dimension where you can paralyze
447
+
448
+ 0:12:56.108 --> 0:12:57.964
449
+ and then be more efficient.
450
+
451
+ 0:12:58.558 --> 0:13:14.863
452
+ On the other hand, for example sorts of documents,
453
+ so we learned that if you do badge processing
454
+
455
+ 0:13:14.863 --> 0:13:16.544
456
+ you have.
457
+
458
+ 0:13:16.636 --> 0:13:24.636
459
+ Then, of course, it makes sense to sort the
460
+ sentences in order to have the minimum thing
461
+
462
+ 0:13:24.636 --> 0:13:25.535
463
+ attached.
464
+
465
+ 0:13:27.427 --> 0:13:32.150
466
+ The other scenario is more the streaming scenario
467
+ where you do life translation.
468
+
469
+ 0:13:32.512 --> 0:13:40.212
470
+ So in that case you can't wait for the whole
471
+ document to pass, but you have to do.
472
+
473
+ 0:13:40.520 --> 0:13:49.529
474
+ And then, for example, that's especially in
475
+ situations like speech translation, and then
476
+
477
+ 0:13:49.529 --> 0:13:53.781
478
+ you're interested in things like latency.
479
+
480
+ 0:13:53.781 --> 0:14:00.361
481
+ So how much do you have to wait to get the
482
+ output of a sentence?
483
+
484
+ 0:14:06.566 --> 0:14:16.956
485
+ Finally, there is the thing about the implementation:
486
+ Today we're mainly looking at different algorithms,
487
+
488
+ 0:14:16.956 --> 0:14:23.678
489
+ different models of how you can model them
490
+ in your machine translation system, but of
491
+
492
+ 0:14:23.678 --> 0:14:29.227
493
+ course for the same algorithms there's also
494
+ different implementations.
495
+
496
+ 0:14:29.489 --> 0:14:38.643
497
+ So, for example, for a machine translation
498
+ this tool could be very fast.
499
+
500
+ 0:14:38.638 --> 0:14:46.615
501
+ So they have like coded a lot of the operations
502
+ very low resource, not low resource, low level
503
+
504
+ 0:14:46.615 --> 0:14:49.973
505
+ on the directly on the QDAC kernels in.
506
+
507
+ 0:14:50.110 --> 0:15:00.948
508
+ So the same attention network is typically
509
+ more efficient in that type of algorithm.
510
+
511
+ 0:15:00.880 --> 0:15:02.474
512
+ Than in in any other.
513
+
514
+ 0:15:03.323 --> 0:15:13.105
515
+ Of course, it might be other disadvantages,
516
+ so if you're a little worker or have worked
517
+
518
+ 0:15:13.105 --> 0:15:15.106
519
+ in the practical.
520
+
521
+ 0:15:15.255 --> 0:15:22.604
522
+ Because it's normally easier to understand,
523
+ easier to change, and so on, but there is again
524
+
525
+ 0:15:22.604 --> 0:15:23.323
526
+ a train.
527
+
528
+ 0:15:23.483 --> 0:15:29.440
529
+ You have to think about, do you want to include
530
+ this into my study or comparison or not?
531
+
532
+ 0:15:29.440 --> 0:15:36.468
533
+ Should it be like I compare different implementations
534
+ and I also find the most efficient implementation?
535
+
536
+ 0:15:36.468 --> 0:15:39.145
537
+ Or is it only about the pure algorithm?
538
+
539
+ 0:15:42.742 --> 0:15:50.355
540
+ Yeah, when building these systems there is
541
+ a different trade-off to do.
542
+
543
+ 0:15:50.850 --> 0:15:56.555
544
+ So there's one of the traders between memory
545
+ and throughput, so how many words can generate
546
+
547
+ 0:15:56.555 --> 0:15:57.299
548
+ per second.
549
+
550
+ 0:15:57.557 --> 0:16:03.351
551
+ So typically you can easily like increase
552
+ your scruple by increasing the batch size.
553
+
554
+ 0:16:03.643 --> 0:16:06.899
555
+ So that means you are translating more sentences
556
+ in parallel.
557
+
558
+ 0:16:07.107 --> 0:16:09.241
559
+ And gypsies are very good at that stuff.
560
+
561
+ 0:16:09.349 --> 0:16:15.161
562
+ It should translate one sentence or one hundred
563
+ sentences, not the same time, but its.
564
+
565
+ 0:16:15.115 --> 0:16:20.784
566
+ Rough are very similar because they are at
567
+ this efficient metrics multiplication so that
568
+
569
+ 0:16:20.784 --> 0:16:24.415
570
+ you can do the same operation on all sentences
571
+ parallel.
572
+
573
+ 0:16:24.415 --> 0:16:30.148
574
+ So typically that means if you increase your
575
+ benchmark you can do more things in parallel
576
+
577
+ 0:16:30.148 --> 0:16:31.995
578
+ and you will translate more.
579
+
580
+ 0:16:31.952 --> 0:16:33.370
581
+ Second.
582
+
583
+ 0:16:33.653 --> 0:16:43.312
584
+ On the other hand, with this advantage, of
585
+ course you will need higher badge sizes and
586
+
587
+ 0:16:43.312 --> 0:16:44.755
588
+ more memory.
589
+
590
+ 0:16:44.965 --> 0:16:56.452
591
+ To begin with, the other problem is that you
592
+ have such big models that you can only translate
593
+
594
+ 0:16:56.452 --> 0:16:59.141
595
+ with lower bed sizes.
596
+
597
+ 0:16:59.119 --> 0:17:08.466
598
+ If you are running out of memory with translating,
599
+ one idea to go on that is to decrease your.
600
+
601
+ 0:17:13.453 --> 0:17:24.456
602
+ Then there is the thing about quality in Screwport,
603
+ of course, and before it's like larger models,
604
+
605
+ 0:17:24.456 --> 0:17:28.124
606
+ but in generally higher quality.
607
+
608
+ 0:17:28.124 --> 0:17:31.902
609
+ The first one is always this way.
610
+
611
+ 0:17:32.092 --> 0:17:38.709
612
+ Course: Not always larger model helps you
613
+ have over fitting at some point, but in generally.
614
+
615
+ 0:17:43.883 --> 0:17:52.901
616
+ And with this a bit on this training and testing
617
+ thing we had before.
618
+
619
+ 0:17:53.113 --> 0:17:58.455
620
+ So it wears all the difference between training
621
+ and testing, and for the encoder and decoder.
622
+
623
+ 0:17:58.798 --> 0:18:06.992
624
+ So if we are looking at what mentioned before
625
+ at training time, we have a source sentence
626
+
627
+ 0:18:06.992 --> 0:18:17.183
628
+ here: And how this is processed on a is not
629
+ the attention here.
630
+
631
+ 0:18:17.183 --> 0:18:21.836
632
+ That's a tubical transformer.
633
+
634
+ 0:18:22.162 --> 0:18:31.626
635
+ And how we can do that on a is that we can
636
+ paralyze the ear ever since.
637
+
638
+ 0:18:31.626 --> 0:18:40.422
639
+ The first thing to know is: So that is, of
640
+ course, not in all cases.
641
+
642
+ 0:18:40.422 --> 0:18:49.184
643
+ We'll later talk about speech translation
644
+ where we might want to translate.
645
+
646
+ 0:18:49.389 --> 0:18:56.172
647
+ Without the general case in, it's like you
648
+ have the full sentence you want to translate.
649
+
650
+ 0:18:56.416 --> 0:19:02.053
651
+ So the important thing is we are here everything
652
+ available on the source side.
653
+
654
+ 0:19:03.323 --> 0:19:13.524
655
+ And then this was one of the big advantages
656
+ that you can remember back of transformer.
657
+
658
+ 0:19:13.524 --> 0:19:15.752
659
+ There are several.
660
+
661
+ 0:19:16.156 --> 0:19:25.229
662
+ But the other one is now that we can calculate
663
+ the full layer.
664
+
665
+ 0:19:25.645 --> 0:19:29.318
666
+ There is no dependency between this and this
667
+ state or this and this state.
668
+
669
+ 0:19:29.749 --> 0:19:36.662
670
+ So we always did like here to calculate the
671
+ key value and query, and based on that you
672
+
673
+ 0:19:36.662 --> 0:19:37.536
674
+ calculate.
675
+
676
+ 0:19:37.937 --> 0:19:46.616
677
+ Which means we can do all these calculations
678
+ here in parallel and in parallel.
679
+
680
+ 0:19:48.028 --> 0:19:55.967
681
+ And there, of course, is this very efficiency
682
+ because again for GPS it's too bigly possible
683
+
684
+ 0:19:55.967 --> 0:20:00.887
685
+ to do these things in parallel and one after
686
+ each other.
687
+
688
+ 0:20:01.421 --> 0:20:10.311
689
+ And then we can also for each layer one by
690
+ one, and then we calculate here the encoder.
691
+
692
+ 0:20:10.790 --> 0:20:21.921
693
+ In training now an important thing is that
694
+ for the decoder we have the full sentence available
695
+
696
+ 0:20:21.921 --> 0:20:28.365
697
+ because we know this is the target we should
698
+ generate.
699
+
700
+ 0:20:29.649 --> 0:20:33.526
701
+ We have models now in a different way.
702
+
703
+ 0:20:33.526 --> 0:20:38.297
704
+ This hidden state is only on the previous
705
+ ones.
706
+
707
+ 0:20:38.598 --> 0:20:51.887
708
+ And the first thing here depends only on this
709
+ information, so you see if you remember we
710
+
711
+ 0:20:51.887 --> 0:20:56.665
712
+ had this masked self-attention.
713
+
714
+ 0:20:56.896 --> 0:21:04.117
715
+ So that means, of course, we can only calculate
716
+ the decoder once the encoder is done, but that's.
717
+
718
+ 0:21:04.444 --> 0:21:06.656
719
+ Percent can calculate the end quarter.
720
+
721
+ 0:21:06.656 --> 0:21:08.925
722
+ Then we can calculate here the decoder.
723
+
724
+ 0:21:09.569 --> 0:21:25.566
725
+ But again in training we have x, y and that
726
+ is available so we can calculate everything
727
+
728
+ 0:21:25.566 --> 0:21:27.929
729
+ in parallel.
730
+
731
+ 0:21:28.368 --> 0:21:40.941
732
+ So the interesting thing or advantage of transformer
733
+ is in training.
734
+
735
+ 0:21:40.941 --> 0:21:46.408
736
+ We can do it for the decoder.
737
+
738
+ 0:21:46.866 --> 0:21:54.457
739
+ That means you will have more calculations
740
+ because you can only calculate one layer at
741
+
742
+ 0:21:54.457 --> 0:22:02.310
743
+ a time, but for example the length which is
744
+ too bigly quite long or doesn't really matter
745
+
746
+ 0:22:02.310 --> 0:22:03.270
747
+ that much.
748
+
749
+ 0:22:05.665 --> 0:22:10.704
750
+ However, in testing this situation is different.
751
+
752
+ 0:22:10.704 --> 0:22:13.276
753
+ In testing we only have.
754
+
755
+ 0:22:13.713 --> 0:22:20.622
756
+ So this means we start with a sense: We don't
757
+ know the full sentence yet because we ought
758
+
759
+ 0:22:20.622 --> 0:22:29.063
760
+ to regularly generate that so for the encoder
761
+ we have the same here but for the decoder.
762
+
763
+ 0:22:29.409 --> 0:22:39.598
764
+ In this case we only have the first and the
765
+ second instinct, but only for all states in
766
+
767
+ 0:22:39.598 --> 0:22:40.756
768
+ parallel.
769
+
770
+ 0:22:41.101 --> 0:22:51.752
771
+ And then we can do the next step for y because
772
+ we are putting our most probable one.
773
+
774
+ 0:22:51.752 --> 0:22:58.643
775
+ We do greedy search or beam search, but you
776
+ cannot do.
777
+
778
+ 0:23:03.663 --> 0:23:16.838
779
+ Yes, so if we are interesting in making things
780
+ more efficient for testing, which we see, for
781
+
782
+ 0:23:16.838 --> 0:23:22.363
783
+ example in the scenario of really our.
784
+
785
+ 0:23:22.642 --> 0:23:34.286
786
+ It makes sense that we think about our architecture
787
+ and that we are currently working on attention
788
+
789
+ 0:23:34.286 --> 0:23:35.933
790
+ based models.
791
+
792
+ 0:23:36.096 --> 0:23:44.150
793
+ The decoder there is some of the most time
794
+ spent testing and testing.
795
+
796
+ 0:23:44.150 --> 0:23:47.142
797
+ It's similar, but during.
798
+
799
+ 0:23:47.167 --> 0:23:50.248
800
+ Nothing about beam search.
801
+
802
+ 0:23:50.248 --> 0:23:59.833
803
+ It might be even more complicated because
804
+ in beam search you have to try different.
805
+
806
+ 0:24:02.762 --> 0:24:15.140
807
+ So the question is what can you now do in
808
+ order to make your model more efficient and
809
+
810
+ 0:24:15.140 --> 0:24:21.905
811
+ better in translation in these types of cases?
812
+
813
+ 0:24:24.604 --> 0:24:30.178
814
+ And the one thing is to look into the encoded
815
+ decoder trailer.
816
+
817
+ 0:24:30.690 --> 0:24:43.898
818
+ And then until now we typically assume that
819
+ the depth of the encoder and the depth of the
820
+
821
+ 0:24:43.898 --> 0:24:48.154
822
+ decoder is roughly the same.
823
+
824
+ 0:24:48.268 --> 0:24:55.553
825
+ So if you haven't thought about it, you just
826
+ take what is running well.
827
+
828
+ 0:24:55.553 --> 0:24:57.678
829
+ You would try to do.
830
+
831
+ 0:24:58.018 --> 0:25:04.148
832
+ However, we saw now that there is a quite
833
+ big challenge and the runtime is a lot longer
834
+
835
+ 0:25:04.148 --> 0:25:04.914
836
+ than here.
837
+
838
+ 0:25:05.425 --> 0:25:14.018
839
+ The question is also the case for the calculations,
840
+ or do we have there the same issue that we
841
+
842
+ 0:25:14.018 --> 0:25:21.887
843
+ only get the good quality if we are having
844
+ high and high, so we know that making these
845
+
846
+ 0:25:21.887 --> 0:25:25.415
847
+ more depths is increasing our quality.
848
+
849
+ 0:25:25.425 --> 0:25:31.920
850
+ But what we haven't talked about is really
851
+ important that we increase the depth the same
852
+
853
+ 0:25:31.920 --> 0:25:32.285
854
+ way.
855
+
856
+ 0:25:32.552 --> 0:25:41.815
857
+ So what we can put instead also do is something
858
+ like this where you have a deep encoder and
859
+
860
+ 0:25:41.815 --> 0:25:42.923
861
+ a shallow.
862
+
863
+ 0:25:43.163 --> 0:25:57.386
864
+ So that would be that you, for example, have
865
+ instead of having layers on the encoder, and
866
+
867
+ 0:25:57.386 --> 0:25:59.757
868
+ layers on the.
869
+
870
+ 0:26:00.080 --> 0:26:10.469
871
+ So in this case the overall depth from start
872
+ to end would be similar and so hopefully.
873
+
874
+ 0:26:11.471 --> 0:26:21.662
875
+ But we could a lot more things hear parallelized,
876
+ and hear what is costly at the end during decoding
877
+
878
+ 0:26:21.662 --> 0:26:22.973
879
+ the decoder.
880
+
881
+ 0:26:22.973 --> 0:26:29.330
882
+ Because that does change in an outer regressive
883
+ way, there we.
884
+
885
+ 0:26:31.411 --> 0:26:33.727
886
+ And that that can be analyzed.
887
+
888
+ 0:26:33.727 --> 0:26:38.734
889
+ So here is some examples: Where people have
890
+ done all this.
891
+
892
+ 0:26:39.019 --> 0:26:55.710
893
+ So here it's mainly interested on the orange
894
+ things, which is auto-regressive about the
895
+
896
+ 0:26:55.710 --> 0:26:57.607
897
+ speed up.
898
+
899
+ 0:26:57.717 --> 0:27:15.031
900
+ You have the system, so agree is not exactly
901
+ the same, but it's similar.
902
+
903
+ 0:27:15.055 --> 0:27:23.004
904
+ It's always the case if you look at speed
905
+ up.
906
+
907
+ 0:27:23.004 --> 0:27:31.644
908
+ Think they put a speed of so that's the baseline.
909
+
910
+ 0:27:31.771 --> 0:27:35.348
911
+ So between and times as fast.
912
+
913
+ 0:27:35.348 --> 0:27:42.621
914
+ If you switch from a system to where you have
915
+ layers in the.
916
+
917
+ 0:27:42.782 --> 0:27:52.309
918
+ You see that although you have slightly more
919
+ parameters, more calculations are also roughly
920
+
921
+ 0:27:52.309 --> 0:28:00.283
922
+ the same, but you can speed out because now
923
+ during testing you can paralyze.
924
+
925
+ 0:28:02.182 --> 0:28:09.754
926
+ The other thing is that you're speeding up,
927
+ but if you look at the performance it's similar,
928
+
929
+ 0:28:09.754 --> 0:28:13.500
930
+ so sometimes you improve, sometimes you lose.
931
+
932
+ 0:28:13.500 --> 0:28:20.421
933
+ There's a bit of losing English to Romania,
934
+ but in general the quality is very slow.
935
+
936
+ 0:28:20.680 --> 0:28:30.343
937
+ So you see that you can keep a similar performance
938
+ while improving your speed by just having different.
939
+
940
+ 0:28:30.470 --> 0:28:34.903
941
+ And you also see the encoder layers from speed.
942
+
943
+ 0:28:34.903 --> 0:28:38.136
944
+ They don't really metal that much.
945
+
946
+ 0:28:38.136 --> 0:28:38.690
947
+ Most.
948
+
949
+ 0:28:38.979 --> 0:28:50.319
950
+ Because if you compare the 12th system to
951
+ the 6th system you have a lower performance
952
+
953
+ 0:28:50.319 --> 0:28:57.309
954
+ with 6th and colder layers but the speed is
955
+ similar.
956
+
957
+ 0:28:57.897 --> 0:29:02.233
958
+ And see the huge decrease is it maybe due
959
+ to a lack of data.
960
+
961
+ 0:29:03.743 --> 0:29:11.899
962
+ Good idea would say it's not the case.
963
+
964
+ 0:29:11.899 --> 0:29:23.191
965
+ Romanian English should have the same number
966
+ of data.
967
+
968
+ 0:29:24.224 --> 0:29:31.184
969
+ Maybe it's just that something in that language.
970
+
971
+ 0:29:31.184 --> 0:29:40.702
972
+ If you generate Romanian maybe they need more
973
+ target dependencies.
974
+
975
+ 0:29:42.882 --> 0:29:46.263
976
+ The Wine's the Eye Also Don't Know Any Sex
977
+ People Want To.
978
+
979
+ 0:29:47.887 --> 0:29:49.034
980
+ There could be yeah the.
981
+
982
+ 0:29:49.889 --> 0:29:58.962
983
+ As the maybe if you go from like a movie sphere
984
+ to a hybrid sphere, you can: It's very much
985
+
986
+ 0:29:58.962 --> 0:30:12.492
987
+ easier to expand the vocabulary to English,
988
+ but it must be the vocabulary.
989
+
990
+ 0:30:13.333 --> 0:30:21.147
991
+ Have to check, but would assume that in this
992
+ case the system is not retrained, but it's
993
+
994
+ 0:30:21.147 --> 0:30:22.391
995
+ trained with.
996
+
997
+ 0:30:22.902 --> 0:30:30.213
998
+ And that's why I was assuming that they have
999
+ the same, but maybe you'll write that in this
1000
+
1001
+ 0:30:30.213 --> 0:30:35.595
1002
+ piece, for example, if they were pre-trained,
1003
+ the decoder English.
1004
+
1005
+ 0:30:36.096 --> 0:30:43.733
1006
+ But don't remember exactly if they do something
1007
+ like that, but that could be a good.
1008
+
1009
+ 0:30:45.325 --> 0:30:52.457
1010
+ So this is some of the most easy way to speed
1011
+ up.
1012
+
1013
+ 0:30:52.457 --> 0:31:01.443
1014
+ You just switch to hyperparameters, not to
1015
+ implement anything.
1016
+
1017
+ 0:31:02.722 --> 0:31:08.367
1018
+ Of course, there's other ways of doing that.
1019
+
1020
+ 0:31:08.367 --> 0:31:11.880
1021
+ We'll look into two things.
1022
+
1023
+ 0:31:11.880 --> 0:31:16.521
1024
+ The other thing is the architecture.
1025
+
1026
+ 0:31:16.796 --> 0:31:28.154
1027
+ We are now at some of the baselines that we
1028
+ are doing.
1029
+
1030
+ 0:31:28.488 --> 0:31:39.978
1031
+ However, in translation in the decoder side,
1032
+ it might not be the best solution.
1033
+
1034
+ 0:31:39.978 --> 0:31:41.845
1035
+ There is no.
1036
+
1037
+ 0:31:42.222 --> 0:31:47.130
1038
+ So we can use different types of architectures,
1039
+ also in the encoder and the.
1040
+
1041
+ 0:31:47.747 --> 0:31:52.475
1042
+ And there's two ways of what you could do
1043
+ different, or there's more ways.
1044
+
1045
+ 0:31:52.912 --> 0:31:54.825
1046
+ We will look into two todays.
1047
+
1048
+ 0:31:54.825 --> 0:31:58.842
1049
+ The one is average attention, which is a very
1050
+ simple solution.
1051
+
1052
+ 0:31:59.419 --> 0:32:01.464
1053
+ You can do as it says.
1054
+
1055
+ 0:32:01.464 --> 0:32:04.577
1056
+ It's not really attending anymore.
1057
+
1058
+ 0:32:04.577 --> 0:32:08.757
1059
+ It's just like equal attendance to everything.
1060
+
1061
+ 0:32:09.249 --> 0:32:23.422
1062
+ And the other idea, which is currently done
1063
+ in most systems which are optimized to efficiency,
1064
+
1065
+ 0:32:23.422 --> 0:32:24.913
1066
+ is we're.
1067
+
1068
+ 0:32:25.065 --> 0:32:32.623
1069
+ But on the decoder side we are then not using
1070
+ transformer or self attention, but we are using
1071
+
1072
+ 0:32:32.623 --> 0:32:39.700
1073
+ recurrent neural network because they are the
1074
+ disadvantage of recurrent neural network.
1075
+
1076
+ 0:32:39.799 --> 0:32:48.353
1077
+ And then the recurrent is normally easier
1078
+ to calculate because it only depends on inputs,
1079
+
1080
+ 0:32:48.353 --> 0:32:49.684
1081
+ the input on.
1082
+
1083
+ 0:32:51.931 --> 0:33:02.190
1084
+ So what is the difference between decoding
1085
+ and why is the tension maybe not sufficient
1086
+
1087
+ 0:33:02.190 --> 0:33:03.841
1088
+ for decoding?
1089
+
1090
+ 0:33:04.204 --> 0:33:14.390
1091
+ If we want to populate the new state, we only
1092
+ have to look at the input and the previous
1093
+
1094
+ 0:33:14.390 --> 0:33:15.649
1095
+ state, so.
1096
+
1097
+ 0:33:16.136 --> 0:33:19.029
1098
+ We are more conditional here networks.
1099
+
1100
+ 0:33:19.029 --> 0:33:19.994
1101
+ We have the.
1102
+
1103
+ 0:33:19.980 --> 0:33:31.291
1104
+ Dependency to a fixed number of previous ones,
1105
+ but that's rarely used for decoding.
1106
+
1107
+ 0:33:31.291 --> 0:33:39.774
1108
+ In contrast, in transformer we have this large
1109
+ dependency, so.
1110
+
1111
+ 0:33:40.000 --> 0:33:52.760
1112
+ So from t minus one to y t so that is somehow
1113
+ and mainly not very efficient in this way mean
1114
+
1115
+ 0:33:52.760 --> 0:33:56.053
1116
+ it's very good because.
1117
+
1118
+ 0:33:56.276 --> 0:34:03.543
1119
+ However, the disadvantage is that we also
1120
+ have to do all these calculations, so if we
1121
+
1122
+ 0:34:03.543 --> 0:34:10.895
1123
+ more view from the point of view of efficient
1124
+ calculation, this might not be the best.
1125
+
1126
+ 0:34:11.471 --> 0:34:20.517
1127
+ So the question is, can we change our architecture
1128
+ to keep some of the advantages but make things
1129
+
1130
+ 0:34:20.517 --> 0:34:21.994
1131
+ more efficient?
1132
+
1133
+ 0:34:24.284 --> 0:34:31.131
1134
+ The one idea is what is called the average
1135
+ attention, and the interesting thing is this
1136
+
1137
+ 0:34:31.131 --> 0:34:32.610
1138
+ work surprisingly.
1139
+
1140
+ 0:34:33.013 --> 0:34:38.917
1141
+ So the only idea what you're doing is doing
1142
+ the decoder.
1143
+
1144
+ 0:34:38.917 --> 0:34:42.646
1145
+ You're not doing attention anymore.
1146
+
1147
+ 0:34:42.646 --> 0:34:46.790
1148
+ The attention weights are all the same.
1149
+
1150
+ 0:34:47.027 --> 0:35:00.723
1151
+ So you don't calculate with query and key
1152
+ the different weights, and then you just take
1153
+
1154
+ 0:35:00.723 --> 0:35:03.058
1155
+ equal weights.
1156
+
1157
+ 0:35:03.283 --> 0:35:07.585
1158
+ So here would be one third from this, one
1159
+ third from this, and one third.
1160
+
1161
+ 0:35:09.009 --> 0:35:14.719
1162
+ And while it is sufficient you can now do
1163
+ precalculation and things get more efficient.
1164
+
1165
+ 0:35:15.195 --> 0:35:18.803
1166
+ So first go the formula that's maybe not directed
1167
+ here.
1168
+
1169
+ 0:35:18.979 --> 0:35:38.712
1170
+ So the difference here is that your new hint
1171
+ stage is the sum of all the hint states, then.
1172
+
1173
+ 0:35:38.678 --> 0:35:40.844
1174
+ So here would be with this.
1175
+
1176
+ 0:35:40.844 --> 0:35:45.022
1177
+ It would be one third of this plus one third
1178
+ of this.
1179
+
1180
+ 0:35:46.566 --> 0:35:57.162
1181
+ But if you calculate it this way, it's not
1182
+ yet being more efficient because you still
1183
+
1184
+ 0:35:57.162 --> 0:36:01.844
1185
+ have to sum over here all the hidden.
1186
+
1187
+ 0:36:04.524 --> 0:36:22.932
1188
+ But you can not easily speed up these things
1189
+ by having an in between value, which is just
1190
+
1191
+ 0:36:22.932 --> 0:36:24.568
1192
+ always.
1193
+
1194
+ 0:36:25.585 --> 0:36:30.057
1195
+ If you take this as ten to one, you take this
1196
+ one class this one.
1197
+
1198
+ 0:36:30.350 --> 0:36:36.739
1199
+ Because this one then was before this, and
1200
+ this one was this, so in the end.
1201
+
1202
+ 0:36:37.377 --> 0:36:49.545
1203
+ So now this one is not the final one in order
1204
+ to get the final one to do the average.
1205
+
1206
+ 0:36:49.545 --> 0:36:50.111
1207
+ So.
1208
+
1209
+ 0:36:50.430 --> 0:37:00.264
1210
+ But then if you do this calculation with speed
1211
+ up you can do it with a fixed number of steps.
1212
+
1213
+ 0:37:00.180 --> 0:37:11.300
1214
+ Instead of the sun which depends on age, so
1215
+ you only have to do calculations to calculate
1216
+
1217
+ 0:37:11.300 --> 0:37:12.535
1218
+ this one.
1219
+
1220
+ 0:37:12.732 --> 0:37:21.718
1221
+ Can you do the lakes and the lakes?
1222
+
1223
+ 0:37:21.718 --> 0:37:32.701
1224
+ For example, light bulb here now takes and.
1225
+
1226
+ 0:37:32.993 --> 0:37:38.762
1227
+ That's a very good point and that's why this
1228
+ is now in the image.
1229
+
1230
+ 0:37:38.762 --> 0:37:44.531
1231
+ It's not very good so this is the one with
1232
+ tilder and the tilder.
1233
+
1234
+ 0:37:44.884 --> 0:37:57.895
1235
+ So this one is just the sum of these two,
1236
+ because this is just this one.
1237
+
1238
+ 0:37:58.238 --> 0:38:08.956
1239
+ So the sum of this is exactly as the sum of
1240
+ these, and the sum of these is the sum of here.
1241
+
1242
+ 0:38:08.956 --> 0:38:15.131
1243
+ So you only do the sum in here, and the multiplying.
1244
+
1245
+ 0:38:15.255 --> 0:38:22.145
1246
+ So what you can mainly do here is you can
1247
+ do it more mathematically.
1248
+
1249
+ 0:38:22.145 --> 0:38:31.531
1250
+ You can know this by tea taking out of the
1251
+ sum, and then you can calculate the sum different.
1252
+
1253
+ 0:38:36.256 --> 0:38:42.443
1254
+ That maybe looks a bit weird and simple, so
1255
+ we were all talking about this great attention
1256
+
1257
+ 0:38:42.443 --> 0:38:47.882
1258
+ that we can focus on different parts, and a
1259
+ bit surprising on this work is now.
1260
+
1261
+ 0:38:47.882 --> 0:38:53.321
1262
+ In the end it might also work well without
1263
+ really putting and just doing equal.
1264
+
1265
+ 0:38:53.954 --> 0:38:56.164
1266
+ Mean it's not that easy.
1267
+
1268
+ 0:38:56.376 --> 0:38:58.261
1269
+ It's like sometimes this is working.
1270
+
1271
+ 0:38:58.261 --> 0:39:00.451
1272
+ There's also report weight work that well.
1273
+
1274
+ 0:39:01.481 --> 0:39:05.848
1275
+ But I think it's an interesting way and it
1276
+ maybe shows that a lot of.
1277
+
1278
+ 0:39:05.805 --> 0:39:10.624
1279
+ Things in the self or in the transformer paper
1280
+ which are more put as like yet.
1281
+
1282
+ 0:39:10.624 --> 0:39:15.930
1283
+ These are some hyperpermetheuss around it,
1284
+ like that you do the layer norm in between,
1285
+
1286
+ 0:39:15.930 --> 0:39:21.785
1287
+ and that you do a feat forward before, and
1288
+ things like that, that these are also all important,
1289
+
1290
+ 0:39:21.785 --> 0:39:25.566
1291
+ and that the right set up around that is also
1292
+ very important.
1293
+
1294
+ 0:39:28.969 --> 0:39:38.598
1295
+ The other thing you can do in the end is not
1296
+ completely different from this one.
1297
+
1298
+ 0:39:38.598 --> 0:39:42.521
1299
+ It's just like a very different.
1300
+
1301
+ 0:39:42.942 --> 0:39:54.338
1302
+ And that is a recurrent network which also
1303
+ has this type of highway connection that can
1304
+
1305
+ 0:39:54.338 --> 0:40:01.330
1306
+ ignore the recurrent unit and directly put
1307
+ the input.
1308
+
1309
+ 0:40:01.561 --> 0:40:10.770
1310
+ It's not really adding out, but if you see
1311
+ the hitting step is your input, but what you
1312
+
1313
+ 0:40:10.770 --> 0:40:15.480
1314
+ can do is somehow directly go to the output.
1315
+
1316
+ 0:40:17.077 --> 0:40:28.390
1317
+ These are the four components of the simple
1318
+ return unit, and the unit is motivated by GIS
1319
+
1320
+ 0:40:28.390 --> 0:40:33.418
1321
+ and by LCMs, which we have seen before.
1322
+
1323
+ 0:40:33.513 --> 0:40:43.633
1324
+ And that has proven to be very good for iron
1325
+ ends, which allows you to have a gate on your.
1326
+
1327
+ 0:40:44.164 --> 0:40:48.186
1328
+ In this thing we have two gates, the reset
1329
+ gate and the forget gate.
1330
+
1331
+ 0:40:48.768 --> 0:40:57.334
1332
+ So first we have the general structure which
1333
+ has a cell state.
1334
+
1335
+ 0:40:57.334 --> 0:41:01.277
1336
+ Here we have the cell state.
1337
+
1338
+ 0:41:01.361 --> 0:41:09.661
1339
+ And then this goes next, and we always get
1340
+ the different cell states over the times that.
1341
+
1342
+ 0:41:10.030 --> 0:41:11.448
1343
+ This Is the South Stand.
1344
+
1345
+ 0:41:11.771 --> 0:41:16.518
1346
+ How do we now calculate that just assume we
1347
+ have an initial cell safe here?
1348
+
1349
+ 0:41:17.017 --> 0:41:19.670
1350
+ But the first thing is we're doing the forget
1351
+ game.
1352
+
1353
+ 0:41:20.060 --> 0:41:34.774
1354
+ The forgetting models should the new cell
1355
+ state mainly depend on the previous cell state
1356
+
1357
+ 0:41:34.774 --> 0:41:40.065
1358
+ or should it depend on our age.
1359
+
1360
+ 0:41:40.000 --> 0:41:41.356
1361
+ Like Add to Them.
1362
+
1363
+ 0:41:41.621 --> 0:41:42.877
1364
+ How can we model that?
1365
+
1366
+ 0:41:44.024 --> 0:41:45.599
1367
+ First we were at a cocktail.
1368
+
1369
+ 0:41:45.945 --> 0:41:52.151
1370
+ The forget gait is depending on minus one.
1371
+
1372
+ 0:41:52.151 --> 0:41:56.480
1373
+ You also see here the former.
1374
+
1375
+ 0:41:57.057 --> 0:42:01.963
1376
+ So we are multiplying both the cell state
1377
+ and our input.
1378
+
1379
+ 0:42:01.963 --> 0:42:04.890
1380
+ With some weights we are getting.
1381
+
1382
+ 0:42:05.105 --> 0:42:08.472
1383
+ We are putting some Bay Inspector and then
1384
+ we are doing Sigma Weed on that.
1385
+
1386
+ 0:42:08.868 --> 0:42:13.452
1387
+ So in the end we have numbers between zero
1388
+ and one saying for each dimension.
1389
+
1390
+ 0:42:13.853 --> 0:42:22.041
1391
+ Like how much if it's near to zero we will
1392
+ mainly use the new input.
1393
+
1394
+ 0:42:22.041 --> 0:42:31.890
1395
+ If it's near to one we will keep the input
1396
+ and ignore the input at this dimension.
1397
+
1398
+ 0:42:33.313 --> 0:42:40.173
1399
+ And by this motivation we can then create
1400
+ here the new sound state, and here you see
1401
+
1402
+ 0:42:40.173 --> 0:42:41.141
1403
+ the formal.
1404
+
1405
+ 0:42:41.601 --> 0:42:55.048
1406
+ So you take your foot back gate and multiply
1407
+ it with your class.
1408
+
1409
+ 0:42:55.048 --> 0:43:00.427
1410
+ So if my was around then.
1411
+
1412
+ 0:43:00.800 --> 0:43:07.405
1413
+ In the other case, when the value was others,
1414
+ that's what you added.
1415
+
1416
+ 0:43:07.405 --> 0:43:10.946
1417
+ Then you're adding a transformation.
1418
+
1419
+ 0:43:11.351 --> 0:43:24.284
1420
+ So if this value was maybe zero then you're
1421
+ putting most of the information from inputting.
1422
+
1423
+ 0:43:25.065 --> 0:43:26.947
1424
+ Is already your element?
1425
+
1426
+ 0:43:26.947 --> 0:43:30.561
1427
+ The only question is now based on your element.
1428
+
1429
+ 0:43:30.561 --> 0:43:32.067
1430
+ What is the output?
1431
+
1432
+ 0:43:33.253 --> 0:43:47.951
1433
+ And there you have another opportunity so
1434
+ you can either take the output or instead you
1435
+
1436
+ 0:43:47.951 --> 0:43:50.957
1437
+ prefer the input.
1438
+
1439
+ 0:43:52.612 --> 0:43:58.166
1440
+ So is the value also the same for the recept
1441
+ game and the forget game.
1442
+
1443
+ 0:43:58.166 --> 0:43:59.417
1444
+ Yes, the movie.
1445
+
1446
+ 0:44:00.900 --> 0:44:10.004
1447
+ Yes exactly so the matrices are different
1448
+ and therefore it can be and that should be
1449
+
1450
+ 0:44:10.004 --> 0:44:16.323
1451
+ and maybe there is sometimes you want to have
1452
+ information.
1453
+
1454
+ 0:44:16.636 --> 0:44:23.843
1455
+ So here again we have this vector with values
1456
+ between zero and which says controlling how
1457
+
1458
+ 0:44:23.843 --> 0:44:25.205
1459
+ the information.
1460
+
1461
+ 0:44:25.505 --> 0:44:36.459
1462
+ And then the output is calculated here similar
1463
+ to a cell stage, but again input is from.
1464
+
1465
+ 0:44:36.536 --> 0:44:45.714
1466
+ So either the reset gate decides should give
1467
+ what is currently stored in there, or.
1468
+
1469
+ 0:44:46.346 --> 0:44:58.647
1470
+ So it's not exactly as the thing we had before,
1471
+ with the residual connections where we added
1472
+
1473
+ 0:44:58.647 --> 0:45:01.293
1474
+ up, but here we do.
1475
+
1476
+ 0:45:04.224 --> 0:45:08.472
1477
+ This is the general idea of a simple recurrent
1478
+ neural network.
1479
+
1480
+ 0:45:08.472 --> 0:45:13.125
1481
+ Then we will now look at how we can make things
1482
+ even more efficient.
1483
+
1484
+ 0:45:13.125 --> 0:45:17.104
1485
+ But first do you have more questions on how
1486
+ it is working?
1487
+
1488
+ 0:45:23.063 --> 0:45:38.799
1489
+ Now these calculations are a bit where things
1490
+ get more efficient because this somehow.
1491
+
1492
+ 0:45:38.718 --> 0:45:43.177
1493
+ It depends on all the other damage for the
1494
+ second one also.
1495
+
1496
+ 0:45:43.423 --> 0:45:48.904
1497
+ Because if you do a matrix multiplication
1498
+ with a vector like for the output vector, each
1499
+
1500
+ 0:45:48.904 --> 0:45:52.353
1501
+ diameter of the output vector depends on all
1502
+ the other.
1503
+
1504
+ 0:45:52.973 --> 0:46:06.561
1505
+ The cell state here depends because this one
1506
+ is used here, and somehow the first dimension
1507
+
1508
+ 0:46:06.561 --> 0:46:11.340
1509
+ of the cell state only depends.
1510
+
1511
+ 0:46:11.931 --> 0:46:17.973
1512
+ In order to make that, of course, is sometimes
1513
+ again making things less paralyzeable if things
1514
+
1515
+ 0:46:17.973 --> 0:46:18.481
1516
+ depend.
1517
+
1518
+ 0:46:19.359 --> 0:46:35.122
1519
+ Can easily make that different by changing
1520
+ from the metric product to not a vector.
1521
+
1522
+ 0:46:35.295 --> 0:46:51.459
1523
+ So you do first, just like inside here, you
1524
+ take like the first dimension, my second dimension.
1525
+
1526
+ 0:46:52.032 --> 0:46:53.772
1527
+ Is, of course, narrow.
1528
+
1529
+ 0:46:53.772 --> 0:46:59.294
1530
+ This should be reset or this should be because
1531
+ it should be a different.
1532
+
1533
+ 0:46:59.899 --> 0:47:12.053
1534
+ Now the first dimension only depends on the
1535
+ first dimension, so you don't have dependencies
1536
+
1537
+ 0:47:12.053 --> 0:47:16.148
1538
+ any longer between dimensions.
1539
+
1540
+ 0:47:18.078 --> 0:47:25.692
1541
+ Maybe it gets a bit clearer if you see about
1542
+ it in this way, so what we have to do now.
1543
+
1544
+ 0:47:25.966 --> 0:47:31.911
1545
+ First, we have to do a metrics multiplication
1546
+ on to gather and to get the.
1547
+
1548
+ 0:47:32.292 --> 0:47:38.041
1549
+ And then we only have the element wise operations
1550
+ where we take this output.
1551
+
1552
+ 0:47:38.041 --> 0:47:38.713
1553
+ We take.
1554
+
1555
+ 0:47:39.179 --> 0:47:42.978
1556
+ Minus one and our original.
1557
+
1558
+ 0:47:42.978 --> 0:47:52.748
1559
+ Here we only have elemental abrasions which
1560
+ can be optimally paralyzed.
1561
+
1562
+ 0:47:53.273 --> 0:48:07.603
1563
+ So here we have additional paralyzed things
1564
+ across the dimension and don't have to do that.
1565
+
1566
+ 0:48:09.929 --> 0:48:24.255
1567
+ Yeah, but this you can do like in parallel
1568
+ again for all xts.
1569
+
1570
+ 0:48:24.544 --> 0:48:33.014
1571
+ Here you can't do it in parallel, but you
1572
+ only have to do it on each seat, and then you
1573
+
1574
+ 0:48:33.014 --> 0:48:34.650
1575
+ can parallelize.
1576
+
1577
+ 0:48:35.495 --> 0:48:39.190
1578
+ But this maybe for the dimension.
1579
+
1580
+ 0:48:39.190 --> 0:48:42.124
1581
+ Maybe it's also important.
1582
+
1583
+ 0:48:42.124 --> 0:48:46.037
1584
+ I don't know if they have tried it.
1585
+
1586
+ 0:48:46.037 --> 0:48:55.383
1587
+ I assume it's not only for dimension reduction,
1588
+ but it's hard because you can easily.
1589
+
1590
+ 0:49:01.001 --> 0:49:08.164
1591
+ People have even like made the second thing
1592
+ even more easy.
1593
+
1594
+ 0:49:08.164 --> 0:49:10.313
1595
+ So there is this.
1596
+
1597
+ 0:49:10.313 --> 0:49:17.954
1598
+ This is how we have the highway connections
1599
+ in the transformer.
1600
+
1601
+ 0:49:17.954 --> 0:49:20.699
1602
+ Then it's like you do.
1603
+
1604
+ 0:49:20.780 --> 0:49:24.789
1605
+ So that is like how things are put together
1606
+ as a transformer.
1607
+
1608
+ 0:49:25.125 --> 0:49:39.960
1609
+ And that is a similar and simple recurring
1610
+ neural network where you do exactly the same
1611
+
1612
+ 0:49:39.960 --> 0:49:44.512
1613
+ for the so you don't have.
1614
+
1615
+ 0:49:46.326 --> 0:49:47.503
1616
+ This type of things.
1617
+
1618
+ 0:49:49.149 --> 0:50:01.196
1619
+ And with this we are at the end of how to
1620
+ make efficient architectures before we go to
1621
+
1622
+ 0:50:01.196 --> 0:50:02.580
1623
+ the next.
1624
+
1625
+ 0:50:13.013 --> 0:50:24.424
1626
+ Between the ink or the trader and the architectures
1627
+ there is a next technique which is used in
1628
+
1629
+ 0:50:24.424 --> 0:50:28.988
1630
+ nearly all deburning very successful.
1631
+
1632
+ 0:50:29.449 --> 0:50:43.463
1633
+ So the idea is can we extract the knowledge
1634
+ from a large network into a smaller one, but
1635
+
1636
+ 0:50:43.463 --> 0:50:45.983
1637
+ it's similarly.
1638
+
1639
+ 0:50:47.907 --> 0:50:53.217
1640
+ And the nice thing is that this really works,
1641
+ and it may be very, very surprising.
1642
+
1643
+ 0:50:53.673 --> 0:51:03.000
1644
+ So the idea is that we have a large straw
1645
+ model which we train for long, and the question
1646
+
1647
+ 0:51:03.000 --> 0:51:07.871
1648
+ is: Can that help us to train a smaller model?
1649
+
1650
+ 0:51:08.148 --> 0:51:16.296
1651
+ So can what we refer to as teacher model tell
1652
+ us better to build a small student model than
1653
+
1654
+ 0:51:16.296 --> 0:51:17.005
1655
+ before.
1656
+
1657
+ 0:51:17.257 --> 0:51:27.371
1658
+ So what we're before in it as a student model,
1659
+ we learn from the data and that is how we train
1660
+
1661
+ 0:51:27.371 --> 0:51:28.755
1662
+ our systems.
1663
+
1664
+ 0:51:29.249 --> 0:51:37.949
1665
+ The question is: Can we train this small model
1666
+ better if we are not only learning from the
1667
+
1668
+ 0:51:37.949 --> 0:51:46.649
1669
+ data, but we are also learning from a large
1670
+ model which has been trained maybe in the same
1671
+
1672
+ 0:51:46.649 --> 0:51:47.222
1673
+ data?
1674
+
1675
+ 0:51:47.667 --> 0:51:55.564
1676
+ So that you have then in the end a smaller
1677
+ model that is somehow better performing than.
1678
+
1679
+ 0:51:55.895 --> 0:51:59.828
1680
+ And maybe that's on the first view.
1681
+
1682
+ 0:51:59.739 --> 0:52:05.396
1683
+ Very very surprising because it has seen the
1684
+ same data so it should have learned the same
1685
+
1686
+ 0:52:05.396 --> 0:52:11.053
1687
+ so the baseline model trained only on the data
1688
+ and the student teacher knowledge to still
1689
+
1690
+ 0:52:11.053 --> 0:52:11.682
1691
+ model it.
1692
+
1693
+ 0:52:11.682 --> 0:52:17.401
1694
+ They all have seen only this data because
1695
+ your teacher modeling was also trained typically
1696
+
1697
+ 0:52:17.401 --> 0:52:19.161
1698
+ only on this model however.
1699
+
1700
+ 0:52:20.580 --> 0:52:30.071
1701
+ It has by now shown that by many ways the
1702
+ model trained in the teacher and analysis framework
1703
+
1704
+ 0:52:30.071 --> 0:52:32.293
1705
+ is performing better.
1706
+
1707
+ 0:52:33.473 --> 0:52:40.971
1708
+ A bit of an explanation when we see how that
1709
+ works.
1710
+
1711
+ 0:52:40.971 --> 0:52:46.161
1712
+ There's different ways of doing it.
1713
+
1714
+ 0:52:46.161 --> 0:52:47.171
1715
+ Maybe.
1716
+
1717
+ 0:52:47.567 --> 0:52:51.501
1718
+ So how does it work?
1719
+
1720
+ 0:52:51.501 --> 0:53:04.802
1721
+ This is our student network, the normal one,
1722
+ some type of new network.
1723
+
1724
+ 0:53:04.802 --> 0:53:06.113
1725
+ We're.
1726
+
1727
+ 0:53:06.586 --> 0:53:17.050
1728
+ So we are training the model to predict the
1729
+ same thing as we are doing that by calculating.
1730
+
1731
+ 0:53:17.437 --> 0:53:23.173
1732
+ The cross angry loss was defined in a way
1733
+ where saying all the probabilities for the
1734
+
1735
+ 0:53:23.173 --> 0:53:25.332
1736
+ correct word should be as high.
1737
+
1738
+ 0:53:25.745 --> 0:53:32.207
1739
+ So you are calculating your alphabet probabilities
1740
+ always, and each time step you have an alphabet
1741
+
1742
+ 0:53:32.207 --> 0:53:33.055
1743
+ probability.
1744
+
1745
+ 0:53:33.055 --> 0:53:38.669
1746
+ What is the most probable in the next word
1747
+ and your training signal is put as much of
1748
+
1749
+ 0:53:38.669 --> 0:53:43.368
1750
+ your probability mass to the correct word to
1751
+ the word that is there in.
1752
+
1753
+ 0:53:43.903 --> 0:53:51.367
1754
+ And this is the chief by this cross entry
1755
+ loss, which says with some of the all training
1756
+
1757
+ 0:53:51.367 --> 0:53:58.664
1758
+ examples of all positions, with some of the
1759
+ full vocabulary, and then this one is this
1760
+
1761
+ 0:53:58.664 --> 0:54:03.947
1762
+ one that this current word is the case word
1763
+ in the vocabulary.
1764
+
1765
+ 0:54:04.204 --> 0:54:11.339
1766
+ And then we take here the lock for the ability
1767
+ of that, so what we made me do is: We have
1768
+
1769
+ 0:54:11.339 --> 0:54:27.313
1770
+ this metric here, so each position of your
1771
+ vocabulary size.
1772
+
1773
+ 0:54:27.507 --> 0:54:38.656
1774
+ In the end what you just do is some of these
1775
+ three lock probabilities, and then you want
1776
+
1777
+ 0:54:38.656 --> 0:54:40.785
1778
+ to have as much.
1779
+
1780
+ 0:54:41.041 --> 0:54:54.614
1781
+ So although this is a thumb over this metric
1782
+ here, in the end of each dimension you.
1783
+
1784
+ 0:54:54.794 --> 0:55:06.366
1785
+ So that is a normal cross end to be lost that
1786
+ we have discussed at the very beginning of
1787
+
1788
+ 0:55:06.366 --> 0:55:07.016
1789
+ how.
1790
+
1791
+ 0:55:08.068 --> 0:55:15.132
1792
+ So what can we do differently in the teacher
1793
+ network?
1794
+
1795
+ 0:55:15.132 --> 0:55:23.374
1796
+ We also have a teacher network which is trained
1797
+ on large data.
1798
+
1799
+ 0:55:24.224 --> 0:55:35.957
1800
+ And of course this distribution might be better
1801
+ than the one from the small model because it's.
1802
+
1803
+ 0:55:36.456 --> 0:55:40.941
1804
+ So in this case we have now the training signal
1805
+ from the teacher network.
1806
+
1807
+ 0:55:41.441 --> 0:55:46.262
1808
+ And it's the same way as we had before.
1809
+
1810
+ 0:55:46.262 --> 0:55:56.507
1811
+ The only difference is we're training not
1812
+ the ground truths per ability distribution
1813
+
1814
+ 0:55:56.507 --> 0:55:59.159
1815
+ year, which is sharp.
1816
+
1817
+ 0:55:59.299 --> 0:56:11.303
1818
+ That's also a probability, so this word has
1819
+ a high probability, but have some probability.
1820
+
1821
+ 0:56:12.612 --> 0:56:19.577
1822
+ And that is the main difference.
1823
+
1824
+ 0:56:19.577 --> 0:56:30.341
1825
+ Typically you do like the interpretation of
1826
+ these.
1827
+
1828
+ 0:56:33.213 --> 0:56:38.669
1829
+ Because there's more information contained
1830
+ in the distribution than in the front booth,
1831
+
1832
+ 0:56:38.669 --> 0:56:44.187
1833
+ because it encodes more information about the
1834
+ language, because language always has more
1835
+
1836
+ 0:56:44.187 --> 0:56:47.907
1837
+ options to put alone, that's the same sentence
1838
+ yes exactly.
1839
+
1840
+ 0:56:47.907 --> 0:56:53.114
1841
+ So there's ambiguity in there that is encoded
1842
+ hopefully very well in the complaint.
1843
+
1844
+ 0:56:53.513 --> 0:56:57.257
1845
+ Trade you two networks so better than a student
1846
+ network you have in there from your learner.
1847
+
1848
+ 0:56:57.537 --> 0:57:05.961
1849
+ So maybe often there's only one correct word,
1850
+ but it might be two or three, and then all
1851
+
1852
+ 0:57:05.961 --> 0:57:10.505
1853
+ of these three have a probability distribution.
1854
+
1855
+ 0:57:10.590 --> 0:57:21.242
1856
+ And then is the main advantage or one explanation
1857
+ of why it's better to train from the.
1858
+
1859
+ 0:57:21.361 --> 0:57:32.652
1860
+ Of course, it's good to also keep the signal
1861
+ in there because then you can prevent it because
1862
+
1863
+ 0:57:32.652 --> 0:57:33.493
1864
+ crazy.
1865
+
1866
+ 0:57:37.017 --> 0:57:49.466
1867
+ Any more questions on the first type of knowledge
1868
+ distillation, also distribution changes.
1869
+
1870
+ 0:57:50.550 --> 0:58:02.202
1871
+ Coming around again, this would put it a bit
1872
+ different, so this is not a solution to maintenance
1873
+
1874
+ 0:58:02.202 --> 0:58:04.244
1875
+ or distribution.
1876
+
1877
+ 0:58:04.744 --> 0:58:12.680
1878
+ But don't think it's performing worse than
1879
+ only doing the ground tours because they also.
1880
+
1881
+ 0:58:13.113 --> 0:58:21.254
1882
+ So it's more like it's not improving you would
1883
+ assume it's similarly helping you, but.
1884
+
1885
+ 0:58:21.481 --> 0:58:28.145
1886
+ Of course, if you now have a teacher, maybe
1887
+ you have no danger on your target to Maine,
1888
+
1889
+ 0:58:28.145 --> 0:58:28.524
1890
+ but.
1891
+
1892
+ 0:58:28.888 --> 0:58:39.895
1893
+ Then you can use this one which is not the
1894
+ ground truth but helpful to learn better for
1895
+
1896
+ 0:58:39.895 --> 0:58:42.147
1897
+ the distribution.
1898
+
1899
+ 0:58:46.326 --> 0:58:57.012
1900
+ The second idea is to do sequence level knowledge
1901
+ distillation, so what we have in this case
1902
+
1903
+ 0:58:57.012 --> 0:59:02.757
1904
+ is we have looked at each position independently.
1905
+
1906
+ 0:59:03.423 --> 0:59:05.436
1907
+ Mean, we do that often.
1908
+
1909
+ 0:59:05.436 --> 0:59:10.972
1910
+ We are not generating a lot of sequences,
1911
+ but that has a problem.
1912
+
1913
+ 0:59:10.972 --> 0:59:13.992
1914
+ We have this propagation of errors.
1915
+
1916
+ 0:59:13.992 --> 0:59:16.760
1917
+ We start with one area and then.
1918
+
1919
+ 0:59:17.237 --> 0:59:27.419
1920
+ So if we are doing word-level knowledge dissolution,
1921
+ we are treating each word in the sentence independently.
1922
+
1923
+ 0:59:28.008 --> 0:59:32.091
1924
+ So we are not trying to like somewhat model
1925
+ the dependency between.
1926
+
1927
+ 0:59:32.932 --> 0:59:47.480
1928
+ We can try to do that by sequence level knowledge
1929
+ dissolution, but the problem is, of course,.
1930
+
1931
+ 0:59:47.847 --> 0:59:53.478
1932
+ So we can that for each position we can get
1933
+ a distribution over all the words at this.
1934
+
1935
+ 0:59:53.793 --> 1:00:05.305
1936
+ But if we want to have a distribution of all
1937
+ possible target sentences, that's not possible
1938
+
1939
+ 1:00:05.305 --> 1:00:06.431
1940
+ because.
1941
+
1942
+ 1:00:08.508 --> 1:00:15.940
1943
+ Area, so we can then again do a bit of a heck
1944
+ on that.
1945
+
1946
+ 1:00:15.940 --> 1:00:23.238
1947
+ If we can't have a distribution of all sentences,
1948
+ it.
1949
+
1950
+ 1:00:23.843 --> 1:00:30.764
1951
+ So what we can't do is you can not use the
1952
+ teacher network and sample different translations.
1953
+
1954
+ 1:00:31.931 --> 1:00:39.327
1955
+ And now we can do different ways to train
1956
+ them.
1957
+
1958
+ 1:00:39.327 --> 1:00:49.343
1959
+ We can use them as their probability, the
1960
+ easiest one to assume.
1961
+
1962
+ 1:00:50.050 --> 1:00:56.373
1963
+ So what that ends to is that we're taking
1964
+ our teacher network, we're generating some
1965
+
1966
+ 1:00:56.373 --> 1:01:01.135
1967
+ translations, and these ones we're using as
1968
+ additional trading.
1969
+
1970
+ 1:01:01.781 --> 1:01:11.382
1971
+ Then we have mainly done this sequence level
1972
+ because the teacher network takes us.
1973
+
1974
+ 1:01:11.382 --> 1:01:17.513
1975
+ These are all probable translations of the
1976
+ sentence.
1977
+
1978
+ 1:01:26.286 --> 1:01:34.673
1979
+ And then you can do a bit of a yeah, and you
1980
+ can try to better make a bit of an interpolated
1981
+
1982
+ 1:01:34.673 --> 1:01:36.206
1983
+ version of that.
1984
+
1985
+ 1:01:36.716 --> 1:01:42.802
1986
+ So what people have also done is like subsequent
1987
+ level interpolations.
1988
+
1989
+ 1:01:42.802 --> 1:01:52.819
1990
+ You generate here several translations: But
1991
+ then you don't use all of them.
1992
+
1993
+ 1:01:52.819 --> 1:02:00.658
1994
+ You do some metrics on which of these ones.
1995
+
1996
+ 1:02:01.021 --> 1:02:12.056
1997
+ So it's a bit more training on this brown
1998
+ chose which might be improbable or unreachable
1999
+
2000
+ 1:02:12.056 --> 1:02:16.520
2001
+ because we can generate everything.
2002
+
2003
+ 1:02:16.676 --> 1:02:23.378
2004
+ And we are giving it an easier solution which
2005
+ is also good quality and training of that.
2006
+
2007
+ 1:02:23.703 --> 1:02:32.602
2008
+ So you're not training it on a very difficult
2009
+ solution, but you're training it on an easier
2010
+
2011
+ 1:02:32.602 --> 1:02:33.570
2012
+ solution.
2013
+
2014
+ 1:02:36.356 --> 1:02:38.494
2015
+ Any More Questions to This.
2016
+
2017
+ 1:02:40.260 --> 1:02:41.557
2018
+ Yeah.
2019
+
2020
+ 1:02:41.461 --> 1:02:44.296
2021
+ Good.
2022
+
2023
+ 1:02:43.843 --> 1:03:01.642
2024
+ Is to look at the vocabulary, so the problem
2025
+ is we have seen that vocabulary calculations
2026
+
2027
+ 1:03:01.642 --> 1:03:06.784
2028
+ are often very presuming.
2029
+
2030
+ 1:03:09.789 --> 1:03:19.805
2031
+ The thing is that most of the vocabulary is
2032
+ not needed for each sentence, so in each sentence.
2033
+
2034
+ 1:03:20.280 --> 1:03:28.219
2035
+ The question is: Can we somehow easily precalculate,
2036
+ which words are probable to occur in the sentence,
2037
+
2038
+ 1:03:28.219 --> 1:03:30.967
2039
+ and then only calculate these ones?
2040
+
2041
+ 1:03:31.691 --> 1:03:34.912
2042
+ And this can be done so.
2043
+
2044
+ 1:03:34.912 --> 1:03:43.932
2045
+ For example, if you have sentenced card, it's
2046
+ probably not happening.
2047
+
2048
+ 1:03:44.164 --> 1:03:48.701
2049
+ So what you can try to do is to limit your
2050
+ vocabulary.
2051
+
2052
+ 1:03:48.701 --> 1:03:51.093
2053
+ You're considering for each.
2054
+
2055
+ 1:03:51.151 --> 1:04:04.693
2056
+ So you're no longer taking the full vocabulary
2057
+ as possible output, but you're restricting.
2058
+
2059
+ 1:04:06.426 --> 1:04:18.275
2060
+ That typically works is that we limit it by
2061
+ the most frequent words we always take because
2062
+
2063
+ 1:04:18.275 --> 1:04:23.613
2064
+ these are not so easy to align to words.
2065
+
2066
+ 1:04:23.964 --> 1:04:32.241
2067
+ To take the most treatment taggin' words and
2068
+ then work that often aligns with one of the
2069
+
2070
+ 1:04:32.241 --> 1:04:32.985
2071
+ source.
2072
+
2073
+ 1:04:33.473 --> 1:04:46.770
2074
+ So for each source word you calculate the
2075
+ word alignment on your training data, and then
2076
+
2077
+ 1:04:46.770 --> 1:04:51.700
2078
+ you calculate which words occur.
2079
+
2080
+ 1:04:52.352 --> 1:04:57.680
2081
+ And then for decoding you build this union
2082
+ of maybe the source word list that other.
2083
+
2084
+ 1:04:59.960 --> 1:05:02.145
2085
+ Are like for each source work.
2086
+
2087
+ 1:05:02.145 --> 1:05:08.773
2088
+ One of the most frequent translations of these
2089
+ source words, for example for each source work
2090
+
2091
+ 1:05:08.773 --> 1:05:13.003
2092
+ like in the most frequent ones, and then the
2093
+ most frequent.
2094
+
2095
+ 1:05:13.193 --> 1:05:24.333
2096
+ In total, if you have short sentences, you
2097
+ have a lot less words, so in most cases it's
2098
+
2099
+ 1:05:24.333 --> 1:05:26.232
2100
+ not more than.
2101
+
2102
+ 1:05:26.546 --> 1:05:33.957
2103
+ And so you have dramatically reduced your
2104
+ vocabulary, and thereby can also fax a depot.
2105
+
2106
+ 1:05:35.495 --> 1:05:43.757
2107
+ That easy does anybody see what is challenging
2108
+ here and why that might not always need.
2109
+
2110
+ 1:05:47.687 --> 1:05:54.448
2111
+ The performance is not why this might not.
2112
+
2113
+ 1:05:54.448 --> 1:06:01.838
2114
+ If you implement it, it might not be a strong.
2115
+
2116
+ 1:06:01.941 --> 1:06:06.053
2117
+ You have to store this list.
2118
+
2119
+ 1:06:06.053 --> 1:06:14.135
2120
+ You have to burn the union and of course your
2121
+ safe time.
2122
+
2123
+ 1:06:14.554 --> 1:06:21.920
2124
+ The second thing the vocabulary is used in
2125
+ our last step, so we have the hidden state,
2126
+
2127
+ 1:06:21.920 --> 1:06:23.868
2128
+ and then we calculate.
2129
+
2130
+ 1:06:24.284 --> 1:06:29.610
2131
+ Now we are not longer calculating them for
2132
+ all output words, but for a subset of them.
2133
+
2134
+ 1:06:30.430 --> 1:06:35.613
2135
+ However, this metric multiplication is typically
2136
+ parallelized with the perfect but good.
2137
+
2138
+ 1:06:35.956 --> 1:06:46.937
2139
+ But if you not only calculate some of them,
2140
+ if you're not modeling it right, it will take
2141
+
2142
+ 1:06:46.937 --> 1:06:52.794
2143
+ as long as before because of the nature of
2144
+ the.
2145
+
2146
+ 1:06:56.776 --> 1:07:07.997
2147
+ Here for beam search there's some ideas of
2148
+ course you can go back to greedy search because
2149
+
2150
+ 1:07:07.997 --> 1:07:10.833
2151
+ that's more efficient.
2152
+
2153
+ 1:07:11.651 --> 1:07:18.347
2154
+ And better quality, and you can buffer some
2155
+ states in between, so how much buffering it's
2156
+
2157
+ 1:07:18.347 --> 1:07:22.216
2158
+ again this tradeoff between calculation and
2159
+ memory.
2160
+
2161
+ 1:07:25.125 --> 1:07:41.236
2162
+ Then at the end of today what we want to look
2163
+ into is one last type of new machine translation
2164
+
2165
+ 1:07:41.236 --> 1:07:42.932
2166
+ approach.
2167
+
2168
+ 1:07:43.403 --> 1:07:53.621
2169
+ And the idea is what we've already seen in
2170
+ our first two steps is that this ultra aggressive
2171
+
2172
+ 1:07:53.621 --> 1:07:57.246
2173
+ park is taking community coding.
2174
+
2175
+ 1:07:57.557 --> 1:08:04.461
2176
+ Can process everything in parallel, but we
2177
+ are always taking the most probable and then.
2178
+
2179
+ 1:08:05.905 --> 1:08:10.476
2180
+ The question is: Do we really need to do that?
2181
+
2182
+ 1:08:10.476 --> 1:08:14.074
2183
+ Therefore, there is a bunch of work.
2184
+
2185
+ 1:08:14.074 --> 1:08:16.602
2186
+ Can we do it differently?
2187
+
2188
+ 1:08:16.602 --> 1:08:19.616
2189
+ Can we generate a full target?
2190
+
2191
+ 1:08:20.160 --> 1:08:29.417
2192
+ We'll see it's not that easy and there's still
2193
+ an open debate whether this is really faster
2194
+
2195
+ 1:08:29.417 --> 1:08:31.832
2196
+ and quality, but think.
2197
+
2198
+ 1:08:32.712 --> 1:08:45.594
2199
+ So, as said, what we have done is our encoder
2200
+ decoder where we can process our encoder color,
2201
+
2202
+ 1:08:45.594 --> 1:08:50.527
2203
+ and then the output always depends.
2204
+
2205
+ 1:08:50.410 --> 1:08:54.709
2206
+ We generate the output and then we have to
2207
+ put it here the wide because then everything
2208
+
2209
+ 1:08:54.709 --> 1:08:56.565
2210
+ depends on the purpose of the output.
2211
+
2212
+ 1:08:56.916 --> 1:09:10.464
2213
+ This is what is referred to as an outer-regressive
2214
+ model and nearly outs speech generation and
2215
+
2216
+ 1:09:10.464 --> 1:09:16.739
2217
+ language generation or works in this outer.
2218
+
2219
+ 1:09:18.318 --> 1:09:21.132
2220
+ So the motivation is, can we do that more
2221
+ efficiently?
2222
+
2223
+ 1:09:21.361 --> 1:09:31.694
2224
+ And can we somehow process all target words
2225
+ in parallel?
2226
+
2227
+ 1:09:31.694 --> 1:09:41.302
2228
+ So instead of doing it one by one, we are
2229
+ inputting.
2230
+
2231
+ 1:09:45.105 --> 1:09:46.726
2232
+ So how does it work?
2233
+
2234
+ 1:09:46.726 --> 1:09:50.587
2235
+ So let's first have a basic auto regressive
2236
+ mode.
2237
+
2238
+ 1:09:50.810 --> 1:09:53.551
2239
+ So the encoder looks as it is before.
2240
+
2241
+ 1:09:53.551 --> 1:09:58.310
2242
+ That's maybe not surprising because here we
2243
+ know we can paralyze.
2244
+
2245
+ 1:09:58.618 --> 1:10:04.592
2246
+ So we have put in here our ink holder and
2247
+ generated the ink stash, so that's exactly
2248
+
2249
+ 1:10:04.592 --> 1:10:05.295
2250
+ the same.
2251
+
2252
+ 1:10:05.845 --> 1:10:16.229
2253
+ However, now we need to do one more thing:
2254
+ One challenge is what we had before and that's
2255
+
2256
+ 1:10:16.229 --> 1:10:26.799
2257
+ a challenge of natural language generation
2258
+ like machine translation.
2259
+
2260
+ 1:10:32.672 --> 1:10:38.447
2261
+ We generate until we generate this out of
2262
+ end of center stock, but if we now generate
2263
+
2264
+ 1:10:38.447 --> 1:10:44.625
2265
+ everything at once that's no longer possible,
2266
+ so we cannot generate as long because we only
2267
+
2268
+ 1:10:44.625 --> 1:10:45.632
2269
+ generated one.
2270
+
2271
+ 1:10:46.206 --> 1:10:58.321
2272
+ So the question is how can we now determine
2273
+ how long the sequence is, and we can also accelerate.
2274
+
2275
+ 1:11:00.000 --> 1:11:06.384
2276
+ Yes, but there would be one idea, and there
2277
+ is other work which tries to do that.
2278
+
2279
+ 1:11:06.806 --> 1:11:15.702
2280
+ However, in here there's some work already
2281
+ done before and maybe you remember we had the
2282
+
2283
+ 1:11:15.702 --> 1:11:20.900
2284
+ IBM models and there was this concept of fertility.
2285
+
2286
+ 1:11:21.241 --> 1:11:26.299
2287
+ The concept of fertility is means like for
2288
+ one saucepan, and how many target pores does
2289
+
2290
+ 1:11:26.299 --> 1:11:27.104
2291
+ it translate?
2292
+
2293
+ 1:11:27.847 --> 1:11:34.805
2294
+ And exactly that we try to do here, and that
2295
+ means we are calculating like at the top we
2296
+
2297
+ 1:11:34.805 --> 1:11:36.134
2298
+ are calculating.
2299
+
2300
+ 1:11:36.396 --> 1:11:42.045
2301
+ So it says word is translated into word.
2302
+
2303
+ 1:11:42.045 --> 1:11:54.171
2304
+ Word might be translated into words into,
2305
+ so we're trying to predict in how many words.
2306
+
2307
+ 1:11:55.935 --> 1:12:10.314
2308
+ And then the end of the anchor, so this is
2309
+ like a length estimation.
2310
+
2311
+ 1:12:10.314 --> 1:12:15.523
2312
+ You can do it otherwise.
2313
+
2314
+ 1:12:16.236 --> 1:12:24.526
2315
+ You initialize your decoder input and we know
2316
+ it's good with word embeddings so we're trying
2317
+
2318
+ 1:12:24.526 --> 1:12:28.627
2319
+ to do the same thing and what people then do.
2320
+
2321
+ 1:12:28.627 --> 1:12:35.224
2322
+ They initialize it again with word embedding
2323
+ but in the frequency of the.
2324
+
2325
+ 1:12:35.315 --> 1:12:36.460
2326
+ So we have the cartilage.
2327
+
2328
+ 1:12:36.896 --> 1:12:47.816
2329
+ So one has two, so twice the is and then one
2330
+ is, so that is then our initialization.
2331
+
2332
+ 1:12:48.208 --> 1:12:57.151
2333
+ In other words, if you don't predict fertilities
2334
+ but predict lengths, you can just initialize
2335
+
2336
+ 1:12:57.151 --> 1:12:57.912
2337
+ second.
2338
+
2339
+ 1:12:58.438 --> 1:13:07.788
2340
+ This often works a bit better, but that's
2341
+ the other.
2342
+
2343
+ 1:13:07.788 --> 1:13:16.432
2344
+ Now you have everything in training and testing.
2345
+
2346
+ 1:13:16.656 --> 1:13:18.621
2347
+ This is all available at once.
2348
+
2349
+ 1:13:20.280 --> 1:13:31.752
2350
+ Then we can generate everything in parallel,
2351
+ so we have the decoder stack, and that is now
2352
+
2353
+ 1:13:31.752 --> 1:13:33.139
2354
+ as before.
2355
+
2356
+ 1:13:35.395 --> 1:13:41.555
2357
+ And then we're doing the translation predictions
2358
+ here on top of it in order to do.
2359
+
2360
+ 1:13:43.083 --> 1:13:59.821
2361
+ And then we are predicting here the target
2362
+ words and once predicted, and that is the basic
2363
+
2364
+ 1:13:59.821 --> 1:14:00.924
2365
+ idea.
2366
+
2367
+ 1:14:01.241 --> 1:14:08.171
2368
+ Machine translation: Where the idea is, we
2369
+ don't have to do one by one what we're.
2370
+
2371
+ 1:14:10.210 --> 1:14:13.900
2372
+ So this looks really, really, really great.
2373
+
2374
+ 1:14:13.900 --> 1:14:20.358
2375
+ On the first view there's one challenge with
2376
+ this, and this is the baseline.
2377
+
2378
+ 1:14:20.358 --> 1:14:27.571
2379
+ Of course there's some improvements, but in
2380
+ general the quality is often significant.
2381
+
2382
+ 1:14:28.068 --> 1:14:32.075
2383
+ So here you see the baseline models.
2384
+
2385
+ 1:14:32.075 --> 1:14:38.466
2386
+ You have a loss of ten blue points or something
2387
+ like that.
2388
+
2389
+ 1:14:38.878 --> 1:14:40.230
2390
+ So why does it change?
2391
+
2392
+ 1:14:40.230 --> 1:14:41.640
2393
+ So why is it happening?
2394
+
2395
+ 1:14:43.903 --> 1:14:56.250
2396
+ If you look at the errors there is repetitive
2397
+ tokens, so you have like or things like that.
2398
+
2399
+ 1:14:56.536 --> 1:15:01.995
2400
+ Broken senses or influent senses, so that
2401
+ exactly where algebra aggressive models are
2402
+
2403
+ 1:15:01.995 --> 1:15:04.851
2404
+ very good, we say that's a bit of a problem.
2405
+
2406
+ 1:15:04.851 --> 1:15:07.390
2407
+ They generate very fluid transcription.
2408
+
2409
+ 1:15:07.387 --> 1:15:10.898
2410
+ Translation: Sometimes there doesn't have
2411
+ to do anything with the input.
2412
+
2413
+ 1:15:11.411 --> 1:15:14.047
2414
+ But generally it really looks always very
2415
+ fluid.
2416
+
2417
+ 1:15:14.995 --> 1:15:20.865
2418
+ Here exactly the opposite, so the problem
2419
+ is that we don't have really fluid translation.
2420
+
2421
+ 1:15:21.421 --> 1:15:26.123
2422
+ And that is mainly due to the challenge that
2423
+ we have this independent assumption.
2424
+
2425
+ 1:15:26.646 --> 1:15:35.873
2426
+ So in this case, the probability of Y of the
2427
+ second position is independent of the probability
2428
+
2429
+ 1:15:35.873 --> 1:15:40.632
2430
+ of X, so we don't know what was there generated.
2431
+
2432
+ 1:15:40.632 --> 1:15:43.740
2433
+ We're just generating it there.
2434
+
2435
+ 1:15:43.964 --> 1:15:55.439
2436
+ You can see it also in a bit of examples.
2437
+
2438
+ 1:15:55.439 --> 1:16:03.636
2439
+ You can over-panelize shifts.
2440
+
2441
+ 1:16:04.024 --> 1:16:10.566
2442
+ And the problem is this is already an improvement
2443
+ again, but this is also similar to.
2444
+
2445
+ 1:16:11.071 --> 1:16:19.900
2446
+ So you can, for example, translate heeded
2447
+ back, or maybe you could also translate it
2448
+
2449
+ 1:16:19.900 --> 1:16:31.105
2450
+ with: But on their feeling down in feeling
2451
+ down, if the first position thinks of their
2452
+
2453
+ 1:16:31.105 --> 1:16:34.594
2454
+ feeling done and the second.
2455
+
2456
+ 1:16:35.075 --> 1:16:42.908
2457
+ So each position here and that is one of the
2458
+ main issues here doesn't know what the other.
2459
+
2460
+ 1:16:43.243 --> 1:16:53.846
2461
+ And for example, if you are translating something
2462
+ with, you can often translate things in two
2463
+
2464
+ 1:16:53.846 --> 1:16:58.471
2465
+ ways: German with a different agreement.
2466
+
2467
+ 1:16:58.999 --> 1:17:02.058
2468
+ And then here where you have to decide do
2469
+ a used jet.
2470
+
2471
+ 1:17:02.162 --> 1:17:05.460
2472
+ Interpretator: It doesn't know which word
2473
+ it has to select.
2474
+
2475
+ 1:17:06.086 --> 1:17:14.789
2476
+ Mean, of course, it knows a hidden state,
2477
+ but in the end you have a liability distribution.
2478
+
2479
+ 1:17:16.256 --> 1:17:20.026
2480
+ And that is the important thing in the outer
2481
+ regressive month.
2482
+
2483
+ 1:17:20.026 --> 1:17:24.335
2484
+ You know that because you have put it in you
2485
+ here, you don't know that.
2486
+
2487
+ 1:17:24.335 --> 1:17:29.660
2488
+ If it's equal probable here to two, you don't
2489
+ Know Which Is Selected, and of course that
2490
+
2491
+ 1:17:29.660 --> 1:17:32.832
2492
+ depends on what should be the latest traction
2493
+ under.
2494
+
2495
+ 1:17:33.333 --> 1:17:39.554
2496
+ Yep, that's the undershift, and we're going
2497
+ to last last the next time.
2498
+
2499
+ 1:17:39.554 --> 1:17:39.986
2500
+ Yes.
2501
+
2502
+ 1:17:40.840 --> 1:17:44.935
2503
+ Doesn't this also appear in and like now we're
2504
+ talking about physical training?
2505
+
2506
+ 1:17:46.586 --> 1:17:48.412
2507
+ The thing is in the auto regress.
2508
+
2509
+ 1:17:48.412 --> 1:17:50.183
2510
+ If you give it the correct one,.
2511
+
2512
+ 1:17:50.450 --> 1:17:55.827
2513
+ So if you predict here comma what the reference
2514
+ is feeling then you tell the model here.
2515
+
2516
+ 1:17:55.827 --> 1:17:59.573
2517
+ The last one was feeling and then it knows
2518
+ it has to be done.
2519
+
2520
+ 1:17:59.573 --> 1:18:04.044
2521
+ But here it doesn't know that because it doesn't
2522
+ get as input as a right.
2523
+
2524
+ 1:18:04.204 --> 1:18:24.286
2525
+ Yes, that's a bit depending on what.
2526
+
2527
+ 1:18:24.204 --> 1:18:27.973
2528
+ But in training, of course, you just try to
2529
+ make the highest one the current one.
2530
+
2531
+ 1:18:31.751 --> 1:18:38.181
2532
+ So what you can do is things like CDC loss
2533
+ which can adjust for this.
2534
+
2535
+ 1:18:38.181 --> 1:18:42.866
2536
+ So then you can also have this shifted correction.
2537
+
2538
+ 1:18:42.866 --> 1:18:50.582
2539
+ If you're doing this type of correction in
2540
+ the CDC loss you don't get full penalty.
2541
+
2542
+ 1:18:50.930 --> 1:18:58.486
2543
+ Just shifted by one, so it's a bit of a different
2544
+ loss, which is mainly used in, but.
2545
+
2546
+ 1:19:00.040 --> 1:19:03.412
2547
+ It can be used in order to address this problem.
2548
+
2549
+ 1:19:04.504 --> 1:19:13.844
2550
+ The other problem is that outer regressively
2551
+ we have the label buyers that tries to disimmigrate.
2552
+
2553
+ 1:19:13.844 --> 1:19:20.515
2554
+ That's the example did before was if you translate
2555
+ thank you to Dung.
2556
+
2557
+ 1:19:20.460 --> 1:19:31.925
2558
+ And then it might end up because it learns
2559
+ in the first position and the second also.
2560
+
2561
+ 1:19:32.492 --> 1:19:43.201
2562
+ In order to prevent that, it would be helpful
2563
+ for one output, only one output, so that makes
2564
+
2565
+ 1:19:43.201 --> 1:19:47.002
2566
+ the system already better learn.
2567
+
2568
+ 1:19:47.227 --> 1:19:53.867
2569
+ Might be that for slightly different inputs
2570
+ you have different outputs, but for the same.
2571
+
2572
+ 1:19:54.714 --> 1:19:57.467
2573
+ That we can luckily very easily solve.
2574
+
2575
+ 1:19:59.119 --> 1:19:59.908
2576
+ And it's done.
2577
+
2578
+ 1:19:59.908 --> 1:20:04.116
2579
+ We just learned the technique about it, which
2580
+ is called knowledge distillation.
2581
+
2582
+ 1:20:04.985 --> 1:20:13.398
2583
+ So what we can do and the easiest solution
2584
+ to prove your non-autoregressive model is to
2585
+
2586
+ 1:20:13.398 --> 1:20:16.457
2587
+ train an auto regressive model.
2588
+
2589
+ 1:20:16.457 --> 1:20:22.958
2590
+ Then you decode your whole training gamer
2591
+ with this model and then.
2592
+
2593
+ 1:20:23.603 --> 1:20:27.078
2594
+ While the main advantage of that is that this
2595
+ is more consistent,.
2596
+
2597
+ 1:20:27.407 --> 1:20:33.995
2598
+ So for the same input you always have the
2599
+ same output.
2600
+
2601
+ 1:20:33.995 --> 1:20:41.901
2602
+ So you have to make your training data more
2603
+ consistent and learn.
2604
+
2605
+ 1:20:42.482 --> 1:20:54.471
2606
+ So there is another advantage of knowledge
2607
+ distillation and that advantage is you have
2608
+
2609
+ 1:20:54.471 --> 1:20:59.156
2610
+ more consistent training signals.
2611
+
2612
+ 1:21:04.884 --> 1:21:10.630
2613
+ There's another to make the things more easy
2614
+ at the beginning.
2615
+
2616
+ 1:21:10.630 --> 1:21:16.467
2617
+ There's this plants model, black model where
2618
+ you do more masks.
2619
+
2620
+ 1:21:16.756 --> 1:21:26.080
2621
+ So during training, especially at the beginning,
2622
+ you give some correct solutions at the beginning.
2623
+
2624
+ 1:21:28.468 --> 1:21:38.407
2625
+ And there is this tokens at a time, so the
2626
+ idea is to establish other regressive training.
2627
+
2628
+ 1:21:40.000 --> 1:21:50.049
2629
+ And some targets are open, so you always predict
2630
+ only like first auto regression is K.
2631
+
2632
+ 1:21:50.049 --> 1:21:59.174
2633
+ It puts one, so you always have one input
2634
+ and one output, then you do partial.
2635
+
2636
+ 1:21:59.699 --> 1:22:05.825
2637
+ So in that way you can slowly learn what is
2638
+ a good and what is a bad answer.
2639
+
2640
+ 1:22:08.528 --> 1:22:10.862
2641
+ It doesn't sound very impressive.
2642
+
2643
+ 1:22:10.862 --> 1:22:12.578
2644
+ Don't contact me anyway.
2645
+
2646
+ 1:22:12.578 --> 1:22:15.323
2647
+ Go all over your training data several.
2648
+
2649
+ 1:22:15.875 --> 1:22:20.655
2650
+ You can even switch in between.
2651
+
2652
+ 1:22:20.655 --> 1:22:29.318
2653
+ There is a homework on this thing where you
2654
+ try to start.
2655
+
2656
+ 1:22:31.271 --> 1:22:41.563
2657
+ You have to learn so there's a whole work
2658
+ on that so this is often happening and it doesn't
2659
+
2660
+ 1:22:41.563 --> 1:22:46.598
2661
+ mean it's less efficient but still it helps.
2662
+
2663
+ 1:22:49.389 --> 1:22:57.979
2664
+ For later maybe here are some examples of
2665
+ how much things help.
2666
+
2667
+ 1:22:57.979 --> 1:23:04.958
2668
+ Maybe one point here is that it's really important.
2669
+
2670
+ 1:23:05.365 --> 1:23:13.787
2671
+ Here's the translation performance and speed.
2672
+
2673
+ 1:23:13.787 --> 1:23:24.407
2674
+ One point which is a point is if you compare
2675
+ researchers.
2676
+
2677
+ 1:23:24.784 --> 1:23:33.880
2678
+ So yeah, if you're compared to one very weak
2679
+ baseline transformer even with beam search,
2680
+
2681
+ 1:23:33.880 --> 1:23:40.522
2682
+ then you're ten times slower than a very strong
2683
+ auto regressive.
2684
+
2685
+ 1:23:40.961 --> 1:23:48.620
2686
+ If you make a strong baseline then it's going
2687
+ down to depending on times and here like: You
2688
+
2689
+ 1:23:48.620 --> 1:23:53.454
2690
+ have a lot of different speed ups.
2691
+
2692
+ 1:23:53.454 --> 1:24:03.261
2693
+ Generally, it makes a strong baseline and
2694
+ not very simple transformer.
2695
+
2696
+ 1:24:07.407 --> 1:24:20.010
2697
+ Yeah, with this one last thing that you can
2698
+ do to speed up things and also reduce your
2699
+
2700
+ 1:24:20.010 --> 1:24:25.950
2701
+ memory is what is called half precision.
2702
+
2703
+ 1:24:26.326 --> 1:24:29.139
2704
+ And especially for decoding issues for training.
2705
+
2706
+ 1:24:29.139 --> 1:24:31.148
2707
+ Sometimes it also gets less stale.
2708
+
2709
+ 1:24:32.592 --> 1:24:45.184
2710
+ With this we close nearly wait a bit, so what
2711
+ you should remember is that efficient machine
2712
+
2713
+ 1:24:45.184 --> 1:24:46.963
2714
+ translation.
2715
+
2716
+ 1:24:47.007 --> 1:24:51.939
2717
+ We have, for example, looked at knowledge
2718
+ distillation.
2719
+
2720
+ 1:24:51.939 --> 1:24:55.991
2721
+ We have looked at non auto regressive models.
2722
+
2723
+ 1:24:55.991 --> 1:24:57.665
2724
+ We have different.
2725
+
2726
+ 1:24:58.898 --> 1:25:02.383
2727
+ For today and then only requests.
2728
+
2729
+ 1:25:02.383 --> 1:25:08.430
2730
+ So if you haven't done so, please fill out
2731
+ the evaluation.
2732
+
2733
+ 1:25:08.388 --> 1:25:20.127
2734
+ So now if you have done so think then you
2735
+ should have and with the online people hopefully.
2736
+
2737
+ 1:25:20.320 --> 1:25:29.758
2738
+ Only possibility to tell us what things are
2739
+ good and what not the only one but the most
2740
+
2741
+ 1:25:29.758 --> 1:25:30.937
2742
+ efficient.
2743
+
2744
+ 1:25:31.851 --> 1:25:35.871
2745
+ So think of all the students doing it in this
2746
+ case okay and then thank.
2747
+
demo_data/lectures/Lecture-14-27.06.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:59f384b3137c89cb3f00f2020badb6eb5ff6de5043bd9e015adab92072e27e62
3
+ size 113488295
demo_data/lectures/Lecture-15-11.07.2023/English.vtt ADDED
@@ -0,0 +1,2279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:00.060 --> 0:00:07.762
4
+ OK good so today's lecture is on on supervised
5
+ machines and stations so what you have seen
6
+
7
+ 0:00:07.762 --> 0:00:13.518
8
+ so far is different techniques are on supervised
9
+ and MP so you are.
10
+
11
+ 0:00:13.593 --> 0:00:18.552
12
+ Data right so let's say in English coppers
13
+ you are one file and then in German you have
14
+
15
+ 0:00:18.552 --> 0:00:23.454
16
+ another file which is sentence to sentence
17
+ la and then you try to build systems around
18
+
19
+ 0:00:23.454 --> 0:00:23.679
20
+ it.
21
+
22
+ 0:00:24.324 --> 0:00:30.130
23
+ But what's different about this lecture is
24
+ that you assume that you have no final data
25
+
26
+ 0:00:30.130 --> 0:00:30.663
27
+ at all.
28
+
29
+ 0:00:30.663 --> 0:00:37.137
30
+ You only have monolingual data and the question
31
+ is how can we build systems to translate between
32
+
33
+ 0:00:37.137 --> 0:00:39.405
34
+ these two languages right and so.
35
+
36
+ 0:00:39.359 --> 0:00:44.658
37
+ This is a bit more realistic scenario because
38
+ you have so many languages in the world.
39
+
40
+ 0:00:44.658 --> 0:00:50.323
41
+ You cannot expect to have parallel data between
42
+ all the two languages and so, but in typical
43
+
44
+ 0:00:50.323 --> 0:00:55.623
45
+ cases you have newspapers and so on, which
46
+ is like monolingual files, and the question
47
+
48
+ 0:00:55.623 --> 0:00:57.998
49
+ is can we build something around them?
50
+
51
+ 0:00:59.980 --> 0:01:01.651
52
+ They like said for today.
53
+
54
+ 0:01:01.651 --> 0:01:05.893
55
+ First we'll start up with the interactions,
56
+ so why do we need it?
57
+
58
+ 0:01:05.893 --> 0:01:11.614
59
+ and also some infusion on how these models
60
+ work before going into the technical details.
61
+
62
+ 0:01:11.614 --> 0:01:17.335
63
+ I want to also go through an example,, which
64
+ kind of gives you more understanding on how
65
+
66
+ 0:01:17.335 --> 0:01:19.263
67
+ people came into more elders.
68
+
69
+ 0:01:20.820 --> 0:01:23.905
70
+ Then the rest of the lecture is going to be
71
+ two parts.
72
+
73
+ 0:01:23.905 --> 0:01:26.092
74
+ One is we're going to translate words.
75
+
76
+ 0:01:26.092 --> 0:01:30.018
77
+ We're not going to care about how can we translate
78
+ the full sentence.
79
+
80
+ 0:01:30.018 --> 0:01:35.177
81
+ But given to monolingual files, how can we
82
+ get a dictionary basically, which is much easier
83
+
84
+ 0:01:35.177 --> 0:01:37.813
85
+ than generating something in a sentence level?
86
+
87
+ 0:01:38.698 --> 0:01:43.533
88
+ Then we're going to go into the Edwards case,
89
+ which is the unsupervised sentence type solution.
90
+
91
+ 0:01:44.204 --> 0:01:50.201
92
+ And here what you'll see is what are the training
93
+ objectives which are quite different than the
94
+
95
+ 0:01:50.201 --> 0:01:55.699
96
+ word translation and also where it doesn't
97
+ but because this is also quite important and
98
+
99
+ 0:01:55.699 --> 0:02:01.384
100
+ it's one of the reasons why unsupervised does
101
+ not use anymore because the limitations kind
102
+
103
+ 0:02:01.384 --> 0:02:03.946
104
+ of go away from the realistic use cases.
105
+
106
+ 0:02:04.504 --> 0:02:06.922
107
+ And then that leads to the marketing world
108
+ model.
109
+
110
+ 0:02:06.922 --> 0:02:07.115
111
+ So.
112
+
113
+ 0:02:07.807 --> 0:02:12.915
114
+ People are trying to do to build systems for
115
+ languages that will not have any parallel data.
116
+
117
+ 0:02:12.915 --> 0:02:17.693
118
+ Is use multilingual models and combine with
119
+ these training objectives to get better at
120
+
121
+ 0:02:17.693 --> 0:02:17.913
122
+ it.
123
+
124
+ 0:02:17.913 --> 0:02:18.132
125
+ So.
126
+
127
+ 0:02:18.658 --> 0:02:24.396
128
+ People are not trying to build bilingual systems
129
+ currently for unsupervised arm translation,
130
+
131
+ 0:02:24.396 --> 0:02:30.011
132
+ but I think it's good to know how they came
133
+ to hear this point and what they're doing now.
134
+
135
+ 0:02:30.090 --> 0:02:34.687
136
+ You also see some patterns overlapping which
137
+ people are using.
138
+
139
+ 0:02:36.916 --> 0:02:41.642
140
+ So as you said before, and you probably hear
141
+ it multiple times now is that we have seven
142
+
143
+ 0:02:41.642 --> 0:02:43.076
144
+ thousand languages around.
145
+
146
+ 0:02:43.903 --> 0:02:49.460
147
+ Can be different dialects in someone, so it's
148
+ quite hard to distinguish what's the language,
149
+
150
+ 0:02:49.460 --> 0:02:54.957
151
+ but you can typically approximate that seven
152
+ thousand and that leads to twenty five million
153
+
154
+ 0:02:54.957 --> 0:02:59.318
155
+ pairs, which is the obvious reason why we do
156
+ not have any parallel data.
157
+
158
+ 0:03:00.560 --> 0:03:06.386
159
+ So you want to build an empty system for all
160
+ possible language pests and the question is
161
+
162
+ 0:03:06.386 --> 0:03:07.172
163
+ how can we?
164
+
165
+ 0:03:08.648 --> 0:03:13.325
166
+ The typical use case, but there are actually
167
+ quite few interesting use cases than what you
168
+
169
+ 0:03:13.325 --> 0:03:14.045
170
+ would expect.
171
+
172
+ 0:03:14.614 --> 0:03:20.508
173
+ One is the animal languages, which is the
174
+ real thing that's happening right now with.
175
+
176
+ 0:03:20.780 --> 0:03:26.250
177
+ The dog but with dolphins and so on, but I
178
+ couldn't find a picture that could show this,
179
+
180
+ 0:03:26.250 --> 0:03:31.659
181
+ but if you are interested in stuff like this
182
+ you can check out the website where people
183
+
184
+ 0:03:31.659 --> 0:03:34.916
185
+ are actually trying to understand how animals
186
+ speak.
187
+
188
+ 0:03:35.135 --> 0:03:37.356
189
+ It's Also a Bit More About.
190
+
191
+ 0:03:37.297 --> 0:03:44.124
192
+ Knowing what the animals want to say but may
193
+ not die dead but still people are trying to
194
+
195
+ 0:03:44.124 --> 0:03:44.661
196
+ do it.
197
+
198
+ 0:03:45.825 --> 0:03:50.689
199
+ More realistic thing that's happening is the
200
+ translation of programming languages.
201
+
202
+ 0:03:51.371 --> 0:03:56.963
203
+ And so this is quite a quite good scenario
204
+ for entrepreneurs and empty is that you have
205
+
206
+ 0:03:56.963 --> 0:04:02.556
207
+ a lot of code available online right in C +
208
+ + and in Python and the question is how can
209
+
210
+ 0:04:02.556 --> 0:04:08.402
211
+ we translate by just looking at the code alone
212
+ and no parallel functions and so on and this
213
+
214
+ 0:04:08.402 --> 0:04:10.754
215
+ is actually quite good right now so.
216
+
217
+ 0:04:12.032 --> 0:04:16.111
218
+ See how these techniques were applied to do
219
+ the programming translation.
220
+
221
+ 0:04:18.258 --> 0:04:23.882
222
+ And then you can also think of language as
223
+ something that is quite common so you can take
224
+
225
+ 0:04:23.882 --> 0:04:24.194
226
+ off.
227
+
228
+ 0:04:24.194 --> 0:04:29.631
229
+ Think of formal sentences in English as one
230
+ language and informal sentences in English
231
+
232
+ 0:04:29.631 --> 0:04:35.442
233
+ as another language and then learn the kind
234
+ to stay between them and then it kind of becomes
235
+
236
+ 0:04:35.442 --> 0:04:37.379
237
+ a style plan for a problem so.
238
+
239
+ 0:04:38.358 --> 0:04:43.042
240
+ Although it's translation, you can consider
241
+ different characteristics of a language and
242
+
243
+ 0:04:43.042 --> 0:04:46.875
244
+ then separate them as two different languages
245
+ and then try to map them.
246
+
247
+ 0:04:46.875 --> 0:04:52.038
248
+ So it's not only about languages, but you
249
+ can also do quite cool things by using unsophisticated
250
+
251
+ 0:04:52.038 --> 0:04:54.327
252
+ techniques, which are quite possible also.
253
+
254
+ 0:04:56.256 --> 0:04:56.990
255
+ I am so.
256
+
257
+ 0:04:56.990 --> 0:05:04.335
258
+ This is kind of TV modeling for many of the
259
+ use cases that we have for ours, ours and MD.
260
+
261
+ 0:05:04.335 --> 0:05:11.842
262
+ But before we go into the modeling of these
263
+ systems, what I want you to do is look at these
264
+
265
+ 0:05:11.842 --> 0:05:12.413
266
+ dummy.
267
+
268
+ 0:05:13.813 --> 0:05:19.720
269
+ We have text and language one, text and language
270
+ two right, and nobody knows what these languages
271
+
272
+ 0:05:19.720 --> 0:05:20.082
273
+ mean.
274
+
275
+ 0:05:20.082 --> 0:05:23.758
276
+ They completely are made up right, and the
277
+ question is also.
278
+
279
+ 0:05:23.758 --> 0:05:29.364
280
+ They're not parallel lines, so the first line
281
+ here and the first line is not a line, they're
282
+
283
+ 0:05:29.364 --> 0:05:30.810
284
+ just monolingual files.
285
+
286
+ 0:05:32.052 --> 0:05:38.281
287
+ And now think about how can you translate
288
+ the word M1 from language one to language two,
289
+
290
+ 0:05:38.281 --> 0:05:41.851
291
+ and this kind of you see how we try to model
292
+ this.
293
+
294
+ 0:05:42.983 --> 0:05:47.966
295
+ Would take your time and then think of how
296
+ can you translate more into language two?
297
+
298
+ 0:06:41.321 --> 0:06:45.589
299
+ About the model, if you ask somebody who doesn't
300
+ know anything about machine translation right,
301
+
302
+ 0:06:45.589 --> 0:06:47.411
303
+ and then you ask them to translate more.
304
+
305
+ 0:07:01.201 --> 0:07:10.027
306
+ But it's also not quite easy if you think
307
+ of the way that I made this example is relatively
308
+
309
+ 0:07:10.027 --> 0:07:10.986
310
+ easy, so.
311
+
312
+ 0:07:11.431 --> 0:07:17.963
313
+ Basically, the first two sentences are these
314
+ two: A, B, C is E, and G cured up the U, V
315
+
316
+ 0:07:17.963 --> 0:07:21.841
317
+ is L, A, A, C, S, and S, on and this is used
318
+ towards the German.
319
+
320
+ 0:07:22.662 --> 0:07:25.241
321
+ And then when you join these two words, it's.
322
+
323
+ 0:07:25.205 --> 0:07:32.445
324
+ English German the third line and the last
325
+ line, and then the fourth line is the first
326
+
327
+ 0:07:32.445 --> 0:07:38.521
328
+ line, so German language, English, and then
329
+ speak English, speak German.
330
+
331
+ 0:07:38.578 --> 0:07:44.393
332
+ So this is how I made made up the example
333
+ and what the intuition here is that you assume
334
+
335
+ 0:07:44.393 --> 0:07:50.535
336
+ that the languages have a fundamental structure
337
+ right and it's the same across all languages.
338
+
339
+ 0:07:51.211 --> 0:07:57.727
340
+ Doesn't matter what language you are thinking
341
+ of words kind of you have in the same way join
342
+
343
+ 0:07:57.727 --> 0:07:59.829
344
+ together is the same way and.
345
+
346
+ 0:07:59.779 --> 0:08:06.065
347
+ And plasma sign thinks the same way but this
348
+ is not a realistic assumption for sure but
349
+
350
+ 0:08:06.065 --> 0:08:12.636
351
+ it's actually a decent one to make and if you
352
+ can think of this like if you can assume this
353
+
354
+ 0:08:12.636 --> 0:08:16.207
355
+ then we can model systems in an unsupervised
356
+ way.
357
+
358
+ 0:08:16.396 --> 0:08:22.743
359
+ So this is the intuition that I want to give,
360
+ and you can see that whenever assumptions fail,
361
+
362
+ 0:08:22.743 --> 0:08:23.958
363
+ the systems fail.
364
+
365
+ 0:08:23.958 --> 0:08:29.832
366
+ So in practice whenever we go far away from
367
+ these assumptions, the systems try to more
368
+
369
+ 0:08:29.832 --> 0:08:30.778
370
+ time to fail.
371
+
372
+ 0:08:33.753 --> 0:08:39.711
373
+ So the example that I gave was actually perfect
374
+ mapping right, so it never really sticks bad.
375
+
376
+ 0:08:39.711 --> 0:08:45.353
377
+ They have the same number of words, same sentence
378
+ structure, perfect mapping, and so on.
379
+
380
+ 0:08:45.353 --> 0:08:50.994
381
+ This doesn't happen, but let's assume that
382
+ this happens and try to see how we can moral.
383
+
384
+ 0:08:53.493 --> 0:09:01.061
385
+ Okay, now let's go a bit more formal, so what
386
+ you want to do is unsupervise word translation.
387
+
388
+ 0:09:01.901 --> 0:09:08.773
389
+ Here the task is that we have input data as
390
+ monolingual data, so a bunch of sentences in
391
+
392
+ 0:09:08.773 --> 0:09:15.876
393
+ one file and a bunch of sentences another file
394
+ in two different languages, and the question
395
+
396
+ 0:09:15.876 --> 0:09:18.655
397
+ is how can we get a bilingual word?
398
+
399
+ 0:09:19.559 --> 0:09:25.134
400
+ So if you look at the picture you see that
401
+ it's just kind of projected down into two dimension
402
+
403
+ 0:09:25.134 --> 0:09:30.358
404
+ planes, but it's basically when you map them
405
+ into a plot you see that the words that are
406
+
407
+ 0:09:30.358 --> 0:09:35.874
408
+ parallel are closer together, and the question
409
+ is how can we do it just looking at two files?
410
+
411
+ 0:09:36.816 --> 0:09:42.502
412
+ And you can say that what we want to basically
413
+ do is create a dictionary in the end given
414
+
415
+ 0:09:42.502 --> 0:09:43.260
416
+ two fights.
417
+
418
+ 0:09:43.260 --> 0:09:45.408
419
+ So this is the task that we want.
420
+
421
+ 0:09:46.606 --> 0:09:52.262
422
+ And the first step on how we do this is to
423
+ learn word vectors, and this chicken is whatever
424
+
425
+ 0:09:52.262 --> 0:09:56.257
426
+ techniques that you have seen before, but to
427
+ work glow or so on.
428
+
429
+ 0:09:56.856 --> 0:10:00.699
430
+ So you take a monolingual data and try to
431
+ learn word embeddings.
432
+
433
+ 0:10:02.002 --> 0:10:07.675
434
+ Then you plot them into a graph, and then
435
+ typically what you would see is that they're
436
+
437
+ 0:10:07.675 --> 0:10:08.979
438
+ not aligned at all.
439
+
440
+ 0:10:08.979 --> 0:10:14.717
441
+ One word space is somewhere, and one word
442
+ space is somewhere else, and this is what you
443
+
444
+ 0:10:14.717 --> 0:10:18.043
445
+ would typically expect to see in the in the
446
+ image.
447
+
448
+ 0:10:19.659 --> 0:10:23.525
449
+ Now our assumption was that both lines we
450
+ just have the same.
451
+
452
+ 0:10:23.563 --> 0:10:28.520
453
+ Culture and so that we can use this information
454
+ to learn the mapping between these two spaces.
455
+
456
+ 0:10:30.130 --> 0:10:37.085
457
+ So before how we do it, I think this is quite
458
+ famous already, and everybody knows it a bit
459
+
460
+ 0:10:37.085 --> 0:10:41.824
461
+ more is that we're emitting capture semantic
462
+ relations right.
463
+
464
+ 0:10:41.824 --> 0:10:48.244
465
+ So the distance between man and woman is approximately
466
+ the same as king and prince.
467
+
468
+ 0:10:48.888 --> 0:10:54.620
469
+ It's also for world dances, country capital
470
+ and so on, so there are some relationships
471
+
472
+ 0:10:54.620 --> 0:11:00.286
473
+ happening in the word emmering space, which
474
+ is quite clear for at least one language.
475
+
476
+ 0:11:03.143 --> 0:11:08.082
477
+ Now if you think of this, let's say of the
478
+ English word embryng.
479
+
480
+ 0:11:08.082 --> 0:11:14.769
481
+ Let's say of German word embryng and the way
482
+ the King Keene Man woman organized is same
483
+
484
+ 0:11:14.769 --> 0:11:17.733
485
+ as the German translation of his word.
486
+
487
+ 0:11:17.998 --> 0:11:23.336
488
+ This is the main idea is that although they
489
+ are somewhere else, the relationship is the
490
+
491
+ 0:11:23.336 --> 0:11:28.008
492
+ same between the both languages and we can
493
+ use this to to learn the mapping.
494
+
495
+ 0:11:31.811 --> 0:11:35.716
496
+ 'S not only for these poor words where it
497
+ happens for all the words in the language,
498
+
499
+ 0:11:35.716 --> 0:11:37.783
500
+ and so we can use this to to learn the math.
501
+
502
+ 0:11:39.179 --> 0:11:43.828
503
+ This is the main idea is that both emittings
504
+ have a similar shape.
505
+
506
+ 0:11:43.828 --> 0:11:48.477
507
+ It's only that they're just not aligned and
508
+ so you go to the here.
509
+
510
+ 0:11:48.477 --> 0:11:50.906
511
+ They kind of have a similar shape.
512
+
513
+ 0:11:50.906 --> 0:11:57.221
514
+ They're just in some different spaces and
515
+ what you need to do is to map them into a common
516
+
517
+ 0:11:57.221 --> 0:11:57.707
518
+ space.
519
+
520
+ 0:12:06.086 --> 0:12:12.393
521
+ The w, such that if it multiplied w with x,
522
+ they both become.
523
+
524
+ 0:12:35.335 --> 0:12:41.097
525
+ That's true, but there are also many works
526
+ that have the relationship right, and we hope
527
+
528
+ 0:12:41.097 --> 0:12:43.817
529
+ that this is enough to learn the mapping.
530
+
531
+ 0:12:43.817 --> 0:12:49.838
532
+ So there's always going to be a bit of noise,
533
+ as in how when we align them they're not going
534
+
535
+ 0:12:49.838 --> 0:12:51.716
536
+ to be exactly the same, but.
537
+
538
+ 0:12:51.671 --> 0:12:57.293
539
+ What you can expect is that there are these
540
+ main works that allow us to learn the mapping,
541
+
542
+ 0:12:57.293 --> 0:13:02.791
543
+ so it's not going to be perfect, but it's an
544
+ approximation that we make to to see how it
545
+
546
+ 0:13:02.791 --> 0:13:04.521
547
+ works and then practice it.
548
+
549
+ 0:13:04.521 --> 0:13:10.081
550
+ Also, it's not that the fact that women do
551
+ not have any relationship does not affect that
552
+
553
+ 0:13:10.081 --> 0:13:10.452
554
+ much.
555
+
556
+ 0:13:10.550 --> 0:13:15.429
557
+ A lot of words usually have, so it kind of
558
+ works out in practice.
559
+
560
+ 0:13:22.242 --> 0:13:34.248
561
+ I have not heard about it, but if you want
562
+ to say something about it, I would be interested,
563
+
564
+ 0:13:34.248 --> 0:13:37.346
565
+ but we can do it later.
566
+
567
+ 0:13:41.281 --> 0:13:44.133
568
+ Usual case: This is supervised.
569
+
570
+ 0:13:45.205 --> 0:13:49.484
571
+ First way to do a supervised work translation
572
+ where we have a dictionary right and that we
573
+
574
+ 0:13:49.484 --> 0:13:53.764
575
+ can use that to learn the mapping, but in our
576
+ case we assume that we have nothing right so
577
+
578
+ 0:13:53.764 --> 0:13:55.222
579
+ we only have monolingual data.
580
+
581
+ 0:13:56.136 --> 0:14:03.126
582
+ Then we need unsupervised planning to figure
583
+ out W, and we're going to use guns to to find
584
+
585
+ 0:14:03.126 --> 0:14:06.122
586
+ W, and it's quite a nice way to do it.
587
+
588
+ 0:14:08.248 --> 0:14:15.393
589
+ So just before I go on how we use it to use
590
+ case, I'm going to go briefly on gas right,
591
+
592
+ 0:14:15.393 --> 0:14:19.940
593
+ so we have two components: generator and discriminator.
594
+
595
+ 0:14:21.441 --> 0:14:27.052
596
+ Gen data tries to generate something obviously,
597
+ and the discriminator tries to see if it's
598
+
599
+ 0:14:27.052 --> 0:14:30.752
600
+ real data or something that is generated by
601
+ the generation.
602
+
603
+ 0:14:31.371 --> 0:14:37.038
604
+ And there's like this two player game where
605
+ the winner decides to fool and the winner decides
606
+
607
+ 0:14:37.038 --> 0:14:41.862
608
+ to market food and they try to build these
609
+ two components and try to learn WWE.
610
+
611
+ 0:14:43.483 --> 0:14:53.163
612
+ Okay, so let's say we have two languages,
613
+ X and Y right, so the X language has N words
614
+
615
+ 0:14:53.163 --> 0:14:56.167
616
+ with numbering dimensions.
617
+
618
+ 0:14:56.496 --> 0:14:59.498
619
+ So what I'm reading is matrix is peak or something.
620
+
621
+ 0:14:59.498 --> 0:15:02.211
622
+ Then we have target language why with m words.
623
+
624
+ 0:15:02.211 --> 0:15:06.944
625
+ I'm also the same amount of things I mentioned
626
+ and then we have a matrix peak or.
627
+
628
+ 0:15:07.927 --> 0:15:13.784
629
+ Basically what you're going to do is use word
630
+ to work and learn our word embedded.
631
+
632
+ 0:15:14.995 --> 0:15:23.134
633
+ Now we have these X Mrings, Y Mrings, and
634
+ what you want to know is W, such that W X and
635
+
636
+ 0:15:23.134 --> 0:15:24.336
637
+ Y are align.
638
+
639
+ 0:15:29.209 --> 0:15:35.489
640
+ With guns you have two steps, one is a discriminative
641
+ step and one is the the mapping step and the
642
+
643
+ 0:15:35.489 --> 0:15:41.135
644
+ discriminative step is to see if the embeddings
645
+ are from the source or mapped embedding.
646
+
647
+ 0:15:41.135 --> 0:15:44.688
648
+ So it's going to be much scary when I go to
649
+ the figure.
650
+
651
+ 0:15:46.306 --> 0:15:50.041
652
+ So we have a monolingual documents with two
653
+ different languages.
654
+
655
+ 0:15:50.041 --> 0:15:54.522
656
+ From here we get our source language ambients
657
+ target language ambients right.
658
+
659
+ 0:15:54.522 --> 0:15:57.855
660
+ Then we randomly initialize the transformation
661
+ metrics W.
662
+
663
+ 0:16:00.040 --> 0:16:06.377
664
+ Then we have the discriminator which tries
665
+ to see if it's WX or Y, so it needs to know
666
+
667
+ 0:16:06.377 --> 0:16:13.735
668
+ that this is a mapped one and this is the original
669
+ language, and so if you look at the lost function
670
+
671
+ 0:16:13.735 --> 0:16:20.072
672
+ here, it's basically that source is one given
673
+ WX, so this is from the source language.
674
+
675
+ 0:16:23.543 --> 0:16:27.339
676
+ Which means it's the target language em yeah.
677
+
678
+ 0:16:27.339 --> 0:16:34.436
679
+ It's just like my figure is not that great,
680
+ but you can assume that they are totally.
681
+
682
+ 0:16:40.260 --> 0:16:43.027
683
+ So this is the kind of the lost function.
684
+
685
+ 0:16:43.027 --> 0:16:46.386
686
+ We have N source words, M target words, and
687
+ so on.
688
+
689
+ 0:16:46.386 --> 0:16:52.381
690
+ So that's why you have one by M, one by M,
691
+ and the discriminator is to just see if they're
692
+
693
+ 0:16:52.381 --> 0:16:55.741
694
+ mapped or they're from the original target
695
+ number.
696
+
697
+ 0:16:57.317 --> 0:17:04.024
698
+ And then we have the mapping step where we
699
+ train W to fool the the discriminators.
700
+
701
+ 0:17:04.564 --> 0:17:10.243
702
+ So here it's the same way, but what you're
703
+ going to just do is inverse the loss function.
704
+
705
+ 0:17:10.243 --> 0:17:15.859
706
+ So now we freeze the discriminators, so it's
707
+ important to note that in the previous sect
708
+
709
+ 0:17:15.859 --> 0:17:20.843
710
+ we freezed the transformation matrix, and here
711
+ we freezed your discriminators.
712
+
713
+ 0:17:22.482 --> 0:17:28.912
714
+ And now it's to fool the discriminated rights,
715
+ so it should predict that the source is zero
716
+
717
+ 0:17:28.912 --> 0:17:35.271
718
+ given the map numbering, and the source is
719
+ one given the target numbering, which is wrong,
720
+
721
+ 0:17:35.271 --> 0:17:37.787
722
+ which is why we're attaining the W.
723
+
724
+ 0:17:39.439 --> 0:17:46.261
725
+ Any questions on this okay so then how do
726
+ we know when to stop?
727
+
728
+ 0:17:46.261 --> 0:17:55.854
729
+ We just train until we reach convergence right
730
+ and then we have our W hopefully train and
731
+
732
+ 0:17:55.854 --> 0:17:59.265
733
+ map them into an airline space.
734
+
735
+ 0:18:02.222 --> 0:18:07.097
736
+ The question is how can we evaluate this mapping?
737
+
738
+ 0:18:07.097 --> 0:18:13.923
739
+ Does anybody know what we can use to mapping
740
+ or evaluate the mapping?
741
+
742
+ 0:18:13.923 --> 0:18:15.873
743
+ How good is a word?
744
+
745
+ 0:18:28.969 --> 0:18:33.538
746
+ We use as I said we use a dictionary, at least
747
+ in the end.
748
+
749
+ 0:18:33.538 --> 0:18:40.199
750
+ We need a dictionary to evaluate, so this
751
+ is our only final, so we aren't using it at
752
+
753
+ 0:18:40.199 --> 0:18:42.600
754
+ all in attaining data and the.
755
+
756
+ 0:18:43.223 --> 0:18:49.681
757
+ Is one is to check what's the position for
758
+ our dictionary, just that.
759
+
760
+ 0:18:50.650 --> 0:18:52.813
761
+ The first nearest neighbor and see if it's
762
+ there on.
763
+
764
+ 0:18:53.573 --> 0:18:56.855
765
+ But this is quite strict because there's a
766
+ lot of noise in the emitting space right.
767
+
768
+ 0:18:57.657 --> 0:19:03.114
769
+ Not always your first neighbor is going to
770
+ be the translation, so what people also report
771
+
772
+ 0:19:03.114 --> 0:19:05.055
773
+ is precision at file and so on.
774
+
775
+ 0:19:05.055 --> 0:19:10.209
776
+ So you take the finerest neighbors and see
777
+ if the translation is in there and so on.
778
+
779
+ 0:19:10.209 --> 0:19:15.545
780
+ So the more you increase, the more likely
781
+ that there is a translation because where I'm
782
+
783
+ 0:19:15.545 --> 0:19:16.697
784
+ being quite noisy.
785
+
786
+ 0:19:19.239 --> 0:19:25.924
787
+ What's interesting is that people have used
788
+ dictionary to to learn word translation, but
789
+
790
+ 0:19:25.924 --> 0:19:32.985
791
+ the way of doing this is much better than using
792
+ a dictionary, so somehow our assumption helps
793
+
794
+ 0:19:32.985 --> 0:19:36.591
795
+ us to to build better than a supervised system.
796
+
797
+ 0:19:39.099 --> 0:19:42.985
798
+ So as you see on the top you have a question
799
+ at one five ten.
800
+
801
+ 0:19:42.985 --> 0:19:47.309
802
+ These are the typical numbers that you report
803
+ for world translation.
804
+
805
+ 0:19:48.868 --> 0:19:55.996
806
+ But guns are usually quite tricky to to train,
807
+ and it does not converge on on language based,
808
+
809
+ 0:19:55.996 --> 0:20:02.820
810
+ and this kind of goes back to a assumption
811
+ that they kind of behave in the same structure
812
+
813
+ 0:20:02.820 --> 0:20:03.351
814
+ right.
815
+
816
+ 0:20:03.351 --> 0:20:07.142
817
+ But if you take a language like English and
818
+ some.
819
+
820
+ 0:20:07.087 --> 0:20:12.203
821
+ Other languages are almost very lotus, so
822
+ it's quite different from English and so on.
823
+
824
+ 0:20:12.203 --> 0:20:13.673
825
+ Then I've one language,.
826
+
827
+ 0:20:13.673 --> 0:20:18.789
828
+ So whenever whenever our assumption fails,
829
+ these unsupervised techniques always do not
830
+
831
+ 0:20:18.789 --> 0:20:21.199
832
+ converge or just give really bad scores.
833
+
834
+ 0:20:22.162 --> 0:20:27.083
835
+ And so the fact is that the monolingual embryons
836
+ for distant languages are too far.
837
+
838
+ 0:20:27.083 --> 0:20:30.949
839
+ They do not share the same structure, and
840
+ so they do not convert.
841
+
842
+ 0:20:32.452 --> 0:20:39.380
843
+ And so I just want to mention that there is
844
+ a better retrieval technique than the nearest
845
+
846
+ 0:20:39.380 --> 0:20:41.458
847
+ neighbor, which is called.
848
+
849
+ 0:20:42.882 --> 0:20:46.975
850
+ But it's more advanced than mathematical,
851
+ so I didn't want to go in it now.
852
+
853
+ 0:20:46.975 --> 0:20:51.822
854
+ But if your interest is in some quite good
855
+ retrieval segments, you can just look at these
856
+
857
+ 0:20:51.822 --> 0:20:53.006
858
+ if you're interested.
859
+
860
+ 0:20:55.615 --> 0:20:59.241
861
+ Okay, so this is about the the word translation.
862
+
863
+ 0:20:59.241 --> 0:21:02.276
864
+ Does anybody have any questions of cure?
865
+
866
+ 0:21:06.246 --> 0:21:07.501
867
+ Was the worst answer?
868
+
869
+ 0:21:07.501 --> 0:21:12.580
870
+ It was a bit easier than a sentence right,
871
+ so you just assume that there's a mapping and
872
+
873
+ 0:21:12.580 --> 0:21:14.577
874
+ then you try to learn the mapping.
875
+
876
+ 0:21:14.577 --> 0:21:19.656
877
+ But now it's a bit more difficult because
878
+ you need to jump at stuff also, which is quite
879
+
880
+ 0:21:19.656 --> 0:21:20.797
881
+ much more trickier.
882
+
883
+ 0:21:22.622 --> 0:21:28.512
884
+ Task here is that we have our input as manually
885
+ well data for both languages as before, but
886
+
887
+ 0:21:28.512 --> 0:21:34.017
888
+ now what we want to do is instead of translating
889
+ word by word we want to do sentence.
890
+
891
+ 0:21:37.377 --> 0:21:44.002
892
+ We have word of work now and so on to learn
893
+ word amber inks, but sentence amber inks are
894
+
895
+ 0:21:44.002 --> 0:21:50.627
896
+ actually not the site powered often, at least
897
+ when people try to work on Answer Voice M,
898
+
899
+ 0:21:50.627 --> 0:21:51.445
900
+ E, before.
901
+
902
+ 0:21:52.632 --> 0:21:54.008
903
+ Now they're a bit okay.
904
+
905
+ 0:21:54.008 --> 0:21:59.054
906
+ I mean, as you've seen in the practice on
907
+ where we used places, they were quite decent.
908
+
909
+ 0:21:59.054 --> 0:22:03.011
910
+ But then it's also the case on which data
911
+ it's trained on and so on.
912
+
913
+ 0:22:03.011 --> 0:22:03.240
914
+ So.
915
+
916
+ 0:22:04.164 --> 0:22:09.666
917
+ Sentence embedings are definitely much more
918
+ harder to get than were embedings, so this
919
+
920
+ 0:22:09.666 --> 0:22:13.776
921
+ is a bit more complicated than the task that
922
+ you've seen before.
923
+
924
+ 0:22:16.476 --> 0:22:18.701
925
+ Before we go into how U.
926
+
927
+ 0:22:18.701 --> 0:22:18.968
928
+ N.
929
+
930
+ 0:22:18.968 --> 0:22:19.235
931
+ M.
932
+
933
+ 0:22:19.235 --> 0:22:19.502
934
+ T.
935
+
936
+ 0:22:19.502 --> 0:22:24.485
937
+ Works, so this is your typical supervised
938
+ system right.
939
+
940
+ 0:22:24.485 --> 0:22:29.558
941
+ So we have parallel data source sentence target
942
+ centers.
943
+
944
+ 0:22:29.558 --> 0:22:31.160
945
+ We have a source.
946
+
947
+ 0:22:31.471 --> 0:22:36.709
948
+ We have a target decoder and then we try to
949
+ minimize the cross center pillar on this viral
950
+
951
+ 0:22:36.709 --> 0:22:37.054
952
+ data.
953
+
954
+ 0:22:37.157 --> 0:22:39.818
955
+ And this is how we train our typical system.
956
+
957
+ 0:22:43.583 --> 0:22:49.506
958
+ But now we do not have any parallel data,
959
+ and so the intuition here is that if we can
960
+
961
+ 0:22:49.506 --> 0:22:55.429
962
+ learn language independent representations
963
+ at the end quota outputs, then we can pass
964
+
965
+ 0:22:55.429 --> 0:22:58.046
966
+ it along to the decoder that we want.
967
+
968
+ 0:22:58.718 --> 0:23:03.809
969
+ It's going to get more clear in the future,
970
+ but I'm trying to give a bit more intuition
971
+
972
+ 0:23:03.809 --> 0:23:07.164
973
+ before I'm going to show you all the planning
974
+ objectives.
975
+
976
+ 0:23:08.688 --> 0:23:15.252
977
+ So I assume that we have these different encoders
978
+ right, so it's not only two, you have a bunch
979
+
980
+ 0:23:15.252 --> 0:23:21.405
981
+ of different source language encoders, a bunch
982
+ of different target language decoders, and
983
+
984
+ 0:23:21.405 --> 0:23:26.054
985
+ also I assume that the encoder is in the same
986
+ representation space.
987
+
988
+ 0:23:26.706 --> 0:23:31.932
989
+ If you give a sentence in English and the
990
+ same sentence in German, the embeddings are
991
+
992
+ 0:23:31.932 --> 0:23:38.313
993
+ quite the same, so like the muddling when embeddings
994
+ die right, and so then what we can do is, depending
995
+
996
+ 0:23:38.313 --> 0:23:42.202
997
+ on the language we want, pass it to the the
998
+ appropriate decode.
999
+
1000
+ 0:23:42.682 --> 0:23:50.141
1001
+ And so the kind of goal here is to find out
1002
+ a way to create language independent representations
1003
+
1004
+ 0:23:50.141 --> 0:23:52.909
1005
+ and then pass it to the decodement.
1006
+
1007
+ 0:23:54.975 --> 0:23:59.714
1008
+ Just keep in mind that you're trying to do
1009
+ language independent for some reason, but it's
1010
+
1011
+ 0:23:59.714 --> 0:24:02.294
1012
+ going to be more clear once we see how it works.
1013
+
1014
+ 0:24:05.585 --> 0:24:12.845
1015
+ So in total we have three objectives that
1016
+ we're going to try to train in our systems,
1017
+
1018
+ 0:24:12.845 --> 0:24:16.981
1019
+ so this is and all of them use monolingual
1020
+ data.
1021
+
1022
+ 0:24:17.697 --> 0:24:19.559
1023
+ So there's no pilot data at all.
1024
+
1025
+ 0:24:19.559 --> 0:24:24.469
1026
+ The first one is denoising water encoding,
1027
+ so it's more like you add noise to noise to
1028
+
1029
+ 0:24:24.469 --> 0:24:27.403
1030
+ the sentence, and then they construct the original.
1031
+
1032
+ 0:24:28.388 --> 0:24:34.276
1033
+ Then we have the on the flyby translation,
1034
+ so this is where you take a sentence, generate
1035
+
1036
+ 0:24:34.276 --> 0:24:39.902
1037
+ a translation, and then learn the the word
1038
+ smarting, which I'm going to show pictures
1039
+
1040
+ 0:24:39.902 --> 0:24:45.725
1041
+ stated, and then we have an adverse serial
1042
+ planning to do learn the language independent
1043
+
1044
+ 0:24:45.725 --> 0:24:46.772
1045
+ representation.
1046
+
1047
+ 0:24:47.427 --> 0:24:52.148
1048
+ So somehow we'll fill in these three tasks
1049
+ or retain on these three tasks.
1050
+
1051
+ 0:24:52.148 --> 0:24:54.728
1052
+ We somehow get an answer to President M.
1053
+
1054
+ 0:24:54.728 --> 0:24:54.917
1055
+ T.
1056
+
1057
+ 0:24:56.856 --> 0:25:02.964
1058
+ OK, so first we're going to do is denoising
1059
+ what I'm cutting right, so as I said we add
1060
+
1061
+ 0:25:02.964 --> 0:25:06.295
1062
+ noise to the sentence, so we take our sentence.
1063
+
1064
+ 0:25:06.826 --> 0:25:09.709
1065
+ And then there are different ways to add noise.
1066
+
1067
+ 0:25:09.709 --> 0:25:11.511
1068
+ You can shuffle words around.
1069
+
1070
+ 0:25:11.511 --> 0:25:12.712
1071
+ You can drop words.
1072
+
1073
+ 0:25:12.712 --> 0:25:18.298
1074
+ Do whatever you want to do as long as there's
1075
+ enough information to reconstruct the original
1076
+
1077
+ 0:25:18.298 --> 0:25:18.898
1078
+ sentence.
1079
+
1080
+ 0:25:19.719 --> 0:25:25.051
1081
+ And then we assume that the nicest one and
1082
+ the original one are parallel data and train
1083
+
1084
+ 0:25:25.051 --> 0:25:26.687
1085
+ similar to the supervised.
1086
+
1087
+ 0:25:28.168 --> 0:25:30.354
1088
+ So we have a source sentence.
1089
+
1090
+ 0:25:30.354 --> 0:25:32.540
1091
+ We have a noisy source right.
1092
+
1093
+ 0:25:32.540 --> 0:25:37.130
1094
+ So here what basically happened is that the
1095
+ word got shuffled.
1096
+
1097
+ 0:25:37.130 --> 0:25:39.097
1098
+ One word is dropped right.
1099
+
1100
+ 0:25:39.097 --> 0:25:41.356
1101
+ So this was a noise of source.
1102
+
1103
+ 0:25:41.356 --> 0:25:47.039
1104
+ And then we treat the noise of source and
1105
+ source as a sentence bed basically.
1106
+
1107
+ 0:25:49.009 --> 0:25:53.874
1108
+ Way retainers optimizing the cross entropy
1109
+ loss similar to.
1110
+
1111
+ 0:25:57.978 --> 0:26:03.211
1112
+ Basically a picture to show what's happening
1113
+ and we have the nice resources.
1114
+
1115
+ 0:26:03.163 --> 0:26:09.210
1116
+ Now is the target and then we have the reconstructed
1117
+ original source and original tag and since
1118
+
1119
+ 0:26:09.210 --> 0:26:14.817
1120
+ the languages are different we have our source
1121
+ hand coded target and coded source coded.
1122
+
1123
+ 0:26:17.317 --> 0:26:20.202
1124
+ And for this task we only need monolingual
1125
+ data.
1126
+
1127
+ 0:26:20.202 --> 0:26:25.267
1128
+ We don't need any pedal data because it's
1129
+ just taking a sentence and shuffling it and
1130
+
1131
+ 0:26:25.267 --> 0:26:27.446
1132
+ reconstructing the the original one.
1133
+
1134
+ 0:26:28.848 --> 0:26:31.058
1135
+ And we are four different blocks.
1136
+
1137
+ 0:26:31.058 --> 0:26:36.841
1138
+ This is kind of very important to keep in
1139
+ mind on how we change these connections later.
1140
+
1141
+ 0:26:41.121 --> 0:26:49.093
1142
+ Then this is more like the mathematical formulation
1143
+ where you predict source given the noisy.
1144
+
1145
+ 0:26:52.492 --> 0:26:55.090
1146
+ So that was the nursing water encoding.
1147
+
1148
+ 0:26:55.090 --> 0:26:58.403
1149
+ The second step is on the flight back translation.
1150
+
1151
+ 0:26:59.479 --> 0:27:06.386
1152
+ So what we do is, we put our model inference
1153
+ mode right, we take a source of sentences,
1154
+
1155
+ 0:27:06.386 --> 0:27:09.447
1156
+ and we generate a translation pattern.
1157
+
1158
+ 0:27:09.829 --> 0:27:18.534
1159
+ It might be completely wrong or maybe partially
1160
+ correct or so on, but we assume that the moral
1161
+
1162
+ 0:27:18.534 --> 0:27:20.091
1163
+ knows of it and.
1164
+
1165
+ 0:27:20.680 --> 0:27:25.779
1166
+ Tend rate: T head right and then what we do
1167
+ is assume that T head or not assume but T head
1168
+
1169
+ 0:27:25.779 --> 0:27:27.572
1170
+ and S are sentence space right.
1171
+
1172
+ 0:27:27.572 --> 0:27:29.925
1173
+ That's how we can handle the translation.
1174
+
1175
+ 0:27:30.530 --> 0:27:38.824
1176
+ So we train a supervised system on this sentence
1177
+ bed, so we do inference and then build a reverse
1178
+
1179
+ 0:27:38.824 --> 0:27:39.924
1180
+ translation.
1181
+
1182
+ 0:27:42.442 --> 0:27:49.495
1183
+ Are both more concrete, so we have a false
1184
+ sentence right, then we chamber the translation,
1185
+
1186
+ 0:27:49.495 --> 0:27:55.091
1187
+ then we give the general translation as an
1188
+ input and try to predict the.
1189
+
1190
+ 0:27:58.378 --> 0:28:03.500
1191
+ This is how we would do in practice right,
1192
+ so not before the source encoder was connected
1193
+
1194
+ 0:28:03.500 --> 0:28:08.907
1195
+ to the source decoder, but now we interchanged
1196
+ connections, so the source encoder is connected
1197
+
1198
+ 0:28:08.907 --> 0:28:10.216
1199
+ to the target decoder.
1200
+
1201
+ 0:28:10.216 --> 0:28:13.290
1202
+ The target encoder is turned into the source
1203
+ decoder.
1204
+
1205
+ 0:28:13.974 --> 0:28:20.747
1206
+ And given s we get t-hat and given t we get
1207
+ s-hat, so this is the first time.
1208
+
1209
+ 0:28:21.661 --> 0:28:24.022
1210
+ On the second time step, what you're going
1211
+ to do is reverse.
1212
+
1213
+ 0:28:24.664 --> 0:28:32.625
1214
+ So as that is here, t hat is here, and given
1215
+ s hat we are trying to predict t, and given
1216
+
1217
+ 0:28:32.625 --> 0:28:34.503
1218
+ t hat we are trying.
1219
+
1220
+ 0:28:36.636 --> 0:28:39.386
1221
+ Is this clear you have any questions on?
1222
+
1223
+ 0:28:45.405 --> 0:28:50.823
1224
+ Bit more mathematically, we try to play the
1225
+ class, give and take and so it's always the
1226
+
1227
+ 0:28:50.823 --> 0:28:53.963
1228
+ supervised NMP technique that we are trying
1229
+ to do.
1230
+
1231
+ 0:28:53.963 --> 0:28:59.689
1232
+ But you're trying to create this synthetic
1233
+ pass that kind of helpers to build an unsurprised
1234
+
1235
+ 0:28:59.689 --> 0:29:00.181
1236
+ system.
1237
+
1238
+ 0:29:02.362 --> 0:29:08.611
1239
+ Now also with maybe you can see here is that
1240
+ if the source encoded and targeted encoded
1241
+
1242
+ 0:29:08.611 --> 0:29:14.718
1243
+ the language independent, we can always shuffle
1244
+ the connections and the translations.
1245
+
1246
+ 0:29:14.718 --> 0:29:21.252
1247
+ That's why it was important to find a way
1248
+ to generate language independent representations.
1249
+
1250
+ 0:29:21.441 --> 0:29:26.476
1251
+ And the way we try to force this language
1252
+ independence is the gan step.
1253
+
1254
+ 0:29:27.627 --> 0:29:34.851
1255
+ So the third step kind of combines all of
1256
+ them is where we try to use gun to make the
1257
+
1258
+ 0:29:34.851 --> 0:29:37.959
1259
+ encoded output language independent.
1260
+
1261
+ 0:29:37.959 --> 0:29:42.831
1262
+ So here it's the same picture but from a different
1263
+ paper.
1264
+
1265
+ 0:29:42.831 --> 0:29:43.167
1266
+ So.
1267
+
1268
+ 0:29:43.343 --> 0:29:48.888
1269
+ We have X-rays, X-ray objects which is monolingual
1270
+ in data.
1271
+
1272
+ 0:29:48.888 --> 0:29:50.182
1273
+ We add noise.
1274
+
1275
+ 0:29:50.690 --> 0:29:54.736
1276
+ Then we encode it using the source and the
1277
+ target encoders right.
1278
+
1279
+ 0:29:54.736 --> 0:29:58.292
1280
+ Then we get the latent space Z source and
1281
+ Z target right.
1282
+
1283
+ 0:29:58.292 --> 0:30:03.503
1284
+ Then we decode and try to reconstruct the
1285
+ original one and this is the auto encoding
1286
+
1287
+ 0:30:03.503 --> 0:30:08.469
1288
+ loss which takes the X source which is the
1289
+ original one and then the translated.
1290
+
1291
+ 0:30:08.468 --> 0:30:09.834
1292
+ Predicted output.
1293
+
1294
+ 0:30:09.834 --> 0:30:16.740
1295
+ So hello, it always is the auto encoding step
1296
+ where the gun concern is in the between gang
1297
+
1298
+ 0:30:16.740 --> 0:30:24.102
1299
+ cord outputs, and here we have an discriminator
1300
+ which tries to predict which language the latent
1301
+
1302
+ 0:30:24.102 --> 0:30:25.241
1303
+ space is from.
1304
+
1305
+ 0:30:26.466 --> 0:30:33.782
1306
+ So given Z source it has to predict that the
1307
+ representation is from a language source and
1308
+
1309
+ 0:30:33.782 --> 0:30:39.961
1310
+ given Z target it has to predict the representation
1311
+ from a language target.
1312
+
1313
+ 0:30:40.520 --> 0:30:45.135
1314
+ And our headquarters are kind of teaching
1315
+ data right now, and then we have a separate
1316
+
1317
+ 0:30:45.135 --> 0:30:49.803
1318
+ network discriminator which tries to predict
1319
+ which language the Latin spaces are from.
1320
+
1321
+ 0:30:53.393 --> 0:30:57.611
1322
+ And then this one is when we combined guns
1323
+ with the other ongoing step.
1324
+
1325
+ 0:30:57.611 --> 0:31:02.767
1326
+ Then we had an on the fly back translation
1327
+ step right, and so here what we're trying to
1328
+
1329
+ 0:31:02.767 --> 0:31:03.001
1330
+ do.
1331
+
1332
+ 0:31:03.863 --> 0:31:07.260
1333
+ Is the same, basically just exactly the same.
1334
+
1335
+ 0:31:07.260 --> 0:31:12.946
1336
+ But when we are doing the training, we are
1337
+ at the adversarial laws here, so.
1338
+
1339
+ 0:31:13.893 --> 0:31:20.762
1340
+ We take our X source, gender and intermediate
1341
+ translation, so why target and why source right?
1342
+
1343
+ 0:31:20.762 --> 0:31:27.342
1344
+ This is the previous time step, and then we
1345
+ have to encode the new sentences and basically
1346
+
1347
+ 0:31:27.342 --> 0:31:32.764
1348
+ make them language independent or train to
1349
+ make them language independent.
1350
+
1351
+ 0:31:33.974 --> 0:31:43.502
1352
+ And then the hope is that now if we do this
1353
+ using monolingual data alone we can just switch
1354
+
1355
+ 0:31:43.502 --> 0:31:47.852
1356
+ connections and then get our translation.
1357
+
1358
+ 0:31:47.852 --> 0:31:49.613
1359
+ So the scale of.
1360
+
1361
+ 0:31:54.574 --> 0:32:03.749
1362
+ And so as I said before, guns are quite good
1363
+ for vision right, so this is kind of like the
1364
+
1365
+ 0:32:03.749 --> 0:32:11.312
1366
+ cycle gun approach that you might have seen
1367
+ in any computer vision course.
1368
+
1369
+ 0:32:11.911 --> 0:32:19.055
1370
+ Somehow protect that place at least not as
1371
+ promising as for merchants, and so people.
1372
+
1373
+ 0:32:19.055 --> 0:32:23.706
1374
+ What they did is to enforce this language
1375
+ independence.
1376
+
1377
+ 0:32:25.045 --> 0:32:31.226
1378
+ They try to use a shared encoder instead of
1379
+ having these different encoders right, and
1380
+
1381
+ 0:32:31.226 --> 0:32:37.835
1382
+ so this is basically the same painting objectives
1383
+ as before, but what you're going to do now
1384
+
1385
+ 0:32:37.835 --> 0:32:43.874
1386
+ is learn cross language language and then use
1387
+ the single encoder for both languages.
1388
+
1389
+ 0:32:44.104 --> 0:32:49.795
1390
+ And this kind also forces them to be in the
1391
+ same space, and then you can choose whichever
1392
+
1393
+ 0:32:49.795 --> 0:32:50.934
1394
+ decoder you want.
1395
+
1396
+ 0:32:52.552 --> 0:32:58.047
1397
+ You can use guns or you can just use a shared
1398
+ encoder and type to build your unsupervised
1399
+
1400
+ 0:32:58.047 --> 0:32:58.779
1401
+ MTT system.
1402
+
1403
+ 0:33:08.488 --> 0:33:09.808
1404
+ These are now the.
1405
+
1406
+ 0:33:09.808 --> 0:33:15.991
1407
+ The enhancements that you can do on top of
1408
+ your unsavoizant system is one you can create
1409
+
1410
+ 0:33:15.991 --> 0:33:16.686
1411
+ a shared.
1412
+
1413
+ 0:33:18.098 --> 0:33:22.358
1414
+ On top of the shared encoder you can ask are
1415
+ your guns lost or whatever so there's a lot
1416
+
1417
+ 0:33:22.358 --> 0:33:22.550
1418
+ of.
1419
+
1420
+ 0:33:24.164 --> 0:33:29.726
1421
+ The other thing that is more relevant right
1422
+ now is that you can create parallel data by
1423
+
1424
+ 0:33:29.726 --> 0:33:35.478
1425
+ word to word translation right because you
1426
+ know how to do all supervised word translation.
1427
+
1428
+ 0:33:36.376 --> 0:33:40.548
1429
+ First step is to create parallel data, assuming
1430
+ that word translations are quite good.
1431
+
1432
+ 0:33:41.361 --> 0:33:47.162
1433
+ And then you claim a supervised and empty
1434
+ model on these more likely wrong model data,
1435
+
1436
+ 0:33:47.162 --> 0:33:50.163
1437
+ but somehow gives you a good starting point.
1438
+
1439
+ 0:33:50.163 --> 0:33:56.098
1440
+ So you build your supervised and empty system
1441
+ on the word translation data, and then you
1442
+
1443
+ 0:33:56.098 --> 0:33:59.966
1444
+ initialize it before you're doing unsupervised
1445
+ and empty.
1446
+
1447
+ 0:34:00.260 --> 0:34:05.810
1448
+ And the hope is that when you're doing the
1449
+ back pain installation, it's a good starting
1450
+
1451
+ 0:34:05.810 --> 0:34:11.234
1452
+ point, but it's one technique that you can
1453
+ do to to improve your anthropoids and the.
1454
+
1455
+ 0:34:17.097 --> 0:34:25.879
1456
+ In the previous case we had: The way we know
1457
+ when to stop was to see comedians on the gun
1458
+
1459
+ 0:34:25.879 --> 0:34:26.485
1460
+ training.
1461
+
1462
+ 0:34:26.485 --> 0:34:28.849
1463
+ Actually, all we want to do is when W.
1464
+
1465
+ 0:34:28.849 --> 0:34:32.062
1466
+ Comedians, which is quite easy to know when
1467
+ to stop.
1468
+
1469
+ 0:34:32.062 --> 0:34:37.517
1470
+ But in a realistic case, we don't have any
1471
+ parallel data right, so there's no validation.
1472
+
1473
+ 0:34:37.517 --> 0:34:42.002
1474
+ Or I mean, we might have test data in the
1475
+ end, but there's no validation.
1476
+
1477
+ 0:34:43.703 --> 0:34:48.826
1478
+ How will we tune our hyper parameters in this
1479
+ case because it's not really there's nothing
1480
+
1481
+ 0:34:48.826 --> 0:34:49.445
1482
+ for us to?
1483
+
1484
+ 0:34:50.130 --> 0:34:53.326
1485
+ Or the gold data in a sense like so.
1486
+
1487
+ 0:34:53.326 --> 0:35:01.187
1488
+ How do you think we can evaluate such systems
1489
+ or how can we tune hyper parameters in this?
1490
+
1491
+ 0:35:11.711 --> 0:35:17.089
1492
+ So what you're going to do is use the back
1493
+ translation technique.
1494
+
1495
+ 0:35:17.089 --> 0:35:24.340
1496
+ It's like a common technique where you have
1497
+ nothing okay that is to use back translation
1498
+
1499
+ 0:35:24.340 --> 0:35:26.947
1500
+ somehow and what you can do is.
1501
+
1502
+ 0:35:26.947 --> 0:35:31.673
1503
+ The main idea is validate on how good the
1504
+ reconstruction.
1505
+
1506
+ 0:35:32.152 --> 0:35:37.534
1507
+ So the idea is that if you have a good system
1508
+ then the intermediate translation is quite
1509
+
1510
+ 0:35:37.534 --> 0:35:39.287
1511
+ good and going back is easy.
1512
+
1513
+ 0:35:39.287 --> 0:35:44.669
1514
+ But if it's just noise that you generate in
1515
+ the forward step then it's really hard to go
1516
+
1517
+ 0:35:44.669 --> 0:35:46.967
1518
+ back, which is kind of the main idea.
1519
+
1520
+ 0:35:48.148 --> 0:35:53.706
1521
+ So the way it works is that we take a source
1522
+ sentence, we generate a translation in target
1523
+
1524
+ 0:35:53.706 --> 0:35:59.082
1525
+ language right, and then again can state the
1526
+ generated sentence and compare it with the
1527
+
1528
+ 0:35:59.082 --> 0:36:01.342
1529
+ original one, and if they're closer.
1530
+
1531
+ 0:36:01.841 --> 0:36:09.745
1532
+ It means that we have a good system, and if
1533
+ they are far this is kind of like an unsupervised
1534
+
1535
+ 0:36:09.745 --> 0:36:10.334
1536
+ grade.
1537
+
1538
+ 0:36:17.397 --> 0:36:21.863
1539
+ As far as the amount of data that you need.
1540
+
1541
+ 0:36:23.083 --> 0:36:27.995
1542
+ This was like the first initial resistance
1543
+ on on these systems is that you had.
1544
+
1545
+ 0:36:27.995 --> 0:36:32.108
1546
+ They wanted to do English and French and they
1547
+ had fifteen million.
1548
+
1549
+ 0:36:32.108 --> 0:36:38.003
1550
+ There was fifteen million more linguist sentences
1551
+ so it's quite a lot and they were able to get
1552
+
1553
+ 0:36:38.003 --> 0:36:40.581
1554
+ thirty two blue on these kinds of setups.
1555
+
1556
+ 0:36:41.721 --> 0:36:47.580
1557
+ But unsurprisingly if you have zero point
1558
+ one million pilot sentences you get the same
1559
+
1560
+ 0:36:47.580 --> 0:36:48.455
1561
+ performance.
1562
+
1563
+ 0:36:48.748 --> 0:36:50.357
1564
+ So it's a lot of training.
1565
+
1566
+ 0:36:50.357 --> 0:36:55.960
1567
+ It's a lot of monolingual data, but monolingual
1568
+ data is relatively easy to obtain is the fact
1569
+
1570
+ 0:36:55.960 --> 0:37:01.264
1571
+ that the training is also quite longer than
1572
+ the supervised system, but it's unsupervised
1573
+
1574
+ 0:37:01.264 --> 0:37:04.303
1575
+ so it's kind of the trade off that you are
1576
+ making.
1577
+
1578
+ 0:37:07.367 --> 0:37:13.101
1579
+ The other thing to note is that it's English
1580
+ and French, which is very close to our exemptions.
1581
+
1582
+ 0:37:13.101 --> 0:37:18.237
1583
+ Also, the monolingual data that they took
1584
+ are kind of from similar domains and so on.
1585
+
1586
+ 0:37:18.638 --> 0:37:27.564
1587
+ So that's why they're able to build such a
1588
+ good system, but you'll see later that it fails.
1589
+
1590
+ 0:37:36.256 --> 0:37:46.888
1591
+ Voice, and so mean what people usually do
1592
+ is first build a system right using whatever
1593
+
1594
+ 0:37:46.888 --> 0:37:48.110
1595
+ parallel.
1596
+
1597
+ 0:37:48.608 --> 0:37:55.864
1598
+ Then they use monolingual data and do back
1599
+ translation, so this is always being the standard
1600
+
1601
+ 0:37:55.864 --> 0:38:04.478
1602
+ way to to improve, and what people have seen
1603
+ is that: You don't even need zero point one
1604
+
1605
+ 0:38:04.478 --> 0:38:05.360
1606
+ million right.
1607
+
1608
+ 0:38:05.360 --> 0:38:10.706
1609
+ You just need like ten thousand or so on and
1610
+ then you do the monolingual back time station
1611
+
1612
+ 0:38:10.706 --> 0:38:12.175
1613
+ and you're still better.
1614
+
1615
+ 0:38:12.175 --> 0:38:13.291
1616
+ The answer is why.
1617
+
1618
+ 0:38:13.833 --> 0:38:19.534
1619
+ The question is it's really worth trying to
1620
+ to do this or maybe it's always better to find
1621
+
1622
+ 0:38:19.534 --> 0:38:20.787
1623
+ some parallel data.
1624
+
1625
+ 0:38:20.787 --> 0:38:26.113
1626
+ I'll expand a bit of money on getting few
1627
+ parallel data and then use it to start and
1628
+
1629
+ 0:38:26.113 --> 0:38:27.804
1630
+ find to build your system.
1631
+
1632
+ 0:38:27.804 --> 0:38:33.756
1633
+ So it was kind of the understanding that billing
1634
+ wool and spoiled systems are not that really.
1635
+
1636
+ 0:38:50.710 --> 0:38:54.347
1637
+ The thing is that with unlabeled data.
1638
+
1639
+ 0:38:57.297 --> 0:39:05.488
1640
+ Not in an obtaining signal, so when we are
1641
+ starting basically what we want to do is first
1642
+
1643
+ 0:39:05.488 --> 0:39:13.224
1644
+ get a good translation system and then use
1645
+ an unlabeled monolingual data to improve.
1646
+
1647
+ 0:39:13.613 --> 0:39:15.015
1648
+ But if you start from U.
1649
+
1650
+ 0:39:15.015 --> 0:39:15.183
1651
+ N.
1652
+
1653
+ 0:39:15.183 --> 0:39:20.396
1654
+ Empty our model might be really bad like it
1655
+ would be somewhere translating completely wrong.
1656
+
1657
+ 0:39:20.760 --> 0:39:26.721
1658
+ And then when you find your unlabeled data,
1659
+ it basically might be harming, or maybe the
1660
+
1661
+ 0:39:26.721 --> 0:39:28.685
1662
+ same as supervised applause.
1663
+
1664
+ 0:39:28.685 --> 0:39:35.322
1665
+ So the thing is, I hope, by fine tuning on
1666
+ labeled data as first is to get a good initialization.
1667
+
1668
+ 0:39:35.835 --> 0:39:38.404
1669
+ And then use the unsupervised techniques to
1670
+ get better.
1671
+
1672
+ 0:39:38.818 --> 0:39:42.385
1673
+ But if your starting point is really bad then
1674
+ it's not.
1675
+
1676
+ 0:39:45.185 --> 0:39:47.324
1677
+ Year so as we said before.
1678
+
1679
+ 0:39:47.324 --> 0:39:52.475
1680
+ This is kind of like the self supervised training
1681
+ usually works.
1682
+
1683
+ 0:39:52.475 --> 0:39:54.773
1684
+ First we have parallel data.
1685
+
1686
+ 0:39:56.456 --> 0:39:58.062
1687
+ Source language is X.
1688
+
1689
+ 0:39:58.062 --> 0:39:59.668
1690
+ Target language is Y.
1691
+
1692
+ 0:39:59.668 --> 0:40:06.018
1693
+ In the end we want a system that does X to
1694
+ Y, not Y to X, but first we want to train a
1695
+
1696
+ 0:40:06.018 --> 0:40:10.543
1697
+ backward model as it is Y to X, so target language
1698
+ to source.
1699
+
1700
+ 0:40:11.691 --> 0:40:17.353
1701
+ Then we take our moonlighting will target
1702
+ sentences, use our backward model to generate
1703
+
1704
+ 0:40:17.353 --> 0:40:21.471
1705
+ synthetic source, and then we join them with
1706
+ our original data.
1707
+
1708
+ 0:40:21.471 --> 0:40:27.583
1709
+ So now we have this noisy input, but always
1710
+ the gold output, which is kind of really important
1711
+
1712
+ 0:40:27.583 --> 0:40:29.513
1713
+ when you're doing backpaints.
1714
+
1715
+ 0:40:30.410 --> 0:40:36.992
1716
+ And then you can coordinate these big data
1717
+ and then you can train your X to Y cholesterol
1718
+
1719
+ 0:40:36.992 --> 0:40:44.159
1720
+ system and then you can always do this in multiple
1721
+ steps and usually three, four steps which kind
1722
+
1723
+ 0:40:44.159 --> 0:40:48.401
1724
+ of improves always and then finally get your
1725
+ best system.
1726
+
1727
+ 0:40:49.029 --> 0:40:54.844
1728
+ The point that I'm trying to make is that
1729
+ although answers and MPs the scores that I've
1730
+
1731
+ 0:40:54.844 --> 0:41:00.659
1732
+ shown before were quite good, you probably
1733
+ can get the same performance with with fifty
1734
+
1735
+ 0:41:00.659 --> 0:41:06.474
1736
+ thousand sentences, and also the languages
1737
+ that they've shown are quite similar and the
1738
+
1739
+ 0:41:06.474 --> 0:41:08.654
1740
+ texts were from the same domain.
1741
+
1742
+ 0:41:14.354 --> 0:41:21.494
1743
+ So any questions on u n m t ok yeah.
1744
+
1745
+ 0:41:22.322 --> 0:41:28.982
1746
+ So after this fact that temperature was already
1747
+ better than than empty, what people have tried
1748
+
1749
+ 0:41:28.982 --> 0:41:34.660
1750
+ is to use this idea of multilinguality as you
1751
+ have seen in the previous lecture.
1752
+
1753
+ 0:41:34.660 --> 0:41:41.040
1754
+ The question is how can we do this knowledge
1755
+ transfer from high resource language to lower
1756
+
1757
+ 0:41:41.040 --> 0:41:42.232
1758
+ source language?
1759
+
1760
+ 0:41:44.484 --> 0:41:51.074
1761
+ One way to promote this language independent
1762
+ representations is to share the encoder and
1763
+
1764
+ 0:41:51.074 --> 0:41:57.960
1765
+ decoder for all languages, all their available
1766
+ languages, and that kind of hopefully enables
1767
+
1768
+ 0:41:57.960 --> 0:42:00.034
1769
+ the the knowledge transfer.
1770
+
1771
+ 0:42:03.323 --> 0:42:08.605
1772
+ When we're doing multilinguality, the two
1773
+ questions we need to to think of is how does
1774
+
1775
+ 0:42:08.605 --> 0:42:09.698
1776
+ the encoder know?
1777
+
1778
+ 0:42:09.698 --> 0:42:14.495
1779
+ How does the encoder encoder know which language
1780
+ that we're dealing with that?
1781
+
1782
+ 0:42:15.635 --> 0:42:20.715
1783
+ You already might have known the answer also,
1784
+ and the second question is how can we promote
1785
+
1786
+ 0:42:20.715 --> 0:42:24.139
1787
+ the encoder to generate language independent
1788
+ representations?
1789
+
1790
+ 0:42:25.045 --> 0:42:32.580
1791
+ By solving these two problems we can take
1792
+ help of high resource languages to do unsupervised
1793
+
1794
+ 0:42:32.580 --> 0:42:33.714
1795
+ translations.
1796
+
1797
+ 0:42:34.134 --> 0:42:40.997
1798
+ Typical example would be you want to do unsurpressed
1799
+ between English and Dutch right, but you are
1800
+
1801
+ 0:42:40.997 --> 0:42:47.369
1802
+ parallel data between English and German, so
1803
+ the question is can we use this parallel data
1804
+
1805
+ 0:42:47.369 --> 0:42:51.501
1806
+ to help building an unsurpressed betweenEnglish
1807
+ and Dutch?
1808
+
1809
+ 0:42:56.296 --> 0:43:01.240
1810
+ For the first one we try to take help of language
1811
+ embeddings for tokens, and this kind of is
1812
+
1813
+ 0:43:01.240 --> 0:43:05.758
1814
+ a straightforward way to know to tell them
1815
+ well which language they're dealing with.
1816
+
1817
+ 0:43:06.466 --> 0:43:11.993
1818
+ And for the second one we're going to look
1819
+ at some pre training objectives which are also
1820
+
1821
+ 0:43:11.993 --> 0:43:17.703
1822
+ kind of unsupervised so we need monolingual
1823
+ data mostly and this kind of helps us to promote
1824
+
1825
+ 0:43:17.703 --> 0:43:20.221
1826
+ the language independent representation.
1827
+
1828
+ 0:43:23.463 --> 0:43:29.954
1829
+ So the first three things more that we'll
1830
+ look at is excel, which is quite famous if
1831
+
1832
+ 0:43:29.954 --> 0:43:32.168
1833
+ you haven't heard of it yet.
1834
+
1835
+ 0:43:32.552 --> 0:43:40.577
1836
+ And: The way it works is that it's basically
1837
+ a transformer encoder right, so it's like the
1838
+
1839
+ 0:43:40.577 --> 0:43:42.391
1840
+ just the encoder module.
1841
+
1842
+ 0:43:42.391 --> 0:43:44.496
1843
+ No, there's no decoder here.
1844
+
1845
+ 0:43:44.884 --> 0:43:51.481
1846
+ And what we're trying to do is mask two tokens
1847
+ in a sequence and try to predict these mask
1848
+
1849
+ 0:43:51.481 --> 0:43:52.061
1850
+ tokens.
1851
+
1852
+ 0:43:52.061 --> 0:43:55.467
1853
+ So I quickly called us mask language modeling.
1854
+
1855
+ 0:43:55.996 --> 0:44:05.419
1856
+ Typical language modeling that you see is
1857
+ the Danish language modeling where you predict
1858
+
1859
+ 0:44:05.419 --> 0:44:08.278
1860
+ the next token in English.
1861
+
1862
+ 0:44:08.278 --> 0:44:11.136
1863
+ Then we have the position.
1864
+
1865
+ 0:44:11.871 --> 0:44:18.774
1866
+ Then we have the token embellings, and then
1867
+ here we have the mass token, and then we have
1868
+
1869
+ 0:44:18.774 --> 0:44:22.378
1870
+ the transformer encoder blocks to predict the.
1871
+
1872
+ 0:44:24.344 --> 0:44:30.552
1873
+ To do this for all languages using the same
1874
+ tang somewhere encoded and this kind of helps
1875
+
1876
+ 0:44:30.552 --> 0:44:36.760
1877
+ us to push the the sentence and bearings or
1878
+ the output of the encoded into a common space
1879
+
1880
+ 0:44:36.760 --> 0:44:37.726
1881
+ per multiple.
1882
+
1883
+ 0:44:42.782 --> 0:44:49.294
1884
+ So first we train an MLM on both source, both
1885
+ source and target language sites, and then
1886
+
1887
+ 0:44:49.294 --> 0:44:54.928
1888
+ we use it as a starting point for the encoded
1889
+ and decoded for a UNMP system.
1890
+
1891
+ 0:44:55.475 --> 0:45:03.175
1892
+ So we take a monolingual data, build a mass
1893
+ language model on both source and target languages,
1894
+
1895
+ 0:45:03.175 --> 0:45:07.346
1896
+ and then read it to be or initialize that in
1897
+ the U.
1898
+
1899
+ 0:45:07.346 --> 0:45:07.586
1900
+ N.
1901
+
1902
+ 0:45:07.586 --> 0:45:07.827
1903
+ P.
1904
+
1905
+ 0:45:07.827 --> 0:45:08.068
1906
+ C.
1907
+
1908
+ 0:45:09.009 --> 0:45:14.629
1909
+ Here we look at two languages, but you can
1910
+ also do it with one hundred languages once.
1911
+
1912
+ 0:45:14.629 --> 0:45:20.185
1913
+ So they're retain checkpoints that you can
1914
+ use, which are quite which have seen quite
1915
+
1916
+ 0:45:20.185 --> 0:45:21.671
1917
+ a lot of data and use.
1918
+
1919
+ 0:45:21.671 --> 0:45:24.449
1920
+ It always has a starting point for your U.
1921
+
1922
+ 0:45:24.449 --> 0:45:24.643
1923
+ N.
1924
+
1925
+ 0:45:24.643 --> 0:45:27.291
1926
+ MP system, which in practice works well.
1927
+
1928
+ 0:45:31.491 --> 0:45:36.759
1929
+ This detail is that since this is an encoder
1930
+ block only, and your U.
1931
+
1932
+ 0:45:36.759 --> 0:45:36.988
1933
+ N.
1934
+
1935
+ 0:45:36.988 --> 0:45:37.217
1936
+ M.
1937
+
1938
+ 0:45:37.217 --> 0:45:37.446
1939
+ T.
1940
+
1941
+ 0:45:37.446 --> 0:45:40.347
1942
+ System is encodered, decodered right.
1943
+
1944
+ 0:45:40.347 --> 0:45:47.524
1945
+ So there's this cross attention that's missing,
1946
+ but you can always branch like that randomly.
1947
+
1948
+ 0:45:47.524 --> 0:45:48.364
1949
+ It's fine.
1950
+
1951
+ 0:45:48.508 --> 0:45:53.077
1952
+ Not everything is initialized, but it's still
1953
+ decent.
1954
+
1955
+ 0:45:56.056 --> 0:46:02.141
1956
+ Then we have the other one is M by plane,
1957
+ and here you see that this kind of builds on
1958
+
1959
+ 0:46:02.141 --> 0:46:07.597
1960
+ the the unsupervised training objector, which
1961
+ is the realizing auto encoding.
1962
+
1963
+ 0:46:08.128 --> 0:46:14.337
1964
+ So what they do is they say that we don't
1965
+ even need to do the gun outback translation,
1966
+
1967
+ 0:46:14.337 --> 0:46:17.406
1968
+ but you can do it later, but pre training.
1969
+
1970
+ 0:46:17.406 --> 0:46:24.258
1971
+ We just do do doing doing doing water inputting
1972
+ on all different languages, and that also gives
1973
+
1974
+ 0:46:24.258 --> 0:46:32.660
1975
+ you: Out of the box good performance, so what
1976
+ we basically have here is the transformer encoded.
1977
+
1978
+ 0:46:34.334 --> 0:46:37.726
1979
+ You are trying to generate a reconstructed
1980
+ sequence.
1981
+
1982
+ 0:46:37.726 --> 0:46:38.942
1983
+ You need a tickle.
1984
+
1985
+ 0:46:39.899 --> 0:46:42.022
1986
+ So we gave an input sentence.
1987
+
1988
+ 0:46:42.022 --> 0:46:48.180
1989
+ We tried to predict the masked tokens from
1990
+ the or we tried to reconstruct the original
1991
+
1992
+ 0:46:48.180 --> 0:46:52.496
1993
+ sentence from the input segments, which was
1994
+ corrupted right.
1995
+
1996
+ 0:46:52.496 --> 0:46:57.167
1997
+ So this is the same denoting objective that
1998
+ you have seen before.
1999
+
2000
+ 0:46:58.418 --> 0:46:59.737
2001
+ This is for English.
2002
+
2003
+ 0:46:59.737 --> 0:47:04.195
2004
+ I think this is for Japanese and then once
2005
+ we do it for all languages.
2006
+
2007
+ 0:47:04.195 --> 0:47:09.596
2008
+ I mean they have this difference on twenty
2009
+ five, fifty or so on and then you can find
2010
+
2011
+ 0:47:09.596 --> 0:47:11.794
2012
+ you on your sentence and document.
2013
+
2014
+ 0:47:13.073 --> 0:47:20.454
2015
+ And so what they is this for the supervised
2016
+ techniques, but you can also use this as initializations
2017
+
2018
+ 0:47:20.454 --> 0:47:25.058
2019
+ for unsupervised buildup on that which also
2020
+ in practice works.
2021
+
2022
+ 0:47:30.790 --> 0:47:36.136
2023
+ Then we have these, so still now we kind of
2024
+ didn't see the the states benefit from the
2025
+
2026
+ 0:47:36.136 --> 0:47:38.840
2027
+ high resource language right, so as I said.
2028
+
2029
+ 0:47:38.878 --> 0:47:44.994
2030
+ Why you can use English as something for English
2031
+ to Dutch, and if you want a new Catalan, you
2032
+
2033
+ 0:47:44.994 --> 0:47:46.751
2034
+ can use English to French.
2035
+
2036
+ 0:47:48.408 --> 0:47:55.866
2037
+ One typical way to do this is to use favorite
2038
+ translation lights or you take the.
2039
+
2040
+ 0:47:55.795 --> 0:48:01.114
2041
+ So here it's finished two weeks so you take
2042
+ your time say from finish to English English
2043
+
2044
+ 0:48:01.114 --> 0:48:03.743
2045
+ two weeks and then you get the translation.
2046
+
2047
+ 0:48:04.344 --> 0:48:10.094
2048
+ What's important is that you have these different
2049
+ techniques and you can always think of which
2050
+
2051
+ 0:48:10.094 --> 0:48:12.333
2052
+ one to use given the data situation.
2053
+
2054
+ 0:48:12.333 --> 0:48:18.023
2055
+ So if it was like finish to Greek maybe it's
2056
+ pivotal better because you might get good finish
2057
+
2058
+ 0:48:18.023 --> 0:48:20.020
2059
+ to English and English to Greek.
2060
+
2061
+ 0:48:20.860 --> 0:48:23.255
2062
+ Sometimes it also depends on the language
2063
+ pair.
2064
+
2065
+ 0:48:23.255 --> 0:48:27.595
2066
+ There might be some information loss and so
2067
+ on, so there are quite a few variables you
2068
+
2069
+ 0:48:27.595 --> 0:48:30.039
2070
+ need to think of and decide which system to
2071
+ use.
2072
+
2073
+ 0:48:32.752 --> 0:48:39.654
2074
+ Then there's a zero shot, which probably also
2075
+ I've seen in the multilingual course, and how
2076
+
2077
+ 0:48:39.654 --> 0:48:45.505
2078
+ if you can improve the language independence
2079
+ then your zero shot gets better.
2080
+
2081
+ 0:48:45.505 --> 0:48:52.107
2082
+ So maybe if you use the multilingual models
2083
+ and do zero shot directly, it's quite good.
2084
+
2085
+ 0:48:53.093 --> 0:48:58.524
2086
+ Thought we have zero shots per word, and then
2087
+ we have the answer to voice translation where
2088
+
2089
+ 0:48:58.524 --> 0:49:00.059
2090
+ we can calculate between.
2091
+
2092
+ 0:49:00.600 --> 0:49:02.762
2093
+ Just when there is no battle today.
2094
+
2095
+ 0:49:06.686 --> 0:49:07.565
2096
+ Is to solve.
2097
+
2098
+ 0:49:07.565 --> 0:49:11.959
2099
+ So sometimes what we have seen so far is that
2100
+ we basically have.
2101
+
2102
+ 0:49:15.255 --> 0:49:16.754
2103
+ To do from looking at it.
2104
+
2105
+ 0:49:16.836 --> 0:49:19.307
2106
+ These two files alone you can create a dictionary.
2107
+
2108
+ 0:49:19.699 --> 0:49:26.773
2109
+ Can build an unsupervised entry system, not
2110
+ always, but if the domains are similar in the
2111
+
2112
+ 0:49:26.773 --> 0:49:28.895
2113
+ languages, that's similar.
2114
+
2115
+ 0:49:28.895 --> 0:49:36.283
2116
+ But if there are distant languages, then the
2117
+ unsupervised texting doesn't usually work really
2118
+
2119
+ 0:49:36.283 --> 0:49:36.755
2120
+ well.
2121
+
2122
+ 0:49:37.617 --> 0:49:40.297
2123
+ What um.
2124
+
2125
+ 0:49:40.720 --> 0:49:46.338
2126
+ Would be is that if you can get some paddle
2127
+ data from somewhere or do bitex mining that
2128
+
2129
+ 0:49:46.338 --> 0:49:51.892
2130
+ we have seen in the in the laser practicum
2131
+ then you can use that as to initialize your
2132
+
2133
+ 0:49:51.892 --> 0:49:57.829
2134
+ system and then try and accept a semi supervised
2135
+ energy system and that would be better than
2136
+
2137
+ 0:49:57.829 --> 0:50:00.063
2138
+ just building an unsupervised and.
2139
+
2140
+ 0:50:00.820 --> 0:50:06.546
2141
+ With that as the end.
2142
+
2143
+ 0:50:07.207 --> 0:50:08.797
2144
+ Quickly could be.
2145
+
2146
+ 0:50:16.236 --> 0:50:25.070
2147
+ In common, they can catch the worst because
2148
+ the thing about finding a language is: And
2149
+
2150
+ 0:50:25.070 --> 0:50:34.874
2151
+ there's another joy in playing these games,
2152
+ almost in the middle of a game, and she's a
2153
+
2154
+ 0:50:34.874 --> 0:50:40.111
2155
+ characteristic too, and she is a global waver.
2156
+
2157
+ 0:50:56.916 --> 0:51:03.798
2158
+ Next talk inside and this somehow gives them
2159
+ many abilities, not only translation but other
2160
+
2161
+ 0:51:03.798 --> 0:51:08.062
2162
+ than that there are quite a few things that
2163
+ they can do.
2164
+
2165
+ 0:51:10.590 --> 0:51:17.706
2166
+ But the translation in itself usually doesn't
2167
+ really work really well if you build a system
2168
+
2169
+ 0:51:17.706 --> 0:51:20.878
2170
+ from your specific system for your case.
2171
+
2172
+ 0:51:22.162 --> 0:51:27.924
2173
+ I would guess that it's usually better than
2174
+ the LLM, but you can always adapt the LLM to
2175
+
2176
+ 0:51:27.924 --> 0:51:31.355
2177
+ the task that you want, and then it could be
2178
+ better.
2179
+
2180
+ 0:51:32.152 --> 0:51:37.849
2181
+ A little amount of the box might not be the
2182
+ best choice for your task force.
2183
+
2184
+ 0:51:37.849 --> 0:51:44.138
2185
+ For me, I'm working on new air translation,
2186
+ so it's more about translating software.
2187
+
2188
+ 0:51:45.065 --> 0:51:50.451
2189
+ And it's quite often each domain as well,
2190
+ and if use the LLM out of the box, they're
2191
+
2192
+ 0:51:50.451 --> 0:51:53.937
2193
+ actually quite bad compared to the systems
2194
+ that built.
2195
+
2196
+ 0:51:54.414 --> 0:51:56.736
2197
+ But you can do these different techniques
2198
+ like prompting.
2199
+
2200
+ 0:51:57.437 --> 0:52:03.442
2201
+ This is what people usually do is heart prompting
2202
+ where they give similar translation pairs in
2203
+
2204
+ 0:52:03.442 --> 0:52:08.941
2205
+ the prompt and then ask it to translate and
2206
+ then that kind of improves the performance
2207
+
2208
+ 0:52:08.941 --> 0:52:09.383
2209
+ a lot.
2210
+
2211
+ 0:52:09.383 --> 0:52:15.135
2212
+ So there are different techniques that you
2213
+ can do to adapt your eye lens and then it might
2214
+
2215
+ 0:52:15.135 --> 0:52:16.399
2216
+ be better than the.
2217
+
2218
+ 0:52:16.376 --> 0:52:17.742
2219
+ Task a fixed system.
2220
+
2221
+ 0:52:18.418 --> 0:52:22.857
2222
+ But if you're looking for niche things, I
2223
+ don't think error limbs are that good.
2224
+
2225
+ 0:52:22.857 --> 0:52:26.309
2226
+ But if you want to do to do, let's say, unplugged
2227
+ translation.
2228
+
2229
+ 0:52:26.309 --> 0:52:30.036
2230
+ In this case you can never be sure that they
2231
+ haven't seen the data.
2232
+
2233
+ 0:52:30.036 --> 0:52:35.077
2234
+ First of all is that if you see the data in
2235
+ that language or not, and if they're panthetic,
2236
+
2237
+ 0:52:35.077 --> 0:52:36.831
2238
+ they probably did see the data.
2239
+
2240
+ 0:52:40.360 --> 0:53:00.276
2241
+ I feel like they have pretty good understanding
2242
+ of each million people.
2243
+
2244
+ 0:53:04.784 --> 0:53:09.059
2245
+ Depends on the language, but I'm pretty surprised
2246
+ that it works on a lotus language.
2247
+
2248
+ 0:53:09.059 --> 0:53:11.121
2249
+ I would expect it to work on German and.
2250
+
2251
+ 0:53:11.972 --> 0:53:13.633
2252
+ But if you take a lot of first language,.
2253
+
2254
+ 0:53:14.474 --> 0:53:20.973
2255
+ Don't think it works, and also there are quite
2256
+ a few papers where they've already showed that
2257
+
2258
+ 0:53:20.973 --> 0:53:27.610
2259
+ if you build a system yourself or build a typical
2260
+ way to build a system, it's quite better than
2261
+
2262
+ 0:53:27.610 --> 0:53:29.338
2263
+ the bit better than the.
2264
+
2265
+ 0:53:29.549 --> 0:53:34.883
2266
+ But you can always do things with limbs to
2267
+ get better, but then I'm probably.
2268
+
2269
+ 0:53:37.557 --> 0:53:39.539
2270
+ Anymore.
2271
+
2272
+ 0:53:41.421 --> 0:53:47.461
2273
+ So if not then we're going to end the lecture
2274
+ here and then on Thursday we're going to have
2275
+
2276
+ 0:53:47.461 --> 0:53:51.597
2277
+ documented empty which is also run by me so
2278
+ thanks for coming.
2279
+
demo_data/lectures/Lecture-15-11.07.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:62985057e3dfdb7c34a3ef8e74a9b52e9529b2a974ff62438c617e6d699b5a89
3
+ size 81272567
demo_data/lectures/Lecture-18-18.07.2023/English.vtt ADDED
@@ -0,0 +1,2732 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.541 --> 0:00:06.926
4
+ Okay, so we'll come back to today's lecture.
5
+
6
+ 0:00:08.528 --> 0:00:23.334
7
+ We want to talk about is speech translation,
8
+ so we'll have two lectures in this week about
9
+
10
+ 0:00:23.334 --> 0:00:26.589
11
+ speech translation.
12
+
13
+ 0:00:27.087 --> 0:00:36.456
14
+ And so in the last week we'll have some exercise
15
+ and repetition.
16
+
17
+ 0:00:36.456 --> 0:00:46.690
18
+ We want to look at what is now to do when
19
+ we want to translate speech.
20
+
21
+ 0:00:46.946 --> 0:00:55.675
22
+ So we want to address the specific challenges
23
+ that occur when we switch from translating
24
+
25
+ 0:00:55.675 --> 0:00:56.754
26
+ to speech.
27
+
28
+ 0:00:57.697 --> 0:01:13.303
29
+ Today we will look at the more general picture
30
+ out and build the systems.
31
+
32
+ 0:01:13.493 --> 0:01:23.645
33
+ And then secondly an end approach where we
34
+ are going to put in audio and generate.
35
+
36
+ 0:01:24.224 --> 0:01:41.439
37
+ Which are the main dominant systems which
38
+ are used in research and commercial systems.
39
+
40
+ 0:01:43.523 --> 0:01:56.879
41
+ More general, what is the general task of
42
+ speech translation that is shown here?
43
+
44
+ 0:01:56.879 --> 0:02:01.826
45
+ The idea is we have a speech.
46
+
47
+ 0:02:02.202 --> 0:02:12.838
48
+ Then we want to have a system which takes
49
+ this audio and then translates it into another
50
+
51
+ 0:02:12.838 --> 0:02:14.033
52
+ language.
53
+
54
+ 0:02:15.095 --> 0:02:20.694
55
+ Then it's no longer as clear the output modality.
56
+
57
+ 0:02:20.694 --> 0:02:33.153
58
+ In contrast, for humans we can typically have:
59
+ So you can either have more textual translation,
60
+
61
+ 0:02:33.153 --> 0:02:37.917
62
+ then you have subtitles, and the.
63
+
64
+ 0:02:38.538 --> 0:02:57.010
65
+ Are you want to have it also in audio like
66
+ it's done for human interpretation?
67
+
68
+ 0:02:57.417 --> 0:03:03.922
69
+ See there is not the one best solution, so
70
+ all of this one is always better.
71
+
72
+ 0:03:03.922 --> 0:03:09.413
73
+ It heavily depends on what is the use of what
74
+ the people prefer.
75
+
76
+ 0:03:09.929 --> 0:03:14.950
77
+ For example, you can think of if you know
78
+ a bit the source of language, but you're a
79
+
80
+ 0:03:14.950 --> 0:03:17.549
81
+ bit unsure and don't understand everything.
82
+
83
+ 0:03:17.549 --> 0:03:23.161
84
+ They may texture it out for this pattern because
85
+ you can direct your gear to what was said and
86
+
87
+ 0:03:23.161 --> 0:03:26.705
88
+ only if you're unsure you check down with your
89
+ translation.
90
+
91
+ 0:03:27.727 --> 0:03:33.511
92
+ Are another things that might be preferable
93
+ to have a complete spoken of.
94
+
95
+ 0:03:34.794 --> 0:03:48.727
96
+ So there are both ones for a long time in
97
+ automatic systems focused mainly on text output.
98
+
99
+ 0:03:48.727 --> 0:04:06.711
100
+ In most cases: But of course you can always
101
+ hand them to text to speech systems which generates
102
+
103
+ 0:04:06.711 --> 0:04:09.960
104
+ audio from that.
105
+
106
+ 0:04:12.772 --> 0:04:14.494
107
+ Why should we care about that?
108
+
109
+ 0:04:14.494 --> 0:04:15.771
110
+ Why should we do that?
111
+
112
+ 0:04:17.737 --> 0:04:24.141
113
+ There is the nice thing that yeah, with a
114
+ globalized world, we are able to now interact
115
+
116
+ 0:04:24.141 --> 0:04:25.888
117
+ with a lot more people.
118
+
119
+ 0:04:25.888 --> 0:04:29.235
120
+ You can do some conferences around the world.
121
+
122
+ 0:04:29.235 --> 0:04:31.564
123
+ We can travel around the world.
124
+
125
+ 0:04:31.671 --> 0:04:37.802
126
+ We can by Internet watch movies from all over
127
+ the world and watch TV from all over the world.
128
+
129
+ 0:04:38.618 --> 0:04:47.812
130
+ However, there is still this barrier that
131
+ is mainly to watch videos, either in English
132
+
133
+ 0:04:47.812 --> 0:04:49.715
134
+ or in a language.
135
+
136
+ 0:04:50.250 --> 0:05:00.622
137
+ So what is currently happening in order to
138
+ reach a large audience is that everybody.
139
+
140
+ 0:05:00.820 --> 0:05:07.300
141
+ So if we are going, for example, to a conferences,
142
+ these are international conferences.
143
+
144
+ 0:05:08.368 --> 0:05:22.412
145
+ However, everybody will then speak English
146
+ since that is some of the common language that
147
+
148
+ 0:05:22.412 --> 0:05:26.001
149
+ everybody understands.
150
+
151
+ 0:05:26.686 --> 0:05:32.929
152
+ So on the other hand, we cannot like have
153
+ human interpreters like they ever work.
154
+
155
+ 0:05:32.892 --> 0:05:37.797
156
+ You have that maybe in the European Parliament
157
+ or in important business meetings.
158
+
159
+ 0:05:38.078 --> 0:05:47.151
160
+ But this is relatively expensive, and so the
161
+ question is, can we enable communication in
162
+
163
+ 0:05:47.151 --> 0:05:53.675
164
+ your mother-in-law without having to have human
165
+ interpretation?
166
+
167
+ 0:05:54.134 --> 0:06:04.321
168
+ And there like speech translation can be helpful
169
+ in order to help you bridge this gap.
170
+
171
+ 0:06:06.726 --> 0:06:22.507
172
+ In this case, there are different scenarios
173
+ of how you can apply speech translation.
174
+
175
+ 0:06:22.422 --> 0:06:29.282
176
+ That's typically more interactive than we
177
+ are talking about text translation.
178
+
179
+ 0:06:29.282 --> 0:06:32.800
180
+ Text translation is most commonly used.
181
+
182
+ 0:06:33.153 --> 0:06:41.637
183
+ Course: Nowadays there's things like chat
184
+ and so on where it could also be interactive.
185
+
186
+ 0:06:42.082 --> 0:06:48.299
187
+ In contrast to speech translation, that is
188
+ less static, so there is different ways of
189
+
190
+ 0:06:48.299 --> 0:06:48.660
191
+ how.
192
+
193
+ 0:06:49.149 --> 0:07:00.544
194
+ The one scenario is what is called a translation
195
+ where you first get an input, then you translate
196
+
197
+ 0:07:00.544 --> 0:07:03.799
198
+ this fixed input, and then.
199
+
200
+ 0:07:04.944 --> 0:07:12.823
201
+ With me, which means you have always like
202
+ fixed, yeah fixed challenges which you need
203
+
204
+ 0:07:12.823 --> 0:07:14.105
205
+ to translate.
206
+
207
+ 0:07:14.274 --> 0:07:25.093
208
+ You don't need to like beat your mind what
209
+ are the boundaries where there's an end.
210
+
211
+ 0:07:25.405 --> 0:07:31.023
212
+ Also, there is no overlapping.
213
+
214
+ 0:07:31.023 --> 0:07:42.983
215
+ There is always a one-person sentence that
216
+ is getting translated.
217
+
218
+ 0:07:43.443 --> 0:07:51.181
219
+ Of course, this has a disadvantage that it
220
+ makes the conversation a lot longer because
221
+
222
+ 0:07:51.181 --> 0:07:55.184
223
+ you always have only speech and translation.
224
+
225
+ 0:07:57.077 --> 0:08:03.780
226
+ For example, if you would use that for a presentation
227
+ there would be yeah quite get quite long, if
228
+
229
+ 0:08:03.780 --> 0:08:09.738
230
+ I would just imagine you sitting here in the
231
+ lecture I would say three sentences that I
232
+
233
+ 0:08:09.738 --> 0:08:15.765
234
+ would wait for this interpreter to translate
235
+ it, then I would say the next two sentences
236
+
237
+ 0:08:15.765 --> 0:08:16.103
238
+ and.
239
+
240
+ 0:08:16.676 --> 0:08:28.170
241
+ That is why in these situations, for example,
242
+ if you have a direct conversation with a patient,
243
+
244
+ 0:08:28.170 --> 0:08:28.888
245
+ then.
246
+
247
+ 0:08:29.209 --> 0:08:32.733
248
+ But still there it's too big to be taking
249
+ them very long.
250
+
251
+ 0:08:33.473 --> 0:08:42.335
252
+ And that's why there's also the research on
253
+ simultaneous translation, where the idea is
254
+
255
+ 0:08:42.335 --> 0:08:43.644
256
+ in parallel.
257
+
258
+ 0:08:43.964 --> 0:08:46.179
259
+ That Is the Dining for Human.
260
+
261
+ 0:08:46.126 --> 0:08:52.429
262
+ Interpretation like if you think of things
263
+ like the European Parliament where they of
264
+
265
+ 0:08:52.429 --> 0:08:59.099
266
+ course not only speak always one sentence but
267
+ are just giving their speech and in parallel
268
+
269
+ 0:08:59.099 --> 0:09:04.157
270
+ human interpreters are translating the speech
271
+ into another language.
272
+
273
+ 0:09:04.985 --> 0:09:12.733
274
+ The same thing is interesting for automatic
275
+ speech translation where we in parallel generate
276
+
277
+ 0:09:12.733 --> 0:09:13.817
278
+ translation.
279
+
280
+ 0:09:15.415 --> 0:09:32.271
281
+ The challenges then, of course, are that we
282
+ need to segment our speech into somehow's chunks.
283
+
284
+ 0:09:32.152 --> 0:09:34.903
285
+ We just looked for the dots we saw.
286
+
287
+ 0:09:34.903 --> 0:09:38.648
288
+ There are some challenges that we have to
289
+ check.
290
+
291
+ 0:09:38.648 --> 0:09:41.017
292
+ The Doctor may not understand.
293
+
294
+ 0:09:41.201 --> 0:09:47.478
295
+ But in generally getting sentence boundary
296
+ sentences is not a really research question.
297
+
298
+ 0:09:47.647 --> 0:09:51.668
299
+ While in speech translation, this is not that
300
+ easy.
301
+
302
+ 0:09:51.952 --> 0:10:05.908
303
+ Either getting that in the audio is difficult
304
+ because it's not like we typically do breaks
305
+
306
+ 0:10:05.908 --> 0:10:09.742
307
+ when there's a sentence.
308
+
309
+ 0:10:10.150 --> 0:10:17.432
310
+ And even if you then see the transcript and
311
+ would have to add the punctuation, this is
312
+
313
+ 0:10:17.432 --> 0:10:18.101
314
+ not as.
315
+
316
+ 0:10:20.340 --> 0:10:25.942
317
+ Another question is how many speakers we have
318
+ here.
319
+
320
+ 0:10:25.942 --> 0:10:31.759
321
+ In presentations you have more like a single
322
+ speaker.
323
+
324
+ 0:10:31.931 --> 0:10:40.186
325
+ That is normally easier from the part of audio
326
+ processing, so in general in speech translation.
327
+
328
+ 0:10:40.460 --> 0:10:49.308
329
+ You can have different challenges and they
330
+ can be of different components.
331
+
332
+ 0:10:49.308 --> 0:10:57.132
333
+ In addition to translation, you have: And
334
+ if you're not going, for example, the magical
335
+
336
+ 0:10:57.132 --> 0:11:00.378
337
+ speaker, there are significantly additional
338
+ challenges.
339
+
340
+ 0:11:00.720 --> 0:11:10.313
341
+ So we as humans we are very good in filtering
342
+ out noises, or if two people speak in parallel
343
+
344
+ 0:11:10.313 --> 0:11:15.058
345
+ to like separate these two speakers and hear.
346
+
347
+ 0:11:15.495 --> 0:11:28.300
348
+ However, if you want to do that with automatic
349
+ systems that is very challenging so that you
350
+
351
+ 0:11:28.300 --> 0:11:33.172
352
+ can separate the speakers so that.
353
+
354
+ 0:11:33.453 --> 0:11:41.284
355
+ For the more of you have this multi-speaker
356
+ scenario, typically it's also less well prepared.
357
+
358
+ 0:11:41.721 --> 0:11:45.807
359
+ So you're getting very, we'll talk about the
360
+ spontaneous effects.
361
+
362
+ 0:11:46.186 --> 0:11:53.541
363
+ So people like will stop in the middle of
364
+ the sentence, they change their sentence, and
365
+
366
+ 0:11:53.541 --> 0:12:01.481
367
+ so on, and like filtering these, these fluences
368
+ out of the text and working with them is often
369
+
370
+ 0:12:01.481 --> 0:12:02.986
371
+ very challenging.
372
+
373
+ 0:12:05.565 --> 0:12:09.144
374
+ So these are all additional challenges when
375
+ you have multiples.
376
+
377
+ 0:12:10.330 --> 0:12:19.995
378
+ Then there's a question of an online or offline
379
+ system, sometimes textbook station.
380
+
381
+ 0:12:19.995 --> 0:12:21.836
382
+ We also mainly.
383
+
384
+ 0:12:21.962 --> 0:12:36.507
385
+ That means you can take the whole text and
386
+ you can translate it in a badge.
387
+
388
+ 0:12:37.337 --> 0:12:44.344
389
+ However, for speech translation there's also
390
+ several scenarios where this is the case.
391
+
392
+ 0:12:44.344 --> 0:12:51.513
393
+ For example, when you're translating a movie,
394
+ it's not only that you don't have to do it
395
+
396
+ 0:12:51.513 --> 0:12:54.735
397
+ live, but you can take the whole movie.
398
+
399
+ 0:12:55.215 --> 0:13:05.473
400
+ However, there is also a lot of situations
401
+ where you don't have this opportunity like
402
+
403
+ 0:13:05.473 --> 0:13:06.785
404
+ or sports.
405
+
406
+ 0:13:07.247 --> 0:13:13.963
407
+ And you don't want to like first like let
408
+ around a sports event and then like show in
409
+
410
+ 0:13:13.963 --> 0:13:19.117
411
+ the game three hours later then there is not
412
+ really any interest.
413
+
414
+ 0:13:19.399 --> 0:13:31.118
415
+ So you have to do it live, and so we have
416
+ the additional challenge of translating the
417
+
418
+ 0:13:31.118 --> 0:13:32.208
419
+ system.
420
+
421
+ 0:13:32.412 --> 0:13:42.108
422
+ There are still things on the one end of course.
423
+
424
+ 0:13:42.108 --> 0:13:49.627
425
+ It needs to be real time translation.
426
+
427
+ 0:13:49.869 --> 0:13:54.153
428
+ It's taking longer, then you're getting more
429
+ and more and more delayed.
430
+
431
+ 0:13:55.495 --> 0:14:05.245
432
+ So it maybe seems simple, but there have been
433
+ research systems which are undertime slower
434
+
435
+ 0:14:05.245 --> 0:14:07.628
436
+ than real time or so.
437
+
438
+ 0:14:07.628 --> 0:14:15.103
439
+ If you want to show what is possible with
440
+ the best current systems,.
441
+
442
+ 0:14:16.596 --> 0:14:18.477
443
+ But that isn't even not enough.
444
+
445
+ 0:14:18.918 --> 0:14:29.593
446
+ The other question: You can have a system
447
+ which is even like several times real time.
448
+
449
+ 0:14:29.509 --> 0:14:33.382
450
+ In less than one second, it might still be
451
+ not useful.
452
+
453
+ 0:14:33.382 --> 0:14:39.648
454
+ Then the question is like the latency, so
455
+ how much time has passed since you can produce
456
+
457
+ 0:14:39.648 --> 0:14:39.930
458
+ an.
459
+
460
+ 0:14:40.120 --> 0:14:45.814
461
+ It might be that in average you can like concress
462
+ it, but you still can't do it directly.
463
+
464
+ 0:14:45.814 --> 0:14:51.571
465
+ You need to do it after, or you need to have
466
+ the full context of thirty seconds before you
467
+
468
+ 0:14:51.571 --> 0:14:55.178
469
+ can output something, and then you have a large
470
+ latency.
471
+
472
+ 0:14:55.335 --> 0:15:05.871
473
+ So it can be that do it as fast as it is produced,
474
+ but have to wait until the food.
475
+
476
+ 0:15:06.426 --> 0:15:13.772
477
+ So we'll look into that on Thursday how we
478
+ can then generate translations that are having
479
+
480
+ 0:15:13.772 --> 0:15:14.996
481
+ a low latency.
482
+
483
+ 0:15:15.155 --> 0:15:21.587
484
+ You can imagine, for example, in German that
485
+ it's maybe quite challenging since the word
486
+
487
+ 0:15:21.587 --> 0:15:23.466
488
+ is often like at the end.
489
+
490
+ 0:15:23.466 --> 0:15:30.115
491
+ If you're using perfect, like in harbor and
492
+ so on, and then in English you have to directly
493
+
494
+ 0:15:30.115 --> 0:15:30.983
495
+ produce it.
496
+
497
+ 0:15:31.311 --> 0:15:38.757
498
+ So if you really want to have no context you
499
+ might need to wait until the end of the sentence.
500
+
501
+ 0:15:41.021 --> 0:15:45.920
502
+ Besides that, of course, offline and it gives
503
+ you more additional help.
504
+
505
+ 0:15:45.920 --> 0:15:52.044
506
+ I think last week you talked about context
507
+ based systems that typically have context from
508
+
509
+ 0:15:52.044 --> 0:15:55.583
510
+ maybe from the past but maybe also from the
511
+ future.
512
+
513
+ 0:15:55.595 --> 0:16:02.923
514
+ Then, of course, you cannot use anything from
515
+ the future in this case, but you can use it.
516
+
517
+ 0:16:07.407 --> 0:16:24.813
518
+ Finally, there is a thing about how you want
519
+ to present it to the audience in automatic
520
+
521
+ 0:16:24.813 --> 0:16:27.384
522
+ translation.
523
+
524
+ 0:16:27.507 --> 0:16:31.361
525
+ There is also the thing that you want to do.
526
+
527
+ 0:16:31.361 --> 0:16:35.300
528
+ All your outfits are running like the system.
529
+
530
+ 0:16:35.996 --> 0:16:36.990
531
+ Top of it.
532
+
533
+ 0:16:36.990 --> 0:16:44.314
534
+ Then they answered questions: How should it
535
+ be spoken so you can do things like.
536
+
537
+ 0:16:46.586 --> 0:16:52.507
538
+ Voice cloning so that it's like even the same
539
+ voice than the original speaker.
540
+
541
+ 0:16:53.994 --> 0:16:59.081
542
+ And if you do text or dubbing then there might
543
+ be additional constraints.
544
+
545
+ 0:16:59.081 --> 0:17:05.729
546
+ So if you think about subtitles: And they
547
+ should be readable, and we are too big to speak
548
+
549
+ 0:17:05.729 --> 0:17:07.957
550
+ faster than you can maybe read.
551
+
552
+ 0:17:08.908 --> 0:17:14.239
553
+ So you might need to shorten your text.
554
+
555
+ 0:17:14.239 --> 0:17:20.235
556
+ People say that a subtitle can be two lines.
557
+
558
+ 0:17:20.235 --> 0:17:26.099
559
+ Each line can be this number of characters.
560
+
561
+ 0:17:26.346 --> 0:17:31.753
562
+ So you cannot like if you have too long text,
563
+ we might need to shorten that to do that.
564
+
565
+ 0:17:32.052 --> 0:17:48.272
566
+ Similarly, if you think about dubbing, if
567
+ you want to produce dubbing voice, then the
568
+
569
+ 0:17:48.272 --> 0:17:50.158
570
+ original.
571
+
572
+ 0:17:51.691 --> 0:17:59.294
573
+ Here is another problem that we have different
574
+ settings like a more formal setting and let's
575
+
576
+ 0:17:59.294 --> 0:18:00.602
577
+ have different.
578
+
579
+ 0:18:00.860 --> 0:18:09.775
580
+ If you think about the United Nations maybe
581
+ you want more former things and between friends
582
+
583
+ 0:18:09.775 --> 0:18:14.911
584
+ maybe that former and there are languages which
585
+ use.
586
+
587
+ 0:18:15.355 --> 0:18:21.867
588
+ That is sure that is an important research
589
+ question.
590
+
591
+ 0:18:21.867 --> 0:18:28.010
592
+ To do that would more think of it more generally.
593
+
594
+ 0:18:28.308 --> 0:18:32.902
595
+ That's important in text translation.
596
+
597
+ 0:18:32.902 --> 0:18:41.001
598
+ If you translate a letter to your boss, it
599
+ should sound different.
600
+
601
+ 0:18:42.202 --> 0:18:53.718
602
+ So there is a question of how you can do this
603
+ style work on how you can do that.
604
+
605
+ 0:18:53.718 --> 0:19:00.542
606
+ For example, if you can specify that you might.
607
+
608
+ 0:19:00.460 --> 0:19:10.954
609
+ So you can tax the center or generate an informal
610
+ style because, as you correctly said, this
611
+
612
+ 0:19:10.954 --> 0:19:16.709
613
+ is especially challenging again in the situations.
614
+
615
+ 0:19:16.856 --> 0:19:20.111
616
+ Of course, there are ways of like being formal
617
+ or less formal.
618
+
619
+ 0:19:20.500 --> 0:19:24.846
620
+ But it's not like as clear as you do it, for
621
+ example, in German where you have the twin
622
+
623
+ 0:19:24.846 --> 0:19:24.994
624
+ C.
625
+
626
+ 0:19:25.165 --> 0:19:26.855
627
+ So there is no one to own mapping.
628
+
629
+ 0:19:27.287 --> 0:19:34.269
630
+ If you want to make that sure you can build
631
+ a system which generates different styles in
632
+
633
+ 0:19:34.269 --> 0:19:38.662
634
+ the output, so yeah that's definitely also
635
+ a challenge.
636
+
637
+ 0:19:38.662 --> 0:19:43.762
638
+ It just may be not mentioned here because
639
+ it's not specific now.
640
+
641
+ 0:19:44.524 --> 0:19:54.029
642
+ Generally, of course, these are all challenges
643
+ in how to customize and adapt systems to use
644
+
645
+ 0:19:54.029 --> 0:19:56.199
646
+ cases with specific.
647
+
648
+ 0:20:00.360 --> 0:20:11.020
649
+ Speech translation has been done for quite
650
+ a while and it's maybe not surprising it started
651
+
652
+ 0:20:11.020 --> 0:20:13.569
653
+ with more simple use.
654
+
655
+ 0:20:13.793 --> 0:20:24.557
656
+ So people first started to look into, for
657
+ example, limited to main translations.
658
+
659
+ 0:20:24.557 --> 0:20:33.726
660
+ The tourist was typically application if you're
661
+ going to a new city.
662
+
663
+ 0:20:34.834 --> 0:20:44.028
664
+ Then there are several open things of doing
665
+ open domain translation, especially people.
666
+
667
+ 0:20:44.204 --> 0:20:51.957
668
+ Like where there's a lot of data so you could
669
+ build systems which are more open to main,
670
+
671
+ 0:20:51.957 --> 0:20:55.790
672
+ but of course it's still a bit restrictive.
673
+
674
+ 0:20:55.790 --> 0:20:59.101
675
+ It's true in the European Parliament.
676
+
677
+ 0:20:59.101 --> 0:21:01.888
678
+ People talk about anything but.
679
+
680
+ 0:21:02.162 --> 0:21:04.820
681
+ And so it's not completely used for everything.
682
+
683
+ 0:21:05.165 --> 0:21:11.545
684
+ Nowadays we've seen this technology in a lot
685
+ of different situations guess you ought.
686
+
687
+ 0:21:11.731 --> 0:21:17.899
688
+ Use it so there is some basic technologies
689
+ where you can use them already.
690
+
691
+ 0:21:18.218 --> 0:21:33.599
692
+ There is still a lot of open questions going
693
+ from if you are going to really spontaneous
694
+
695
+ 0:21:33.599 --> 0:21:35.327
696
+ meetings.
697
+
698
+ 0:21:35.655 --> 0:21:41.437
699
+ Then these systems typically work good for
700
+ like some languages where we have a lot of
701
+
702
+ 0:21:41.437 --> 0:21:42.109
703
+ friendly.
704
+
705
+ 0:21:42.742 --> 0:21:48.475
706
+ But if we want to go for really low resource
707
+ data then things are often challenging.
708
+
709
+ 0:21:48.448 --> 0:22:02.294
710
+ Last week we had a workshop on spoken language
711
+ translation and there is a low-resource data
712
+
713
+ 0:22:02.294 --> 0:22:05.756
714
+ track which is dialed.
715
+
716
+ 0:22:05.986 --> 0:22:06.925
717
+ And so on.
718
+
719
+ 0:22:06.925 --> 0:22:14.699
720
+ All these languages can still then have significantly
721
+ lower performance than for a higher.
722
+
723
+ 0:22:17.057 --> 0:22:20.126
724
+ So how does this work?
725
+
726
+ 0:22:20.126 --> 0:22:31.614
727
+ If we want to do speech translation, there's
728
+ like three basic technology: So on the one
729
+
730
+ 0:22:31.614 --> 0:22:40.908
731
+ hand, it's automatic speech recognition where
732
+ automatic speech recognition normally transacts
733
+
734
+ 0:22:40.908 --> 0:22:41.600
735
+ audio.
736
+
737
+ 0:22:42.822 --> 0:22:58.289
738
+ Then what we talked about here is machine
739
+ translation, which takes input and translates
740
+
741
+ 0:22:58.289 --> 0:23:01.276
742
+ into the target.
743
+
744
+ 0:23:02.642 --> 0:23:11.244
745
+ And the very simple model now, if you think
746
+ about it, is of course the similar combination.
747
+
748
+ 0:23:11.451 --> 0:23:14.740
749
+ We have solved all these parts in a salt bedrock.
750
+
751
+ 0:23:14.975 --> 0:23:31.470
752
+ We are working on all these problems there,
753
+ so if we want to do a speech transition, maybe.
754
+
755
+ 0:23:31.331 --> 0:23:35.058
756
+ Such problems we just put all these combinations
757
+ together.
758
+
759
+ 0:23:35.335 --> 0:23:45.130
760
+ And then you get what you have as a cascading
761
+ system, which first is so you take your audio.
762
+
763
+ 0:23:45.045 --> 0:23:59.288
764
+ To take this as input and generate the output,
765
+ and then you take this text output, put it
766
+
767
+ 0:23:59.288 --> 0:24:00.238
768
+ into.
769
+
770
+ 0:24:00.640 --> 0:24:05.782
771
+ So in that way you have now.
772
+
773
+ 0:24:08.008 --> 0:24:18.483
774
+ Have now a solution for generating doing speech
775
+ translation for these types of systems, and
776
+
777
+ 0:24:18.483 --> 0:24:20.874
778
+ this type is called.
779
+
780
+ 0:24:21.681 --> 0:24:28.303
781
+ It is still often reaching state of the art,
782
+ however it has benefits and disadvantages.
783
+
784
+ 0:24:28.668 --> 0:24:41.709
785
+ So the one big benefit is we have independent
786
+ components and some of that is nice.
787
+
788
+ 0:24:41.709 --> 0:24:48.465
789
+ So if there are great ideas put into your.
790
+
791
+ 0:24:48.788 --> 0:24:57.172
792
+ And then some other times people develop a
793
+ new good way of how to improve.
794
+
795
+ 0:24:57.172 --> 0:25:00.972
796
+ You can also take this model and.
797
+
798
+ 0:25:01.381 --> 0:25:07.639
799
+ So you can leverage improvements from all
800
+ the different communities in order to adapt.
801
+
802
+ 0:25:08.288 --> 0:25:18.391
803
+ Furthermore, we would like to see, since all
804
+ of them is learning, that the biggest advantage
805
+
806
+ 0:25:18.391 --> 0:25:23.932
807
+ is that we have training data for each individual.
808
+
809
+ 0:25:24.164 --> 0:25:34.045
810
+ So there's a lot less training data where
811
+ you have the English audio, so it's easy to
812
+
813
+ 0:25:34.045 --> 0:25:34.849
814
+ train.
815
+
816
+ 0:25:36.636 --> 0:25:48.595
817
+ Now am a one that we will focus on when talking
818
+ about the cascaded approach is that often it.
819
+
820
+ 0:25:48.928 --> 0:25:58.049
821
+ So you need to adapt each component a bit
822
+ so that it's adapting to its input and.
823
+
824
+ 0:25:58.278 --> 0:26:07.840
825
+ So we'll focus there especially on how to
826
+ combine and since said the main focus is: So
827
+
828
+ 0:26:07.840 --> 0:26:18.589
829
+ if you would directly use an output that might
830
+ not work as perfect as you would,.
831
+
832
+ 0:26:18.918 --> 0:26:33.467
833
+ So a major challenge when building a cascade
834
+ of speech translation systems is how can we
835
+
836
+ 0:26:33.467 --> 0:26:38.862
837
+ adapt these systems and how can?
838
+
839
+ 0:26:41.681 --> 0:26:43.918
840
+ So why, why is this the kick?
841
+
842
+ 0:26:44.164 --> 0:26:49.183
843
+ So it would look quite nice.
844
+
845
+ 0:26:49.183 --> 0:26:54.722
846
+ It seems to be very reasonable.
847
+
848
+ 0:26:54.722 --> 0:26:58.356
849
+ You have some audio.
850
+
851
+ 0:26:58.356 --> 0:27:03.376
852
+ You put it into your system.
853
+
854
+ 0:27:04.965 --> 0:27:23.759
855
+ However, this is a bit which for thinking
856
+ because if you speak what you speak is more.
857
+
858
+ 0:27:23.984 --> 0:27:29.513
859
+ And especially all that rarely have punctuations
860
+ in there, and while the anti-system.
861
+
862
+ 0:27:29.629 --> 0:27:43.247
863
+ They assume, of course, that it's a full sentence,
864
+ that you don't have there some.
865
+
866
+ 0:27:43.523 --> 0:27:55.087
867
+ So we see we want to get this bridge between
868
+ the output and the input, and we might need
869
+
870
+ 0:27:55.087 --> 0:27:56.646
871
+ additional.
872
+
873
+ 0:27:58.778 --> 0:28:05.287
874
+ And that is typically what is referred to
875
+ as re-case and re-piculation system.
876
+
877
+ 0:28:05.445 --> 0:28:15.045
878
+ So the idea is that you might be good to have
879
+ something like an adapter here in between,
880
+
881
+ 0:28:15.045 --> 0:28:20.007
882
+ which really tries to adapt the speech input.
883
+
884
+ 0:28:20.260 --> 0:28:28.809
885
+ That can be at different levels, but it might
886
+ be even more rephrasing.
887
+
888
+ 0:28:29.569 --> 0:28:40.620
889
+ If you think of the sentence, if you have
890
+ false starts, then when speaking you sometimes
891
+
892
+ 0:28:40.620 --> 0:28:41.986
893
+ assume oh.
894
+
895
+ 0:28:41.901 --> 0:28:52.224
896
+ You restart it, then you might want to delete
897
+ that because if you read it you don't want
898
+
899
+ 0:28:52.224 --> 0:28:52.688
900
+ to.
901
+
902
+ 0:28:56.096 --> 0:28:57.911
903
+ Why is this yeah?
904
+
905
+ 0:28:57.911 --> 0:29:01.442
906
+ The case in punctuation important.
907
+
908
+ 0:29:02.622 --> 0:29:17.875
909
+ One important thing is directly for the challenge
910
+ is when speak is just a continuous stream of
911
+
912
+ 0:29:17.875 --> 0:29:18.999
913
+ words.
914
+
915
+ 0:29:19.079 --> 0:29:27.422
916
+ Then just speaking and punctuation marks,
917
+ and so on are all notes are there in natural.
918
+
919
+ 0:29:27.507 --> 0:29:30.281
920
+ However, they are of course important.
921
+
922
+ 0:29:30.410 --> 0:29:33.877
923
+ They are first of all very important for readability.
924
+
925
+ 0:29:34.174 --> 0:29:41.296
926
+ If you have once read a text without characterization
927
+ marks, you need more time to process it.
928
+
929
+ 0:29:41.861 --> 0:29:47.375
930
+ They're sometimes even semantically important.
931
+
932
+ 0:29:47.375 --> 0:29:52.890
933
+ There's a list for grandpa and big difference.
934
+
935
+ 0:29:53.553 --> 0:30:00.089
936
+ And so this, of course, with humans as well,
937
+ it'd be easy to distinguish by again doing
938
+
939
+ 0:30:00.089 --> 0:30:01.426
940
+ it automatically.
941
+
942
+ 0:30:01.426 --> 0:30:06.180
943
+ It's more typically and finally, in our case,
944
+ if we want to do.
945
+
946
+ 0:30:06.386 --> 0:30:13.672
947
+ We are assuming normally sentence wise, so
948
+ we always enter out system which is like one
949
+
950
+ 0:30:13.672 --> 0:30:16.238
951
+ sentence by the next sentence.
952
+
953
+ 0:30:16.736 --> 0:30:26.058
954
+ If you want to do speech translation of a
955
+ continuous stream, then of course what are
956
+
957
+ 0:30:26.058 --> 0:30:26.716
958
+ your.
959
+
960
+ 0:30:28.168 --> 0:30:39.095
961
+ And the easiest and most straightforward situation
962
+ is, of course, if you have a continuously.
963
+
964
+ 0:30:39.239 --> 0:30:51.686
965
+ And if it generates your calculation marks,
966
+ it's easy to separate your text into sentences.
967
+
968
+ 0:30:52.032 --> 0:31:09.157
969
+ So we can again reuse our system and thereby
970
+ have a normal anti-system on this continuous.
971
+
972
+ 0:31:14.174 --> 0:31:21.708
973
+ These are a bit older numbers, but they show
974
+ you a bit also how important all that is.
975
+
976
+ 0:31:21.861 --> 0:31:31.719
977
+ So this was so the best is if you do insurance
978
+ transcript you get roughly a blue score of.
979
+
980
+ 0:31:32.112 --> 0:31:47.678
981
+ If you have as it is with some air based length
982
+ segmentation, then you get something like.
983
+
984
+ 0:31:47.907 --> 0:31:57.707
985
+ If you then use the segments correctly as
986
+ it's done from the reference, you get one blue
987
+
988
+ 0:31:57.707 --> 0:32:01.010
989
+ point and another blue point.
990
+
991
+ 0:32:01.201 --> 0:32:08.085
992
+ So you see that you have been total like nearly
993
+ two blue points just by having the correct
994
+
995
+ 0:32:08.085 --> 0:32:09.144
996
+ segmentation.
997
+
998
+ 0:32:10.050 --> 0:32:21.178
999
+ This shows you that it's important to estimate
1000
+ as good a segmentation because even if you
1001
+
1002
+ 0:32:21.178 --> 0:32:25.629
1003
+ still have the same arrows in your.
1004
+
1005
+ 0:32:27.147 --> 0:32:35.718
1006
+ Is to be into this movement, which is also
1007
+ not as unusual as we do in translation.
1008
+
1009
+ 0:32:36.736 --> 0:32:40.495
1010
+ So this is done by looking at the reference.
1011
+
1012
+ 0:32:40.495 --> 0:32:48.097
1013
+ It should show you how much these scores are
1014
+ done to just analyze how important are these.
1015
+
1016
+ 0:32:48.097 --> 0:32:55.699
1017
+ So you take the A's R transcript and you look
1018
+ at the reference and it's only done for the.
1019
+
1020
+ 0:32:55.635 --> 0:33:01.720
1021
+ If we have optimal punctuations, if our model
1022
+ is as good and optimal, so as a reference we
1023
+
1024
+ 0:33:01.720 --> 0:33:15.602
1025
+ could: But of course this is not how we can
1026
+ do it in reality because we don't have access
1027
+
1028
+ 0:33:15.602 --> 0:33:16.990
1029
+ to that.
1030
+
1031
+ 0:33:17.657 --> 0:33:24.044
1032
+ Because one would invade you okay, why should
1033
+ we do that?
1034
+
1035
+ 0:33:24.044 --> 0:33:28.778
1036
+ If we have the optimal then it's possible.
1037
+
1038
+ 0:33:31.011 --> 0:33:40.060
1039
+ And yeah, that is why a typical system does
1040
+ not only yeah depend on if our key component.
1041
+
1042
+ 0:33:40.280 --> 0:33:56.468
1043
+ But in between you have this segmentation
1044
+ in there in order to have more input and.
1045
+
1046
+ 0:33:56.496 --> 0:34:01.595
1047
+ You can also prefer often this invariability
1048
+ over the average study.
1049
+
1050
+ 0:34:04.164 --> 0:34:19.708
1051
+ So the task of segmentation is to re-segment
1052
+ the text into what is called sentence like
1053
+
1054
+ 0:34:19.708 --> 0:34:24.300
1055
+ unit, so you also assign.
1056
+
1057
+ 0:34:24.444 --> 0:34:39.421
1058
+ That is more a traditional thing because for
1059
+ a long time case information was not provided.
1060
+
1061
+ 0:34:39.879 --> 0:34:50.355
1062
+ So there was any good ASR system which directly
1063
+ provides you with case information and this
1064
+
1065
+ 0:34:50.355 --> 0:34:52.746
1066
+ may not be any more.
1067
+
1068
+ 0:34:56.296 --> 0:35:12.060
1069
+ How that can be done is you can have three
1070
+ different approaches because that was some
1071
+
1072
+ 0:35:12.060 --> 0:35:16.459
1073
+ of the most common one.
1074
+
1075
+ 0:35:17.097 --> 0:35:23.579
1076
+ Course: That is not the only thing you can
1077
+ do.
1078
+
1079
+ 0:35:23.579 --> 0:35:30.888
1080
+ You can also try to train the data to generate
1081
+ that.
1082
+
1083
+ 0:35:31.891 --> 0:35:41.324
1084
+ On the other hand, that is of course more
1085
+ challenging.
1086
+
1087
+ 0:35:41.324 --> 0:35:47.498
1088
+ You need some type of segmentation.
1089
+
1090
+ 0:35:48.028 --> 0:35:59.382
1091
+ Mean, of course, you can easily remove and
1092
+ capture information from your data and then
1093
+
1094
+ 0:35:59.382 --> 0:36:05.515
1095
+ play a system which does non-case to non-case.
1096
+
1097
+ 0:36:05.945 --> 0:36:15.751
1098
+ You can also, of course, try to combine these
1099
+ two into one so that you directly translate
1100
+
1101
+ 0:36:15.751 --> 0:36:17.386
1102
+ from non-case.
1103
+
1104
+ 0:36:17.817 --> 0:36:24.722
1105
+ What is more happening by now is that you
1106
+ also try to provide these to that you provide.
1107
+
1108
+ 0:36:24.704 --> 0:36:35.267
1109
+ The ASR is a segmentation directly get these
1110
+ information in there.
1111
+
1112
+ 0:36:35.267 --> 0:36:45.462
1113
+ The systems that combine the A's and A's are:
1114
+ Yes, there is a valid rule.
1115
+
1116
+ 0:36:45.462 --> 0:36:51.187
1117
+ What we come later to today is that you do
1118
+ audio to text in the target language.
1119
+
1120
+ 0:36:51.187 --> 0:36:54.932
1121
+ That is what is referred to as an end to end
1122
+ system.
1123
+
1124
+ 0:36:54.932 --> 0:36:59.738
1125
+ So it's directly and this is still more often
1126
+ done for text output.
1127
+
1128
+ 0:36:59.738 --> 0:37:03.414
1129
+ But there is also end to end system which
1130
+ directly.
1131
+
1132
+ 0:37:03.683 --> 0:37:09.109
1133
+ There you have additional challenges by how
1134
+ to even measure if things are correct or not.
1135
+
1136
+ 0:37:09.089 --> 0:37:10.522
1137
+ Mean for text.
1138
+
1139
+ 0:37:10.522 --> 0:37:18.073
1140
+ You can mention, in other words, that for
1141
+ audio the audio signal is even more.
1142
+
1143
+ 0:37:18.318 --> 0:37:27.156
1144
+ That's why it's currently mostly speech to
1145
+ text, but that is one single system, but of
1146
+
1147
+ 0:37:27.156 --> 0:37:27.969
1148
+ course.
1149
+
1150
+ 0:37:32.492 --> 0:37:35.605
1151
+ Yeah, how can you do that?
1152
+
1153
+ 0:37:35.605 --> 0:37:45.075
1154
+ You can do adding these calculation information:
1155
+ Will look into three systems.
1156
+
1157
+ 0:37:45.075 --> 0:37:53.131
1158
+ You can do that as a sequence labeling problem
1159
+ or as a monolingual.
1160
+
1161
+ 0:37:54.534 --> 0:37:57.145
1162
+ Let's have a little bit of a series.
1163
+
1164
+ 0:37:57.145 --> 0:37:59.545
1165
+ This was some of the first ideas.
1166
+
1167
+ 0:37:59.545 --> 0:38:04.626
1168
+ There's the idea where you try to do it mainly
1169
+ based on language model.
1170
+
1171
+ 0:38:04.626 --> 0:38:11.471
1172
+ So how probable is that there is a punctuation
1173
+ that was done with like old style engram language
1174
+
1175
+ 0:38:11.471 --> 0:38:12.883
1176
+ models to visually.
1177
+
1178
+ 0:38:13.073 --> 0:38:24.687
1179
+ So you can, for example, if you have a program
1180
+ language model to calculate the score of Hello,
1181
+
1182
+ 0:38:24.687 --> 0:38:25.787
1183
+ how are?
1184
+
1185
+ 0:38:25.725 --> 0:38:33.615
1186
+ And then you compare this probability and
1187
+ take the one which has the highest probability.
1188
+
1189
+ 0:38:33.615 --> 0:38:39.927
1190
+ You might have something like if you have
1191
+ very long pauses, you anyway.
1192
+
1193
+ 0:38:40.340 --> 0:38:51.953
1194
+ So this is a very easy model, which only calculates
1195
+ some language model probabilities, and however
1196
+
1197
+ 0:38:51.953 --> 0:39:00.023
1198
+ the advantages of course are: And then, of
1199
+ course, in general, so what we will look into
1200
+
1201
+ 0:39:00.023 --> 0:39:06.249
1202
+ here is that maybe interesting is that most
1203
+ of the systems, also the advance, are really
1204
+
1205
+ 0:39:06.249 --> 0:39:08.698
1206
+ mainly focused purely on the text.
1207
+
1208
+ 0:39:09.289 --> 0:39:19.237
1209
+ If you think about how to insert punctuation
1210
+ marks, maybe your first idea would have been
1211
+
1212
+ 0:39:19.237 --> 0:39:22.553
1213
+ we can use pause information.
1214
+
1215
+ 0:39:23.964 --> 0:39:30.065
1216
+ But however interestingly most systems that
1217
+ use are really focusing on the text.
1218
+
1219
+ 0:39:31.151 --> 0:39:34.493
1220
+ There are several reasons.
1221
+
1222
+ 0:39:34.493 --> 0:39:44.147
1223
+ One is that it's easier to get training data
1224
+ so you only need pure text data.
1225
+
1226
+ 0:39:46.806 --> 0:40:03.221
1227
+ The next way you can do it is you can make
1228
+ it as a secret labeling tax or something like
1229
+
1230
+ 0:40:03.221 --> 0:40:04.328
1231
+ that.
1232
+
1233
+ 0:40:04.464 --> 0:40:11.734
1234
+ Then you have how there is nothing in you,
1235
+ and there is a.
1236
+
1237
+ 0:40:11.651 --> 0:40:15.015
1238
+ A question.
1239
+
1240
+ 0:40:15.315 --> 0:40:31.443
1241
+ So you have the number of labels, the number
1242
+ of punctuation symbols you have for the basic
1243
+
1244
+ 0:40:31.443 --> 0:40:32.329
1245
+ one.
1246
+
1247
+ 0:40:32.892 --> 0:40:44.074
1248
+ Typically nowadays it would use something
1249
+ like bird, and then you can train a sister.
1250
+
1251
+ 0:40:48.168 --> 0:40:59.259
1252
+ Any questions to that then it would probably
1253
+ be no contrary, you know, or not.
1254
+
1255
+ 0:41:00.480 --> 0:41:03.221
1256
+ Yeah, you have definitely a labeled imbalance.
1257
+
1258
+ 0:41:04.304 --> 0:41:12.405
1259
+ Think that works relatively well and haven't
1260
+ seen that.
1261
+
1262
+ 0:41:12.405 --> 0:41:21.085
1263
+ It's not a completely crazy label, maybe twenty
1264
+ times more.
1265
+
1266
+ 0:41:21.561 --> 0:41:29.636
1267
+ It can and especially for the more rare things
1268
+ mean, the more rare things is question marks.
1269
+
1270
+ 0:41:30.670 --> 0:41:43.877
1271
+ At least for question marks you have typically
1272
+ very strong indicator words.
1273
+
1274
+ 0:41:47.627 --> 0:42:03.321
1275
+ And then what was done for quite a long time
1276
+ can we know how to do machine translation?
1277
+
1278
+ 0:42:04.504 --> 0:42:12.640
1279
+ So the idea is, can we just translate non
1280
+ punctuated English into punctuated English
1281
+
1282
+ 0:42:12.640 --> 0:42:14.650
1283
+ and do it correctly?
1284
+
1285
+ 0:42:15.855 --> 0:42:25.344
1286
+ So what you need is something like this type
1287
+ of data where the source doesn't have punctuation.
1288
+
1289
+ 0:42:25.845 --> 0:42:30.641
1290
+ Course: A year is already done.
1291
+
1292
+ 0:42:30.641 --> 0:42:36.486
1293
+ You have to make it a bit challenging.
1294
+
1295
+ 0:42:41.661 --> 0:42:44.550
1296
+ Yeah, that is true.
1297
+
1298
+ 0:42:44.550 --> 0:42:55.237
1299
+ If you think about the normal trained age,
1300
+ you have to do one thing more.
1301
+
1302
+ 0:42:55.237 --> 0:43:00.724
1303
+ Is it otherwise difficult to predict?
1304
+
1305
+ 0:43:05.745 --> 0:43:09.277
1306
+ Here it's already this already looks different
1307
+ than normal training data.
1308
+
1309
+ 0:43:09.277 --> 0:43:09.897
1310
+ What is the.
1311
+
1312
+ 0:43:10.350 --> 0:43:15.305
1313
+ People want to use this transcript of speech.
1314
+
1315
+ 0:43:15.305 --> 0:43:19.507
1316
+ We'll probably go to our text editors.
1317
+
1318
+ 0:43:19.419 --> 0:43:25.906
1319
+ Yes, that is all already quite too difficult.
1320
+
1321
+ 0:43:26.346 --> 0:43:33.528
1322
+ Mean, that's making things a lot better with
1323
+ the first and easiest thing is you have to
1324
+
1325
+ 0:43:33.528 --> 0:43:35.895
1326
+ randomly cut your sentences.
1327
+
1328
+ 0:43:35.895 --> 0:43:43.321
1329
+ So if you take just me normally we have one
1330
+ sentence per line and if you take this as your
1331
+
1332
+ 0:43:43.321 --> 0:43:44.545
1333
+ training data.
1334
+
1335
+ 0:43:44.924 --> 0:43:47.857
1336
+ And that is, of course, not very helpful.
1337
+
1338
+ 0:43:48.208 --> 0:44:01.169
1339
+ So in order to build the training corpus for
1340
+ doing punctuation you randomly cut your sentences
1341
+
1342
+ 0:44:01.169 --> 0:44:08.264
1343
+ and then you can remove all your punctuation
1344
+ marks.
1345
+
1346
+ 0:44:08.528 --> 0:44:21.598
1347
+ Because of course there is no longer to do
1348
+ when you have some random segments in your
1349
+
1350
+ 0:44:21.598 --> 0:44:22.814
1351
+ system.
1352
+
1353
+ 0:44:25.065 --> 0:44:37.984
1354
+ And then you can, for example, if you then
1355
+ have generated your punctuation marks before
1356
+
1357
+ 0:44:37.984 --> 0:44:41.067
1358
+ going to the system.
1359
+
1360
+ 0:44:41.221 --> 0:44:54.122
1361
+ And that is an important thing, which we like
1362
+ to see is more challenging for end systems.
1363
+
1364
+ 0:44:54.122 --> 0:45:00.143
1365
+ We can change the segmentation, so maybe.
1366
+
1367
+ 0:45:00.040 --> 0:45:06.417
1368
+ You can, then if you're combining these things
1369
+ you can change the segmentation here, so.
1370
+
1371
+ 0:45:06.406 --> 0:45:18.178
1372
+ While you have ten new ten segments in your,
1373
+ you might only have five ones in your anymore.
1374
+
1375
+ 0:45:18.178 --> 0:45:18.946
1376
+ Then.
1377
+
1378
+ 0:45:19.259 --> 0:45:33.172
1379
+ Which might be more useful or helpful in because
1380
+ you have to reorder things and so on.
1381
+
1382
+ 0:45:33.273 --> 0:45:43.994
1383
+ And if you think of the wrong segmentation
1384
+ then you cannot reorder things from the beginning
1385
+
1386
+ 0:45:43.994 --> 0:45:47.222
1387
+ to the end of the sentence.
1388
+
1389
+ 0:45:49.749 --> 0:45:58.006
1390
+ Okay, so much about segmentation do you have
1391
+ any more questions about that?
1392
+
1393
+ 0:46:02.522 --> 0:46:21.299
1394
+ Then there is one additional thing you can
1395
+ do, and that is when we refer to the idea.
1396
+
1397
+ 0:46:21.701 --> 0:46:29.356
1398
+ And when you get input there might be some
1399
+ arrows in there, so it might not be perfect.
1400
+
1401
+ 0:46:29.889 --> 0:46:36.322
1402
+ So the question is, can we adapt to that?
1403
+
1404
+ 0:46:36.322 --> 0:46:45.358
1405
+ And can the system be improved by saying that
1406
+ it can some.
1407
+
1408
+ 0:46:45.265 --> 0:46:50.591
1409
+ So that is as aware that before there is a.
1410
+
1411
+ 0:46:50.490 --> 0:46:55.449
1412
+ Their arm might not be the best one.
1413
+
1414
+ 0:46:55.935 --> 0:47:01.961
1415
+ There are different ways of dealing with them.
1416
+
1417
+ 0:47:01.961 --> 0:47:08.116
1418
+ You can use a best list but several best lists.
1419
+
1420
+ 0:47:08.408 --> 0:47:16.711
1421
+ So the idea is that you're not only telling
1422
+ the system this is the transcript, but here
1423
+
1424
+ 0:47:16.711 --> 0:47:18.692
1425
+ I'm not going to be.
1426
+
1427
+ 0:47:19.419 --> 0:47:30.748
1428
+ Or that you can try to make it more robust
1429
+ towards arrows from an system so that.
1430
+
1431
+ 0:47:32.612 --> 0:47:48.657
1432
+ Interesting what is often done is hope convince
1433
+ you it might be a good idea to deal.
1434
+
1435
+ 0:47:48.868 --> 0:47:57.777
1436
+ The interesting thing is if you're looking
1437
+ into a lot of systems, this is often ignored,
1438
+
1439
+ 0:47:57.777 --> 0:48:04.784
1440
+ so they are not adapting their T-system to
1441
+ this type of A-S-R system.
1442
+
1443
+ 0:48:05.345 --> 0:48:15.232
1444
+ So it's not really doing any handling of Arab,
1445
+ and the interesting thing is often works as
1446
+
1447
+ 0:48:15.232 --> 0:48:15.884
1448
+ good.
1449
+
1450
+ 0:48:16.516 --> 0:48:23.836
1451
+ And one reason is, of course, one reason is
1452
+ if the ASR system does not arrow up to like
1453
+
1454
+ 0:48:23.836 --> 0:48:31.654
1455
+ a challenging situation, and then the antisystem
1456
+ is really for the antisystem hard to detect.
1457
+
1458
+ 0:48:31.931 --> 0:48:39.375
1459
+ If it would be easy for the system to detect
1460
+ the error you would integrate this information
1461
+
1462
+ 0:48:39.375 --> 0:48:45.404
1463
+ into: That is not always the case, but that
1464
+ of course makes it a bit challenging, and that's
1465
+
1466
+ 0:48:45.404 --> 0:48:49.762
1467
+ why there is a lot of systems where it's not
1468
+ explicitly handled how to deal with.
1469
+
1470
+ 0:48:52.912 --> 0:49:06.412
1471
+ But of course it might be good, so one thing
1472
+ is you can give him a best list and you can
1473
+
1474
+ 0:49:06.412 --> 0:49:09.901
1475
+ translate every entry.
1476
+
1477
+ 0:49:10.410 --> 0:49:17.705
1478
+ And then you have two scores like the anti-probability
1479
+ and the square probability.
1480
+
1481
+ 0:49:18.058 --> 0:49:25.695
1482
+ Combine them and then generate or output the
1483
+ output from what has the best combined.
1484
+
1485
+ 0:49:26.366 --> 0:49:29.891
1486
+ And then it might no longer be the best.
1487
+
1488
+ 0:49:29.891 --> 0:49:38.144
1489
+ It might like we had a bean search, so this
1490
+ has the best score, but this has a better combined.
1491
+
1492
+ 0:49:39.059 --> 0:49:46.557
1493
+ The problem sometimes works, but the problem
1494
+ is that the anti-system might then tend to
1495
+
1496
+ 0:49:46.557 --> 0:49:52.777
1497
+ just translate not the correct sentence but
1498
+ the one easier to translate.
1499
+
1500
+ 0:49:53.693 --> 0:50:03.639
1501
+ You can also generate a more compact representation
1502
+ of this invest in it by having this type of
1503
+
1504
+ 0:50:03.639 --> 0:50:04.467
1505
+ graphs.
1506
+
1507
+ 0:50:05.285 --> 0:50:22.952
1508
+ Lettices: So then you could like try to do
1509
+ a graph to text translation so you can translate.
1510
+
1511
+ 0:50:22.802 --> 0:50:26.582
1512
+ Where like all possibilities, by the way our
1513
+ systems are invented.
1514
+
1515
+ 0:50:26.906 --> 0:50:31.485
1516
+ So it can be like a hostage, a conference
1517
+ with some programs.
1518
+
1519
+ 0:50:31.591 --> 0:50:35.296
1520
+ So the highest probability is here.
1521
+
1522
+ 0:50:35.296 --> 0:50:41.984
1523
+ Conference is being recorded, but there are
1524
+ other possibilities.
1525
+
1526
+ 0:50:42.302 --> 0:50:53.054
1527
+ And you can take all of this information out
1528
+ there with your probabilities.
1529
+
1530
+ 0:50:59.980 --> 0:51:07.614
1531
+ But we'll see this type of arrow propagation
1532
+ that if you have an error that this might then
1533
+
1534
+ 0:51:07.614 --> 0:51:15.165
1535
+ propagate to, and t errors is one of the main
1536
+ reasons why people looked into other ways of
1537
+
1538
+ 0:51:15.165 --> 0:51:17.240
1539
+ doing it and not having.
1540
+
1541
+ 0:51:19.219 --> 0:51:28.050
1542
+ By generally a cascaded combination, as we've
1543
+ seen it, it has several advantages: The biggest
1544
+
1545
+ 0:51:28.050 --> 0:51:42.674
1546
+ maybe is the data availability so we can train
1547
+ systems for the different components.
1548
+
1549
+ 0:51:42.822 --> 0:51:47.228
1550
+ So you can train your individual components
1551
+ on relatively large stages.
1552
+
1553
+ 0:51:47.667 --> 0:51:58.207
1554
+ A modular system where you can improve each
1555
+ individual model and if there's new development
1556
+
1557
+ 0:51:58.207 --> 0:52:01.415
1558
+ and models you can improve.
1559
+
1560
+ 0:52:01.861 --> 0:52:11.280
1561
+ There are several advantages, but of course
1562
+ there are also some disadvantages: The most
1563
+
1564
+ 0:52:11.280 --> 0:52:19.522
1565
+ common thing is that there is what is referred
1566
+ to as arrow propagation.
1567
+
1568
+ 0:52:19.522 --> 0:52:28.222
1569
+ If the arrow is arrow, probably your output
1570
+ will then directly do an arrow.
1571
+
1572
+ 0:52:28.868 --> 0:52:41.740
1573
+ Typically it's like if there's an error in
1574
+ the system, it's easier to like ignore by a
1575
+
1576
+ 0:52:41.740 --> 0:52:46.474
1577
+ quantity scale than the output.
1578
+
1579
+ 0:52:46.967 --> 0:52:49.785
1580
+ What do that mean?
1581
+
1582
+ 0:52:49.785 --> 0:53:01.209
1583
+ It's complicated, so if you have German, the
1584
+ ASR does the Arab, and instead.
1585
+
1586
+ 0:53:01.101 --> 0:53:05.976
1587
+ Then most probably you'll ignore it or you'll
1588
+ still know what it was said.
1589
+
1590
+ 0:53:05.976 --> 0:53:11.827
1591
+ Maybe you even don't notice because you'll
1592
+ fastly read over it and don't see that there's
1593
+
1594
+ 0:53:11.827 --> 0:53:12.997
1595
+ one letter wrong.
1596
+
1597
+ 0:53:13.673 --> 0:53:25.291
1598
+ However, if you translate this one in an English
1599
+ sentence about speeches, there's something
1600
+
1601
+ 0:53:25.291 --> 0:53:26.933
1602
+ about wines.
1603
+
1604
+ 0:53:27.367 --> 0:53:37.238
1605
+ So it's a lot easier typically to read over
1606
+ like arrows in the than reading over them in
1607
+
1608
+ 0:53:37.238 --> 0:53:38.569
1609
+ the speech.
1610
+
1611
+ 0:53:40.120 --> 0:53:45.863
1612
+ But there is additional challenges in in cascaded
1613
+ systems.
1614
+
1615
+ 0:53:46.066 --> 0:53:52.667
1616
+ So secondly we have seen that we optimize
1617
+ each component individually so you have a separate
1618
+
1619
+ 0:53:52.667 --> 0:53:59.055
1620
+ optimization and that doesn't mean that the
1621
+ overall performance is really the best at the
1622
+
1623
+ 0:53:59.055 --> 0:53:59.410
1624
+ end.
1625
+
1626
+ 0:53:59.899 --> 0:54:07.945
1627
+ And we have tried to do that by already saying
1628
+ yes.
1629
+
1630
+ 0:54:07.945 --> 0:54:17.692
1631
+ You need to adapt them a bit to work good
1632
+ together, but still.
1633
+
1634
+ 0:54:20.280 --> 0:54:24.185
1635
+ Secondly, like that, there's a computational
1636
+ complexity.
1637
+
1638
+ 0:54:24.185 --> 0:54:30.351
1639
+ You always need to run an ASR system and an
1640
+ MTT system, and especially if you think about
1641
+
1642
+ 0:54:30.351 --> 0:54:32.886
1643
+ it, it should be fast and real time.
1644
+
1645
+ 0:54:32.886 --> 0:54:37.065
1646
+ It's challenging to always run two systems
1647
+ and not a single.
1648
+
1649
+ 0:54:38.038 --> 0:54:45.245
1650
+ And one final thing which you might have not
1651
+ directly thought of, but most of the world's
1652
+
1653
+ 0:54:45.245 --> 0:54:47.407
1654
+ languages do not have any.
1655
+
1656
+ 0:54:48.108 --> 0:55:01.942
1657
+ So if you have a language which doesn't have
1658
+ any script, then of course if you want to translate
1659
+
1660
+ 0:55:01.942 --> 0:55:05.507
1661
+ it you cannot first use.
1662
+
1663
+ 0:55:05.905 --> 0:55:13.705
1664
+ So in order to do this, the pressure was mentioned
1665
+ before ready.
1666
+
1667
+ 0:55:13.705 --> 0:55:24.264
1668
+ Build somehow a system which takes the audio
1669
+ and directly generates text in the target.
1670
+
1671
+ 0:55:26.006 --> 0:55:41.935
1672
+ And there is quite big opportunity for that
1673
+ because before that there was very different
1674
+
1675
+ 0:55:41.935 --> 0:55:44.082
1676
+ technology.
1677
+
1678
+ 0:55:44.644 --> 0:55:55.421
1679
+ However, since we are using neuromachine translation
1680
+ encoded decoder models, the interesting thing
1681
+
1682
+ 0:55:55.421 --> 0:56:00.429
1683
+ is that we are using very similar technology.
1684
+
1685
+ 0:56:00.360 --> 0:56:06.047
1686
+ It's like in both cases very similar architecture.
1687
+
1688
+ 0:56:06.047 --> 0:56:09.280
1689
+ The main difference is once.
1690
+
1691
+ 0:56:09.649 --> 0:56:17.143
1692
+ But generally how it's done is very similar,
1693
+ and therefore of course it might be put everything
1694
+
1695
+ 0:56:17.143 --> 0:56:22.140
1696
+ together, and that is what is referred to as
1697
+ end-to-end speech.
1698
+
1699
+ 0:56:22.502 --> 0:56:31.411
1700
+ So that means we're having one large neural
1701
+ network and decoded voice system, but we put
1702
+
1703
+ 0:56:31.411 --> 0:56:34.914
1704
+ an audio in one language and then.
1705
+
1706
+ 0:56:36.196 --> 0:56:43.106
1707
+ We can then have a system which directly does
1708
+ the full process.
1709
+
1710
+ 0:56:43.106 --> 0:56:46.454
1711
+ We don't have to care anymore.
1712
+
1713
+ 0:56:48.048 --> 0:57:02.615
1714
+ So if you think of it as before, so we have
1715
+ this decoder, and that's the two separate.
1716
+
1717
+ 0:57:02.615 --> 0:57:04.792
1718
+ We have the.
1719
+
1720
+ 0:57:05.085 --> 0:57:18.044
1721
+ And instead of going via the discrete text
1722
+ representation in the Suez language, we can
1723
+
1724
+ 0:57:18.044 --> 0:57:21.470
1725
+ go via the continuous.
1726
+
1727
+ 0:57:21.681 --> 0:57:26.027
1728
+ Of course, they hope it's by not doing this
1729
+ discrimination in between.
1730
+
1731
+ 0:57:26.146 --> 0:57:30.275
1732
+ We don't have a problem at doing errors.
1733
+
1734
+ 0:57:30.275 --> 0:57:32.793
1735
+ We can only cover later.
1736
+
1737
+ 0:57:32.772 --> 0:57:47.849
1738
+ But we can encode here the variability or
1739
+ so that we have and then only define the decision.
1740
+
1741
+ 0:57:51.711 --> 0:57:54.525
1742
+ And so.
1743
+
1744
+ 0:57:54.274 --> 0:58:02.253
1745
+ What we're doing is we're having very similar
1746
+ technique.
1747
+
1748
+ 0:58:02.253 --> 0:58:12.192
1749
+ We're having still the decoder model where
1750
+ we're coming from the main.
1751
+
1752
+ 0:58:12.552 --> 0:58:24.098
1753
+ Instead of getting discrete tokens in there
1754
+ as we have subwords, we always encoded that
1755
+
1756
+ 0:58:24.098 --> 0:58:26.197
1757
+ in one pattern.
1758
+
1759
+ 0:58:26.846 --> 0:58:42.505
1760
+ The problem is that this is in continuous,
1761
+ so we have to check how we can work with continuous
1762
+
1763
+ 0:58:42.505 --> 0:58:43.988
1764
+ signals.
1765
+
1766
+ 0:58:47.627 --> 0:58:55.166
1767
+ Mean, the first thing in your system is when
1768
+ you do your disc freeze and code it.
1769
+
1770
+ 0:59:02.402 --> 0:59:03.888
1771
+ A newer machine translation.
1772
+
1773
+ 0:59:03.888 --> 0:59:05.067
1774
+ You're getting a word.
1775
+
1776
+ 0:59:05.067 --> 0:59:06.297
1777
+ It's one hot, some not.
1778
+
1779
+ 0:59:21.421 --> 0:59:24.678
1780
+ The first layer of the machine translation.
1781
+
1782
+ 0:59:27.287 --> 0:59:36.147
1783
+ Yes, you do the word embedding, so then you
1784
+ have a continuous thing.
1785
+
1786
+ 0:59:36.147 --> 0:59:40.128
1787
+ So if you know get continuous.
1788
+
1789
+ 0:59:40.961 --> 0:59:46.316
1790
+ Deal with it the same way, so we'll see not
1791
+ a big of a challenge.
1792
+
1793
+ 0:59:46.316 --> 0:59:48.669
1794
+ What is more challenging is.
1795
+
1796
+ 0:59:49.349 --> 1:00:04.498
1797
+ So the audio signal is ten times longer or
1798
+ so, like more time steps you have.
1799
+
1800
+ 1:00:04.764 --> 1:00:10.332
1801
+ And so that is, of course, any challenge how
1802
+ we can deal with this type of long sequence.
1803
+
1804
+ 1:00:11.171 --> 1:00:13.055
1805
+ The advantage is a bit.
1806
+
1807
+ 1:00:13.055 --> 1:00:17.922
1808
+ The long sequence is only at the input and
1809
+ not at the output.
1810
+
1811
+ 1:00:17.922 --> 1:00:24.988
1812
+ So when you remember for the efficiency, for
1813
+ example, like a long sequence are especially
1814
+
1815
+ 1:00:24.988 --> 1:00:29.227
1816
+ challenging in the decoder, but also for the
1817
+ encoder.
1818
+
1819
+ 1:00:31.371 --> 1:00:33.595
1820
+ So how it is this?
1821
+
1822
+ 1:00:33.595 --> 1:00:40.617
1823
+ How can we process audio into an speech translation
1824
+ system?
1825
+
1826
+ 1:00:41.501 --> 1:00:51.856
1827
+ And you can follow mainly what is done in
1828
+ an system, so you have the audio signal.
1829
+
1830
+ 1:00:52.172 --> 1:00:59.135
1831
+ Then you measure your amplitude at every time
1832
+ step.
1833
+
1834
+ 1:00:59.135 --> 1:01:04.358
1835
+ It's typically something like killing.
1836
+
1837
+ 1:01:04.384 --> 1:01:13.893
1838
+ And then you're doing this, this windowing,
1839
+ so that you get a signal of a length twenty
1840
+
1841
+ 1:01:13.893 --> 1:01:22.430
1842
+ to thirty seconds, and you have all these windowings
1843
+ so that you measure them.
1844
+
1845
+ 1:01:22.342 --> 1:01:32.260
1846
+ A simple gear, and then you look at these
1847
+ time signals of seconds.
1848
+
1849
+ 1:01:32.432 --> 1:01:36.920
1850
+ So in the end then it is ten seconds, ten
1851
+ million seconds.
1852
+
1853
+ 1:01:36.920 --> 1:01:39.735
1854
+ You have for every ten milliseconds.
1855
+
1856
+ 1:01:40.000 --> 1:01:48.309
1857
+ Some type of representation which type of
1858
+ representation you can generate from that,
1859
+
1860
+ 1:01:48.309 --> 1:01:49.286
1861
+ but that.
1862
+
1863
+ 1:01:49.649 --> 1:02:06.919
1864
+ So instead of having no letter or word, you
1865
+ have no representations for every 10mm of your
1866
+
1867
+ 1:02:06.919 --> 1:02:08.437
1868
+ system.
1869
+
1870
+ 1:02:08.688 --> 1:02:13.372
1871
+ How we record that now your thirty second
1872
+ window here there is different ways.
1873
+
1874
+ 1:02:16.176 --> 1:02:31.891
1875
+ Was a traditional way of how people have done
1876
+ that from an audio signal what frequencies
1877
+
1878
+ 1:02:31.891 --> 1:02:34.010
1879
+ are in the.
1880
+
1881
+ 1:02:34.114 --> 1:02:44.143
1882
+ So to do that you can do this malfrequency,
1883
+ capsule co-pression so you can use gear transformations.
1884
+
1885
+ 1:02:44.324 --> 1:02:47.031
1886
+ Which frequencies are there?
1887
+
1888
+ 1:02:47.031 --> 1:02:53.566
1889
+ You know that the letters are different by
1890
+ the different frequencies.
1891
+
1892
+ 1:02:53.813 --> 1:03:04.243
1893
+ And then if you're doing that, use the matte
1894
+ to covers for your window we have before.
1895
+
1896
+ 1:03:04.624 --> 1:03:14.550
1897
+ So for each of these windows: You will calculate
1898
+ what frequencies in there and then get features
1899
+
1900
+ 1:03:14.550 --> 1:03:20.059
1901
+ for this window and features for this window.
1902
+
1903
+ 1:03:19.980 --> 1:03:28.028
1904
+ These are the frequencies that occur there
1905
+ and that help you to model which letters are
1906
+
1907
+ 1:03:28.028 --> 1:03:28.760
1908
+ spoken.
1909
+
1910
+ 1:03:31.611 --> 1:03:43.544
1911
+ More recently, instead of doing the traditional
1912
+ signal processing, you can also replace that
1913
+
1914
+ 1:03:43.544 --> 1:03:45.853
1915
+ by deep learning.
1916
+
1917
+ 1:03:46.126 --> 1:03:56.406
1918
+ So that we are using a self-supervised approach
1919
+ from language model to generate features that
1920
+
1921
+ 1:03:56.406 --> 1:03:58.047
1922
+ describe what.
1923
+
1924
+ 1:03:58.358 --> 1:03:59.821
1925
+ So you have your.
1926
+
1927
+ 1:03:59.759 --> 1:04:07.392
1928
+ All your signal again, and then for each child
1929
+ to do your convolutional neural networks to
1930
+
1931
+ 1:04:07.392 --> 1:04:07.811
1932
+ get.
1933
+
1934
+ 1:04:07.807 --> 1:04:23.699
1935
+ First representation here is a transformer
1936
+ network here, and in the end it's similar to
1937
+
1938
+ 1:04:23.699 --> 1:04:25.866
1939
+ a language.
1940
+
1941
+ 1:04:25.705 --> 1:04:30.238
1942
+ And you tried to predict what was referenced
1943
+ here.
1944
+
1945
+ 1:04:30.670 --> 1:04:42.122
1946
+ So that is in a way similar that you also
1947
+ try to learn a good representation of all these
1948
+
1949
+ 1:04:42.122 --> 1:04:51.608
1950
+ audio signals by predicting: And then you don't
1951
+ do the signal processing base, but have this
1952
+
1953
+ 1:04:51.608 --> 1:04:52.717
1954
+ way to make.
1955
+
1956
+ 1:04:52.812 --> 1:04:59.430
1957
+ But in all the things that you have to remember
1958
+ what is most important for you, and to end
1959
+
1960
+ 1:04:59.430 --> 1:05:05.902
1961
+ system is, of course, that you in the end get
1962
+ for every minute ten milliseconds, you get
1963
+
1964
+ 1:05:05.902 --> 1:05:11.283
1965
+ a representation of this audio signal, which
1966
+ is again a vector, and that.
1967
+
1968
+ 1:05:11.331 --> 1:05:15.365
1969
+ And then you can use your normal encoder to
1970
+ code your model to do this research.
1971
+
1972
+ 1:05:21.861 --> 1:05:32.694
1973
+ So that is all which directly has to be changed,
1974
+ and then you can build your first base.
1975
+
1976
+ 1:05:33.213 --> 1:05:37.167
1977
+ You do the audio processing.
1978
+
1979
+ 1:05:37.167 --> 1:05:49.166
1980
+ You of course need data which is like Audio
1981
+ and English and Text in German and then you
1982
+
1983
+ 1:05:49.166 --> 1:05:50.666
1984
+ can train.
1985
+
1986
+ 1:05:53.333 --> 1:05:57.854
1987
+ And interestingly, it works at the beginning.
1988
+
1989
+ 1:05:57.854 --> 1:06:03.261
1990
+ The systems were maybe a bit worse, but we
1991
+ saw really.
1992
+
1993
+ 1:06:03.964 --> 1:06:11.803
1994
+ This is like from the biggest workshop where
1995
+ people like compared different systems.
1996
+
1997
+ 1:06:11.751 --> 1:06:17.795
1998
+ Special challenge on comparing Cascaded to
1999
+ end to end systems and you see two thousand
2000
+
2001
+ 1:06:17.795 --> 1:06:18.767
2002
+ and eighteen.
2003
+
2004
+ 1:06:18.767 --> 1:06:25.089
2005
+ We had quite a huge gap between the Cascaded
2006
+ and end to end systems and then it got nearer
2007
+
2008
+ 1:06:25.089 --> 1:06:27.937
2009
+ and earlier in starting in two thousand.
2010
+
2011
+ 1:06:27.907 --> 1:06:33.619
2012
+ Twenty the performance was mainly the same,
2013
+ so there was no clear difference anymore.
2014
+
2015
+ 1:06:34.014 --> 1:06:42.774
2016
+ So this is, of course, writing a bit of hope
2017
+ saying if we better learn how to build these
2018
+
2019
+ 1:06:42.774 --> 1:06:47.544
2020
+ internal systems, they might really fall better.
2021
+
2022
+ 1:06:49.549 --> 1:06:52.346
2023
+ However, a bit.
2024
+
2025
+ 1:06:52.452 --> 1:06:59.018
2026
+ This satisfying this is how this all continues,
2027
+ and this is not only in two thousand and twenty
2028
+
2029
+ 1:06:59.018 --> 1:07:04.216
2030
+ one, but even nowadays we can say there is
2031
+ no clear performance difference.
2032
+
2033
+ 1:07:04.216 --> 1:07:10.919
2034
+ It's not like the one model is better than
2035
+ the other, but we are seeing very similar performance.
2036
+
2037
+ 1:07:11.391 --> 1:07:19.413
2038
+ So the question is what is the difference?
2039
+
2040
+ 1:07:19.413 --> 1:07:29.115
2041
+ Of course, this can only be achieved by new
2042
+ tricks.
2043
+
2044
+ 1:07:30.570 --> 1:07:35.658
2045
+ Yes and no, that's what we will mainly look
2046
+ into now.
2047
+
2048
+ 1:07:35.658 --> 1:07:39.333
2049
+ How can we make use of other types of.
2050
+
2051
+ 1:07:39.359 --> 1:07:53.236
2052
+ In that case you can achieve some performance
2053
+ by using different types of training so you
2054
+
2055
+ 1:07:53.236 --> 1:07:55.549
2056
+ can also make.
2057
+
2058
+ 1:07:55.855 --> 1:08:04.961
2059
+ So if you are training or preparing the systems
2060
+ only on very small corpora where you have as
2061
+
2062
+ 1:08:04.961 --> 1:08:10.248
2063
+ much data than you have for the individual
2064
+ ones then.
2065
+
2066
+ 1:08:10.550 --> 1:08:22.288
2067
+ So that is the biggest challenge of an end
2068
+ system that you have small corpora and therefore.
2069
+
2070
+ 1:08:24.404 --> 1:08:30.479
2071
+ Of course, there is several advantages so
2072
+ you can give access to the audio information.
2073
+
2074
+ 1:08:30.750 --> 1:08:42.046
2075
+ So that's, for example, interesting if you
2076
+ think about it, you might not have modeled
2077
+
2078
+ 1:08:42.046 --> 1:08:45.198
2079
+ everything in the text.
2080
+
2081
+ 1:08:45.198 --> 1:08:50.321
2082
+ So remember when we talk about biases.
2083
+
2084
+ 1:08:50.230 --> 1:08:55.448
2085
+ Male or female, and that of course is not
2086
+ in the text any more, but in the audio signal
2087
+
2088
+ 1:08:55.448 --> 1:08:56.515
2089
+ it's still there.
2090
+
2091
+ 1:08:58.078 --> 1:09:03.108
2092
+ It also allows you to talk about that on Thursday
2093
+ when you talk about latency.
2094
+
2095
+ 1:09:03.108 --> 1:09:08.902
2096
+ You have a bit better chance if you do an
2097
+ end to end system to get a lower latency because
2098
+
2099
+ 1:09:08.902 --> 1:09:14.377
2100
+ you only have one system and you don't have
2101
+ two systems which might have to wait for.
2102
+
2103
+ 1:09:14.934 --> 1:09:20.046
2104
+ And having one system might be also a bit
2105
+ easier management.
2106
+
2107
+ 1:09:20.046 --> 1:09:23.146
2108
+ See that two systems work and so on.
2109
+
2110
+ 1:09:26.346 --> 1:09:41.149
2111
+ The biggest challenge of end systems is the
2112
+ data, so as you correctly pointed out, typically
2113
+
2114
+ 1:09:41.149 --> 1:09:42.741
2115
+ there is.
2116
+
2117
+ 1:09:43.123 --> 1:09:45.829
2118
+ There is some data for Ted.
2119
+
2120
+ 1:09:45.829 --> 1:09:47.472
2121
+ People did that.
2122
+
2123
+ 1:09:47.472 --> 1:09:52.789
2124
+ They took the English audio with all the translations.
2125
+
2126
+ 1:09:53.273 --> 1:10:02.423
2127
+ But in January there is a lot less so we'll
2128
+ look into how you can use other data sources.
2129
+
2130
+ 1:10:05.305 --> 1:10:10.950
2131
+ And secondly, the second challenge is that
2132
+ we have to deal with audio.
2133
+
2134
+ 1:10:11.431 --> 1:10:22.163
2135
+ For example, in input length, and therefore
2136
+ it's also important to handle this in your
2137
+
2138
+ 1:10:22.163 --> 1:10:27.590
2139
+ network and maybe have dedicated solutions.
2140
+
2141
+ 1:10:31.831 --> 1:10:40.265
2142
+ So in general we have this challenge that
2143
+ we have a lot of text and translation and audio
2144
+
2145
+ 1:10:40.265 --> 1:10:43.076
2146
+ transcript data by quite few.
2147
+
2148
+ 1:10:43.643 --> 1:10:50.844
2149
+ So what can we do in one trick?
2150
+
2151
+ 1:10:50.844 --> 1:11:00.745
2152
+ You already know a bit from other research.
2153
+
2154
+ 1:11:02.302 --> 1:11:14.325
2155
+ Exactly so what you can do is you can, for
2156
+ example, use to take a power locust, generate
2157
+
2158
+ 1:11:14.325 --> 1:11:19.594
2159
+ an audio of a Suez language, and then.
2160
+
2161
+ 1:11:21.341 --> 1:11:33.780
2162
+ There has been a bit motivated by what we
2163
+ have seen in Beck translation, which was very
2164
+
2165
+ 1:11:33.780 --> 1:11:35.476
2166
+ successful.
2167
+
2168
+ 1:11:38.758 --> 1:11:54.080
2169
+ However, it's a bit more challenging because
2170
+ it is often very different from real audience.
2171
+
2172
+ 1:11:54.314 --> 1:12:07.131
2173
+ So often if you build a system only trained
2174
+ on, but then generalized to real audio data
2175
+
2176
+ 1:12:07.131 --> 1:12:10.335
2177
+ is quite challenging.
2178
+
2179
+ 1:12:10.910 --> 1:12:20.927
2180
+ And therefore here the synthetic data generation
2181
+ is significantly more challenging than when.
2182
+
2183
+ 1:12:20.981 --> 1:12:27.071
2184
+ Because if you read a text, it's maybe bad
2185
+ translation.
2186
+
2187
+ 1:12:27.071 --> 1:12:33.161
2188
+ It's hard, but it's a real text or a text
2189
+ generated by.
2190
+
2191
+ 1:12:35.835 --> 1:12:42.885
2192
+ But it's a valid solution, and for example
2193
+ we use that also for say current systems.
2194
+
2195
+ 1:12:43.923 --> 1:12:53.336
2196
+ Of course you can also do a bit of forward
2197
+ translation that is done so that you take data.
2198
+
2199
+ 1:12:53.773 --> 1:13:02.587
2200
+ But then the problem is that your reference
2201
+ is not always correct, and you remember when
2202
+
2203
+ 1:13:02.587 --> 1:13:08.727
2204
+ we talked about back translation, it's a bit
2205
+ of an advantage.
2206
+
2207
+ 1:13:09.229 --> 1:13:11.930
2208
+ But both can be done and both have been done.
2209
+
2210
+ 1:13:12.212 --> 1:13:20.277
2211
+ So you can think about this picture again.
2212
+
2213
+ 1:13:20.277 --> 1:13:30.217
2214
+ You can take this data and generate the audio
2215
+ to it.
2216
+
2217
+ 1:13:30.750 --> 1:13:37.938
2218
+ However, it is only synthetic of what can
2219
+ be used for the voice handling technology for:
2220
+
2221
+ 1:13:40.240 --> 1:13:47.153
2222
+ But you have not, I mean, yet you get text
2223
+ to speech, but the voice cloning would need
2224
+
2225
+ 1:13:47.153 --> 1:13:47.868
2226
+ a voice.
2227
+
2228
+ 1:13:47.868 --> 1:13:53.112
2229
+ You can use, of course, and then it's nothing
2230
+ else than a normal.
2231
+
2232
+ 1:13:54.594 --> 1:14:03.210
2233
+ But still think there are better than both,
2234
+ but there are some characteristics of that
2235
+
2236
+ 1:14:03.210 --> 1:14:05.784
2237
+ which is quite different.
2238
+
2239
+ 1:14:07.327 --> 1:14:09.341
2240
+ But yeah, it's getting better.
2241
+
2242
+ 1:14:09.341 --> 1:14:13.498
2243
+ That is definitely true, and then this might
2244
+ get more and more.
2245
+
2246
+ 1:14:16.596 --> 1:14:21.885
2247
+ Here make sure it's a good person and our
2248
+ own systems because we try to train and.
2249
+
2250
+ 1:14:21.881 --> 1:14:24.356
2251
+ And it's like a feedback mood.
2252
+
2253
+ 1:14:24.356 --> 1:14:28.668
2254
+ There's anything like the Dutch English model
2255
+ that's.
2256
+
2257
+ 1:14:28.648 --> 1:14:33.081
2258
+ Yeah, you of course need a decent amount of
2259
+ real data.
2260
+
2261
+ 1:14:33.081 --> 1:14:40.255
2262
+ But I mean, as I said, so there is always
2263
+ an advantage if you have this synthetics thing
2264
+
2265
+ 1:14:40.255 --> 1:14:44.044
2266
+ only on the input side and not on the outside.
2267
+
2268
+ 1:14:44.464 --> 1:14:47.444
2269
+ That you at least always generate correct
2270
+ outcomes.
2271
+
2272
+ 1:14:48.688 --> 1:14:54.599
2273
+ That's different in a language case because
2274
+ they have input and the output and it's not
2275
+
2276
+ 1:14:54.599 --> 1:14:55.002
2277
+ like.
2278
+
2279
+ 1:14:58.618 --> 1:15:15.815
2280
+ The other idea is to integrate additional
2281
+ sources so you can have more model sharing.
2282
+
2283
+ 1:15:16.376 --> 1:15:23.301
2284
+ But you can use these components also in the
2285
+ system.
2286
+
2287
+ 1:15:23.301 --> 1:15:28.659
2288
+ Typically the text decoder and the text.
2289
+
2290
+ 1:15:29.169 --> 1:15:41.845
2291
+ And so the other way of languaging is to join
2292
+ a train or somehow train all these tasks.
2293
+
2294
+ 1:15:43.403 --> 1:15:54.467
2295
+ The first and easy thing to do is multi task
2296
+ training so the idea is you take these components
2297
+
2298
+ 1:15:54.467 --> 1:16:02.038
2299
+ and train these two components and train the
2300
+ speech translation.
2301
+
2302
+ 1:16:02.362 --> 1:16:13.086
2303
+ So then, for example, all your encoders used
2304
+ by the speech translation system can also gain
2305
+
2306
+ 1:16:13.086 --> 1:16:14.951
2307
+ from the large.
2308
+
2309
+ 1:16:14.975 --> 1:16:24.048
2310
+ So everything can gain a bit of emphasis,
2311
+ but it can partly gain in there quite a bit.
2312
+
2313
+ 1:16:27.407 --> 1:16:39.920
2314
+ The other idea is to do it in a pre-training
2315
+ phase.
2316
+
2317
+ 1:16:40.080 --> 1:16:50.414
2318
+ And then you take the end coder and the text
2319
+ decoder and trade your model on that.
2320
+
2321
+ 1:16:54.774 --> 1:17:04.895
2322
+ Finally, there is also what is referred to
2323
+ as knowledge distillation, so there you have
2324
+
2325
+ 1:17:04.895 --> 1:17:11.566
2326
+ to remember if you learn from a probability
2327
+ distribution.
2328
+
2329
+ 1:17:11.771 --> 1:17:24.371
2330
+ So what you can do then is you have your system
2331
+ and if you then have your audio and text input
2332
+
2333
+ 1:17:24.371 --> 1:17:26.759
2334
+ you can use your.
2335
+
2336
+ 1:17:27.087 --> 1:17:32.699
2337
+ And then get a more rich signal that you'll
2338
+ not only know this is the word, but you have
2339
+
2340
+ 1:17:32.699 --> 1:17:33.456
2341
+ a complete.
2342
+
2343
+ 1:17:34.394 --> 1:17:41.979
2344
+ Example is typically also done because, of
2345
+ course, if you have ski data, it still begins
2346
+
2347
+ 1:17:41.979 --> 1:17:49.735
2348
+ that you don't only have source language audio
2349
+ and target language text, but then you also
2350
+
2351
+ 1:17:49.735 --> 1:17:52.377
2352
+ have the source language text.
2353
+
2354
+ 1:17:53.833 --> 1:18:00.996
2355
+ Get a good idea of the text editor and the
2356
+ artist design.
2357
+
2358
+ 1:18:00.996 --> 1:18:15.888
2359
+ Now have to be aligned so that: Otherwise
2360
+ they wouldn't be able to determine which degree
2361
+
2362
+ 1:18:15.888 --> 1:18:17.922
2363
+ they'd be.
2364
+
2365
+ 1:18:18.178 --> 1:18:25.603
2366
+ What you've been doing in non-stasilation
2367
+ is you run your MP and then you get your probability
2368
+
2369
+ 1:18:25.603 --> 1:18:32.716
2370
+ distribution for all the words and you use
2371
+ that to train and that is not only more helpful
2372
+
2373
+ 1:18:32.716 --> 1:18:34.592
2374
+ than only getting back.
2375
+
2376
+ 1:18:35.915 --> 1:18:44.427
2377
+ You can, of course, use the same decoder to
2378
+ be even similar.
2379
+
2380
+ 1:18:44.427 --> 1:18:49.729
2381
+ Otherwise you don't have exactly the.
2382
+
2383
+ 1:18:52.832 --> 1:19:03.515
2384
+ Is a good point making these tools, and generally
2385
+ in all these cases it's good to have more similar
2386
+
2387
+ 1:19:03.515 --> 1:19:05.331
2388
+ representations.
2389
+
2390
+ 1:19:05.331 --> 1:19:07.253
2391
+ You can transfer.
2392
+
2393
+ 1:19:07.607 --> 1:19:23.743
2394
+ If you hear your representation to give from
2395
+ the audio encoder and the text encoder are
2396
+
2397
+ 1:19:23.743 --> 1:19:27.410
2398
+ more similar, then.
2399
+
2400
+ 1:19:30.130 --> 1:19:39.980
2401
+ So here you have your text encoder in the
2402
+ target language and you can train it on large
2403
+
2404
+ 1:19:39.980 --> 1:19:40.652
2405
+ data.
2406
+
2407
+ 1:19:41.341 --> 1:19:45.994
2408
+ But of course you want to benefit also for
2409
+ this task because that's what your most interested.
2410
+
2411
+ 1:19:46.846 --> 1:19:59.665
2412
+ Of course, the most benefit for this task
2413
+ is if these two representations you give are
2414
+
2415
+ 1:19:59.665 --> 1:20:01.728
2416
+ more similar.
2417
+
2418
+ 1:20:02.222 --> 1:20:10.583
2419
+ Therefore, it's interesting to look into how
2420
+ can we make these two representations as similar
2421
+
2422
+ 1:20:10.583 --> 1:20:20.929
2423
+ as: The hope is that in the end you can't even
2424
+ do something like zero shot transfer, but while
2425
+
2426
+ 1:20:20.929 --> 1:20:25.950
2427
+ you only learn this one you can also deal with.
2428
+
2429
+ 1:20:30.830 --> 1:20:40.257
2430
+ So what you can do is you can look at these
2431
+ two representations.
2432
+
2433
+ 1:20:40.257 --> 1:20:42.867
2434
+ So once the text.
2435
+
2436
+ 1:20:43.003 --> 1:20:51.184
2437
+ And you can either put them into the text
2438
+ decoder to the encoder.
2439
+
2440
+ 1:20:51.184 --> 1:20:53.539
2441
+ We have seen both.
2442
+
2443
+ 1:20:53.539 --> 1:21:03.738
2444
+ You can think: If you want to build an A's
2445
+ and to insist on you can either take the audio
2446
+
2447
+ 1:21:03.738 --> 1:21:06.575
2448
+ encoder and see how deep.
2449
+
2450
+ 1:21:08.748 --> 1:21:21.915
2451
+ However, you have these two representations
2452
+ and you want to make them more similar.
2453
+
2454
+ 1:21:21.915 --> 1:21:23.640
2455
+ One thing.
2456
+
2457
+ 1:21:23.863 --> 1:21:32.797
2458
+ Here we have, like you said, for every ten
2459
+ million seconds we have a representation.
2460
+
2461
+ 1:21:35.335 --> 1:21:46.085
2462
+ So what people may have done, for example,
2463
+ is to remove redundant information so you can:
2464
+
2465
+ 1:21:46.366 --> 1:21:56.403
2466
+ So you can use your system to put India based
2467
+ on letter or words and then average over the
2468
+
2469
+ 1:21:56.403 --> 1:21:58.388
2470
+ words or letters.
2471
+
2472
+ 1:21:59.179 --> 1:22:07.965
2473
+ So that the number of representations from
2474
+ the encoder is the same as you would get from.
2475
+
2476
+ 1:22:12.692 --> 1:22:20.919
2477
+ Okay, that much to data do have any more questions
2478
+ first about that.
2479
+
2480
+ 1:22:27.207 --> 1:22:36.787
2481
+ Then we'll finish with the audience assessing
2482
+ and highlight a bit while this is challenging,
2483
+
2484
+ 1:22:36.787 --> 1:22:52.891
2485
+ so here's: One test here has one thousand eight
2486
+ hundred sentences, so there are words or characters.
2487
+
2488
+ 1:22:53.954 --> 1:22:59.336
2489
+ If you look how many all your features, so
2490
+ how many samples there is like one point five
2491
+
2492
+ 1:22:59.336 --> 1:22:59.880
2493
+ million.
2494
+
2495
+ 1:23:00.200 --> 1:23:10.681
2496
+ So you have ten times more pizzas than you
2497
+ have characters, and then again five times
2498
+
2499
+ 1:23:10.681 --> 1:23:11.413
2500
+ more.
2501
+
2502
+ 1:23:11.811 --> 1:23:23.934
2503
+ So you have the sequence leg of the audio
2504
+ as long as you have for words, and that is
2505
+
2506
+ 1:23:23.934 --> 1:23:25.788
2507
+ a challenge.
2508
+
2509
+ 1:23:26.086 --> 1:23:34.935
2510
+ So the question is what can you do to make
2511
+ the sequins a bit shorter and not have this?
2512
+
2513
+ 1:23:38.458 --> 1:23:48.466
2514
+ The one thing is you can try to reduce the
2515
+ dimensional entity in your encounter.
2516
+
2517
+ 1:23:48.466 --> 1:23:50.814
2518
+ There's different.
2519
+
2520
+ 1:23:50.991 --> 1:24:04.302
2521
+ So, for example, you can just sum up always
2522
+ over some or you can do a congregation.
2523
+
2524
+ 1:24:04.804 --> 1:24:12.045
2525
+ Are you a linear projectile or you even take
2526
+ not every feature but only every fifth or something?
2527
+
2528
+ 1:24:12.492 --> 1:24:23.660
2529
+ So this way you can very easily reduce your
2530
+ number of features in there, and there has
2531
+
2532
+ 1:24:23.660 --> 1:24:25.713
2533
+ been different.
2534
+
2535
+ 1:24:26.306 --> 1:24:38.310
2536
+ There's also what you can do with things like
2537
+ a convolutional layer.
2538
+
2539
+ 1:24:38.310 --> 1:24:43.877
2540
+ If you skip over what you can,.
2541
+
2542
+ 1:24:47.327 --> 1:24:55.539
2543
+ And then, in addition to the audio, the other
2544
+ problem is higher variability.
2545
+
2546
+ 1:24:55.539 --> 1:25:04.957
2547
+ So if you have a text you can: But there are
2548
+ very different ways of saying that you can
2549
+
2550
+ 1:25:04.957 --> 1:25:09.867
2551
+ distinguish whether say a sentence or your
2552
+ voice.
2553
+
2554
+ 1:25:10.510 --> 1:25:21.224
2555
+ That of course makes it more challenging because
2556
+ now you get different inputs and while they
2557
+
2558
+ 1:25:21.224 --> 1:25:22.837
2559
+ were in text.
2560
+
2561
+ 1:25:23.263 --> 1:25:32.360
2562
+ So that makes especially for limited data
2563
+ things more challenging and you want to somehow
2564
+
2565
+ 1:25:32.360 --> 1:25:35.796
2566
+ learn that this is not important.
2567
+
2568
+ 1:25:36.076 --> 1:25:39.944
2569
+ So there is the idea again okay.
2570
+
2571
+ 1:25:39.944 --> 1:25:47.564
2572
+ Can we doing some type of data augmentation
2573
+ to better deal with?
2574
+
2575
+ 1:25:48.908 --> 1:25:55.735
2576
+ And again people can mainly use what has been
2577
+ done in and try to do the same things.
2578
+
2579
+ 1:25:56.276 --> 1:26:02.937
2580
+ You can try to do a bit of noise and speech
2581
+ perturbation so playing the audio like slower
2582
+
2583
+ 1:26:02.937 --> 1:26:08.563
2584
+ and a bit faster to get more samples then and
2585
+ you can train on all of them.
2586
+
2587
+ 1:26:08.563 --> 1:26:14.928
2588
+ What is very important and very successful
2589
+ recently is what is called Spektr augment.
2590
+
2591
+ 1:26:15.235 --> 1:26:25.882
2592
+ The idea is that you directly work on all
2593
+ your features and you can try to last them
2594
+
2595
+ 1:26:25.882 --> 1:26:29.014
2596
+ and that gives you more.
2597
+
2598
+ 1:26:29.469 --> 1:26:41.717
2599
+ What do they mean with masking so this is
2600
+ your audio feature and then there is different?
2601
+
2602
+ 1:26:41.962 --> 1:26:47.252
2603
+ You can do what is referred to as mask and
2604
+ a time masking.
2605
+
2606
+ 1:26:47.252 --> 1:26:50.480
2607
+ That means you just set some masks.
2608
+
2609
+ 1:26:50.730 --> 1:26:58.003
2610
+ And since then you should be still able to
2611
+ to deal with it because you can normally.
2612
+
2613
+ 1:26:57.937 --> 1:27:05.840
2614
+ Also without that you are getting more robust
2615
+ and not and you can handle that because then
2616
+
2617
+ 1:27:05.840 --> 1:27:10.877
2618
+ many symbols which have different time look
2619
+ more similar.
2620
+
2621
+ 1:27:11.931 --> 1:27:22.719
2622
+ You are not only doing that for time masking
2623
+ but also for frequency masking so that if you
2624
+
2625
+ 1:27:22.719 --> 1:27:30.188
2626
+ have here the frequency channels you mask a
2627
+ frequency channel.
2628
+
2629
+ 1:27:30.090 --> 1:27:33.089
2630
+ Thereby being able to better recognize these
2631
+ things.
2632
+
2633
+ 1:27:35.695 --> 1:27:43.698
2634
+ This we have had an overview of the two main
2635
+ approaches for speech translation that is on
2636
+
2637
+ 1:27:43.698 --> 1:27:51.523
2638
+ the one hand cascaded speech translation and
2639
+ on the other hand we talked about advanced
2640
+
2641
+ 1:27:51.523 --> 1:27:53.302
2642
+ speech translation.
2643
+
2644
+ 1:27:53.273 --> 1:28:02.080
2645
+ It's like how to combine things and what they
2646
+ work together for end speech translations.
2647
+
2648
+ 1:28:02.362 --> 1:28:06.581
2649
+ Here was data challenges and a bit about long
2650
+ circuits.
2651
+
2652
+ 1:28:07.747 --> 1:28:09.304
2653
+ We have any more questions.
2654
+
2655
+ 1:28:11.451 --> 1:28:19.974
2656
+ Can you really describe the change in cascading
2657
+ from translation to text to speech because
2658
+
2659
+ 1:28:19.974 --> 1:28:22.315
2660
+ thought the translation.
2661
+
2662
+ 1:28:25.745 --> 1:28:30.201
2663
+ Yes, so mean that works again the easiest
2664
+ thing.
2665
+
2666
+ 1:28:30.201 --> 1:28:33.021
2667
+ What of course is challenging?
2668
+
2669
+ 1:28:33.021 --> 1:28:40.751
2670
+ What can be challenging is how to make that
2671
+ more lively and like that pronunciation?
2672
+
2673
+ 1:28:40.680 --> 1:28:47.369
2674
+ And yeah, which things are put more important,
2675
+ how to put things like that into.
2676
+
2677
+ 1:28:47.627 --> 1:28:53.866
2678
+ In the normal text, otherwise it would sound
2679
+ very monotone.
2680
+
2681
+ 1:28:53.866 --> 1:28:57.401
2682
+ You want to add this information.
2683
+
2684
+ 1:28:58.498 --> 1:29:02.656
2685
+ That is maybe one thing to make it a bit more
2686
+ emotional.
2687
+
2688
+ 1:29:02.656 --> 1:29:04.917
2689
+ That is maybe one thing which.
2690
+
2691
+ 1:29:05.305 --> 1:29:13.448
2692
+ But you are right there and out of the box.
2693
+
2694
+ 1:29:13.448 --> 1:29:20.665
2695
+ If you have everything works decently.
2696
+
2697
+ 1:29:20.800 --> 1:29:30.507
2698
+ Still, especially if you have a very monotone
2699
+ voice, so think these are quite some open challenges.
2700
+
2701
+ 1:29:30.750 --> 1:29:35.898
2702
+ Maybe another open challenge is that it's
2703
+ not so much for the end product, but for the
2704
+
2705
+ 1:29:35.898 --> 1:29:37.732
2706
+ development is very important.
2707
+
2708
+ 1:29:37.732 --> 1:29:40.099
2709
+ It's very hard to evaluate the quality.
2710
+
2711
+ 1:29:40.740 --> 1:29:48.143
2712
+ So you cannot doubt that there is a way about
2713
+ most systems are currently evaluated by human
2714
+
2715
+ 1:29:48.143 --> 1:29:49.109
2716
+ evaluation.
2717
+
2718
+ 1:29:49.589 --> 1:29:54.474
2719
+ So you cannot try hundreds of things and run
2720
+ your blue score and get this score.
2721
+
2722
+ 1:29:54.975 --> 1:30:00.609
2723
+ So therefore no means very important to have
2724
+ some type of evaluation metric and that is
2725
+
2726
+ 1:30:00.609 --> 1:30:01.825
2727
+ quite challenging.
2728
+
2729
+ 1:30:08.768 --> 1:30:15.550
2730
+ And thanks for listening, and we'll have the
2731
+ second part of speech translation on search.
2732
+
demo_data/lectures/Lecture-18-18.07.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7158cf58687ceeb69cae55cb9786cecc77ea95e9afcc0b29251b8b9cfe54cdb5
3
+ size 125329284
demo_data/lectures/Lecture-19-21.07.2023/English.vtt ADDED
@@ -0,0 +1,2853 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 0:00:01.121 --> 0:00:14.214
4
+ Okay, so welcome to today's lecture, on Tuesday
5
+ we started to talk about speech translation.
6
+
7
+ 0:00:14.634 --> 0:00:27.037
8
+ And the idea is hopefully an idea of the basic
9
+ ideas we have in speech translation, the two
10
+
11
+ 0:00:27.037 --> 0:00:29.464
12
+ major approaches.
13
+
14
+ 0:00:29.829 --> 0:00:41.459
15
+ And the other one is the end system where
16
+ we have one large system which is everything
17
+
18
+ 0:00:41.459 --> 0:00:42.796
19
+ together.
20
+
21
+ 0:00:43.643 --> 0:00:58.459
22
+ Until now we mainly focus on text output that
23
+ we'll see today, but you can extend these ideas
24
+
25
+ 0:00:58.459 --> 0:01:01.138
26
+ to other speech.
27
+
28
+ 0:01:01.441 --> 0:01:08.592
29
+ But since it's also like a machine translation
30
+ lecture, you of course mainly focus a bit on
31
+
32
+ 0:01:08.592 --> 0:01:10.768
33
+ the translation challenges.
34
+
35
+ 0:01:12.172 --> 0:01:25.045
36
+ And what is the main focus of today's lecture
37
+ is to look into why that is challenging speech
38
+
39
+ 0:01:25.045 --> 0:01:26.845
40
+ translation.
41
+
42
+ 0:01:27.627 --> 0:01:33.901
43
+ So a bit more focus on what is now really
44
+ the difference to all you and how we can address.
45
+
46
+ 0:01:34.254 --> 0:01:39.703
47
+ We'll start there by with the segmentation
48
+ problem.
49
+
50
+ 0:01:39.703 --> 0:01:45.990
51
+ We had that already of bits, but especially
52
+ for end-to-end.
53
+
54
+ 0:01:46.386 --> 0:01:57.253
55
+ So the problem is that until now it was easy
56
+ to segment the input into sentences and then
57
+
58
+ 0:01:57.253 --> 0:02:01.842
59
+ translate each sentence individually.
60
+
61
+ 0:02:02.442 --> 0:02:17.561
62
+ When you're now translating audio, the challenge
63
+ is that you have just a sequence of audio input
64
+
65
+ 0:02:17.561 --> 0:02:20.055
66
+ and there's no.
67
+
68
+ 0:02:21.401 --> 0:02:27.834
69
+ So you have this difference that your audio
70
+ is a continuous stream, but the text is typically
71
+
72
+ 0:02:27.834 --> 0:02:28.930
73
+ sentence based.
74
+
75
+ 0:02:28.930 --> 0:02:31.667
76
+ So how can you match this gap in there?
77
+
78
+ 0:02:31.667 --> 0:02:37.690
79
+ We'll see that is really essential, and if
80
+ you're not using a decent good system there,
81
+
82
+ 0:02:37.690 --> 0:02:41.249
83
+ then you can lose a lot of quality and performance.
84
+
85
+ 0:02:41.641 --> 0:02:44.267
86
+ That is what also meant before.
87
+
88
+ 0:02:44.267 --> 0:02:51.734
89
+ So if you have a more complex system out of
90
+ several units, it's really essential that they
91
+
92
+ 0:02:51.734 --> 0:02:56.658
93
+ all work together and it's very easy to lose
94
+ significantly.
95
+
96
+ 0:02:57.497 --> 0:03:13.029
97
+ The second challenge we'll talk about is disfluencies,
98
+ so the style of speaking is very different
99
+
100
+ 0:03:13.029 --> 0:03:14.773
101
+ from text.
102
+
103
+ 0:03:15.135 --> 0:03:24.727
104
+ So if you translate or TedTalks, that's normally
105
+ very good speakers.
106
+
107
+ 0:03:24.727 --> 0:03:30.149
108
+ They will give you a very fluent text.
109
+
110
+ 0:03:30.670 --> 0:03:36.692
111
+ When you want to translate a lecture, it might
112
+ be more difficult or rednested.
113
+
114
+ 0:03:37.097 --> 0:03:39.242
115
+ Mean people are not well that well.
116
+
117
+ 0:03:39.242 --> 0:03:42.281
118
+ They should be prepared in giving the lecture
119
+ and.
120
+
121
+ 0:03:42.362 --> 0:03:48.241
122
+ But it's not that I mean, typically a lecture
123
+ will have like rehearsal like five times before
124
+
125
+ 0:03:48.241 --> 0:03:52.682
126
+ he is giving this lecture, and then like will
127
+ it completely be fluent?
128
+
129
+ 0:03:52.682 --> 0:03:56.122
130
+ He might at some point notice all this is
131
+ not perfect.
132
+
133
+ 0:03:56.122 --> 0:04:00.062
134
+ I want to rephrase, and he'll have to sing
135
+ during the lecture.
136
+
137
+ 0:04:00.300 --> 0:04:04.049
138
+ Might be also good that he's thinking, so
139
+ he's not going too fast and things like.
140
+
141
+ 0:04:05.305 --> 0:04:07.933
142
+ If you then go to the other extreme, it's
143
+ more meetings.
144
+
145
+ 0:04:08.208 --> 0:04:15.430
146
+ If you have a lively discussion, of course,
147
+ people will interrupt, they will restart, they
148
+
149
+ 0:04:15.430 --> 0:04:22.971
150
+ will think while they speak, and you know that
151
+ sometimes you tell people first think and speak
152
+
153
+ 0:04:22.971 --> 0:04:26.225
154
+ because they are changing their opinion.
155
+
156
+ 0:04:26.606 --> 0:04:31.346
157
+ So the question of how can you deal with this?
158
+
159
+ 0:04:31.346 --> 0:04:37.498
160
+ And there again it might be solutions for
161
+ that, or at least.
162
+
163
+ 0:04:39.759 --> 0:04:46.557
164
+ Then for the output we will look into simultaneous
165
+ translation that is at least not very important
166
+
167
+ 0:04:46.557 --> 0:04:47.175
168
+ in text.
169
+
170
+ 0:04:47.175 --> 0:04:53.699
171
+ There might be some cases but normally you
172
+ have all text available and then you're translating
173
+
174
+ 0:04:53.699 --> 0:04:54.042
175
+ and.
176
+
177
+ 0:04:54.394 --> 0:05:09.220
178
+ While for speech translation, since it's often
179
+ a life interaction, then of course it's important.
180
+
181
+ 0:05:09.149 --> 0:05:12.378
182
+ Otherwise it's hard to follow.
183
+
184
+ 0:05:12.378 --> 0:05:19.463
185
+ You see what said five minutes ago and the
186
+ slide is not as helpful.
187
+
188
+ 0:05:19.739 --> 0:05:35.627
189
+ You have to wait very long before you can
190
+ answer because you have to first wait for what
191
+
192
+ 0:05:35.627 --> 0:05:39.197
193
+ is happening there.
194
+
195
+ 0:05:40.660 --> 0:05:46.177
196
+ And finally, we can talk a bit about presentation.
197
+
198
+ 0:05:46.177 --> 0:05:54.722
199
+ For example, mentioned that if you're generating
200
+ subtitles, it's not possible.
201
+
202
+ 0:05:54.854 --> 0:06:01.110
203
+ So in professional subtitles there are clear
204
+ rules.
205
+
206
+ 0:06:01.110 --> 0:06:05.681
207
+ Subtitle has to be shown for seconds.
208
+
209
+ 0:06:05.681 --> 0:06:08.929
210
+ It's maximum of two lines.
211
+
212
+ 0:06:09.549 --> 0:06:13.156
213
+ Because otherwise it's getting too long, it's
214
+ not able to read it anymore, and so.
215
+
216
+ 0:06:13.613 --> 0:06:19.826
217
+ So if you want to achieve that, of course,
218
+ you might have to adjust and select what you
219
+
220
+ 0:06:19.826 --> 0:06:20.390
221
+ really.
222
+
223
+ 0:06:23.203 --> 0:06:28.393
224
+ The first date starts with the segmentation.
225
+
226
+ 0:06:28.393 --> 0:06:36.351
227
+ On the one end it's an issue while training,
228
+ on the other hand it's.
229
+
230
+ 0:06:38.678 --> 0:06:47.781
231
+ What is the problem so when we train it's
232
+ relatively easy to separate our data into sentence
233
+
234
+ 0:06:47.781 --> 0:06:48.466
235
+ level.
236
+
237
+ 0:06:48.808 --> 0:07:02.241
238
+ So if you have your example, you have the
239
+ audio and the text, then you typically know
240
+
241
+ 0:07:02.241 --> 0:07:07.083
242
+ that this sentence is aligned.
243
+
244
+ 0:07:07.627 --> 0:07:16.702
245
+ You can use these time information to cut
246
+ your audio and then you can train and then.
247
+
248
+ 0:07:18.018 --> 0:07:31.775
249
+ Because what we need for an enchilada model
250
+ is to be an output chart, in this case an audio
251
+
252
+ 0:07:31.775 --> 0:07:32.822
253
+ chart.
254
+
255
+ 0:07:33.133 --> 0:07:38.551
256
+ And even if this is a long speech, it's easy
257
+ then since we have this time information to
258
+
259
+ 0:07:38.551 --> 0:07:39.159
260
+ separate.
261
+
262
+ 0:07:39.579 --> 0:07:43.866
263
+ But we are using therefore, of course, the
264
+ target side information.
265
+
266
+ 0:07:45.865 --> 0:07:47.949
267
+ The problem is now in runtime.
268
+
269
+ 0:07:47.949 --> 0:07:49.427
270
+ This is not possible.
271
+
272
+ 0:07:49.427 --> 0:07:55.341
273
+ Here we can do that based on the calculation
274
+ marks and the sentence segmentation on the
275
+
276
+ 0:07:55.341 --> 0:07:57.962
277
+ target side because that is splitting.
278
+
279
+ 0:07:57.962 --> 0:08:02.129
280
+ But during transcript, during translation
281
+ it is not possible.
282
+
283
+ 0:08:02.442 --> 0:08:10.288
284
+ Because there is just a long audio signal,
285
+ and of course if you have your test data to
286
+
287
+ 0:08:10.288 --> 0:08:15.193
288
+ split it into: That has been done for some
289
+ experience.
290
+
291
+ 0:08:15.193 --> 0:08:22.840
292
+ It's fine, but it's not a realistic scenario
293
+ because if you really apply it in real world,
294
+
295
+ 0:08:22.840 --> 0:08:25.949
296
+ we won't have a manual segmentation.
297
+
298
+ 0:08:26.266 --> 0:08:31.838
299
+ If a human has to do that then he can do the
300
+ translation so you want to have a full automatic
301
+
302
+ 0:08:31.838 --> 0:08:32.431
303
+ pipeline.
304
+
305
+ 0:08:32.993 --> 0:08:38.343
306
+ So the question is how can we deal with this
307
+ type of you know?
308
+
309
+ 0:09:09.309 --> 0:09:20.232
310
+ So the question is how can we deal with this
311
+ time of situation and how can we segment the
312
+
313
+ 0:09:20.232 --> 0:09:23.024
314
+ audio into some units?
315
+
316
+ 0:09:23.863 --> 0:09:32.495
317
+ And here is one further really big advantage
318
+ of a cascaded sauce: Because how is this done
319
+
320
+ 0:09:32.495 --> 0:09:34.259
321
+ in a cascade of systems?
322
+
323
+ 0:09:34.259 --> 0:09:38.494
324
+ We are splitting the audio with some features
325
+ we are doing.
326
+
327
+ 0:09:38.494 --> 0:09:42.094
328
+ We can use similar ones which we'll discuss
329
+ later.
330
+
331
+ 0:09:42.094 --> 0:09:43.929
332
+ Then we run against chin.
333
+
334
+ 0:09:43.929 --> 0:09:48.799
335
+ We have the transcript and then we can do
336
+ what we talked last about.
337
+
338
+ 0:09:49.069 --> 0:10:02.260
339
+ So if you have this is an audio signal and
340
+ the training data it was good.
341
+
342
+ 0:10:02.822 --> 0:10:07.951
343
+ So here we have a big advantage.
344
+
345
+ 0:10:07.951 --> 0:10:16.809
346
+ We can use a different segmentation for the
347
+ and for the.
348
+
349
+ 0:10:16.809 --> 0:10:21.316
350
+ Why is that a big advantage?
351
+
352
+ 0:10:23.303 --> 0:10:34.067
353
+ Will say for a team task is more important
354
+ because we can then do the sentence transformation.
355
+
356
+ 0:10:34.955 --> 0:10:37.603
357
+ See and Yeah, We Can Do the Same Thing.
358
+
359
+ 0:10:37.717 --> 0:10:40.226
360
+ To save us, why is it not as important for
361
+ us?
362
+
363
+ 0:10:40.226 --> 0:10:40.814
364
+ Are maybe.
365
+
366
+ 0:10:43.363 --> 0:10:48.589
367
+ We don't need that much context.
368
+
369
+ 0:10:48.589 --> 0:11:01.099
370
+ We only try to restrict the word, but the
371
+ context to consider is mainly small.
372
+
373
+ 0:11:03.283 --> 0:11:11.419
374
+ Would agree with it in more context, but there
375
+ is one more important: its.
376
+
377
+ 0:11:11.651 --> 0:11:16.764
378
+ The is monotone, so there's no reordering.
379
+
380
+ 0:11:16.764 --> 0:11:22.472
381
+ The second part of the signal is no reordering.
382
+
383
+ 0:11:22.472 --> 0:11:23.542
384
+ We have.
385
+
386
+ 0:11:23.683 --> 0:11:29.147
387
+ And of course if we are doing that we cannot
388
+ really order across boundaries between segments.
389
+
390
+ 0:11:29.549 --> 0:11:37.491
391
+ It might be challenging if we split the words
392
+ so that it's not perfect for so that.
393
+
394
+ 0:11:37.637 --> 0:11:40.846
395
+ But we need to do quite long range reordering.
396
+
397
+ 0:11:40.846 --> 0:11:47.058
398
+ If you think about the German where the work
399
+ has moved, and now the English work is in one
400
+
401
+ 0:11:47.058 --> 0:11:50.198
402
+ part, but the end of the sentence is another.
403
+
404
+ 0:11:50.670 --> 0:11:59.427
405
+ And of course this advantage we have now here
406
+ that if we have a segment we have.
407
+
408
+ 0:12:01.441 --> 0:12:08.817
409
+ And that this segmentation is important.
410
+
411
+ 0:12:08.817 --> 0:12:15.294
412
+ Here are some motivations for that.
413
+
414
+ 0:12:15.675 --> 0:12:25.325
415
+ What you are doing is you are taking the reference
416
+ text and you are segmenting.
417
+
418
+ 0:12:26.326 --> 0:12:30.991
419
+ And then, of course, your segments are exactly
420
+ yeah cute.
421
+
422
+ 0:12:31.471 --> 0:12:42.980
423
+ If you're now using different segmentation
424
+ strategies, you're using significantly in blue
425
+
426
+ 0:12:42.980 --> 0:12:44.004
427
+ points.
428
+
429
+ 0:12:44.004 --> 0:12:50.398
430
+ If the segmentation is bad, you have a lot
431
+ worse.
432
+
433
+ 0:12:52.312 --> 0:13:10.323
434
+ And interesting, here you ought to see how
435
+ it was a human, but people have in a competition.
436
+
437
+ 0:13:10.450 --> 0:13:22.996
438
+ You can see that by working on the segmentation
439
+ and using better segmentation you can improve
440
+
441
+ 0:13:22.996 --> 0:13:25.398
442
+ your performance.
443
+
444
+ 0:13:26.006 --> 0:13:29.932
445
+ So it's really essential.
446
+
447
+ 0:13:29.932 --> 0:13:41.712
448
+ One other interesting thing is if you're looking
449
+ into the difference between.
450
+
451
+ 0:13:42.082 --> 0:13:49.145
452
+ So it really seems to be more important to
453
+ have a good segmentation for our cascaded system.
454
+
455
+ 0:13:49.109 --> 0:13:56.248
456
+ For an intra-end system because there you
457
+ can't re-segment while it is less important
458
+
459
+ 0:13:56.248 --> 0:13:58.157
460
+ for a cascaded system.
461
+
462
+ 0:13:58.157 --> 0:14:05.048
463
+ Of course, it's still important, but the difference
464
+ between the two segmentations.
465
+
466
+ 0:14:06.466 --> 0:14:18.391
467
+ It was a shared task some years ago like it's
468
+ just one system from different.
469
+
470
+ 0:14:22.122 --> 0:14:31.934
471
+ So the question is how can we deal with this
472
+ in speech translation and what people look
473
+
474
+ 0:14:31.934 --> 0:14:32.604
475
+ into?
476
+
477
+ 0:14:32.752 --> 0:14:48.360
478
+ Now we want to use different techniques to
479
+ split the audio signal into segments.
480
+
481
+ 0:14:48.848 --> 0:14:54.413
482
+ You have the disadvantage that you can't change
483
+ it.
484
+
485
+ 0:14:54.413 --> 0:15:00.407
486
+ Therefore, some of the quality might be more
487
+ important.
488
+
489
+ 0:15:00.660 --> 0:15:15.678
490
+ But in both cases, of course, the A's are
491
+ better if you have a good segmentation.
492
+
493
+ 0:15:17.197 --> 0:15:23.149
494
+ So any idea, how would you have this task
495
+ now split this audio?
496
+
497
+ 0:15:23.149 --> 0:15:26.219
498
+ What type of tool would you use?
499
+
500
+ 0:15:28.648 --> 0:15:41.513
501
+ The fuse was a new network to segment half
502
+ for instance supervise.
503
+
504
+ 0:15:41.962 --> 0:15:44.693
505
+ Yes, that's exactly already the better system.
506
+
507
+ 0:15:44.693 --> 0:15:50.390
508
+ So for long time people have done more simple
509
+ things because we'll come to that a bit challenging
510
+
511
+ 0:15:50.390 --> 0:15:52.250
512
+ as creating or having the data.
513
+
514
+ 0:15:53.193 --> 0:16:00.438
515
+ The first thing is you use some tool out of
516
+ the box like voice activity detection which
517
+
518
+ 0:16:00.438 --> 0:16:07.189
519
+ has been there as a whole research field so
520
+ people find when somebody's speaking.
521
+
522
+ 0:16:07.647 --> 0:16:14.952
523
+ And then you use that in this different threshold
524
+ you always have the ability that somebody's
525
+
526
+ 0:16:14.952 --> 0:16:16.273
527
+ speaking or not.
528
+
529
+ 0:16:17.217 --> 0:16:19.889
530
+ Then you split your signal.
531
+
532
+ 0:16:19.889 --> 0:16:26.762
533
+ It will not be perfect, but you transcribe
534
+ or translate each component.
535
+
536
+ 0:16:28.508 --> 0:16:39.337
537
+ But as you see, a supervised classification
538
+ task is even better, and that is now the most
539
+
540
+ 0:16:39.337 --> 0:16:40.781
541
+ common use.
542
+
543
+ 0:16:41.441 --> 0:16:49.909
544
+ The supervisor is doing that as a supervisor
545
+ classification and then you'll try to use this
546
+
547
+ 0:16:49.909 --> 0:16:50.462
548
+ type.
549
+
550
+ 0:16:50.810 --> 0:16:53.217
551
+ We're going into a bit more detail on how
552
+ to do that.
553
+
554
+ 0:16:53.633 --> 0:17:01.354
555
+ So what you need to do first is, of course,
556
+ you have to have some labels whether this is
557
+
558
+ 0:17:01.354 --> 0:17:03.089
559
+ an end of sentence.
560
+
561
+ 0:17:03.363 --> 0:17:10.588
562
+ You do that by using the alignment between
563
+ the segments and the audio.
564
+
565
+ 0:17:10.588 --> 0:17:12.013
566
+ You have the.
567
+
568
+ 0:17:12.212 --> 0:17:15.365
569
+ The two people have not for each word, so
570
+ these tank steps.
571
+
572
+ 0:17:15.365 --> 0:17:16.889
573
+ This word is said this time.
574
+
575
+ 0:17:17.157 --> 0:17:27.935
576
+ This word is said by what you typically have
577
+ from this time to time to time.
578
+
579
+ 0:17:27.935 --> 0:17:34.654
580
+ We have the second segment, the second segment.
581
+
582
+ 0:17:35.195 --> 0:17:39.051
583
+ Which also used to trade for example your
584
+ advanced system and everything.
585
+
586
+ 0:17:41.661 --> 0:17:53.715
587
+ Based on that you can label each frame in
588
+ there so if you have a green or blue that is
589
+
590
+ 0:17:53.715 --> 0:17:57.455
591
+ our speech segment so you.
592
+
593
+ 0:17:58.618 --> 0:18:05.690
594
+ And these labels will then later help you,
595
+ but you extract exactly these types of.
596
+
597
+ 0:18:07.067 --> 0:18:08.917
598
+ There's one big challenge.
599
+
600
+ 0:18:08.917 --> 0:18:15.152
601
+ If you have two sentences which are directly
602
+ connected to each other, then if you're doing
603
+
604
+ 0:18:15.152 --> 0:18:18.715
605
+ this labeling, you would not have a break in
606
+ later.
607
+
608
+ 0:18:18.715 --> 0:18:23.512
609
+ If you tried to extract that, there should
610
+ be something great or not.
611
+
612
+ 0:18:23.943 --> 0:18:31.955
613
+ So what you typically do is in the last frame.
614
+
615
+ 0:18:31.955 --> 0:18:41.331
616
+ You mark as outside, although it's not really
617
+ outside.
618
+
619
+ 0:18:43.463 --> 0:18:46.882
620
+ Yes, I guess you could also do that in more
621
+ of a below check.
622
+
623
+ 0:18:46.882 --> 0:18:48.702
624
+ I mean, this is the most simple.
625
+
626
+ 0:18:48.702 --> 0:18:51.514
627
+ It's like inside outside, so it's related
628
+ to that.
629
+
630
+ 0:18:51.514 --> 0:18:54.988
631
+ Of course, you could have an extra startup
632
+ segment, and so on.
633
+
634
+ 0:18:54.988 --> 0:18:57.469
635
+ I guess this is just to make it more simple.
636
+
637
+ 0:18:57.469 --> 0:19:00.226
638
+ You only have two labels, not a street classroom.
639
+
640
+ 0:19:00.226 --> 0:19:02.377
641
+ But yeah, you could do similar things.
642
+
643
+ 0:19:12.432 --> 0:19:20.460
644
+ Has caused down the roads to problems because
645
+ it could be an important part of a segment
646
+
647
+ 0:19:20.460 --> 0:19:24.429
648
+ which has some meaning and we do something.
649
+
650
+ 0:19:24.429 --> 0:19:28.398
651
+ The good thing is frames are normally very.
652
+
653
+ 0:19:28.688 --> 0:19:37.586
654
+ Like some milliseconds, so normally if you
655
+ remove some milliseconds you can still understand
656
+
657
+ 0:19:37.586 --> 0:19:38.734
658
+ everything.
659
+
660
+ 0:19:38.918 --> 0:19:46.999
661
+ Mean the speech signal is very repetitive,
662
+ and so you have information a lot of times.
663
+
664
+ 0:19:47.387 --> 0:19:50.730
665
+ That's why we talked along there last time
666
+ they could try to shrink the steak and.
667
+
668
+ 0:19:51.031 --> 0:20:00.995
669
+ If you now have a short sequence where there
670
+ is like which would be removed and that's not
671
+
672
+ 0:20:00.995 --> 0:20:01.871
673
+ really.
674
+
675
+ 0:20:02.162 --> 0:20:06.585
676
+ Yeah, but it's not a full letter is missing.
677
+
678
+ 0:20:06.585 --> 0:20:11.009
679
+ It's like only the last ending of the vocal.
680
+
681
+ 0:20:11.751 --> 0:20:15.369
682
+ Think it doesn't really happen.
683
+
684
+ 0:20:15.369 --> 0:20:23.056
685
+ We have our audio signal and we have these
686
+ gags that are not above.
687
+
688
+ 0:20:23.883 --> 0:20:29.288
689
+ With this blue rectangulars the inside speech
690
+ segment and with the guess it's all set yes.
691
+
692
+ 0:20:29.669 --> 0:20:35.736
693
+ So then you have the full signal and you're
694
+ meaning now labeling your task as a blue or
695
+
696
+ 0:20:35.736 --> 0:20:36.977
697
+ white prediction.
698
+
699
+ 0:20:36.977 --> 0:20:39.252
700
+ So that is your prediction task.
701
+
702
+ 0:20:39.252 --> 0:20:44.973
703
+ You have the audio signal only and your prediction
704
+ task is like label one or zero.
705
+
706
+ 0:20:45.305 --> 0:20:55.585
707
+ Once you do that then based on this labeling
708
+ you can extract each segment again like each
709
+
710
+ 0:20:55.585 --> 0:20:58.212
711
+ consecutive blue area.
712
+
713
+ 0:20:58.798 --> 0:21:05.198
714
+ See then removed maybe the non-speaking part
715
+ already and duo speech translation only on
716
+
717
+ 0:21:05.198 --> 0:21:05.998
718
+ the parts.
719
+
720
+ 0:21:06.786 --> 0:21:19.768
721
+ Which is good because the training would have
722
+ done similarly.
723
+
724
+ 0:21:20.120 --> 0:21:26.842
725
+ So on the noise in between you never saw in
726
+ the training, so it's good to throw it away.
727
+
728
+ 0:21:29.649 --> 0:21:34.930
729
+ One challenge, of course, is now if you're
730
+ doing that, what is your input?
731
+
732
+ 0:21:34.930 --> 0:21:40.704
733
+ You cannot do the sequence labeling normally
734
+ on the whole talk, so it's too long.
735
+
736
+ 0:21:40.704 --> 0:21:46.759
737
+ So if you're doing this prediction of the
738
+ label, you also have a window for which you
739
+
740
+ 0:21:46.759 --> 0:21:48.238
741
+ do the segmentation.
742
+
743
+ 0:21:48.788 --> 0:21:54.515
744
+ And that's the bedline we have in the punctuation
745
+ prediction.
746
+
747
+ 0:21:54.515 --> 0:22:00.426
748
+ If we don't have good borders, random splits
749
+ are normally good.
750
+
751
+ 0:22:00.426 --> 0:22:03.936
752
+ So what we do now is split the audio.
753
+
754
+ 0:22:04.344 --> 0:22:09.134
755
+ So that would be our input, and then the part
756
+ three would be our labels.
757
+
758
+ 0:22:09.269 --> 0:22:15.606
759
+ This green would be the input and here we
760
+ want, for example, blue labels and then white.
761
+
762
+ 0:22:16.036 --> 0:22:20.360
763
+ Here only do labors and here at the beginning
764
+ why maybe at the end why.
765
+
766
+ 0:22:21.401 --> 0:22:28.924
767
+ So thereby you have now a fixed window always
768
+ for which you're doing than this task of predicting.
769
+
770
+ 0:22:33.954 --> 0:22:43.914
771
+ How you build your classifier that is based
772
+ again.
773
+
774
+ 0:22:43.914 --> 0:22:52.507
775
+ We had this wave to be mentioned last week.
776
+
777
+ 0:22:52.752 --> 0:23:00.599
778
+ So in training you use labels to say whether
779
+ it's in speech or outside speech.
780
+
781
+ 0:23:01.681 --> 0:23:17.740
782
+ Inference: You give them always the chance
783
+ and then predict whether this part like each
784
+
785
+ 0:23:17.740 --> 0:23:20.843
786
+ label is afraid.
787
+
788
+ 0:23:23.143 --> 0:23:29.511
789
+ Bit more complicated, so one challenge is
790
+ if you randomly split off cognition, losing
791
+
792
+ 0:23:29.511 --> 0:23:32.028
793
+ your context for the first brain.
794
+
795
+ 0:23:32.028 --> 0:23:38.692
796
+ It might be very hard to predict whether this
797
+ is now in or out of, and also for the last.
798
+
799
+ 0:23:39.980 --> 0:23:48.449
800
+ You often need a bit of context whether this
801
+ is audio or not, and at the beginning.
802
+
803
+ 0:23:49.249 --> 0:23:59.563
804
+ So what you do is you put the audio in twice.
805
+
806
+ 0:23:59.563 --> 0:24:08.532
807
+ You want to do it with splits and then.
808
+
809
+ 0:24:08.788 --> 0:24:15.996
810
+ It is shown you have shifted the two offsets,
811
+ so one is predicted with the other offset.
812
+
813
+ 0:24:16.416 --> 0:24:23.647
814
+ And then averaging the probabilities so that
815
+ at each time you have, at least for one of
816
+
817
+ 0:24:23.647 --> 0:24:25.127
818
+ the predictions,.
819
+
820
+ 0:24:25.265 --> 0:24:36.326
821
+ Because at the end of the second it might
822
+ be very hard to predict whether this is now
823
+
824
+ 0:24:36.326 --> 0:24:39.027
825
+ speech or nonspeech.
826
+
827
+ 0:24:39.939 --> 0:24:47.956
828
+ Think it is a high parameter, but you are
829
+ not optimizing it, so you just take two shifts.
830
+
831
+ 0:24:48.328 --> 0:24:54.636
832
+ Of course try a lot of different shifts and
833
+ so on.
834
+
835
+ 0:24:54.636 --> 0:24:59.707
836
+ The thing is it's mainly a problem here.
837
+
838
+ 0:24:59.707 --> 0:25:04.407
839
+ If you don't do two outsets you have.
840
+
841
+ 0:25:05.105 --> 0:25:14.761
842
+ You could get better by doing that, but would
843
+ be skeptical if it really matters, and also
844
+
845
+ 0:25:14.761 --> 0:25:18.946
846
+ have not seen any experience in doing.
847
+
848
+ 0:25:19.159 --> 0:25:27.629
849
+ Guess you're already good, you have maybe
850
+ some arrows in there and you're getting.
851
+
852
+ 0:25:31.191 --> 0:25:37.824
853
+ So with this you have your segmentation.
854
+
855
+ 0:25:37.824 --> 0:25:44.296
856
+ However, there is a problem in between.
857
+
858
+ 0:25:44.296 --> 0:25:49.150
859
+ Once the model is wrong then.
860
+
861
+ 0:25:49.789 --> 0:26:01.755
862
+ The normal thing would be the first thing
863
+ that you take some threshold and that you always
864
+
865
+ 0:26:01.755 --> 0:26:05.436
866
+ label everything in speech.
867
+
868
+ 0:26:06.006 --> 0:26:19.368
869
+ The problem is when you are just doing this
870
+ one threshold that you might have.
871
+
872
+ 0:26:19.339 --> 0:26:23.954
873
+ Those are the challenges.
874
+
875
+ 0:26:23.954 --> 0:26:31.232
876
+ Short segments mean you have no context.
877
+
878
+ 0:26:31.232 --> 0:26:35.492
879
+ The policy will be bad.
880
+
881
+ 0:26:37.077 --> 0:26:48.954
882
+ Therefore, people use this probabilistic divided
883
+ cocker algorithm, so the main idea is start
884
+
885
+ 0:26:48.954 --> 0:26:56.744
886
+ with the whole segment, and now you split the
887
+ whole segment.
888
+
889
+ 0:26:57.397 --> 0:27:09.842
890
+ Then you split there and then you continue
891
+ until each segment is smaller than the maximum
892
+
893
+ 0:27:09.842 --> 0:27:10.949
894
+ length.
895
+
896
+ 0:27:11.431 --> 0:27:23.161
897
+ But you can ignore some splits, and if you
898
+ split one segment into two parts you first
899
+
900
+ 0:27:23.161 --> 0:27:23.980
901
+ trim.
902
+
903
+ 0:27:24.064 --> 0:27:40.197
904
+ So normally it's not only one signal position,
905
+ it's a longer area of non-voice, so you try
906
+
907
+ 0:27:40.197 --> 0:27:43.921
908
+ to find this longer.
909
+
910
+ 0:27:43.943 --> 0:27:51.403
911
+ Now your large segment is split into two smaller
912
+ segments.
913
+
914
+ 0:27:51.403 --> 0:27:56.082
915
+ Now you are checking these segments.
916
+
917
+ 0:27:56.296 --> 0:28:04.683
918
+ So if they are very, very short, it might
919
+ be good not to spin at this point because you're
920
+
921
+ 0:28:04.683 --> 0:28:05.697
922
+ ending up.
923
+
924
+ 0:28:06.006 --> 0:28:09.631
925
+ And this way you continue all the time, and
926
+ then hopefully you'll have a good stretch.
927
+
928
+ 0:28:10.090 --> 0:28:19.225
929
+ So, of course, there's one challenge with
930
+ this approach: if you think about it later,
931
+
932
+ 0:28:19.225 --> 0:28:20.606
933
+ low latency.
934
+
935
+ 0:28:25.405 --> 0:28:31.555
936
+ So in this case you have to have the full
937
+ audio available.
938
+
939
+ 0:28:32.132 --> 0:28:38.112
940
+ So you cannot continuously do that mean if
941
+ you would do it just always.
942
+
943
+ 0:28:38.112 --> 0:28:45.588
944
+ If the probability is higher you split but
945
+ in this case you try to find a global optimal.
946
+
947
+ 0:28:46.706 --> 0:28:49.134
948
+ A heuristic body.
949
+
950
+ 0:28:49.134 --> 0:28:58.170
951
+ You find a global solution for your whole
952
+ tar and not a local one.
953
+
954
+ 0:28:58.170 --> 0:29:02.216
955
+ Where's the system most sure?
956
+
957
+ 0:29:02.802 --> 0:29:12.467
958
+ So that's a bit of a challenge here, but the
959
+ advantage of course is that in the end you
960
+
961
+ 0:29:12.467 --> 0:29:14.444
962
+ have no segments.
963
+
964
+ 0:29:17.817 --> 0:29:23.716
965
+ Any more questions like this.
966
+
967
+ 0:29:23.716 --> 0:29:36.693
968
+ Then the next thing is we also need to evaluate
969
+ in this scenario.
970
+
971
+ 0:29:37.097 --> 0:29:44.349
972
+ So know machine translation is quite a long
973
+ way.
974
+
975
+ 0:29:44.349 --> 0:29:55.303
976
+ History now was the beginning of the semester,
977
+ but hope you can remember.
978
+
979
+ 0:29:55.675 --> 0:30:09.214
980
+ Might be with blue score, might be with comment
981
+ or similar, but you need to have.
982
+
983
+ 0:30:10.310 --> 0:30:22.335
984
+ But this assumes that you have this one-to-one
985
+ match, so you always have an output and machine
986
+
987
+ 0:30:22.335 --> 0:30:26.132
988
+ translation, which is nicely.
989
+
990
+ 0:30:26.506 --> 0:30:34.845
991
+ So then it might be that our output has four
992
+ segments, while our reference output has only
993
+
994
+ 0:30:34.845 --> 0:30:35.487
995
+ three.
996
+
997
+ 0:30:36.756 --> 0:30:40.649
998
+ And now is, of course, questionable like what
999
+ should we compare in our metric.
1000
+
1001
+ 0:30:44.704 --> 0:30:53.087
1002
+ So it's no longer directly possible to directly
1003
+ do that because what should you compare?
1004
+
1005
+ 0:30:53.413 --> 0:31:00.214
1006
+ Just have four segments there and three segments
1007
+ there, and of course it seems to be that.
1008
+
1009
+ 0:31:00.920 --> 0:31:06.373
1010
+ The first one it likes to the first one when
1011
+ you see I can't speak Spanish, but you're an
1012
+
1013
+ 0:31:06.373 --> 0:31:09.099
1014
+ audience of the guests who is already there.
1015
+
1016
+ 0:31:09.099 --> 0:31:14.491
1017
+ So even like just a woman, the blue comparing
1018
+ wouldn't work, so you need to do something
1019
+
1020
+ 0:31:14.491 --> 0:31:17.157
1021
+ about that to take this type of evaluation.
1022
+
1023
+ 0:31:19.019 --> 0:31:21.727
1024
+ Still any suggestions what you could do.
1025
+
1026
+ 0:31:25.925 --> 0:31:44.702
1027
+ How can you calculate a blue score because
1028
+ you don't have one you want to see?
1029
+
1030
+ 0:31:45.925 --> 0:31:49.365
1031
+ Here you put another layer which spies to
1032
+ add in the second.
1033
+
1034
+ 0:31:51.491 --> 0:31:56.979
1035
+ It's even not aligning only, but that's one
1036
+ solution, so you need to align and resign.
1037
+
1038
+ 0:31:57.177 --> 0:32:06.886
1039
+ Because even if you have no alignment so this
1040
+ to this and this to that you see that it's
1041
+
1042
+ 0:32:06.886 --> 0:32:12.341
1043
+ not good because the audio would compare to
1044
+ that.
1045
+
1046
+ 0:32:13.453 --> 0:32:16.967
1047
+ That we'll discuss is even one simpler solution.
1048
+
1049
+ 0:32:16.967 --> 0:32:19.119
1050
+ Yes, it's a simpler solution.
1051
+
1052
+ 0:32:19.119 --> 0:32:23.135
1053
+ It's called document based blue or something
1054
+ like that.
1055
+
1056
+ 0:32:23.135 --> 0:32:25.717
1057
+ So you just take the full document.
1058
+
1059
+ 0:32:26.566 --> 0:32:32.630
1060
+ For some matrix it's good and it's not clear
1061
+ how good it is to the other, but there might
1062
+
1063
+ 0:32:32.630 --> 0:32:32.900
1064
+ be.
1065
+
1066
+ 0:32:33.393 --> 0:32:36.454
1067
+ Think of more simple metrics like blue.
1068
+
1069
+ 0:32:36.454 --> 0:32:40.356
1070
+ Do you have any idea what could be a disadvantage?
1071
+
1072
+ 0:32:49.249 --> 0:32:56.616
1073
+ Blue is matching ingrams so you start with
1074
+ the original.
1075
+
1076
+ 0:32:56.616 --> 0:33:01.270
1077
+ You check how many ingrams in here.
1078
+
1079
+ 0:33:01.901 --> 0:33:11.233
1080
+ If you're not doing that on the full document,
1081
+ you can also match grams from year to year.
1082
+
1083
+ 0:33:11.751 --> 0:33:15.680
1084
+ So you can match things very far away.
1085
+
1086
+ 0:33:15.680 --> 0:33:21.321
1087
+ Start doing translation and you just randomly
1088
+ randomly.
1089
+
1090
+ 0:33:22.142 --> 0:33:27.938
1091
+ And that, of course, could be a bit of a disadvantage
1092
+ or like is a problem, and therefore people
1093
+
1094
+ 0:33:27.938 --> 0:33:29.910
1095
+ also look into the segmentation.
1096
+
1097
+ 0:33:29.910 --> 0:33:34.690
1098
+ But I've recently seen some things, so document
1099
+ levels tours are also normally.
1100
+
1101
+ 0:33:34.690 --> 0:33:39.949
1102
+ If you have a relatively high quality system
1103
+ or state of the art, then they also have a
1104
+
1105
+ 0:33:39.949 --> 0:33:41.801
1106
+ good correlation of the human.
1107
+
1108
+ 0:33:46.546 --> 0:33:59.241
1109
+ So how are we doing that so we are putting
1110
+ end of sentence boundaries in there and then.
1111
+
1112
+ 0:33:59.179 --> 0:34:07.486
1113
+ Alignment based on a similar Livingston distance,
1114
+ so at a distance between our output and the
1115
+
1116
+ 0:34:07.486 --> 0:34:09.077
1117
+ reference output.
1118
+
1119
+ 0:34:09.449 --> 0:34:13.061
1120
+ And here is our boundary.
1121
+
1122
+ 0:34:13.061 --> 0:34:23.482
1123
+ We map the boundary based on the alignment,
1124
+ so in Lithuania you only have.
1125
+
1126
+ 0:34:23.803 --> 0:34:36.036
1127
+ And then, like all the words that are before,
1128
+ it might be since there is not a random.
1129
+
1130
+ 0:34:36.336 --> 0:34:44.890
1131
+ Mean it should be, but it can happen things
1132
+ like that, and it's not clear where.
1133
+
1134
+ 0:34:44.965 --> 0:34:49.727
1135
+ At the break, however, they are typically
1136
+ not that bad because they are words which are
1137
+
1138
+ 0:34:49.727 --> 0:34:52.270
1139
+ not matching between reference and hypothesis.
1140
+
1141
+ 0:34:52.270 --> 0:34:56.870
1142
+ So normally it doesn't really matter that
1143
+ much because they are anyway not matching.
1144
+
1145
+ 0:34:57.657 --> 0:35:05.888
1146
+ And then you take the mule as a T output and
1147
+ use that to calculate your metric.
1148
+
1149
+ 0:35:05.888 --> 0:35:12.575
1150
+ Then it's again a perfect alignment for which
1151
+ you can calculate.
1152
+
1153
+ 0:35:14.714 --> 0:35:19.229
1154
+ Any idea you could do it the other way around.
1155
+
1156
+ 0:35:19.229 --> 0:35:23.359
1157
+ You could resigment your reference to the.
1158
+
1159
+ 0:35:29.309 --> 0:35:30.368
1160
+ Which one would you select?
1161
+
1162
+ 0:35:34.214 --> 0:35:43.979
1163
+ I think segmenting the assertive also is much
1164
+ more natural because the reference sentence
1165
+
1166
+ 0:35:43.979 --> 0:35:46.474
1167
+ is the fixed solution.
1168
+
1169
+ 0:35:47.007 --> 0:35:52.947
1170
+ Yes, that's the right motivation if you do
1171
+ think about blue or so.
1172
+
1173
+ 0:35:52.947 --> 0:35:57.646
1174
+ Additionally important if you change your
1175
+ reference.
1176
+
1177
+ 0:35:57.857 --> 0:36:07.175
1178
+ You might have a different number of diagrams
1179
+ or diagrams because the sentences are different
1180
+
1181
+ 0:36:07.175 --> 0:36:08.067
1182
+ lengths.
1183
+
1184
+ 0:36:08.068 --> 0:36:15.347
1185
+ Here your five system, you're always comparing
1186
+ it to the same system, and you don't compare
1187
+
1188
+ 0:36:15.347 --> 0:36:16.455
1189
+ to different.
1190
+
1191
+ 0:36:16.736 --> 0:36:22.317
1192
+ The only different base of segmentation, but
1193
+ still it could make some do.
1194
+
1195
+ 0:36:25.645 --> 0:36:38.974
1196
+ Good, that's all about sentence segmentation,
1197
+ then a bit about disfluencies and what there
1198
+
1199
+ 0:36:38.974 --> 0:36:40.146
1200
+ really.
1201
+
1202
+ 0:36:42.182 --> 0:36:51.138
1203
+ So as said in daily life, you're not speaking
1204
+ like very nice full sentences every.
1205
+
1206
+ 0:36:51.471 --> 0:36:53.420
1207
+ He was speaking powerful sentences.
1208
+
1209
+ 0:36:53.420 --> 0:36:54.448
1210
+ We do repetitions.
1211
+
1212
+ 0:36:54.834 --> 0:37:00.915
1213
+ It's especially if it's more interactive,
1214
+ so in meetings, phone calls and so on.
1215
+
1216
+ 0:37:00.915 --> 0:37:04.519
1217
+ If you have multiple speakers, they also break.
1218
+
1219
+ 0:37:04.724 --> 0:37:16.651
1220
+ Each other, and then if you keep them, they
1221
+ are harder to translate because most of your
1222
+
1223
+ 0:37:16.651 --> 0:37:17.991
1224
+ training.
1225
+
1226
+ 0:37:18.278 --> 0:37:30.449
1227
+ It's also very difficult to read, so we'll
1228
+ have some examples there to transcribe everything
1229
+
1230
+ 0:37:30.449 --> 0:37:32.543
1231
+ as it was said.
1232
+
1233
+ 0:37:33.473 --> 0:37:36.555
1234
+ What type of things are there?
1235
+
1236
+ 0:37:37.717 --> 0:37:42.942
1237
+ So you have all these pillow works.
1238
+
1239
+ 0:37:42.942 --> 0:37:47.442
1240
+ These are very easy to remove.
1241
+
1242
+ 0:37:47.442 --> 0:37:52.957
1243
+ You can just use regular expressions.
1244
+
1245
+ 0:37:53.433 --> 0:38:00.139
1246
+ Is getting more difficult with some other
1247
+ type of filler works.
1248
+
1249
+ 0:38:00.139 --> 0:38:03.387
1250
+ In German you have this or in.
1251
+
1252
+ 0:38:04.024 --> 0:38:08.473
1253
+ And these ones you cannot just remove by regular
1254
+ expression.
1255
+
1256
+ 0:38:08.473 --> 0:38:15.039
1257
+ You shouldn't remove all yacht from a text
1258
+ because it might be very important information
1259
+
1260
+ 0:38:15.039 --> 0:38:15.768
1261
+ for well.
1262
+
1263
+ 0:38:15.715 --> 0:38:19.995
1264
+ It may be not as important as you are, but
1265
+ still it might be very important.
1266
+
1267
+ 0:38:20.300 --> 0:38:24.215
1268
+ So just removing them is there already more
1269
+ difficult.
1270
+
1271
+ 0:38:26.586 --> 0:38:29.162
1272
+ Then you have these repetitions.
1273
+
1274
+ 0:38:29.162 --> 0:38:32.596
1275
+ You have something like mean saw him there.
1276
+
1277
+ 0:38:32.596 --> 0:38:33.611
1278
+ There was a.
1279
+
1280
+ 0:38:34.334 --> 0:38:41.001
1281
+ And while for the first one that might be
1282
+ very easy to remove because you just look for
1283
+
1284
+ 0:38:41.001 --> 0:38:47.821
1285
+ double, the thing is that the repetition might
1286
+ not be exactly the same, so there is there
1287
+
1288
+ 0:38:47.821 --> 0:38:48.199
1289
+ was.
1290
+
1291
+ 0:38:48.199 --> 0:38:54.109
1292
+ So there is already getting a bit more complicated,
1293
+ of course still possible.
1294
+
1295
+ 0:38:54.614 --> 0:39:01.929
1296
+ You can remove Denver so the real sense would
1297
+ be like to have a ticket to Houston.
1298
+
1299
+ 0:39:02.882 --> 0:39:13.327
1300
+ But there the detection, of course, is getting
1301
+ more challenging as you want to get rid of.
1302
+
1303
+ 0:39:13.893 --> 0:39:21.699
1304
+ You don't have the data, of course, which
1305
+ makes all the tasks harder, but you probably
1306
+
1307
+ 0:39:21.699 --> 0:39:22.507
1308
+ want to.
1309
+
1310
+ 0:39:22.507 --> 0:39:24.840
1311
+ That's really meaningful.
1312
+
1313
+ 0:39:24.840 --> 0:39:26.185
1314
+ Current isn't.
1315
+
1316
+ 0:39:26.185 --> 0:39:31.120
1317
+ That is now a really good point and it's really
1318
+ there.
1319
+
1320
+ 0:39:31.051 --> 0:39:34.785
1321
+ The thing about what is your final task?
1322
+
1323
+ 0:39:35.155 --> 0:39:45.526
1324
+ If you want to have a transcript reading it,
1325
+ I'm not sure if we have another example.
1326
+
1327
+ 0:39:45.845 --> 0:39:54.171
1328
+ So there it's nicer if you have a clean transfer
1329
+ and if you see subtitles in, they're also not
1330
+
1331
+ 0:39:54.171 --> 0:39:56.625
1332
+ having all the repetitions.
1333
+
1334
+ 0:39:56.625 --> 0:40:03.811
1335
+ It's the nice way to shorten but also getting
1336
+ the structure you cannot even make.
1337
+
1338
+ 0:40:04.064 --> 0:40:11.407
1339
+ So in this situation, of course, they might
1340
+ give you information.
1341
+
1342
+ 0:40:11.407 --> 0:40:14.745
1343
+ There is a lot of stuttering.
1344
+
1345
+ 0:40:15.015 --> 0:40:22.835
1346
+ So in this case agree it might be helpful
1347
+ in some way, but meaning reading all the disfluencies
1348
+
1349
+ 0:40:22.835 --> 0:40:25.198
1350
+ is getting really difficult.
1351
+
1352
+ 0:40:25.198 --> 0:40:28.049
1353
+ If you have the next one, we have.
1354
+
1355
+ 0:40:28.308 --> 0:40:31.630
1356
+ That's a very long text.
1357
+
1358
+ 0:40:31.630 --> 0:40:35.883
1359
+ You need a bit of time to pass.
1360
+
1361
+ 0:40:35.883 --> 0:40:39.472
1362
+ This one is not important.
1363
+
1364
+ 0:40:40.480 --> 0:40:48.461
1365
+ It might be nice if you can start reading
1366
+ from here.
1367
+
1368
+ 0:40:48.461 --> 0:40:52.074
1369
+ Let's have a look here.
1370
+
1371
+ 0:40:52.074 --> 0:40:54.785
1372
+ Try to read this.
1373
+
1374
+ 0:40:57.297 --> 0:41:02.725
1375
+ You can understand it, but think you need
1376
+ a bit of time to really understand what was.
1377
+
1378
+ 0:41:11.711 --> 0:41:21.480
1379
+ And now we have the same text, but you have
1380
+ highlighted in bold, and not only read the
1381
+
1382
+ 0:41:21.480 --> 0:41:22.154
1383
+ bold.
1384
+
1385
+ 0:41:23.984 --> 0:41:25.995
1386
+ And ignore everything which is not bold.
1387
+
1388
+ 0:41:30.250 --> 0:41:49.121
1389
+ Would assume it's easier to read just the
1390
+ book part more faster and more faster.
1391
+
1392
+ 0:41:50.750 --> 0:41:57.626
1393
+ Yeah, it might be, but I'm not sure we have
1394
+ a master thesis of that.
1395
+
1396
+ 0:41:57.626 --> 0:41:59.619
1397
+ If seen my videos,.
1398
+
1399
+ 0:42:00.000 --> 0:42:09.875
1400
+ Of the recordings, I also have it more likely
1401
+ that it's like a fluent speak and I'm not like
1402
+
1403
+ 0:42:09.875 --> 0:42:12.318
1404
+ doing the hesitations.
1405
+
1406
+ 0:42:12.652 --> 0:42:23.764
1407
+ Don't know if somebody else has looked into
1408
+ the Cusera video, but notice that.
1409
+
1410
+ 0:42:25.005 --> 0:42:31.879
1411
+ For these videos spoke every minute, three
1412
+ times or something, and then people were there
1413
+
1414
+ 0:42:31.879 --> 0:42:35.011
1415
+ and cutting things and making hopefully.
1416
+
1417
+ 0:42:35.635 --> 0:42:42.445
1418
+ And therefore if you want to more achieve
1419
+ that, of course, no longer exactly what was
1420
+
1421
+ 0:42:42.445 --> 0:42:50.206
1422
+ happening, but if it more looks like a professional
1423
+ video, then you would have to do that and cut
1424
+
1425
+ 0:42:50.206 --> 0:42:50.998
1426
+ that out.
1427
+
1428
+ 0:42:50.998 --> 0:42:53.532
1429
+ But yeah, there are definitely.
1430
+
1431
+ 0:42:55.996 --> 0:42:59.008
1432
+ We're also going to do this thing again.
1433
+
1434
+ 0:42:59.008 --> 0:43:02.315
1435
+ First turn is like I'm going to have a very.
1436
+
1437
+ 0:43:02.422 --> 0:43:07.449
1438
+ Which in the end they start to slow down just
1439
+ without feeling as though they're.
1440
+
1441
+ 0:43:07.407 --> 0:43:10.212
1442
+ It's a good point for the next.
1443
+
1444
+ 0:43:10.212 --> 0:43:13.631
1445
+ There is not the one perfect solution.
1446
+
1447
+ 0:43:13.631 --> 0:43:20.732
1448
+ There's some work on destruction removal,
1449
+ but of course there's also disability.
1450
+
1451
+ 0:43:20.732 --> 0:43:27.394
1452
+ Removal is not that easy, so do you just remove
1453
+ that's in order everywhere.
1454
+
1455
+ 0:43:27.607 --> 0:43:29.708
1456
+ But how much like cleaning do you do?
1457
+
1458
+ 0:43:29.708 --> 0:43:31.366
1459
+ It's more a continuous thing.
1460
+
1461
+ 0:43:31.811 --> 0:43:38.211
1462
+ Is it more really you only remove stuff or
1463
+ are you also into rephrasing and here is only
1464
+
1465
+ 0:43:38.211 --> 0:43:38.930
1466
+ removing?
1467
+
1468
+ 0:43:39.279 --> 0:43:41.664
1469
+ But maybe you want to rephrase it.
1470
+
1471
+ 0:43:41.664 --> 0:43:43.231
1472
+ That's hearing better.
1473
+
1474
+ 0:43:43.503 --> 0:43:49.185
1475
+ So then it's going into what people are doing
1476
+ in style transfer.
1477
+
1478
+ 0:43:49.185 --> 0:43:52.419
1479
+ We are going from a speech style to.
1480
+
1481
+ 0:43:52.872 --> 0:44:07.632
1482
+ So there is more continuum, and of course
1483
+ Airconditioner is not the perfect solution,
1484
+
1485
+ 0:44:07.632 --> 0:44:10.722
1486
+ but exactly what.
1487
+
1488
+ 0:44:15.615 --> 0:44:19.005
1489
+ Yeah, we're challenging.
1490
+
1491
+ 0:44:19.005 --> 0:44:30.258
1492
+ You have examples where the direct copy is
1493
+ not as hard or is not exactly the same.
1494
+
1495
+ 0:44:30.258 --> 0:44:35.410
1496
+ That is, of course, more challenging.
1497
+
1498
+ 0:44:41.861 --> 0:44:49.889
1499
+ If it's getting really mean why it's so challenging,
1500
+ if it's really spontaneous even for the speaker,
1501
+
1502
+ 0:44:49.889 --> 0:44:55.634
1503
+ you need maybe even the video to really get
1504
+ that and at least the audio.
1505
+
1506
+ 0:45:01.841 --> 0:45:06.025
1507
+ Yeah what it also depends on.
1508
+
1509
+ 0:45:06.626 --> 0:45:15.253
1510
+ The purpose, of course, and very important
1511
+ thing is the easiest tasks just to removing.
1512
+
1513
+ 0:45:15.675 --> 0:45:25.841
1514
+ Of course you have to be very careful because
1515
+ if you remove some of the not, it's normally
1516
+
1517
+ 0:45:25.841 --> 0:45:26.958
1518
+ not much.
1519
+
1520
+ 0:45:27.227 --> 0:45:33.176
1521
+ But if you remove too much, of course, that's
1522
+ very, very bad because you're losing important.
1523
+
1524
+ 0:45:33.653 --> 0:45:46.176
1525
+ And this might be even more challenging if
1526
+ you think about rarer and unseen works.
1527
+
1528
+ 0:45:46.226 --> 0:45:56.532
1529
+ So when doing this removal, it's important
1530
+ to be careful and normally more conservative.
1531
+
1532
+ 0:46:03.083 --> 0:46:15.096
1533
+ Of course, also you have to again see if you're
1534
+ doing that now in a two step approach, not
1535
+
1536
+ 0:46:15.096 --> 0:46:17.076
1537
+ an end to end.
1538
+
1539
+ 0:46:17.076 --> 0:46:20.772
1540
+ So first you need a remote.
1541
+
1542
+ 0:46:21.501 --> 0:46:30.230
1543
+ But you have to somehow sing it in the whole
1544
+ type line.
1545
+
1546
+ 0:46:30.230 --> 0:46:36.932
1547
+ If you learn text or remove disfluencies,.
1548
+
1549
+ 0:46:36.796 --> 0:46:44.070
1550
+ But it might be that the ASR system is outputing
1551
+ something else or that it's more of an ASR
1552
+
1553
+ 0:46:44.070 --> 0:46:44.623
1554
+ error.
1555
+
1556
+ 0:46:44.864 --> 0:46:46.756
1557
+ So um.
1558
+
1559
+ 0:46:46.506 --> 0:46:52.248
1560
+ Just for example, if you do it based on language
1561
+ modeling scores, it might be that you're just
1562
+
1563
+ 0:46:52.248 --> 0:46:57.568
1564
+ the language modeling score because the has
1565
+ done some errors, so you really have to see
1566
+
1567
+ 0:46:57.568 --> 0:46:59.079
1568
+ the combination of that.
1569
+
1570
+ 0:46:59.419 --> 0:47:04.285
1571
+ And for example, we had like partial words.
1572
+
1573
+ 0:47:04.285 --> 0:47:06.496
1574
+ They are like some.
1575
+
1576
+ 0:47:06.496 --> 0:47:08.819
1577
+ We didn't have that.
1578
+
1579
+ 0:47:08.908 --> 0:47:18.248
1580
+ So these feelings cannot be that you start
1581
+ in the middle of the world and then you switch
1582
+
1583
+ 0:47:18.248 --> 0:47:19.182
1584
+ because.
1585
+
1586
+ 0:47:19.499 --> 0:47:23.214
1587
+ And of course, in text in perfect transcript,
1588
+ that's very easy to recognize.
1589
+
1590
+ 0:47:23.214 --> 0:47:24.372
1591
+ That's not a real word.
1592
+
1593
+ 0:47:24.904 --> 0:47:37.198
1594
+ However, when you really do it into an system,
1595
+ he will normally detect some type of word because
1596
+
1597
+ 0:47:37.198 --> 0:47:40.747
1598
+ he only can help the words.
1599
+
1600
+ 0:47:50.050 --> 0:48:03.450
1601
+ Example: We should think so if you have this
1602
+ in the transcript it's easy to detect as a
1603
+
1604
+ 0:48:03.450 --> 0:48:05.277
1605
+ disgusting.
1606
+
1607
+ 0:48:05.986 --> 0:48:11.619
1608
+ And then, of course, it's more challenging
1609
+ in a real world example where you have.
1610
+
1611
+ 0:48:12.492 --> 0:48:29.840
1612
+ Now to the approaches one thing is to really
1613
+ put it in between so you put your A's system.
1614
+
1615
+ 0:48:31.391 --> 0:48:45.139
1616
+ So what your task is like, so you have this
1617
+ text and the outputs in this text.
1618
+
1619
+ 0:48:45.565 --> 0:48:49.605
1620
+ There is different formulations of that.
1621
+
1622
+ 0:48:49.605 --> 0:48:54.533
1623
+ You might not be able to do everything like
1624
+ that.
1625
+
1626
+ 0:48:55.195 --> 0:49:10.852
1627
+ Or do you also allow, for example, rephrasing
1628
+ for reordering so in text you might have the
1629
+
1630
+ 0:49:10.852 --> 0:49:13.605
1631
+ word correctly.
1632
+
1633
+ 0:49:13.513 --> 0:49:24.201
1634
+ But the easiest thing is you only do it more
1635
+ like removing, so some things can be removed.
1636
+
1637
+ 0:49:29.049 --> 0:49:34.508
1638
+ Any ideas how to do that this is output.
1639
+
1640
+ 0:49:34.508 --> 0:49:41.034
1641
+ You have training data so we have training
1642
+ data.
1643
+
1644
+ 0:49:47.507 --> 0:49:55.869
1645
+ To put in with the spoon you can eat it even
1646
+ after it is out, but after the machine has.
1647
+
1648
+ 0:50:00.000 --> 0:50:05.511
1649
+ Was wearing rocks, so you have not just the
1650
+ shoes you remove but wearing them as input,
1651
+
1652
+ 0:50:05.511 --> 0:50:07.578
1653
+ as disfluent text and as output.
1654
+
1655
+ 0:50:07.578 --> 0:50:09.207
1656
+ It should be fueled text.
1657
+
1658
+ 0:50:09.207 --> 0:50:15.219
1659
+ It can be before or after recycling as you
1660
+ said, but you have this type of task, so technically
1661
+
1662
+ 0:50:15.219 --> 0:50:20.042
1663
+ how would you address this type of task when
1664
+ you have to solve this type of.
1665
+
1666
+ 0:50:24.364 --> 0:50:26.181
1667
+ That's exactly so.
1668
+
1669
+ 0:50:26.181 --> 0:50:28.859
1670
+ That's one way of doing it.
1671
+
1672
+ 0:50:28.859 --> 0:50:33.068
1673
+ It's a translation task and you train your.
1674
+
1675
+ 0:50:33.913 --> 0:50:34.683
1676
+ Can do.
1677
+
1678
+ 0:50:34.683 --> 0:50:42.865
1679
+ Then, of course, the bit of the challenge
1680
+ is that you automatically allow rephrasing
1681
+
1682
+ 0:50:42.865 --> 0:50:43.539
1683
+ stuff.
1684
+
1685
+ 0:50:43.943 --> 0:50:52.240
1686
+ Which of the one end is good so you have more
1687
+ opportunities but it might be also a bad thing
1688
+
1689
+ 0:50:52.240 --> 0:50:58.307
1690
+ because if you have more opportunities you
1691
+ have more opportunities.
1692
+
1693
+ 0:51:01.041 --> 0:51:08.300
1694
+ If you want to prevent that, it can also do
1695
+ more simple labeling, so for each word your
1696
+
1697
+ 0:51:08.300 --> 0:51:10.693
1698
+ label should not be removed.
1699
+
1700
+ 0:51:12.132 --> 0:51:17.658
1701
+ People have also been looked into parsley.
1702
+
1703
+ 0:51:17.658 --> 0:51:29.097
1704
+ You remember maybe the past trees at the beginning
1705
+ like the structure because the ideas.
1706
+
1707
+ 0:51:29.649 --> 0:51:45.779
1708
+ There's also more unsupervised approaches
1709
+ where you then phrase it as a style transfer
1710
+
1711
+ 0:51:45.779 --> 0:51:46.892
1712
+ task.
1713
+
1714
+ 0:51:50.310 --> 0:51:58.601
1715
+ At the last point since we have that yes,
1716
+ it has also been done in an end-to-end fashion
1717
+
1718
+ 0:51:58.601 --> 0:52:06.519
1719
+ so that it's really you have as input the audio
1720
+ signal and output you have than the.
1721
+
1722
+ 0:52:06.446 --> 0:52:10.750
1723
+ The text, without influence, is a clearly
1724
+ clear text.
1725
+
1726
+ 0:52:11.131 --> 0:52:19.069
1727
+ You model every single total, which of course
1728
+ has a big advantage.
1729
+
1730
+ 0:52:19.069 --> 0:52:25.704
1731
+ You can use these paralinguistic features,
1732
+ pauses, and.
1733
+
1734
+ 0:52:25.705 --> 0:52:34.091
1735
+ If you switch so you start something then
1736
+ oh it doesn't work continue differently so.
1737
+
1738
+ 0:52:34.374 --> 0:52:42.689
1739
+ So you can easily use in a fashion while in
1740
+ a cascade approach.
1741
+
1742
+ 0:52:42.689 --> 0:52:47.497
1743
+ As we saw there you have text input.
1744
+
1745
+ 0:52:49.990 --> 0:53:02.389
1746
+ But on the one end we have again, and in the
1747
+ more extreme case the problem before was endless.
1748
+
1749
+ 0:53:02.389 --> 0:53:06.957
1750
+ Of course there is even less data.
1751
+
1752
+ 0:53:11.611 --> 0:53:12.837
1753
+ Good.
1754
+
1755
+ 0:53:12.837 --> 0:53:30.814
1756
+ This is all about the input to a very more
1757
+ person, or maybe if you think about YouTube.
1758
+
1759
+ 0:53:32.752 --> 0:53:34.989
1760
+ Talk so this could use be very exciting.
1761
+
1762
+ 0:53:36.296 --> 0:53:42.016
1763
+ Is more viewed as style transferred.
1764
+
1765
+ 0:53:42.016 --> 0:53:53.147
1766
+ You can use ideas from machine translation
1767
+ where you have one language.
1768
+
1769
+ 0:53:53.713 --> 0:53:57.193
1770
+ So there is ways of trying to do this type
1771
+ of style transfer.
1772
+
1773
+ 0:53:57.637 --> 0:54:02.478
1774
+ Think is definitely also very promising to
1775
+ make it more and more fluent in a business.
1776
+
1777
+ 0:54:03.223 --> 0:54:17.974
1778
+ Because one major issue about all the previous
1779
+ ones is that you need training data and then
1780
+
1781
+ 0:54:17.974 --> 0:54:21.021
1782
+ you need training.
1783
+
1784
+ 0:54:21.381 --> 0:54:32.966
1785
+ So I mean, think that we are only really of
1786
+ data that we have for English.
1787
+
1788
+ 0:54:32.966 --> 0:54:39.453
1789
+ Maybe there is a very few data in German.
1790
+
1791
+ 0:54:42.382 --> 0:54:49.722
1792
+ Okay, then let's talk about low latency speech.
1793
+
1794
+ 0:54:50.270 --> 0:55:05.158
1795
+ So the idea is if we are doing life translation
1796
+ of a talker, so we want to start out.
1797
+
1798
+ 0:55:05.325 --> 0:55:23.010
1799
+ This is possible because there is typically
1800
+ some kind of monotony in many languages.
1801
+
1802
+ 0:55:24.504 --> 0:55:29.765
1803
+ And this is also what, for example, human
1804
+ interpreters are doing to have a really low
1805
+
1806
+ 0:55:29.765 --> 0:55:30.071
1807
+ leg.
1808
+
1809
+ 0:55:30.750 --> 0:55:34.393
1810
+ They are even going further.
1811
+
1812
+ 0:55:34.393 --> 0:55:40.926
1813
+ They guess what will be the ending of the
1814
+ sentence.
1815
+
1816
+ 0:55:41.421 --> 0:55:51.120
1817
+ Then they can already continue, although it's
1818
+ not sad it might be needed, but that is even
1819
+
1820
+ 0:55:51.120 --> 0:55:53.039
1821
+ more challenging.
1822
+
1823
+ 0:55:54.714 --> 0:55:58.014
1824
+ Why is it so difficult?
1825
+
1826
+ 0:55:58.014 --> 0:56:09.837
1827
+ There is this train of on the one end for
1828
+ a and you want to have more context because
1829
+
1830
+ 0:56:09.837 --> 0:56:14.511
1831
+ we learn if we have more context.
1832
+
1833
+ 0:56:15.015 --> 0:56:24.033
1834
+ And therefore to have more contacts you have
1835
+ to wait as long as possible.
1836
+
1837
+ 0:56:24.033 --> 0:56:27.689
1838
+ The best is to have the full.
1839
+
1840
+ 0:56:28.168 --> 0:56:35.244
1841
+ On the other hand, you want to have a low
1842
+ latency for the user to wait to generate as
1843
+
1844
+ 0:56:35.244 --> 0:56:35.737
1845
+ soon.
1846
+
1847
+ 0:56:36.356 --> 0:56:47.149
1848
+ So if you're doing no situation you have to
1849
+ find the best way to start in order to have
1850
+
1851
+ 0:56:47.149 --> 0:56:48.130
1852
+ a good.
1853
+
1854
+ 0:56:48.728 --> 0:56:52.296
1855
+ There's no longer the perfect solution.
1856
+
1857
+ 0:56:52.296 --> 0:56:56.845
1858
+ People will also evaluate what is the translation.
1859
+
1860
+ 0:56:57.657 --> 0:57:09.942
1861
+ While it's challenging in German to English,
1862
+ German has this very nice thing where the prefix
1863
+
1864
+ 0:57:09.942 --> 0:57:16.607
1865
+ of the word can be put at the end of the sentence.
1866
+
1867
+ 0:57:17.137 --> 0:57:24.201
1868
+ And you only know if the person registers
1869
+ or cancels his station at the end of the center.
1870
+
1871
+ 0:57:24.985 --> 0:57:33.690
1872
+ So if you want to start the translation in
1873
+ English you need to know at this point is the.
1874
+
1875
+ 0:57:35.275 --> 0:57:39.993
1876
+ So you would have to wait until the end of
1877
+ the year.
1878
+
1879
+ 0:57:39.993 --> 0:57:42.931
1880
+ That's not really what you want.
1881
+
1882
+ 0:57:43.843 --> 0:57:45.795
1883
+ What happened.
1884
+
1885
+ 0:57:47.207 --> 0:58:12.550
1886
+ Other solutions of doing that are: Have been
1887
+ motivating like how we can do that subject
1888
+
1889
+ 0:58:12.550 --> 0:58:15.957
1890
+ object or subject work.
1891
+
1892
+ 0:58:16.496 --> 0:58:24.582
1893
+ In German it's not always subject, but there
1894
+ are relative sentence where you have that,
1895
+
1896
+ 0:58:24.582 --> 0:58:25.777
1897
+ so it needs.
1898
+
1899
+ 0:58:28.808 --> 0:58:41.858
1900
+ How we can do that is, we'll look today into
1901
+ three ways of doing that.
1902
+
1903
+ 0:58:41.858 --> 0:58:46.269
1904
+ The one is to mitigate.
1905
+
1906
+ 0:58:46.766 --> 0:58:54.824
1907
+ And then the IVAR idea is to do retranslating,
1908
+ and there you can now use the text output.
1909
+
1910
+ 0:58:54.934 --> 0:59:02.302
1911
+ So the idea is you translate, and if you later
1912
+ notice it was wrong then you can retranslate
1913
+
1914
+ 0:59:02.302 --> 0:59:03.343
1915
+ and correct.
1916
+
1917
+ 0:59:03.803 --> 0:59:14.383
1918
+ Or you can do what is called extremely coding,
1919
+ so you can generically.
1920
+
1921
+ 0:59:17.237 --> 0:59:30.382
1922
+ Let's start with the optimization, so if you
1923
+ have a sentence, it may reach a conference,
1924
+
1925
+ 0:59:30.382 --> 0:59:33.040
1926
+ and in this time.
1927
+
1928
+ 0:59:32.993 --> 0:59:39.592
1929
+ So you have a good translation quality while
1930
+ still having low latency.
1931
+
1932
+ 0:59:39.699 --> 0:59:50.513
1933
+ You have an extra model which does your segmentation
1934
+ before, but your aim is not to have a segmentation.
1935
+
1936
+ 0:59:50.470 --> 0:59:53.624
1937
+ But you can somehow measure in training data.
1938
+
1939
+ 0:59:53.624 --> 0:59:59.863
1940
+ If do these types of segment lengths, that's
1941
+ my latency and that's my translation quality,
1942
+
1943
+ 0:59:59.863 --> 1:00:02.811
1944
+ and then you can try to search a good way.
1945
+
1946
+ 1:00:03.443 --> 1:00:20.188
1947
+ If you're doing that one, it's an extra component,
1948
+ so you can use your system as it was.
1949
+
1950
+ 1:00:22.002 --> 1:00:28.373
1951
+ The other idea is to directly output the first
1952
+ high processes always, so always when you have
1953
+
1954
+ 1:00:28.373 --> 1:00:34.201
1955
+ text or audio we translate, and if we then
1956
+ have more context available we can update.
1957
+
1958
+ 1:00:35.015 --> 1:00:50.195
1959
+ So imagine before, if get an eye register
1960
+ and there's a sentence continued, then.
1961
+
1962
+ 1:00:50.670 --> 1:00:54.298
1963
+ So you change the output.
1964
+
1965
+ 1:00:54.298 --> 1:01:07.414
1966
+ Of course, that might be also leading to bad
1967
+ user experience if you always flicker and change
1968
+
1969
+ 1:01:07.414 --> 1:01:09.228
1970
+ your output.
1971
+
1972
+ 1:01:09.669 --> 1:01:15.329
1973
+ The bit like human interpreters also are able
1974
+ to correct, so they're doing a more long text.
1975
+
1976
+ 1:01:15.329 --> 1:01:20.867
1977
+ If they are guessing how to continue to say
1978
+ and then he's saying something different, they
1979
+
1980
+ 1:01:20.867 --> 1:01:22.510
1981
+ also have to correct them.
1982
+
1983
+ 1:01:22.510 --> 1:01:26.831
1984
+ So here, since it's not all you, we can even
1985
+ change what we have said.
1986
+
1987
+ 1:01:26.831 --> 1:01:29.630
1988
+ Yes, that's exactly what we have implemented.
1989
+
1990
+ 1:01:31.431 --> 1:01:49.217
1991
+ So how that works is, we are aware, and then
1992
+ we translate it, and if we get more input like
1993
+
1994
+ 1:01:49.217 --> 1:01:51.344
1995
+ you, then.
1996
+
1997
+ 1:01:51.711 --> 1:02:00.223
1998
+ And so we can always continue to do that and
1999
+ improve the transcript that we have.
2000
+
2001
+ 1:02:00.480 --> 1:02:07.729
2002
+ So in the end we have the lowest possible
2003
+ latency because we always output what is possible.
2004
+
2005
+ 1:02:07.729 --> 1:02:14.784
2006
+ On the other hand, introducing a bit of a
2007
+ new problem is: There's another challenge when
2008
+
2009
+ 1:02:14.784 --> 1:02:20.061
2010
+ we first used that this one was first used
2011
+ for old and that it worked fine.
2012
+
2013
+ 1:02:20.061 --> 1:02:21.380
2014
+ You switch to NMT.
2015
+
2016
+ 1:02:21.380 --> 1:02:25.615
2017
+ You saw one problem that is even generating
2018
+ more flickering.
2019
+
2020
+ 1:02:25.615 --> 1:02:28.878
2021
+ The problem is the normal machine translation.
2022
+
2023
+ 1:02:29.669 --> 1:02:35.414
2024
+ So implicitly learn all the output that always
2025
+ ends with a dot, and it's always a full sentence.
2026
+
2027
+ 1:02:36.696 --> 1:02:42.466
2028
+ And this was even more important somewhere
2029
+ in the model than really what is in the input.
2030
+
2031
+ 1:02:42.983 --> 1:02:55.910
2032
+ So if you give him a partial sentence, it
2033
+ will still generate a full sentence.
2034
+
2035
+ 1:02:55.910 --> 1:02:58.201
2036
+ So encourage.
2037
+
2038
+ 1:02:58.298 --> 1:03:05.821
2039
+ It's like trying to just continue it somehow
2040
+ to a full sentence and if it's doing better
2041
+
2042
+ 1:03:05.821 --> 1:03:10.555
2043
+ guessing stuff then you have to even have more
2044
+ changes.
2045
+
2046
+ 1:03:10.890 --> 1:03:23.944
2047
+ So here we have a trained mismatch and that's
2048
+ maybe more a general important thing that the
2049
+
2050
+ 1:03:23.944 --> 1:03:28.910
2051
+ modem might learn a bit different.
2052
+
2053
+ 1:03:29.289 --> 1:03:32.636
2054
+ It's always ending with a dog, so you don't
2055
+ just guess something in general.
2056
+
2057
+ 1:03:33.053 --> 1:03:35.415
2058
+ So we have your trained test mismatch.
2059
+
2060
+ 1:03:38.918 --> 1:03:41.248
2061
+ And we have a trained test message.
2062
+
2063
+ 1:03:41.248 --> 1:03:43.708
2064
+ What is the best way to address that?
2065
+
2066
+ 1:03:46.526 --> 1:03:51.934
2067
+ That's exactly the right, so we have to like
2068
+ train also on that.
2069
+
2070
+ 1:03:52.692 --> 1:03:55.503
2071
+ The problem is for particle sentences.
2072
+
2073
+ 1:03:55.503 --> 1:03:59.611
2074
+ There's not training data, so it's hard to
2075
+ find all our.
2076
+
2077
+ 1:04:00.580 --> 1:04:06.531
2078
+ Hi, I'm ransom quite easy to generate artificial
2079
+ pottery scent or at least for the source.
2080
+
2081
+ 1:04:06.926 --> 1:04:15.367
2082
+ So you just take, you take all the prefixes
2083
+ of the source data.
2084
+
2085
+ 1:04:17.017 --> 1:04:22.794
2086
+ On the problem of course, with a bit what
2087
+ do you know lying?
2088
+
2089
+ 1:04:22.794 --> 1:04:30.845
2090
+ If you have a sentence, I encourage all of
2091
+ what should be the right target for that.
2092
+
2093
+ 1:04:31.491 --> 1:04:45.381
2094
+ And the constraints on the one hand, it should
2095
+ be as long as possible, so you always have
2096
+
2097
+ 1:04:45.381 --> 1:04:47.541
2098
+ a long delay.
2099
+
2100
+ 1:04:47.687 --> 1:04:55.556
2101
+ On the other hand, it should be also a suspect
2102
+ of the previous ones, and it should be not
2103
+
2104
+ 1:04:55.556 --> 1:04:57.304
2105
+ too much inventing.
2106
+
2107
+ 1:04:58.758 --> 1:05:02.170
2108
+ A very easy solution works fine.
2109
+
2110
+ 1:05:02.170 --> 1:05:05.478
2111
+ You can just do a length space.
2112
+
2113
+ 1:05:05.478 --> 1:05:09.612
2114
+ You also take two thirds of the target.
2115
+
2116
+ 1:05:10.070 --> 1:05:19.626
2117
+ His learning then implicitly to guess a bit
2118
+ if you think about the beginning of example.
2119
+
2120
+ 1:05:20.000 --> 1:05:30.287
2121
+ This one, if you do two sorts like half, in
2122
+ this case the target would be eye register.
2123
+
2124
+ 1:05:30.510 --> 1:05:39.289
2125
+ So you're doing a bit of implicit guessing,
2126
+ and if it's getting wrong you have rewriting,
2127
+
2128
+ 1:05:39.289 --> 1:05:43.581
2129
+ but you're doing a good amount of guessing.
2130
+
2131
+ 1:05:49.849 --> 1:05:53.950
2132
+ In addition, this would be like how it looks
2133
+ like if it was like.
2134
+
2135
+ 1:05:53.950 --> 1:05:58.300
2136
+ If it wasn't a housing game, then the target
2137
+ could be something like.
2138
+
2139
+ 1:05:58.979 --> 1:06:02.513
2140
+ One problem is that you just do that this
2141
+ way.
2142
+
2143
+ 1:06:02.513 --> 1:06:04.619
2144
+ It's most of your training.
2145
+
2146
+ 1:06:05.245 --> 1:06:11.983
2147
+ And in the end you're interested in the overall
2148
+ translation quality, so for full sentence.
2149
+
2150
+ 1:06:11.983 --> 1:06:19.017
2151
+ So if you train on that, it will mainly learn
2152
+ how to translate prefixes because ninety percent
2153
+
2154
+ 1:06:19.017 --> 1:06:21.535
2155
+ or more of your data is prefixed.
2156
+
2157
+ 1:06:22.202 --> 1:06:31.636
2158
+ That's why we'll see that it's better to do
2159
+ like a ratio.
2160
+
2161
+ 1:06:31.636 --> 1:06:39.281
2162
+ So half your training data are full sentences.
2163
+
2164
+ 1:06:39.759 --> 1:06:47.693
2165
+ Because if you're doing this well you see
2166
+ that for every word prefix and only one sentence.
2167
+
2168
+ 1:06:48.048 --> 1:06:52.252
2169
+ You also see that nicely here here are both.
2170
+
2171
+ 1:06:52.252 --> 1:06:56.549
2172
+ This is the blue scores and you see the bass.
2173
+
2174
+ 1:06:58.518 --> 1:06:59.618
2175
+ Is this one?
2176
+
2177
+ 1:06:59.618 --> 1:07:03.343
2178
+ It has a good quality because it's trained.
2179
+
2180
+ 1:07:03.343 --> 1:07:11.385
2181
+ If you know, train with all the partial sentences
2182
+ is more focusing on how to translate partial
2183
+
2184
+ 1:07:11.385 --> 1:07:12.316
2185
+ sentences.
2186
+
2187
+ 1:07:12.752 --> 1:07:17.840
2188
+ Because all the partial sentences will at
2189
+ some point be removed, because at the end you
2190
+
2191
+ 1:07:17.840 --> 1:07:18.996
2192
+ translate the full.
2193
+
2194
+ 1:07:20.520 --> 1:07:24.079
2195
+ There's many tasks to read, but you have the
2196
+ same performances.
2197
+
2198
+ 1:07:24.504 --> 1:07:26.938
2199
+ On the other hand, you see here the other
2200
+ problem.
2201
+
2202
+ 1:07:26.938 --> 1:07:28.656
2203
+ This is how many words got updated.
2204
+
2205
+ 1:07:29.009 --> 1:07:31.579
2206
+ You want to have as few updates as possible.
2207
+
2208
+ 1:07:31.579 --> 1:07:34.891
2209
+ Updates need to remove things which are once
2210
+ being shown.
2211
+
2212
+ 1:07:35.255 --> 1:07:40.538
2213
+ This is quite high for the baseline.
2214
+
2215
+ 1:07:40.538 --> 1:07:50.533
2216
+ If you know the partials that are going down,
2217
+ they should be removed.
2218
+
2219
+ 1:07:51.151 --> 1:07:58.648
2220
+ And then for moody tasks you have a bit like
2221
+ the best note of swim.
2222
+
2223
+ 1:08:02.722 --> 1:08:05.296
2224
+ Any more questions to this type of.
2225
+
2226
+ 1:08:09.309 --> 1:08:20.760
2227
+ The last thing is that you want to do an extremely.
2228
+
2229
+ 1:08:21.541 --> 1:08:23.345
2230
+ Again, it's a bit implication.
2231
+
2232
+ 1:08:23.345 --> 1:08:25.323
2233
+ Scenario is what you really want.
2234
+
2235
+ 1:08:25.323 --> 1:08:30.211
2236
+ As you said, we sometimes use this updating,
2237
+ and for text output it'd be very nice.
2238
+
2239
+ 1:08:30.211 --> 1:08:35.273
2240
+ But imagine if you want to audio output, of
2241
+ course you can't change it anymore because
2242
+
2243
+ 1:08:35.273 --> 1:08:37.891
2244
+ on one side you cannot change what was said.
2245
+
2246
+ 1:08:37.891 --> 1:08:40.858
2247
+ So in this time you more need like a fixed
2248
+ output.
2249
+
2250
+ 1:08:41.121 --> 1:08:47.440
2251
+ And then the style of street decoding is interesting.
2252
+
2253
+ 1:08:47.440 --> 1:08:55.631
2254
+ Where you, for example, get sourced, the seagullins
2255
+ are so stoked in.
2256
+
2257
+ 1:08:55.631 --> 1:09:00.897
2258
+ Then you decide oh, now it's better to wait.
2259
+
2260
+ 1:09:01.041 --> 1:09:14.643
2261
+ So you somehow need to have this type of additional
2262
+ information.
2263
+
2264
+ 1:09:15.295 --> 1:09:23.074
2265
+ Here you have to decide should know I'll put
2266
+ a token or should wait for my and feel.
2267
+
2268
+ 1:09:26.546 --> 1:09:32.649
2269
+ So you have to do this additional labels like
2270
+ weight, weight, output, output, wage and so
2271
+
2272
+ 1:09:32.649 --> 1:09:32.920
2273
+ on.
2274
+
2275
+ 1:09:33.453 --> 1:09:38.481
2276
+ There are different ways of doing that.
2277
+
2278
+ 1:09:38.481 --> 1:09:45.771
2279
+ You can have an additional model that does
2280
+ this decision.
2281
+
2282
+ 1:09:46.166 --> 1:09:53.669
2283
+ And then have a higher quality or better to
2284
+ continue and then have a lower latency in this
2285
+
2286
+ 1:09:53.669 --> 1:09:54.576
2287
+ different.
2288
+
2289
+ 1:09:55.215 --> 1:09:59.241
2290
+ Surprisingly, a very easy task also works,
2291
+ sometimes quite good.
2292
+
2293
+ 1:10:03.043 --> 1:10:10.981
2294
+ And that is the so called way care policy
2295
+ and the idea is there at least for text to
2296
+
2297
+ 1:10:10.981 --> 1:10:14.623
2298
+ text translation that is working well.
2299
+
2300
+ 1:10:14.623 --> 1:10:22.375
2301
+ It's like you wait for words and then you
2302
+ always output one and like one for each.
2303
+
2304
+ 1:10:22.682 --> 1:10:28.908
2305
+ So your weight slow works at the beginning
2306
+ of the sentence, and every time a new board
2307
+
2308
+ 1:10:28.908 --> 1:10:29.981
2309
+ is coming you.
2310
+
2311
+ 1:10:31.091 --> 1:10:39.459
2312
+ So you have the same times to beat as input,
2313
+ so you're not legging more or less, but to
2314
+
2315
+ 1:10:39.459 --> 1:10:41.456
2316
+ have enough context.
2317
+
2318
+ 1:10:43.103 --> 1:10:49.283
2319
+ Of course this for example for the unmarried
2320
+ will not solve it perfectly but if you have
2321
+
2322
+ 1:10:49.283 --> 1:10:55.395
2323
+ a bit of local reordering inside your token
2324
+ that you can manage very well and then it's
2325
+
2326
+ 1:10:55.395 --> 1:10:57.687
2327
+ a very simple solution but it's.
2328
+
2329
+ 1:10:57.877 --> 1:11:00.481
2330
+ The other one was dynamic.
2331
+
2332
+ 1:11:00.481 --> 1:11:06.943
2333
+ Depending on the context you can decide how
2334
+ long you want to wait.
2335
+
2336
+ 1:11:07.687 --> 1:11:21.506
2337
+ It also only works if you have a similar amount
2338
+ of tokens, so if your target is very short
2339
+
2340
+ 1:11:21.506 --> 1:11:22.113
2341
+ of.
2342
+
2343
+ 1:11:22.722 --> 1:11:28.791
2344
+ That's why it's also more challenging for
2345
+ audio input because the speaking rate is changing
2346
+
2347
+ 1:11:28.791 --> 1:11:29.517
2348
+ and so on.
2349
+
2350
+ 1:11:29.517 --> 1:11:35.586
2351
+ You would have to do something like I'll output
2352
+ a word for every second a year or something
2353
+
2354
+ 1:11:35.586 --> 1:11:35.981
2355
+ like.
2356
+
2357
+ 1:11:36.636 --> 1:11:45.459
2358
+ The problem is that the audio speaking speed
2359
+ is not like fixed but quite very, and therefore.
2360
+
2361
+ 1:11:50.170 --> 1:11:58.278
2362
+ Therefore, what you can also do is you can
2363
+ use a similar solution than we had before with
2364
+
2365
+ 1:11:58.278 --> 1:11:59.809
2366
+ the resetteling.
2367
+
2368
+ 1:12:00.080 --> 1:12:02.904
2369
+ You remember we were re-decoded all the time.
2370
+
2371
+ 1:12:03.423 --> 1:12:12.253
2372
+ And you can do something similar in this case
2373
+ except that you add something in that you're
2374
+
2375
+ 1:12:12.253 --> 1:12:16.813
2376
+ saying, oh, if I read it cold, I'm not always.
2377
+
2378
+ 1:12:16.736 --> 1:12:22.065
2379
+ Can decode as I want, but you can do this
2380
+ target prefix decoding, so what you say is
2381
+
2382
+ 1:12:22.065 --> 1:12:23.883
2383
+ in your achievement section.
2384
+
2385
+ 1:12:23.883 --> 1:12:26.829
2386
+ You can easily say generate a translation
2387
+ bus.
2388
+
2389
+ 1:12:27.007 --> 1:12:29.810
2390
+ The translation has to start with the prefix.
2391
+
2392
+ 1:12:31.251 --> 1:12:35.350
2393
+ How can you do that?
2394
+
2395
+ 1:12:39.839 --> 1:12:49.105
2396
+ In the decoder exactly you start, so if you
2397
+ do beam search you select always the most probable.
2398
+
2399
+ 1:12:49.349 --> 1:12:57.867
2400
+ And now you say oh, I'm not selecting the
2401
+ most perfect, but this is the fourth, so in
2402
+
2403
+ 1:12:57.867 --> 1:13:04.603
2404
+ the first step have to take this one, in the
2405
+ second start decoding.
2406
+
2407
+ 1:13:04.884 --> 1:13:09.387
2408
+ And then you're making sure that your second
2409
+ always starts with this prefix.
2410
+
2411
+ 1:13:10.350 --> 1:13:18.627
2412
+ And then you can use your immediate retranslation,
2413
+ but you're no longer changing the output.
2414
+
2415
+ 1:13:19.099 --> 1:13:31.595
2416
+ Out as it works, so it may get a speech signal
2417
+ and input, and it is not outputing any.
2418
+
2419
+ 1:13:32.212 --> 1:13:45.980
2420
+ So then if you got you get a translation maybe
2421
+ and then you decide yes output.
2422
+
2423
+ 1:13:46.766 --> 1:13:54.250
2424
+ And then you're translating as one as two
2425
+ as sweet as four, but now you say generate
2426
+
2427
+ 1:13:54.250 --> 1:13:55.483
2428
+ only outputs.
2429
+
2430
+ 1:13:55.935 --> 1:14:07.163
2431
+ And then you're translating and maybe you're
2432
+ deciding on and now a good translation.
2433
+
2434
+ 1:14:07.163 --> 1:14:08.880
2435
+ Then you're.
2436
+
2437
+ 1:14:09.749 --> 1:14:29.984
2438
+ Yes, but don't get to worry about what the
2439
+ effect is.
2440
+
2441
+ 1:14:30.050 --> 1:14:31.842
2442
+ We're generating your target text.
2443
+
2444
+ 1:14:32.892 --> 1:14:36.930
2445
+ But we're not always outputing the full target
2446
+ text now.
2447
+
2448
+ 1:14:36.930 --> 1:14:43.729
2449
+ What we are having is we have here some strategy
2450
+ to decide: Oh, is a system already sure enough
2451
+
2452
+ 1:14:43.729 --> 1:14:44.437
2453
+ about it?
2454
+
2455
+ 1:14:44.437 --> 1:14:49.395
2456
+ If it's sure enough and it has all the information,
2457
+ we can output it.
2458
+
2459
+ 1:14:49.395 --> 1:14:50.741
2460
+ And then the next.
2461
+
2462
+ 1:14:51.291 --> 1:14:55.931
2463
+ If we say here sometimes with better not to
2464
+ get output we won't output it already.
2465
+
2466
+ 1:14:57.777 --> 1:15:06.369
2467
+ And thereby the hope is in the uphill model
2468
+ should not yet outcut a register because it
2469
+
2470
+ 1:15:06.369 --> 1:15:10.568
2471
+ doesn't mean no yet if it's a case or not.
2472
+
2473
+ 1:15:13.193 --> 1:15:18.056
2474
+ So what we have to discuss is what is a good
2475
+ output strategy.
2476
+
2477
+ 1:15:18.658 --> 1:15:20.070
2478
+ So you could do.
2479
+
2480
+ 1:15:20.070 --> 1:15:23.806
2481
+ The output strategy could be something like.
2482
+
2483
+ 1:15:23.743 --> 1:15:39.871
2484
+ If you think of weight cape, this is an output
2485
+ strategy here that you always input.
2486
+
2487
+ 1:15:40.220 --> 1:15:44.990
2488
+ Good, and you can view your weight in a similar
2489
+ way as.
2490
+
2491
+ 1:15:45.265 --> 1:15:55.194
2492
+ But now, of course, we can also look at other
2493
+ output strategies where it's more generic and
2494
+
2495
+ 1:15:55.194 --> 1:15:59.727
2496
+ it's deciding whether in some situations.
2497
+
2498
+ 1:16:01.121 --> 1:16:12.739
2499
+ And one thing that works quite well is referred
2500
+ to as local agreement, and that means you're
2501
+
2502
+ 1:16:12.739 --> 1:16:13.738
2503
+ always.
2504
+
2505
+ 1:16:14.234 --> 1:16:26.978
2506
+ Then you're looking what is the same thing
2507
+ between my current translation and the one
2508
+
2509
+ 1:16:26.978 --> 1:16:28.756
2510
+ did before.
2511
+
2512
+ 1:16:29.349 --> 1:16:31.201
2513
+ So let's do that again in six hours.
2514
+
2515
+ 1:16:31.891 --> 1:16:45.900
2516
+ So your input is a first audio segment and
2517
+ your title text is all model trains.
2518
+
2519
+ 1:16:46.346 --> 1:16:53.231
2520
+ Then you're getting six opposites, one and
2521
+ two, and this time the output is all models.
2522
+
2523
+ 1:16:54.694 --> 1:17:08.407
2524
+ You see trains are different, but both of
2525
+ them agree that it's all so in those cases.
2526
+
2527
+ 1:17:09.209 --> 1:17:13.806
2528
+ So we can be hopefully a big show that really
2529
+ starts with all.
2530
+
2531
+ 1:17:15.155 --> 1:17:22.604
2532
+ So now we say we're output all, so at this
2533
+ time instead we'll output all, although before.
2534
+
2535
+ 1:17:23.543 --> 1:17:27.422
2536
+ We are getting one, two, three as input.
2537
+
2538
+ 1:17:27.422 --> 1:17:35.747
2539
+ This time we have a prefix, so now we are
2540
+ only allowing translations to start with all.
2541
+
2542
+ 1:17:35.747 --> 1:17:42.937
2543
+ We cannot change that anymore, so we now need
2544
+ to generate some translation.
2545
+
2546
+ 1:17:43.363 --> 1:17:46.323
2547
+ And then it can be that its now all models
2548
+ are run.
2549
+
2550
+ 1:17:47.927 --> 1:18:01.908
2551
+ Then we compare here and see this agrees on
2552
+ all models so we can output all models.
2553
+
2554
+ 1:18:02.882 --> 1:18:07.356
2555
+ So this by we can dynamically decide is a
2556
+ model is very anxious.
2557
+
2558
+ 1:18:07.356 --> 1:18:10.178
2559
+ We always talk with something different.
2560
+
2561
+ 1:18:11.231 --> 1:18:24.872
2562
+ Then it's, we'll wait longer, it's more for
2563
+ the same thing, and hope we don't need to wait.
2564
+
2565
+ 1:18:30.430 --> 1:18:40.238
2566
+ Is it clear again that the signal wouldn't
2567
+ be able to detect?
2568
+
2569
+ 1:18:43.203 --> 1:18:50.553
2570
+ The hope it is because if it's not sure of,
2571
+ of course, it in this kind would have to switch
2572
+
2573
+ 1:18:50.553 --> 1:18:51.671
2574
+ all the time.
2575
+
2576
+ 1:18:56.176 --> 1:19:01.375
2577
+ So if it would be the first step to register
2578
+ and the second time to cancel and they may
2579
+
2580
+ 1:19:01.375 --> 1:19:03.561
2581
+ register again, they wouldn't do it.
2582
+
2583
+ 1:19:03.561 --> 1:19:08.347
2584
+ Of course, it is very short because in register
2585
+ a long time, then it can't deal.
2586
+
2587
+ 1:19:08.568 --> 1:19:23.410
2588
+ That's why there's two parameters that you
2589
+ can use and which might be important, or how.
2590
+
2591
+ 1:19:23.763 --> 1:19:27.920
2592
+ So you do it like every one second, every
2593
+ five seconds or something like that.
2594
+
2595
+ 1:19:28.648 --> 1:19:37.695
2596
+ Put it more often as your latency will be
2597
+ because your weight is less long, but also
2598
+
2599
+ 1:19:37.695 --> 1:19:39.185
2600
+ you might do.
2601
+
2602
+ 1:19:40.400 --> 1:19:50.004
2603
+ So that is the one thing and the other thing
2604
+ is for words you might do everywhere, but if
2605
+
2606
+ 1:19:50.004 --> 1:19:52.779
2607
+ you think about audio it.
2608
+
2609
+ 1:19:53.493 --> 1:20:04.287
2610
+ And the other question you can do like the
2611
+ agreement, so the model is sure.
2612
+
2613
+ 1:20:04.287 --> 1:20:10.252
2614
+ If you say have to agree, then hopefully.
2615
+
2616
+ 1:20:10.650 --> 1:20:21.369
2617
+ What we saw is think there has been a really
2618
+ normally good performance and otherwise your
2619
+
2620
+ 1:20:21.369 --> 1:20:22.441
2621
+ latency.
2622
+
2623
+ 1:20:22.963 --> 1:20:42.085
2624
+ Okay, we'll just make more tests and we'll
2625
+ get the confidence.
2626
+
2627
+ 1:20:44.884 --> 1:20:47.596
2628
+ Have to completely agree with that.
2629
+
2630
+ 1:20:47.596 --> 1:20:53.018
2631
+ So when this was done, that was our first
2632
+ idea of using the confidence.
2633
+
2634
+ 1:20:53.018 --> 1:21:00.248
2635
+ The problem is that currently that's my assumption
2636
+ is that the modeling the model confidence is
2637
+
2638
+ 1:21:00.248 --> 1:21:03.939
2639
+ not that easy, and they are often overconfident.
2640
+
2641
+ 1:21:04.324 --> 1:21:17.121
2642
+ In the paper there is this type also where
2643
+ you try to use the confidence in some way to
2644
+
2645
+ 1:21:17.121 --> 1:21:20.465
2646
+ decide the confidence.
2647
+
2648
+ 1:21:21.701 --> 1:21:26.825
2649
+ But that gave worse results, and that's why
2650
+ we looked into that.
2651
+
2652
+ 1:21:27.087 --> 1:21:38.067
2653
+ So it's a very good idea think, but it seems
2654
+ not to at least how it was implemented.
2655
+
2656
+ 1:21:38.959 --> 1:21:55.670
2657
+ There is one way that maybe goes in more direction,
2658
+ which is very new.
2659
+
2660
+ 1:21:55.455 --> 1:22:02.743
2661
+ If this one, the last word is attending mainly
2662
+ to the end of the audio.
2663
+
2664
+ 1:22:02.942 --> 1:22:04.934
2665
+ You might you should not output it yet.
2666
+
2667
+ 1:22:05.485 --> 1:22:15.539
2668
+ Because they might think there is something
2669
+ more missing than you need to know, so they
2670
+
2671
+ 1:22:15.539 --> 1:22:24.678
2672
+ look at the attention and only output parts
2673
+ which look to not the audio signal.
2674
+
2675
+ 1:22:25.045 --> 1:22:40.175
2676
+ So there is, of course, a lot of ways how
2677
+ you can do it better or easier in some way.
2678
+
2679
+ 1:22:41.901 --> 1:22:53.388
2680
+ Instead tries to predict the next word with
2681
+ a large language model, and then for text translation
2682
+
2683
+ 1:22:53.388 --> 1:22:54.911
2684
+ you predict.
2685
+
2686
+ 1:22:55.215 --> 1:23:01.177
2687
+ Then you translate all of them and decide
2688
+ if there is a change so you can even earlier
2689
+
2690
+ 1:23:01.177 --> 1:23:02.410
2691
+ do your decision.
2692
+
2693
+ 1:23:02.362 --> 1:23:08.714
2694
+ The idea is that if we continue and then this
2695
+ will be to a change in the translation, then
2696
+
2697
+ 1:23:08.714 --> 1:23:10.320
2698
+ we should have opened.
2699
+
2700
+ 1:23:10.890 --> 1:23:18.302
2701
+ So it's more doing your estimate about possible
2702
+ continuations of the source instead of looking
2703
+
2704
+ 1:23:18.302 --> 1:23:19.317
2705
+ at previous.
2706
+
2707
+ 1:23:23.783 --> 1:23:31.388
2708
+ All that works is a bit here like one example.
2709
+
2710
+ 1:23:31.388 --> 1:23:39.641
2711
+ It has a legacy baselines and you are not
2712
+ putting.
2713
+
2714
+ 1:23:40.040 --> 1:23:47.041
2715
+ And you see in this case you have worse blood
2716
+ scores here.
2717
+
2718
+ 1:23:47.041 --> 1:23:51.670
2719
+ For equal one you have better latency.
2720
+
2721
+ 1:23:52.032 --> 1:24:01.123
2722
+ The how to and how does anybody have an idea
2723
+ of what could be challenging there or when?
2724
+
2725
+ 1:24:05.825 --> 1:24:20.132
2726
+ One problem of these models are hallucinations,
2727
+ and often very long has a negative impact on.
2728
+
2729
+ 1:24:24.884 --> 1:24:30.869
2730
+ If you don't remove the last four words but
2731
+ your model now starts to hallucinate and invent
2732
+
2733
+ 1:24:30.869 --> 1:24:37.438
2734
+ just a lot of new stuff then yeah you're removing
2735
+ the last four words of that but if it has invented
2736
+
2737
+ 1:24:37.438 --> 1:24:41.406
2738
+ ten words and you're still outputting six of
2739
+ these invented.
2740
+
2741
+ 1:24:41.982 --> 1:24:48.672
2742
+ Typically once it starts hallucination generating
2743
+ some output, it's quite long, so then it's
2744
+
2745
+ 1:24:48.672 --> 1:24:50.902
2746
+ no longer enough to just hold.
2747
+
2748
+ 1:24:51.511 --> 1:24:57.695
2749
+ And then, of course, a bit better if you compare
2750
+ to the previous ones.
2751
+
2752
+ 1:24:57.695 --> 1:25:01.528
2753
+ Their destinations are typically different.
2754
+
2755
+ 1:25:07.567 --> 1:25:25.939
2756
+ Yes, so we don't talk about the details, but
2757
+ for outputs, for presentations, there's different
2758
+
2759
+ 1:25:25.939 --> 1:25:27.100
2760
+ ways.
2761
+
2762
+ 1:25:27.347 --> 1:25:36.047
2763
+ So you want to have maximum two lines, maximum
2764
+ forty-two characters per line, and the reading
2765
+
2766
+ 1:25:36.047 --> 1:25:40.212
2767
+ speed is a maximum of twenty-one characters.
2768
+
2769
+ 1:25:40.981 --> 1:25:43.513
2770
+ How to Do That We Can Skip.
2771
+
2772
+ 1:25:43.463 --> 1:25:46.804
2773
+ Then you can generate something like that.
2774
+
2775
+ 1:25:46.886 --> 1:25:53.250
2776
+ Another challenge is, of course, that you
2777
+ not only need to generate the translation,
2778
+
2779
+ 1:25:53.250 --> 1:25:59.614
2780
+ but for subtlyning you also want to generate
2781
+ when to put breaks and what to display.
2782
+
2783
+ 1:25:59.619 --> 1:26:06.234
2784
+ Because it cannot be full sentences, as said
2785
+ here, if you have like maximum twenty four
2786
+
2787
+ 1:26:06.234 --> 1:26:10.443
2788
+ characters per line, that's not always a full
2789
+ sentence.
2790
+
2791
+ 1:26:10.443 --> 1:26:12.247
2792
+ So how can you make it?
2793
+
2794
+ 1:26:13.093 --> 1:26:16.253
2795
+ And then for speech there's not even a hint
2796
+ of wisdom.
2797
+
2798
+ 1:26:18.398 --> 1:26:27.711
2799
+ So what we have done today is yeah, we looked
2800
+ into maybe three challenges: We have this segmentation,
2801
+
2802
+ 1:26:27.711 --> 1:26:33.013
2803
+ which is a challenge both in evaluation and
2804
+ in the decoder.
2805
+
2806
+ 1:26:33.013 --> 1:26:40.613
2807
+ We talked about disfluencies and we talked
2808
+ about simultaneous translations and how to
2809
+
2810
+ 1:26:40.613 --> 1:26:42.911
2811
+ address these challenges.
2812
+
2813
+ 1:26:43.463 --> 1:26:45.507
2814
+ Any more questions.
2815
+
2816
+ 1:26:48.408 --> 1:26:52.578
2817
+ Good then new content.
2818
+
2819
+ 1:26:52.578 --> 1:26:58.198
2820
+ We are done for this semester.
2821
+
2822
+ 1:26:58.198 --> 1:27:04.905
2823
+ You can keep your knowledge in that.
2824
+
2825
+ 1:27:04.744 --> 1:27:09.405
2826
+ Repetition where we can try to repeat a bit
2827
+ what we've done all over the semester.
2828
+
2829
+ 1:27:10.010 --> 1:27:13.776
2830
+ Now prepare a bit of repetition to what think
2831
+ is important.
2832
+
2833
+ 1:27:14.634 --> 1:27:21.441
2834
+ But of course is also the chance for you to
2835
+ ask specific questions.
2836
+
2837
+ 1:27:21.441 --> 1:27:25.445
2838
+ It's not clear to me how things relate.
2839
+
2840
+ 1:27:25.745 --> 1:27:34.906
2841
+ So if you have any specific questions, please
2842
+ come to me or send me an email or so, then
2843
+
2844
+ 1:27:34.906 --> 1:27:36.038
2845
+ I'm happy.
2846
+
2847
+ 1:27:36.396 --> 1:27:46.665
2848
+ If should focus on it really in depth, it
2849
+ might be good not to come and send me an email
2850
+
2851
+ 1:27:46.665 --> 1:27:49.204
2852
+ on Wednesday evening.
2853
+
demo_data/lectures/Lecture-19-21.07.2023/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:627fd6a73ed6853821cd58c2fc9e938a7844998ed51c4163f2d0a4771dc5c156
3
+ size 130103518
demo_data/nips-2021/25957/metadata.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "title": "Shared Independent Component Analysis for Multi-Subject Neuroimaging"
3
+ }
demo_data/nips-2021/25957/transcript_whisper_large-v2.txt ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hi, I'm Hugo Richard, I'm a third year PhD student at Université Paris-Saclay.
2
+ I'm in the INRIA Paris et Alpes team and my supervisor is Bertrand Thirion.
3
+ Today I'll talk about shared independent component analysis for multi-subject neuroimaging.
4
+ This is a joint work with Pierre Abelin, Alexandre Grandfort, Bertrand Thirion and Anna Pouy-Varine.
5
+ First let us consider two sources that are emitting a signal that is recorded by two
6
+ sensors.
7
+ This can be seen as a simplified model of magnetoencephalography where brain sources
8
+ are recorded by magnetometers.
9
+ Because propagation time can be neglected, the signal recorded by the sensors can be
10
+ seen as a linear mixture of the signal emitted by the sources.
11
+ S is a set of sources that are assumed to be independent.
12
+ X are the recordings and A describes how the sources are mixed to produce the recordings.
13
+ At first sight this model may seem ill-defined because if we permute two columns in A and
14
+ permute the corresponding sources in S, we'll get a new set of sources S' and a new mixing
15
+ matrix A' that describes X just as well as A and S.
16
+ And similarly if we scale the column of A by some constant, one column of A by some
17
+ constant and the corresponding source by the same constant, we'll also get an equivalent
18
+ description of X.
19
+ However, these scale and permutation indeterminacies are the only one if the sources contain at
20
+ most one Gaussian component.
21
+ Let us consider the more general problem where you have multiple subjects that are exposed
22
+ to the same stimuli.
23
+ We have two subjects, X1 and X2, and they have different mixing matrices, A1 and A2,
24
+ and different noise levels, N1 and N2.
25
+ The interpretation is that they have shared sources because they have shared connective
26
+ processes.
27
+ They have different mixing matrices because they have different spatial topography.
28
+ And they have different noises because we want to model inter-subject variability.
29
+ This model is called group ICA.
30
+ There are many methods to provide a solution for the group ICA problem.
31
+ A very popular one introduced by Calhoun in 2001 is to just stack the data of all subjects
32
+ feature-wise and then perform a PCA, a principal component analysis, on the stacked data.
33
+ And therefore you obtain reduced data and apply independent component analysis on the
34
+ reduced data to obtain a set of sources.
35
+ Another formulation is introduced by Varoko in 2010 and is called K-NICA.
36
+ You just replace the principal component analysis with a multiset CCA, so a multiset canonical
37
+ correlation analysis, where you have to solve a generalized eigenvalue problem.
38
+ There are many different formulations of multiset CCA, but this one with a generalized eigenvalue
39
+ problem is the fastest to solve.
40
+ KNICA and Cut-ICA have a lot of advantages.
41
+ First, they are very fast to fit.
42
+ And second, they are simple to implement.
43
+ These are the two reasons why they are so popular in neuroimaging.
44
+ However, they do not optimize the proper likelihood.
45
+ So therefore they do not benefit from advantages of such estimators such as asymptotic efficiency.
46
+ There are a lot of other related work that do optimize the proper likelihood.
47
+ I want to mention the independent vector analysis, which is a very powerful framework introduced
48
+ by Li in 2008.
49
+ So unified approach of Guo in 2008 that we will also mention and talk about later.
50
+ The approach of Shen in 2015 that also allows to perform dimension reduction.
51
+ And the multi-view ICA that was introduced by our team last year.
52
+ I want to quickly say that it's not obvious to design a likelihood-based approach that
53
+ is tractable.
54
+ And with this example of the Gaussian mixture noisy ICA by Bermond and Cardozo, we'll see
55
+ that standard approach leads to intractable algorithms.
56
+ The model we take here is the same as the group ICA, but we assume that the noise is
57
+ Gaussian with the same variance for all subjects.
58
+ We'll also assume that the sources follow a Gaussian mixture model.
59
+ And we further assume that the weights of the Gaussian mixtures are known.
60
+ We can solve such model via expectation maximization.
61
+ And if we write the E-step, we'll get a closed form that involves a large sum.
62
+ Because of this large size, this sum, and therefore the M algorithm is intractable whenever
63
+ Q and K are large.
64
+ Our contribution is shared ICA, what we call Shikha for short, where the data of subject
65
+ i are assumed as a linear mixture of noisy sources, and the noise here is not on the
66
+ sensor, but on the sources.
67
+ The noise is Gaussian with a variance that can be different for each subject and different
68
+ for each component.
69
+ S are assumed to be independent, but in contrast to almost all existing work, some components
70
+ can be Gaussian.
71
+ We have a few blanket assumptions.
72
+ We assume that the data are centered, that the mixing metrics are invertible, that the
73
+ sources have identical variance, and that the number of subjects is greater than 3.
74
+ We have two algorithms to solve the Shikha model.
75
+ We have ShikhaJ, that is a FAS algorithm that is based on multiset CCA, and ShikhaML, a
76
+ maximum likelihood approach.
77
+ In Shikha, there are two ways to recover the parameters.
78
+ Either the source are non-Gaussian, in which case we can use classical ICA results to recover
79
+ the unmixing matrices.
80
+ When the components are Gaussian, then we need something else, and what we use here
81
+ is noise diversity.
82
+ When the noise is sufficiently diverse, then it's possible to recover the unmixing matrix
83
+ and the noise covariance up to a permutation and sign indeterminacy.
84
+ Note that the noise diversity in Gaussian components is also a necessary condition.
85
+ If it does not hold, then Shikha cannot be identified.
86
+ Let us now focus on this theorem that is at the core of the ShikhaJ algorithm.
87
+ Namely it shows that we can solve group ICA with multiset CCA.
88
+ So assume the data follows the Shikha model, and consider the multiset CCA framed as a
89
+ generalized eigenvalue problem.
90
+ This generalized eigenvalue problem relies on two matrices, C and D. So C is formed by
91
+ second-order statistics, and D is formed by the diagonal blocks in C.
92
+ And so if we solve this eigenvalue problem and take the first k leading eigenvectors,
93
+ we can recover the correct unmixing matrix from them, up to a permutation and a scaling.
94
+ And this can only be done if the k first eigenvalues are distinct.
95
+ Note that the distinct eigenvalue condition is also necessary.
96
+ If two eigenvalues are the same, then this adds the need to determine IC, and therefore
97
+ we cannot solve group IC.
98
+ Note also that the condition that some eigenvalues need to be distinct is stronger than the noise
99
+ diversity condition we have in the identifiability theorem.
100
+ And therefore we can exhibit an example which is identifiable, but on which multiset CCA
101
+ will fail.
102
+ And I refer you to the paper for more details on this.
103
+ So in our theorem, in order to recover the correct unmixing matrix, we need to have access
104
+ to the second-order statistics.
105
+ However, in practice, we only have access to them, up to some sampling noise.
106
+ And because the mapping from matrices to eigenvectors is highly non-smooth, a small deviation in
107
+ the second-order statistics can lead to a high deviation of the recovered unmixing matrix.
108
+ Now to show this in practice, we take three subjects, two components, and noise covariance
109
+ matrices with two values, lambda1 and lambda2, that are separated by an eigengap epsilon.
110
+ And we compare the solution of multiset CCA on the true covariance matrices and on the
111
+ perturbed covariance matrix, where the perturbation scale is given by delta.
112
+ And for different values of epsilon, 10-4, 10-3, 10-2, 10-1, we show how the performance
113
+ of the algorithm, so the M-ary distance between the true unmixing matrix and the estimated
114
+ unmixing matrix, varies when the perturbation scale increases.
115
+ And we see that when the eigengap is very close, so 10-4, the violet curve, then even
116
+ with a very small perturbation, you can get to a very bad M-ary distance.
117
+ So the black dashed curve is a performance of chance.
118
+ Luckily, there is a large gap between the k-th eigenvalues and the k plus 1.
119
+ This means that in practice, the span of the p-leading eigenvectors is approximately preserved.
120
+ We can recover the true unmixing matrix from the unmixing matrix estimated by multiset
121
+ CCA, just by multiplying by a matrix Q.
122
+ And in order to estimate Q, we make use of the fact that the unmixed data should have
123
+ a diagonal covariance.
124
+ This leads us to a joint diagonalization problem that we can solve efficiently.
125
+ So if we take the experiments we've done on the previous slide, the results are still
126
+ shown here.
127
+ You can see the violet curves, and that is very sensitive to perturbation.
128
+ And so if we apply joint diagonalization, all these curves move, and they join the dashed
129
+ curve on the bottom.
130
+ And therefore, it's much better, because now the new curves that are represented by the
131
+ dashed line are less sensitive to perturbations.
132
+ So now we've obtained the correct unmixing matrix, but up to a scaling.
133
+ And so we need an additional step to find the correct scaling, and another one to find
134
+ the other parameter that is still unestimated, which are the noise covariance.
135
+ And luckily, it's very easy to find the noise covariance.
136
+ We can do this via an EM algorithm.
137
+ The E-step and the M-step are in closed form, and this yields a very fast algorithm.
138
+ But the Shikha-J is not a maximum likelihood estimator.
139
+ So now we will focus on Shikha-ML, which is our maximum likelihood estimator.
140
+ So I won't go too much into details on this, but we optimize this via an EM using a Gaussian
141
+ mixture assumption as a source.
142
+ We assume that the weights are known.
143
+ What I just want to showcase here is that the E-step of the algorithm, the one that
144
+ gives you the expectation of the sources given the data, and the variance of the sources
145
+ given the data, only involves the sum of size 2.
146
+ So previously we had a sum that had an exponential number of terms, and here we don't have that
147
+ anymore.
148
+ So the E-step is much faster than what we had before, and therefore the EM algorithm
149
+ here is tractable, whereas it was not the case before.
150
+ I first want to present our synthetic experiment where we generate data according to the Shikha-ML
151
+ and Shikha-J model.
152
+ In case A, we have only Gaussian components, but we have noise diversity, and therefore
153
+ methods that use noise diversity to recover the sources such as Shikha-ML and Shikha-J
154
+ perform best.
155
+ In the second case, we have only non-Gaussian components and no noise diversity, so methods
156
+ that use non-Gaussianity perform well such as Kana-ICA, Shikha-ML, or MultiView-ICA.
157
+ And the last case, half of the components are Gaussian with noise diversity, and the
158
+ other half are non-Gaussian but without noise diversity.
159
+ And in this case, only Shikha-ML is able to correctly recover the sources.
160
+ MV-ICA doesn't do that, but it's not as good as Shikha-ML.
161
+ Let us now talk about our experiments on real data.
162
+ We have this reconstruction experiment on fMRI data where subjects are exposed to a
163
+ naturalistic stimuli such as movie watching.
164
+ We use 80% of the movie to learn the unmixing matrices of all subjects, and then on the
165
+ 20% left of the movie, we compute the common sources, and from these common sources computed
166
+ using 80% of the subject, we try to reconstruct the data of the 20% left of the subject.
167
+ We compute the R2 score within regions of interest between the reconstructed data and
168
+ the true data, and plot them as a function of the number of components used.
169
+ As we see, Shikha-ML outperforms all of the methods.
170
+ As a take-home message, Shikha is a powerful framework to extract shared sources.
171
+ Shikha-J is a fast approach to fit the model, but it only uses second-order information.
172
+ In contrast, Shikha-ML is a bit slower, but is able to use non-gaussianity in addition
173
+ to second-order information.
174
+ In practice, Shikha-ML yields the best results.
175
+ The methods we've introduced work on reduced data.
176
+ It would be interesting to know how to reduce the data so that they perform optimally.
177
+ Another way to improve our results would be to learn the density of the shared sources
178
+ in Shikha-ML instead of having them fixed.
179
+ Thanks for listening, and have a good day!
demo_data/nips-2021/25957/transcript_whisper_large-v2.vtt ADDED
@@ -0,0 +1,539 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 00:00.000 --> 00:14.000
4
+ Hi, I'm Hugo Richard, I'm a third year PhD student at Université Paris-Saclay.
5
+
6
+ 00:14.000 --> 00:18.480
7
+ I'm in the INRIA Paris et Alpes team and my supervisor is Bertrand Thirion.
8
+
9
+ 00:18.480 --> 00:24.600
10
+ Today I'll talk about shared independent component analysis for multi-subject neuroimaging.
11
+
12
+ 00:24.600 --> 00:31.400
13
+ This is a joint work with Pierre Abelin, Alexandre Grandfort, Bertrand Thirion and Anna Pouy-Varine.
14
+
15
+ 00:31.400 --> 00:36.360
16
+ First let us consider two sources that are emitting a signal that is recorded by two
17
+
18
+ 00:36.360 --> 00:37.360
19
+ sensors.
20
+
21
+ 00:37.360 --> 00:43.120
22
+ This can be seen as a simplified model of magnetoencephalography where brain sources
23
+
24
+ 00:43.120 --> 00:46.000
25
+ are recorded by magnetometers.
26
+
27
+ 00:46.000 --> 00:50.200
28
+ Because propagation time can be neglected, the signal recorded by the sensors can be
29
+
30
+ 00:50.200 --> 00:55.840
31
+ seen as a linear mixture of the signal emitted by the sources.
32
+
33
+ 00:55.840 --> 00:59.600
34
+ S is a set of sources that are assumed to be independent.
35
+
36
+ 00:59.600 --> 01:06.400
37
+ X are the recordings and A describes how the sources are mixed to produce the recordings.
38
+
39
+ 01:06.400 --> 01:12.120
40
+ At first sight this model may seem ill-defined because if we permute two columns in A and
41
+
42
+ 01:12.120 --> 01:19.600
43
+ permute the corresponding sources in S, we'll get a new set of sources S' and a new mixing
44
+
45
+ 01:19.600 --> 01:25.360
46
+ matrix A' that describes X just as well as A and S.
47
+
48
+ 01:25.360 --> 01:30.360
49
+ And similarly if we scale the column of A by some constant, one column of A by some
50
+
51
+ 01:30.360 --> 01:34.920
52
+ constant and the corresponding source by the same constant, we'll also get an equivalent
53
+
54
+ 01:34.920 --> 01:35.920
55
+ description of X.
56
+
57
+ 01:35.920 --> 01:44.840
58
+ However, these scale and permutation indeterminacies are the only one if the sources contain at
59
+
60
+ 01:44.840 --> 01:46.840
61
+ most one Gaussian component.
62
+
63
+ 01:46.840 --> 01:52.040
64
+ Let us consider the more general problem where you have multiple subjects that are exposed
65
+
66
+ 01:52.040 --> 01:54.560
67
+ to the same stimuli.
68
+
69
+ 01:54.560 --> 02:00.640
70
+ We have two subjects, X1 and X2, and they have different mixing matrices, A1 and A2,
71
+
72
+ 02:00.640 --> 02:04.560
73
+ and different noise levels, N1 and N2.
74
+
75
+ 02:04.560 --> 02:08.720
76
+ The interpretation is that they have shared sources because they have shared connective
77
+
78
+ 02:08.720 --> 02:09.720
79
+ processes.
80
+
81
+ 02:09.720 --> 02:15.120
82
+ They have different mixing matrices because they have different spatial topography.
83
+
84
+ 02:15.120 --> 02:20.600
85
+ And they have different noises because we want to model inter-subject variability.
86
+
87
+ 02:20.600 --> 02:22.480
88
+ This model is called group ICA.
89
+
90
+ 02:22.480 --> 02:27.840
91
+ There are many methods to provide a solution for the group ICA problem.
92
+
93
+ 02:27.840 --> 02:34.560
94
+ A very popular one introduced by Calhoun in 2001 is to just stack the data of all subjects
95
+
96
+ 02:34.560 --> 02:42.520
97
+ feature-wise and then perform a PCA, a principal component analysis, on the stacked data.
98
+
99
+ 02:42.520 --> 02:47.520
100
+ And therefore you obtain reduced data and apply independent component analysis on the
101
+
102
+ 02:47.520 --> 02:50.520
103
+ reduced data to obtain a set of sources.
104
+
105
+ 02:50.520 --> 02:55.960
106
+ Another formulation is introduced by Varoko in 2010 and is called K-NICA.
107
+
108
+ 02:55.960 --> 03:01.320
109
+ You just replace the principal component analysis with a multiset CCA, so a multiset canonical
110
+
111
+ 03:01.320 --> 03:06.120
112
+ correlation analysis, where you have to solve a generalized eigenvalue problem.
113
+
114
+ 03:06.120 --> 03:12.800
115
+ There are many different formulations of multiset CCA, but this one with a generalized eigenvalue
116
+
117
+ 03:12.800 --> 03:15.560
118
+ problem is the fastest to solve.
119
+
120
+ 03:15.560 --> 03:17.840
121
+ KNICA and Cut-ICA have a lot of advantages.
122
+
123
+ 03:17.840 --> 03:21.000
124
+ First, they are very fast to fit.
125
+
126
+ 03:21.000 --> 03:23.320
127
+ And second, they are simple to implement.
128
+
129
+ 03:23.320 --> 03:26.920
130
+ These are the two reasons why they are so popular in neuroimaging.
131
+
132
+ 03:26.920 --> 03:30.160
133
+ However, they do not optimize the proper likelihood.
134
+
135
+ 03:30.160 --> 03:35.680
136
+ So therefore they do not benefit from advantages of such estimators such as asymptotic efficiency.
137
+
138
+ 03:35.680 --> 03:41.480
139
+ There are a lot of other related work that do optimize the proper likelihood.
140
+
141
+ 03:41.480 --> 03:46.240
142
+ I want to mention the independent vector analysis, which is a very powerful framework introduced
143
+
144
+ 03:46.240 --> 03:48.760
145
+ by Li in 2008.
146
+
147
+ 03:48.760 --> 03:54.560
148
+ So unified approach of Guo in 2008 that we will also mention and talk about later.
149
+
150
+ 03:54.560 --> 04:01.040
151
+ The approach of Shen in 2015 that also allows to perform dimension reduction.
152
+
153
+ 04:01.040 --> 04:08.320
154
+ And the multi-view ICA that was introduced by our team last year.
155
+
156
+ 04:08.320 --> 04:15.200
157
+ I want to quickly say that it's not obvious to design a likelihood-based approach that
158
+
159
+ 04:15.200 --> 04:17.400
160
+ is tractable.
161
+
162
+ 04:17.400 --> 04:23.680
163
+ And with this example of the Gaussian mixture noisy ICA by Bermond and Cardozo, we'll see
164
+
165
+ 04:23.680 --> 04:31.400
166
+ that standard approach leads to intractable algorithms.
167
+
168
+ 04:31.400 --> 04:37.080
169
+ The model we take here is the same as the group ICA, but we assume that the noise is
170
+
171
+ 04:37.080 --> 04:40.120
172
+ Gaussian with the same variance for all subjects.
173
+
174
+ 04:40.120 --> 04:47.600
175
+ We'll also assume that the sources follow a Gaussian mixture model.
176
+
177
+ 04:47.600 --> 04:53.040
178
+ And we further assume that the weights of the Gaussian mixtures are known.
179
+
180
+ 04:53.040 --> 04:56.360
181
+ We can solve such model via expectation maximization.
182
+
183
+ 04:56.360 --> 05:01.400
184
+ And if we write the E-step, we'll get a closed form that involves a large sum.
185
+
186
+ 05:01.400 --> 05:09.040
187
+ Because of this large size, this sum, and therefore the M algorithm is intractable whenever
188
+
189
+ 05:09.040 --> 05:11.600
190
+ Q and K are large.
191
+
192
+ 05:11.600 --> 05:17.520
193
+ Our contribution is shared ICA, what we call Shikha for short, where the data of subject
194
+
195
+ 05:17.520 --> 05:23.080
196
+ i are assumed as a linear mixture of noisy sources, and the noise here is not on the
197
+
198
+ 05:23.080 --> 05:24.080
199
+ sensor, but on the sources.
200
+
201
+ 05:24.080 --> 05:30.000
202
+ The noise is Gaussian with a variance that can be different for each subject and different
203
+
204
+ 05:30.000 --> 05:31.000
205
+ for each component.
206
+
207
+ 05:31.000 --> 05:37.800
208
+ S are assumed to be independent, but in contrast to almost all existing work, some components
209
+
210
+ 05:37.800 --> 05:38.800
211
+ can be Gaussian.
212
+
213
+ 05:38.800 --> 05:41.600
214
+ We have a few blanket assumptions.
215
+
216
+ 05:41.600 --> 05:45.840
217
+ We assume that the data are centered, that the mixing metrics are invertible, that the
218
+
219
+ 05:45.840 --> 05:50.680
220
+ sources have identical variance, and that the number of subjects is greater than 3.
221
+
222
+ 05:50.680 --> 05:54.000
223
+ We have two algorithms to solve the Shikha model.
224
+
225
+ 05:54.000 --> 06:01.520
226
+ We have ShikhaJ, that is a FAS algorithm that is based on multiset CCA, and ShikhaML, a
227
+
228
+ 06:01.520 --> 06:04.000
229
+ maximum likelihood approach.
230
+
231
+ 06:04.000 --> 06:07.600
232
+ In Shikha, there are two ways to recover the parameters.
233
+
234
+ 06:07.600 --> 06:12.880
235
+ Either the source are non-Gaussian, in which case we can use classical ICA results to recover
236
+
237
+ 06:12.880 --> 06:15.720
238
+ the unmixing matrices.
239
+
240
+ 06:15.720 --> 06:20.120
241
+ When the components are Gaussian, then we need something else, and what we use here
242
+
243
+ 06:20.120 --> 06:22.480
244
+ is noise diversity.
245
+
246
+ 06:22.480 --> 06:28.320
247
+ When the noise is sufficiently diverse, then it's possible to recover the unmixing matrix
248
+
249
+ 06:28.320 --> 06:34.120
250
+ and the noise covariance up to a permutation and sign indeterminacy.
251
+
252
+ 06:34.120 --> 06:38.240
253
+ Note that the noise diversity in Gaussian components is also a necessary condition.
254
+
255
+ 06:38.240 --> 06:42.680
256
+ If it does not hold, then Shikha cannot be identified.
257
+
258
+ 06:42.680 --> 06:48.520
259
+ Let us now focus on this theorem that is at the core of the ShikhaJ algorithm.
260
+
261
+ 06:48.520 --> 06:53.520
262
+ Namely it shows that we can solve group ICA with multiset CCA.
263
+
264
+ 06:53.520 --> 06:58.880
265
+ So assume the data follows the Shikha model, and consider the multiset CCA framed as a
266
+
267
+ 06:58.880 --> 07:00.920
268
+ generalized eigenvalue problem.
269
+
270
+ 07:00.920 --> 07:08.080
271
+ This generalized eigenvalue problem relies on two matrices, C and D. So C is formed by
272
+
273
+ 07:08.080 --> 07:13.560
274
+ second-order statistics, and D is formed by the diagonal blocks in C.
275
+
276
+ 07:13.560 --> 07:19.880
277
+ And so if we solve this eigenvalue problem and take the first k leading eigenvectors,
278
+
279
+ 07:19.880 --> 07:26.520
280
+ we can recover the correct unmixing matrix from them, up to a permutation and a scaling.
281
+
282
+ 07:26.520 --> 07:32.000
283
+ And this can only be done if the k first eigenvalues are distinct.
284
+
285
+ 07:32.000 --> 07:34.320
286
+ Note that the distinct eigenvalue condition is also necessary.
287
+
288
+ 07:34.320 --> 07:40.480
289
+ If two eigenvalues are the same, then this adds the need to determine IC, and therefore
290
+
291
+ 07:40.480 --> 07:42.280
292
+ we cannot solve group IC.
293
+
294
+ 07:42.280 --> 07:48.640
295
+ Note also that the condition that some eigenvalues need to be distinct is stronger than the noise
296
+
297
+ 07:48.640 --> 07:54.080
298
+ diversity condition we have in the identifiability theorem.
299
+
300
+ 07:54.080 --> 07:59.360
301
+ And therefore we can exhibit an example which is identifiable, but on which multiset CCA
302
+
303
+ 07:59.360 --> 08:00.360
304
+ will fail.
305
+
306
+ 08:00.360 --> 08:04.800
307
+ And I refer you to the paper for more details on this.
308
+
309
+ 08:04.800 --> 08:10.160
310
+ So in our theorem, in order to recover the correct unmixing matrix, we need to have access
311
+
312
+ 08:10.160 --> 08:12.480
313
+ to the second-order statistics.
314
+
315
+ 08:12.480 --> 08:18.860
316
+ However, in practice, we only have access to them, up to some sampling noise.
317
+
318
+ 08:18.860 --> 08:24.520
319
+ And because the mapping from matrices to eigenvectors is highly non-smooth, a small deviation in
320
+
321
+ 08:24.520 --> 08:31.160
322
+ the second-order statistics can lead to a high deviation of the recovered unmixing matrix.
323
+
324
+ 08:31.160 --> 08:38.080
325
+ Now to show this in practice, we take three subjects, two components, and noise covariance
326
+
327
+ 08:38.080 --> 08:47.440
328
+ matrices with two values, lambda1 and lambda2, that are separated by an eigengap epsilon.
329
+
330
+ 08:47.440 --> 08:52.440
331
+ And we compare the solution of multiset CCA on the true covariance matrices and on the
332
+
333
+ 08:52.440 --> 08:59.520
334
+ perturbed covariance matrix, where the perturbation scale is given by delta.
335
+
336
+ 08:59.520 --> 09:07.240
337
+ And for different values of epsilon, 10-4, 10-3, 10-2, 10-1, we show how the performance
338
+
339
+ 09:07.240 --> 09:14.720
340
+ of the algorithm, so the M-ary distance between the true unmixing matrix and the estimated
341
+
342
+ 09:14.720 --> 09:20.880
343
+ unmixing matrix, varies when the perturbation scale increases.
344
+
345
+ 09:20.880 --> 09:26.600
346
+ And we see that when the eigengap is very close, so 10-4, the violet curve, then even
347
+
348
+ 09:26.600 --> 09:31.440
349
+ with a very small perturbation, you can get to a very bad M-ary distance.
350
+
351
+ 09:31.440 --> 09:35.720
352
+ So the black dashed curve is a performance of chance.
353
+
354
+ 09:35.720 --> 09:41.200
355
+ Luckily, there is a large gap between the k-th eigenvalues and the k plus 1.
356
+
357
+ 09:41.200 --> 09:46.120
358
+ This means that in practice, the span of the p-leading eigenvectors is approximately preserved.
359
+
360
+ 09:46.120 --> 09:53.600
361
+ We can recover the true unmixing matrix from the unmixing matrix estimated by multiset
362
+
363
+ 09:53.600 --> 09:56.520
364
+ CCA, just by multiplying by a matrix Q.
365
+
366
+ 09:56.520 --> 10:02.640
367
+ And in order to estimate Q, we make use of the fact that the unmixed data should have
368
+
369
+ 10:02.640 --> 10:03.640
370
+ a diagonal covariance.
371
+
372
+ 10:03.640 --> 10:09.680
373
+ This leads us to a joint diagonalization problem that we can solve efficiently.
374
+
375
+ 10:09.680 --> 10:14.480
376
+ So if we take the experiments we've done on the previous slide, the results are still
377
+
378
+ 10:14.480 --> 10:15.480
379
+ shown here.
380
+
381
+ 10:15.480 --> 10:21.640
382
+ You can see the violet curves, and that is very sensitive to perturbation.
383
+
384
+ 10:21.640 --> 10:29.360
385
+ And so if we apply joint diagonalization, all these curves move, and they join the dashed
386
+
387
+ 10:29.360 --> 10:30.360
388
+ curve on the bottom.
389
+
390
+ 10:30.360 --> 10:34.720
391
+ And therefore, it's much better, because now the new curves that are represented by the
392
+
393
+ 10:34.720 --> 10:42.920
394
+ dashed line are less sensitive to perturbations.
395
+
396
+ 10:42.920 --> 10:47.920
397
+ So now we've obtained the correct unmixing matrix, but up to a scaling.
398
+
399
+ 10:47.920 --> 10:55.040
400
+ And so we need an additional step to find the correct scaling, and another one to find
401
+
402
+ 10:55.040 --> 11:00.680
403
+ the other parameter that is still unestimated, which are the noise covariance.
404
+
405
+ 11:00.680 --> 11:04.000
406
+ And luckily, it's very easy to find the noise covariance.
407
+
408
+ 11:04.000 --> 11:06.280
409
+ We can do this via an EM algorithm.
410
+
411
+ 11:06.280 --> 11:11.920
412
+ The E-step and the M-step are in closed form, and this yields a very fast algorithm.
413
+
414
+ 11:11.920 --> 11:15.200
415
+ But the Shikha-J is not a maximum likelihood estimator.
416
+
417
+ 11:15.200 --> 11:22.600
418
+ So now we will focus on Shikha-ML, which is our maximum likelihood estimator.
419
+
420
+ 11:22.600 --> 11:31.240
421
+ So I won't go too much into details on this, but we optimize this via an EM using a Gaussian
422
+
423
+ 11:31.240 --> 11:33.480
424
+ mixture assumption as a source.
425
+
426
+ 11:33.480 --> 11:35.960
427
+ We assume that the weights are known.
428
+
429
+ 11:35.960 --> 11:41.480
430
+ What I just want to showcase here is that the E-step of the algorithm, the one that
431
+
432
+ 11:41.480 --> 11:46.000
433
+ gives you the expectation of the sources given the data, and the variance of the sources
434
+
435
+ 11:46.000 --> 11:50.760
436
+ given the data, only involves the sum of size 2.
437
+
438
+ 11:50.760 --> 11:57.320
439
+ So previously we had a sum that had an exponential number of terms, and here we don't have that
440
+
441
+ 11:57.320 --> 11:58.320
442
+ anymore.
443
+
444
+ 11:58.320 --> 12:02.920
445
+ So the E-step is much faster than what we had before, and therefore the EM algorithm
446
+
447
+ 12:02.920 --> 12:07.200
448
+ here is tractable, whereas it was not the case before.
449
+
450
+ 12:07.200 --> 12:11.440
451
+ I first want to present our synthetic experiment where we generate data according to the Shikha-ML
452
+
453
+ 12:11.440 --> 12:13.200
454
+ and Shikha-J model.
455
+
456
+ 12:13.200 --> 12:18.560
457
+ In case A, we have only Gaussian components, but we have noise diversity, and therefore
458
+
459
+ 12:18.560 --> 12:24.240
460
+ methods that use noise diversity to recover the sources such as Shikha-ML and Shikha-J
461
+
462
+ 12:24.240 --> 12:25.240
463
+ perform best.
464
+
465
+ 12:25.240 --> 12:34.000
466
+ In the second case, we have only non-Gaussian components and no noise diversity, so methods
467
+
468
+ 12:34.000 --> 12:41.520
469
+ that use non-Gaussianity perform well such as Kana-ICA, Shikha-ML, or MultiView-ICA.
470
+
471
+ 12:41.520 --> 12:45.200
472
+ And the last case, half of the components are Gaussian with noise diversity, and the
473
+
474
+ 12:45.200 --> 12:49.000
475
+ other half are non-Gaussian but without noise diversity.
476
+
477
+ 12:49.000 --> 12:53.000
478
+ And in this case, only Shikha-ML is able to correctly recover the sources.
479
+
480
+ 12:53.000 --> 12:57.960
481
+ MV-ICA doesn't do that, but it's not as good as Shikha-ML.
482
+
483
+ 12:57.960 --> 13:00.400
484
+ Let us now talk about our experiments on real data.
485
+
486
+ 13:00.400 --> 13:05.080
487
+ We have this reconstruction experiment on fMRI data where subjects are exposed to a
488
+
489
+ 13:05.080 --> 13:07.920
490
+ naturalistic stimuli such as movie watching.
491
+
492
+ 13:07.920 --> 13:15.320
493
+ We use 80% of the movie to learn the unmixing matrices of all subjects, and then on the
494
+
495
+ 13:15.320 --> 13:22.320
496
+ 20% left of the movie, we compute the common sources, and from these common sources computed
497
+
498
+ 13:22.320 --> 13:28.800
499
+ using 80% of the subject, we try to reconstruct the data of the 20% left of the subject.
500
+
501
+ 13:28.800 --> 13:33.880
502
+ We compute the R2 score within regions of interest between the reconstructed data and
503
+
504
+ 13:33.880 --> 13:39.480
505
+ the true data, and plot them as a function of the number of components used.
506
+
507
+ 13:39.480 --> 13:43.000
508
+ As we see, Shikha-ML outperforms all of the methods.
509
+
510
+ 13:43.000 --> 13:47.400
511
+ As a take-home message, Shikha is a powerful framework to extract shared sources.
512
+
513
+ 13:47.400 --> 13:52.840
514
+ Shikha-J is a fast approach to fit the model, but it only uses second-order information.
515
+
516
+ 13:52.840 --> 13:58.800
517
+ In contrast, Shikha-ML is a bit slower, but is able to use non-gaussianity in addition
518
+
519
+ 13:58.800 --> 14:00.960
520
+ to second-order information.
521
+
522
+ 14:00.960 --> 14:03.840
523
+ In practice, Shikha-ML yields the best results.
524
+
525
+ 14:03.840 --> 14:05.960
526
+ The methods we've introduced work on reduced data.
527
+
528
+ 14:05.960 --> 14:11.160
529
+ It would be interesting to know how to reduce the data so that they perform optimally.
530
+
531
+ 14:11.160 --> 14:15.400
532
+ Another way to improve our results would be to learn the density of the shared sources
533
+
534
+ 14:15.400 --> 14:19.480
535
+ in Shikha-ML instead of having them fixed.
536
+
537
+ 14:19.480 --> 14:23.400
538
+ Thanks for listening, and have a good day!
539
+
demo_data/nips-2021/25957/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0539c1b965a157ce62df522fef5ea03cdec6198f5995fefa04cfddf947861fd
3
+ size 93633719
demo_data/nips-2021/25958/metadata.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "title": "ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions"
3
+ }
demo_data/nips-2021/25958/transcript_whisper_large-v2.txt ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hello everyone, I'm Luigi Carretino, and this is a joint work with Stefano Vigonia,
2
+ Daniele Calandriello, and Lorenzo Rosasco.
3
+ The problem that we study in this work is a standard regression problem, where we want
4
+ to estimate an unknown function f star given n pairs of points, x's and y's, and then
5
+ given n pairs of points, x's and y's, where y's are noisy evaluations of the functions
6
+ f star on the input points axis.
7
+ A well-established method to learn nonlinear functions is kernel ridge regression.
8
+ The basic idea is to map the input points into a higher dimensional space, where linear
9
+ relationships can be learned that then translate in nonlinear ones in the input space.
10
+ To formalize this, we can think about solving a standard empirical risk minimization problem
11
+ regularized over a spatial function which is a reproducing kernel Hilbert space.
12
+ Numerically speaking, the solution of this type of problem boils down to solving a linear
13
+ system. Particularly, we can see here that the linear system is going to be Kc equal
14
+ y, where K is the kernel matrix evaluated in all the pairs of points of the training
15
+ sets, c are the weights that we aim to learn, and y's are the output points.
16
+ We know that this method is optimal from a statistical point of view, but a drawback
17
+ is that it suffers from computational scalability. In fact, in terms of time complexity, if we
18
+ have n training points and we want to solve the linear system directly, we'll have to
19
+ invert the matrix K, and this will cost us n cubed in time.
20
+ Multiple ways of accelerating this process have been proposed over time.
21
+ The first one is to solve the methods iteratively instead of inverting directly the matrix K.
22
+ This allows us to only have matrix vector multiplications, and so the overall cost of
23
+ an iterative method to solve this linear system is going to be Tn squared.
24
+ Another method is the one known as sketching, where we can see this as subsampling the linear
25
+ system, in particular subsampling columns of this linear system, where we can take m
26
+ columns of the linear system uniformly at random to get a smaller one, and the cost
27
+ of this will be m squared n.
28
+ Another method instead is splitting. This allows us to divide the main problem into
29
+ many, in this case Q, subproblems, each one that can be solved independently and so
30
+ potentially can be distributed. So we can have a cost which boils down to n over Q to
31
+ the power of 3.
32
+ Combinations of these methods have been proposed in the literature. In particular, if
33
+ we combine iterating and sketching, we can get a solver that can solve the problem in
34
+ a time complexity of Tmn.
35
+ If instead we combine sketching and splitting, we can get a solver that can be computed
36
+ in m squared times n over Q.
37
+ And in this work, we try to blend all these techniques to derive a new algorithm, which
38
+ we will call PARC, that can achieve a time complexity of Tm times n over Q to the power
39
+ of 2.
40
+ So as we just said, in this work, we propose a new large-scale kernel regression solver
41
+ that combines the computational benefits of iteration, sketching, and splitting.
42
+ Notice, though, that these are approximation techniques and they may come at the cost of
43
+ accuracy. But we are able to show that this new algorithm is able to preserve generalization
44
+ under suitable partitions.
45
+ Now also notice that instead of general splitting, we are going to need to focus on a
46
+ particular type, which is the partitions.
47
+ So we introduce a new principal partition scheme for kernel methods.
48
+ We now look at the difference between data splitting and space partitioning.
49
+ Given a set of points, the procedure of splitting takes groups of points at random and assign
50
+ them to different splits or clusters.
51
+ In this picture, for example, we divide the points in four splits.
52
+ Partitioning instead divides the space in different cells, and then the points are implicitly
53
+ assigned to a particular cluster based on which cell they belong to.
54
+ Notice that with the splitting methods, we don't consider local information while we
55
+ perform the splitting, but we do when we perform partitioning.
56
+ Now, from this picture, the concept of partitioning a space seems pretty straightforward.
57
+ However, when you start considering high dimensional feature space, subtle problems can
58
+ appear.
59
+ So first, as a recap, remember that there are two important spaces to consider in our
60
+ regression problem.
61
+ The input space X with its input space features and the kernel space H with its input space
62
+ features, and the kernel space H, which potentially has many more implicit features.
63
+ Traditionally, partition methods are applied directly to the input space.
64
+ For example, a classical approach is to select a subset of points as centroids and then
65
+ partition the space in cells by assigning each portion of the space to the closest centroid,
66
+ which is called a Voronoi partition.
67
+ Since we are in the input space, closest here is defined according to a simple Euclidean
68
+ distance.
69
+ However, remember that our target function and our whole regression does not happen
70
+ directly on the input data space, but rather on the data mapped in the feature space.
71
+ And after we apply our feature map to the data, the concept of closest and the partition
72
+ can radically change.
73
+ For example, here on the right, we choose a kernel space associated with a cosine similarity
74
+ and again plot how the centroids partition the input space, but this time we chose closest
75
+ according to the new cosine distance.
76
+ The resulting partition is very different from the Euclidean one as it captures the
77
+ non-linearity of the kernel function.
78
+ In the paper, we discuss how this difference can impact the regression and we identified
79
+ sufficient conditions that the partition should satisfy in order to guarantee good generalization
80
+ of the learning process.
81
+ Crucially, we will see that these guarantees depend not on how the input space is partitioned,
82
+ but rather how the feature space is partitioned.
83
+ As a consequence, for our PARC methods, we focus on choosing centroids solely using the
84
+ kernel version of the distance.
85
+ We are now ready to present in more detail how the PARC algorithm works.
86
+ First of all, PARC partitioned the feature space into Q Voronoi cells and the first thing
87
+ to do is to identify the centroids in the feature space that allows us to describe the
88
+ Voronoi cells.
89
+ Then inside each Voronoi cell, we learn a local estimator using an uniterated and sketched
90
+ version of kernel ridge regression.
91
+ And then at prediction time, when a new sample arrives, we can use the Q Voronoi feature
92
+ to identify the new sample.
93
+ We use the local estimator corresponding to the Voronoi cell to which the new points fall
94
+ on.
95
+ The generalization error of standard kernel ridge regression without partitioning can
96
+ be upper bounded by two terms, a bias term and a variance term.
97
+ In our work, we can show that also the generalization error of PARC can be upper bounded by a bias
98
+ term and a variance term.
99
+ But this time, these two terms are weighted and they are weighted by a certain quantity
100
+ that depends on an angle theta, which is the minimum angle between all the subspaces of
101
+ the partitions.
102
+ For example, when all the subspaces are orthogonal between each other, we recover the exact same
103
+ generalization error of standard kernel ridge regression.
104
+ But we are also able to show that for angles which are small enough, we are able to obtain
105
+ a generalization error which is of the same order of standard kernel ridge regression.
106
+ These theoretical results suggest us how to construct a good partition.
107
+ So in particular, PARC selects the Voronoi centroids greedily in order to promote orthogonality
108
+ between the Voronoi cells.
109
+ And in particular, we use the Schur complement to measure the orthogonality.
110
+ We also use the Schur complement to measure the orthogonality of the Voronoi centroids.
111
+ And in particular, we use the Schur complement to measure the orthogonality.
112
+ Given all these ingredients, we are now able to measure the computational complexity of
113
+ PARC, which has a time complexity that is the sum of two terms.
114
+ A first term, q squared n log n, which is the cost of computing the centroids with the
115
+ just mentioned procedure.
116
+ And a second term, q squared n log n, which is the cost of computing the most expensive
117
+ local estimator.
118
+ Empirically, we performed experiments on data set of millions and of billions of points,
119
+ and we compared with the currently fastest global kernel methods and with some other
120
+ splitting kernel methods.
121
+ We can see that PARC is the only method that manages to match the accuracy of the global
122
+ estimator.
123
+ Thank you all for your attention.
124
+ And thank you to the poster for all your questions and more details.
demo_data/nips-2021/25958/transcript_whisper_large-v2.vtt ADDED
@@ -0,0 +1,374 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 00:00.000 --> 00:07.000
4
+ Hello everyone, I'm Luigi Carretino, and this is a joint work with Stefano Vigonia,
5
+
6
+ 00:07.000 --> 00:10.000
7
+ Daniele Calandriello, and Lorenzo Rosasco.
8
+
9
+ 00:10.000 --> 00:16.000
10
+ The problem that we study in this work is a standard regression problem, where we want
11
+
12
+ 00:16.000 --> 00:24.000
13
+ to estimate an unknown function f star given n pairs of points, x's and y's, and then
14
+
15
+ 00:24.000 --> 00:34.000
16
+ given n pairs of points, x's and y's, where y's are noisy evaluations of the functions
17
+
18
+ 00:34.000 --> 00:38.000
19
+ f star on the input points axis.
20
+
21
+ 00:41.000 --> 00:46.000
22
+ A well-established method to learn nonlinear functions is kernel ridge regression.
23
+
24
+ 00:46.000 --> 00:53.000
25
+ The basic idea is to map the input points into a higher dimensional space, where linear
26
+
27
+ 00:53.000 --> 00:59.000
28
+ relationships can be learned that then translate in nonlinear ones in the input space.
29
+
30
+ 01:01.000 --> 01:07.000
31
+ To formalize this, we can think about solving a standard empirical risk minimization problem
32
+
33
+ 01:07.000 --> 01:12.000
34
+ regularized over a spatial function which is a reproducing kernel Hilbert space.
35
+
36
+ 01:14.000 --> 01:20.000
37
+ Numerically speaking, the solution of this type of problem boils down to solving a linear
38
+
39
+ 01:20.000 --> 01:26.000
40
+ system. Particularly, we can see here that the linear system is going to be Kc equal
41
+
42
+ 01:26.000 --> 01:33.000
43
+ y, where K is the kernel matrix evaluated in all the pairs of points of the training
44
+
45
+ 01:33.000 --> 01:39.000
46
+ sets, c are the weights that we aim to learn, and y's are the output points.
47
+
48
+ 01:40.000 --> 01:45.000
49
+ We know that this method is optimal from a statistical point of view, but a drawback
50
+
51
+ 01:45.000 --> 01:52.000
52
+ is that it suffers from computational scalability. In fact, in terms of time complexity, if we
53
+
54
+ 01:52.000 --> 01:57.000
55
+ have n training points and we want to solve the linear system directly, we'll have to
56
+
57
+ 01:57.000 --> 02:03.000
58
+ invert the matrix K, and this will cost us n cubed in time.
59
+
60
+ 02:06.000 --> 02:11.000
61
+ Multiple ways of accelerating this process have been proposed over time.
62
+
63
+ 02:11.000 --> 02:17.000
64
+ The first one is to solve the methods iteratively instead of inverting directly the matrix K.
65
+
66
+ 02:18.000 --> 02:25.000
67
+ This allows us to only have matrix vector multiplications, and so the overall cost of
68
+
69
+ 02:25.000 --> 02:30.000
70
+ an iterative method to solve this linear system is going to be Tn squared.
71
+
72
+ 02:31.000 --> 02:39.000
73
+ Another method is the one known as sketching, where we can see this as subsampling the linear
74
+
75
+ 02:39.000 --> 02:46.000
76
+ system, in particular subsampling columns of this linear system, where we can take m
77
+
78
+ 02:46.000 --> 02:52.000
79
+ columns of the linear system uniformly at random to get a smaller one, and the cost
80
+
81
+ 02:52.000 --> 02:55.000
82
+ of this will be m squared n.
83
+
84
+ 02:57.000 --> 03:04.000
85
+ Another method instead is splitting. This allows us to divide the main problem into
86
+
87
+ 03:04.000 --> 03:12.000
88
+ many, in this case Q, subproblems, each one that can be solved independently and so
89
+
90
+ 03:12.000 --> 03:20.000
91
+ potentially can be distributed. So we can have a cost which boils down to n over Q to
92
+
93
+ 03:20.000 --> 03:22.000
94
+ the power of 3.
95
+
96
+ 03:25.000 --> 03:30.000
97
+ Combinations of these methods have been proposed in the literature. In particular, if
98
+
99
+ 03:30.000 --> 03:35.000
100
+ we combine iterating and sketching, we can get a solver that can solve the problem in
101
+
102
+ 03:35.000 --> 03:38.000
103
+ a time complexity of Tmn.
104
+
105
+ 03:40.000 --> 03:47.000
106
+ If instead we combine sketching and splitting, we can get a solver that can be computed
107
+
108
+ 03:47.000 --> 03:51.000
109
+ in m squared times n over Q.
110
+
111
+ 03:51.000 --> 03:59.000
112
+ And in this work, we try to blend all these techniques to derive a new algorithm, which
113
+
114
+ 03:59.000 --> 04:09.000
115
+ we will call PARC, that can achieve a time complexity of Tm times n over Q to the power
116
+
117
+ 04:09.000 --> 04:10.000
118
+ of 2.
119
+
120
+ 04:12.000 --> 04:18.000
121
+ So as we just said, in this work, we propose a new large-scale kernel regression solver
122
+
123
+ 04:18.000 --> 04:22.000
124
+ that combines the computational benefits of iteration, sketching, and splitting.
125
+
126
+ 04:23.000 --> 04:27.000
127
+ Notice, though, that these are approximation techniques and they may come at the cost of
128
+
129
+ 04:27.000 --> 04:35.000
130
+ accuracy. But we are able to show that this new algorithm is able to preserve generalization
131
+
132
+ 04:35.000 --> 04:37.000
133
+ under suitable partitions.
134
+
135
+ 04:38.000 --> 04:44.000
136
+ Now also notice that instead of general splitting, we are going to need to focus on a
137
+
138
+ 04:44.000 --> 04:48.000
139
+ particular type, which is the partitions.
140
+
141
+ 04:48.000 --> 04:53.000
142
+ So we introduce a new principal partition scheme for kernel methods.
143
+
144
+ 04:56.000 --> 05:01.000
145
+ We now look at the difference between data splitting and space partitioning.
146
+
147
+ 05:01.000 --> 05:08.000
148
+ Given a set of points, the procedure of splitting takes groups of points at random and assign
149
+
150
+ 05:08.000 --> 05:10.000
151
+ them to different splits or clusters.
152
+
153
+ 05:10.000 --> 05:14.000
154
+ In this picture, for example, we divide the points in four splits.
155
+
156
+ 05:15.000 --> 05:21.000
157
+ Partitioning instead divides the space in different cells, and then the points are implicitly
158
+
159
+ 05:21.000 --> 05:25.000
160
+ assigned to a particular cluster based on which cell they belong to.
161
+
162
+ 05:27.000 --> 05:32.000
163
+ Notice that with the splitting methods, we don't consider local information while we
164
+
165
+ 05:32.000 --> 05:37.000
166
+ perform the splitting, but we do when we perform partitioning.
167
+
168
+ 05:37.000 --> 05:42.000
169
+ Now, from this picture, the concept of partitioning a space seems pretty straightforward.
170
+
171
+ 05:43.000 --> 05:48.000
172
+ However, when you start considering high dimensional feature space, subtle problems can
173
+
174
+ 05:48.000 --> 05:49.000
175
+ appear.
176
+
177
+ 05:50.000 --> 05:55.000
178
+ So first, as a recap, remember that there are two important spaces to consider in our
179
+
180
+ 05:55.000 --> 05:56.000
181
+ regression problem.
182
+
183
+ 05:57.000 --> 06:04.000
184
+ The input space X with its input space features and the kernel space H with its input space
185
+
186
+ 06:04.000 --> 06:10.000
187
+ features, and the kernel space H, which potentially has many more implicit features.
188
+
189
+ 06:13.000 --> 06:17.000
190
+ Traditionally, partition methods are applied directly to the input space.
191
+
192
+ 06:18.000 --> 06:24.000
193
+ For example, a classical approach is to select a subset of points as centroids and then
194
+
195
+ 06:24.000 --> 06:30.000
196
+ partition the space in cells by assigning each portion of the space to the closest centroid,
197
+
198
+ 06:30.000 --> 06:32.000
199
+ which is called a Voronoi partition.
200
+
201
+ 06:32.000 --> 06:38.000
202
+ Since we are in the input space, closest here is defined according to a simple Euclidean
203
+
204
+ 06:38.000 --> 06:39.000
205
+ distance.
206
+
207
+ 06:40.000 --> 06:45.000
208
+ However, remember that our target function and our whole regression does not happen
209
+
210
+ 06:45.000 --> 06:51.000
211
+ directly on the input data space, but rather on the data mapped in the feature space.
212
+
213
+ 06:52.000 --> 06:58.000
214
+ And after we apply our feature map to the data, the concept of closest and the partition
215
+
216
+ 06:58.000 --> 06:59.000
217
+ can radically change.
218
+
219
+ 06:59.000 --> 07:05.000
220
+ For example, here on the right, we choose a kernel space associated with a cosine similarity
221
+
222
+ 07:06.000 --> 07:12.000
223
+ and again plot how the centroids partition the input space, but this time we chose closest
224
+
225
+ 07:12.000 --> 07:14.000
226
+ according to the new cosine distance.
227
+
228
+ 07:15.000 --> 07:20.000
229
+ The resulting partition is very different from the Euclidean one as it captures the
230
+
231
+ 07:20.000 --> 07:22.000
232
+ non-linearity of the kernel function.
233
+
234
+ 07:22.000 --> 07:28.000
235
+ In the paper, we discuss how this difference can impact the regression and we identified
236
+
237
+ 07:28.000 --> 07:34.000
238
+ sufficient conditions that the partition should satisfy in order to guarantee good generalization
239
+
240
+ 07:34.000 --> 07:35.000
241
+ of the learning process.
242
+
243
+ 07:37.000 --> 07:43.000
244
+ Crucially, we will see that these guarantees depend not on how the input space is partitioned,
245
+
246
+ 07:43.000 --> 07:45.000
247
+ but rather how the feature space is partitioned.
248
+
249
+ 07:45.000 --> 07:51.000
250
+ As a consequence, for our PARC methods, we focus on choosing centroids solely using the
251
+
252
+ 07:51.000 --> 07:53.000
253
+ kernel version of the distance.
254
+
255
+ 07:57.000 --> 08:00.000
256
+ We are now ready to present in more detail how the PARC algorithm works.
257
+
258
+ 08:01.000 --> 08:07.000
259
+ First of all, PARC partitioned the feature space into Q Voronoi cells and the first thing
260
+
261
+ 08:07.000 --> 08:16.000
262
+ to do is to identify the centroids in the feature space that allows us to describe the
263
+
264
+ 08:16.000 --> 08:17.000
265
+ Voronoi cells.
266
+
267
+ 08:19.000 --> 08:25.000
268
+ Then inside each Voronoi cell, we learn a local estimator using an uniterated and sketched
269
+
270
+ 08:25.000 --> 08:27.000
271
+ version of kernel ridge regression.
272
+
273
+ 08:30.000 --> 08:36.000
274
+ And then at prediction time, when a new sample arrives, we can use the Q Voronoi feature
275
+
276
+ 08:36.000 --> 08:38.000
277
+ to identify the new sample.
278
+
279
+ 08:40.000 --> 08:47.000
280
+ We use the local estimator corresponding to the Voronoi cell to which the new points fall
281
+
282
+ 08:47.000 --> 08:48.000
283
+ on.
284
+
285
+ 08:52.000 --> 08:57.000
286
+ The generalization error of standard kernel ridge regression without partitioning can
287
+
288
+ 08:57.000 --> 09:02.000
289
+ be upper bounded by two terms, a bias term and a variance term.
290
+
291
+ 09:02.000 --> 09:10.000
292
+ In our work, we can show that also the generalization error of PARC can be upper bounded by a bias
293
+
294
+ 09:10.000 --> 09:11.000
295
+ term and a variance term.
296
+
297
+ 09:11.000 --> 09:16.000
298
+ But this time, these two terms are weighted and they are weighted by a certain quantity
299
+
300
+ 09:16.000 --> 09:25.000
301
+ that depends on an angle theta, which is the minimum angle between all the subspaces of
302
+
303
+ 09:25.000 --> 09:26.000
304
+ the partitions.
305
+
306
+ 09:26.000 --> 09:33.000
307
+ For example, when all the subspaces are orthogonal between each other, we recover the exact same
308
+
309
+ 09:33.000 --> 09:36.000
310
+ generalization error of standard kernel ridge regression.
311
+
312
+ 09:38.000 --> 09:45.000
313
+ But we are also able to show that for angles which are small enough, we are able to obtain
314
+
315
+ 09:45.000 --> 09:50.000
316
+ a generalization error which is of the same order of standard kernel ridge regression.
317
+
318
+ 09:50.000 --> 09:54.000
319
+ These theoretical results suggest us how to construct a good partition.
320
+
321
+ 09:54.000 --> 10:00.000
322
+ So in particular, PARC selects the Voronoi centroids greedily in order to promote orthogonality
323
+
324
+ 10:00.000 --> 10:01.000
325
+ between the Voronoi cells.
326
+
327
+ 10:01.000 --> 10:06.000
328
+ And in particular, we use the Schur complement to measure the orthogonality.
329
+
330
+ 10:10.000 --> 10:16.000
331
+ We also use the Schur complement to measure the orthogonality of the Voronoi centroids.
332
+
333
+ 10:16.000 --> 10:20.000
334
+ And in particular, we use the Schur complement to measure the orthogonality.
335
+
336
+ 10:24.000 --> 10:28.000
337
+ Given all these ingredients, we are now able to measure the computational complexity of
338
+
339
+ 10:28.000 --> 10:32.000
340
+ PARC, which has a time complexity that is the sum of two terms.
341
+
342
+ 10:33.000 --> 10:40.000
343
+ A first term, q squared n log n, which is the cost of computing the centroids with the
344
+
345
+ 10:40.000 --> 10:41.000
346
+ just mentioned procedure.
347
+
348
+ 10:41.000 --> 10:46.000
349
+ And a second term, q squared n log n, which is the cost of computing the most expensive
350
+
351
+ 10:46.000 --> 10:47.000
352
+ local estimator.
353
+
354
+ 10:51.000 --> 10:57.000
355
+ Empirically, we performed experiments on data set of millions and of billions of points,
356
+
357
+ 10:57.000 --> 11:01.000
358
+ and we compared with the currently fastest global kernel methods and with some other
359
+
360
+ 11:01.000 --> 11:02.000
361
+ splitting kernel methods.
362
+
363
+ 11:03.000 --> 11:08.000
364
+ We can see that PARC is the only method that manages to match the accuracy of the global
365
+
366
+ 11:08.000 --> 11:11.000
367
+ estimator.
368
+
369
+ 11:11.000 --> 11:13.000
370
+ Thank you all for your attention.
371
+
372
+ 11:13.000 --> 11:40.000
373
+ And thank you to the poster for all your questions and more details.
374
+
demo_data/nips-2021/25958/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fefd926545331be9df0497e824634fa23129d26c9c9e7fdbe67c0382b98b4556
3
+ size 22931245
demo_data/nips-2021/25959/metadata.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "title": "Adversarial Feature Desensitization"
3
+ }
demo_data/nips-2021/25959/transcript_whisper_large-v2.txt ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Hello, my name is Pouya Bahshiban and I'm going to tell you about our paper titled
2
+ Adversarial Feature Desensitization. This is joint work with a number of wonderful collaborators
3
+ at MIWA, University of Montreal and McGill University, including Reza Bayat, Adam Ibrahim,
4
+ Kartika Hoja, Mojtaba Farmazi, Tourez Dale, Lake Richards and Erin Oji. A common assumption in
5
+ machine learning is that the train and test samples come from the same distribution.
6
+ While this is a reasonable assumption under most circumstances, it is intentionally violated in the
7
+ regime of adversarial attacks. Adversarial attacks are algorithms that search for slight input
8
+ perturbations that cause the input to be misclassified. In the case of white box attacks,
9
+ the model itself is transparent to the attacker and the attacker uses it to identify the possible
10
+ inputs that would lead to misclassifications. A famous example of this is the image of a panda
11
+ that when perturbed with imperceptible noise, alters the model's prediction from a panda to a
12
+ gibbon. As prior literature has shown, this is a common issue in almost all machine learning methods
13
+ and unless the classifier is specifically trained to be robust against these attacks,
14
+ the attacks could completely break down the classifier's performance.
15
+ This issue becomes even more critical when we consider the vast usage of these machine learning
16
+ systems in our societies. For example, the possible security concerns that rise in face
17
+ recognition systems prone to adversarial attacks or the safety in autonomous driving systems.
18
+ So what is an adversarial attack? To formally define the adversarial attacks, let's assume a
19
+ feature learning function f that projects inputs x to latent space with feature space z
20
+ and a classifier that uses the latent code z to predict the correct class label y hat.
21
+ The perturbation function or the attack generates a perturbed sample x prime
22
+ within the epsilon neighborhood of the input x, which we're showing here as b of x and epsilon.
23
+ By maximizing the classification objective, the opposite of how we normally optimize the classifier's
24
+ parameter. Many methods have been proposed to defend the models against adversarial attacks.
25
+ Two of these methods that have withstood the test of time so far are the adversarial training
26
+ by Alexander Modrianov, which proposes a defense method by solving a minimax optimization problem
27
+ that involves finding an adversarial input by maximizing the classification loss in the inner
28
+ loop followed by a classifier training to minimizing the classifier loss on these adversarial inputs.
29
+ This procedure is graphically shown for two hypothetical classes in the diagram on this slide.
30
+ The adversarial training method essentially learns to separate the distributions of adversarial
31
+ examples belonging to different classes. The second method is the trades method by Zhang et al,
32
+ which proposes to push the decision boundary of the classifier away from the data.
33
+ Trades achieves this by introducing a regularization term to the original learning
34
+ objective for classification that penalizes the mismatch between the predicted label
35
+ for the clean and perturbed inputs. The diagram on the right side again graphically illustrates
36
+ this procedure, where now the defense method learns to separate the distributions of clean examples
37
+ belonging to different classes while minimizing the loss of the classifier.
38
+ The third method is the trade method by Wang et al, which proposes to push the decision boundary
39
+ of the classifier to the inner loop followed by a classifier training to minimizing the
40
+ classification loss on these adversarial inputs. The third method is the trade method by Zhang et al,
41
+ which proposes to push the decision boundary of the classifier to the inner loop followed by a
42
+ classifier training to minimizing the classification loss on these adversarial inputs to the inner
43
+ loop. The third method is the trade method by Wang et al, which proposes to push the decision
44
+ boundary of the classifier to minimizing the classification loss. The fourth method is the
45
+ trade method by Wang et al, which proposes to push the decision boundary of the classifier
46
+ for a source domain, but we want the classifier to also perform the same task on a related target
47
+ domain that we might not have enough data for or that the generating procedure for sampling
48
+ domain might be expensive. The domain adaptation theory proposed by Ben David et al answers the
49
+ question of under what conditions can we adapt a classifier trained on the source domain for use
50
+ in the target domain. Here we consider the original clean distributions as the source domain and the
51
+ distribution of adversarial images generated from those images as the target domain. Although here
52
+ the target domain continuously evolves because the adversarial examples are based on the current
53
+ state of the model at each time step. And similar to the domain adaptation theory, our goal here
54
+ is to learn how to perform well on both source and target domains, meaning the natural and
55
+ adversarial domains. Now before I tell you about our proposed method, let's dive a bit deeper into
56
+ what the domain adaptation theory from Ben David et al states. Similar to before, let's assume a
57
+ feature learning function f that projects inputs x to latent space or feature space z and the
58
+ classifier that predicts the correct label y, y hat, from those latent codes. Now consider natural
59
+ and adversarial examples as input domains dx and d' x and their induced feature distributions
60
+ which go through the f function as dz and d' z. Also consider epsilon z and epsilon' z
61
+ as the classification error over the domains dz and d' z, what we are going to refer to as the
62
+ clean accuracy and the adversarial accuracy. The domain adaptation theory now gives a bond
63
+ on the adversarial error in terms of the natural error and the distance between the two domains.
64
+ Fortunately, from the prior work, we know that h delta h distance, which measures the distance
65
+ between two domains, can be estimated using the classifier trained to discriminate between the
66
+ two domains. Now our defense method called adversarial feature desensitization essentially
67
+ minimizes the bound on the adversarial error epsilon' z using a three-step procedure which
68
+ has some conceptual similarities with prior work on adversarial domain adaptation from Ganin et al.
69
+ For this, we first update the parameters theta and phi in the feature learning function f and
70
+ task classifier c to minimize the classification loss on the natural domain. This is shown with
71
+ green arrows and green boxes marked 1 on both the equation and on the diagram.
72
+ Secondly, we estimate the h delta h distance using an additional domain discriminator
73
+ network that predicts the domain identity from the latent code z. We update the domain
74
+ discriminator parameters psi to minimize the domain classification loss. And finally,
75
+ in the third step, we update the feature learning network parameters theta to maximize the domain
76
+ classification loss in an adversarial way. These two steps are marked with red arrows in the figure
77
+ and red boxes on the equation. Similar to previous two methods, adversarial training and trades that
78
+ I showed you, we here we can also graphically demonstrate this procedure. In our method AFD,
79
+ we learn to separate the classes from the distributions of clean examples while at the
80
+ same time we optimize a domain classifier that learns the boundary between the clean and adversarial
81
+ examples for each class. And finally, we push the adversarial examples to the opposite side of that
82
+ boundary. This procedure implicitly desensitizes the learned features to adversarial perturbations
83
+ and hence the name adversarial feature desensitization. We tested our method on four
84
+ data sets and compared them with a number of other baselines including with adversarial training and
85
+ trades. We made two versions of our method called AFDTCGAN that uses the adversarial losses from
86
+ Goodfellow et al and AFDWGAN that uses the Wasserstein losses from Arjovski and Goodtuner.
87
+ In the table, we evaluated all methods on several white box and black box attacks with
88
+ nominal strengths into each data set. Overall, our method AFD and especially AFDWGAN showed superior
89
+ performance against most attacks in most data sets. However, AFD was behind trades on several attacks
90
+ especially on CIFAR-100 and TinyImageNet data set that had more classes in it.
91
+ We also looked in trust attack methods and attack strengths which we controlled with the parameter
92
+ epsilon. The diagrams on the right show the robust accuracy for each defense method across
93
+ eight attack methods and various epsilon values for each of them. Overall, our results in these
94
+ diagrams showed that AFD's robustness generalizes better than the baselines across attacks and
95
+ across attack strengths. To quantify these differences, we also computed the area under
96
+ the curve for each method for each attack and summarized them in a table on the left.
97
+ As you can see, AFD's robust performance generalizes better to unseen and stronger attacks
98
+ compared to other baselines. If you remember from previous slides, the domain adaptation theory
99
+ predicted a bound on the adversarial error which can also be turned into a bound on the generalization
100
+ gap between natural and adversarial attacks. We empirically tested this prediction in our experiments
101
+ under two settings. Under the first setting, we varied the epsilon value for the PGDL-infinity
102
+ attack which was used during the training. And under the second setting, we varied the
103
+ epsilon value for the PGDL-infinity attack which was used during the training. And under the second setting, we used a diverse set of attacks and various attack strengths for each of them.
104
+ And under both scenarios, we found that the domain discriminator, which was originally trained on a
105
+ particular attack and attack strength, in our case it was PGDL-infinity attack with a fixed epsilon
106
+ for each data set, could well predict the generalization gap to unseen attacks and
107
+ different attack magnitudes. This suggests that the adversarial training against a domain classifier
108
+ like that used in our proposed method could potentially lead to robust models with better
109
+ generalization capacity. Finally, while we showed that AFD generalizes well to most other attacks
110
+ and attack strengths, it occasionally was worse compared to other baselines, especially in data
111
+ sets with more classes like Tiny ImageNet. This could potentially be due to the difficulty of training
112
+ domain classifiers in these data sets and leaves much space for future work on
113
+ investigating the effect of domain classifiers on the robustness of feature learning functions.
114
+ Also, AFD required more backward computations compared to some of the other baselines
115
+ such as adversarial training, and as a result, its training time was on average about 31%
116
+ longer than adversarial training. We invite you to read our paper for more details and please
117
+ get in touch with us if you have any questions. Thanks for watching this video and we hope you enjoyed it.
demo_data/nips-2021/25959/transcript_whisper_large-v2.vtt ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ WEBVTT
2
+
3
+ 00:00.000 --> 00:13.120
4
+ Hello, my name is Pouya Bahshiban and I'm going to tell you about our paper titled
5
+
6
+ 00:13.120 --> 00:18.720
7
+ Adversarial Feature Desensitization. This is joint work with a number of wonderful collaborators
8
+
9
+ 00:18.720 --> 00:24.400
10
+ at MIWA, University of Montreal and McGill University, including Reza Bayat, Adam Ibrahim,
11
+
12
+ 00:24.400 --> 00:32.160
13
+ Kartika Hoja, Mojtaba Farmazi, Tourez Dale, Lake Richards and Erin Oji. A common assumption in
14
+
15
+ 00:32.160 --> 00:36.560
16
+ machine learning is that the train and test samples come from the same distribution.
17
+
18
+ 00:37.200 --> 00:42.960
19
+ While this is a reasonable assumption under most circumstances, it is intentionally violated in the
20
+
21
+ 00:42.960 --> 00:49.600
22
+ regime of adversarial attacks. Adversarial attacks are algorithms that search for slight input
23
+
24
+ 00:49.600 --> 00:55.600
25
+ perturbations that cause the input to be misclassified. In the case of white box attacks,
26
+
27
+ 00:55.600 --> 01:01.600
28
+ the model itself is transparent to the attacker and the attacker uses it to identify the possible
29
+
30
+ 01:01.600 --> 01:07.760
31
+ inputs that would lead to misclassifications. A famous example of this is the image of a panda
32
+
33
+ 01:07.760 --> 01:13.360
34
+ that when perturbed with imperceptible noise, alters the model's prediction from a panda to a
35
+
36
+ 01:13.360 --> 01:19.840
37
+ gibbon. As prior literature has shown, this is a common issue in almost all machine learning methods
38
+
39
+ 01:19.840 --> 01:25.280
40
+ and unless the classifier is specifically trained to be robust against these attacks,
41
+
42
+ 01:25.280 --> 01:28.720
43
+ the attacks could completely break down the classifier's performance.
44
+
45
+ 01:30.240 --> 01:35.600
46
+ This issue becomes even more critical when we consider the vast usage of these machine learning
47
+
48
+ 01:35.600 --> 01:41.040
49
+ systems in our societies. For example, the possible security concerns that rise in face
50
+
51
+ 01:41.040 --> 01:46.720
52
+ recognition systems prone to adversarial attacks or the safety in autonomous driving systems.
53
+
54
+ 01:48.080 --> 01:54.000
55
+ So what is an adversarial attack? To formally define the adversarial attacks, let's assume a
56
+
57
+ 01:54.000 --> 02:00.080
58
+ feature learning function f that projects inputs x to latent space with feature space z
59
+
60
+ 02:01.600 --> 02:08.720
61
+ and a classifier that uses the latent code z to predict the correct class label y hat.
62
+
63
+ 02:08.720 --> 02:14.480
64
+ The perturbation function or the attack generates a perturbed sample x prime
65
+
66
+ 02:14.480 --> 02:21.520
67
+ within the epsilon neighborhood of the input x, which we're showing here as b of x and epsilon.
68
+
69
+ 02:22.160 --> 02:28.880
70
+ By maximizing the classification objective, the opposite of how we normally optimize the classifier's
71
+
72
+ 02:28.880 --> 02:36.720
73
+ parameter. Many methods have been proposed to defend the models against adversarial attacks.
74
+
75
+ 02:36.720 --> 02:42.640
76
+ Two of these methods that have withstood the test of time so far are the adversarial training
77
+
78
+ 02:43.200 --> 02:50.160
79
+ by Alexander Modrianov, which proposes a defense method by solving a minimax optimization problem
80
+
81
+ 02:50.160 --> 02:56.000
82
+ that involves finding an adversarial input by maximizing the classification loss in the inner
83
+
84
+ 02:56.000 --> 03:03.840
85
+ loop followed by a classifier training to minimizing the classifier loss on these adversarial inputs.
86
+
87
+ 03:03.840 --> 03:09.920
88
+ This procedure is graphically shown for two hypothetical classes in the diagram on this slide.
89
+
90
+ 03:10.560 --> 03:15.440
91
+ The adversarial training method essentially learns to separate the distributions of adversarial
92
+
93
+ 03:15.440 --> 03:22.400
94
+ examples belonging to different classes. The second method is the trades method by Zhang et al,
95
+
96
+ 03:22.400 --> 03:27.440
97
+ which proposes to push the decision boundary of the classifier away from the data.
98
+
99
+ 03:27.440 --> 03:32.480
100
+ Trades achieves this by introducing a regularization term to the original learning
101
+
102
+ 03:32.480 --> 03:38.320
103
+ objective for classification that penalizes the mismatch between the predicted label
104
+
105
+ 03:38.320 --> 03:44.400
106
+ for the clean and perturbed inputs. The diagram on the right side again graphically illustrates
107
+
108
+ 03:44.400 --> 03:50.000
109
+ this procedure, where now the defense method learns to separate the distributions of clean examples
110
+
111
+ 03:50.000 --> 03:54.400
112
+ belonging to different classes while minimizing the loss of the classifier.
113
+
114
+ 03:54.400 --> 03:59.920
115
+ The third method is the trade method by Wang et al, which proposes to push the decision boundary
116
+
117
+ 03:59.920 --> 04:06.880
118
+ of the classifier to the inner loop followed by a classifier training to minimizing the
119
+
120
+ 04:06.880 --> 04:13.120
121
+ classification loss on these adversarial inputs. The third method is the trade method by Zhang et al,
122
+
123
+ 04:13.120 --> 04:18.720
124
+ which proposes to push the decision boundary of the classifier to the inner loop followed by a
125
+
126
+ 04:18.720 --> 04:27.840
127
+ classifier training to minimizing the classification loss on these adversarial inputs to the inner
128
+
129
+ 04:27.840 --> 04:34.640
130
+ loop. The third method is the trade method by Wang et al, which proposes to push the decision
131
+
132
+ 04:34.640 --> 04:39.920
133
+ boundary of the classifier to minimizing the classification loss. The fourth method is the
134
+
135
+ 04:39.920 --> 04:45.600
136
+ trade method by Wang et al, which proposes to push the decision boundary of the classifier
137
+
138
+ 04:45.600 --> 04:52.160
139
+ for a source domain, but we want the classifier to also perform the same task on a related target
140
+
141
+ 04:52.160 --> 05:00.960
142
+ domain that we might not have enough data for or that the generating procedure for sampling
143
+
144
+ 05:00.960 --> 05:09.440
145
+ domain might be expensive. The domain adaptation theory proposed by Ben David et al answers the
146
+
147
+ 05:09.440 --> 05:15.840
148
+ question of under what conditions can we adapt a classifier trained on the source domain for use
149
+
150
+ 05:15.840 --> 05:23.920
151
+ in the target domain. Here we consider the original clean distributions as the source domain and the
152
+
153
+ 05:23.920 --> 05:31.280
154
+ distribution of adversarial images generated from those images as the target domain. Although here
155
+
156
+ 05:31.280 --> 05:38.240
157
+ the target domain continuously evolves because the adversarial examples are based on the current
158
+
159
+ 05:38.240 --> 05:46.000
160
+ state of the model at each time step. And similar to the domain adaptation theory, our goal here
161
+
162
+ 05:46.000 --> 05:52.960
163
+ is to learn how to perform well on both source and target domains, meaning the natural and
164
+
165
+ 05:52.960 --> 06:02.240
166
+ adversarial domains. Now before I tell you about our proposed method, let's dive a bit deeper into
167
+
168
+ 06:02.240 --> 06:08.960
169
+ what the domain adaptation theory from Ben David et al states. Similar to before, let's assume a
170
+
171
+ 06:08.960 --> 06:14.880
172
+ feature learning function f that projects inputs x to latent space or feature space z and the
173
+
174
+ 06:14.880 --> 06:23.040
175
+ classifier that predicts the correct label y, y hat, from those latent codes. Now consider natural
176
+
177
+ 06:23.040 --> 06:31.440
178
+ and adversarial examples as input domains dx and d' x and their induced feature distributions
179
+
180
+ 06:31.440 --> 06:42.560
181
+ which go through the f function as dz and d' z. Also consider epsilon z and epsilon' z
182
+
183
+ 06:42.560 --> 06:50.320
184
+ as the classification error over the domains dz and d' z, what we are going to refer to as the
185
+
186
+ 06:50.320 --> 06:58.880
187
+ clean accuracy and the adversarial accuracy. The domain adaptation theory now gives a bond
188
+
189
+ 06:58.880 --> 07:04.320
190
+ on the adversarial error in terms of the natural error and the distance between the two domains.
191
+
192
+ 07:05.120 --> 07:11.680
193
+ Fortunately, from the prior work, we know that h delta h distance, which measures the distance
194
+
195
+ 07:11.680 --> 07:17.440
196
+ between two domains, can be estimated using the classifier trained to discriminate between the
197
+
198
+ 07:17.440 --> 07:26.080
199
+ two domains. Now our defense method called adversarial feature desensitization essentially
200
+
201
+ 07:26.080 --> 07:34.720
202
+ minimizes the bound on the adversarial error epsilon' z using a three-step procedure which
203
+
204
+ 07:34.720 --> 07:40.560
205
+ has some conceptual similarities with prior work on adversarial domain adaptation from Ganin et al.
206
+
207
+ 07:42.240 --> 07:49.280
208
+ For this, we first update the parameters theta and phi in the feature learning function f and
209
+
210
+ 07:49.280 --> 07:56.320
211
+ task classifier c to minimize the classification loss on the natural domain. This is shown with
212
+
213
+ 07:56.320 --> 08:01.920
214
+ green arrows and green boxes marked 1 on both the equation and on the diagram.
215
+
216
+ 08:04.000 --> 08:10.400
217
+ Secondly, we estimate the h delta h distance using an additional domain discriminator
218
+
219
+ 08:10.960 --> 08:17.600
220
+ network that predicts the domain identity from the latent code z. We update the domain
221
+
222
+ 08:17.600 --> 08:24.720
223
+ discriminator parameters psi to minimize the domain classification loss. And finally,
224
+
225
+ 08:24.720 --> 08:31.680
226
+ in the third step, we update the feature learning network parameters theta to maximize the domain
227
+
228
+ 08:31.680 --> 08:39.600
229
+ classification loss in an adversarial way. These two steps are marked with red arrows in the figure
230
+
231
+ 08:39.600 --> 08:48.960
232
+ and red boxes on the equation. Similar to previous two methods, adversarial training and trades that
233
+
234
+ 08:48.960 --> 08:55.760
235
+ I showed you, we here we can also graphically demonstrate this procedure. In our method AFD,
236
+
237
+ 08:55.760 --> 09:01.040
238
+ we learn to separate the classes from the distributions of clean examples while at the
239
+
240
+ 09:01.040 --> 09:07.840
241
+ same time we optimize a domain classifier that learns the boundary between the clean and adversarial
242
+
243
+ 09:07.840 --> 09:14.560
244
+ examples for each class. And finally, we push the adversarial examples to the opposite side of that
245
+
246
+ 09:14.560 --> 09:22.400
247
+ boundary. This procedure implicitly desensitizes the learned features to adversarial perturbations
248
+
249
+ 09:22.400 --> 09:30.480
250
+ and hence the name adversarial feature desensitization. We tested our method on four
251
+
252
+ 09:30.480 --> 09:35.840
253
+ data sets and compared them with a number of other baselines including with adversarial training and
254
+
255
+ 09:35.840 --> 09:43.760
256
+ trades. We made two versions of our method called AFDTCGAN that uses the adversarial losses from
257
+
258
+ 09:43.760 --> 09:50.880
259
+ Goodfellow et al and AFDWGAN that uses the Wasserstein losses from Arjovski and Goodtuner.
260
+
261
+ 09:52.000 --> 09:57.840
262
+ In the table, we evaluated all methods on several white box and black box attacks with
263
+
264
+ 09:57.840 --> 10:07.360
265
+ nominal strengths into each data set. Overall, our method AFD and especially AFDWGAN showed superior
266
+
267
+ 10:07.360 --> 10:15.200
268
+ performance against most attacks in most data sets. However, AFD was behind trades on several attacks
269
+
270
+ 10:15.200 --> 10:20.720
271
+ especially on CIFAR-100 and TinyImageNet data set that had more classes in it.
272
+
273
+ 10:20.720 --> 10:26.080
274
+ We also looked in trust attack methods and attack strengths which we controlled with the parameter
275
+
276
+ 10:26.080 --> 10:32.800
277
+ epsilon. The diagrams on the right show the robust accuracy for each defense method across
278
+
279
+ 10:32.800 --> 10:41.200
280
+ eight attack methods and various epsilon values for each of them. Overall, our results in these
281
+
282
+ 10:41.200 --> 10:48.240
283
+ diagrams showed that AFD's robustness generalizes better than the baselines across attacks and
284
+
285
+ 10:48.240 --> 10:55.200
286
+ across attack strengths. To quantify these differences, we also computed the area under
287
+
288
+ 10:55.200 --> 11:00.000
289
+ the curve for each method for each attack and summarized them in a table on the left.
290
+
291
+ 11:00.880 --> 11:06.800
292
+ As you can see, AFD's robust performance generalizes better to unseen and stronger attacks
293
+
294
+ 11:06.800 --> 11:15.680
295
+ compared to other baselines. If you remember from previous slides, the domain adaptation theory
296
+
297
+ 11:15.680 --> 11:22.400
298
+ predicted a bound on the adversarial error which can also be turned into a bound on the generalization
299
+
300
+ 11:22.400 --> 11:30.320
301
+ gap between natural and adversarial attacks. We empirically tested this prediction in our experiments
302
+
303
+ 11:30.320 --> 11:37.600
304
+ under two settings. Under the first setting, we varied the epsilon value for the PGDL-infinity
305
+
306
+ 11:37.600 --> 11:45.600
307
+ attack which was used during the training. And under the second setting, we varied the
308
+
309
+ 11:45.600 --> 11:51.120
310
+ epsilon value for the PGDL-infinity attack which was used during the training. And under the second setting, we used a diverse set of attacks and various attack strengths for each of them.
311
+
312
+ 11:52.000 --> 11:58.480
313
+ And under both scenarios, we found that the domain discriminator, which was originally trained on a
314
+
315
+ 11:58.480 --> 12:05.280
316
+ particular attack and attack strength, in our case it was PGDL-infinity attack with a fixed epsilon
317
+
318
+ 12:05.280 --> 12:10.960
319
+ for each data set, could well predict the generalization gap to unseen attacks and
320
+
321
+ 12:10.960 --> 12:18.000
322
+ different attack magnitudes. This suggests that the adversarial training against a domain classifier
323
+
324
+ 12:18.000 --> 12:24.000
325
+ like that used in our proposed method could potentially lead to robust models with better
326
+
327
+ 12:24.000 --> 12:33.520
328
+ generalization capacity. Finally, while we showed that AFD generalizes well to most other attacks
329
+
330
+ 12:33.520 --> 12:39.200
331
+ and attack strengths, it occasionally was worse compared to other baselines, especially in data
332
+
333
+ 12:39.200 --> 12:45.760
334
+ sets with more classes like Tiny ImageNet. This could potentially be due to the difficulty of training
335
+
336
+ 12:46.320 --> 12:51.680
337
+ domain classifiers in these data sets and leaves much space for future work on
338
+
339
+ 12:51.680 --> 12:57.120
340
+ investigating the effect of domain classifiers on the robustness of feature learning functions.
341
+
342
+ 12:58.080 --> 13:04.400
343
+ Also, AFD required more backward computations compared to some of the other baselines
344
+
345
+ 13:04.400 --> 13:11.120
346
+ such as adversarial training, and as a result, its training time was on average about 31%
347
+
348
+ 13:11.120 --> 13:17.680
349
+ longer than adversarial training. We invite you to read our paper for more details and please
350
+
351
+ 13:17.680 --> 13:34.720
352
+ get in touch with us if you have any questions. Thanks for watching this video and we hope you enjoyed it.
353
+
demo_data/nips-2021/25959/video.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76fac80c58c0fd077be83cb3d4b052aaf70c0128d8884b24f83a34a9f9c72fe3
3
+ size 86886949
demo_data/nips-2021/25963/metadata.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "title": "Reusing Combinatorial Structure: Faster Iterative Projections over Submodular Base Polytopes"
3
+ }