thomwolf HF staff commited on
Commit
bc52030
·
verified ·
1 Parent(s): e56b973
assets/images/5D_nutshell_tp_sp.svg ADDED
assets/images/5d_nutshell_cp.svg ADDED
assets/images/5d_nutshell_ep.svg ADDED
dist/assets/images/5D_nutshell_tp_sp.svg ADDED
dist/assets/images/5d_nutshell_cp.svg ADDED
dist/assets/images/5d_nutshell_ep.svg ADDED
dist/index.html CHANGED
@@ -1510,7 +1510,8 @@
1510
 
1511
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1512
 
1513
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1514
 
1515
 
1516
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
@@ -1520,7 +1521,9 @@
1520
 
1521
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1522
 
1523
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
1524
 
1525
 
1526
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
@@ -1533,14 +1536,15 @@
1533
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1534
  </ul>
1535
 
 
 
1536
  <div class="note-box">
1537
- <p class="note-box-title">📝 Note</p>
1538
- <p class="note-box-content">
1539
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1540
- </p>
1541
- </div>
1542
 
1543
- <p>TODO: the text between the table and figueres is still a bit sparse.</p>
1544
 
1545
  <table>
1546
  <thead>
@@ -2578,13 +2582,26 @@
2578
  year={2025},
2579
  }</pre>
2580
  </d-appendix>
 
 
 
 
 
 
 
 
 
2581
 
2582
  <script>
2583
  const article = document.querySelector('d-article');
2584
  const toc = document.querySelector('d-contents');
2585
  if (toc) {
2586
  const headings = article.querySelectorAll('h2, h3, h4');
2587
- let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
 
 
 
 
2588
  let prevLevel = 0;
2589
 
2590
  for (const el of headings) {
@@ -2606,7 +2623,7 @@
2606
  }
2607
  if (level === 0)
2608
  ToC += '<div>' + link + '</div>';
2609
- else
2610
  ToC += '<li>' + link + '</li>';
2611
  }
2612
 
@@ -2614,10 +2631,10 @@
2614
  ToC += '</ul>'
2615
  prevLevel--;
2616
  }
2617
- ToC += '</nav>';
2618
  toc.innerHTML = ToC;
2619
  toc.setAttribute('prerendered', 'true');
2620
- const toc_links = document.querySelectorAll('d-contents > nav a');
2621
 
2622
  window.addEventListener('scroll', (_event) => {
2623
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
 
1510
 
1511
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1512
 
1513
+ <div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
1514
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1515
 
1516
 
1517
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
 
1521
 
1522
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1523
 
1524
+ <div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
1525
+
1526
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1527
 
1528
 
1529
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1536
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1537
  </ul>
1538
 
1539
+ <div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
1540
+
1541
  <div class="note-box">
1542
+ <p class="note-box-title">📝 Note</p>
1543
+ <p class="note-box-content">
1544
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1545
+ </p>
1546
+ </div>
1547
 
 
1548
 
1549
  <table>
1550
  <thead>
 
2582
  year={2025},
2583
  }</pre>
2584
  </d-appendix>
2585
+ <script>
2586
+ function toggleTOC() {
2587
+ const content = document.querySelector('.toc-content');
2588
+ const icon = document.querySelector('.toggle-icon');
2589
+
2590
+ content.classList.toggle('collapsed');
2591
+ icon.classList.toggle('collapsed');
2592
+ }
2593
+ </script>
2594
 
2595
  <script>
2596
  const article = document.querySelector('d-article');
2597
  const toc = document.querySelector('d-contents');
2598
  if (toc) {
2599
  const headings = article.querySelectorAll('h2, h3, h4');
2600
+ // let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
2601
+ let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
2602
+ <span class="toc-title">Table of Contents</span>
2603
+ <span class="toggle-icon">▼</span>
2604
+ </div><div class="toc-content">`;
2605
  let prevLevel = 0;
2606
 
2607
  for (const el of headings) {
 
2623
  }
2624
  if (level === 0)
2625
  ToC += '<div>' + link + '</div>';
2626
+ else if (level === 1)
2627
  ToC += '<li>' + link + '</li>';
2628
  }
2629
 
 
2631
  ToC += '</ul>'
2632
  prevLevel--;
2633
  }
2634
+ ToC += '</div></nav>';
2635
  toc.innerHTML = ToC;
2636
  toc.setAttribute('prerendered', 'true');
2637
+ const toc_links = document.querySelectorAll('d-contents > nav div a');
2638
 
2639
  window.addEventListener('scroll', (_event) => {
2640
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
dist/style.css CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
 
153
  justify-self: start;
154
  align-self: start;
155
  padding-bottom: 0.5em;
@@ -160,7 +161,7 @@ d-contents > nav a.active {
160
  border-bottom-style: solid;
161
  border-bottom-color: rgba(0, 0, 0, 0.1);
162
  overflow-y: scroll;
163
- height: calc(100vh - 80px);
164
  scrollbar-width: none;
165
  z-index: -100;
166
  }
@@ -170,6 +171,31 @@ d-contents a:hover {
170
  border-bottom: none;
171
  }
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  @media (min-width: 1200px) {
175
  d-article {
@@ -179,6 +205,7 @@ d-contents a:hover {
179
 
180
  d-contents {
181
  align-self: start;
 
182
  grid-column-start: 1 !important;
183
  grid-column-end: 4 !important;
184
  grid-row: auto / span 6;
@@ -186,16 +213,17 @@ d-contents a:hover {
186
  margin-top: 0em;
187
  padding-right: 3em;
188
  padding-left: 2em;
189
- border-right: 1px solid rgba(0, 0, 0, 0.1);
190
  border-right-width: 1px;
191
  border-right-style: solid;
192
- border-right-color: rgba(0, 0, 0, 0.1);
193
  position: -webkit-sticky; /* For Safari */
194
  position: sticky;
195
  top: 10px; /* Adjust this value if needed */
196
- overflow-y: scroll;
197
- height: calc(100vh - 80px);
198
  scrollbar-width: none;
 
199
  z-index: -100;
200
  }
201
  }
@@ -205,7 +233,7 @@ d-contents nav h3 {
205
  margin-bottom: 1em;
206
  }
207
 
208
- d-contents nav div {
209
  color: rgba(0, 0, 0, 0.8);
210
  font-weight: bold;
211
  }
 
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
153
+ background: white;
154
  justify-self: start;
155
  align-self: start;
156
  padding-bottom: 0.5em;
 
161
  border-bottom-style: solid;
162
  border-bottom-color: rgba(0, 0, 0, 0.1);
163
  overflow-y: scroll;
164
+ height: calc(100vh - 40px);
165
  scrollbar-width: none;
166
  z-index: -100;
167
  }
 
171
  border-bottom: none;
172
  }
173
 
174
+ toc-title {
175
+ font-weight: bold;
176
+ font-size: 1.2em;
177
+ color: #333;
178
+ }
179
+
180
+ toggle-icon {
181
+ transition: transform 0.3s;
182
+ }
183
+
184
+ toggle-icon.collapsed {
185
+ transform: rotate(-90deg);
186
+ }
187
+
188
+ .toc-content {
189
+ margin-top: 15px;
190
+ overflow: hidden;
191
+ max-height: 1000px;
192
+ transition: max-height 0.3s ease-out;
193
+ }
194
+
195
+ .toc-content.collapsed {
196
+ max-height: 0;
197
+ margin-top: 0;
198
+ }
199
 
200
  @media (min-width: 1200px) {
201
  d-article {
 
205
 
206
  d-contents {
207
  align-self: start;
208
+ background: white;
209
  grid-column-start: 1 !important;
210
  grid-column-end: 4 !important;
211
  grid-row: auto / span 6;
 
213
  margin-top: 0em;
214
  padding-right: 3em;
215
  padding-left: 2em;
216
+ /* border-right: 1px solid rgba(0, 0, 0, 0.1);
217
  border-right-width: 1px;
218
  border-right-style: solid;
219
+ border-right-color: rgba(0, 0, 0, 0.1); */
220
  position: -webkit-sticky; /* For Safari */
221
  position: sticky;
222
  top: 10px; /* Adjust this value if needed */
223
+ overflow-y: auto;
224
+ height: calc(100vh - 40px);
225
  scrollbar-width: none;
226
+ transition: max-height 0.3s ease-out;
227
  z-index: -100;
228
  }
229
  }
 
233
  margin-bottom: 1em;
234
  }
235
 
236
+ d-contents nav div div {
237
  color: rgba(0, 0, 0, 0.8);
238
  font-weight: bold;
239
  }
src/index.html CHANGED
@@ -1510,7 +1510,8 @@
1510
 
1511
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1512
 
1513
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
1514
 
1515
 
1516
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
@@ -1520,7 +1521,9 @@
1520
 
1521
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1522
 
1523
- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
 
 
1524
 
1525
 
1526
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
@@ -1533,14 +1536,15 @@
1533
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1534
  </ul>
1535
 
 
 
1536
  <div class="note-box">
1537
- <p class="note-box-title">📝 Note</p>
1538
- <p class="note-box-content">
1539
- <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1540
- </p>
1541
- </div>
1542
 
1543
- <p>TODO: the text between the table and figueres is still a bit sparse.</p>
1544
 
1545
  <table>
1546
  <thead>
@@ -2578,13 +2582,26 @@
2578
  year={2025},
2579
  }</pre>
2580
  </d-appendix>
 
 
 
 
 
 
 
 
 
2581
 
2582
  <script>
2583
  const article = document.querySelector('d-article');
2584
  const toc = document.querySelector('d-contents');
2585
  if (toc) {
2586
  const headings = article.querySelectorAll('h2, h3, h4');
2587
- let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
 
 
 
 
2588
  let prevLevel = 0;
2589
 
2590
  for (const el of headings) {
@@ -2606,7 +2623,7 @@
2606
  }
2607
  if (level === 0)
2608
  ToC += '<div>' + link + '</div>';
2609
- else
2610
  ToC += '<li>' + link + '</li>';
2611
  }
2612
 
@@ -2614,10 +2631,10 @@
2614
  ToC += '</ul>'
2615
  prevLevel--;
2616
  }
2617
- ToC += '</nav>';
2618
  toc.innerHTML = ToC;
2619
  toc.setAttribute('prerendered', 'true');
2620
- const toc_links = document.querySelectorAll('d-contents > nav a');
2621
 
2622
  window.addEventListener('scroll', (_event) => {
2623
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
 
1510
 
1511
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
1512
 
1513
+ <div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
1514
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1515
 
1516
 
1517
  <p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
 
1521
 
1522
  <p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
1523
 
1524
+ <div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
1525
+
1526
+ <!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
1527
 
1528
 
1529
  <p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
 
1536
  <li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
1537
  </ul>
1538
 
1539
+ <div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
1540
+
1541
  <div class="note-box">
1542
+ <p class="note-box-title">📝 Note</p>
1543
+ <p class="note-box-content">
1544
+ <p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
1545
+ </p>
1546
+ </div>
1547
 
 
1548
 
1549
  <table>
1550
  <thead>
 
2582
  year={2025},
2583
  }</pre>
2584
  </d-appendix>
2585
+ <script>
2586
+ function toggleTOC() {
2587
+ const content = document.querySelector('.toc-content');
2588
+ const icon = document.querySelector('.toggle-icon');
2589
+
2590
+ content.classList.toggle('collapsed');
2591
+ icon.classList.toggle('collapsed');
2592
+ }
2593
+ </script>
2594
 
2595
  <script>
2596
  const article = document.querySelector('d-article');
2597
  const toc = document.querySelector('d-contents');
2598
  if (toc) {
2599
  const headings = article.querySelectorAll('h2, h3, h4');
2600
+ // let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
2601
+ let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
2602
+ <span class="toc-title">Table of Contents</span>
2603
+ <span class="toggle-icon">▼</span>
2604
+ </div><div class="toc-content">`;
2605
  let prevLevel = 0;
2606
 
2607
  for (const el of headings) {
 
2623
  }
2624
  if (level === 0)
2625
  ToC += '<div>' + link + '</div>';
2626
+ else if (level === 1)
2627
  ToC += '<li>' + link + '</li>';
2628
  }
2629
 
 
2631
  ToC += '</ul>'
2632
  prevLevel--;
2633
  }
2634
+ ToC += '</div></nav>';
2635
  toc.innerHTML = ToC;
2636
  toc.setAttribute('prerendered', 'true');
2637
+ const toc_links = document.querySelectorAll('d-contents > nav div a');
2638
 
2639
  window.addEventListener('scroll', (_event) => {
2640
  if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
src/style.css CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
 
153
  justify-self: start;
154
  align-self: start;
155
  padding-bottom: 0.5em;
@@ -160,7 +161,7 @@ d-contents > nav a.active {
160
  border-bottom-style: solid;
161
  border-bottom-color: rgba(0, 0, 0, 0.1);
162
  overflow-y: scroll;
163
- height: calc(100vh - 80px);
164
  scrollbar-width: none;
165
  z-index: -100;
166
  }
@@ -170,6 +171,31 @@ d-contents a:hover {
170
  border-bottom: none;
171
  }
172
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
  @media (min-width: 1200px) {
175
  d-article {
@@ -179,6 +205,7 @@ d-contents a:hover {
179
 
180
  d-contents {
181
  align-self: start;
 
182
  grid-column-start: 1 !important;
183
  grid-column-end: 4 !important;
184
  grid-row: auto / span 6;
@@ -186,16 +213,17 @@ d-contents a:hover {
186
  margin-top: 0em;
187
  padding-right: 3em;
188
  padding-left: 2em;
189
- border-right: 1px solid rgba(0, 0, 0, 0.1);
190
  border-right-width: 1px;
191
  border-right-style: solid;
192
- border-right-color: rgba(0, 0, 0, 0.1);
193
  position: -webkit-sticky; /* For Safari */
194
  position: sticky;
195
  top: 10px; /* Adjust this value if needed */
196
- overflow-y: scroll;
197
- height: calc(100vh - 80px);
198
  scrollbar-width: none;
 
199
  z-index: -100;
200
  }
201
  }
@@ -205,7 +233,7 @@ d-contents nav h3 {
205
  margin-bottom: 1em;
206
  }
207
 
208
- d-contents nav div {
209
  color: rgba(0, 0, 0, 0.8);
210
  font-weight: bold;
211
  }
 
150
  @media (max-width: 1199px) {
151
  d-contents {
152
  display: none;
153
+ background: white;
154
  justify-self: start;
155
  align-self: start;
156
  padding-bottom: 0.5em;
 
161
  border-bottom-style: solid;
162
  border-bottom-color: rgba(0, 0, 0, 0.1);
163
  overflow-y: scroll;
164
+ height: calc(100vh - 40px);
165
  scrollbar-width: none;
166
  z-index: -100;
167
  }
 
171
  border-bottom: none;
172
  }
173
 
174
+ toc-title {
175
+ font-weight: bold;
176
+ font-size: 1.2em;
177
+ color: #333;
178
+ }
179
+
180
+ toggle-icon {
181
+ transition: transform 0.3s;
182
+ }
183
+
184
+ toggle-icon.collapsed {
185
+ transform: rotate(-90deg);
186
+ }
187
+
188
+ .toc-content {
189
+ margin-top: 15px;
190
+ overflow: hidden;
191
+ max-height: 1000px;
192
+ transition: max-height 0.3s ease-out;
193
+ }
194
+
195
+ .toc-content.collapsed {
196
+ max-height: 0;
197
+ margin-top: 0;
198
+ }
199
 
200
  @media (min-width: 1200px) {
201
  d-article {
 
205
 
206
  d-contents {
207
  align-self: start;
208
+ background: white;
209
  grid-column-start: 1 !important;
210
  grid-column-end: 4 !important;
211
  grid-row: auto / span 6;
 
213
  margin-top: 0em;
214
  padding-right: 3em;
215
  padding-left: 2em;
216
+ /* border-right: 1px solid rgba(0, 0, 0, 0.1);
217
  border-right-width: 1px;
218
  border-right-style: solid;
219
+ border-right-color: rgba(0, 0, 0, 0.1); */
220
  position: -webkit-sticky; /* For Safari */
221
  position: sticky;
222
  top: 10px; /* Adjust this value if needed */
223
+ overflow-y: auto;
224
+ height: calc(100vh - 40px);
225
  scrollbar-width: none;
226
+ transition: max-height 0.3s ease-out;
227
  z-index: -100;
228
  }
229
  }
 
233
  margin-bottom: 1em;
234
  }
235
 
236
+ d-contents nav div div {
237
  color: rgba(0, 0, 0, 0.8);
238
  font-weight: bold;
239
  }