Spaces:

harmdevries
/

transformer_inference

Runtime error

harmdevries commited on Nov 3, 2022

Commit

32aafee

1 Parent(s): 52a11ab

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -34,7 +34,7 @@ TFLOPS = GPU_EFFICIENCY*TFLOPS
 # in ms
 def calc_exec_time(comp_flop, mem_bytes, include_overhead=True):
-  exec_time = comp_flop/TFLOPS + mem_bytes/GB_S
   exec_time *= 1000
   if include_overhead:
     exec_time = max(exec_time, THREAD_OVERHEAD)
@@ -169,24 +169,21 @@ st.latex("A \in \mathbb{R}^{MxK}, B \in R^{KxN}, C \in \mathbb{R}^{MxN}")
 st.markdown('''
 To execute this operation on the GPU, we need to
 1. Read A, B from memory
-2. Perform math operations
 3. Write C to memory
 ''')
-st.latex('''
-For float16 operations (2 bytes), we can estimate the memory access time of A as follows:
-T_mem(A) = 2*M*K / BW_mem
-where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)
-''')
 st.markdown("For float16 operations (2 bytes), we can estimate the memory access time of A as follows:")
 st.latex("T_{mem}(A) = 2*M*K / BW_{mem}")
-st.markdown("where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for A100)")
 breakdown = st.checkbox("Show breakdown per operation")
 if breakdown:

 # in ms
 def calc_exec_time(comp_flop, mem_bytes, include_overhead=True):
+  exec_time = max(comp_flop/TFLOPS, mem_bytes/GB_S)
   exec_time *= 1000
   if include_overhead:
     exec_time = max(exec_time, THREAD_OVERHEAD)
 st.markdown('''
 To execute this operation on the GPU, we need to
 1. Read A, B from memory
+2. Perform matrix multiplication
 3. Write C to memory
 ''')
 st.markdown("For float16 operations (2 bytes), we can estimate the memory access time of A as follows:")
 st.latex("T_{mem}(A) = 2*M*K / BW_{mem}")
+st.markdown("where BW_mem is the memory bandwidth of the GPU (e.g. 1935 GB/s for an A100 GPU)")
+st.markdown("The total time on memory access is T_mem = T_mem(A) + T_mem(B) + T_mem(C)")
+st.markdown("We can estimate the compute time for the math operations as follows:")
+st.latex("T_{math}(A \cdot B) = 2*M*K*N / BW_{math}")
+st.markdown("where BW_math is the number of floating point operations per second (e.g. 312 TFLOPS for an A100 GPU)")
+st.markdown("If we assume we can *perfectly* overlap memory access with math operations, then the estimated execution time for the operation is:")
+st.latex("max(T_math, T_mem)")
 breakdown = st.checkbox("Show breakdown per operation")
 if breakdown: