I like to train large deep neural nets too ๐ง ๐ค๐ฅ | First Paper (AutoAgents: A Framework for Automatic Agent Generation) Accepted @ IJCAI 2024 | Role Model Karpathy
Triton nanoGPT now has a custom cross entropy loss kernel ๐ Next: matmul, gradually overthrowing all major PyTorch ops:)
Simplified pseudo for parallel cross-entropy loss compute: - init program: get pid, compute offsets, load targets. - init row_max and row_sum. - for-loop1 (find max logits): update row_max with max logits. - for-loop2 (compute softmax and loss): compute row_sum, update loss. - add log(row_sum) and store loss.