What are the benefits of using the ReLU activation function over the classic sigmoid function in neural networks?
1. Gradient vanishing problem: The max value of the derivative of sigmoid function is 0.25, which can result in the total gradients becoming very small during back propagation. In contrast ReLU has a constant gradient 1 for positive inputs. 
2. Compute efficiency: ReLU can be computed using max(0, x), more efficient than exponential operations in sigmoid function. 
3. Convergence speed: it's also reported that using ReLU as activation function in neural networks converges faster than using sigmoid function.