A portable OpenCL implementation of the lattice Boltzmann method targeting emerging many-core architectures is described. The main purpose of this work is to evaluate and compare the performance of this code on three mainstream hardware architectures available today, namely an Intel CPU, an Nvidia GPU, and the Intel Xeon Phi. Because of the similarities between OpenCL and CUDA, we chose to follow some of the strategies devised to implement efficient lattice Boltzmann solvers on Nvidia GPU, while remaining as generic as possible. Being fairly configurable, this program makes possible to ascertain the best options for each hardware platforms. The achieved performance is quite satisfactory for both the CPU and the GPU. For the Xeon Phi however, the results are below expectations. Nevertheless, comparison with data from the literature shows that on this architecture the code seems memory-bound.