What is Mali? What is Exynos?
“Mali” is the name of ARM’s own “standard” GPU cores that complement the standard CPU cores (Cortex). Many ARM CPU licensors integrate GPU cores from other vendors in their SoCs, e.g. Imagination, Vivante, Adreno rather than the default Mali.
Mali Series 700 is the 3-rd generation “Midgard” core that complement’s ARM’s 64-bit ArmV8 Cortex 5X designs and thus used in the very latest phones/tablets and has been updated to include support for new technologies like OpenCL ES, OpenGL ES and DirectX.
“Exynos” is the name of Samsung’s line of SoCs that is used in Samsung’s own phones/tablets/TVs. Series 5 is the 5-th generation SoC generally using ARM’s “big.LITTLE” architecture of “small” cores for low-power and “big” cores for performance. 5433 is the 1st 64-bit SoC from Samsung supporting AArch64 aka ArmV8 but running in “legacy” 32-bit ArmV7 mode.
In this article we test (GP)GPU graphics unit performance; please see our other articles on:
Hardware Specifications
We are comparing the internal GPUs of various modern phones and tablets that support GPGPU.
Graphics Processors | ARM Mali T760 | Qualcomm Adreno 420 | Qualcomm Adreno 330 | Qualcomm Adreno 320 | ARM Mali T628 | nVidia K1 | Comment |
Type / Micro-Arch | VLIW4 (Midgard 3nd gen) | VLIW5 | VLIW5 | VLIW5 | VLIW4 (Midgard 2nd gen) | Scalar (Maxwell 3rd gen) | All except K1 are VLIW thus work best with vectorised data; some compilers are very good at vectorising simple code (e.g. by executing mutiple data items simultaneously), but the programmer can generally do a better job of extracting paralellism. |
Core Speed (MHz) estimated | 600 | 600 | 578 | 400 | 533 | ? | Core speeds are comparative with latest devices not pushing the clocks too high but instead improving the cores. |
OpenGL ES Support | 3.1 | 3.1 | 3.0 | 3.0 | 3.0 (should support 3.1) | 3.1 | Mali T7xx adds official support for OpenGL ES 3.1 just like the other modern GPU designs: Adreno 400 and K1. While Mali T6xx should also suppot 3.1 the drivers have not been updated for this “legacy” device. |
OpenCL ES Support | 1.2 (full) | 1.2 (full) | 1.1 | 1.1 | 1.1 (should support full) | Not for Android, supports CUDA | Mali T7xx adds support for OpenCL 1.2 but also “full profile” just like Adreno 420 – both supporting all the desktop features of OpenCL – thus any kernels developed for desktop/mobile GPUs can run pretty much unchanged. |
CU / SP Units | 8 / 256 | 4 / 128 | 4 / 128 | 4 / 64 | 8 / 64 | 1 / 192 | Mali T760 has 2x the CU of T628 but they should also be far more powerful. Adreno 420 only relies on more powerful CUs over the 330/320; nVidia uses only 1 SMX/CU but more SPs. |
Global Memory (MB) | 2048 of 3072 | 1400 of 3072 | 1400 of 3072 | 960 of 2048 | 1024 of 3072 | n/a | Modern phones with 3GB memory seem to allow about 50% to be allocated through OpenCL. Mali does generally seem to allow more, typically 66%. |
Largest Memory Block (MB) | 512 of 2048 | 347 of 1400 | 347 of 1400 | 227 of 960 | 694 of 1024 | n/a | The maximum block size seems to be about 25% of total memory, but Mali’s driver allows as much as 50%. |
Constant Memory (kB) | 64 | 64 | 4 | 4 | 64 | n/a | Mali T600 was already fine here, with Adreno 400 needed to catch up to the rest. Previously constant data would have needed to be kept in normal global memory due to the small constant memory size. |
Shared Memory (kB) | 32 | 32 | 8 | 8 | 32 | n/a | Again Mali T600 was fine already – with Adreno 400 finally matching the rest. |
Max. Workgroup Size | 256 x 256 x 256 | 1024 x 1024 x 1024 | 512 x 512 x 512 | 256 x 256 x 256 | 256 x 256 x 256 | n/a | Surprisingly the work-group size remains at 256 for Mali T700/T600 with Adreno 400 pushing alll the way to 1024. That does not necessarily mean it is the optimum size. |
Cache (Reported) kB | 256 | 128 | 32 | 32 | n/a | n/a | Here Mali T760 overtakes them all with a 256kB L2 cache, 2x bigger than Adreno 400 and older Mali T600. |
FP16 / FP64 Support | Yes / Yes | Yes / No | Yes / No | Yes / No | No | No | Here we are the 1st mobile FP64 native GPU! If you have double floating-point workloads then stop reading now and get a SoC with Mali T700 series. |
Byte/Integer Width | 16 / 4 | 1 / 1 | 1 / 1 | 1 / 1 | 16 / 4 | n/a | Adreno prefers non-vectorised integer data even though it is VLIW5; only Mali prefers vectorised data (vec4) similar to the old ATI/AMD pre-GCN hardware. At least all our vectorisations are not in vain 😉 |
Float/Double Width | 4 / 2 | 1 / n/a | 1 / n/a | 1 / n/a | 4 / n/a | n/a | As before, Adreno prefers non-vectorised while Mali vectorised data. As Mali T760 supports FP64, it also wants vectorised double floating-point data. |
GPGPU Compute Performance
We are testing vectorised, crypto (including hash), financial and scientific GPGPU performance of the GPUs in OpenCL.
Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.
Environment: Android 5.x.x, latest updates (May 2015).
It seems our early enthusiasm over FP64 native support was quickly extinguished: while Mali T760 naturally does well in FP64 tests – it cannot beat its rival Adreno (420) in other tests.
In single-precision floating-point (FP32) simple workloads, Adreno is only about 10-20% faster; however in complex workloads (Binomial, Monte-Carlo, FFT, GEMM) Adreno can be 2-7x (times) faster – a huge lead. It seems to do with shared memory accesses rather VLIW design needing highly-vectorised kernels which is what we’re using.
Naturally in double-precision floating-point (FP64) workloads, Mali T760 flies – being 3-5x (times) faster, so if those are the kinds of workloads you require – it is the natural choice. However, such precision is uncommon on phones/tablets – even desktop/laptop GPGPUs have crippled FP64 performance.
In integer workloads, the two GPGPUs are competitive with a 3-5% difference either way.
The relatively small (256) workgroup size may also hamper performance with Adreno 420 able to keep more (1024) threads in flight – although the shared cache size is the same.
GPGPU Memory Performance
We are testing memory bandwidth performance of GPUs using OpenCL, including transfer (up/down) to/from system memory; we also measure the latencies of the various memory types (global, constant, shared, etc.) using different access patterns (in-page random access, sequential access, etc.).
Results Interpretation (Bandwidth): Higher values (MPix/s, MB/s, etc.) mean better performance.
Results Interpretation (Latency): Lower values (ns, clocks, etc.) mean better performance.
Environment: Android 5.x.x, latest updates (May 2015).
Memory testing seems to reveal Mali’s T760 problem: its bandwidth is much lower than Adreno while its key memories (shared, constant) latencies are far higher. It is a wonder how it performs so well actually if the numbers are to be believed – but since Mali T628 scores similarly there is no reason to doubt them.
Adreno T420 has 2x higher internal bandwidth and over 70% more upload/download bandwidth – and since neither supports HSA and thus “zero copy” – it will be much faster the bigger the memory blocks used. Here, Qualcomm’s completely-designed SoC (CPU, GPU, memory controller) pays dividends.
Mali T760’s global memory latency is lower but neither constant nor (more crucially) shared memory seem to be treated differently and thus have similar latencies to global memory; common GPGPU optimisations are thus useless and any commplex algorithm making extensive use of shared memory will be greatly bogged down. ARM should better re-think their approach for the new (T800) Mali series.
Video Shader Performance
We are testing vectorised shader compute performance of the GPUs in OpenGL.
Results Interpretation: Higher values (MPix/s, MB/s, etc.) mean better performance.
Environment: Android 5.x.x, latest updates (May 2015).
Using OpenGL ES allows the K1 to play, but more specifically it shows Mali’s OpenGL prowess is lacking – Adreno 420 is between 4-5x faster – a big difference. FP16 support seems to make no difference while FP64 support is missing in OpenGL thus it cannot play its Ace card. ARM has some OpenGL driver optimisation to make.
SiSoftware Official Ranker Scores
Final Thoughts / Conclusions
Mali T760 is a big upgrade over its older T600 series though a lot of the details have not changed. However, it is not enough to beat its rival Adreno 400 series – with native FP64 performance (and thus FP128 emulated) being the only shining example. While its integer workload performance is competitive – floating-point performance in complex workloads (making extensive use of shared memory) is much lower. Even highly vectorised kernels that should help its VLIW design cannot close the gap.
It seems the SoC’s memory controller lets it down, and its non-dedicated shared and constant memory means high latencies slow it down. ARM should really implement dedicated shared memory in the next major version.
Elsewhere its OpenGL shader compute performance is even slower (1/4x Adreno) with FP16 support not helping much and FP64 native support missing. This is a surprise considering its far more competive OpenCL performance. Hopefully future drivers will address this – but considering the T600 performance has remained pretty much unchanged we’re not hopeful.
To see how the Exynos 5433 CPU fares, please see Exynos 5433 CPU (Cortex A57+A53) performance article!