Change #273665
| Category | ffmpeg |
| Changed by | Shreesh Adiga <16567adigashreesh@gmail.com> |
| Changed at | Thu 02 Jul 2026 11:03:25 |
| Repository | https://git.ffmpeg.org/ffmpeg.git |
| Project | ffmpeg |
| Branch | master |
| Revision | 915bac7bdc89317b85dc19e9cf6e7aed99be2e68 |
Comments
avutil/crc: add aarch64 hybrid crc32 NEON PMULL+EOR SIMD implementation Adding crc32 specialization for aarch64 which uses both PMULL and crc32 instructions to perform 192 bytes fold in one iteration, performing 9x PMULL and 6 crc32 in one loop iteration, obtaining higher performance for large inputs >8kB. This approach is based on zlib-ng implementation which is also described at https://github.com/corsix/fast-crc32. For smaller buffer size, it was observed to be slightly slower, thus only for input size >8192 this logic is used, for smaller sizes otherwise the 4x PMULL folding method is used along with scalar crc32 instructions for processing the remainder input size. On a MediaTek Dimensity 9400 Android device in termux environment, with normal checkasm seed 0 which picks random buffer size and max buffer size of 16kB, the data observed on Cortex X925, A720 and X4: X925 Before: crc_32_IEEE_LE_c: 12762.0 crc_32_IEEE_LE_crc: 667.5 (19.11x) crc_32_IEEE_LE_pmull_eor3: 346.9 (26.30x) X925 After: crc_32_IEEE_LE_c: 12707.6 crc_32_IEEE_LE_crc: 665.2 (19.10x) crc_32_IEEE_LE_pmull_eor3: 292.8 (41.90x) A720 Before: crc_32_IEEE_LE_c: 23059.1 crc_32_IEEE_LE_crc: 1220.7 (18.89x) crc_32_IEEE_LE_pmull_eor3: 1198.9 (19.23x) A720 After: crc_32_IEEE_LE_c: 23293.3 crc_32_IEEE_LE_crc: 1209.1 (19.26x) crc_32_IEEE_LE_pmull_eor3: 1150.4 (20.24x) X4 Before: crc_32_IEEE_LE_c: 12405.5 crc_32_IEEE_LE_crc: 664.5 (18.67x) crc_32_IEEE_LE_pmull_eor3: 498.1 (24.90x) X4 After: crc_32_IEEE_LE_c: 12457.2 crc_32_IEEE_LE_crc: 665.5 (18.72x) crc_32_IEEE_LE_pmull_eor3: 468.8 (26.57x) So it seems to work well on high performance core like X925, and results in about 20% better performance, while having tiny gains on other cores. Testing for input size of 160 kB after modifying the checkasm crc test to have buffer size increased to 160kB and always using full capacity instead of a random size results in below observations: X925 Before: crc_32_IEEE_LE_c: 210177.1 crc_32_IEEE_LE_crc: 10313.7 (20.35x) crc_32_IEEE_LE_pmull_eor3: 6580.9 (31.83x) X925 After: crc_32_IEEE_LE_c: 210869.3 crc_32_IEEE_LE_crc: 10304.8 (20.36x) crc_32_IEEE_LE_pmull_eor3: 3098.5 (68.05x) A720 Before: crc_32_IEEE_LE_c: 387502.5 crc_32_IEEE_LE_crc: 19196.7 (19.54x) crc_32_IEEE_LE_pmull_eor3: 18717.1 (20.63x) A720 After: crc_32_IEEE_LE_c: 392090.8 crc_32_IEEE_LE_crc: 19795.1 (18.68x) crc_32_IEEE_LE_pmull_eor3: 14971.4 (24.97x) X4 Before: crc_32_IEEE_LE_c: 196232.0 crc_32_IEEE_LE_crc: 10378.7 (18.68x) crc_32_IEEE_LE_pmull_eor3: 7742.0 (25.29x) X4 After: crc_32_IEEE_LE_c: 199632.9 crc_32_IEEE_LE_crc: 10495.8 (18.32x) crc_32_IEEE_LE_pmull_eor3: 5448.9 (24.69x) Seems to result in about 2x gains on X925, 25% on A70 and 40% on X4. In general the performance gains depends on the CPU Core and input size, and this optimization benefits large input size especially on high performance cores like X925 and Apple M series.
Changed files
- libavutil/aarch64/crc.S
- libavutil/aarch64/crc.h