Change #273665

Category	ffmpeg
Changed by	Shreesh Adiga <16567adigashreeshohnoyoudont@gmail.com>
Changed at	Thu 02 Jul 2026 11:03:25
Repository	https://git.ffmpeg.org/ffmpeg.git
Project	ffmpeg
Branch	master
Revision	915bac7bdc89317b85dc19e9cf6e7aed99be2e68
Comments

avutil/crc: add aarch64 hybrid crc32 NEON PMULL+EOR SIMD implementation
Adding crc32 specialization for aarch64 which uses both PMULL and crc32
instructions to perform 192 bytes fold in one iteration, performing
9x PMULL and 6 crc32 in one loop iteration, obtaining higher performance for
large inputs >8kB. This approach is based on zlib-ng implementation which
is also described at https://github.com/corsix/fast-crc32.

For smaller buffer size, it was observed to be slightly slower, thus only
for input size >8192 this logic is used, for smaller sizes otherwise the
4x PMULL folding method is used along with scalar crc32 instructions for
processing the remainder input size.

On a MediaTek Dimensity 9400 Android device in termux environment,
with normal checkasm seed 0 which picks random buffer size and max buffer size
of 16kB, the data observed on Cortex X925, A720 and X4:
X925 Before:
  crc_32_IEEE_LE_c:                  12762.0
  crc_32_IEEE_LE_crc:                  667.5 (19.11x)
  crc_32_IEEE_LE_pmull_eor3:           346.9 (26.30x)
X925 After:
  crc_32_IEEE_LE_c:                  12707.6
  crc_32_IEEE_LE_crc:                  665.2 (19.10x)
  crc_32_IEEE_LE_pmull_eor3:           292.8 (41.90x)

A720 Before:
  crc_32_IEEE_LE_c:                  23059.1
  crc_32_IEEE_LE_crc:                 1220.7 (18.89x)
  crc_32_IEEE_LE_pmull_eor3:          1198.9 (19.23x)
A720 After:
  crc_32_IEEE_LE_c:                  23293.3
  crc_32_IEEE_LE_crc:                 1209.1 (19.26x)
  crc_32_IEEE_LE_pmull_eor3:          1150.4 (20.24x)

X4 Before:
  crc_32_IEEE_LE_c:                  12405.5
  crc_32_IEEE_LE_crc:                  664.5 (18.67x)
  crc_32_IEEE_LE_pmull_eor3:           498.1 (24.90x)
X4 After:
  crc_32_IEEE_LE_c:                  12457.2
  crc_32_IEEE_LE_crc:                  665.5 (18.72x)
  crc_32_IEEE_LE_pmull_eor3:           468.8 (26.57x)

So it seems to work well on high performance core like X925, and results in about
20% better performance, while having tiny gains on other cores.

Testing for input size of 160 kB after modifying the checkasm crc test to
have buffer size increased to 160kB and always using full capacity instead of
a random size results in below observations:
X925 Before:
  crc_32_IEEE_LE_c:                 210177.1
  crc_32_IEEE_LE_crc:                10313.7 (20.35x)
  crc_32_IEEE_LE_pmull_eor3:          6580.9 (31.83x)
X925 After:
  crc_32_IEEE_LE_c:                 210869.3
  crc_32_IEEE_LE_crc:                10304.8 (20.36x)
  crc_32_IEEE_LE_pmull_eor3:          3098.5 (68.05x)

A720 Before:
  crc_32_IEEE_LE_c:                 387502.5
  crc_32_IEEE_LE_crc:                19196.7 (19.54x)
  crc_32_IEEE_LE_pmull_eor3:         18717.1 (20.63x)
A720 After:
  crc_32_IEEE_LE_c:                 392090.8
  crc_32_IEEE_LE_crc:                19795.1 (18.68x)
  crc_32_IEEE_LE_pmull_eor3:         14971.4 (24.97x)

X4 Before:
  crc_32_IEEE_LE_c:                 196232.0
  crc_32_IEEE_LE_crc:                10378.7 (18.68x)
  crc_32_IEEE_LE_pmull_eor3:          7742.0 (25.29x)
X4 After:
  crc_32_IEEE_LE_c:                 199632.9
  crc_32_IEEE_LE_crc:                10495.8 (18.32x)
  crc_32_IEEE_LE_pmull_eor3:          5448.9 (24.69x)

Seems to result in about 2x gains on X925, 25% on A70 and 40% on X4.
In general the performance gains depends on the CPU Core and input size,
and this optimization benefits large input size especially on high performance
cores like X925 and Apple M series.
Changed files

libavutil/aarch64/crc.S
libavutil/aarch64/crc.h