Change #265094

Category	ffmpeg
Changed by	Jun Zhao <barryjzhaoohnoyoudont@tencent.com>
Changed at	Tue 21 Apr 2026 09:50:49
Repository	https://git.ffmpeg.org/ffmpeg.git
Project	ffmpeg
Branch	master
Revision	75838b9c891768d3fb5ad066773c6838b942e6df

Comments

lavc/hevc: add aarch64 NEON for reference sample filtering
3-tap [1,2,1]>>2: shared implementation body across size-specialized
entry points (8x8/16x16/32x32) to reduce code size. Fold the 3-tap
kernel into uhadd + urhadd: uhadd gives floor((prev+next)/2), then
urhadd rounds with curr to produce (prev + 2*curr + next + 2) >> 2
on 16 bytes in-place (no widen/narrow needed). Overlap-last technique
for tail avoids partial stores. Caller pads input arrays by 16 bytes
to guarantee safe over-read.

Strong smoothing (32x32): preloaded weight tables, interleaved
umull/umlal pairs (two 16-byte blocks at a time) to hide
rshrn-to-store latency, with paired st1 for 32-byte writes.

checkasm --bench --runs=15 (Apple M4, average of 3 trials):
  ref_filter_3tap_8x8_8_neon:    4.1x
  ref_filter_3tap_16x16_8_neon:  3.3x
  ref_filter_3tap_32x32_8_neon:  2.5x
  ref_filter_strong_8_neon:      1.9x

Signed-off-by: Jun Zhao <barryjzhao@tencent.com>

Changed files

libavcodec/aarch64/hevcpred_init_aarch64.c
libavcodec/aarch64/hevcpred_neon.S
libavcodec/hevc/pred_template.c