Change #265094
| Category | ffmpeg |
| Changed by | Jun Zhao <barryjzhao@tencent.com> |
| Changed at | Tue 21 Apr 2026 09:50:49 |
| Repository | https://git.ffmpeg.org/ffmpeg.git |
| Project | ffmpeg |
| Branch | master |
| Revision | 75838b9c891768d3fb5ad066773c6838b942e6df |
Comments
lavc/hevc: add aarch64 NEON for reference sample filtering 3-tap [1,2,1]>>2: shared implementation body across size-specialized entry points (8x8/16x16/32x32) to reduce code size. Fold the 3-tap kernel into uhadd + urhadd: uhadd gives floor((prev+next)/2), then urhadd rounds with curr to produce (prev + 2*curr + next + 2) >> 2 on 16 bytes in-place (no widen/narrow needed). Overlap-last technique for tail avoids partial stores. Caller pads input arrays by 16 bytes to guarantee safe over-read. Strong smoothing (32x32): preloaded weight tables, interleaved umull/umlal pairs (two 16-byte blocks at a time) to hide rshrn-to-store latency, with paired st1 for 32-byte writes. checkasm --bench --runs=15 (Apple M4, average of 3 trials): ref_filter_3tap_8x8_8_neon: 4.1x ref_filter_3tap_16x16_8_neon: 3.3x ref_filter_3tap_32x32_8_neon: 2.5x ref_filter_strong_8_neon: 1.9x Signed-off-by: Jun Zhao <barryjzhao@tencent.com>
Changed files
- libavcodec/aarch64/hevcpred_init_aarch64.c
- libavcodec/aarch64/hevcpred_neon.S
- libavcodec/hevc/pred_template.c