On Tuesday 02 January 2007 04:23, you wrote: Well, nevermind the question in my previous post: http://lists.mplayerhq.hu/pipermail/ffmpeg-devel/2007-January/050356.html I thought that this code could be improved by reducing the number of instructions for loading and storing data (load and store values as 32-bit or even 64-bit). But appears that performance improvement is minimal (only loads can be optimized) and that all is not worth additional troubles. So that patch (attached to my previous post) is final, and I think it is ready for commit now :) Verified it using some synthetic correctness/performance test program: https://garage.maemo.org/plugins/scmsvn/viewcvs.php/trunk/libavcodec/tests/?root=mplayer # ./test-unquantize dct_unquantize_h263_helper_c time=0.07111 usec per element, or 17.8 cycles (250MHz), 29.6 cycles (416MHz) dct_unquantize_h263_helper_armv5te time=0.03072 usec per element, or 7.7 cycles (250MHz), 12.8 cycles (416MHz) It was tested on Nokia 770, so estimation of cpu cycles is valid for 250MHz. So this ARM optimized code is twice faster than the code generated by gcc 4.1.1 (-march=armv5te -mtune=arm926ej-s -O3 -fomit-frame-pointer) Also tested decoding (using mplayer from svn) of Doom trailer from: http://www.divx.com/movies/detail.php?movieID=57&cID=1 Output of gprof before patch: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 10.81 9.73 9.73 5372100 0.00 0.00 mpeg4_decode_block 9.07 17.90 8.17 idct_col_put_armv5te 7.41 24.57 6.67 1497822 0.00 0.00 dct_unquantize_h263_intra_c 6.83 30.72 6.15 1228760 0.01 0.02 ff_mpeg4_decode_mb 6.77 36.82 6.10 911294 0.01 0.01 put_pixels16_c 6.32 42.51 5.69 1074435 0.01 0.02 MPV_motion 5.92 47.84 5.33 idct_col_add_armv5te 5.76 53.03 5.19 1228760 0.00 0.03 MPV_decode_mb 3.75 56.41 3.38 1497822 0.00 0.00 mpeg4_pred_ac 3.43 59.50 3.09 put_pixels8_arm 2.65 61.89 2.39 idct_row_armv5te Output of gprof after patch: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 11.07 9.52 9.52 5372100 0.00 0.00 mpeg4_decode_block 10.20 18.29 8.77 idct_col_put_armv5te 7.22 24.50 6.21 911294 0.01 0.01 put_pixels16_c 6.71 30.27 5.77 1074435 0.01 0.02 MPV_motion 6.64 35.98 5.71 1228760 0.00 0.02 ff_mpeg4_decode_mb 6.35 41.44 5.46 idct_col_add_armv5te 5.83 46.45 5.01 1228760 0.00 0.03 MPV_decode_mb 4.41 50.24 3.79 1497822 0.00 0.00 dct_unquantize_h263_intra_armv5te 3.54 53.28 3.04 1497822 0.00 0.00 mpeg4_pred_ac 3.48 56.27 2.99 put_pixels8_arm 3.09 58.93 2.66 idct_row_armv5te Also tested running mplayer with '-vo md5sum', results are identical. It is a pity that ARMv5TE does not have SIMD instructions, it would be much better to use SIMD for this code, but at least this ARM optimized function is still a lot faster than gcc generated code and provides about ~3% overall improvement on this video file. PS. I have started a thread about ffmpeg optimizations for ARM in oesf.org forum: http://www.oesf.org/forums/index.php?showtopic=22280 Maybe at least it will be possible to find some people willing to try compiling and testing ffmpeg/mplayer on XScale processors :)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4