-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add autodetection for POWER7, POWER9 & POWER10 #647
Conversation
Read from `/proc/cpuinfo` as done for ARM. Fixes flame#501
Thanks @Flamefire. Two questions:
|
Yes, that's the reason for this: Without this patch there is no blis/build/detect/config/config_detect.c Line 72 in 9fea633
See #501 for complete output
According to https://www.ibm.com/docs/de/xl-c-and-cpp-linux/13.1.1?topic=features-macros-related-architecture-settings not:
This is a port of my PR amd#6 which I tested on 0.8.1 and 0.9.0 (although the later then fails due to #621 while 0.8.1 works) |
Awesome, merging. |
Details: - This attribution was intended to go in PR #647.
Details: - Fixed a harmless bug that would have allowed C++ headers into the list of header suffices specifically reserved for C99 headers. In practice, this would have had no substantive effect on anything since the core BLIS framework does not use C++ headers. - (cherry picked from commit bbaf29a) CREDITS file update. Details: - Thanks to Kihiro Bando for assisting with issue #644. - (cherry picked from commit a48e29d) Removed buggy cruft from power10 subconfig. Details: - Removed #defines for BLIS_BBN_s and BLIS_BBN_d from bli_kernel_defs_power10.h. These were inadvertently set in ae10d94 because the power10 subconfig was registering bb packm ukernels, but only for 6xk (power10 uses s8x16 and d8x8 ukernels) and only because the original author (probably) copy-pasted from power9 when getting started. That 6xk packm registration was effectively "dead code" prior to ae10d94, but was then mistaken as not-dead code during the ae10d94 refactor. These improper bb factors may have been causing bugs in power10 builds. Thanks to Nicholai Tukanov for helping remind me what the power10 subconfig was supposed to look like. - Removed extraneous microkernel preference registrations from power10 subconfig. Preferences for single and double complex gemm were being registered despite there being no complex gemm ukernels registered to go with them. Similarly, there were trsm preferences registered without any trsm ukernels registered (and BLIS doesn't actually use a preference for the trsm ukernel anyway). These extraneous registrations were almost surely not hurting anything, even if they were quite misleading. - (cherry picked from commit 5b29893) Disable modification of KC in the gemmsup kernels. (#648) This led to a ~50% performance reduction for certain gemm operations (but not others?). See #644 for example. - (cherry picked from commit 56de31b) Fixed out-of-bounds bug in sup s6x16m haswell kernel. Details: - Fixed another out-of-bounds read access bug in the haswell sup assembly kernels. This bug is similar to the one fixed in 17b0caa and affects bli_sgemmsup_rv_haswell_asm_6x2m(). Thanks to Madeesh Kannan for reporting this bug (and a suitable fix) in #635. - CREDITS file update. - (cherry picked from commit 4dde947) Add `#line` directives to flattened `blis.h`. (#643) Details: - Modified flatten-headers.py so that #line directives are inserted into the flattened blis.h file. This facilitates easier debugging when something is amiss in the flattened blis.h because the compiler will be able to refer to the line number within the original constituent header file (which is where the fix would go) rather than the line number within the flattened header (which is not as helpful). - (cherry picked from commit 6826c1c) Add autodetection for POWER7, POWER9 & POWER10 (#647) Read from `/proc/cpuinfo` as done for ARM. Fixes #501 - (cherry picked from commit af3a41e) Fixed out-of-bounds read in haswell gemmsup kernels. Details: - Fixed memory access bugs in the bli_sgemmsup_rv_haswell_asm_Mx2() kernels, where M = {1,2,3,4,5,6}. The bugs were caused by loading four single-precision elements of C, via instructions such as: vfmadd231ps(mem(rcx, 0*32), xmm3, xmm4) in situations where only two elements are guaranteed to exist. (These bugs may not have manifested in earlier tests due to the leading dimension alignment that BLIS employs by default.) The issue was fixed by replacing lines like the one above with: vmovsd(mem(rcx), xmm0) vfmadd231ps(xmm0, xmm3, xmm4) Thus, we use vmovsd to explicitly load only two elements of C into registers, and then operate on those values using register addressing. Thanks to Daniël de Kok for reporting these bugs in #635, and to Bhaskar Nallani for proposing the fix). - CREDITS file update. - (cherry picked from commit 17b0caa) Allow uniform max problem sizes in test/3/runme.sh. Details: - Tweaked test/3/runme.sh so that the test driver binaries for single- threaded (st), single-socket (1s), and dual-socket (2s) execution can be built using identical problem size ranges. Previously, this was not possible because runme.sh used the maximum problem size, which was embedded into the binary filename, to tell the three classes of binaries apart from one another. Now, runme.sh uses the binary suffix ("st", "1s", or "2s") to tell them apart. This required only a few changes to the logic, but it also required a change in format to the threading config strings themselves (replacing the max problem size with "st", "1s", or "2s"). Thanks to Jeff Diamond for inspiring this improvement. - Comment updates. - (cherry picked from commit cc260fd) Use BLIS_ENABLE_COMPLEX_RETURN_INTEL in blastest files (#636) Details: - Fixed a crash that occurs when either cblat1 or zblat1 are linked with a build of BLIS that was compiled with '--complex-return=intel'. This fix involved inserting preprocessor macro guards based on BLIS_ENABLE_COMPLEX_RETURN_INTEL into blastest/src/cblat1.c and blastest/src/zblat1.c to correctly handle situations where BLIS is compiled with Intel/f2c-style calling conventions for complex numbers. - Updated blastest/src/fortran/run-f2c.sh so that future executions will insert the aforementioned cpp macro conditional where appropriate. - (cherry picked from commit 9b1beec) Change complex_return='intel' for ifx. (#637) Details: - When checking the version string of the Fortran compiler for the purposes of determining a default return convention for complex domain values, grep for "IFORT" instead of "ifort" since that string is common to both the 'ifx' and 'ifort' binaries provided by Intel: $ ifx --version ifx (IFORT) 2022.1.0 20220316 Copyright (C) 1985-2022 Intel Corporation. All rights reserved. $ ifort --version ifort (IFORT) 2021.6.0 20220226 Copyright (C) 1985-2022 Intel Corporation. All rights reserved. - (cherry picked from commit 98d4678) Minor changes to .gitignore and LICENSE files. (#642) Details: - Macs create .DS_Store files in every directory visited. Updated .gitignore file so these files won't be reported as untracked by 'git status'. - Added Oracle Corporation to the LICENSE file. - Updated UT copyright on behalf of SHPC. - (cherry picked from commit ffde54c) Minor cleanups, comment updates to bli_gks.c. Details: - Removed a redundant registration of 'a64fx' subconfig in bli_gks_init(). - Reordered registration of 'armsve', 'a64fx', and 'firestorm' subconfigs. Thanks to Jeff Diamond for his input on this reordering. - Comment updates to bli_gks.c and arch_t enum in bli_type_defs.h. - (cherry picked from commit 7cba7ce)
- (cherry picked from commit c803b03) Fix auto-detection of firestorm (Apple M1). - (cherry picked from commit 2dd692b) Added Discord documentation (#677) Details: - Added a docs/Discord.md markdown document that walks the reader through creating a Discord account, obtaining the invite link, and using the link to join the BLIS Discord server. - Updated README.md to reference the new Discord.md document in multiple places, including via the official Discord logo (used with explicit permission from representatives at Discord Inc.). - (cherry picked from commit 88105db) Shuffled checked properties in bli_l3_check.c. (#676) Details: - Added certain checks for matrix structure to the level-3 operations' _check() functions, and slightly reorganized existing checks. - (cherry picked from commit 23f5b8d) CREDITS file update. Details: - This attribution was intended to go in PR #647. - (cherry picked from commit 9453e0f) Reinstate sanity check in bli_pool_finalize. (#671) Details: - Added a reinit argument to bli_pool_finalize(). This bool will signal whether or not the function is being called from bli_pool_reinit(). If it is not being called from _reinit(), we can safely check to confirm that .top_index == 0 (i.e., all blocks have been checked in). But if it *is* being called from _reinit(), then that check will be skipped since one of the predicted use cases for bli_pool_reinit() anticipates that some blocks are (probably) checked out when the pool_t is reinitialized. - Updated existing invocations of bli_pool_finalize() to pass in either FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or TRUE (from bli_pool_reinit()) for the new reinit argument. - (cherry picked from commit 76a23bd) Fix some bugs in bli_pool.c (#670) Details: - Add a check for premature pool exhaustion when checking in blocks via bli_pool_checkin_block(). This detects "double-free" and other bad conditions that don't necessarily result in a segfault. - Make sure to copy all block pointers when growing the pool size. Previously, checked-out block pointers (which are guaranteed to be set to NULL) were not being copied, leading to the presence of uninitialized data. - (cherry picked from commit 63470b4) Add AddressSanitizer (-fsanitize=address) option. (#669) Details: - Added support for AddressSanitizer (ASan), a compiler-integrated memory error detector. The option (disabled by default) enables compiling and linking with the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is employed during compilation of all BLIS source files *except* for optimized kernels, which are exempted because ASan usually requires an extra register, which violates the constraints for many gemm microkernels. - Minor whitespace, comment, ordering, and configure help text updates. - (cherry picked from commit 42d0e66) Add consistent NaN/Inf handling in sumsqv. (#668) Details: - Changed sumsqv implementation as follows: - If there is a NaN (either real or imaginary), then return a sum of NaN and unit scale. - Else, if there is an Inf (either real or imaginary), then return a sum of +Inf and unit scale. - Otherwise behave as normal. - (cherry picked from commit b861c71) Parameterized test/3 drivers via command line args. (#667) Details: - Rewrote the drivers in test/3, the Makefile, and the runme.sh script so that most of the important parameters, including parameter combo, datatype, storage combo, induced method, problem size range, dimension bindings, number of repeats, and alpha/beta values can be passed in via command line arguments. (Previously, most of these parameters were hard-coded into the driver source, except a few that were hard-coded into the Makefile.) If no argument is given for any particular option, it will be assigned a sane default. Either way, the values employed at runtime will be printed to stdout before the performance data in a section that is commented out with '%' characters (which is used by matlab and octave for comments), unless the -q option is given, in which case the driver will proceed quietly and output only performance data. Each driver also provides extensive help via the -h option, with the help text tailored for the operation in question (e.g. gemm, hemm, herk, etc.). In this help text, the driver reminds the user which implementation it was linked to (e.g. blis, openblas, vendor, eigen). Thanks to Jeff Diamond for suggesting this CLI-based reimagining of the test/3 drivers. - In the test/3 drivers: converted cpp macro string constants, as well as two string literals (for the opname and pc_str) used in each test driver, to global (or static) const char* strings, and replaced the use of strncpy() for storing the results of the command line argument parsing with pointer copies from the corresponding strings in argv. This works because the argv array is guaranteed by the C99 standard to persist throughout the life of the program. This new approach uses less storage and executes faster. Thanks to Minh Quan Ho for recommending this change. - Renamed the IMP_STR cpp macro that gets defined on the command line, via the test/3/Makefile, to IMPL_STR. - Updated runme.sh to set the problem size ranges for single-threaded and multithreaded execution independently from one another, as well as on a per-system basis. - Added a 'quiet' variable to runme.sh that can easily toggle quiet mode for the test drivers' output. - Very minor typecast fix in call to bli_getopt() in bli_utils.c. - In bli_getopt(), changed the nextchar variable from being a local static variable to a field of the getopt_t state struct. (Not sure why it was ever declared static to begin with.) - Other minor changes to bli_getopt() to accommodate the rewritten test drivers' command line parsing needs. - (cherry picked from commit ee81efc)
- (cherry picked from commit c803b03) Fix auto-detection of firestorm (Apple M1). - (cherry picked from commit 2dd692b) Added Discord documentation (#677) Details: - Added a docs/Discord.md markdown document that walks the reader through creating a Discord account, obtaining the invite link, and using the link to join the BLIS Discord server. - Updated README.md to reference the new Discord.md document in multiple places, including via the official Discord logo (used with explicit permission from representatives at Discord Inc.). - (cherry picked from commit 88105db) Shuffled checked properties in bli_l3_check.c. (#676) Details: - Added certain checks for matrix structure to the level-3 operations' _check() functions, and slightly reorganized existing checks. - (cherry picked from commit 23f5b8d) CREDITS file update. Details: - This attribution was intended to go in PR #647. - (cherry picked from commit 9453e0f) Reinstate sanity check in bli_pool_finalize. (#671) Details: - Added a reinit argument to bli_pool_finalize(). This bool will signal whether or not the function is being called from bli_pool_reinit(). If it is not being called from _reinit(), we can safely check to confirm that .top_index == 0 (i.e., all blocks have been checked in). But if it *is* being called from _reinit(), then that check will be skipped since one of the predicted use cases for bli_pool_reinit() anticipates that some blocks are (probably) checked out when the pool_t is reinitialized. - Updated existing invocations of bli_pool_finalize() to pass in either FALSE (from bli_apool_free_block() or bli_pba_finalize_pools()) or TRUE (from bli_pool_reinit()) for the new reinit argument. - (cherry picked from commit 76a23bd) Fix some bugs in bli_pool.c (#670) Details: - Add a check for premature pool exhaustion when checking in blocks via bli_pool_checkin_block(). This detects "double-free" and other bad conditions that don't necessarily result in a segfault. - Make sure to copy all block pointers when growing the pool size. Previously, checked-out block pointers (which are guaranteed to be set to NULL) were not being copied, leading to the presence of uninitialized data. - (cherry picked from commit 63470b4) Add AddressSanitizer (-fsanitize=address) option. (#669) Details: - Added support for AddressSanitizer (ASan), a compiler-integrated memory error detector. The option (disabled by default) enables compiling and linking with the -fsanitize=address flag supported by clang, gcc, and probably others. This flag is employed during compilation of all BLIS source files *except* for optimized kernels, which are exempted because ASan usually requires an extra register, which violates the constraints for many gemm microkernels. - Minor whitespace, comment, ordering, and configure help text updates. - (cherry picked from commit 42d0e66) Add consistent NaN/Inf handling in sumsqv. (#668) Details: - Changed sumsqv implementation as follows: - If there is a NaN (either real or imaginary), then return a sum of NaN and unit scale. - Else, if there is an Inf (either real or imaginary), then return a sum of +Inf and unit scale. - Otherwise behave as normal. - (cherry picked from commit b861c71) Parameterized test/3 drivers via command line args. (#667) Details: - Rewrote the drivers in test/3, the Makefile, and the runme.sh script so that most of the important parameters, including parameter combo, datatype, storage combo, induced method, problem size range, dimension bindings, number of repeats, and alpha/beta values can be passed in via command line arguments. (Previously, most of these parameters were hard-coded into the driver source, except a few that were hard-coded into the Makefile.) If no argument is given for any particular option, it will be assigned a sane default. Either way, the values employed at runtime will be printed to stdout before the performance data in a section that is commented out with '%' characters (which is used by matlab and octave for comments), unless the -q option is given, in which case the driver will proceed quietly and output only performance data. Each driver also provides extensive help via the -h option, with the help text tailored for the operation in question (e.g. gemm, hemm, herk, etc.). In this help text, the driver reminds the user which implementation it was linked to (e.g. blis, openblas, vendor, eigen). Thanks to Jeff Diamond for suggesting this CLI-based reimagining of the test/3 drivers. - In the test/3 drivers: converted cpp macro string constants, as well as two string literals (for the opname and pc_str) used in each test driver, to global (or static) const char* strings, and replaced the use of strncpy() for storing the results of the command line argument parsing with pointer copies from the corresponding strings in argv. This works because the argv array is guaranteed by the C99 standard to persist throughout the life of the program. This new approach uses less storage and executes faster. Thanks to Minh Quan Ho for recommending this change. - Renamed the IMP_STR cpp macro that gets defined on the command line, via the test/3/Makefile, to IMPL_STR. - Updated runme.sh to set the problem size ranges for single-threaded and multithreaded execution independently from one another, as well as on a per-system basis. - Added a 'quiet' variable to runme.sh that can easily toggle quiet mode for the test drivers' output. - Very minor typecast fix in call to bli_getopt() in bli_utils.c. - In bli_getopt(), changed the nextchar variable from being a local static variable to a field of the getopt_t state struct. (Not sure why it was ever declared static to begin with.) - Other minor changes to bli_getopt() to accommodate the rewritten test drivers' command line parsing needs. - (cherry picked from commit ee81efc) Allow test/3 drivers to use default ind_t method. (#804) Details: - Previously, the standalone performance drivers in test/3 were written under the assumption that the user would want to explicitly test either native execution *or* 1m. But because the accompanying runme.sh script defaults to passing "native" in for the -i command line option (which explicitly sets the induced method type), running the script without modification causes the test drivers to use slow reference microkernels on systems where native complex-domain microkernels are not registered -- which will yield poor performance for complex-domain level-3 operations. Furthermore, even if a user was aware of this, the test drivers did not support any single value for the -i option that would test BLIS using the library's default behavior -- that is, using 1m on systems where it is needed and native execution on systems that have native microkernels implemented and registered. - This commit addresses the aforementioned issue by supporting a new value for the -i option: "auto". The "auto" value causes the driver to avoid explicitly setting the induced method altogether, leaving BLIS's default behavior in place. This "auto" option is also now the default setting within the runme.sh script. Thanks to Leick Robinson for finding and reporting this issue. - Also added support for "nat" as a shorthand for "native", which the help text already (erroneously) claimed was supported. - (cherry picked from commit fd1a7e3) Use "-i auto" by default in test/3 drivers. Details: - Request default induced method behavior of BLIS via "-i auto" when running the standalone performance drivers in test/3 via the runme.sh script present in that directory. (Previously, the runme.sh script would use "-i native" by default.) This change was originally intended for fd1a7e3. - (cherry picked from commit cad5149)
Read from
/proc/cpuinfo
as done for ARM.Fixes #501