Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code enhancements #1429

Open
wants to merge 31 commits into
base: develop
Choose a base branch
from
Open
Changes from 30 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
71f5680
Initial commit for testing
PJAvinash Oct 16, 2024
39f0cb0
Fix memory leak in checkOptions
PJAvinash Oct 17, 2024
87fe181
Fix memory leak in checkOption
PJAvinash Oct 17, 2024
e460625
x
PJAvinash Oct 17, 2024
7c21e4c
Delete cmake-3.28.2-linux-x86_64.sh
PJAvinash Oct 17, 2024
41026f4
gcn changes
PJAvinash Oct 17, 2024
6b34cef
Merge branch 'MEM_LEAK_FIXES' of https://github.com/PJAvinash/rccl in…
PJAvinash Oct 18, 2024
4e5776d
gcn memleak fixes
PJAvinash Oct 18, 2024
0537ffe
gcn leak fix
PJAvinash Oct 18, 2024
1c95f27
memory leak fixes for parseRome4P2H and ncclTopoAddGPU
PJAvinash Oct 22, 2024
be00579
Merge pull request #1 from PJAvinash/MEM_LEAK_FIXES
PJAvinash Oct 22, 2024
9f49cc4
Keeping only necessary file for fixes
PJAvinash Oct 22, 2024
c62d7bd
Merge branch 'ROCm:develop' into develop
PJAvinash Oct 25, 2024
e0ce29c
changing to GCN_ARCH_NAME_LEN
PJAvinash Oct 28, 2024
837e473
Added sanity check directory
PJAvinash Oct 29, 2024
4adb7b9
refactoring scripts
i-chaochen Oct 29, 2024
1d6eda0
Merge branch 'ROCm:develop' into develop
PJAvinash Oct 30, 2024
0face6b
Merge branch 'ROCm:develop' into develop
PJAvinash Oct 30, 2024
0224efd
Updated to sanity checks folder
i-chaochen Oct 31, 2024
e07278b
Initial fixes
PJAvinash Nov 8, 2024
a637d89
Merge branch 'ROCm:develop' into MEM_LEAK_TEST_TOOLS
PJAvinash Nov 8, 2024
d8a142d
Merge branch 'ROCm:develop' into develop
PJAvinash Nov 12, 2024
faa1840
changes in tools
PJAvinash Nov 12, 2024
d319267
Merge branch 'ROCm:develop' into develop
PJAvinash Nov 20, 2024
0461beb
Merge branch 'ROCm:develop' into MEM_LEAK_TEST_TOOLS
PJAvinash Nov 21, 2024
450765b
Merge pull request #2 from PJAvinash/MEM_LEAK_TEST_TOOLS
PJAvinash Nov 21, 2024
fae9c89
pointing RCCL lib build to debug version
PJAvinash Nov 21, 2024
949c0e3
Removed second pthread_detach
PJAvinash Nov 21, 2024
48d5928
Removing sanity checks
PJAvinash Nov 21, 2024
ac1c396
Keeping only code changes
PJAvinash Nov 21, 2024
8c08de7
Merge branch 'ROCm:develop' into CODE_ENHANCEMENTS
PJAvinash Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 8 additions & 6 deletions src/transport/net_ib.cc
Original file line number Diff line number Diff line change
Expand Up @@ -408,8 +408,8 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction) {
}

// Detect IB cards
int nIbDevs;
struct ibv_device** devices;
int nIbDevs = 0;
struct ibv_device** devices = NULL;

// Check if user defined which IB device:port to use
char* userIbEnv = getenv("NCCL_IB_HCA");
Expand All @@ -434,7 +434,12 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction) {
memset(&devAttr, 0, sizeof(devAttr));
if (ncclSuccess != wrap_ibv_query_device(context, &devAttr)) {
WARN("NET/IB : Unable to query device %s", devices[d]->name);
if (ncclSuccess != wrap_ibv_close_device(context)) { ret = ncclInternalError; goto fail; }
if (ncclSuccess != wrap_ibv_close_device(context))
{
if(ncclSuccess != wrap_ibv_free_device_list(devices)){WARN("NET/IB : Unable to free device list");}
ret = ncclInternalError;
goto fail;
}
continue;
}
for (int port_num = 1; port_num <= devAttr.phys_port_cnt; port_num++) {
Expand Down Expand Up @@ -505,9 +510,6 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction) {
ncclIbMergedDevs[mergedDev].speed += ncclIbDevs[ncclNIbDevs].speed;
ncclNIbDevs++;
nPorts++;
// [RCCL]
pthread_detach(ncclIbAsyncThread);
// [/RCCL]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing pthread_detach? This was added to fix a memory leak issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pthread_detach() is being called twice with same parameter, first time at line 486, second call will simply return -1 and it is not being checked, and has no effect.

}
if (nPorts == 0 && ncclSuccess != wrap_ibv_close_device(context)) { ret = ncclInternalError; goto fail; }
}
Expand Down