NN clustering: VRAM memory leak fix + (u)int -> (u)int32_t by ChSonnabend · Pull Request #14272 · AliceO2Group/AliceO2

ChSonnabend · 2025-05-14T07:37:29Z

This PR fixes memory leaks and changes (u)int -> (u)int32_t in the GPU kernel code

github-actions · 2025-05-14T07:37:39Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass4
async-2023-pp-apass4
async-2024-pp-apass1
async-2022-pp-apass7
async-2024-pp-cpass0
async-2024-PbPb-apass1
async-2024-ppRef-apass1
async-2024-PbPb-apass2
async-2023-PbPb-apass5

Please consider the following formatting changes to AliceO2Group#14272

alibuild · 2025-05-14T10:44:59Z

Error while checking build/O2/fullCI_slc9 for b0e8923 at 2025-05-14 12:44:

## sw/BUILD/O2-latest/log
clang: error: unable to execute command: Segmentation fault (core dumped)
clang: error: llvm-spirv command failed due to signal (use -v to see invocation)
ninja: build stopped: subcommand failed.

Full log here.

alibuild · 2025-05-14T23:54:46Z

Error while checking build/O2/fullCI_slc9 for afbdade at 2025-05-15 01:54:

## sw/BUILD/O2-latest/log
/sw/SOURCES/O2/14272-slc9_x86-64/0/GPU/GPUTracking/TPCClusterFinder/GPUTPCNNClusterizerKernels.cxx:302:54: error: no matching function for call to 'o2::gpu::GPUTPCCFClusterizer::sortIntoBuckets(o2::gpu::GPUTPCClusterFinder&, o2::tpc::ClusterNative&, o2::gpu::tpccf::Row, uint32_t&, uint32_t*&, o2::tpc::ClusterNative*&, uint32_t&)'
ninja: build stopped: subcommand failed.

Full log here.

ChSonnabend · 2025-05-16T09:45:17Z

@drohr Can be merged from my side

davidrohr · 2025-05-16T11:00:14Z

Common/ML/include/ML/OrtInterface.h

  // ORT variables -> need to be hidden as pImpl
  struct OrtVariables;
-  OrtVariables* mPImplOrt;
+  std::shared_ptr<OrtVariables> mPImplOrt = nullptr;


Why do you use a shared_ptr and no unique_ptr?

Because I can't assign a nullptr in the header file since OrtVariables is not known there. It will throw an error for invalid size of incomplete type.

You have to declare the constructor and destructor in the header, but instanciate it in the cxx.
I.e. only OrtInterface() in the header, and
OrtInterface::OrtInterface() = default; in the cxx.
Then you can use unique_ptr.

Is there any reason why we need a unique_ptr? Especially for the OrtAllocator (which gets created only for one environment but used in multiple), shared_ptr would probably be preferable. Unique_ptr would also require a deleter, which will make the code more cumbersome...

davidrohr · 2025-05-16T14:17:01Z

But a unique_ptr is simpler than a shared_ptr. shared_ptr is basically a unique_ptr with refcounting. So if you don't need particularly a shared_ptr, you should use a unique_ptr. And what kind of deleter do you need?

ChSonnabend · 2025-05-16T14:31:06Z

But a unique_ptr is simpler than a shared_ptr. shared_ptr is basically a unique_ptr with refcounting. So if you don't need particularly a shared_ptr, you should use a unique_ptr. And what kind of deleter do you need? Kind Regards David Rohr Sent from my mobile. (Excuse the typos!)
…
On 16 May 2025 16:13:48 CEST, Christian Sonnabend @.> wrote: @ChSonnabend commented on this pull request. > @@ -113,7 +113,7 @@ class OrtModel private: // ORT variables -> need to be hidden as pImpl struct OrtVariables; - OrtVariables mPImplOrt; + std::shared_ptr mPImplOrt = nullptr; Is there any reason why we need a unique_ptr? Especially for the OrtAllocator (which gets created only for one environment but used in multiple), shared_ptr would probably be preferable. Unique_ptr would also require a deleter, which will make the code more cumbersome... -- Reply to this email directly or view it on GitHub: #14272 (comment) You are receiving this because your review was requested. Message ID: @.*>

An example.

OrtInterface.h:
'''
private:
// ORT variables -> need to be hidden as pImpl
struct OrtVariables;
struct OrtVariablesDeleter {
void operator()(OrtVariables* ptr) const;
};
std::unique_ptr<OrtVariables, OrtVariablesDeleter> mPImplOrt;
...
'''

OrtInterface.cxx:

'''
void OrtModel::OrtVariablesDeleter::operator()(OrtVariables* ptr) const {
delete ptr;
}

// General purpose
void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsMap)
{
mPImplOrt = std::unique_ptr<OrtVariables, OrtVariablesDeleter>(new OrtVariables());
...
}
'''

davidrohr · 2025-05-16T18:08:19Z

I would do it like in the patch below. You are basically using a shared_ptr as a unique_ptr, since your pointer is never really shared. The reason it works for you with a shared ptr is that for a shared ptr the constructor / destructor do not need sizeof(...), while for unique_ptr they do. Thus, with a unique_ptr the constructors must not be in the header. My patch moves them to the cxx.

And then, your setEnv / getEnv functions certainly cause memory corruption, irrespective of whether they use shared or unique ptr. With get(...), you are getting a raw C pointer, and then you are setting that to as the env of another OrtVariables struct, thus you have 2 unique_ptr (or in your case 2 shared_ptr, which do not know about each other), both owning the same object, which will lead to use after free memory corruption and to double delete.
It is not a problem right now, since from what I see in your code, you are never calling setEnv. But you must not use the unique_shared ptr in that way. If you use the get() semantic, you must never make something else own that pointer. If you want to get an owning pointer, you have to use the release() semantic.

diff --git a/Common/ML/include/ML/OrtInterface.h b/Common/ML/include/ML/OrtInterface.h
index 248542cebe..7224645425 100644
--- a/Common/ML/include/ML/OrtInterface.h
+++ b/Common/ML/include/ML/OrtInterface.h
@@ -45,14 +45,10 @@ class OrtModel
 
  public:
   // Constructors & destructors
-  OrtModel() = default;
-  OrtModel(std::unordered_map<std::string, std::string> optionsMap) { init(optionsMap); }
-  void init(std::unordered_map<std::string, std::string> optionsMap)
-  {
-    initOptions(optionsMap);
-    initEnvironment();
-  }
-  virtual ~OrtModel() = default;
+  OrtModel();
+  OrtModel(std::unordered_map<std::string, std::string> optionsMap);
+  void init(std::unordered_map<std::string, std::string> optionsMap);
+  virtual ~OrtModel();
 
   // General purpose
   void initOptions(std::unordered_map<std::string, std::string> optionsMap);
@@ -113,7 +109,7 @@ class OrtModel
  private:
   // ORT variables -> need to be hidden as pImpl
   struct OrtVariables;
-  std::shared_ptr<OrtVariables> mPImplOrt = nullptr;
+  std::unique_ptr<OrtVariables> mPImplOrt;
 
   // Input & Output specifications of the loaded network
   std::vector<const char*> mInputNamesChar, mOutputNamesChar;
diff --git a/Common/ML/src/OrtInterface.cxx b/Common/ML/src/OrtInterface.cxx
index 9d6da4c1ec..4af3b97aee 100644
--- a/Common/ML/src/OrtInterface.cxx
+++ b/Common/ML/src/OrtInterface.cxx
@@ -27,11 +27,20 @@ namespace o2
 namespace ml
 {
 
+OrtModel::OrtModel() = default;
+OrtModel::OrtModel(std::unordered_map<std::string, std::string> optionsMap) { init(optionsMap); }
+OrtModel::~OrtModel() = default;
+void OrtModel::init(std::unordered_map<std::string, std::string> optionsMap)
+{
+  initOptions(optionsMap);
+  initEnvironment();
+}
+
 struct OrtModel::OrtVariables { // The actual implementation is hidden in the .cxx file
   // ORT runtime objects
   Ort::RunOptions runOptions;
-  std::shared_ptr<Ort::Env> env = nullptr;
-  std::shared_ptr<Ort::Session> session = nullptr; ///< ONNX session
+  std::unique_ptr<Ort::Env> env;
+  std::unique_ptr<Ort::Session> session; ///< ONNX session
   Ort::SessionOptions sessionOptions;
   Ort::AllocatorWithDefaultOptions allocator;
   Ort::MemoryInfo memoryInfo = Ort::MemoryInfo("Cpu", OrtAllocatorType::OrtDeviceAllocator, 0, OrtMemType::OrtMemTypeDefault);
@@ -41,7 +50,7 @@ struct OrtModel::OrtVariables { // The actual implementation is hidden in the .c
 // General purpose
 void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsMap)
 {
-  mPImplOrt = std::make_shared<OrtVariables>();
+  mPImplOrt = std::make_unique<OrtVariables>();
 
   // Load from options map
   if (!optionsMap.contains("model-path")) {
@@ -101,7 +110,7 @@ void OrtModel::initOptions(std::unordered_map<std::string, std::string> optionsM
 
 void OrtModel::initEnvironment()
 {
-  mPImplOrt->env = std::make_shared<Ort::Env>(
+  mPImplOrt->env = std::make_unique<Ort::Env>(
     OrtLoggingLevel(mLoggingLevel),
     (mEnvName.empty() ? "ORT" : mEnvName.c_str()),
     // Integrate ORT logging into Fairlogger
@@ -129,7 +138,7 @@ void OrtModel::initSession()
   if (mAllocateDeviceMemory) {
     memoryOnDevice(mDeviceId);
   }
-  mPImplOrt->session = std::make_shared<Ort::Session>(*mPImplOrt->env, mModelPath.c_str(), mPImplOrt->sessionOptions);
+  mPImplOrt->session = std::make_unique<Ort::Session>(*mPImplOrt->env, mModelPath.c_str(), mPImplOrt->sessionOptions);
   mPImplOrt->ioBinding = std::make_unique<Ort::IoBinding>(*mPImplOrt->session);
 
   setIO();
@@ -166,7 +175,7 @@ void OrtModel::memoryOnDevice(int32_t deviceIndex)
 
 void OrtModel::resetSession()
 {
-  mPImplOrt->session = std::make_shared<Ort::Session>(*(mPImplOrt->env), mModelPath.c_str(), mPImplOrt->sessionOptions);
+  mPImplOrt->session = std::make_unique<Ort::Session>(*(mPImplOrt->env), mModelPath.c_str(), mPImplOrt->sessionOptions);
 }
 
 // Getters
@@ -252,7 +261,7 @@ void OrtModel::setIO()
 
 void OrtModel::setEnv(Ort::Env* env)
 {
-  mPImplOrt->env = std::shared_ptr<Ort::Env>(env);
+  mPImplOrt->env.reset(env);
 }
 
 // Inference

VRAM memory leak fix + (u)int -> (u)int32_t

4a7cd21

ChSonnabend requested a review from davidrohr as a code owner May 14, 2025 07:37

Please consider the following formatting changes

507b3d6

alibuild mentioned this pull request May 14, 2025

Please consider the following formatting changes to #14272 ChSonnabend/AliceO2#25

Merged

Merge pull request #25 from alibuild/alibot-cleanup-14272

b0e8923

Please consider the following formatting changes to AliceO2Group#14272

ChSonnabend changed the title ~~VRAM memory leak fix + (u)int -> (u)int32_t~~ NN clustering: VRAM memory leak fix + (u)int -> (u)int32_t May 14, 2025

ChSonnabend added 2 commits May 15, 2025 00:26

Fixing my own debug messages

cd6ceb5

Making shared pointer for releasing

afbdade

ChSonnabend requested a review from a team as a code owner May 14, 2025 22:28

Bug-fix

cefe787

davidrohr reviewed May 16, 2025

View reviewed changes

Adding Davids patch

9ff31cc

davidrohr merged commit 0c5140e into AliceO2Group:dev May 19, 2025
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NN clustering: VRAM memory leak fix + (u)int -> (u)int32_t#14272

NN clustering: VRAM memory leak fix + (u)int -> (u)int32_t#14272
davidrohr merged 7 commits intoAliceO2Group:devfrom
ChSonnabend:onnx_fixes_2

ChSonnabend commented May 14, 2025

Uh oh!

github-actions bot commented May 14, 2025

Uh oh!

alibuild commented May 14, 2025

Uh oh!

alibuild commented May 14, 2025

Uh oh!

ChSonnabend commented May 16, 2025

Uh oh!

davidrohr May 16, 2025

Uh oh!

ChSonnabend May 16, 2025

Uh oh!

davidrohr May 16, 2025

Uh oh!

ChSonnabend May 16, 2025

Uh oh!

davidrohr commented May 16, 2025 via email •

edited

Loading

Uh oh!

ChSonnabend commented May 16, 2025 •

edited

Loading

Uh oh!

davidrohr commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

ChSonnabend commented May 14, 2025

Uh oh!

github-actions bot commented May 14, 2025

Uh oh!

alibuild commented May 14, 2025

Uh oh!

alibuild commented May 14, 2025

Uh oh!

ChSonnabend commented May 16, 2025

Uh oh!

davidrohr May 16, 2025

Choose a reason for hiding this comment

Uh oh!

ChSonnabend May 16, 2025

Choose a reason for hiding this comment

Uh oh!

davidrohr May 16, 2025

Choose a reason for hiding this comment

Uh oh!

ChSonnabend May 16, 2025

Choose a reason for hiding this comment

Uh oh!

davidrohr commented May 16, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChSonnabend commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidrohr commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

davidrohr commented May 16, 2025 via email •

edited

Loading

ChSonnabend commented May 16, 2025 •

edited

Loading