Opened 7 years ago

Closed 5 years ago

#1959 closed enhancement (duplicate)

[icetray] multi-core IceTray 2020

Reported by: David Schultz Owned by: Alex Olivas
Priority: normal Milestone: Long-Term Future
Component: combo core Keywords:
Cc: Alex Olivas, don la dieu

Description (last modified by don la dieu)

Proposal: make IceTray multi-core by 2020.

Main reasoning for full multi-core support:

Glideins are moving to multiple CPU cores all over the grid. We have more cores than we know what to do with, but not as much memory. Running one IceTray instance across multiple cores will probably reduce memory requirements.

Additionally, near the end of a glidein new jobs are not scheduled and more idle cores are available. It would be useful to dynamically allocate those CPUs to running jobs to make them finish faster. This might make the difference between job completion or failure due to walltime.

I note that CMS is already doing this, and other experiments are following.

Other tickets:

Attachments (1)

test_multi.py (3.5 KB) - added by David Schultz 7 years ago.
do multiprocess in python

Download all attachments as: .zip

Change History (12)

comment:1 Changed 7 years ago by Frederik Lauber

Wouldn't all the simulation filtering also profit from this?
I was given to understand that for example corsika simulation is by now limited by the available memory due to the needed photonics tables (which means you need 3-4GB ram only for that). With multicore support, could this data be shared between multiple trays and hence lower the needed memory from 3.5GB per Core to
3.5GB for N Cores?

comment:2 Changed 7 years ago by David Schultz

The major benefit is shared state, such as the photonics tables, clsim primes, nugen cross sections, etc. There might also be shared libraries and other code in memory that won't need to be duplicated.

Note that production runs each of these in separate jobs already so we don't need 3-4GB of memory for every job.

comment:3 Changed 7 years ago by Jakob van Santen

A side note: we can get some of those benefits with interprocess memory sharing, e.g. with photospline 2 (https://github.com/cnweaver/photospline) without dealing with the GIL.

This is straightforward if we have a dedicated pilot job that creates the shared table, and a corresponding job (caboose?) that frees it when we vacate the node. A peer-sharing design where the tables are allocated on demand and freed when the last user exits is trickier.

comment:4 Changed 7 years ago by don la dieu

  • Cc nega added
  • Description modified (diff)

comment:5 Changed 7 years ago by Jakob van Santen

As Claudio commented on Slack, there are a bunch of core assumptions in IceTray? that are violated when you try to split frames across threads. These are probably worth writing down. Assuming that the parallelization model involves some driving I3Module (e.g. an I3Reader) distributing frames to N threads, the assumptions I can think of are:

  1. The main thread always holds the GIL. This is easy enough to fix by releasing the GIL in I3Tray::Execute() and protecting the (relatively few) re-entry points (PythonModule, various Python-subclassable bits of clsim and gulliver), at the cost of some overhead in the single-threaded case. Re-acquiring the GIL to evaluate conditions in I3ConditionalModule, however, could become a major bottleneck.
  1. The GIL will always be held when a Python object is destroyed. Some care may need to be taken to ensure that frame objects created in Python can be cleanly destroyed from a secondary thread.
  1. I3Module methods do not need to be re-entrant. There are enough I3Modules out there that abuse member variables that it's simpler to just instantiate a copy of the I3Module for each thread. This is one of the occasions when the factory pattern is actually useful, allowing I3Tray to secretly instantiate a module N times behind the scenes.
  1. Service methods do not need to be re-entrant. This is trickier than the I3Module case, because we broke the factory pattern when we allowed services to be added directly as pointers. gulliver is chock-full of explicitly stateful services, and would need a major overhaul.
  1. PopFrame() yields all frames, in order. This is the biggie. Frame mixing and the I3PacketModule rely on I3Module::Process() seeing every single frame. Any attempt to distribute frames needs to drag along their dependencies as well, which is a problem that distribute seems to have solved in one way or another.

1-4 can be worked around by going to multiple processes, as Claudio does in distribute. 5 is universal.

In this presentation it appears that distribute solves all problems, but I would still like to understand whether all of its complexity is strictly necessary in this use case, where the server and clients are on the same node and could in principle communicate via shared memory. I also don't particularly like the interface, which appears to require the user to maintain two scripts. I would be happier If this could be wrapped up in e.g. an I3MultiTray that can have its parallel section defined in-line.

comment:6 Changed 7 years ago by David Schultz

Presentation slides about the end-of-glidein problem from the Spring 2017 collaboration meeting:
https://events.icecube.wisc.edu/getFile.py/access?contribId=25&sessionId=10&resId=0&materialId=slides&confId=83#page=21
(slides 21-22)

comment:7 Changed 7 years ago by David Schultz

During development on the new dataio.I3FrameSequence, I found a few places where I3Frame is thread-unsafe. Mixing is particularly problematic. Would probably need a mutex for any action that writes to the frame.

As for (1) the GIL being a problem, I actually question how much it will be. For low energy simulation, certainly it will. But for higher energy simulation where more time is spent analyzing a single frame, maybe it's not as much of a problem as we think?

If we do go the multiprocess route, I agree distribute is greatly overcomplicated for this use case. We should be able to use shared memory, and maybe just modify I3Tray itself to handle this. One thing to be careful of is where the fork happens. If we use multiprocessing in python, the interpreter is happy. If we do it in c++, I'm not sure what happens to the python part.

Changed 7 years ago by David Schultz

do multiprocess in python

comment:8 Changed 7 years ago by David Schultz

I played around with multiprocessing in python. Seems like the interpreter is happy with just a plain os.fork(), which is a direct C call. See test code attached.

One fun thing is that the random number generator produces the same numbers in each process.

comment:9 Changed 7 years ago by Nathan Whitehorn

Multithreading with boost python is a nightmare because the shared_ptr destructor does not know to call the GIL. As such, it is not safe from C++ to do anything that might cause a shared_ptr to go out of scope if there is any chance that the pointer in question has touched Python and so will call Py_DECREF() on destruction. This is a known bug in boost (I forget the number) and a tricky one to solve since acquiring the GIL from the shared_ptr destructor can cause lock-order reversals and thus deadlocks. Since lots of things can cause shared_ptr reference counts to drop, this basically means any threading that is not very tightly contained in C++ code won't work.

I wrote some code years ago (i3mpi, in my sandbox) that does multi-process (potentially multi-machine) IceTray? and works well.

comment:10 Changed 6 years ago by Alex Olivas

  • Owner changed from david.schultz to olivas
  • Status changed from new to assigned

comment:11 Changed 5 years ago by Alex Olivas

  • Resolution set to duplicate
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.