Opened 8 years ago

Closed 6 years ago

Last modified 5 years ago

#1461 closed enhancement (fixed)

[clsim] cpu overheads and optimization

Reported by: David Schultz Owned by: Kevin Meagher
Priority: major Milestone:
Component: combo simulation Keywords: clsim, ppc
Cc: Alex Olivas, Claudio Kopper, Jakob van Santen

Description

On heavily loaded machines (say, NPX nodes), clsim slows down quite a bit (2x) compared to the same machine only running a single clsim job. This seems to point to a CPU overhead, where it can't send/receive to the gpu fast enough.

While ppc also slows down a little (1.2x), it doesn't take nearly the same performance penalty. And it's still faster overall.

So, anyone want to optimize things more?

Change History (25)

comment:1 Changed 8 years ago by Alex Olivas

  • Owner set to claudio.kopper
  • Status changed from new to assigned

comment:2 Changed 8 years ago by David Schultz

Poke this ticket. The gtx 1080 benchmarks for corsika (all types) show 100% cpu usage, resulting in no speedup over the gtx 980 (should be 1.8x).

comment:3 Changed 7 years ago by Gonzalo Merino

Comments from Jakob:

photon propagation in clsim is divided into a few distinct phases:

  1. MCTree manipulation: chopping muons into segments of constant energy
  2. Step generation: turning light-emitting particles into collections of Cherenkov track segments. Each segment carries the ID of the particle it came from.
  3. Photon propagation: steps are fed to the GPU, and photons on DOMs collected in the output queue. Each photon also carries a particle ID.
  4. Photon sorting: photons from the output queue are reassociated with the frames they belong to, and the IDs re-mapped to undo the "chopping" from step 1.
  5. PE generation: a separate, single-threaded I3Module sub-samples the photons according to the quantum efficiency and angular acceptance curves, and produces photoelectrons, which also carry a particle ID. These are looked up in the MCTree to ensure that they exist.

Of these, only (3) runs multi-threaded on the GPU. I strongly suspect that a lot of the bottlenecks with faster and faster GPUs are associated with steps 4 and 5. The best way to test this hypothesis is probably to modify I3CLSimModule's harvester thread to simply continue once the output queue is copied over from the GPU, rather than actually doing any post-processing. If that shows a significant increase in utilization, it means that the bottleneck is in draining the GPU rather than feeding it.

If that is in fact the case, there are a couple of ways forward. One would be to disable the ID tracking as I initially suggested, since no one who is not Marcel Zoll ever, ever needs it. That would eliminate a number of useless MCTree lookups, possibly speeding things up, but might also end up making the code significantly more complex. Another option would be to defer step 5 and move step 4 into an OpenCL kernel so that the down-sampling can be done in parallel. This opens up some additional possibilities for more clever optimizations, like coalescing PEs that are too close together to resolve, somewhat mitigating memory usage for high-energy events.

comment:4 Changed 7 years ago by Gonzalo Merino

Comment from David:

Note also that if the draining steps could be made multithreaded, we can ask for more cpus at most sites. That may be easier than putting more things in OpenCL.

comment:5 Changed 7 years ago by David Schultz

  • Cc jvansanten added
  • Milestone changed from IceSim 5.1 to IceSim 6

comment:6 Changed 7 years ago by David Schultz

Sadly, I'm seeing signs of this for dataset 20040. I'm not sure why, since that dataset should be higher energy corsika, but it's there even on the GTX 980. Spends about 1/3 to 1/2 of the time with an idle gpu and 120% cpu.

comment:7 Changed 7 years ago by Jakob van Santen

I had occasion to test some of the things suggested in comment:3 in the course of profiling GPUs on Nvidia's test cluster, and it turns out I was entirely wrong. Skipping photon->PE conversion and turning the photon-to-frame sorting step into a no-op have almost no effect on the GPU utilization; the bottlenecks are entirely in steps 1 and 2. For purposes of record-keeping, here are the times per photon from the first 1000 events of dataset 10068, file 99994, using simprod.segments.CLSim from trunk as of today:

Input gzipped
Speed: 0.796097 [37.34 ns/photon, 88.3% utilization] (ivb_k40.json)
Speed: 1.00436 [29.60 ns/photon, 83.8% utilization] (ivb_k80.json)
Speed: 1.75448 [16.95 ns/photon, 65.7% utilization] (ivb_m60.json)
Speed: 2.47966 [11.99 ns/photon, 49.9% utilization] (ivb_m40.json)
Speed: 3.20405 [9.28 ns/photon, 40.6% utilization] (hsw_p100.json)

Input zstd-compressed
Speed: 0.795444 [37.38 ns/photon, 87.1% utilization] (ivb_k40.json)
Speed: 1.00475 [29.59 ns/photon, 84.5% utilization] (ivb_k80.json)
Speed: 1.7603 [16.89 ns/photon, 65.9% utilization] (ivb_m60.json)
Speed: 2.50047 [11.89 ns/photon, 49.1% utilization] (ivb_m40.json)
Speed: 3.20322 [9.28 ns/photon, 41.5% utilization] (hsw_p100.json)

No hits, photons only
Speed: 0.795939 [37.35 ns/photon, 88.4% utilization] (ivb_k40.json)
Speed: 1.00453 [29.60 ns/photon, 84.6% utilization] (ivb_k80.json)
Speed: 1.76142 [16.88 ns/photon, 64.5% utilization] (ivb_m60.json)
Speed: 2.49923 [11.90 ns/photon, 49.8% utilization] (ivb_m40.json)
Speed: 3.20724 [9.27 ns/photon, 43.6% utilization] (hsw_p100.json)

Pre-sliced MCTree
Speed: 0.796312 [37.33 ns/photon, 78.1% utilization] (ivb_k40.json)
Speed: 1.00189 [29.67 ns/photon, 73.6% utilization] (ivb_k80.json)
Speed: 1.74554 [17.03 ns/photon, 53.1% utilization] (ivb_m60.json)
Speed: 2.49876 [11.90 ns/photon, 37.7% utilization] (ivb_m40.json)
Speed: 3.20559 [9.27 ns/photon, 31.3% utilization] (hsw_p100.json)
-> go back to slicing in-process

Short-cut AddPhotonsToFrames
Speed: 0.795747 [37.36 ns/photon, 92.6% utilization] (ivb_k40.json)
Speed: 1.00377 [29.62 ns/photon, 89.7% utilization] (ivb_k80.json)
Speed: 1.76144 [16.88 ns/photon, 73.0% utilization] (ivb_m60.json)
Speed: 2.48055 [11.99 ns/photon, 59.0% utilization] (ivb_m40.json)
Speed: 3.20622 [9.27 ns/photon, 49.8% utilization] (hsw_p100.json)

1 PeV -> 10 PeV buffer
Speed: 0.795259 [37.38 ns/photon, 92.5% utilization] (ivb_k40.json)
Speed: 1.00428 [29.60 ns/photon, 89.7% utilization] (ivb_k80.json)
Speed: 1.75182 [16.97 ns/photon, 73.5% utilization] (ivb_m60.json)
Speed: 2.50136 [11.89 ns/photon, 58.6% utilization] (ivb_m40.json)
Speed: 3.20379 [9.28 ns/photon, 50.6% utilization] (hsw_p100.json)

The filename in parentheses gives the CPU family and GPU model. None of the changes bring a major performance improvement; slicing the muons in a different process even slows things down due to serialization. In light of this, it indeed looks like multiprocessing is the way out, factoring the photon propagation core into a separate server process.

There's a fairly natural way to do this, too. I3CLSimStepToPhotonConverterOpenCL is already asynchronous, and with a few modifications could be moved to shared memory. It could be constructed at the top of the CLSim segment and passed to a collection of I3Trays running in forked processes. The first process to request work could initialize the converter (as it already does in the current implementation), and the master process could deal with tearing it down when done. A non-exhaustive list of other things that would need to be done:

  1. Each client (instance of I3CLSimModule) needs to get a unique ID from the server, and provide it when enqueuing steps. This is a small change, since I3CLSimModule already needs to provide an ID for the bunch of steps itself.
  2. I3CLSimStepToPhotonConverterOpenCL needs to maintain separate output queues for each client.
  3. Steps would need to be copied to shared memory before enqueuing.
  4. GetConversionResult() would need to return I3Photons in shared memory.

Breaking the 1:N mapping between feeder processes and GPUs also opens up other possibilities for organizing simulation in a more efficient way. For example, muon propagation could be done in the feeder process to avoid storing the oodles of energy losses on disk.

comment:8 Changed 7 years ago by David Schultz

Just a thought, but would a clsim core module in a separate process from icetray solve some problems? You could then play fun games like connecting to it from multiple icetray instances.

comment:9 Changed 7 years ago by Jakob van Santen

Yes, the idea of the above proposal was to achieve 90% of the practical benefit of a full server-client model without a complete rewrite of CLSim, in particular the rather involved configuration steps. There would also be some serialization overhead if we didn't use shared memory. If there are people interested in putting in the work to write a CLSim server, though, I'm happy to defer.

comment:10 Changed 7 years ago by David Schultz

You could always use shared memory via memory mapped files to cut out serialization between processes. Then you just need to make sure to almost never change the data structure format that you're passing around.

comment:11 Changed 7 years ago by David Schultz

One thing I like about making the core a separate process is, you can develop a new core following the same API, and swap it in once it is better (or if we have differently optimized code for a different accelerator).

comment:12 Changed 7 years ago by Jakob van Santen

You can already do that now. I3CLSimStepToPhotonConverterOpenCL is the only concrete instance of I3CLSimStepToPhotonConverter, whose interface is not all that complicated: http://code.icecube.wisc.edu/projects/icecube/browser/IceCube/projects/clsim/trunk/public/clsim/I3CLSimStepToPhotonConverter.h

The construction of the wavelength generators and biases from the medium properties is a thing that would have to be extracted from I3CLSimModuleHelper with a crow-bar, however.

comment:13 Changed 7 years ago by Jakob van Santen

This is actually much simpler, and much dumber, than any of us thought.

David noted in a private conversation that ppc manages to keep the GPUs busier than clsim. That means that whatever the issue is, it's likely inside I3CLSimModule itself, since the preceding steps are largely the same. Because I3CLSimModule implements Process() instead of DAQ() for compatibility with ancient IceSim, it does not track or report its CPU usage. Since I3Tray only prints out usage for modules that took 10 seconds or more of CPU time, this makes it look like it's cheap. It's not, not at all.

After moving the CPU usage tracking to I3Module::Do(), it turns out that in my test case I3CLSimModule ran for 105 seconds (including setup), whereas photon propagation on the GPU (Tesla M40) took only 62 seconds, for an average utilization of around 50%. 78 of those 105 seconds were spent in DistToClosestDOM(), which has a host of problems. First, it makes a copy of the geometry (~120 kB) for every particle. Using a reference instead takes the time down to 60 seconds. Second, it calculates sqrt(x*x+y*y+z*z) for every DOM. Finding the smallest d-squared and returning its sqrt instead reduces the time further to 42 seconds. The underlying problem, however, is that the entire geometry is being used. The envelope of 300 m spheres around all DOMs, however, is not all that different from the convex hull of the detector shifted outwards by 300 m, and luckily we already have code to calculate that. Using ExtrudedPolygon to clip the input particles instead of the full geometry takes 0.27 seconds, making the utilization a more respectable 92%, in line with NuGen simulation.

There's still a performance ceiling out there, but the above changes should get us back in business on current-generation hardware.

comment:14 Changed 7 years ago by Jakob van Santen

In 157923/IceCube:

Clip input particles to the convex hull of the DOMs rather than the
envelope of spheres around all DOMs. This was a major bottleneck in
keeping modern GPUs busy, and fixing it increases GPU utilization with
5x oversized CORSIKA on e.g. Tesla M40s by 60% (49%->78%). See #1461.

comment:15 Changed 7 years ago by David Schultz

Awesome! Now you just have to tackle LE Corsika (since I think this was HE corsika you tested).

comment:16 Changed 7 years ago by Jakob van Santen

In 157950/IceCube:

Fix double buffering in TotalEnergyToProcess? mode

Resetting maxNumParallelEvents_ and maxNumParallelEventsSecondFlush_
both to 1 after a flush effectively caused every other batch to send
only 2 events to the GPU while buffering further frames, leading to a
dramatic drop in efficiency. See #1461

directory)
private/geant4/I3CLSimLightSourceToStepConverterGeant4.cxx

Last edited 7 years ago by Jakob van Santen (previous) (diff)

comment:17 Changed 7 years ago by Jakob van Santen

In 157951/IceCube:

Pass TotalEnergyToProcess? on to clsim

Without this change, any simprod sets that used TotalEnergyToProcess?
instead of MaxParallelEvents? got MaxParallelEvents? set to 100
instead, where e.g. 1e5 would have been more appropriate for
low-energy CORSIKA. See #1461.

comment:18 Changed 7 years ago by Jakob van Santen

After r157950/IceCube and r157951/IceCube, low-energy CORSIKA (set 20007, first 500k events, 5x oversize, 20 PeV buffering) performs more acceptably:

Speed: 0.749887 [39.94 ns/photon, 98.0% utilization] (ivb_k40.json)
Speed: 1.00027 [29.94 ns/photon, 97.5% utilization] (ivb_k80.json)
Speed: 1.65275 [18.12 ns/photon, 93.9% utilization] (ivb_m60.json)
Speed: 2.32704 [12.87 ns/photon, 90.5% utilization] (ivb_m40.json)
Speed: 2.97476 [10.07 ns/photon, 90.9% utilization] (hsw_p100.json)
Speed: 3.81792 [7.85 ns/photon, 81.1% utilization] (hsw_p40.json)

There's still a significant drop-off with the Maxwell and Pascal models, but that is only an issue for oversized simulation. For oversize 1 (with 25 times more photons per input particle), utilization picks back up, e.g.:

Speed: 7.736255 [5.69 ns/photon, 97.3% utilization] (hsw_p40.json)
Last edited 7 years ago by Jakob van Santen (previous) (diff)

comment:19 Changed 7 years ago by David Schultz

That's still better than the 10-20% usage seen now on the 1080.

comment:20 Changed 7 years ago by Jakob van Santen

Even that gets better. Low-energy CORSIKA, oversize 5:

Speed: 4.586112 [9.59 ns/photon, 86.4% utilization] (gtx1080.json)

This runs at 136% CPU on rad-6 (sandy bridge @ 2.6 GHz), so there are still opportunities for internal improvements.

Oversize 1 is better again, with only 46% CPU:

Speed: 5.895750 [7.46 ns/photon, 97.7% utilization] (gtx1080_os1.json)

comment:21 Changed 7 years ago by Jakob van Santen

After fixing the bottleneck in I3CLSimModule itself, there does seem to be some truth to comment:3 after all. Here are the CPU times on the main thread (i.e. without wakeups of the clsim worker threads sprinkled randomly throughout) for modules in the chain using the example in comment:20:

                                                     trashcan:usr 0.11
        normalpes_makeCLSimHits_makePhotons_decomposeGeometry:usr 2.07
                        normalpes_makeCLSimHits_deletePhotons:usr 3.41
                        normalpes_cleanup_clsim_sliced_MCTree:usr 3.73
                                                    hitfilter:usr 8.82
                                                       reader:usr 13.18
  normalpes_makeCLSimHits_makeHitsFromPhotons_clsim_make_hits:usr 14.16
                normalpes_makeCLSimHits_makePhotons_chopMuons:usr 15.83
                                      normalpes_sanitize_taus:usr 15.87
                                       normalpes_removeSlices:usr 17.34
                                                       writer:usr 24.01
                    normalpes_makeCLSimHits_makePhotons_clsim:usr 45.44

These sum to 164 seconds, whereas the propagation kernel ran for a total of 172 seconds. 25% of the time in the main thread is spent copying and manipulating the I3MCTree (setting taus to Dark, slicing muons, de-slicing muons), a further 25% creating steps inside I3CLSimModule, and the remainder in I/O.

comment:22 Changed 7 years ago by Jakob van Santen

In 158056/IceCube:

Proof-of-concept multi-process I3CLSimStepToPhotonConverter facade

I3CLSimServer manages a collection of pre-initialized
I3CLSimStepToPhotonConverters, and I3CLSimClient presents an interface
similar to an individual I3CLSimStepToPhotonConverter. The client can be
used to send steps to a server running in a different process, possibly
on a different machine. This allows multiple I3Trays to share a set of
GPUs, potentially increasing throughput for workflows with expensive pre-
or post-processing steps. See #1461.

Hoisting the converter configuration out of I3CLSimModule is left as
an exercise.

comment:23 Changed 6 years ago by Alex Olivas

  • Owner changed from claudio.kopper to kjmeagher

comment:24 Changed 6 years ago by Alex Olivas

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:25 Changed 5 years ago by Alex Olivas

  • Milestone IceSim 6 deleted

Milestone IceSim? 6 deleted

Note: See TracTickets for help on using tickets.