#1461 closed enhancement (fixed)
[clsim] cpu overheads and optimization
Reported by: | David Schultz | Owned by: | Kevin Meagher |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | combo simulation | Keywords: | clsim, ppc |
Cc: | Alex Olivas, Claudio Kopper, Jakob van Santen |
Description
On heavily loaded machines (say, NPX nodes), clsim slows down quite a bit (2x) compared to the same machine only running a single clsim job. This seems to point to a CPU overhead, where it can't send/receive to the gpu fast enough.
While ppc also slows down a little (1.2x), it doesn't take nearly the same performance penalty. And it's still faster overall.
So, anyone want to optimize things more?
Change History (25)
comment:1 Changed 8 years ago by Alex Olivas
- Owner set to claudio.kopper
- Status changed from new to assigned
comment:2 Changed 8 years ago by David Schultz
comment:3 Changed 7 years ago by Gonzalo Merino
Comments from Jakob:
photon propagation in clsim is divided into a few distinct phases:
- MCTree manipulation: chopping muons into segments of constant energy
- Step generation: turning light-emitting particles into collections of Cherenkov track segments. Each segment carries the ID of the particle it came from.
- Photon propagation: steps are fed to the GPU, and photons on DOMs collected in the output queue. Each photon also carries a particle ID.
- Photon sorting: photons from the output queue are reassociated with the frames they belong to, and the IDs re-mapped to undo the "chopping" from step 1.
- PE generation: a separate, single-threaded I3Module sub-samples the photons according to the quantum efficiency and angular acceptance curves, and produces photoelectrons, which also carry a particle ID. These are looked up in the MCTree to ensure that they exist.
Of these, only (3) runs multi-threaded on the GPU. I strongly suspect that a lot of the bottlenecks with faster and faster GPUs are associated with steps 4 and 5. The best way to test this hypothesis is probably to modify I3CLSimModule's harvester thread to simply continue once the output queue is copied over from the GPU, rather than actually doing any post-processing. If that shows a significant increase in utilization, it means that the bottleneck is in draining the GPU rather than feeding it.
If that is in fact the case, there are a couple of ways forward. One would be to disable the ID tracking as I initially suggested, since no one who is not Marcel Zoll ever, ever needs it. That would eliminate a number of useless MCTree lookups, possibly speeding things up, but might also end up making the code significantly more complex. Another option would be to defer step 5 and move step 4 into an OpenCL kernel so that the down-sampling can be done in parallel. This opens up some additional possibilities for more clever optimizations, like coalescing PEs that are too close together to resolve, somewhat mitigating memory usage for high-energy events.
comment:4 Changed 7 years ago by Gonzalo Merino
Comment from David:
Note also that if the draining steps could be made multithreaded, we can ask for more cpus at most sites. That may be easier than putting more things in OpenCL.
comment:5 Changed 7 years ago by David Schultz
- Cc jvansanten added
- Milestone changed from IceSim 5.1 to IceSim 6
comment:6 Changed 7 years ago by David Schultz
Sadly, I'm seeing signs of this for dataset 20040. I'm not sure why, since that dataset should be higher energy corsika, but it's there even on the GTX 980. Spends about 1/3 to 1/2 of the time with an idle gpu and 120% cpu.
comment:7 Changed 7 years ago by Jakob van Santen
I had occasion to test some of the things suggested in comment:3 in the course of profiling GPUs on Nvidia's test cluster, and it turns out I was entirely wrong. Skipping photon->PE conversion and turning the photon-to-frame sorting step into a no-op have almost no effect on the GPU utilization; the bottlenecks are entirely in steps 1 and 2. For purposes of record-keeping, here are the times per photon from the first 1000 events of dataset 10068, file 99994, using simprod.segments.CLSim from trunk as of today:
Input gzipped Speed: 0.796097 [37.34 ns/photon, 88.3% utilization] (ivb_k40.json) Speed: 1.00436 [29.60 ns/photon, 83.8% utilization] (ivb_k80.json) Speed: 1.75448 [16.95 ns/photon, 65.7% utilization] (ivb_m60.json) Speed: 2.47966 [11.99 ns/photon, 49.9% utilization] (ivb_m40.json) Speed: 3.20405 [9.28 ns/photon, 40.6% utilization] (hsw_p100.json) Input zstd-compressed Speed: 0.795444 [37.38 ns/photon, 87.1% utilization] (ivb_k40.json) Speed: 1.00475 [29.59 ns/photon, 84.5% utilization] (ivb_k80.json) Speed: 1.7603 [16.89 ns/photon, 65.9% utilization] (ivb_m60.json) Speed: 2.50047 [11.89 ns/photon, 49.1% utilization] (ivb_m40.json) Speed: 3.20322 [9.28 ns/photon, 41.5% utilization] (hsw_p100.json) No hits, photons only Speed: 0.795939 [37.35 ns/photon, 88.4% utilization] (ivb_k40.json) Speed: 1.00453 [29.60 ns/photon, 84.6% utilization] (ivb_k80.json) Speed: 1.76142 [16.88 ns/photon, 64.5% utilization] (ivb_m60.json) Speed: 2.49923 [11.90 ns/photon, 49.8% utilization] (ivb_m40.json) Speed: 3.20724 [9.27 ns/photon, 43.6% utilization] (hsw_p100.json) Pre-sliced MCTree Speed: 0.796312 [37.33 ns/photon, 78.1% utilization] (ivb_k40.json) Speed: 1.00189 [29.67 ns/photon, 73.6% utilization] (ivb_k80.json) Speed: 1.74554 [17.03 ns/photon, 53.1% utilization] (ivb_m60.json) Speed: 2.49876 [11.90 ns/photon, 37.7% utilization] (ivb_m40.json) Speed: 3.20559 [9.27 ns/photon, 31.3% utilization] (hsw_p100.json) -> go back to slicing in-process Short-cut AddPhotonsToFrames Speed: 0.795747 [37.36 ns/photon, 92.6% utilization] (ivb_k40.json) Speed: 1.00377 [29.62 ns/photon, 89.7% utilization] (ivb_k80.json) Speed: 1.76144 [16.88 ns/photon, 73.0% utilization] (ivb_m60.json) Speed: 2.48055 [11.99 ns/photon, 59.0% utilization] (ivb_m40.json) Speed: 3.20622 [9.27 ns/photon, 49.8% utilization] (hsw_p100.json) 1 PeV -> 10 PeV buffer Speed: 0.795259 [37.38 ns/photon, 92.5% utilization] (ivb_k40.json) Speed: 1.00428 [29.60 ns/photon, 89.7% utilization] (ivb_k80.json) Speed: 1.75182 [16.97 ns/photon, 73.5% utilization] (ivb_m60.json) Speed: 2.50136 [11.89 ns/photon, 58.6% utilization] (ivb_m40.json) Speed: 3.20379 [9.28 ns/photon, 50.6% utilization] (hsw_p100.json)
The filename in parentheses gives the CPU family and GPU model. None of the changes bring a major performance improvement; slicing the muons in a different process even slows things down due to serialization. In light of this, it indeed looks like multiprocessing is the way out, factoring the photon propagation core into a separate server process.
There's a fairly natural way to do this, too. I3CLSimStepToPhotonConverterOpenCL is already asynchronous, and with a few modifications could be moved to shared memory. It could be constructed at the top of the CLSim segment and passed to a collection of I3Trays running in forked processes. The first process to request work could initialize the converter (as it already does in the current implementation), and the master process could deal with tearing it down when done. A non-exhaustive list of other things that would need to be done:
- Each client (instance of I3CLSimModule) needs to get a unique ID from the server, and provide it when enqueuing steps. This is a small change, since I3CLSimModule already needs to provide an ID for the bunch of steps itself.
- I3CLSimStepToPhotonConverterOpenCL needs to maintain separate output queues for each client.
- Steps would need to be copied to shared memory before enqueuing.
- GetConversionResult() would need to return I3Photons in shared memory.
Breaking the 1:N mapping between feeder processes and GPUs also opens up other possibilities for organizing simulation in a more efficient way. For example, muon propagation could be done in the feeder process to avoid storing the oodles of energy losses on disk.
comment:8 Changed 7 years ago by David Schultz
Just a thought, but would a clsim core module in a separate process from icetray solve some problems? You could then play fun games like connecting to it from multiple icetray instances.
comment:9 Changed 7 years ago by Jakob van Santen
Yes, the idea of the above proposal was to achieve 90% of the practical benefit of a full server-client model without a complete rewrite of CLSim, in particular the rather involved configuration steps. There would also be some serialization overhead if we didn't use shared memory. If there are people interested in putting in the work to write a CLSim server, though, I'm happy to defer.
comment:10 Changed 7 years ago by David Schultz
You could always use shared memory via memory mapped files to cut out serialization between processes. Then you just need to make sure to almost never change the data structure format that you're passing around.
comment:11 Changed 7 years ago by David Schultz
One thing I like about making the core a separate process is, you can develop a new core following the same API, and swap it in once it is better (or if we have differently optimized code for a different accelerator).
comment:12 Changed 7 years ago by Jakob van Santen
You can already do that now. I3CLSimStepToPhotonConverterOpenCL is the only concrete instance of I3CLSimStepToPhotonConverter, whose interface is not all that complicated: http://code.icecube.wisc.edu/projects/icecube/browser/IceCube/projects/clsim/trunk/public/clsim/I3CLSimStepToPhotonConverter.h
The construction of the wavelength generators and biases from the medium properties is a thing that would have to be extracted from I3CLSimModuleHelper with a crow-bar, however.
comment:13 Changed 7 years ago by Jakob van Santen
This is actually much simpler, and much dumber, than any of us thought.
David noted in a private conversation that ppc manages to keep the GPUs busier than clsim. That means that whatever the issue is, it's likely inside I3CLSimModule itself, since the preceding steps are largely the same. Because I3CLSimModule implements Process() instead of DAQ() for compatibility with ancient IceSim, it does not track or report its CPU usage. Since I3Tray only prints out usage for modules that took 10 seconds or more of CPU time, this makes it look like it's cheap. It's not, not at all.
After moving the CPU usage tracking to I3Module::Do(), it turns out that in my test case I3CLSimModule ran for 105 seconds (including setup), whereas photon propagation on the GPU (Tesla M40) took only 62 seconds, for an average utilization of around 50%. 78 of those 105 seconds were spent in DistToClosestDOM(), which has a host of problems. First, it makes a copy of the geometry (~120 kB) for every particle. Using a reference instead takes the time down to 60 seconds. Second, it calculates sqrt(x*x+y*y+z*z) for every DOM. Finding the smallest d-squared and returning its sqrt instead reduces the time further to 42 seconds. The underlying problem, however, is that the entire geometry is being used. The envelope of 300 m spheres around all DOMs, however, is not all that different from the convex hull of the detector shifted outwards by 300 m, and luckily we already have code to calculate that. Using ExtrudedPolygon to clip the input particles instead of the full geometry takes 0.27 seconds, making the utilization a more respectable 92%, in line with NuGen simulation.
There's still a performance ceiling out there, but the above changes should get us back in business on current-generation hardware.
comment:14 Changed 7 years ago by Jakob van Santen
In 157923/IceCube:
comment:15 Changed 7 years ago by David Schultz
Awesome! Now you just have to tackle LE Corsika (since I think this was HE corsika you tested).
comment:16 Changed 7 years ago by Jakob van Santen
In 157950/IceCube:
comment:17 Changed 7 years ago by Jakob van Santen
In 157951/IceCube:
comment:18 Changed 7 years ago by Jakob van Santen
After r157950/IceCube and r157951/IceCube, low-energy CORSIKA (set 20007, first 500k events, 5x oversize, 20 PeV buffering) performs more acceptably:
Speed: 0.749887 [39.94 ns/photon, 98.0% utilization] (ivb_k40.json) Speed: 1.00027 [29.94 ns/photon, 97.5% utilization] (ivb_k80.json) Speed: 1.65275 [18.12 ns/photon, 93.9% utilization] (ivb_m60.json) Speed: 2.32704 [12.87 ns/photon, 90.5% utilization] (ivb_m40.json) Speed: 2.97476 [10.07 ns/photon, 90.9% utilization] (hsw_p100.json) Speed: 3.81792 [7.85 ns/photon, 81.1% utilization] (hsw_p40.json)
There's still a significant drop-off with the Maxwell and Pascal models, but that is likely to be less significant with oversize 1 simulation.
comment:19 Changed 7 years ago by David Schultz
That's still better than the 10-20% usage seen now on the 1080.
comment:20 Changed 7 years ago by Jakob van Santen
Even that gets better. Low-energy CORSIKA, oversize 5:
Speed: 4.586112 [9.59 ns/photon, 86.4% utilization] (gtx1080.json)
This runs at 136% CPU on rad-6 (sandy bridge @ 2.6 GHz), so there are still opportunities for internal improvements.
Oversize 1 is better again, with only 46% CPU:
Speed: 5.895750 [7.46 ns/photon, 97.7% utilization] (gtx1080_os1.json)
comment:21 Changed 7 years ago by Jakob van Santen
After fixing the bottleneck in I3CLSimModule itself, there does seem to be some truth to comment:3 after all. Here are the CPU times on the main thread (i.e. without wakeups of the clsim worker threads sprinkled randomly throughout) for modules in the chain using the example in comment:20:
trashcan:usr 0.11 normalpes_makeCLSimHits_makePhotons_decomposeGeometry:usr 2.07 normalpes_makeCLSimHits_deletePhotons:usr 3.41 normalpes_cleanup_clsim_sliced_MCTree:usr 3.73 hitfilter:usr 8.82 reader:usr 13.18 normalpes_makeCLSimHits_makeHitsFromPhotons_clsim_make_hits:usr 14.16 normalpes_makeCLSimHits_makePhotons_chopMuons:usr 15.83 normalpes_sanitize_taus:usr 15.87 normalpes_removeSlices:usr 17.34 writer:usr 24.01 normalpes_makeCLSimHits_makePhotons_clsim:usr 45.44
These sum to 164 seconds, whereas the propagation kernel ran for a total of 172 seconds. 25% of the time in the main thread is spent copying and manipulating the I3MCTree (setting taus to Dark, slicing muons, de-slicing muons), a further 25% creating steps inside I3CLSimModule, and the remainder in I/O.
comment:22 Changed 7 years ago by Jakob van Santen
In 158056/IceCube:
comment:23 Changed 6 years ago by Alex Olivas
- Owner changed from claudio.kopper to kjmeagher
comment:24 Changed 6 years ago by Alex Olivas
- Resolution set to fixed
- Status changed from assigned to closed
comment:25 Changed 5 years ago by Alex Olivas
- Milestone IceSim 6 deleted
Milestone IceSim? 6 deleted
Poke this ticket. The gtx 1080 benchmarks for corsika (all types) show 100% cpu usage, resulting in no speedup over the gtx 980 (should be 1.8x).