BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Chicago
X-LIC-LOCATION:America/Chicago
BEGIN:DAYLIGHT
TZOFFSETFROM:-0600
TZOFFSETTO:-0500
TZNAME:CDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0500
TZOFFSETTO:-0600
TZNAME:CST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20181221T160725Z
LOCATION:D175
DTSTART;TZID=America/Chicago:20181111T105100
DTEND;TZID=America/Chicago:20181111T111500
UID:submissions.supercomputing.org_SC18_sess155_ws_waccpd104@linklings.com
SUMMARY:OpenMP Target Offloading: Splitting GPU Kernels, Pipelining Commun
 ication and Computation, and Selecting Better Grid Geometries
DESCRIPTION:Workshop\nAccelerators, Heterogeneous Systems, Parallel Progra
 mming Languages, Libraries, and Models, Workshop Reg Pass\n\nOpenMP Target
  Offloading: Splitting GPU Kernels, Pipelining Communication and Computati
 on, and Selecting Better Grid Geometries\n\nChikin, Gobran, Amaral\n\nThis
  paper presents three ideas that focus on improving the execution of high-
 level parallel code in GPUs. The first addresses programs that include mul
 tiple parallel blocks within a single region of GPU code. A proposed compi
 ler transformation can split such regions into multiple, leading to the la
 unching of multiple kernels, one for each parallel region. Advantages incl
 ude the opportunity to tailor grid geometry of each kernel to the parallel
  region that it executes and the elimination of the overheads imposed by a
  code-generation scheme meant to handle multiple nested parallel regions. 
  Second, is a code transformation that sets up a pipeline of kernel execut
 ion and asynchronous data transfer. This transformation enables the overla
 p of communication and computation. Intricate technical details that are r
 equired for this transformation are described. The third idea is that the 
 selection of a grid geometry for the execution of a parallel region must b
 alance the GPU occupancy with the potential saturation of the memory throu
 ghput in the GPU.  Adding this additional parameter to the geometry select
 ion heuristic can often yield better performance at lower occupancy levels
 .
URL:https://sc18.supercomputing.org/presentation/?id=ws_waccpd104&sess=ses
 s155
END:VEVENT
END:VCALENDAR