 |
*************************DEADLINE EXTENDED********************************* Note new deadline: December 24, 2004 ***************************************************************************
First Workshop on System Management Tools for Large-Scale Parallel Systems
(Held in Conjunction with IPDPS 2005, Denver, Colorado, April 8, 2005)
***************************************************************************
We are entering a new era in computing where the size and complexity of scientific and engineering simulations is growing at a speed that has never been observed before. In order to satisfy the needs of these applications, parallel systems with an "extreme-scale" are being designed and deployed. Although the progress in hardware and architecture design has made it possible to build machines with tens of thousands of processors, the development of software tools for such systems is still lagging behind. To name just a few examples, new operating system level modifications are needed to efficiently utilize the massive computing and networking power. In addition, sophisticated fault-tolerant tools are in great need to minimize the performance loss under a faulty condition. The scale of the systems also demands advanced power management tools. For both commodity supercomputing
clusters and custom-designed supercomputers, system maintenance, reliability, fault isolation, prevention and control pose huge challenges. There is a great need of research not only in terms of scale of the machine, but also in terms of their implications on system performance and utilization.
This workshop is intended to bring together researchers and practitioners to begin identifying the new challenges imposed by this trend and investigating efficient software tools to improve the performance, reliability and operation of large scale parallel systems.
Topics of interest include, but are not limited to: =B7 Scalable operating system design =B7 Scalable resource management tools =B7 Efficient failure diagnosis, failure prediction and failure recovery tools =B7 Scalable job scheduling tools =B7 Scalable parallel check-pointing tools =B7 Self-healing and self-management tools =B7 Power management for large scale machines =B7 System bring-up and control tools =B7 Ease of system maintenance, services including system management experiences =B7 Performance, system utilization implications
Workshop Organizers:
Fabrizio Petrini , Los Alamos National Laboratory, New Mexico (fabrizio@lanl.gov) Ramendra K. Sahoo, IBM TJ Watson Research Center, Yorktown Heights,NY (rsahoo@us.ibm.com) Yanyong Zhang, Dept. of Electrical and Computer Engineering, Rutgers University (yyzhang@ece.rutgers.edu)
Program Committee:
Ricardo Bianchini(Rutgers Univ., CS) ricardob@cs.rutgers.edu Henri Casanova (UCSD) casanova@cs.ucsd.edu Dror Feitelson (Hebrew University) feiteldg@vuse.vanderbilt.edu Rahul Garg (IBM India) grahul@in.ibm.com Jose E. Moreira (IBM Research) jmoreira@us.ibm.com Manish Parashar (Rutgers Univ., ECE) parashar@caip.rutgers.edu Kyung Ryu (IBM Research) kryu@us.ibm.com Anand Sivasubramaniam (Penn. State Univ.) anand@cse.psu.edu Rajeev Thakur (Argonne National Lab.) thakur@mcs.anl.gov Jeff Vetter (Oak Ridge National Lab.) vetterjs@ornl.gov Andy Yoo (Lawrence Livermore National Lab.) ayoo@llnl.gov Xiaodong Zhang (College of William & Mary) zhang@cs.wm.edu
Important Dates: Submission Date: 12/24/2004 (Extended!) Notification Date: 1/07/2005 Camera-Ready Date: 1/21/2005
CONTACT INFO: ------------ web : http://www.ece.rutgers.edu/~yyzhang/ipdps-ws email: yyzhang@ece.rutgers.edu
Informal proceedings will be handed out at the workshop and published along with other IPDPS 05 publications. For full paper submissions, we are also planning to publish formal proceedings as one issue of Springer-Verlag?s Lecture Notes in Computer Science (LNCS) series.
Ramendra K. Sahoo IBM TJ Watson Research Center 1101 Kitchawan Road, Yorktown Heights, NY 10598 phone: 914-945-2936, T/L 8-862-2936 email: rsahoo@us.ibm.com
--
|
|