We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.

On data skewness, stragglers, and MapReduce progress indicators / Coppa, Emilio; Finocchi, Irene. - Proceedings of the Sixth ACM Symposium on Cloud Computing, (2015), pp. 139-152. (ACM Symposium on Cloud Computing (SoCC), Hawaii, USA, 27 August through 30 August 2015). [10.1145/2806777.2806843].

On data skewness, stragglers, and MapReduce progress indicators

FINOCCHI, Irene
2015

Abstract

We tackle the problem of predicting the performance of MapReduce applications designing accurate progress indicators, which keep programmers informed on the percentage of completed computation time during the execution of a job. This is especially important in pay-as-you-go cloud environments, where slow jobs can be aborted in order to avoid excessive costs. Performance predictions can also serve as a building block for several profile-guided optimizations. By assuming that the running time depends linearly on the input size, state-of-the-art techniques can be seriously harmed by data skewness, load unbalancing, and straggling tasks. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption in a fully online way (i.e., without resorting to profile data collected from previous executions). NearestFit exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Fine-grained profiles required by our theoretical progress model are approximated through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of benchmarks shows that its accuracy is very good, even when competitors incur non-negligible errors and wide prediction fluctuations.
978-1-4503-3651-2
MapReduce; data skewness; hadoop; performance prediction; performance profiling; progress indicators
File in questo prodotto:
File Dimensione Formato  
ACM-SoCC15.pdf

Solo gestori archivio

Tipologia: Versione dell'editore
Licenza: DRM non definito
Dimensione 300.91 kB
Formato Adobe PDF
300.91 kB Adobe PDF   Visualizza/Apri
Pubblicazioni consigliate

Caricamento pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11385/192551
Citazioni
  • Scopus 25
  • ???jsp.display-item.citation.isi??? 16
social impact