Distributed Systems - MapReduce - Nguyen Quang Hung

Challenges?
– Applications face with large-scale of data (e.g. multi-terabyte).
» High Energy Physics (HEP) and Astronomy.
» Earth climate weather forecasts.
» Gene databases.
» Index of all Internet web pages (in-house).
» etc
– Easy programming to non-CS scientists (e.g. biologists)

31 trang thamphan 26/12/2022 1940

Download

Bạn đang xem 20 trang mẫu của tài liệu "Distributed Systems - MapReduce - Nguyen Quang Hung", để tải tài liệu gốc về máy hãy click vào nút Download ở trên.

File đính kèm:

distributed_systems_mapreduce_nguyen_quang_hung.pdf

Nội dung text: Distributed Systems - MapReduce - Nguyen Quang Hung

MapReduce Nguyen Quang Hung
Outline  Challenges  Motivation  Ideas  Programming model  Implementation  Related works  References
MapReduce  Motivation: Large scale data processing – Want to process huge of datasets (>1 TB). – Want to parallelize across hundreds/thousands of CPUs. – Want to make this easy.
MapReduce: programming model  Borrows from functional programming  Users implement interface of two functions: map and reduce:  map (k1,v1) list(k2,v2)  reduce (k2,list(v2)) list(v2)
reduce() function  After the map phase is over, all the intermediate values for a given output key are combined together into a list  reduce() combines those intermediate values into one or more final values for that same output key  (in practice, usually only one final value per key)
MapReduce: execution flows
Locality  Master program allocates tasks based on location of data: tries to have map() tasks on same machine as physical file data, or at least same rack (cluster rack)  map() task inputs are divided into 64 MB blocks: same size as Google File System chunks
Optimizations (1)  No reduce can start until map is complete: – A single slow disk controller can rate-limit the whole process  Master redundantly executes “slow-moving” map tasks; uses results of first copy to finish Why is it safe to redundantly execute map tasks? Wouldn’t this mess up the total computation?
MapReduce: implementations  Google MapReduce: C/C++  Hadoop: Java  Phoenix: C/C++ multithread  Etc.
Google MapReduce evaluation (2) Data transfer rates over time for different executions of the sort program (J.Dean and S.Ghemawat shows in their paper [1, page 9])
Related works  Bulk Synchronous Programming [6]  MPI primitives [4]  Condor [5]  SAGA-MapReduce [8]  CGI-MapReduce [7]
CGL-MapReduce Components of the CGL-MapReduce , extracted from [8]
CGL-MapReduce: evaluation HEP data analysis, execution Total Kmeans time against the time vs. the volume of data number of data points (Both (fixed compute resources) axes are in log scale) J.Ekanayake, S.Pallickara, and G.Fox show in their paper [7]
Hadoop vs. SAGA-MapReduce C.Miceli, M.Miceli, S. Jha, H. Kaiser, A. Merzky show in [8]
Conclusions  MapReduce has proven to be a useful abstraction  Simplifies large-scale computations on cluster of commodity PCs  Functional programming paradigm can be applied to large-scale applications  Focus on problem, let library deal w/ messy details