Big Data Technologies Based on MapReduce and Hadoop
Big Data & 5Vs (3Vs + 2Vs)
- 3Vs
- Volume: λ°μ΄ν°μ ν¬κΈ°
- Velocity: λ°μ΄ν°μ μμ±/μ²λ¦¬ μλ
- Variety: λ°μ΄ν°μ λ€μμ± (Structured, Semi-structured, Unstructured)
- + 2Vs (2010~)
- Veracity: λ°μ΄ν°μ μ§μ€μ± (truthfulness), μ νμ±, λΆνμ€μ±, Noise, μ€λ₯
- Value: λ°μ΄ν°μ κ°μΉ
Database αα
‘α¨μ
μ λ αα
©αΌα
α
²
Hadoop
Hadoop : High-Availability Distributed Object-Oriented Platform
- λμ©λ λ°μ΄ν°λ₯Ό λΆμ° μ²λ¦¬ν μ μλ μλ° κΈ°λ°μ μ€ν μμ€ νλ μμν¬
- μμ€ν
λΉμ©μ΄ μ κ² λ€κ³ μ¬λ¬ λμ μλ²μ λ°μ΄ν°λ² μ΄μ€λ₯Ό λΆμ°ν΄μ μ μ₯ν΄λκΈ° λλ¬Έμ μ²λ¦¬ μλκ° λΉ λ₯΄λ€λ μ₯μ μ΄ μμ
Hadoop Ecosystem
μ£Όμ Component
- HDFS: Hadoop Distributed File System - λΆμ° λ°μ΄ν° μ μ₯
- YARN: Hadoopμ resource manager - λΆμ° ν΄λ¬μ€ν° 리μμ€ κ΄λ¦¬
- MapReduce2: breaking up the JobTracker into a few different services, it avoids many of the scaling issues faced by MRv1
- Spark2: RDD-based computing framework. (κ³ μμ²λ¦¬)
- Tez: A Framework for YARN-based, Data Processing Applications In Hadoop
- Hive: SQL-like interface to query
- HBase: NoSQL DB on top of Hadoop
- Pig: script language to run MapReduce jobs on Hadoop
- Oozie: workflow scheduler to manage Hadoop jobs
- Zookeeper: centralized operational services for a Hadoop cluster
- Storm: real-time stream analytic system
- Flume: collecting, aggregating, moving unstructured data like log data to Hadoop
- Kafka: Stream data processing platform. Distributed message broker
- Sqoop: RDBMS β Hadoop. Import, Export Strucured data
- Solr: κ²μ μΈλ±μ±
- Zeppelin: web-based notebook for interactive data analysis
References
- Fundamentals of Database Systems 7th Edition by Ramez Elmasri, Shamkant B. Navathe.