AWS Big Data Landscape – a complicated arrangement

There are many ways to construct a big data flow on AWs depending on the time, skills, budgets, objective and operational support.

Key architectural principle is simplicity. A second is cost control.

Types of Data:

  • Transaction – usually SQL sometimes NoSQL
  • Unstructured- web, media, cloud– S3, HDFS on Hadoop, EMR (Hadoop), Spark, Hive
  • Log files -Kinesis, Kafka

Methods of using Big Data:

  • Batch
  • Stream, real time eg clickstreams, IoT
  • Machine Learning/Predictive

Delivery:

  • Virtual – EC2 Kafka
  • Managed – EMR Hadoop Cluster with Spark, Kafka, Kinesis
  • Serverless-Lambda, Glue, Athena (SQL analytics)

Architecture Principles

  • Decouple the process flow Data – ingest – store – ETL or process – analyse
  • Right tools for right job
  • Leverage Managed and Serverless
  • Use event-journal design patterns
  • -immutable data lakes, materialised views
  • Cost control
  • Machine Learning connected to models, app services
  • Decouple Compute from Storage (HBase with S3 for eg)
  • Tier Data between Hot and Cold needs– HDFS (frequent)-S3-Glacier (cold)
  • Types of DB are related to schemas, types, access and retrieval, OLTP vs. OLAP, transactional vs storage

ETL – normalised view of different data sets and schemas  eg Glue analyses, CSV, JSON, Parquet

AWS Big Data Landscape