Big Data
How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.
Data Ingestion and Transfer
- Amazon Kinesis Agent for Data Ingestion https://github.com/awslabs/amazon-kinesis-agent
- Apache Flume https://flume.apache.org/ can be installed and run on Amazon EC2 instances.
- You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
- Apache Sqoop https://cwiki.apache.org/confluence/display/SQOOP/Home supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
- AWS IoT can collect and handle large quantities of data coming from a variety of sources https://aws.amazon.com/iot/ and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
- AWS DataSync https://aws.amazon.com/datasync/ is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
- Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/ provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
- AWS Glue DataBrew https://aws.amazon.com/glue/features/databrew/ visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning
Big Data Streaming and Amazon Kinesis
- Overview of Amazon Kinesis Data Firehose https://aws.amazon.com/kinesis/data-firehose/
- AWS Kinesis Data Analytics – SQL Functions https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-functions.html
- Using the Schema Discovery Feature on Streaming Data https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sch-dis.html
- Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing. http://spark.apache.org/streaming/
- Amazon Managed Streaming for Kafka (MSK) https://aws.amazon.com/msk/ is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.
Data Lake Concepts and Building a Serverless Data Lake
- What is a data lake? https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
- Building Data Lakes on AWS https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf AWS white paper.
- AWS Lake Formation https://aws.amazon.com/lake-formation/ is a service that makes it easy to set up a secure data lake in days.
- S3 Object Lifecycle Management http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
- How to set up cross-origin resource sharing (CORS) http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-cors.html
- EMR File System (EMRFS) consistent view https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-consistent-view.html
- Quick Start Data Lake with SnapLogic https://aws.amazon.com/quickstart/architecture/data-lake-with-snaplogic/ builds a data lake environment on AWS in about 15 minutes by deploying SnapLogic components and AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift.
- AWS Lake Formation Workshop https://lakeformation.aworkshop.io/
Hadoop Frameworks (Hive, Presto, Pig etc.)
- About Amazon EMR Releases https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html Each release comprises different big-data applications, components, and features that you select to have Amazon EMR install and configure when you create a cluster.
- Apache Hive https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html
- Differences and Considerations for Hive on Amazon EMR https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
- Presto on Amazon EMR https://aws.amazon.com/emr/features/presto/
- Apache Pig https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-pig.html
- PIGgy Bank https://cwiki.apache.org/confluence/display/PIG/PiggyBank is a place for Pig users to share their functions.
- Apache Spark on Amazon EMR https://aws.amazon.com/emr/features/spark/
- Apache MXNet https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-mxnet.html
- How do I restart a service in Amazon EMR? https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
- Amazon EMR now supports a public EMR artifact repository for Maven builds https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html
Hadoop User Interfaces
- View Web Interfaces Hosted on Amazon EMR Clusters https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
- View On-Cluster Application User Interfaces https://docs.aws.amazon.com/emr/latest/ManagementGuide/on-cluster-app-UI.html
- Launching the Hue Web Interface https://docs.aws.amazon.com/emr/latest/ReleaseGuide/accessing-hue.html
- Apache Zeppelin https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html
- JupyterHub https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html allows you to host multiple instances of a single-user Jupyter notebook server.
Spark
- Spark Or Hadoop: Which Is The Best Big Data Framework? https://www.datasciencecentral.com/profiles/blogs/spark-or-hadoop-which-is-the-best-big-data-framework Blog post from Data Science Central.
- Apache Spark home page: https://spark.apache.org/
- Spark RDD Programming Guide http://spark.apache.org/docs/latest/rdd-programming-guide.html
- Spark Streaming Programming Guide https://spark.apache.org/docs/latest/streaming-programming-guide.html
- Use Apache Spark with Amazon Sagemaker https://docs.aws.amazon.com/sagemaker/latest/dg/apache-spark.html
- Persistent Spark history server https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html
Management and Monitoring
- Best Practices for Amazon EMR white paper: https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf
- Monitor Metrics with CloudWatch https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
- Ganglia https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-ganglia.html Ganglia Monitoring System http://ganglia.info/
- Using Automatic Scaling in Amazon EMR https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html
- Upload Data to an Amazon ES Domain for Indexing https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg-upload-data.html