AWS Big Data Links and Resources

26th March 2021 craigreadcloud

Big Data

How to use Amazon EMR to process data using the broad ecosystem of Hadoop tools like Hive and Hue and work with Amazon DynamoDB, Amazon Redshift, Amazon QuickSight, Amazon Athena and Amazon Kinesis.

AWS Marketplace for Big Data

Data Ingestion and Transfer

Amazon Kinesis Agent for Data Ingestion https://github.com/awslabs/amazon-kinesis-agent
Apache Flume https://flume.apache.org/ can be installed and run on Amazon EC2 instances.
You can also use S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon S3 across AWS accounts http://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html
Apache Sqoop https://cwiki.apache.org/confluence/display/SQOOP/Home supports the transfer of data between Hadoop and structured data stores such as Amazon RDS.
AWS IoT can collect and handle large quantities of data coming from a variety of sources https://aws.amazon.com/iot/ and makes it easy to use AWS services like AWS Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning, and Amazon DynamoDB.
AWS DataSync https://aws.amazon.com/datasync/ is a data transfer service that makes it easy for you to automate moving data between on-premises storage and Amazon S3 or Amazon Elastic File System (Amazon EFS).
Amazon FSx for Lustre https://aws.amazon.com/fsx/lustre/ provides a high-performance file system optimized for fast processing of workloads such as machine learning, high performance computing (HPC), video processing, financial modeling, and electronic design automation (EDA).
AWS Glue DataBrew https://aws.amazon.com/glue/features/databrew/ visual data preparation tool to clean and normalize data to prepare it for analytics and machine learning

Big Data Streaming and Amazon Kinesis

Overview of Amazon Kinesis Data Firehose https://aws.amazon.com/kinesis/data-firehose/
AWS Kinesis Data Analytics – SQL Functions https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/sql-reference-functions.html
Using the Schema Discovery Feature on Streaming Data https://docs.aws.amazon.com/kinesisanalytics/latest/dev/sch-dis.html
Apache Spark Streaming enables high-throughput, fault-tolerant, and scalable processing of live data streams. It divides the incoming data streams into batches before sending them to the Spark engine for processing. http://spark.apache.org/streaming/
Amazon Managed Streaming for Kafka (MSK) https://aws.amazon.com/msk/ is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data.

Data Lake Concepts and Building a Serverless Data Lake

What is a data lake? https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
Building Data Lakes on AWS https://d1.awsstatic.com/whitepapers/Storage/data-lake-on-aws.pdf AWS white paper.
AWS Lake Formation https://aws.amazon.com/lake-formation/ is a service that makes it easy to set up a secure data lake in days.
S3 Object Lifecycle Management http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
How to set up cross-origin resource sharing (CORS) http://docs.aws.amazon.com/apigateway/latest/developerguide/how-to-cors.html
EMR File System (EMRFS) consistent view https://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-consistent-view.html
Quick Start Data Lake with SnapLogic https://aws.amazon.com/quickstart/architecture/data-lake-with-snaplogic/ builds a data lake environment on AWS in about 15 minutes by deploying SnapLogic components and AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon Redshift.
AWS Lake Formation Workshop https://lakeformation.aworkshop.io/

Hadoop Frameworks (Hive, Presto, Pig etc.)

About Amazon EMR Releases https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-components.html Each release comprises different big-data applications, components, and features that you select to have Amazon EMR install and configure when you create a cluster.
Apache Hive https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive.html
Differences and Considerations for Hive on Amazon EMR https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-differences.html
Presto on Amazon EMR https://aws.amazon.com/emr/features/presto/
Apache Pig https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-pig.html
PIGgy Bank https://cwiki.apache.org/confluence/display/PIG/PiggyBank is a place for Pig users to share their functions.
Apache Spark on Amazon EMR https://aws.amazon.com/emr/features/spark/
Apache MXNet https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-mxnet.html
How do I restart a service in Amazon EMR? https://aws.amazon.com/premiumsupport/knowledge-center/restart-service-emr/
Amazon EMR now supports a public EMR artifact repository for Maven builds https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-artifact-repository.html

Hadoop User Interfaces

View Web Interfaces Hosted on Amazon EMR Clusters https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
View On-Cluster Application User Interfaces https://docs.aws.amazon.com/emr/latest/ManagementGuide/on-cluster-app-UI.html
Launching the Hue Web Interface https://docs.aws.amazon.com/emr/latest/ReleaseGuide/accessing-hue.html
Apache Zeppelin https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-zeppelin.html
JupyterHub https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html allows you to host multiple instances of a single-user Jupyter notebook server.

Spark

Spark Or Hadoop: Which Is The Best Big Data Framework? https://www.datasciencecentral.com/profiles/blogs/spark-or-hadoop-which-is-the-best-big-data-framework Blog post from Data Science Central.
Apache Spark home page: https://spark.apache.org/
Spark RDD Programming Guide http://spark.apache.org/docs/latest/rdd-programming-guide.html
Spark Streaming Programming Guide https://spark.apache.org/docs/latest/streaming-programming-guide.html
Use Apache Spark with Amazon Sagemaker https://docs.aws.amazon.com/sagemaker/latest/dg/apache-spark.html
Persistent Spark history server https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html

Management and Monitoring

Best Practices for Amazon EMR white paper: https://d0.awsstatic.com/whitepapers/aws-amazon-emr-best-practices.pdf
Monitor Metrics with CloudWatch https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
Ganglia https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-ganglia.html Ganglia Monitoring System http://ganglia.info/
Using Automatic Scaling in Amazon EMR https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-automatic-scaling.html
Upload Data to an Amazon ES Domain for Indexing https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-gsg-upload-data.html

Copyright © 2025 Cloud, IS & Business Alignment - Practical approaches — Primer WordPress theme by GoDaddy