ApacheAzureBig DataBig Data Cloud

AWS end-to-end Project Best Practices you Can follow

Analysis of YouTube data

Problem Statement — Preparing the data in a desired resultant format from which the analyst can come up with strategies to design and run the ads as per the insights from data which will eventually lead to increase in the business revenue.

Solution Architecture — Implementing an AWS project pipeline to meet the above requirement using services like AWS S3, Glue, etc.

Creating AWS account :

Best practices after creating your AWS account

Protect your root account by enabling the MFA(google authenticator) and by changing the keys and passwords periodically. Also avoid sharing the credential files.

Creating IAM(Identity Access Management) group and user:

Search for IAM in the AWS console search tab.

Next steps are intuitive and use all the default values.

After creating the user, you will be able to download a user secret key .csv file. This file consists of login credentials using the IAM account.

Now you can login to your AWS account using the IAM option where you need to provide the username as given in the csv file and the password. Now, you are ready with your AWS account to implement the project!

In order to start working on your account, you need to download and install AWS CLI.

https://aws.amazon.com/cli/

Use the above link to download CLI as per your OS and install.

After the installation, in the command prompt type in the command — aws , you should get the options listed. Then, you need to configure the aws account by using the command — aws configure , it will request for access key ID and secret key which you need to provide from the downloaded csv file

Getting the data from Kaggle :

Use the following link to get the dataset — https://www.kaggle.com/datasets/datasnaek/youtube-new

Steps to get the data to AWS S3 –

  1. Create a landing zone S3 bucket in the AWS account.

Following details needs to be filled in –

Bucket name, AWS Region, enable server-side encryption and click on Create Bucket.

2. Copy the downloaded dataset from Kaggle to S3 bucket.

Upload the data that is downloaded from Kaggle to S3 by navigating to the newly created s3 bucket and click the Upload button (This can be done using the CLI as well).

Note : Make sure to create folders for each of the country/region and upload the dataset in the respective country folder under the main folder named raw_statistics. Also, all the JSON files have to be uploaded in a separate folder named raw_statistics_ref as shown below

[Glue crawlers are used to derive the meta-data of the data coming from different sources like redshift, s3,etc and create a data catalog to aid the processing and analysis steps. ]

Steps to create the AWS Glue crawler –

  1. Search for the Glue service in the AWS console and select AWS Glue. On the left panel click on the Crawlers option.

2. Add Crawler -> Crawler info (give it a name) -> Crawler source type(default values) -> Data store (provide the S3 path where the raw json files are stored)

3. Create a IAM role to allow Glue to access the S3 data -> select Roles -> Create role -> Choose Glue under “Use cases for other AWS services” -> Select “AmazonS3FullAccess” under permissions policies -> Give a Role name -> create role

Additionally, you will have to provide another permission -> Add permission -> AWSGlueServiceRole

Choose the above role in the steps while adding the crawler

4. Creating a Database — select Add Database option to create a new database with default values -> next -> Finish

5. Run the Crawler — Select the newly added crawler and select Run Crawler option.

Note : This process will access the data present in S3, extract the meta-data and build a catalog out of it. This will create a new table under the database as shown below

6. Viewing and querying the data — Select the newly created table in the database after running the crawler. And under Actions, choose View Data, this redirects to the Athena window (Athen is a ad hoc query service in AWS)

To view data in Athena, You need to provide the output location. Follow the steps

Settings -> Manage -> S3 location (create a new bucket for Athena job and copy the )

7. Preprocess the data — To Run the query in Athena, you need to perform some pre-cleaning of data like the nested formats have to be converted into a format that is allowed by Athena. This preprocessing can be done using AWS Lambda.

8. Creating Lambda function to preprocess the data — Select the Lambda service in the AWS console and create a new function. Name the function and choose Python3 under runtime.

Additionally, you will have to create a role and grant lambda functions to access the S3 data.

You can access the python code using the following link and paste it in the lamda code editor window.

https://github.com/manasa45/code.git

Post that you need to add a few environment variables. For this, Configuration -> Environment Variables -> and add the following variables according to the names with which you have created.

9. Testing the function -> In AWS lambda, we have the Configuration test feature. Test -> Configure test event -> Choose S3 put event -> Provide the S3 bucket name in place of example bucket and the S3 file URI as the value to the Key parameter in the test configure file.

On running the above function, a new table with the clean data will be created in the destination S3 bucket as provided in the environment variables while configuring. Finally, You are ready to query the data present in the cleaned output table using Athena by selecting the action View Data on the final table.

Related posts
ApacheAzureBig DataBig Data Cloud

Data LakeHouse -An Architecture that beautifully handles Data!

ApacheAzureBig DataBig Data Cloud

Apache iceberg : an open source format for analytical datasets

Sign up for our Newsletter and
stay informed

Leave a Reply

Your email address will not be published. Required fields are marked *