Query Kaggle data via Apache Spark and Zeppelin via EMR cluster

Nikita sharma
1 min readOct 29, 2018

--

This is a 3 post blog series on querying Kaggle data on EMR cluster. I will be using Apache Zeppelin for the data exploration, and internally using Apache Spark for the query execution.

Most of the complications would be hidden from us and Amazon EMR is going to take care of it.

Here are the 3 posts for our task:

  1. Part 1: How to copy Kaggle data to Amazon S3
  2. Part 2: How to create EMR cluster with Apache Spark and Apache Zeppelin
  3. Part 3: Query Kaggle data via Apache Zeppelin

I have provided examples and complete walk though on the steps involved for the task. I hope the post is helpful.

Cheers

Originally published at confusedcoders.com on October 29, 2018.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Nikita sharma
Nikita sharma

Written by Nikita sharma

Data Scientist | Python programmer

No responses yet

Write a response