CCA175 CCA Spark and Hadoop Developer Exam Questions and Answers

Questions 4

Problem Scenario 50 : You have been given below code snippet (calculating an average score}, with intermediate output.

type ScoreCollector = (Int, Double)

type PersonScores = (String, (Int, Double))

val initialScores = Array(( " Fred " , 88.0), ( " Fred " , 95.0), ( " Fred " , 91.0), ( " Wilma " , 93.0), ( " Wilma " , 95.0), ( " Wilma " , 98.0))

val wilmaAndFredScores = sc.parallelize(initialScores).cache()

val scores = wilmaAndFredScores.combineByKey(createScoreCombiner, scoreCombiner, scoreMerger)

val averagingFunction = (personScore: PersonScores) = > { val (name, (numberScores, totalScore)) = personScore (name, totalScore / numberScores)

}

val averageScores = scores.collectAsMap(}.map(averagingFunction)

Expected output: averageScores: scala.collection.Map[String,Double] = Map(Fred - > 91.33333333333333, Wilma - > 95.33333333333333)

Define all three required function , which are input for combineByKey method, e.g. (createScoreCombiner, scoreCombiner, scoreMerger). And help us producing required results.

Options:

Buy Now

Questions 5

Problem Scenario 95 : You have to run your Spark application on yarn with each executor Maximum heap size to be 512MB and Number of processor cores to allocate on each executor will be 1 and Your main application required three values as input arguments V1 V2 V3.

Please replace XXX, YYY, ZZZ

./bin/spark-submit -class com.hadoopexam.MyTask --master yarn-cluster--num-executors 3 --driver-memory 512m XXX YYY lib/hadoopexam.jarZZZ

Options:

Buy Now

Questions 6

Problem Scenario 27 : You need to implement near real time solutions for collecting information when submitted in file with below information.

Data

echo " IBM,100,20160104 " > > /tmp/spooldir/bb/.bb.txt

echo " IBM,103,20160105 " > > /tmp/spooldir/bb/.bb.txt

mv /tmp/spooldir/bb/.bb.txt /tmp/spooldir/bb/bb.txt

After few mins

echo " IBM,100.2,20160104 " > > /tmp/spooldir/dr/.dr.txt

echo " IBM,103.1,20160105 " > > /tmp/spooldir/dr/.dr.txt

mv /tmp/spooldir/dr/.dr.txt /tmp/spooldir/dr/dr.txt

Requirements:

You have been given below directory location (if not available than create it) /tmp/spooldir . You have a finacial subscription for getting stock prices from BloomBerg as well as

Reuters and using ftp you download every hour new files from their respective ftp site in directories /tmp/spooldir/bb and /tmp/spooldir/dr respectively.

As soon as file committed in this directory that needs to be available in hdfs in /tmp/flume/finance location in a single directory.

Write a flume configuration file named flume7.conf and use it to load data in hdfs with following additional properties .

1. Spool /tmp/spooldir/bb and /tmp/spooldir/dr

2. File prefix in hdfs sholuld be events

3. File suffix should be .log

4. If file is not commited and in use than it should have _ as prefix.

5. Data should be written as text to hdfs

Options:

Buy Now

Questions 7

Problem Scenario 1:

You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.categories

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. Connect MySQL DB and check the content of the tables.

2. Copy " retaildb.categories " table to hdfs, without specifying directory name.

3. Copy " retaildb.categories " table to hdfs, in a directory name " categories_target " .

4. Copy " retaildb.categories " table to hdfs, in a warehouse directory name " categories_warehouse " .

Options:

Buy Now

Questions 8

Problem Scenario 39 : You have been given two files

spark16/file1.txt

1,9,5

2,7,4

3,8,3

spark16/file2.txt

1,g,h

2,i,j

3,k,l

Load these two tiles as Spark RDD and join them to produce the below results

(l,((9,5),(g,h)))

(2, ((7,4), (i, j ))) (3, ((8,3), (k,l)))

And write code snippet which will sum the second columns of above joined results (5+4+3).

Options:

Buy Now

Questions 9

Problem Scenario 93 : You have to run your Spark application with locally 8 thread or locally on 8 cores. Replace XXX with correct values.

spark-submit --class com.hadoopexam.MyTask XXX \ -deploy-mode cluster SSPARK_HOME/lib/hadoopexam.jar 10

Options:

Buy Now

Questions 10

Problem Scenario 5 : You have been given following mysql database details.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. List all the tables using sqoop command from retail_db

2. Write simple sqoop eval command to check whether you have permission to read database tables or not.

3. Import all the tables as avro files in /user/hive/warehouse/retail cca174.db

4. Import departments table as a text file in /user/cloudera/departments.

Options:

Buy Now

Questions 11

Problem Scenario 32 : You have given three files as below.

spark3/sparkdir1/file1.txt

spark3/sparkd ir2ffile2.txt

spark3/sparkd ir3Zfile3.txt

Each file contain some text.

spark3/sparkdir1/file1.txt

Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework

spark3/sparkdir2/file2.txt

The core of Apache Hadoop consists of a storage part known as Hadoop Distributed File System (HDFS) and a processing part called MapReduce. Hadoop splits files into large blocks and distributes them across nodes in a cluster. To process data, Hadoop transfers packaged code for nodes to process in parallel based on the data that needs to be processed.

spark3/sparkdir3/file3.txt

his approach takes advantage of data locality nodes manipulating the data they have access to to allow the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking

Now write a Spark code in scala which will load all these three files from hdfs and do the word count by filtering following words. And result should be sorted by word count in reverse order.

Filter words ( " a " , " the " , " an " , " as " , " a " , " with " , " this " , " these " , " is " , " are " , " in " , " for " , " to " , " and " , " T h e " , " of " )

Also please make sure you load all three files as a Single RDD (All three files must be loaded using single API call).

You have also been given following codec

import org.apache.hadoop.io.compress.GzipCodec

Please use above codec to compress file, while saving in hdfs.

Options:

Buy Now

Questions 12

Problem Scenario 24 : You have been given below comma separated employee information.

Data Set:

name,salary,sex,age

alok,100000,male,29

jatin,105000,male,32

yogesh,134000,male,39

ragini,112000,female,35

jyotsana,129000,female,39

valmiki,123000,male,29

Requirements:

Use the netcat service on port 44444, and nc above data line by line. Please do the following activities.

1. Create a flume conf file using fastest channel, which write data in hive warehouse directory, in a table called flumemaleemployee (Create hive table as well tor given data).

2. While importing, make sure only male employee data is stored.

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Step 1 : Create hive table for flumeemployee. '

CREATE TABLE f lumemaleemployee

(

name string,

salary int,

sex string,

age int

)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ' , ' ;

step 2 : Create flume configuration file, with below configuration for source, sink and channel and save it in flume4.conf.

#Define source , sink, channel and agent .

agent1 .sources = source1

agent1 .sinks = sink1

agent 1 .channels = channel1

# Describe/configure source1

agent 1 . sources.source1.type = netcat

agent1 .sources.source1.bind = 127.0.0.1

agent 1 .sources.sourcel.port = 44444

#Define interceptors

agent1.sources.source1.interceptors=il

agent 1 .sources.source1.interceptors.i1.type=regex_filter

agent 1 .sources.source1.interceptors.i1.regex=female

agent 1 .sources.source1.interceptors.i1.excludeEvents=true

## Describe sink 1

agent 1 .sinks, sinkl.channel = memory-channel

agent1.sinks.sink1.type = hdfs

agent 1 .sinks, sinkl. hdfs. path = /user/hive/warehouse/flumemaleemployee

hdfs-agent.sinks.hdfs-write.hdfs.writeFormat=Text

agentl .sinks.sink 1 .hdfs.fileType = Data Stream

# Now we need to define channel1 property.

agent1.channels.channel1.type = memory

agent1.channels.channell.capacity = 1000

agent 1 .channels.channe l1 .transactionCapacity = 100

# Bind the source and sink to the channel

agent1 .sources.source1.channels = channel1

agent 1 .sinks.sink 1 .channel = channel 1

step 3 : Run below command which will use this configuration file and append data in hdfs.

Start flume service:

flume-ng agent -conf /home/cloudera/flumeconf -conf-file /home/cloudera/flumeconf/flume4.conf --name agentl

Step 4 : Open another terminal and use the netcat service, nc localhost 44444

Step 5 : Enter data line by line.

alok,100000,male,29

jatin,105000,male,32

yogesh,134000,male,39

ragini,112000,female,35

jyotsana,129000,female,39

valmiki.123000.male.29

Step 6 : Open hue and check the data is available in hive table or not.

Step 7 : Stop flume service by pressing ctrl+c

Step 8 : Calculate average salary on hive table using below query. You can use either hive command line tool or hue. select avg(salary) from flumeemployee;

Questions 13

Problem Scenario 91 : You have been given data in json format as below.

{ " first_name " : " Ankit " , " last_name " : " Jain " }

{ " first_name " : " Amir " , " last_name " : " Khan " }

{ " first_name " : " Rajesh " , " last_name " : " Khanna " }

{ " first_name " : " Priynka " , " last_name " : " Chopra " }

{ " first_name " : " Kareena " , " last_name " : " Kapoor " }

{ " first_name " : " Lokesh " , " last_name " : " Yadav " }

Do the following activity

1. create employee.json tile locally.

2. Load this tile on hdfs

3. Register this data as a temp table in Spark using Python.

4. Write select query and print this data.

5. Now save back this selected data in json format.

Options:

Buy Now

Questions 14

Problem Scenario 70 : Write down a Spark Application using Python, In which it read a file " Content.txt " (On hdfs) with following content. Do the word count and save the results in a directory called " problem85 " (On hdfs)

Content.txt

Hello this is ABCTECH .com

This is XYZTECH .com

Apache Spark Training

This is Spark Learning Session

Spark is faster than MapReduce

Options:

Buy Now

Questions 15

Problem Scenario 84 : In Continuation of previous question, please accomplish following activities.

1. Select all the products which has product code as null

2. Select all the products, whose name starts with Pen and results should be order by Price descending order.

3. Select all the products, whose name starts with Pen and results should be order by Price descending order and quantity ascending order.

4. Select top 2 products by price

Options:

Buy Now

Questions 16

Problem Scenario 19 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Now accomplish following activities.

1. Import departments table from mysql to hdfs as textfile in departments_text directory.

2. Import departments table from mysql to hdfs as sequncefile in departments_sequence directory.

3. Import departments table from mysql to hdfs as avro file in departments avro directory.

4. Import departments table from mysql to hdfs as parquet file in departments_parquet directory.

Options:

Buy Now

Questions 17

Problem Scenario 94 : You have to run your Spark application on yarn with each executor 20GB and number of executors should be 50. Please replace XXX, YYY, ZZZ

export HADOOP_CONF_DIR=XXX

./bin/spark-submit \

-class com.hadoopexam.MyTask \

xxx\

-deploy-mode cluster \ # can be client for client mode

YYY\

222 \

/path/to/hadoopexam.jar \

1000

Options:

Buy Now

Questions 18

Problem Scenario 74 : You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.orders

table=retail_db.order_items

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Columns of order table : (orderjd , order_date , ordercustomerid, order status}

Columns of orderjtems table : (order_item_td , order_item_order_id , order_item_product_id, order_item_quantity,order_item_subtotal,order_item _ product_price)

Please accomplish following activities.

1. Copy " retaildb.orders " and " retaildb.orderjtems " table to hdfs in respective directory p89_orders and p89_order_items .

2. Join these data using orderjd in Spark and Python

3. Now fetch selected columns from joined data Orderld, Order date and amount collected on this order.

4. Calculate total order placed for each date, and produced the output sorted by date.

Options:

Buy Now

Answer:

See the explanation for Step by Step Solution and configuration.

Explanation:

Solution:

Step 1 : Import Single table .

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=orders --target-dir=p89_orders - -m1

sqoop import --connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera -table=order_items ~target-dir=p89_ order items -m 1

Note : Please check you dont have space between before or after ' = ' sign. Sqoop uses the MapReduce framework to copy data from RDBMS to hdfs

Step 2 : Read the data from one of the partition, created using above command, hadoopfs -cat p89_orders/part-m-00000 hadoop fs -cat p89_order_items/part-m-00000

Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark terminal and do following). orders = sc.textFile( " p89_orders " ) orderitems = sc.textFile( " p89_order_items " )

Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)

#First value is orderjd

ordersKeyValue = orders.map(lambda line: (int(line.split( " , " )[0]), line))

#Second value as an Orderjd

orderltemsKeyValue = orderltems.map(lambda line: (int(line.split( " , " )[1]), line))

Step 5 : Join both the RDD using orderjd

joinedData = orderltemsKeyValue.join(ordersKeyValue)

#print the joined data

tor line in joinedData.collect():

print(line)

Format of joinedData as below.

[Orderld, ' All columns from orderltemsKeyValue ' , ' All columns from orders Key Value ' ]

Step 6 : Now fetch selected values Orderld, Order date and amount collected on this order.

revenuePerOrderPerDay = joinedData.map(lambda row: (row[0] ( row[1][1] .split( " , " )[1] ( f!oat(row[1][0] .split( ' \ M }[4]}}}

#printthe result

for line in revenuePerOrderPerDay.collect():

print(line)

Step 7 : Select distinct order ids for each date.

#distinct(date,order_id)

distinctOrdersDate = joinedData.map(lambda row: row[1][1] .split( ' \ " )[1] + " , " + str(row[0])).distinct()

for line in distinctOrdersDate.collect(): print(line)

Step 8 : Similar to word count, generate (date, 1) record for each row. newLineTuple = distinctOrdersDate.map(lambda line: (line.split( " , " )[0], 1))

Step 9 : Do the count for each key(date), to get total order per date. totalOrdersPerDate = newLineTuple.reduceByKey(lambda a, b: a + b}

#print results

for line in totalOrdersPerDate.collect():

print(line)

step 10 : Sort the results by date sortedData=totalOrdersPerDate.sortByKey().collect()

#print results

for line in sortedData:

print(line)

Questions 19

Problem Scenario 88 : You have been given below three files

product.csv (Create this file in hdfs)

productID,productCode,name,quantity,price,supplierid

1001,PEN,Pen Red,5000,1.23,501

1002,PEN,Pen Blue,8000,1.25,501

1003,PEN,Pen Black,2000,1.25,501

1004,PEC,Pencil 2B,10000,0.48,502

1005,PEC,Pencil 2H,8000,0.49,502

1006,PEC,Pencil HB,0,9999.99,502

2001,PEC,Pencil 3B,500,0.52,501

2002,PEC,Pencil 4B,200,0.62,501

2003,PEC,Pencil 5B,100,0.73,501

2004,PEC,Pencil 6B,500,0.47,502

supplier.csv

supplierid,name,phone

501,ABC Traders,88881111

502,XYZ Company,88882222

503,QQ Corp,88883333

products_suppliers.csv

productID,supplierID

2001,501

2002,501

2003,501

2004,502

2001,503

Now accomplish all the queries given in solution.

1. It is possible that, same product can be supplied by multiple supplier. Now find each product, its price according to each supplier.

2. Find all the supllier name, who are supplying ' Pencil 3B '

3. Find all the products , which are supplied by ABC Traders.

Options:

Buy Now

Questions 20

Problem Scenario 12 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Create a table in retailedb with following definition.

CREATE table departments_new (department_id int(11), department_name varchar(45), created_date T1MESTAMP DEFAULT NOW());

2. Now isert records from departments table to departments_new

3. Now import data from departments_new table to hdfs.

4. Insert following 5 records in departmentsnew table. Insert into departments_new values(110, " Civil " , null); Insert into departments_new values(111, " Mechanical " , null); Insert into departments_new values(112, " Automobile " , null); Insert into departments_new values(113, " Pharma " , null);

Insert into departments_new values(114, " Social Engineering " , null);

5. Now do the incremental import based on created_date column.

Options:

Buy Now

Questions 21

Problem Scenario 7 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Import department tables using your custom boundary query, which import departments between 1 to 25.

2. Also make sure each tables file is partitioned in 2 files e.g. part-00000, part-00002

3. Also make sure you have imported only two columns from table, which are department_id,department_name

Options:

Buy Now

Questions 22

Problem Scenario 9 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following.

1. Import departments table in a directory.

2. Again import departments table same directory (However, directory already exist hence it should not overrride and append the results)

3. Also make sure your results fields are terminated by ' | ' and lines terminated by ' \n\

Options:

Buy Now

Questions 23

Problem Scenario 46 : You have been given belwo list in scala (name,sex,cost) for each work done.

List( ( " Deeapak " , " male " , 4000), ( " Deepak " , " male " , 2000), ( " Deepika " , " female " , 2000),( " Deepak " , " female " , 2000), ( " Deepak " , " male " , 1000) , ( " Neeta " , " female " , 2000))

Now write a Spark program to load this list as an RDD and do the sum of cost for combination of name and sex (as key)

Options:

Buy Now

Questions 24

Problem Scenario 64 : You have been given below code snippet.

val a = sc.parallelize(List( " dog " , " salmon " , " salmon " , " rat " , " elephant " ), 3)

val b = a.keyBy(_.length)

val c = sc.parallelize(Ust( " dog " , " cat " , " gnu " , " salmon " , " rabbit " , " turkey " , " wolf " , " bear " , " bee " ), 3)

val d = c.keyBy(_.length)

operation 1

Write a correct code snippet for operationl which will produce desired output, shown below.

Array[(lnt, (Option[String], String))] = Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit}}, (6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)) , (6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)), (3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)), (3,(Some(rat), (3,(Some(rat),cat)), (3,(Some(rat),gnu)), (3,(Some(rat),bee)), (4,(None,wo!f)), (4,(None,bear)))

Options:

Buy Now

Questions 25

Problem Scenario 6 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Compression Codec : org.apache.hadoop.io.compress.SnappyCodec

Please accomplish following.

1. Import entire database such that it can be used as a hive tables, it must be created in default schema.

2. Also make sure each tables file is partitioned in 3 files e.g. part-00000, part-00002, part-00003

3. Store all the Java files in a directory called java_output to evalute the further

Options:

Buy Now

Questions 26

Problem Scenario 90 : You have been given below two files

course.txt

id,course

1,Hadoop

2,Spark

3,HBase

fee.txt

id,fee

2,3900

3,4200

4,2900

Accomplish the following activities.

1. Select all the courses and their fees , whether fee is listed or not.

2. Select all the available fees and respective course. If course does not exists still list the fee

3. Select all the courses and their fees , whether fee is listed or not. However, ignore records having fee as null.

Options:

Buy Now

Questions 27

Problem Scenario 4: You have been given MySQL DB with following details.

user=retail_dba

password=cloudera

database=retail_db

table=retail_db.categories

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

Import Single table categories (Subset data} to hive managed table , where category_id between 1 and 22

Options:

Buy Now

Questions 28

Problem Scenario 15 : You have been given following mysql database details as well as other info.

user=retail_dba

password=cloudera

database=retail_db

jdbc URL = jdbc:mysql://quickstart:3306/retail_db

Please accomplish following activities.

1. In mysql departments table please insert following record. Insert into departments values(9999, ' " Data Science " 1 );

2. Now there is a downstream system which will process dumps of this file. However, system is designed the way that it can process only files if fields are enlcosed in( ' ) single quote and separate of the field should be (-} and line needs to be terminated by : (colon).

3. If data itself contains the " (double quote } than it should be escaped by \.

4. Please import the departments table in a directory called departments_enclosedby and file should be able to process by downstream system.

Options:

Buy Now

Exam Code: CCA175

Exam Name: CCA Spark and Hadoop Developer Exam

Last Update: Apr 30, 2026

Questions: 96

PDF + Testing Engine

$63.52 ~~$181.49~~

Testing Engine

$50.57 ~~$144.49~~

PDF (Q&A)

$43.57 ~~$124.49~~

CCA175 CCA Spark and Hadoop Developer Exam Questions and Answers

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

Options:

Answer:

Explanation:

PDF + Testing Engine

Testing Engine

PDF (Q&A)

Quick Links