Archive for the 'data integration' Category

Working with the Kafka Consumer and Producer Steps in Kettle

The other day a partner asked how to work with the Kafka Marketplace plugins for Kettle contributed by Ruckus Wireless.  I decided to kick the tires and get the steps up and running.

I first started off downloading Kafka, you can find it here:

http://kafka.apache.org/

I downloaded version 0.9 of Kafka.  I happened to have a Cloudera Quickstart VM running (CDH 5.4), so I figured I’d run through the quick start of Kafka from that VM.  I had no trouble starting up Kafka and sending and receiving basic messages via the console consumer and producer.  Getting started with Kafka is very simple!  Now on to Kettle.

Within Spoon (Version 6.0), I installed the Kafka Marketplace Plugins.  After restarting, I created a very simple transformation.  I placed an “Apache Kafka Consumer” step  on the palette followed by a “Write to Log” step, can’t get much simpler than that!

In the Kafka Consumer dialog, I specified the topic name as “test” to match what I did during the Kafka Quick Start.  I then set the “zookeeper.connect” property to my Zookeeper’s location running on my Cloudera VM, “192.168.56.102:2181″.  Finally I specified the group.id as “kettle-group”.




Now that I had things wired up, I figured it was time to run!  I had some basic thoughts at this point.  Which message does the consumer group start reading from in the Kafka Topic?  How long does the step run for before exiting?  We’ll get to those answers in a few minutes.  First let’s run it and see what happens…

BOOM!

2015/12/23 11:58:18 - Apache Kafka Consumer.0 - ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : Error initializing step [Apache Kafka Consumer]
2015/12/23 11:58:18 - Apache Kafka Consumer.0 - ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : java.lang.NoClassDefFoundError: kafka/consumer/ConsumerConfig

Fun with Java Classes.  I’m not exactly sure why Kettle can’t find the Kafka class here.  I quickly resolved this by placing all the plugin lib jar files in Spoon’s main classpath:

cp plugins/pentaho-kafka-consumer/lib/* lib

Note that this was a hammer of a solution.  I renamed all the jar files to start with “kafka”, that way I could quickly undo my change if necessary.  Also, I’ve created the following issue over on github, maybe there’s a better approach to fixing this one that I haven’t thought of yet.

https://github.com/RuckusWirelessIL/pentaho-kafka-consumer/issues/11

Once I restarted Spoon, I re-ran the transformation and … got no results from Kafka.  I tried a bunch of different configurations, I sent additional messages to Kafka, but no luck.  So I did what any developer would do, and checked out the latest source code.

git clone https://github.com/RuckusWirelessIL/pentaho-kafka-consumer.git

From there I ran “mvn package”, and got a fresh new build.  I replaced plugins/steps/pentaho-kafka-consumer with the new target/pentaho-kafka-consumer-TRUNK-SNAPSHOT.zip.  After running it and seeing a similar NoClassDefFoundError, I repeated my steps with the new plugin jars, moving them to the main classpath.

Another thing I ran into was on the Kafka configuration side.  Kafka was using the hostname of my VM for comms, which my OS wasn’t aware of.  I fixed this by updating config/server.properties advertised.host.name to the public IP address of the VM.

After restarting Spoon, I successfully read in the messages from Kafka!  Note that at this time you can’t reset the message offset for a specified group, so the only way to re-read messages is to change the “group.id”.  This is a feature that Ruckus is considering adding, it would be a great way to contribute to the open source plugin!

After getting the Consumer working, I went ahead and tried out the Producer.  Note that the Producer step needs Binary data to feed a topic.  All I had to do was feed in Binary data, specify the topic name (I used “test” again), and finally specify the “metadata.broker.list” with the correct IP and port, and it worked like a charm!  Note that at this time I didn’t have to rebuild the producer plugin like I did the consumer, but without the consumer jars being placed in the lib folder the producer wouldn’t function either.



So how might you use Kettle and Kafka together?  Kafka is becoming the de-facto big data message queue, and can be used in combination with Spark and other Hadoop technologies for data ingestion and streaming.  Kettle can be used as a way to populate a Kafka Topic via the Apache Kafka Producer, or it could be used to consume messages from a topic via the Apache Kafka Consumer for downstream processing.  Ruckus Wireless, the company that contributed the steps, uses Pentaho Data Integration to ingest data into Vertica and then visualize the data with Pentaho Business Analytics.  You can learn more about Ruckus Wireless use case here:

http://www.pentaho.com/customers/ruckus-wireless

Here are links to the github locations for the plugins:

https://github.com/RuckusWirelessIL/pentaho-kafka-consumer/ 

https://github.com/RuckusWirelessIL/pentaho-kafka-producer/

 

 

Pentaho Data Integration 4 Cookbook: Get your swiss army knife out

This weekend I had the pleasure of reading Maria Roldan and Adrian Pulvirenti’s Pentaho Data Integration 4 Cookbook, published by Packt Publishing.  I was one of the reviewers for Maria’s first Packt book, Pentaho 3.2 Data Integration: Beginner’s Guide, as well as a Packt author myself, so when I was asked if I’d be willing to write about the most recent addition to the Pentaho collection of books, I happily obliged.

I highly recommend this book to all those out there looking to learn more about PDI.  The book has many great recipes for specific situations, but also throughout the book you learn many important swiss-army-knife-type skills that will aid you in your daily use of Pentaho Data Integration.  The book includes everything from dealing with unstructured text files to working with fuzzy logic.  As a Java developer, I especially appreciate the many uses of the User Defined Java step for the more advanced scenarios.  The book also introduces the many uses of Pentaho Data Integration within Pentaho’s BI Suite, allowing power BI Developers to create a flow of information from a transformation to a report or dashboard.

Chapter 6, Understanding Data Flows, may be the most important chapter in this book.  Managing the merging and splitting of data within a transformation requires key insights that this book covers in detail.  Having this information will allow you to take your transformation building skills to the next level.

Thanks Maria and Adrian on the wonderful piece of work!  The copy I received will reside in the bullpen at Pentaho’s Headquarters here in Orlando, I’m sure many of the Engineers here will use and learn from it!  Now don’t waste any more time, get your own copy today!