Thursday, September 26, 2013

A first look to the new Pentaho BA Server 5.0 CE

It's a lot of time I'm playing with the new Pentaho BA Server 5.0 to see the what's next and now, I think, it is the right time to start talking about it. The first thing that appears is that Pentaho made a big refactoring for this release. A lot of changes from a brand new GUI to a completely new and clean source tree with less and more organized projects than the previous release.

What's changed from a developer point of view

The first big set of changes, visible only to developers like me interested in looking into the code for getting knowledge or customize something, is summarized below:

  1. Project moved to github at this url. This is a good choice to improve collaboration between people and collaboration is one of the key points of the OSS philosophy. Almost any project around Pentaho is housed on this platform (Kettle not yet but it seems the only one).
  2. The project organization in the source code radically changed trying to simplify the structure of the codebase (and in my opinion they get it). You can find a description about the the new project structure and role of any of the sub-projects in this article in the Pentaho wiki. 
  3. Then, as a Java architect, another interesting change. It seems they moved to IntelliJ as development platform of choice because of the presence of the IntelliJ metadata files in the source tree. I'm a user of IntelliJ (after having use Netbeans and Eclipse) so this is a switch that I liked a lot. So here at this page a detailed description about to how pull the platform project directly into IntelliJ to start working on it
Get the platform and start playing with it

What's the best way to have the new release in your hands to start trying it out? Two ways: a) get it from the pentaho ci build system or b) build it by yourself. As a developer, the second choice is the funniest and the more interesting if you, like me, want to understand how the things changes from time to time and want to be sure that everything goes correctly. Moreover the ability to prepare your own system package gives you a high degree of personalization; to do this you can find useful using CBF from Webdetails.

In this case, if you want to build it by yourself firstly you've to pull the project locally on your computer from the pentaho-platform repository on GitHub. Then the quick and dirty way to build Pentaho BA Server is to use the development ant build buildfile (dev-build.xml) given in the source tree and make a new complete development build. If intersted, you can get all of the possible targets for this ant command file by giving the following command ant -f dev_build.xml help



but I decided to follow the easy way by using the easiest command available to reach my goal. To do this go to the platform sources' root directory (we indicate that directory as <pentaho-platform-root>) and give the following command

 ant -f dev_build.xml dev-rebuild -Dsolution.dir.server=bin/stage/biserver-ce/pentaho-solutions -Dproject.revision=CE-SNAPSHOT package  

The system will remain busy building your new release; depending on your hardware it will take 10 to 15 minutes to complete and you will have it cooked rightly. The result is

  1. a packaged version of the Pentaho BA server in the <pentaho-platform-root>/assembly/dist
  2. an unpackaged version of the same in the <pentaho-platform-root>/assembly/bin/stage/biserver-ce directory that you can use immediately for development and testing purposes.

That said, you can use the unpackaged version to easily and quickly look at the new features of the BA server.

What's changed from a user point of view

As soon as the platform opens in the browser you can see the new look. The new login screen looks really clean and minimal. I really like it! Now, remember that our old and friendly user joe is dead; if you want to login as admin you need to log in as the user Admin with password.... ah yes of course password (too difficult to remember!). In any case you can always get help by clicking with the mouse on the Login as Evaluator link to get the demo users and passwords.



As soon as you get logged in you discover a totally new user environment that I really like for its. The new environment now is based on a concept called Perspective.


A Pentaho perspective is an area that refers to a specific user context and that collect all the objects and actions referred to that particular context. You can change your perspective by clicking on the cascade menu located on the upper left side of the screen immediately below the main menu


We have 4 perspective in our Pentaho environment:

  • Home: It's the Perspective  you get into as soon as you log into Pentaho (see screenshot two pictures above). It contains portlets for Recent files and Favorites files to help you in getting easier to reach the more often used reports or visualizations. Then you have a big blue button called Manage Datasources that is the new way you need to use to define datasources. From this release, this is the only place you can go to deal with datasources: you will not have anymore this function in the administration part of the system as in the previous releases. By clicking on that button you open the Manage Datasources dialog box that let you define new datasources and either edit or delete existing ones.

  • Browse files: it gives you the ability to access the solution through the solution explorer. As you can see here another interesting new feature. Now we've a differentiation between a public and a private part of the solution. As soon as a user gets created a new Home directory is created. That directory and its content is visible only to that particular user. Only a user with the Admin role has access to all of the users' home directories. All of the homes directories are under a parent folder called Home. There's also a public part of the solution identified by a root folder called Public. That part is the part of the solution that is shared, depending on the share level decided by the administrator or by the content owner, by every user in the system. There's no more contextual menu referred to any item of the solution (either folders or files) as in the previous release but you can have the contextual actions using the menu on the right of the File explorer.

  • Open files: As soon as you start an analysis view, a report, a visualization or anything else time the content you get is opened in this perspective. You always have  multiple contents opened at the same time through a tabbed interface and you can easily switch back to the Browse files perspective or any other using the drop down menu we illustrated a while ago.

  • Schedule: The old My Workspace view is definitely dead. With this new release we have this perspective where the user can check the status of any the scheduled content and open the content for any terminated execution.

  • Administration: Another important change is about the administration of the Pentaho BA Server. The old administration console is definitely dead (I like this sooooo much!) and we've a new perspective accessible only to the Admin role and called Administration that let you do your administrative tasks. 



Conclusion


So what's next? I think they made a big work reorganizing almost anything in a new and clean way. I really like it thanks a lot guys. I really appreciated they choose to implement a responsive design for this new GUI so that it could be easily used without any pain on any device but it isn't. So this is a good point in my opinion where they could invest some time and there's space to make things much better. In any case we must admit a good change versus a more intuitive and clean web interface so thank you Pentaho for this good job.




Tuesday, July 16, 2013

My first book about Pentaho is out!

Today is a great day for me and I'm very proud about this. It is the realization of a dream I had from a long time: have the possibility to write a technical book.

In February 2013, a guy from Packt Publishing wrote me an email saying that he read at my blog, he liked it and so he decided to ask me about writing a book about Pentaho Data Integration. The book is about kitchen and how to use the PDI's command line tools efficiently to help people in their day to day operations with Pentaho Data Integration. It is a practical book and it seemed relatively easy to write. At first I was surprised and excited (a lot I must admit). It seemed me impossible that someone somewhere in the world noticed my blog and decided to trust me. I was really proud and honored about this so I decided to test me on this and I accepted.

Now, after 4 months of work, that book is finally out and you can order it!



I just wanted to thanks Packt Publishing for this opportunity and especially all the people from Packt that helped me in this adventure by reviewing my work and giving me useful suggestions about to improve my work of novice writer the best I can. I also want to thanks my technical reviewer Joel Latino for the precious help he gave in reviewing the book and suggesting things to improve it.

Monday, June 10, 2013

Print Pentaho Reports from PDI efficiently and without pain!

These days I'm involved in a project where we need to print reports directly from our PDI processes. So what a good option to use the PRD output step to do that? The idea was to find out a way for not writing the JDBC connection informations in the definition of our Pentaho reports connections. To do this, we decided to go for report connections defined by using JNDI datasources.

Pentaho Reporting and JNDI database connections

PRD, as any of the tools in the Pentaho family, supports the creation of a database connection using JNDI datasources. This is a cool feature! Creating Pentaho reports database connections using JNDI datasource is a good approach for the following reasons

  • We are able to not write connection information in the report making the it more easily adaptable to the system environment changes.
  • We are able to use a server side common datasource, greatly simplify the deployment of our reports to the Pentaho BI server.
  • Usually a datasource is directly related to the concept of database connection pooling that helps us to efficiently use and streamline server side resources.

Tipically a JNDI datasource is created by an application server or a servlet engine using an appropriate set of connection parameters that differs depending on the particular database we are going to connect to. As soon as the server side middleware creates the datasource, it is flagged with a name decided at design time by the developer and under that name is made available to any application that requires to use it. It is always a good rule of thumb to use a datasource, as soon as we have one, to let our application connect to our database. Pentaho BI server has a section in the Pentaho Administration Console (PAC) to let you create all the datasource connections you need. Pentaho Reporting and any other tool of the Pentaho suite makes extensive use of the JNDI datasources defined under PAC to connect to all the SQL databases. But now, as developers using PRD (Pentaho Report Designer) to develop our reports, the problem is really simple: how can we develop our reports using Pentaho Reporting JDBC Connections based on JNDI JDBC datasources if we don't have any datasource created on our client (remember that the datasource lives on the server).

First approach about using JNDI datasources in PRD

The first idea, not a good option unfortunately but the first we can find out, could be as follow:

  • Develop and test the report by defining Pentaho Reporting JDBC Connections based on Native JDBC connections. For this particular Pentaho Reporting JDBC Connection type, the connection information are saved into the Pentaho Report metadata definition making it very difficult to be maintained and eventually changed.
  • As soon as the report is ready to be deployed to the Pentaho BI Server, edit the Pentaho Reporting JDBC connection definition and change it to use a JNDI JDBC datasource. Basically, this kind of JDBC connection requires that you type the name of a datasource that is made available on the server you are going to deploy. Using JNDI JDBC datasource has the major plus in having a single place were the connections information are kept: in the datasource definition on the application server or servlet engine configuration. So if something changes in our connection information the impact on our big set of reports is really low. That's a good thing.

This approach, at first, seems a good idea but what's wrong with this? For sure it's a bit elaborated but the worst thing is that you can forget to make the change suggested immediately before to deploy to your Pentaho BI Server. I forget for sure....  So what can we do to make the things the easiest the possible?

Use SimpleJNDI to emulate datasources on the client

Pentaho Reporting, and the same is for any other Pentaho tool, integrates a very cool library called SimpleJNDI. This library, as stated in the library website "is intended to solve two problems. The first is that of finding a container independent way of opening a database connection, the second is to find a good way of specifying application configurations". It is entirely library based, so no server instances are started, and it sits upon Java .properties files, XML files or Windows-style .ini files, so it is easy to use and simple to understand. The files may be either on the file system or in the classpath.

In case of PRD we have a file called default.properties typically located in the directory <user_home>/.pentaho/simple-jndi. That file contains the datasources definitions. Any datasource is defined by using a set of by 5 entries and any entry is made up by a key-value pair where the key is defined with the following syntax

<datasource name>/<attribute>

That said let's have a look at the example below:

SampleData/type=javax.sql.DataSource
SampleData/driver=org.hsqldb.jdbcDriver
SampleData/user=pentaho_user
SampleData/password=password
SampleData/url=jdbc:hsqldb:mem:SampleData

This example represents the definition of a datasource named SampleData; it has 5 attributes as detailed below:

  • type: it represents the type of JNDI resource we're going to define. Basically it's valued to the Java interface that represents a datasource.
  • driver: it is the name of the database JDBC driver and changes depending on the db we're going to connect to
  • user: the username used on the connection
  • password: the user password used for the connection
  • url: the jdbc connection url used to open the connection to the database. Again, this value depends on the database we're going to connect to.

As soon as the Reporting Engine is going to fill the report template with the data and so a database connection is require the library crease the datasource and the connection is shared with Pentaho Report designer that

The way to go to develop Pentaho reports using JNDI datasources 

That said to properly define a Pentaho JDBC Connection that uses JDBC JNDI datasources, definitely, the way to go is to do it as detailed below

  1. Open the default.properties file located under <user_home>/.pentaho/simple-jndi.
  2. Copy and paste the definition of an already defined datasource to simplify the new datasource definition. 
  3. Change the name of the datasource in the copied rows to the name of the new datasource we're going ìto define. Remember to call the datasource with the same name of the datasource defined on your Pentaho BI server that the report will use once deployed.
  4. Change the values of the attributes and update them with the connection parameters for the new datasource.
  5. Save and close the file
  6. Open the Pentaho Report Designer and start a new report. 
  7. Define a new Pentaho JDBC Datasource connection. Give it a proper name
  8. Add a new JDBC connection in the Pentaho JDBC datasource configuration.
  9. Give the connection a name, select the database connection type and JNDI as access type.
  10. As soon as the JNDI access type is selected set the value of the JNDI name field to the name of the datasource you just configured in the default.properties file (as detailed from point 1 to 5 above).
  11. Press the Test button to verify that the connection through the newly defined datasource works. 
  12. Press OK to close the database connection defininition dialog.


At this point your JNDI connection is defined and uses a fake datasource that has the same name as the datasource on the server. So you're not required to remember to make any change before deploying the report because everything is already done. Cool!

Print Pentaho reports from a PDI transformation

PDI (Pentaho Data Integration) has an output task called Pentaho Reporting Output that you can use to print a Pentaho Report from inside a transformation. Basically it prints your report by configuring the complete template filename, the path to the output directory, the report output format (PDF, Excel, HTML ecc.) and eventually the parameters.


So that said seems very simple for a PDI developer to print a Pentaho Report and obviously it is. Take data from your sources, manipulate them in your transformation and start printing your report using this cool output step. But there are some things that anyone must know to let the things works properly in any case. Because PDI doesn't inject connection information in Pentaho Reporting, a good way to have Pentaho Reporting JDBC connections details not written into the report metadata is to use JNDI JDBC datasources to have all that stuff externally defined and easily maintainable. Again, the simple-jndi library comes to our attention to help us in dealing with this.

PDI doesn't inject into our report a database connection

PDI doesn't inject a database connection into the report. So the report uses the connection information defined in Pentaho Reporting datasources connections. At this point, the better option we have to externalize the report database connection information  is to use the JNDI capability of our Pentaho Reporting by using a locally defined datasource connection (as detailed above). In this case the default.properties file containing the Pentaho Reporting datasources connection information works as a new external configuration file of our PDI process ad you can distribute it with your process configuration files.  As soon as your PDI transformation will start the print output, Pentaho Reporting uses simple-jndi library to create the JNDI JDBC datasource connection and will make that connection available to your report. When we talked about using simple-jndi to support the development process in Pentaho Report Designer of reports using JNDI datasources I said that PRD looks for the file in a specific location of your filesystem. My assumption at this point was that, as usual, even if we use this mechanism to print a report from inside PDI, the reporting subsystem knows where the default.properties file is located. But, unfortunately, I was wrong because Pentaho Reporting is unable to locate the correct location of our default.properties file so the JNDI datasources definitions made in our report are completely. Let analyze the standard use-case: Pentaho Report Designer (the tool) gets the location information for that file through a configuration read by the tool and when we deploy the report to the Pentaho BI server the JNDI name for the datasource is get from the server. But in our new use-case PDI that prints a Pentaho report that uses JNDI definitions, the things are totally different. So how can we deal with this to have our report printed from the PDI step without pain?

Report with JNDI JDBC datasource connections printed by PDI: what happens?

As soon as the print is started by PDI the the Pentaho reporting subsystem the path to default.properties is resolved locally to the script that starts the PDI process (either spoon, or kitchen or pan). Let me try to explain. It your ETL process files are located in the /tmp/foo/etlprocess directory and your start your ETL process with kitchen locally to that directory with a relative path to the job file as in this example

$ cd /tmp/foo/etlprocess
$ kitchen.sh -file:./my_print_job.kjb

PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo/etlprocess so it looks for it in /tmp/foo/etlprocess/simple-jndi. But what happens if you are starting kitchen from inside totally different directory, let me say for example /tmp/foo1 and you're going to start your job my_print_job.kjb that is locate in /tmp/foo/etlprocess
$ cd /tmp/foo1
$ kitchen.sh -file:/tmp/foo/etlprocess/my_print_job.kjb

In this case, PDI looks for the for a simple-jndi directory that is in the directory /tmp/foo1 so it looks for it in /tmp/foo1/simple-jndi. Because you are unaware about how your final user will start your job this is completely a mess! But don't be afraid there's a solution. The best idea to solve this is:

  • Have a configuration directory local to your ETL process file, that already contains other configuration items for your process and that you distribute from within your package.
  • Put your default.properties file inside that directory
  • Have a way to specify to PDI where that simple-jndi configuration file is located.

This elegantly solves your problem.

How to specify simple-jndi information to have your report fully working

There is a way to specify to the simple-jndi library all the information needed to elegantly solve our issue by specifying a set of environment variables. The source of information for this is in the documentation you find in the binary download that you can get from here.

Basically to fix our issue we need redefine the standard value of the PENTAHO_DI_JAVA_OPTIONS to this value:

PENTAHO_DI_JAVA_OPTIONS="-Xmx512m -XX:MaxPermSize=256m -XX:-UseGCOverheadLimit -Djava.naming.factory.initial=org.osjava.sj.SimpleContextFactory -Dorg.osjava.sj.root=/tmp/foo/etlprocess/config -Dorg.osjava.sj.delimiter=/"

As you can see we have the standard memory settings plus three new parameters:

  • -Djava.naming.factory.initial, this first parameter sets the complete name of the initial Context Factory for SimpleContexts. This is an internal simple-jndi object
  • -Dorg.osjava.sj.root, this parameter sets the complete path to the directory that will contain the default.properties file containing our datasources definitions. Starting from our previous example we're saying to simple.jndi to locate that file in /tmp/foo/etlprocess/config where config would be nic that will be our process configuration directory.
  • -Dorg.osjava.sj.delimiter, this third parameter sets the the delimiter used to separate elements in a lookup value. This allows code to get closer to pretending to be another JNDI implementation, such as DNS or LDAP. In our case we need to use the / character.


Fill free to choose which mechanism to use to set these environment variables; you can either put the variable definition in the user profile file or you can make a script file and call the PDI scripts from there after having set these environment variables.

That's all!

So for now that's all! You have all the information required to have your JNDI based reports fully working even when launched through PDI. This is a good way to fully separate connections information from the report and have them saved together to your ETL configuration files. This will give you a clean distribution for your customer and of course, the more easy and clean the things are organized the more happy they will be!

Stay tuned for the next post that will come later on. I've interesting tips to share with you about using a java profiler to support you in investigating memory usage problems in your ETL process so that you get quickly to the solution right solution. Have fun and see you later.


Monday, April 22, 2013

Pretty print XML and JSON files with PDI (2/2)

Today I'm back with the second part of my recipe about how two pretty print XML and JSON files with PDI. You can find the first part here.

To pretty print whatever JSON stream a good starting point is to use the GSON library. It is a nice component that lets you

  1. serialize JSON streams starting from a set of Java objects or 
  2. convert a JSON stream into an equivalent set of JAVA objects
The setup

To prepare PDI to run this example you must:

  1. Download the GSON library from the following link. In my case I've downloaded the version 2.2.3 but consider the same steps with other versions of the library.
  2. Unzip the file on a temporary directory
  3. Copy the gson-2.2.3.jar file to the <PDI_HOME>/libext directory
  4. Restart PDI 

The how-to

First of all I started by making an example to obtain an ugly JSON sample stream to format. To do this I built a new transformation by reusing the input files of the sample multilayer xml file transformation to obtain a simple JSON stream. The interesting part is at the very end of this transformation. Again you have a User Defined Java Class step that contains all the code that does the dirty job for you.

1:  import com.google.gson.Gson;  
2:  import com.google.gson.GsonBuilder;  
3:  import com.google.gson.JsonParser;  
4:  import com.google.gson.JsonElement;  
5:  String jsonOutputField;  
6:  String jsonPPField;  
7:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
8:  {  
9:    // First, get a row from the default input hop  
10:       //  
11:       Object[] r = getRow();  
12:    // If the row object is null, we are done processing.  
13:       //  
14:       if (r == null) {  
15:            setOutputDone();  
16:            return false;  
17:       }  
18:       // Let's look up parameters only once for performance reason.  
19:       //  
20:       if (first) {  
21:            jsonOutputField = getParameter("JSONOUTPUT_FIELD");  
22:            jsonPPField = getParameter("JSONPP_FIELD");  
23:         first=false;  
24:       }  
25:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
26:    // enough to handle any new fields you are creating in this step.  
27:       //  
28:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
29:       logBasic("Input row size: " + r.length);  
30:       logBasic("Output row size: " + data.outputRowMeta.size());  
31:       String jsonOutput = get(Fields.In, jsonOutputField).getString(r);  
32:       Gson gson = new GsonBuilder().setPrettyPrinting().create();  
33:       JsonParser jp = new JsonParser();  
34:       JsonElement je = jp.parse(jsonOutput);  
35:       String jsonpp = gson.toJson(je);  
36:       // Set the value in the output field  
37:       //    
38:       get(Fields.Out, jsonPPField).setValue(outputRow, jsonpp);  
39:    // putRow will send the row on to the default output hop.  
40:       //  
41:    putRow(data.outputRowMeta, outputRow);  
42:       return true;  
43:  }  

This time the interesting code is between lines 32 and 35:

  1. The Gson object is created enabling the pretty printing.
  2. The JSON stream that is coming in is read and parsed appropriately (lines 33-34)
  3. A new JSON stream pretty printed is built and used to fill the jsonpp rowset field (line 35)

Next the output pretty printed JSON stream is saved to a .js file and that's all.

You can download the sample transformation from this link. I hope you enjoyed this two part article and that it can be useful for you.

Friday, April 19, 2013

Pretty print XML and JSON files with PDI (1/2)

These days I've got into a stupid trouble with PDI. A customer asked me to build a process that produces an XML file as a result; the result I get was horrible from a presentation standpoint: the file wasn't rightly pretty printed and that makes it very difficult to be read. You will encounter this problem anytime you are going to obtain XML files as output, but the same thing happens also anytime you want a JSON output. So this time I decided to take care of this by trying to get into a quick and simple solution to beautify my output; this is what I'm going to talk about in this post.

Pretty Printing XML files How-To

So what to do in case I need to pretty print XML files? Well that's very simple. PDI ships with dom4j, a very well known library that let you easily work with XML, XPATH and XSLT. That library has an object in its API model, OutputFormat, which represents the format configuration used to format the XML output. There you find a method called createPrettyPrint(); this is all the needed to reach our goal very effectively. But now how can we apply all of this to the XML that we're going to produce in our transformation? Very easy to do.

Let's start considering one of the samples shipped with PDI the multilayer XML files sample. You can find it in the samples/transformations directory in you PDI home directory. Take the original file and make a copy so that we can start working on it.

Now, in between the Text Output step and the 5th XML Join step (named "XML Join Step 5") I added two new steps a Select Value step and a User Defined Java Step as you can see in the detail shown in the figure below

While the Select values steps selects only the xmlOutput field that contains the raw xml that needs to be beautified, the User Defined Java class step is the place where the dirty job is made. To properly configure the User Defined Java Class step, that I renamed "Pretty Print Out XML", we need to


  1. Under the tab Fields below, define a new field named xpp whose type is String that will contain the produced pretty printed xml stream
  2. Select the tab Parameters 
  3. Define a new parameter XMLOUTPUT_FIELD whose value is set to xmlOutput 
  4. Define a new parameter XMLPP_FIELD whose value is set to xmlpp
  5. Copy and paste the code below in the related area in the task properties dialog

1:  import java.io.StringWriter;  
2:  import org.dom4j.io.OutputFormat;  
3:  import org.dom4j.Document;  
4:  import org.dom4j.DocumentHelper;  
5:  import org.dom4j.io.XMLWriter;  
6:  String xmlOutputField;  
7:  String xmlPPField;  
8:  public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException  
9:  {  
10:    // First, get a row from the default input hop  
11:       //  
12:       Object[] r = getRow();  
13:    // If the row object is null, we are done processing.  
14:       //  
15:       if (r == null) {  
16:            setOutputDone();  
17:            return false;  
18:       }  
19:       // Let's look up parameters only once for performance reason.  
20:       //  
21:       if (first) {  
22:            xmlOutputField = getParameter("XMLOUTPUT_FIELD");  
23:            xmlPPField = getParameter("XMLPP_FIELD");  
24:         first=false;  
25:       }  
26:    // It is always safest to call createOutputRow() to ensure that your output row's Object[] is large  
27:    // enough to handle any new fields you are creating in this step.  
28:       //  
29:    Object[] outputRow = createOutputRow(r, data.outputRowMeta.size());  
30:       logBasic("Input row size: " + r.length);  
31:       logBasic("Output row size: " + data.outputRowMeta.size());  
32:       String xmlOutput = get(Fields.In, xmlOutputField).getString(r);  
33:    StringWriter sw;  
34:    try {  
35:      OutputFormat format = OutputFormat.createPrettyPrint();  
36:      Document document = DocumentHelper.parseText(xmlOutput);  
37:      sw = new StringWriter();  
38:      XMLWriter writer = new XMLWriter(sw, format);  
39:      writer.write(document);  
40:    }  
41:    catch (Exception e) {  
42:      throw new RuntimeException("Error pretty printing xml:\n" + xmlOutputField, e);  
43:    }  
44:    String xmlpp = sw.toString();  
45:       // Set the value in the output field  
46:       //  
47:       get(Fields.Out, xmlPPField).setValue(outputRow, xmlpp);  
48:    // putRow will send the row on to the default output hop.  
49:       //  
50:    putRow(data.outputRowMeta, outputRow);  
51:       return true;  
52:  }  

The code we're interested in is between lines 32 and 44.


  1. The xml that is caming in is read from the xmlOutput row field and is put into the xmlOutput variable
  2. We create an OutputFormat object with the pretty printing enabled by using  OutputFormat.createPrettyPrint() fuction (see line 35).
  3. Parse the XML document and then write it to a StringWriter (see lines 36 - 39)
  4. Put the obtained pretty printed XML in the xmlpp variable and the into the new field created in the output row (see lines 44 - 47)
The resulting XML stream is then saved to disk by mean of the Text Output step.
We made tests by pretty printing quite significant XML files in terms of size with good performances so I strongly suggest you to use this method. We're also planning to build a custom task to do this in a smoother way that will be released to SeraSoft github public repository soon. 

You can download the following sample by clicking here. So for now we've done; next time I'll show you how to beautify a JSON file. Stay tuned!