Right Outer Join

18 July 2014

Hierarchical JSON from Oracle

Filed under: Master Data Management, Oracle — Tags: , , , — mdahlman @ 14:03

Background

Semarchy manages master data hierarchies (corporate structures, product group hierarchies, employee management trees, etc.) easily with out of the box functionality. By this I mean it can validate the data, match up different sources, enrich the data from external systems, manage an audit log of changes and so forth. It’s all great stuff.  But on a recent project  I wanted to display hierarchical data using an intuitive visual interface. A plethora of visualization libraries exist, and I was leaning toward using D3 since it appears to be one of the most polished, most used, and most actively developed at the moment.

Problem

The D3 example I wanted to use is designed to accept data in JSON format. My data is in Oracle, and Oracle doesn’t provide a simple way to generate a complex JSON output.

Likely Solutions

A few people pointed me to plsql-utils, aka Alexandria, as the best starting point. It’s a really useful Oracle resource, and I spent some time investigating this idea. Morten Braten’s 2010 article about it is excellent. But in the end I didn’t find it to be the right tool for this problem. It made it very easy to take a result set and convert it to valid JSON where each row was a JSON record. But this was just tabular data as JSON, so this was not what I needed to feed into the D3 engine. I have no doubt that I could write a stored procedure which could loop through my data to get a more appropriate hierarchical structure and then use plsql-utils to convert this to JSON. But the level of effort required seemed high.

I found several references to PL/JSON. This project shows potential, but it doesn’t appear to be actively developed (as of mid 2014).

There’s an interesting answer at the greatest of all answer sites from Mike Bostock, the author of D3(!). That example is focused on converting comma separated values (CSV) data to JSON. The concepts could be applied here. But my data, though tabular, is not actually CSV. I would prefer to use the D3 sample with the smallest number of changes possible. So I would much prefer to return the data to D3 already JSON-ified if I can.

Then I found Lucas Jellema’s 2011 article about generating a JSON string directly from a query. This presented a more intuitive approach for me. He uses a common table expression (CTE) to easily create a sub-select to gather the hierarchical information along with the LIST_AGG analytic function to present it well. Clever. In the end I didn’t actually use LIST_AGG and I didn’t really use a CTE. (OK, my sample query below has a CTE… but it could be changed into a standard subquery with trivial effort.)

My Solution

In the end I decided to use Oracle’s inherent abilities to handle hierarchical information (mainly the CONNECT BY syntax) and then convert it to JSON with the additional of simple string logic. The key concepts needed in this conversion are:

  • The CONNECT BY query can return the data in a specified logical order.
  • By knowing if the next record is at a higher, lower, or equal level in the hierarchy, we can generate JSON brackets correctly.
  • We can know if the next record is at a higher, lower, or equal level in the hierarchy by using analytic windowing functions like LAG and LEAD.

Here’s the commented SQL used to return the data

WITH connect_by_query as (
  SELECT 
     ROWNUM                               as rnum
    ,FIRST_NAME || ' ' || LAST_NAME       as FULL_NAME
    ,LEVEL                                as Lvl
  FROM GD_EMPLOYEE emp1
  START WITH EMPLOYEE_NUMBER = 100
  CONNECT BY PRIOR EMPLOYEE_NUMBER = F_MANAGER
  ORDER SIBLINGS BY EMPLOYEE_NUMBER
)
select 
  CASE 
    /* the top dog gets a left curly brace to start things off */
    WHEN Lvl = 1 THEN '{'
    /* when the last level is lower (shallower) than the current level, start a "children" array */
    WHEN Lvl - LAG(Lvl) OVER (order by rnum) = 1 THEN ',"children" : [{' 
    ELSE ',{' 
  END 
  || ' "name" : "' || FULL_NAME || '" '
  /* when the next level lower (shallower) than the current level, close a "children" array */
  || CASE WHEN LEAD(Lvl, 1, 1) OVER (order by rnum) - Lvl <= 0 
     THEN '}' || rpad( ' ', 1+ (-2 * (LEAD(Lvl, 1, 1) OVER (order by rnum) - Lvl)), ']}' )
     ELSE NULL 
  END as JSON_SNIPPET
from connect_by_query
order by rnum;

Here’s an example of the data returned (formatting was added afterwards, but the data was returned exactly like this):

{
  "name": "Steven King",
  "children": [{
    "name": "Neena Kochhar",
    "children": [{
      "name": "Nancy Greenberg",
      "children": [{
        "name": "Daniel Faviet"
      }, {
        "name": "John Chen"
      }, {
        "name": "Ismael Sciarra"
      }, {
        "name": "Jose Manuel Urman"
      }, {
        "name": "Luis Popp"
      }]
    }]
  }, {
    "name": "Lex De Haan",
    "children": [{
      "name": "Alexander Hunold",
      "children": [{
        "name": "Bruce Ernst"
      }, {
        "name": "David Austin"
      }, {
        "name": "Valli Pataballa"
      }, {
        "name": "Diana Lorentz"
      }]
    }]
  }, {
    "name": "Den Raphaely",
    "children": [{
      "name": "Alexander Khoo"
    }, {
      "name": "Shelli Baida"
    }, {
      "name": "Sigal Tobias"
    }, {
      "name": "Guy Himuro"
    }, {
      "name": "Karen Colmenares"
    }]
  }]
}

With the data in that form, it was easy to implement this D3 sample inside Semarchy Convergence for MDM:

Employee Hierarchy in Semarchy MDM

The left side shows the standard tree view. Practical.
The right side shows the D3 tree visualization. Awesome.
(And practical in different ways.)

 

Here is the SQL (creates, inserts, and the complete select statement) to try it yourself:

Oracle select query to generate JSON data

 

8 July 2014

MDM in the Cloud (on Amazon AWS Marketplace)

Semarchy MDM on AWS Marketplace


Semarchy shows off its 5 star reviews as the most popular MDM solution on Amazon’s AWS Marketplace

MDM in the Cloud

One of the biggest impediments to Master Data Management (MDM) projects is that they can be hard to get started. An enterprise has lots of people and lots of groups who all stand to benefit from improved data quality, structured data governance, and systematic master data management. But the very fact that so many people stand to gain from it is also a reason why it’s slow to start. Gathering requirements and opinions from everyone takes time.

One of the best ways to get quick agreement about what the scope for the first iteration of an MDM project is to generate a quick proof-of-concept or proof-of-value prototype. And one of the quickest ways to get started on an MDM prototype is by using software that’s completely pre-installed and pre-configured. This can lead to better alignment about what will be possible in an MDM project ensuring that a project will be more successful.

The cloud is a natural fit for this.

Amazon’s AWS Marketplace provides an environment where it’s easy to find software that’s relevant to your needs and get it launched instantly without any up-front costs. When I worked at Jaspersoft I invested quite a bit of time into getting a pre-configured JasperReports Server instance available and in making it easy for people to use Business Intelligence (BI) in the cloud. It was a natural fit especially for anyone who already had data in Amazon RDS or Redshift. The time we invested in that paid off nicely as customers flocked to it. Sales are way up; the reviews are great; and it should serve as a model and an inspiration to other vendors considering cloud offerings.

Semarchy in the Cloud

While business intelligence offerings in the cloud are legion, traditional Master Data Management vendors have been much too slow to embrace the cloud. The industry has taken baby steps. For example, Informatica purchased Data Scout and sells this as their SaaS MDM Salesforce.com plug-in solution. It’s a great utility for salesforce.com, but I don’t put it into the same class as enterprise MDM. Other SaaS MDM solutions are similar.

At Semarchy I see the cloud as an excellent vehicle for putting enterprise MDM into the hands of more users. You can have a fully functional MDM server running in an Amazon Virtual Private Cloud (VPC) in less than an hour. It’s accessible to only people from your company, and it’s ready for you to model your master data management requirements and to start fuzzy-matching and de-duplicating your data.

I expect other vendors to follow eventually. The net result will be improved solutions available to data management professionals everywhere. I’m pleased that Semarchy is leading the way.

 

5 December 2013

Copy files between s3 buckets

Filed under: AWS, Linux — Tags: , , , , — mdahlman @ 15:06

The problem

I needed to copy files between Amazon AWS S3 buckets. This should be easy. Right?

To be clear, I wanted the equivalent of this:

cp s3://sourceBucket/file_prefix* s3://targetBucket/

The solution (short version)

No, it’s not easy.

Or rather, in the end it turned out to be pretty easy; but it was entirely unintuitive.

s3cmd cp --recursive --exclude=* --include=file_prefix* s3://sourceBucket/ s3://targetBucket/

The explanation (long version)

Get s3cmd

The best command line utility for working with S3 is s3cmd. You can get it from s3tools.org. If you’re on Amazon Linux (or CentOS or RHEL, etc) then this is the easiest way to install it.

# Note the absence of s3tools.repo in your list of repositories like this:
ls /etc/yum.repos.d/
# Put s3tools.repo in your list of repositories like this:
sudo wget http://s3tools.org/repo/RHEL_6/s3tools.repo -O /etc/yum.repos.d/s3tools.repo
# Confirm that you did it correctly:
ls /etc/yum.repos.d/

# Install s3cmd:
sudo yum install s3cmd

# Configure s3cmd:
s3cmd --configure

False start 1

s3cmd has a copy command, “cp”. Try that:

# This should do the trick:
s3cmd s3://sourceBucket/file_prefix* s3://targetBucket/

One file copies successfully… but then it crashes:

File s3://sourceBucket/file_prefix_name1.txt copied to s3://targetBucket/file_prefix_name1.txt

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please report the following lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Problem: KeyError: 'dest_name'
S3cmd:   1.0.0

Traceback (most recent call last):
  File "/usr/bin/s3cmd", line 2006, in 
    main()
  File "/usr/bin/s3cmd", line 1950, in main
    cmd_func(args)
  File "/usr/bin/s3cmd", line 614, in cmd_cp
    subcmd_cp_mv(args, s3.object_copy, "copy", "File %(src)s copied to %(dst)s")
  File "/usr/bin/s3cmd", line 604, in subcmd_cp_mv
    dst_uri = S3Uri(item['dest_name'])
KeyError: 'dest_name'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
    Please report the above lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Argh!! This stackoverflow answer confirms that s3cmd cp cannot handle this. (It is wrong, but for a long time I believed it.)

False start 2

This stackoverflow answer suggests “sync” as the command to use.

It is correct. But sync is not the same as copy, so this has bad side effects if what you really want to achieve is copying files. For example, sync will remove files in the target folder (to keep things in sync, duh). So syncing from source1 and source2 into a single target will cause grief. For copying all files from one location to another it’s great. I wanted to copy files, and I did not want any of the side effects of sync.

Bad alternatives

You can write your own script using boto and python or muck around with awk and getting lists of files to copy one-by-one. In principle these will work, but yuck.

You could download the files from s3 then put them back up into the intended target bucket. This is a terrible solution. It will succeed… but what a waste of time and bandwidth. What makes it so tempting is that s3cmd works exactly like you want it to work with “get” and “put”.

s3cmd put /localDirectory/file_prefix* s3://targetBucket/

If “put” is so easy, why is “cp” so hard?

Enlightenment

I studied the s3cmd options over and over. Eventually I realized “cp” had more flexibility if you look deep enough.

  • –recursive
    In my mind, my requirement is clearly not recursive. I simple want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great.
  • –exclude
    It’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what?
  • –include
    Now we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.
  • s3://sourceBucket/  s3://targetBucket/
    This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:
    s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]

I posted a brief version of my answer to the most elegant of technical websites. You should vote it up. But that didn’t seem like the best place to elaborate on the answer as I’ve done here.

Postscript

Amazon offers a command line interface (CLI) tool to do the same thing. AWS Command Line Interface. I swear that I looked extensively and repeatedly for exactly this saying, “I just can’t believe that Amazon wouldn’t have this by now.” Well, they do. I have no idea why I could not find it, but I’m mentioning it here for my own future reference and for anyone else who is using s3cmd as an alternative to the Amazon utility that they couldn’t find.

I have no idea if the Amazon CLI is [ better | worse | different ] than s3cmd in any interesting way regarding S3. (It’s certainly different in the respect that it interacts with many other AWS services besides S3.) If I ever need to compare them, then I’ll write it up.

 

20 November 2013

Citi doesn’t get it

Filed under: JasperReports — Tags: , , , — mdahlman @ 22:12

I received an email today with this quote:

Once you register a purchase online with Citi Price Rewind, we will search our database of online merchants for a lower price for 30 days after the purchase date. If we find a price that is at least $25 less than what you paid, you can be eligible for a refund of the difference, up to $250 per item.

This is an email from someone who deeply “doesn’t get it”. Allow me to elaborate.

“Once you register a purchase…” Citi already knows all of my Citi purchases. All of them. They bill me for them, so they have to know them. But they still make me register a purchase. This is a waste of time. It’s silly.

“… search our database of online merchants …” If I buy a $300 mixer at Target, they know. If a million other people buy the same mixer, they know. But they don’t consider these purchases. They only look at online purchases. This is intentionally incomplete.

“… a price that is at least $25 less than what you paid …” They’ll keep it to themselves if I could have gotten the same item for $23 cheaper somewhere else? This is petty and mean. If I could have saved a nickel somewhere else, then they should tell me.

“… you can be eligible for a refund …” Sweet! I’ll get a refund! Oh wait… I’ll be eligible for a refund. What?

Desired situation:
I use my Citi card for purchases. Citicard looks out for me; if they find the item cheaper then they refund me the difference.

Actual situation:
I use my Citi card for purchases. If I think to manually go to “Price Rewind” then…

  • Read 1500 words explaining the fine print of what’s covered.
  • Then provide the details about it:
    How much did it cost?
    When did you buy it?
    Where did you buy it?
    This should surely be a skit on SNL. They want me to tell them the cost of something I just bought using their card? I should tell them the date and location of the transaction that they already know? This is cynical and stupid.
  • Having selected an item and consulted a lawyer … I then …
    wait 30 days.
    Then I receive my refund.
  • Oops, no. I don’t receive a refund. I receive an email indicating that I’m eligible for a refund. I’m then invited to upload a scanned copy of my receipt. I assume you all file all of your receipts  for all purchases by date for future reference and uploading to cynical credit card offers. I do. I photocopy all of them and cross file them by date, merchant, product line, color, and average specific gravity of the products. Who doesn’t?

I appreciate the warm regards from Jud Linville. But his email inspires me to use other credit cards instead of my Citi card.

If they want to inspire me to use their credit card, then they should do something for me. It should require no effort from me. They should call within an hour of my purchase to say, “Item ABC is available for $X cheaper at store XYZ which is within 5 miles of your purchase.” They wouldn’t have to refund me a cent. But they would let me know that I could return my item and buy it cheaper somewhere else. That would be putting big data to a practical use which helps me instead of giving me useless, legalistic, nearly-impossible-to-use delayed benefits.

16 September 2013

Listen on port 80

Filed under: Linux — Tags: , , , , , , — mdahlman @ 11:44

Problem

I have an application server running on port 8080. I want it to listen on port 80. In my case it was Tomcat, but this applies to any application server.

I know this problem is somewhat common problem. I get lots of Google hits on it. But I have found that the answers are surprisingly non-great. They often assume a set of knowledge that doesn’t match with my personal knowledge. They [probably] tell me everything I need to know, but they tell me a lot more as well. This is not better; it’s hard to find what’s really important. This iptables answer on serverfault.com was really quite good. But it offers a little too much detail without offering firm enough guidance about what the best and simplest solution is. I want just one perfect answer if I can find it.

Answer

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080

It’s that easy. Now your app server can continue to run on port 8080. If port 8080 is open to the outside world then you’re free to connect directly to it… but you can also connect on the traditional port 80.

But… you’ll lose the change if your machine reboots. So there’s one more step. An Amazon Linux I used the following. It should be fine on CentOS and RHEL etc.

sudo service iptables save

On Ubuntu I found it easiest to persist the change like this:

sudo apt-get install iptables-persistent

Alternative Answers

There are, of course, an infinite number of alternatives. I’m more interested in having one easy-to-understand solution than having lots of alternatives. But sometime it’s useful to consider the alternatives explicitly… even if it’s only to mock and ridicule them afterwards.

Run your app server on port 80. I declare this to be a bad solution. But hey, maybe you’ve got a valid use case for this. We tracked down how to do it in the past. I found it to be difficult (grabbing those ports below 1024 is intended to be tough), and I found it to have bad side effects (some things broke on upgrades). The side effects were surely our own fault… but the ‘iptables’ solution above is much less prone to side effects. And running your application server as root in order to access port 80 opens security issues as well.

Run a web server on port 80 in front of the application server and route requests to the application server as appropriate. This is a fine solution. In fact, it’s vastly better in a bunch of ways. I have used it myself several times. It’s just overkill for many needs. Administering httpd isn’t so difficult… but it’s harder than not administering httpd.

Edit the file /etc/sysconfig/iptables manually. Yuck. Sure… you could… but why? The command ‘iptables’ exists to make your life easier. Let it.

Older Posts »

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 39 other followers