Right Outer Join

8 July 2014

MDM in the Cloud (on Amazon AWS Marketplace)

Semarchy MDM on AWS Marketplace


Semarchy shows off its 5 star reviews as the most popular MDM solution on Amazon’s AWS Marketplace

MDM in the Cloud

One of the biggest impediments to Master Data Management (MDM) projects is that they can be hard to get started. An enterprise has lots of people and lots of groups who all stand to benefit from improved data quality, structured data governance, and systematic master data management. But the very fact that so many people stand to gain from it is also a reason why it’s slow to start. Gathering requirements and opinions from everyone takes time.

One of the best ways to get quick agreement about what the scope for the first iteration of an MDM project is to generate a quick proof-of-concept or proof-of-value prototype. And one of the quickest ways to get started on an MDM prototype is by using software that’s completely pre-installed and pre-configured. This can lead to better alignment about what will be possible in an MDM project ensuring that a project will be more successful.

The cloud is a natural fit for this.

Amazon’s AWS Marketplace provides an environment where it’s easy to find software that’s relevant to your needs and get it launched instantly without any up-front costs. When I worked at Jaspersoft I invested quite a bit of time into getting a pre-configured JasperReports Server instance available and in making it easy for people to use Business Intelligence (BI) in the cloud. It was a natural fit especially for anyone who already had data in Amazon RDS or Redshift. The time we invested in that paid off nicely as customers flocked to it. Sales are way up; the reviews are great; and it should serve as a model and an inspiration to other vendors considering cloud offerings.

Semarchy in the Cloud

While business intelligence offerings in the cloud are legion, traditional Master Data Management vendors have been much too slow to embrace the cloud. The industry has taken baby steps. For example, Informatica purchased Data Scout and sells this as their SaaS MDM Salesforce.com plug-in solution. It’s a great utility for salesforce.com, but I don’t put it into the same class as enterprise MDM. Other SaaS MDM solutions are similar.

At Semarchy I see the cloud as an excellent vehicle for putting enterprise MDM into the hands of more users. You can have a fully functional MDM server running in an Amazon Virtual Private Cloud (VPC) in less than an hour. It’s accessible to only people from your company, and it’s ready for you to model your master data management requirements and to start fuzzy-matching and de-duplicating your data.

I expect other vendors to follow eventually. The net result will be improved solutions available to data management professionals everywhere. I’m pleased that Semarchy is leading the way.

Advertisements

5 December 2013

Copy files between s3 buckets

Filed under: AWS, Linux — Tags: , , , , — mdahlman @ 15:06

The problem

I needed to copy files between Amazon AWS S3 buckets. This should be easy. Right?

To be clear, I wanted the equivalent of this:

cp s3://sourceBucket/file_prefix* s3://targetBucket/

The solution (short version)

No, it’s not easy.

Or rather, in the end it turned out to be pretty easy; but it was entirely unintuitive.

s3cmd cp --recursive --exclude=* --include=file_prefix* s3://sourceBucket/ s3://targetBucket/

The explanation (long version)

Get s3cmd

The best command line utility for working with S3 is s3cmd. You can get it from s3tools.org. If you’re on Amazon Linux (or CentOS or RHEL, etc) then this is the easiest way to install it.

# Note the absence of s3tools.repo in your list of repositories like this:
ls /etc/yum.repos.d/
# Put s3tools.repo in your list of repositories like this:
sudo wget http://s3tools.org/repo/RHEL_6/s3tools.repo -O /etc/yum.repos.d/s3tools.repo
# Confirm that you did it correctly:
ls /etc/yum.repos.d/

# Install s3cmd:
sudo yum install s3cmd

# Configure s3cmd:
s3cmd --configure

False start 1

s3cmd has a copy command, “cp”. Try that:

# This should do the trick:
s3cmd s3://sourceBucket/file_prefix* s3://targetBucket/

One file copies successfully… but then it crashes:

File s3://sourceBucket/file_prefix_name1.txt copied to s3://targetBucket/file_prefix_name1.txt

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please report the following lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Problem: KeyError: 'dest_name'
S3cmd:   1.0.0

Traceback (most recent call last):
  File "/usr/bin/s3cmd", line 2006, in 
    main()
  File "/usr/bin/s3cmd", line 1950, in main
    cmd_func(args)
  File "/usr/bin/s3cmd", line 614, in cmd_cp
    subcmd_cp_mv(args, s3.object_copy, "copy", "File %(src)s copied to %(dst)s")
  File "/usr/bin/s3cmd", line 604, in subcmd_cp_mv
    dst_uri = S3Uri(item['dest_name'])
KeyError: 'dest_name'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
    Please report the above lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Argh!! This stackoverflow answer confirms that s3cmd cp cannot handle this. (It is wrong, but for a long time I believed it.)

False start 2

This stackoverflow answer suggests “sync” as the command to use.

It is correct. But sync is not the same as copy, so this has bad side effects if what you really want to achieve is copying files. For example, sync will remove files in the target folder (to keep things in sync, duh). So syncing from source1 and source2 into a single target will cause grief. For copying all files from one location to another it’s great. I wanted to copy files, and I did not want any of the side effects of sync.

Bad alternatives

You can write your own script using boto and python or muck around with awk and getting lists of files to copy one-by-one. In principle these will work, but yuck.

You could download the files from s3 then put them back up into the intended target bucket. This is a terrible solution. It will succeed… but what a waste of time and bandwidth. What makes it so tempting is that s3cmd works exactly like you want it to work with “get” and “put”.

s3cmd put /localDirectory/file_prefix* s3://targetBucket/

If “put” is so easy, why is “cp” so hard?

Enlightenment

I studied the s3cmd options over and over. Eventually I realized “cp” had more flexibility if you look deep enough.

  • –recursive
    In my mind, my requirement is clearly not recursive. I simple want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great.
  • –exclude
    It’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what?
  • –include
    Now we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.
  • s3://sourceBucket/  s3://targetBucket/
    This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:
    s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]

I posted a brief version of my answer to the most elegant of technical websites. You should vote it up. But that didn’t seem like the best place to elaborate on the answer as I’ve done here.

Postscript

Amazon offers a command line interface (CLI) tool to do the same thing. AWS Command Line Interface. I swear that I looked extensively and repeatedly for exactly this saying, “I just can’t believe that Amazon wouldn’t have this by now.” Well, they do. I have no idea why I could not find it, but I’m mentioning it here for my own future reference and for anyone else who is using s3cmd as an alternative to the Amazon utility that they couldn’t find.

I have no idea if the Amazon CLI is [ better | worse | different ] than s3cmd in any interesting way regarding S3. (It’s certainly different in the respect that it interacts with many other AWS services besides S3.) If I ever need to compare them, then I’ll write it up.

 

6 August 2013

Oracle SQL*Loader on Amazon Linux

Filed under: Linux, Oracle — Tags: , , , , , , — mdahlman @ 14:22

SQL*Loader on Amazon Linux

I’m using Oracle on Amazon RDS. I want to load some data into it from an EC2 instance. SQL*Loader (sqlldr) is a reasonable way to load data into Oracle. Amazon agrees. But for someone who’s a little rusty with Oracle installation procedures, it’s a bit harder to get the SQL*Loader client installed than I had hoped.

Find the Oracle Client

I didn’t think this section would need to exist. But it was harder to find than one might expect.

First, don’t be fooled into thinking Oracle Database Instant Client will be instantly useful. Or even eventually useful. It doesn’t have SQL*Loader.

11g is listed under 12c

11g is listed under 12c

Second, don’t be fooled into thinking Oracle’s download page for Oracle 12c includes downloads for Oracle 12c. Well… it does… but it also includes the downloads for Oracle 11g. Go figure.

Third, don’t be fooled into thinking the lack of links to anything labeled “client” is a problem. Just follow the link “See All” to get to the client downloads. Of course. It’s even explained in the improbably punctuated note below the links, “- See All, page contains unzip instructions plus Database Client, Gateways, Grid Infrastructure, more”.

The “See All” link corresponding to “Oracle Database 11g Release 2 Client (11.2.0.1.0) for Linux x86-64” got me to the correct spot:

Get the Oracle Client

With the link in hand it’s trivially easy to download the installer. Not so fast. It’s possible to download the installer to my laptop and then upload it to my EC2 instance. But that’s slow, and it’s a terrible waste of bandwidth. I want to download it directly onto the EC2 instance.

The problem is that the download page requires me to accept the license terms before the download link will work, but the EC2 instance has no GUI in which to easily do this.

A naive attempt like this will fail:

wget http://www.oracle.com/correct/download/link.zip

The issue addressed in a blog on My Oracle Support. I’m optimistic that it ought to work as indicated. But the solution was old, seemingly brittle (failed based on locale), and strangely unofficial feeling. I don’t need to automate the process, so it’s much easier to understand with a quick manual process.

  1. Login. Accept the license agreement. (This is done while browsing from your local machine. (This step is genuinely easy! Hooray!))
  2. Get the relevant cookie. This was harder than I expected. Chrome and Firefox store their cookies in a SQLite database. Various browser extensions and database clients allow you to get at them. But I found the Chrome extension Cookie.txt export to be the simplest way to get the info. Just click the button that the extension creates and copy the complete contents of the popup.
  3. Save the cookie information. On the Amazon EC2 instance create a new file called cookies.txt. Paste in the text copied in the previous step. (Details are left as an exercise for the reader. Use vi or cat or whatever. If you get stuck here feel free to post a comment.)
  4. Run wget using the new cookie file:
wget -x --load-cookies cookies.txt -O linux.x64_11gR2_client.zip http://download.oracle.com/otn/linux/oracle11g/R2/linux.x64_11gR2_client.zip

Run the Oracle Client Installation

This final step sounds trivial… but once again I realized I needed a few sub-steps. I’m using Amazon Linux which is decidedly un-GUI. I had forgotten that the Oracle Client doesn’t have a simple interactive text version. It’s all-or-nothing silent install or GUI install.

Install x11:

sudo yum install xorg-x11-xauth
exit

Then log back in. But… don’t forget to use the -X option. I’m on Mac, so this part works easily. On Windows you can do the equivalent with PuTTY, but you’ll need to look up the details.

ssh -X -i mykey.pem ec2-user@ec2-123-456-246-579.us-west-1.compute.amazonaws.com

Test that x11 will work as intended:

sudo yum install xclock
xclock

If xclock pops up, then the Oracle Client installation should be good as well:

./client/runInstaller
Oracle Client Installer

Oracle Client Installer

And finally, don’t forget to choose the Administrator installation type. After all, the whole point was to get SQL*Loader, and that’s the only option where it’s included.

Bonus Appendix

Once you have SQLPlus installed, you’ll want rlwrap installed. It will allow you to hit the up arrow to get your command history. SQLPlus is miserable without it. The Amazon Linux repositories do not have rlwrap. But EPEL does. So here’s how to install it with a single line:

sudo yum -y install rlwrap --enablerepo=epel

Here’s a good way to transparently launch SQLPlus with rlwrap giving you access to your command history.

#Add these lines to .bashrc for both ec2-user and oracle:
alias sqlplus="rlwrap sqlplus"

Error Appendix

The first time I tried to run the install I got this error:

ubuntu@ip-10-48-138-63:~/wget_test/download.oracle.com/otn/linux/oracle11g/R2/client$ ./runInstaller
Starting Oracle Universal Installer...
...
>>> Ignoring required pre-requisite failures. Continuing...
Preparing to launch Oracle Universal Installer from /tmp/OraInstall2013-08-06_07-26-42PM. Please wait ...ubuntu@ip-10-48-138-63:~/wget_test/download.oracle.com/otn/linux/oracle11g/R2/client$ Exception in thread "main" java.lang.NoClassDefFoundError
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:164)

This is clearly a Java problem. It clearly has nothing to do with X11. Except… well… it does. Installing xterm or x11 (installing xorg-x11-xauth as indicated above) solved it for me.

Blog at WordPress.com.