Right Outer Join

5 December 2013

Copy files between s3 buckets

Filed under: AWS, Linux — Tags: , , , , — mdahlman @ 15:06

The problem

I needed to copy files between Amazon AWS S3 buckets. This should be easy. Right?

To be clear, I wanted the equivalent of this:

cp s3://sourceBucket/file_prefix* s3://targetBucket/

The solution (short version)

No, it’s not easy.

Or rather, in the end it turned out to be pretty easy; but it was entirely unintuitive.

s3cmd cp --recursive --exclude=* --include=file_prefix* s3://sourceBucket/ s3://targetBucket/

The explanation (long version)

Get s3cmd

The best command line utility for working with S3 is s3cmd. You can get it from s3tools.org. If you’re on Amazon Linux (or CentOS or RHEL, etc) then this is the easiest way to install it.

# Note the absence of s3tools.repo in your list of repositories like this:
ls /etc/yum.repos.d/
# Put s3tools.repo in your list of repositories like this:
sudo wget http://s3tools.org/repo/RHEL_6/s3tools.repo -O /etc/yum.repos.d/s3tools.repo
# Confirm that you did it correctly:
ls /etc/yum.repos.d/

# Install s3cmd:
sudo yum install s3cmd

# Configure s3cmd:
s3cmd --configure

False start 1

s3cmd has a copy command, “cp”. Try that:

# This should do the trick:
s3cmd s3://sourceBucket/file_prefix* s3://targetBucket/

One file copies successfully… but then it crashes:

File s3://sourceBucket/file_prefix_name1.txt copied to s3://targetBucket/file_prefix_name1.txt

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
  Please report the following lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Problem: KeyError: 'dest_name'
S3cmd:   1.0.0

Traceback (most recent call last):
  File "/usr/bin/s3cmd", line 2006, in 
    main()
  File "/usr/bin/s3cmd", line 1950, in main
    cmd_func(args)
  File "/usr/bin/s3cmd", line 614, in cmd_cp
    subcmd_cp_mv(args, s3.object_copy, "copy", "File %(src)s copied to %(dst)s")
  File "/usr/bin/s3cmd", line 604, in subcmd_cp_mv
    dst_uri = S3Uri(item['dest_name'])
KeyError: 'dest_name'

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    An unexpected error has occurred.
    Please report the above lines to:
   s3tools-bugs@lists.sourceforge.net
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Argh!! This stackoverflow answer confirms that s3cmd cp cannot handle this. (It is wrong, but for a long time I believed it.)

False start 2

This stackoverflow answer suggests “sync” as the command to use.

It is correct. But sync is not the same as copy, so this has bad side effects if what you really want to achieve is copying files. For example, sync will remove files in the target folder (to keep things in sync, duh). So syncing from source1 and source2 into a single target will cause grief. For copying all files from one location to another it’s great. I wanted to copy files, and I did not want any of the side effects of sync.

Bad alternatives

You can write your own script using boto and python or muck around with awk and getting lists of files to copy one-by-one. In principle these will work, but yuck.

You could download the files from s3 then put them back up into the intended target bucket. This is a terrible solution. It will succeed… but what a waste of time and bandwidth. What makes it so tempting is that s3cmd works exactly like you want it to work with “get” and “put”.

s3cmd put /localDirectory/file_prefix* s3://targetBucket/

If “put” is so easy, why is “cp” so hard?

Enlightenment

I studied the s3cmd options over and over. Eventually I realized “cp” had more flexibility if you look deep enough.

  • –recursive
    In my mind, my requirement is clearly not recursive. I simple want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great.
  • –exclude
    It’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what?
  • –include
    Now we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.
  • s3://sourceBucket/  s3://targetBucket/
    This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:
    s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]

I posted a brief version of my answer to the most elegant of technical websites. You should vote it up. But that didn’t seem like the best place to elaborate on the answer as I’ve done here.

Postscript

Amazon offers a command line interface (CLI) tool to do the same thing. AWS Command Line Interface. I swear that I looked extensively and repeatedly for exactly this saying, “I just can’t believe that Amazon wouldn’t have this by now.” Well, they do. I have no idea why I could not find it, but I’m mentioning it here for my own future reference and for anyone else who is using s3cmd as an alternative to the Amazon utility that they couldn’t find.

I have no idea if the Amazon CLI is [ better | worse | different ] than s3cmd in any interesting way regarding S3. (It’s certainly different in the respect that it interacts with many other AWS services besides S3.) If I ever need to compare them, then I’ll write it up.

 

Advertisements

16 September 2013

Listen on port 80

Filed under: Linux — Tags: , , , , , , — mdahlman @ 11:44

Problem

I have an application server running on port 8080. I want it to listen on port 80. In my case it was Tomcat, but this applies to any application server.

I know this problem is somewhat common problem. I get lots of Google hits on it. But I have found that the answers are surprisingly non-great. They often assume a set of knowledge that doesn’t match with my personal knowledge. They [probably] tell me everything I need to know, but they tell me a lot more as well. This is not better; it’s hard to find what’s really important. This iptables answer on serverfault.com was really quite good. But it offers a little too much detail without offering firm enough guidance about what the best and simplest solution is. I want just one perfect answer if I can find it.

Answer

sudo iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080

It’s that easy. Now your app server can continue to run on port 8080. If port 8080 is open to the outside world then you’re free to connect directly to it… but you can also connect on the traditional port 80.

But… you’ll lose the change if your machine reboots. So there’s one more step. An Amazon Linux I used the following. It should be fine on CentOS and RHEL etc.

sudo service iptables save

On Ubuntu I found it easiest to persist the change like this:

sudo apt-get install iptables-persistent

Alternative Answers

There are, of course, an infinite number of alternatives. I’m more interested in having one easy-to-understand solution than having lots of alternatives. But sometime it’s useful to consider the alternatives explicitly… even if it’s only to mock and ridicule them afterwards.

Run your app server on port 80. I declare this to be a bad solution. But hey, maybe you’ve got a valid use case for this. We tracked down how to do it in the past. I found it to be difficult (grabbing those ports below 1024 is intended to be tough), and I found it to have bad side effects (some things broke on upgrades). The side effects were surely our own fault… but the ‘iptables’ solution above is much less prone to side effects. And running your application server as root in order to access port 80 opens security issues as well.

Run a web server on port 80 in front of the application server and route requests to the application server as appropriate. This is a fine solution. In fact, it’s vastly better in a bunch of ways. I have used it myself several times. It’s just overkill for many needs. Administering httpd isn’t so difficult… but it’s harder than not administering httpd.

Edit the file /etc/sysconfig/iptables manually. Yuck. Sure… you could… but why? The command ‘iptables’ exists to make your life easier. Let it.

Create a free website or blog at WordPress.com.