Guide to backup and restore big tables in DynamoDB completely
To be brief, in the company I’m working at, we are depending on real-time data with no tolerance of high-latencies. The lower the latency, the better results we can provide to our customers/clients. After Amazon launched the Frankfurt region (eu-central-1) we decided to move from Ireland (eu-west-1) to Frankfurt. It’s much lower latency when you ping both regions from Istanbul.
I was looking for a practical and easy solution to the issue of ours but after trying many solutions (like Data Pipeline) and getting tired of their complexity, I tend to lean on some open-source projects to do our task.
After a couple researches I found this little awesome batch tool working with boto: https://github.com/bchew/dynamodump
I gave it a shot and it worked nearly-perfectly. You can use this tool from your local machine or from an EC2, depending on your connectivity and data-size of course. I had issues on tables with high amount of items but, there is a workaround for that. You need to fine-tune the batch script to work with your dataset.
The default configuration is as follows (inside dynamodump.py):
JSON_INDENT = 2
AWS_SLEEP_INTERVAL = 10 # seconds
LOCAL_SLEEP_INTERVAL = 1 # seconds
MAX_BATCH_WRITE = 25 # DynamoDB limit
SCHEMA_FILE = "schema.json"
DATA_DIR = "data"
MAX_RETRY = 6
LOCAL_REGION = "local"
LOG_LEVEL = "INFO"
DATA_DUMP = "dump"
RESTORE_WRITE_CAPACITY = 25
THREAD_START_DELAY = 1 # seconds
CURRENT_WORKING_DIR = os.getcwd()
DEFAULT_PREFIX_SEPARATOR = "-"
What I needed to change are those:
JSON_INDENT = 4
LOCAL_SLEEP_INTERVAL = 3
MAX_BATCH_WRITE = 10
MAX_RETRY = 10
RESTORE_WRITE_CAPACITY = 10
Reason I’m changing this is that the script starts throwing errors because the provisioned DynamoDB write capacity units/ps is less than the inserted units/ps by the script.
There are two workarounds here, the first one is the above fine-tunes in the script. Which takes around 30 seconds for 14.000 Items to insert into with this config. But if you have more than 1.000.000 items and you want the restore operation to be completed in less than 5 minutes, then you need to temporarily scale up the DynamoDB write capacity units per table. After the restore operation succeeds, you can scale down the write capacity units back to its normal.
Back it up!
Easy. First, you need to create a user with AWS Access Key (from IAM) with FullAccess permissions to DynamoDB. There is an AWS managed policy called AmazonDynamoDBFullAccess and you can use it for the entire operation.
To backup every table in Ireland region:
python dynamodump.py -m backup -r eu-west-1 -s "*" --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY
To backup single table in Frankfurt region (change TableName with your table name):
python dynamodump.py -m backup -r eu-central-1 -s TableName --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY
Restore any means necessary.
One thing your should be careful with is to create table schema (empty table with its structure) before you start inserting data, otherwise script might fail during the sequential operations.
To restore every table you have, to London (eu-west-2) region:
python dynamodump.py -m restore -r eu-west-2 -s "*" --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY --schemaOnly
python dynamodump.py -m restore -r eu-west-2 -s "*" --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY --dataOnly
Or if you’d like to insert single table (change TableName with your table name):
python dynamodump.py -m restore -r eu-west-2 -s TableName --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY --schemaOnly
python dynamodump.py -m restore -r eu-west-2 -s TableName --accessKey AWS_ACCESS_KEY --secretKey AWS_SECRET_KEY --dataOnly
The data you backed-up will be stored in a folder called dump where the dynamodump.py script located. Table schema information is covered in scheme.json and the data per table will be rested as partitions (like 0001.json, 0002.json, etc.) if you have high amount of items per table.
The operation will take 30 seconds to 30 minutes depending on the data size inside tables. If you are only taking backups, then you can do it on-the-fly but if you are planning to migrate to another region like us, you should consider having a downtime off-hours. To avoid that you may use DynamoDB Streams but this will require a lot of time for research. There are several github projects basing this as well.
That basically sums everything up.