Forklift: Moving Big Databases Around in Ruby

data ruby taskrabbit 
2016-04-22T22:45:50.164Z - Originally posted at https://tech.taskrabbit.com/blog/2013/04/12/forklift/
↞ See all posts



What?

Forklift is a ruby gem that can help you collect, augment, and save copies of your mySQL databases. This is often called an "ETL" tool as the steps in this process mirror the actions of "Extracting the data," "Transforming the data," and finally "Loading the data" into its final place.

With Forklift, you create a Plan which describes how to manipulate your data. The process for this involves (at least) three databases:

  • Live Set
  • Working Database
  • Final Database

The "Live Set" is first loaded into the "Working Set" to create a copy of your production data we can manipulate without fear of breaking replication. Then, any transformations/manipulations are run on the data in the working set. This might include normalizing or cleaning up data which was great for production but hard for analysts to use. Finally, when all of your transformations are complete, that data is loaded into the final database.

Forklift is appropriate to use by itself or integrated within a larger project. Forklift aims to be as fast as can be by using native mySQL copy commands and eschewing all ORMs and other RAM hogs.

Features

  • Can extract data from both local and remote databases
  • Can perform integrity checks on your source data to determine if this run of Forklift should be executed
  • Can run each Extract either each run or at a frequency
  • Can run each Transform either each run or at a frequency
  • Data kept in the woking database after each run to be used on subsequent transformations
  • Only ETL’d tables will be copied into the final database, leaving other tables untouched
  • Emails sent on errors

What does TaskRabbit use this for?

At TaskRabbit, the website you see at www.taskrabbit.com is actually made up of many smaller rails applications. When analyzing our site, we need to collect all of this data into one place so we can easily join across it.

We replicate all of our databases into one server in our office, and then use Forklift to extract the data we want into a common place. This gives us the option to both look at live data and to have a more accessible transformed set which we create on a rolling basis. Our "Forklift Loop" also git-pulls to check for any new transformations before each run.

Example Annotated Plan

In Forklift, you build a plan. You can add any action to the plan in any order before you run it. You can have 0 or many actions of each type.

1require 'rubygems' 2require 'bundler' 3Bundler.require(:default) 4require 'forklift/forklift' # Be sure to have installed the gem! 5 6######### 7# SETUP # 8######### 9 10forklift = Forklift::Plan.new({ 11 12 :local_connection => { 13 :host => "localhost", 14 :username => "root", 15 :password => nil, 16 }, 17 18 :remote_connections => [ 19 { 20 :name => "remote_connection_a", 21 :host => "192.168.0.0", 22 :username => "XXX", 23 :password => "XXX", 24 }, 25 { 26 :name => "remote_connection_b", 27 :host => "192.168.0.1", 28 :username => "XXX", 29 :password => "XXX", 30 }, 31 ], 32 33 :final_database => "FINAL", 34 :working_database => "WORKING", 35 36 :do_dump? => true, 37 :dump_file => "/data/backups/dump-#{Time.new}.sql.gz", 38 39 :do_email? => true, 40 :email_to => ['XXX'], 41 :email_options => { 42 :via => :smtp, 43 :via_options => { 44 :address => 'smtp.gmail.com', 45 :port => '587', 46 :enable_starttls_auto => true, 47 :user_name => "XXX", 48 :password => "XXX", 49 :authentication => :plain, 50 } 51 } 52 53}) 54 55########## 56# CHECKS # 57########## 58 59forklift.check_local_source({ 60 :name => 'CHECK_FOR_NEW_DATA', 61 :database => 'test', 62 :query => 'select (select max(created_at) from new_table) > (select date_sub(NOW(), interval 1 day))', 63 :expected => '1' 64}) 65 66forklift.check_remote_source({ 67 :connection_name => "remote_connection_b", 68 :name => 'ANOTHER_CHECK', 69 :database => 'stuff', 70 :query => 'select count(1) from people', 71 :expected => '100' 72}) 73 74########### 75# EXTRACT # 76########### 77 78forklift.import_local_database({ 79 :database => "database_1", 80 :prefix => false, 81 :frequency => 24 * 60 * 60, 82}) 83 84forklift.import_local_database({ 85 :database => "database_2", 86 :prefix => false, 87 :only => ['table_1', 'table_2'], 88}) 89 90forklift.import_remote_database({ 91 :connection_name => 'remote_connection_a', 92 :database => "database_3", 93 :prefix => true, 94 :skip => ['schema_migrations'] 95}) 96 97############# 98# TRANSFORM # 99############# 100 101transformation_base = File.dirname(__FILE__) + "/transformations" 102 103forklift.transform_sql({ 104 :file => "#{transformation_base}/calendars/create_calendars.sql", 105 :frequency => 24 * 60 * 60, 106}) 107 108forklift.transform_ruby({ 109 :file => "#{transformation_base}/test/test.rb", 110}) 111 112####### 113# RUN # 114####### 115 116forklift.run 117

Workflow

1def run 2 lock_pidfile # Ensure that only one instance of Forklift is running 3 rebuild_working_database # Ensure that the working database exists 4 ensure_forklift_data_table # Ensure that the metadata table for forklift exists (used for frequency calculations) 5 6 run_checks # Preform any data integrity checks 7 run_extractions # Extact data from the life databases into the working database 8 run_transformations # Perform any transformations 9 run_load # Load the manipulated data into the final database 10 11 save_dump # mySQLdump the new final database for safe keeping 12 send_email # Email folks the status of this forklift 13 unlock_pidfile # Clean up the pidfile so I can run next time 14end 15

Transformations

Forklift allows you to create both Ruby transformations and SQL transformations

1class Test 2 def transform(connection, database, logger) 3 logger.log "Running on DB: #{database}" 4 logger.log "Counting users..." 5 connection.q("USE `#{database}`") 6 users_count = connection.q("count(1) as 'users_count' from `users`") 7 logger.log("There were #{users_count} users") 8 end 9end

Ruby Transformations

  • SQL Transformations are kept in a file ending in .rb
  • Ruby Transformations should define a class which matches the name of the file (IE: class MyTransformation would be in a file called my_transformation.rb
  • logger.log(message) is the best way to log but logger.debug is also available
  • database is a string containing the name of the working database
  • connection is an instance of Forklift::Connection and connection.connection is a raw mysql2 connection
  • Classes need to define a transform(connection, database, logger) IE:

SQL Transformations

  • SQL Transformations are kept in a file ending in .sql
  • You can have many SQL statements per file
  • SQL will be executed linearly as it is written in the file

SQL Transformations can be used to generate new tables like this as well

Defaults

The defaults for a new Forklift::Plan are:

1 1 { 2 2 :project_root => Dir.pwd, 3 3 :lock_with_pid? => true, 4 4 5 5 :final_database => {}, 6 6 :local_database => {}, 7 7 :forklift_data_table => '_forklift', 8 8 9 9 :verbose? => true, 1010 1111 :do_checks? => true, 1212 :do_extract? => true, 1313 :do_transform? => true, 1414 :do_load? => true, 1515 :do_email? => false, 1616 :do_dump? => false, 1717 }

Methods

Test

1 1 forklift.check_local_source({ 2 2 :name => STRING, # A name for the test 3 3 :database => STRING, # The Database to test 4 4 :query => STRING, # The Query to Run. Needs to return only 1 row with 1 value 5 5 :expected => STRING # The response to compare against 6 6 }) 7 7 8 8 forklift.check_remote_source({ 9 9 :connection_name => STRING, # The name of the remote_connection 1010 :name => STRING, # A name for the test 1111 :database => STRING, # The Database to test 1212 :query => STRING, # The Query to Run. Needs to return only 1 row with 1 value 1313 :expected => STRING # The response to compare against 1414 })

Extract

1 1 forklift.import_local_database({ 2 2 :database => STRING, # The Database to Extract 3 3 :prefix => BOOLEAN, # Should we prefix the names of all tables in this database when imported wight the database? 4 4 :frequency => INTEGER (seconds), # How often should we import this database? 5 5 :skip => ARRAY OR STRINGS # A list of tables to ignore and not import 6 6 :only => ARRAY OR STRINGS # A list of tables to ignore and not import (use :only or :skip, not both) 7 7 }) 8 8 9 9 forklift.import_remote_database({ 1010 :connection_name => STRING, # The name of the remote_connection 1111 :database => STRING, # The Database to Extract 1212 :prefix => BOOLEAN, # Should we prefix the names of all tables in this database when imported wight the database? 1313 :frequency => INTEGER (seconds), # How often should we import this database? 1414 :skip => ARRAY OR STRINGS # A list of tables to ignore and not import 1515 :only => ARRAY OR STRINGS # A list of tables to ignore and not import (use :only or :skip, not both) 1616 }) 17

Transform

11 forklift.transform_sql({ 22 :file => STRING, # The transformation file to run 33 :frequency => INTEGER (seconds), # How often should we run this transformation? 44 }) 55 66 forklift.transform_ruby({ 77 :file => STRING, # The transformation file to run 88 :frequency => INTEGER (seconds), # How often should we run this transformation? 99 })

Debug

You can launch forklift in "debug mode" with — debug (we check ARGV[" — debug"] and ARGV["-debug"]). In debug mode the following will happen: — verbose = true — no SQL will be run (extract, load) — no transforms will be run — no email will be sent — no mySQL dumps will be created

Options & Notes

  • email_options is a hash consumed by the Pony mail gem
  • Forklift’s logger is Lumberjack with a wrapper to also echo the log lines to stdout and save them to an array to be accessed later by the email system.
  • The connections hash will be passed directly to a mysql2 connection. Follow the link to see all the available options.

Limitations

  • mySQL only (the mysql2 gem specifically)

Forklift is available now. Enjoy!

taskrabbit/forklift

Hi, I'm Evan

I write about Technology, Software, and Startups. I use my Product Management, Software Engineering, and Leadership skills to build teams that create world-class digital products.

Get in touch