User Defined Load Libraries in Vertica 6

Vertica 6 introduces the concept of UDLs, User Defined Load libraries. These libraries are C++ routines created with the Vertica API allowing you to extend the functionality of the COPY (bulk loading) statement. Traditionally, the COPY statement loads data from text files or character streams. UDLs allow you to substitute an external function call for a stream name or file location, then load the results directly into a table. This sort of functionality is useful when you want to access complex data sources such as…

  • Non-Vertica databases
  • SOAP and RESTful Web Services
  • Hadoop/MapReduce processes

Although UDLs open up a limitless world of data sources for you, as a database developer and architect like myself, you are likely looking sideways at the fact that this is all done through C++, which is not a common skill set in the database world these days. In fact, you may have trouble finding anyone in your IT organization who is skilled in C++! It is much more likely that you will either possess or have access to skills in a modern scripting language such as perl, python, or my favorite… ruby!

Vertica provides some excellent examples of extensions created using their C++ API on their public github repository…

https://github.com/vertica/Vertica-Extension-Packages

All of these libraries are free and open for you to download and use.

Within this repository is the shell_load_package UDL, developed by Adam Neering, an engineer at Vertica. This package allows you to call a shell command and return the stdout stream to your COPY statement. This command can be anything that outputs to stdout, which opens you up to using any programming or scripting language that you find to be appropriate!

I am going to step you through an example which demonstrates loading a table directly from the output of a ruby script which generates a simple, silly, sample data set!

First you need to download and install the shell_load_package extension. You will require the following in order to perform the steps:

  • git installed on the server
  • sudo privileges for dbadmin user
  • vsql must be in your path

git clone https://github.com/vertica/Vertica-Extension-Packages.git  
cd Vertica-Extension-Packages/shell\_load\_package  
sudo make  
sudo make install

In the second step of the install I was prompted for the database password several times, but everything worked just fine.

Now that the extension library is built on the server and linked into the database, we will create a simple ruby script named generate_silly_data.rb which outputs a sample data set for the number of rows passed to the script as an argument.


You will need to make sure that this script is owned by dbadmin user and is executable:

#!/usr/bin/ruby
begin
  #number of rows to generate is argument passed to script
  num_rows=ARGV[0].to_i

  #iterate through number of rows
  (1..num_rows).each do |pk|
    #generate a random string 8 characters long
    random_text=''
    (1..8).each {|x| random_text+=(rand(26)+65).chr}
    
    #output the string and pk value, pipe delimited  
    puts "#{pk}|#{random_text}"
  end
rescue=>e
  #in case of any error, output nothing  
  puts ''
end

Make it executable:

chown dbadmin /path_to/generate_silly_data.rb
chmod +x /path_to/generate_silly_data.rb

At this point you can connect using vsql (or your preferred SQL client) and execute the following statements. Be sure to use the correct path to the ruby script file, as well as the correct node name for the database server that you are using.


dbadmin=> create table public.silly\_data(pk integer,textfield varchar(8));
CREATE TABLE
dbadmin=> copy public.silly\_data with source ExternalSource(cmd='ruby/path\_to/generate\_silly\_data.rb 10', nodes='v\_dbname\_node0001');

Rows Loaded  
-------------  
10
(1 row)

dbadmin=> select * from public.silly\_data order by pk;

pk  | textfield 
----+-----------
 1  | MTYKKRHU  
 2  | TXYPWDAW  
 3  | FQBAAINP  
 4  | CGCBFSIH  
 5  | JPILSYVH  
 6  | PSJDBYUB  
 7  | JIEFTPNA  
 8  | DBFOUJNM  
 9  | MUJHILJT  
 10 | OFUAQNVC

(10 rows)

dbadmin=> drop table public.silly\_data;  
DROP TABLE

Note that this example generated and loaded 10 rows because 10 was passed as a parameter to the script.

You are not required to write scripts in order to use these extensions, as you can call any shell command you want, including pipes and redirects. If you already have skills with a command line SQL client such as SQLPlus or HenPlus, you can create an almost entirely SQL based ETL process!

I think that Vertica is on the right track UDLs, and the UDx framework in general, and I hope to see more community extensions popping up over the next year or so!


Comments

comments powered by Disqus