Extracting Government Spending Data via Talend and Ruby into CouchDB

Note: To make it easy for anyone to get access to this data I chatted with the good folks at Amazon to expose all the data (both couchdb files and csv) as a public data set! I will also share an ami that you can spin up and use to start looking at this data almost immediately. Look for a post on that soon.

I love working with lots of data! So with 13 digit stimulus numbers being bandied about, I started looking into what it would take to analyze the government spending numbers. The data would make a great addition to our test data sets used to stress test our hosting environment and evaluate new technologies.

Since the data for the stimulus has not been released, I decided to start with getting the USASpending.gov data, which is the data in the federal procurement data system (fpds) that tracks all the contracts and associated transactions. You can query the data on the website, but if you wanted to analyze across all the fields you are restricted to the interface on the website.

I must give kudos to the team behind USASpending.gov, it does have a lot of capabilities and for straightforward reports, you can generate them fairly easily on the website. On the other hand if you are a data junkie like me, and you want to REALLY analyze the data, then you have no choice but to download the data in csv or xml format.

The site offers downloads in different levels of detail:

  • Summary: General information such as the totals, Top-N etc.
  • Low: List of contractors
  • Medium: Contractor profiles
  • High: List of transactions
  • Extensive: Detailed info on each transaction
  • Complete: All information

I wanted to get the Complete level of detail, but the site (rightly) restricts the download to 500 rows per call and only via xml. They also ask that you dont repeatedly call the api to get all the rows, putting significant strain on their servers.

I figured the least invasive way would be to get a comma separated list of transactions, then use that to pull individual contracts using the Federal ID number for the contract (PIID). I did this in two steps, first I broke the query criteria up by government agency and year and automated the download of the csv's by building the following Talend flow:



The flow takes a list of agencys and fiscal years and downloads a csv file for each combination. With this download complete, I tried to extract the year and PIID, so I could download the individual contracts. Unfortunately on analyzing the data in here I found that there is a big data issue. There is a text field - Contract Description, which has all sorts of characters, including double quotes and commas, that completely futzes with the csv extract format. To compensate I created a flow that using the year column as a filter, uploads good data to an oracle database, and the bad rows to a reject file. I had the data in the reject file fixed. At this point I had a table with the fiscal year and piids.



Now how to extract the complete level of detail (in xml) ? To extract the data, I could use Talend to download the individual files, but we are talking millions of documents. Where should I store this data ? I could generate massive xml documents and store it in an oracle database or on the filesystem. After some deliberation, I decided to use couchdb as the storage mechanism for a few reasons:

  1. It can store the data in its original xml format
  2. It can store the data in json
  3. I can query and filter the data once loaded
  4. Easier to share
  5. Um.. it would be the uber geeky cool thing to do (ok, that was the only reason)


I wrote a ruby program to download the xml files, iterating thru the list of PIIDs, convert the file to json (using CobravsMongoose) and load it into couchdb. I also didn't like the badgerfish convention json that CvM generated, so I wrote another transformation script to flatten out the json. Each document contains the original xml as an attribute.

At the end of spinning up a few instances on aws, I had the scripts cranking, downloading 4 years of data in parallel. If anybody from USAspending is reading this - I apologize if I had an impact on your machines. I am hoping by filtering on a specific PIID and year, I wasn't putting a huge amount of load on the server for each call - though I made millions of calls ;)

Some observations:
1. CouchDB did not like parallel bulk loads, so I loaded each year on its own instance, and then consolidated the databases later
2. It makes sense to create the views in advance

I am now the proud caretaker of 10 years of spending data on an EBS volume (132 GBs - probably will be half that if I drop the xml):


and 7.9 M contract documents that include the original xml formatted record:



Next Steps

  • VALIDATE the data
  • Setup full-text search using couchdb-lucene
  • Repeat this process with the stimulus data as soon as it is available
  • Stress test our hosted BI environment, using this data and cubes in Essbase
  • Test visualization and Reporting using Tableau and other reporting toolsets
  • Load data into various bigdata tools like Vertica, InfoBright and SimpleDB
  • Transform this data to be more document centric (for eg. convert all flag fields into a single tag attribute array)
  • Figure out how to get the incremental updates into the database

I am sure this data is useful to other people as well to:

  1. Analyze Govt Spending Data
  2. Have a huge data set in couchdb to play with, test etc.
  3. Create a wiki style website around the contract data
  4. Help identify any data issues

To that end, I am looking to distribute this data to the public via Amazon's public data sets.

Comments

Hey Matt, I will do

Hey Matt, I will do something around the id once I figure out what the unique key for each document is. I tried attaching the xml, but it didnt work for me in a bulk upload. Thanks for your suggestions! Cheers R

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
about us | careers | support
Copyright © Full 360 | All Rights Reserved 2008 | Legal | Privacy