Monday, November 24, 2014

A *NIX Use Case

Gist of this post with nicer formatting: https://gist.github.com/phette23/a71248765c0f0cfeddd7


Almost immediately after declaring a hiatus seems like a great time for a blog post.
Inspired by nina de jesus and Ruth Tillman's libtech level up project, here's something on the value of command-line text processing. Some of these common UNIX tools that have been around since practically the 1980s are great for the sort of data wrangling that many librarians find themselves doing, whether their responsibilities lie with systems, the web, metadata, or other areas. But the command prompt has a learning curve and if you already use text editor tools to accomplish some tasks, it might be tough to see why you should invest in learning. Here's one case I've found.
Scenario: our digital repository needs to maintain several vocabularies of faculty who teach in different departments. That information is, of course, within a siloed vendor product that has no viable APIs. I'm only able to export CSVs that looks like this:
"Namerer, Name","username" "Othernamerer, Othername", "anotherusername"
But to import them into our repository I need to clean up the data a little and put it into a slightly different format:
"Namerer, Name","facultyID","username" "Othernamerer, Othername","facultyID","anotherusername"
This single-line shell script is all I need:
#!/usr/bin/env bash

cat $1 | sort | uniq | sed -e '/"STANDBY",""/d' -e 's|, Staff"|"|' -e 's|, "|"|' -e 's|","|","facultyID","|'
Let's walk through the script. To make it, I put the above text in a file, named it something like "fac-csv.sh", and made it executable by running chmod +x fac-csv.sh. I won't go into permssions but chmod +x, and the paragraph below, aren't even strictly necessary, since one can type bash fac-csv.sh to run the script anyways.
#!/usr/bin/env bash tells the operating system what program to execute the script with. A lot of scripts list a path direct to the program, e.g. #!/usr/bin/python (for a python script) or #!/bin/sh (for a shell script). Using #!/usr/bin/env is just a bit more portable across systems; the env command looks in the *env*ironment for a given program, searching several possible locations, so if someone on a different system (one where the shell is in, say, /usr/bin/local/bash) executes the script it'll still work.
cat $1 prints out the full text file I want to operate on (a CSV, in this case) so I can start piping it through the processing steps. On the command line, I run this script like fac-csv.sh filename.csv and filename.csv becomes $1 (the first positional parameter) inside the script.
The pipes ("|") separating each command chain them together, making the input of one command the output of the last. This is perhaps the most powerful part of UNIX since it means almost arbitrarily complex operations can be composed of smaller ones.
sort takes the CSV, which might be in any order, and sorts the lines alphabetically.
uniq takes duplicate adjacent lines and removes them, thus only *uniq*ue lines are left. This step wouldn't work without the sort prior.
sed stands for stream editor, it takes the text passed to it and performs a series of edits, each edit is specified with an -e flag. We've already deduplicated the file, sed cleans it up. sed has a lot of edit types but I'm only using two; delete line and substitute.
'/"STANDBY",""/d' is a delete line command, which looks like /pattern/d. So here I'm saying "delete all lines that match the pattern "STANDBY","" since "STANDBY" is an artifact of our data system and not a faculty name we need to be recording.
The substitute commands look like: 1) the letter "s", 2) a delimiter (I've used "|" but other common choices include colons or forward slashes, in general you just want a separator that won't appear in your pattern since that complicates things), 3) a pattern to substitute, and 4) want to substitute for the pattern.
's|, Staff"|"|' finds , Staff" and deletes the comma-space-Staff part (note the quotation mark is retained).
's|, "|"|' finds , " and deletes the comma-space, leaving the quotation mark again. This and the step above clean up entries like "Sname, Gname, Staff","sgname, " => "Sname, Gname","sgname"
's|","|","facultyID","|' adds in a second "facultyID" value in each CSV row, which our repository needs for reasons.
In the end I've: deduplicated the export, deleted useless lines, and cleaned up messy lines. I find occaisions to run this script or a slight modification of it weekly. Doing the same steps in a text editor would be far more time-consuming and error prone (since I might forget one, not do them in right order, etc.).
Maybe this came out Greek, if so I apologize. It took me a long time to learn about all these steps, in particular sed has caused me much trouble. But now I'm able to write these quick, one-line scripts that automate what would've been several steps in a text editor.

No comments:

Post a Comment