Monday, November 24, 2014

A *NIX Use Case

Gist of this post with nicer formatting: https://gist.github.com/phette23/a71248765c0f0cfeddd7


Almost immediately after declaring a hiatus seems like a great time for a blog post.
Inspired by nina de jesus and Ruth Tillman's libtech level up project, here's something on the value of command-line text processing. Some of these common UNIX tools that have been around since practically the 1980s are great for the sort of data wrangling that many librarians find themselves doing, whether their responsibilities lie with systems, the web, metadata, or other areas. But the command prompt has a learning curve and if you already use text editor tools to accomplish some tasks, it might be tough to see why you should invest in learning. Here's one case I've found.
Scenario: our digital repository needs to maintain several vocabularies of faculty who teach in different departments. That information is, of course, within a siloed vendor product that has no viable APIs. I'm only able to export CSVs that looks like this:
"Namerer, Name","username" "Othernamerer, Othername", "anotherusername"
But to import them into our repository I need to clean up the data a little and put it into a slightly different format:
"Namerer, Name","facultyID","username" "Othernamerer, Othername","facultyID","anotherusername"
This single-line shell script is all I need:
#!/usr/bin/env bash

cat $1 | sort | uniq | sed -e '/"STANDBY",""/d' -e 's|, Staff"|"|' -e 's|, "|"|' -e 's|","|","facultyID","|'
Let's walk through the script. To make it, I put the above text in a file, named it something like "fac-csv.sh", and made it executable by running chmod +x fac-csv.sh. I won't go into permssions but chmod +x, and the paragraph below, aren't even strictly necessary, since one can type bash fac-csv.sh to run the script anyways.
#!/usr/bin/env bash tells the operating system what program to execute the script with. A lot of scripts list a path direct to the program, e.g. #!/usr/bin/python (for a python script) or #!/bin/sh (for a shell script). Using #!/usr/bin/env is just a bit more portable across systems; the env command looks in the *env*ironment for a given program, searching several possible locations, so if someone on a different system (one where the shell is in, say, /usr/bin/local/bash) executes the script it'll still work.
cat $1 prints out the full text file I want to operate on (a CSV, in this case) so I can start piping it through the processing steps. On the command line, I run this script like fac-csv.sh filename.csv and filename.csv becomes $1 (the first positional parameter) inside the script.
The pipes ("|") separating each command chain them together, making the input of one command the output of the last. This is perhaps the most powerful part of UNIX since it means almost arbitrarily complex operations can be composed of smaller ones.
sort takes the CSV, which might be in any order, and sorts the lines alphabetically.
uniq takes duplicate adjacent lines and removes them, thus only *uniq*ue lines are left. This step wouldn't work without the sort prior.
sed stands for stream editor, it takes the text passed to it and performs a series of edits, each edit is specified with an -e flag. We've already deduplicated the file, sed cleans it up. sed has a lot of edit types but I'm only using two; delete line and substitute.
'/"STANDBY",""/d' is a delete line command, which looks like /pattern/d. So here I'm saying "delete all lines that match the pattern "STANDBY","" since "STANDBY" is an artifact of our data system and not a faculty name we need to be recording.
The substitute commands look like: 1) the letter "s", 2) a delimiter (I've used "|" but other common choices include colons or forward slashes, in general you just want a separator that won't appear in your pattern since that complicates things), 3) a pattern to substitute, and 4) want to substitute for the pattern.
's|, Staff"|"|' finds , Staff" and deletes the comma-space-Staff part (note the quotation mark is retained).
's|, "|"|' finds , " and deletes the comma-space, leaving the quotation mark again. This and the step above clean up entries like "Sname, Gname, Staff","sgname, " => "Sname, Gname","sgname"
's|","|","facultyID","|' adds in a second "facultyID" value in each CSV row, which our repository needs for reasons.
In the end I've: deduplicated the export, deleted useless lines, and cleaned up messy lines. I find occaisions to run this script or a slight modification of it weekly. Doing the same steps in a text editor would be far more time-consuming and error prone (since I might forget one, not do them in right order, etc.).
Maybe this came out Greek, if so I apologize. It took me a long time to learn about all these steps, in particular sed has caused me much trouble. But now I'm able to write these quick, one-line scripts that automate what would've been several steps in a text editor.

Saturday, November 22, 2014

Hit the Pause Button

Just an FYI that this blog is going to go dormant for a while as I'm trying to be better about focusing my responsibilities. I'm a little overwhelmed at the moment, as the last post may have indicated, and cutting back my personal blog makes sense given what else I'm doing.

I'll still be around the interwebs though. Twitter, Tech Connect , and GitHub are good places to find me.

Sunday, September 21, 2014

Better to Burn Out than to Fade Away

Extra-professional obligations of mine:

  • I edit a column for the RUSQ journal, "Accidental Technologist". I'm proud of the columns I've published, but I've only written a couple. I identify topics, authors, read drafts, & provide feedback 4 times a year.
  • I write (quasi-)monthly blog posts for ACRL Tech Connect. Again, I'm proud of my posts. I also provide feedback for my excellent co-authors who mostly tolerate my nagging.
  • I'm on the LITA Forum Coordinating Committee. It's in Albuquerque this year & it's going to be great! Seriously. I'm excited about the keynotes & Forum has proven to be a great event to meet like-minded library technology folks.
  • I'm on the Code4Lib 2015 Keynotes Committee. We're still accepting nominations for keynote speakers!
  • I want to organize more Code4Lib NorCal meetups, which is the most neglected item on this list. If you're a C4L NorCal person, I promise you'll be seeing messages from me soon.
  • I'm juggling dozens of open source projects on GitHub, most of which suffer from benign neglect & could use some code & love. I just cannot help myself from jumping into new projects even when I clearly cannot commit enough. WikipeDPLA is my focal point at the moment but I've created about a half-dozen repos since publishing that & maybe I should just do one project at once.
To reiterate: these are all outside of my librarian position & while I do spend the occasional hour or two on them at work, for the most part I complete tasks outside of my 9-to-5. I'm can't get tenure, I just can't say "no". & I'm undoubtedly privileged; these are extra-professional commitments that aid my status in the profession, whereas others have extra-professional commitments oriented elsewhere. They can't put them in tenure dossiers, as unfair as that is.

But how? How can I continue? I find value in all of these bullet points, so how do I decide to say "no" to any of them? I know others are faced with similar struggles & I'm asking for advice. How do you do it all? There are so many people in libraryland who seem in a similar situation, I could name names but I'd leave someone out. I don't know how they do it, so much in such finite time periods.




Let's all take a breather. No one work for the next week. Let us catch up instead.

Sunday, August 31, 2014

Switching to Fish Shell

I started using Fish as my primary shell a few months ago. While I like Bash, the promise of a more modern shell intrigued me. I spend entirely too much time on the command line. My affinity for Bash has less to do with its features as a language or shell than with the UNIX philosophy of many small programs which play nicely together.

Fish jokingly bills itself as "a command line shell for the 90s". It isn't revolutionizing what a shell does, rather it starts from a strong design document to provide a better experience. If you're unclear on the difference between a shell, terminal emulator, Bash, & command line interface, try Bryan J. Brown's description on his blog.

What's Good with Fish

Why would I switch to Fish? Immediately after trying it out, a few advantages were apparent. I didn't even have to consult help documentation.

Discovery is where Fish shines. I discovered new, useful programs on Mac OS simply by tabing through available completions. Fish's completion is incredibly smart & detailed; it knows files, commands, variables, & flags. Bash does this too, but Fish is far superior & comes with a huge collection of completions for common programs. It's main advantage is that it'll show options, so the completion is exploratory, whereas in other shells the completion is a just convenience for people who already know what they're looking for. Fish shows the definition of a particular flag, function, or program—as well as the current value of variables—instead of merely showing that they exist.

Many of the tools I use have dozens of flags. I love them, but I can't memorize each flag for every one. Take Ack for example. I usually just add a flag for the programming language I'm searching (e.g. --js) & the string I'm looking for. But the other day I wanted to see the number of matches in each of the large list of files I was searching. Now, I know ack can do this, but I don't know what flag(s) I need. Typically, I'd need to open up ack's man page, search through it, close it, & then run the command. With Fish, I typed a couple dashes, then tab to see all its completions, spotted --count right away, & ran the command without leaving my current context.

Another nice advantage of Fish's completion; it learns from previously typed commands. So even if there's no custom-built completions for a particular program, Fish learns how you use it & develops completions over time.

Fish also has colors! Nice ones! They pop more than I'm used to. What's more, the shell provides convenient abstractions for changing colors. The set_color command lets you use natural language like "red" rather than the crazy looking echo \033[1;33m (yes, this is actually how you change colors in Bash). set_color is handy, but Fish also has added features like prompt_pwd, which is great for shortening the working directory for inclusion in a prompt.

If you don't want to spend hours configuring a custom prompt, Fish comes with a couple dozen nice ones built-in. You can run **fish_config** to open a configuration interface in a web browser which gives you copy-pastable prompt code. This config feature makes it super quick to get started without a ton of research & looking up replacement tokens. Every shell should have such a feature.

Scripting in Fish is far more straightforward, as the shell's language is minimal & clean. It looks Ruby-esque & favors natural language everywhere over strange, punctuated incantations. Because it's a smaller & more rationale language, learning the basics of Fish scripting is quicker than with other shells.

Fish also has wonderful error messages, perhaps the best of any programming language I've dealt with. That may not seem valuable but it helps immensely with learning the shell, especially when transitioning from Bash. Fish will not only point to the erroneous character, but will note common mistakes & try to guess what you missed. For instance, in Bash a subshell is launched with $(…) whereas Fish uses (); the $ in Fish means one & only one thing, that a variable is being used. So when you use a $ in the wrong context, it says so. An example:

> echo $(whoami)
fish: Did you mean (COMMAND)? In fish, the '$' character is only used for accessing variables. To learn more about command substitution in fish, type 'help expand-command-substitution'.
echo $(whoami)
     ^

Fish is half written in its own scripting language, so it's easy to see how some features work & extend them. I noticed that there weren't any completions for Node & NPM, so I added them myself by aping existing ones. Exposing so much of the shell's core functionality makes it customizable & approachable.

Annoyances

In a way, Fish is the perfect shell for someone just getting starting at the command line because of its brilliant completions, easy (no code!) configurability, & sane scripting language. Unfortunately, for me, it's not quite perfect because I'm already used to Bash's quirky parts & rely on numerous packages, settings, & scripts that assume a more common (read: Bash) environment.

Example: z. Z is a vital utility for me; it allows me to quickly jump between my current location & places I've been previously. Z's API is simple; "z [string]" where "string" somewhere matches the place you want to go. So if I'm destroying system settings in "/Library/Application Support" & then need to go to my Doge Decimal project, I type "z doge" & am transported to "/Users/phette23/code/dogedc". But Z is a shell script; it's written in Bash. Luckily I found a port for Fish, but for a while I was trying really hacky solutions (including proxying Z through Bash every time I ran it). Other tools, like nvm, pose this same problem.

To be fair; various incompatibilities aren't Fish's fault. They can only be solved by popularity, so when someone writes a script they think "I need this to work in all the popular shells: Bash, Zsh, & Fish". Sublime Text proved to be the biggest compatibility pain. Sublime uses os.environ['PATH'] to find the user's path & this path is used in all kinds of plug-ins. I use several linting plugins, such as SublimeLinter-JSHint, which rely on JSHint being in your path. But Fish separates path locations with a space & not a colon; Sublime consequently misreads the whole PATH string, breaking almost every plugin I've installed.

I found a way around…and it was to default back to Bash. I ran chsh -s /usr/local/bin/bash to switch my default shell back to Bash, so when Sublime runs os.environ['PATH'] it comes back with a predictable, colon-separated path. But then, because I actually want to use Fish, I had to edit all my terminal emulator profiles (I use iTerm2) such that, instead of running as login shells that would default to Bash, they execute the /usr/local/bin/fish command. A surmountable problem, but it took me weeks to identify what was wrong & how to fix it.

In general, Fish users will run into more compatibility problems with all sorts of tools that assume a Bash or strict POSIX environment. As I said, much of this isn't Fish's fault, but it is worth noting that the shell doesn't strive for 100% POSIX compliance. In a way, this is necessary; Fish conflicts with POSIX only where a substantial benefit in usability is at stake. That's great, but it also causes headaches that can't be easily fixed since backwards compatibility is broken.

While Fish breaks with some POSIX traditions, in other places it doesn't go far enough. It relies heavily on double-underscored internal functions; anywhere there's a naming convention like this, there are scoping problems. It's not clear to me why all shell scripting languages lack true objects; everything ends up in the global scope. While Fish has nice arrays, certainly better than Bash, it still lacks data structures that aid in organization. A hash/dict/associative array type is badly needed. I think this might be a place where Windows PowerShell improves upon POSIX shells, though I haven't used PS enough to truly know.

There are also things I genuinely like about Bash. I like its || & && logical operators, which behave slightly different from the natural language or & and of Fish. I like some of Bash's crazy-looking expansions, like !! (references the last command), which are weird & hard to remember but handy at times.

My main struggles with Fish revolve around output redirection, which it seems to be more stringent about. I still haven't found a nice way to quietly test if a command exists (which occurs all throughout my dotfiles, since I try not to assume a particular software setup). In Bash, this was simple with command -v $PROGRAM. But command is a shell built-in, not an external program, & so it differs in Fish. Fish doesn't replicate the "v" flag, it only uses "command" as a way to bypass aliases. I've worked around it with a two-line solution: PROGRAM --version >/dev/null; if test $status…. This runs the program, silencing all output, & then checks the exit status (which would be 0, signifying an error, if the command didn't exist). It works, but it's slower & more verbose.

There's more than you ever wanted to know about my transition to Fish shell. I'm guessing that switching shells isn't something people consider very often. Those who use the command line rarely probably don't think it's worth the trouble (or don't even know/care that it's possible), while those who rely on the command line necessarily build up lots of dependence on a specific environment. Despite all that, I'd strongly recommend Fish to anyone and I thoroughly enjoy using it every day. The pains are, oddly enough, lesser for inexperienced shell users, while the benefits are greater thanks largely to how sane and helpful Fish is designed to be.

Thursday, August 7, 2014

How Not To Do User Testing

  • Perform tests only after a final product has already been rolled out
  • Use your tests to reify assumptions already built into the product
  • Test once and then never again because hey, you’re finished
  • Refuse to accept the validity of any given test until a statistically representative sample of your user populace has been obtained (it’ll never happen)
  • Never change your testing tasks and procedures, even the ones that prove to be deeply flawed, poorly worded, uninformative
  • Ask users for their opinions rather than observing what they actually do. “Do you like this background gradient?” is a particularly apt question.
  • Conversely, test only tasks you think are important without gauging what users think is important
  • Collect personal information and video recordings during tests with no plans for how to secure the data or when to delete it
  • Simply refuse to do user testing