GEEKERY  
ADVENTURE  
CONTEMPLATION  

20151027

Black box variational inference for gammas

Whoa, this has been a long blogging hiatus for me.  I have no excuses other than I've been enjoying life and working hard.  So not excuses, reasons.

I return with a super light-and-fluffy post to share a guide to black box variational inference for gamma-distributed latent variables.  BBVI is very powerful, but I was having trouble applying it to gamma variables, so I asked Rajesh (its creator) for some tips.  I wrote the guide to try out his tricks on a very simple model and share them with other folks that might be having similar issues.  Have fun, ya'll.

20150204

Allison's Law: "The mess has to go somewhere"

When I was a growing up, we had a standard of cleanliness in our house called "daddy-clean." My brother and I were asked regularly to clean are rooms, like most American children, but when we were done, mom would always ask: Is it daddy-clean?  This usually resulted in a second round of cleaning to make sure everything was out of sight.

There was a flaw to this paradigm, however, which was that daddy-clean only applied to things that were visible.  Thus, I learned the art of shoving everything under my bed, which had a convenient bedskirt to hide everything.  Toys, clothes, paper; everything went underneath.  When under-the-bed got full, the closet was my second choice.  Eventually my parents found out about this, due to an abundance of random objects poking out, but they allowed me my secret messes so long as they didn't get in the way of finding important things, which they occasionally did.

Nowadays my messes look a little different.  In addition to paper, I have more abstract things like source code.  And my experience growing up has taught me: the mess has to go somewhere.  Most of the time, this is just a trade off between different aspects of cleanness of an end product and time, but it applies in so many cases.

Consider the process of creating a user interface.  The mess can go into the source code; everything hacked together in an ugly mess underneath.  The mess could also go into the UI itself: bad design with beautifully easy implementation.  Or, the mess could be absorbed with lots of time to have pretty code and sleek presentation.

Or consider a different piece of software, like an operating system.  The mess could go into the kernel, into the user experience, or passed on to developers for that platform.  Or, again, the mess can be absorbed by lots of time and effort.

In my experience, the mess of the very pretty Mac OS is passed on to developers.  D3, with its steep learning curve and beautiful graphics, also passes the mess to programmers.  Easy-to-use and powerful libraries like ggplot2 for R probably put the mess in some combination of the under-the-hood code and time.

I've also been thinking about this in terms of (machine learning) model development.  Usually elegant models require an intense amount of time to polish into their perfected forms.

It's not always the right choice to absorb mess with time; sometimes a project isn't worth doing exceptionally cleanly.  I think it is always worth it, however, to consider where your mess will be going in order to make a measured choice.

20140722

Daft for probabilistic graphical models

probabilistic graphical model rendered with Daft
Daft is python package used to render graphical models. Its renders are indeed lovely (see right), but the pipeline leaves something to be desired, and there's still a lot of functionality missing.

To try it out, I decided to draw one of the simplest PGMs possible: N points drawn from a mean μ.  It was frustrating to enter coordinates to place the nodes and plate boundaries. It would be preferable to specify which nodes the plates should surround, just as the edges specify which nodes they connect.  It would also be nice to not specify coordinates at all for the nodes, and instead have the system determine placement (but still allow manual override).

There are no options to control the alignment or scale of plate labels, and the concept of specifying an origin was a little strange, even if it makes sense.  The aspect ratio of the graphical model should be fit to the contents, and you should be able to set margins; the only time we should specify a size is when rendering.

While it seems promising, the learning curve is too steep for me.  I've entrenched myself in Inkscape, where it's easy for me to center things quickly.  Churning out the variant below took me about two minutes, whereas the Daft variant took closer to ten, and it still needs work.  That said, Daft does match fonts better with LaTex documents.  I could see it being powerful once you know how to handle its quirks.

probabilistic graphical model hand-drawn with Inkscape

20140624

blogiversary!

Today, it's been six years since I started blogging.  To celebrate, I decided to do some text analysis of the 455 posts I've published here, prior to this one.  In curating the corpus, I learned that I write words like totally and amazing far too much.  Moving past my bad mannerisms, there's some fun stuff to see.

I ran the topic model LDA with 50 topics.  It captured the things I like to do: gardening, cooking, and travel. (I'm showing the top 10 terms associated with each topic, and top 5 documents.)

topic 008 chocolate butter egg cup add cream sugar mixture potato lime
two tarts
potato shallot souffle
Nearly Rotten Apples
chocolate festival!
chocolate cake for two

topic 037 seeds plants garden seed tomatoes tomato plant garlic planted plot
starting my heirloom garden
the hard way
So it begins...
frost vs. freeze
bring out yer dead

topic 048 car trip beach drive night friends road visited nwc park
East Coast Australia
up for air: a beautiful, but messy, life
Adventures in Israel, the Epic Saga, Chapter IV - By Day and by Night
concussion!
Come, come, ye students!


It also found some things that I geek out about: software designbooks, and teaching.

topic 009 computer password system name history users physical person book month
accounts - what's the point?
designing everyday things and computer interactions
What should computers be able to do?
unplug
retina displays and serif fonts

topic 012 books book digital library kindle true screen already order libraries
paper and pixels
the Birth Order Book
fiction or nonfiction?
minimally problematic
Kindle review

topic 046 kids science school computer does mean taught put true teach
incorporating computer science into K-12 curriculums
welcome to the system
sorting concept game
switching places
the things we don't clean (little moment of compulsion #5)


And, unsurprisingly, it found the things about which I blather extensively: gender and sexuality, religion, mormon feminism, and morality in general.

topic 039 gender school roles boys girls children grad changed turn transgender
don't compete with the boys
transgenderism
redefining ambition
a blast from the past
gender identity in young children

topic 034 god atonement christ believe belief post faith negative self comfortable
Answering the Temple Recommend Interview Questions
inner light
just on belief (a follow up)
The Atonement
knowledge vs. belief

topic 035 women church priesthood mother holy gender father roles ghost heavenly
General Conference Sentence Generator
teaching young women
Boys and Girls and God
seeing change, or fruit and dirt
The Holy Ghost and Heavenly Mother

topic 047 marriage morality laws society child believe different parents moral gay
on the mercuriality of moral caliber in our beloved republic
forgiving vs. condoning
morality in a governed society, emotional premises, and same-sex marriage
on belief and expressing ideas
can't touch this


Because we have the time aspect, I was tempted to run Sean Gerrish's dynamic topics + influence model to see how topics shifted over time and what posts were prescient of change, but I was too lazy.

We can still, however, track page views over time (Blogger messes up the x-axis labels; it really starts at June 2008) and the number of post over time.



Other tidbits:
  • my most popular post is The Holy Ghost and Heavenly Mother
  • my cs webpage refers the most traffic
  • I have 145 unpublished drafts, ranging from short notes to fully-fledged posts. Some of these I'm still working on, but others I've decided not to publish, but don't want to delete.
  • To date, I've earned $4.13 via Amazon ads.  More on my ad policy here.

20130326

you don't understand

Ugh, that title sounds like some awful teenager.  Luckily, there no teenagers in this post.

Today I had the opportunity to listen to a guest lecture by the famous machine learning theorist Vladimir Vapnik.  Since he lives locally, I've heard him talk about the same topic three times now: in this class, at the annual NYAS Machine Learning Symposium, and at general Princeton CS Lecture.  (These are more-or-less the slides he used today.)

Warning: this next paragraph is geeky; skip it if you aren't interested.
The theory he presents is interesting, as are the results; he proposes that information other than input and result can be used in training a machine learning algorithm.  The idea is that some description of how we get from input to output, even if the description isn't enough to reproduce the result exactly, helps us learn.   He gives an awesome example of labeling OCR numbers with essentially poetry, describing the personalities of the writers in flowery, adjective-heavy text; each digit in the training set had some text written exclusively for it. He shows that providing that text when training the algorithm (in addition to the input pixels and labeled outputs, of course) results in better OCR recognition than just providing the standard training data exclusively.  Permuting the text associations got rid of the improvement. Crazy stuff.

During the lecture, he said to the class several times "you don't understand."  It wasn't a question, nor did he always attempt to re-explain, perhaps deeming us incapable of understanding those particular points at all.  I've often found that the most brilliant people have a hard time explaining themselves so that everyone can understand--they just can't understand not understanding, and so can't see the path people need to follow in order to obtain understanding.

It seemed like Vapnik has reached a point in his life were he is comfortable with people not understanding him; he's a very well-established individual and is possibly entitled to that luxury.  At this point, it's on us to try and understand him, instead of the usual more balanced responsibilities of teacher and student both needing to do their best to teach and understand, respectively.

That isn't to say that Vapnik isn't a good lecturer; he's fairly clear and entertaining, but there are some details that could use more illumination.  Perhaps I'm not being fair, though, since everything is in contrast to the usual lecturer for the course Rob Schapire, who is possibly the best lecturer I've ever encountered.  I also contrast it to my own teaching, where I've been thinking hard about how to explain simple computer science concepts like objects or static methods to students that have never seen the material or anything like it ever before.  It's a lot of fun, but it's also exhausting to some extent.

Anyway, I find it funny that I felt the need to write a commentary about the teaching style of the lecturer whose talk was entitled Learning with Teacher: Learning using Hidden Information.  Maybe there was something hidden in there...

20110117

not fast enough

Tonight I was working on processing data about a topic model on some 360 thousand documents from ASCII files into a database.  Bweh, what a task.  I started with a pretty naïve approach, and after getting it to work piecewise, I set it to run on the whole shebang.  After watching numbers fly by for a few minutes, I crunched some numbers and figured that it would take about 10 days to finish--not okay.  This was a piece of code that was pretty tailored to my task and would likely only ever be seen by me, and even then, only run this once; I didn't really want to sink a lot of time into it, but I'd like it to finish, in say, under 24 hours.

First was the problem of finding links to the documents.  Nature has this entire set online, but finding a link given a document id (doi) wasn't a find-and-replace task.  Take a look at this document on The Rockefeller Foundation (doi: 10.1038/147811a0), for instance.  Part of the doi is in the link, but there's also a volume number and one other number prefixed by n that makes the link unique.  And I didn't have those numbers, so I was querying nature for the doi and finding the pdf link on the search page (such as this).  Good for a handful of links, but not 360k times.  Turns out, I was able to find those numbers buried in my data, but it was pretty obscure.  Plug-n-play on the link took me down to 5 days.

Next, there were two types of data file about the documents, ones with a doi and an abstract per line and ones with all sorts of other info, including topic model data and also the doi, again one line per document.  The files themselves weren't one-to-one (there being 8 of one kind and 25 of the other), but they were one-to-one when it came to the relevant document lines: each doi had one line in each of the two file sets.  However, the organization wasn't intuitive and I wasn't about to pry through them by hand.  Instead, I noted that the matches occurred in groups, though again, not in any way that made logical sense that I could hard-code.  So instead of looking through all the other files for a match, I just had it stash the last-used file and look in that one first each time.  If it wasn't there, it looked in the others and updated the last-used file accordingly.  That took the expected run time down to under 24 hours, and I declared myself done for the night.

I guess the lesson learned was that even for simple pieces of code that are for private use and only to be used once, it's important to take the time to do things right instead of just the easiest/stupidest way possible.