Digging into Ruby Symbols

by Steve Yegge

Lots of people have been discussing symbols in Ruby, and seem have converged on the explanation that symbols should be used whenever you're referring to a name (i.e. an identifier or keyword, essentially), even if you're talking about a hypothetical name that doesn't really exist in actual code yet.

I think this is the correct idiomatic usage, and it's a pretty good way to explain symbols. But I also think it's going to feel a bit hollow or contrived to someone coming to Ruby from a background in (say) JavaScript, Python, or even Java. If I were them, I'd be thinking: "Um, OK. Intent, intent, intent. Got it. But... isn't a program-source identifier a fairly abstract notion to reify as a first-class object type, especially going so far as to give it a special syntax? And did I just use the word 'reify'? Geez."

I mean, Ruby symbols are right up there with numbers, strings, regexps and the like as first-class lexical entities. I'm guessing that this feels like a really odd decision to a lot of programmers. They might be comfortable with the "intent" explanation (which, incidentally, is similar to why I tell people I like tuples in Python so much -- they help me express tuple-ish intent better than a list). Comfortable, sure, but they're probably not wholly satisfied. It still smells a little fishy.

Am I right?

I'd like to offer my own humble take on Ruby symbols, in the hope that it'll clear things up a teeny bit more. Nothing I'm going to say in any way negates what folks have concluded already, which is that symbols are best viewed as representing names in program code, not as "lightweight strings".

Metaprogramming crash-course

Symbols as first-class objects are an idea that's usually associated with Lisp. I don't want to force you to learn any Lisp, and I won't show you any Lisp today. But hopefully I can give you the flavor of how symbols are used in Lisp by describing a "hole" in Ruby that I hope will be fixed someday.

As a toy example, let's take a look at the following Ruby code, which dynamically creates four methods and attaches them to an empty holder class, using eval.

#!/usr/bin/env ruby
# define a blank class as a holder for some methods
class BigMeanGiant
end

# Now add some silly-ish methods, using a flavor of eval.
# They're going to be instance methods, because it's as if
# we defined them inline inside the class definition above.
# When invoked, the giant yells the name of the method.

%w(fee fi fo fum).each do |name|
  BigMeanGiant.class_eval <<-EOS
    def #{name}() 
      puts 'Giant says:  #{name.upcase}!'
    end
  EOS
end

# invoke the methods, just for fun
begin
  g = BigMeanGiant.new
  g.fee
  g.fi
  g.fo
  g.fum
end

When you run this little program, it obligingly prints:

Giant says:  FEE!
Giant says:  FI!
Giant says:  FO!
Giant says:  FUM!

This program is roughly the "hello, world" of metaprogramming in Ruby. We've written some code that generates code on the fly: in our case, four nearly identical methods on BigMeanGiant called 'fee', 'fi', 'fo', and 'fum'. It's almost the same as if we'd written the code like this instead:

#!/usr/bin/env ruby

class BigMeanGiant
  def fee() puts "Giant says FEE!" end
  def fi()  puts "Giant says FI!"  end
  def fo()  puts "Giant says FO!"  end
  def fum() puts "Giant says FUM!" end
end

# invoke the methods, just for fun
begin
  g = BigMeanGiant.new
  g.fee
  g.fi
  g.fo
  g.fum
end

Running this version of the program has the same output.

What did we do that for?

Although this isn't meant to be a lesson in metaprogramming, let's make sure we're all on the same page here. The second version is clearer, right? Why would you ever do the first version?

You almost certainly wouldn't do it in an example this small, but the DRY principle tells us to avoid duplicating code. You can only get so far with function abstraction. Without metaprogramming, you can't really compress the BigMeanGiant class much. You might factor out some of the repetition with a helper function:

class BigMeanGiant
  def say(msg) puts "Giant says #{msg}!" end
  def fee() say "FEE" end
  def fi()  say "FI" end
  def fo()  say "FO" end
  def fum() say "FUM" end
end

But it's not much of a savings, because you still have to write all the stubs. Imagine you're writing an HTMLOutputter class, with one method for every HTML tag -- you'll have to write a few dozen stubs, which is more than just annoying. It's also probably more error-prone, since you'll have so much code it'll be harder to spot missed tags, duplicated tags, incorrect method bodies, and so on. And if you have to go back and change them all in some minor way, your refactoring editor may or may not be able to help, depending on what change you have in mind.

In short, having lots of similar-looking code is a Bad Thing.

To solve problems like this in Java, you either have to build elaborate and inevitably awkward dispatching infrastructure, or you have to use external code generators, then hack your build system to know how to generate and then use the generated code.

This, incidentally, is why you so often see generated code in large Java projects -- it's because Java offers no language-level ways to deal with problems like this. And of course, this is only one type of problem that's solved elegantly with metaprogramming; there are many other classes of problem that are equally difficult to implement cleanly in Java.

OK, we're all on the same page now, right? Generating code on the fly can lead to cleaner, more maintainable code, assuming you use taste and good judgement and blah blah blah. You get the idea.

The example explained

Continuing with my quest to get us all on the same page, let me make sure you understand the code in the first example. The relevant part is this blob right here:

%w(fee fi fo fum).each do |name|
  BigMeanGiant.class_eval <<-EOS
    def #{name}() 
      puts 'Giant says:  #{name.upcase}!'
    end
  EOS
end

This weird-looking snippet, interpreted in English, is saying:

  1. Make me a list of the strings "fee", "fi", "fo", and "fum".
  2. For each one of those strings:
    • substitute it into another string below, containing a Ruby method definition
      • The first time, use it as the method name.
      • The second time, use it (uppercased) as what the Giant says.
    • Then call class_eval to turn it into a real method on the BigMeanGiant class.

Make sense? We're constructing method definitions in a loop, as strings, then passing them to the Ruby interpreter to attach them to a class. It's not all that different from putting the code in a Ruby source file, then invoking the interpreter; it's just that we're controlling the process ourselves at runtime.

The argument to class_eval is a string. The string contains code. Before class_eval gets hold of it, it's Pinocchio, wanting to be a Real Boy. class_eval is the fairy that sends him off to Pleasure Island to be ridiculed and learn valuable lessons, or whatever the interpreter does in its Big Black Box.

So far, so good. eval seems like a useful thing to have in your language, if you use it with caution.

Trouble in Paradise

So let's say there's a bug in my generated methods. Maybe the giant isn't saying anything, or he's saying the wrong thing. Let's say I'm having trouble figuring out the bug by staring at my code-string, which is really just a template. It's not real code until the interpreter finishes evaluating it and attaching it to the BigMeanGiant class.

So I fire up the debugger, and step through the code, and immediately notice a few things:

  1. The call to class_eval is atomic. The debugger just steps right over it.
  2. Calls to the generated methods are also atomic.
  3. I have no way of printing out the generated code.

In other words, your metaprogramming-generated code isn't "first class" in the same way your normal source code is. It's not visible to the debugger, and it's not available to other tools either. (For instance, rdoc lets you include the source code in the generated documentation, but I don't think there's any easy way to have it know about your eval-generated code.)

There are some games you can play that might make some of these things achievable. For instance, you might be able to override class_eval to store the original source code (after the template substitution) in the class somewhere, and then provide an API for getting at it for your favorite debugger. But to the best of my knowledge, it's not something that's supported "out of the box" in Ruby, and it means that working with generated code is harder than it really needs to be.

Even if I'm completely mistaken here, and someone comments with a way to print out a generated method's source code (which would be pretty nifty), the whole experience still falls remarkably short of the metaprogramming facilities in Lisp.

To clarify, let's peer more closely into the lifecycle of that generated code. There are some distinct activities that rush right by us in Ruby, things we might actually want some control over.

We really will make our way to symbols soon, promise.

Constructing the code string

We start with a string, which the first example has in a "here doc" -- one of Ruby's genuine Perl-isms that you're free to view with suspicion. Python's syntax would be a triple-quoted string, which I think is nicer, but what's done is done. Here's the string again:

    def #{name}() 
      puts 'Giant says:  #{name.upcase}!'
    end

It could just as easily have been a normal, double-quoted string, even a one-liner:

  "def #{name}() puts 'Giant says: #{name.upcase}!' end"

However, because dynamically-generated code is notoriously tricky to debug, most of the time you'll want to format code in template strings as clearly as possible.

I'm calling it a template because Ruby strings can contain inline expressions, delimited with #{}. In Java you'd use string concatenation, e.g.:

"Giant says: " + getThingGiantSays() + "!"

Python has the printf-like % operator, and other languages have their own approaches. The Ruby way is probably more readable if the substituted expressions are short; using something like sprintf (which Ruby also has) will be better if there are long expressions. Basically you want to do whatever makes the code template look as much as possible like the code it's going to turn into.

Here's Secret Observation #1: in Lisp, your code template isn't a string. It's a data structure that represents the tokenized and partially-parsed code. If Ruby had this feature, the BigMeanGiant example might look something like this:

%w(fee fi fo fum).each do |name|
  BigMeanGiant.class_eval START_CODE_TEMPLATE
    def #{name}() 
      puts 'Giant says:  #{name.upcase}!'
    end
  END_CODE_TEMPLATE
end

I put those big START/END tokens there in an attempt to make it clear that what's inside them is NOT a real boy; it's Pinocchio, and it will take some major Good Fairy work to make it real code.

But notice that the code inside the template is actually syntax-highlighted properly. When it was all inside a string (heredoc, double-quoted, or otherwise -- it's still just a string), it was all highlighted in light blue, which is what my editor tells me Strings should look like. My editor was nice enough to highlight the substitution expressions in brown, but you still need to realize they're substituted before the final string is used as an argument to class_eval. But inside the CODE_TEMPLATE, we know it's going to be code, so we can invoke the syntax-highlighter on it. Helps you see what's going on more clearly. And auto-indenting, tagging, and other IDE functions will work on it. Muuuuuch nicer than code in a string, wouldn't you agree?

Imagine that you could pass around one of those CODE_TEMPLATE doohickeys as an object, one that actually represented the Pinocchio-code in a way that let you traverse it and modify it before passing it off to eval. That seems like it could come in quite handy, and in fact it does. For one thing, it makes it far easier to do meta-metaprogramming, where you're writing code that generates those code templates. But at a perhaps more mundane level, it makes it possible to create new syntactic constructs in the Ruby language.

At this point, some people will cringe and shudder and proclaim: "Evil! What you just said is Pure Evil!" Lots of programmers, maybe even most of them, are so irrationally afraid of new syntax that they'd rather leaf through hundreds of pages of similar-looking object-oriented calls than accept one new syntactic construct. I blogged about this once, in an article called Language Trickery and EJB. That article actually managed to convince a bunch of hardcore Java programmers that new syntax might actually be a useful tool. Maybe it'll convince you too. If not, well, feel free to skip to the next section.

It would actually take me too far afield to go through a detailed example of how adding a new syntactic control-flow construct to Ruby could turn into a huge benefit for your project. Imagine, though, that Ruby didn't have here-docs, and that you were practically drooling with jealousy over Python's triple-quoted strings. If you're a Java programmer, and you're not drooling purely out of habit, then you should definitely drool over multi-line strings. It boggles the mind that they didn't include it as a language feature, and in Java we wind up doing zillions of manual concatenations to produce long strings (which usually by then look nothing like the thing they're trying to represent.) Ah, me.

If Ruby didn't have here-docs, but Ruby had those CODE_TEMPLATE thingies and one system hook that allowed you to control the evaluation of those templates, then you could implement here-docs pretty easily. Because the code-to-be is represented as a data structure, allowing you to quickly and easily filter out the #{}-substitution elements, you could simply evaluate whatever's inside those elements, and not evaluate anything else in the template. That's all they really do. And of course (much) more sophisticated syntactic constructs are also possible, if you put in more work.

That's the kind of thing Lisp programmers do for breakfast before going and writing their application code. And the funny thing is, it could really be super easy in Ruby -- maybe even easier than in Lisp. It's just that Ruby doesn't support it today.

If the hairs are all standing up on the back of your neck, and you're just recovering from shock and trying to think of the dirtiest word you could possibly call me, well, take a few deep breaths, nice and slow. It just means I'm a dog person and you're a cat person, or something like that. Let's not bite each other. Many people (notably Paul Graham in "On Lisp") have spent lots of effort explaining how this kind of programming has to be treated with MUCH more deference and caution than ordinary API programming. Language extensions and minilanguages can be extremely powerful and useful -- imagine where we'd be without regular expressions, for instance -- but they also require tons more care, documentation, and thought than defining an ordinary function.

You're already sort of doing this kind of "language extension" programming every time you call eval -- for that matter, you're doing it whenever you invoke a separate code generator, or open up a class and add stuff to it, or use a tool like yacc or ANTLR. We're completely surrounded by languages, large and small, all the way down to the minilanguage you use for ordering coffee at Starbucks. It'd be hard to get along without them.

Evaluation

Once you have that code template as an actual object, as opposed to a string that you need to parse yourself, then you could do all sorts of things with it. For one thing, you could pretty-print it. It's effectively in parse-tree format, so all you'd need to do is decide the rules for line breaks and spacing between various token types. For another thing, you could tell your debugger about it, which would allow you to inspect and step through generated code. And evaluation -- the creation of actual code from your template -- would no longer be the black box that it is in Ruby today (and in Python, Perl and JavaScript, for that matter.) More control means more opportunities to remove DRY violations, and do so in a way that has strong(er) long-term maintainability characteristics. I mean, you have to admit, not being able to inspect or step through your generated code makes maintenance a bit of a tricky proposition.

(Note: see the important correction Jim Weirich made in the comments section. --steve)

Symbols at last

Those nonexistent code templates I've been referring to -- that is, objects (collections, really) that represent snippets of code to be evaluated -- they're really just syntax trees representing your source code. They're similar to the output you'd get from any parser, including generated parsers from tools like ANTLR. Or maybe a more familiar example is the XML DOM -- an object-tree representation of the parsed XML file. You have to admit, working with a DOM is a lot more convenient than working with a string containing raw XML. It's a huge difference, and it's a feature Lisp has that Ruby mostly lacks, at least today. A set of features, really: it's a rich programming domain.

In a system with first-class syntax trees represented as language entities, in a way that allows you to interact with the lexer, parser, and evaluator (i.e. different components of the Ruby interpreter), symbols make a whole lot more sense. A symbol is literally an object that represents a name in the code tree. If you had a code template snippet representing this code:

  def fum() say "FUM" end

Then your syntax tree would contain a Symbol object for each token in the code except for the string "FUM" (which would be a String), because that's just a string and not a source identifier or keyword, and also except for the parens in the arg list, but that's another long story that we don't have time for today.

So Ruby's symbols are really a placeholder for grand things to come. Ruby is already a very powerful, capable language, but it has some weaknesses in its ability to process Ruby code at runtime. Your only real tool today is eval (which comes in several flavors in Ruby, but that's irrelevant to our discussion), and it's a big black box. Once your code template is handed over to the Good Fairy, crossing that magical line between your program and the Ruby interpreter, you've lost it, and what you get back is effectively an opaque binary blob wrapped in a thin Method (or UnboundMethod, etc.) class that doesn't remember much about its original symbolic representation.

Well, that went on way too long. Was it helpful?


9 Comments

DougHolton
2005-12-28 20:52:39
Very nice. If I can add a point or question, maybe ruby needs :X instead of just X for symbols because it has no compile-time vs. run-time distinction. Everything is done at run-time, but for metaprogramming it still helps to "intercept" code after it is parsed but before it is evaluated. In static languages, intercepting is easier because parsing and running the code happen at completely different times. For example a macro like:
#define _N(x) x
would replace _N(x) with x at compile time, before x is ever evaluated.


In the language boo (http://boo.codehaus.org/), which is statically typed and supports a couple of kinds of macros (http://boo.codehaus.org/Syntactic+Macros), you can say for example:
attr_accessor X, Y
instead of
attr_accessor :X, :Y


because attr_accessor would be a compile-time macro (not a method called at runtime like in ruby). Thus any identifier passed to the macro is a type of AST node, such as a "symbol" (referenceexpression) or some other type of ast node (string literal expression, methodinvocationexpression, etc.). All the AST nodes are of course arranged in a tree like DOM, and you can manipulate it as you will at compile time.


The syntax you mentioned for code templates (sort of wysiwyg macros), is similar to that proposed for boo: http://jira.codehaus.org/browse/BOO-95


See nemerle for a language with even more powerful macro support: http://www.nemerle.org/
and http://nemerle.org/Macros


Anyway, back to the point, when or if ruby gets the ability to represent code as AST, :X can be used to refer to reference expression node with the name "X", whereas X would refer to whatever some X variable stands for at the time the code template is processed.

Jim Weirich
2005-12-29 04:36:44
If you add the "magic" incantation to your eval line:


  BigMeanGiant.class_eval <<-EOS, __FILE__, __LINE__+1


Then you will be able to step into each of the fee/fie/foo methods in the debugger. However, you won't be able to examine the value of the name variable in the debugger.


However, if you change the definition of of the fee/fie/foo methods to the following, you will be able to both step into the methods and query the value of the name variable at run time.

class BigMeanGiant
%w(fee fi fo fum).each do |name|
define_method(name) {
puts name
}
end
end


You still can't get the source back (but then you can't do that with regular defined methods either). If you are interested in examining the Ruby AST at run time (and even modifying it), check out Ryan Davis's ParseTree library.


Christian Neukirchen
2005-12-29 05:34:36
ri define_method
Dae San Hwang
2005-12-29 07:22:20
Thank you for the enlightenment! I think I am finally beginning to understand what the lisp macro is all about! =)
Steve Yegge
2005-12-29 15:31:43
Jim, that's awesome. Thanks. I knew about the __LINE__ trick, but for some reason only thought it worked in stack traces, not in the debugger. Silly me.


I knew about define_method, and it's very cool. I probably should have included a snippet about it. As you know, though, define_method doesn't quite address what I was getting at, which was synthesizing new code trees on the fly (or turning existing code into an AST and processing it). In particular, it would be ideal to be able to manipulate the token stream and/or AST at lex time (for syntax), parse time (for code rewrites), compile time (for performance), and run time (for metaprogramming, e.g. doing multi-method dispatch or whatever).


Ruby offers enough building blocks to do all of these things to some extent. I'm definitely going to have to check out Ryan Davis's parse tree library; it sounds great.


In the spirit of full disclosure, I think even Lisp has trouble recovering the initial source code from compiled functions, unless you deliberately take steps to store the pre-compiled source somewhere during the compilation. It doesn't happen by default. But it's easy to dream about a language environment where you can pretty much go from symbolic source to symbolic data structures to code, and back, with lots of transparency and control at different steps.


Anyway, thanks for the tips! You've instantly made me a more effective Ruby programmer by showing me a way to step through eval-generated code. I have to go check out that library now.

David Koontz
2005-12-30 12:46:10
I agree with Dae San Hwang, I had heard lots of talk about Lisp macros but never saw a good example of how their power surpasses what is available in Ruby. Thank you Steve!
Jim Weirich
2006-01-02 10:46:57
Steve: [...] doesn't quite address what I was getting at, which was synthesizing new code trees on the fly (or turning existing code into an AST and processing it). In particular, it would be ideal to be able to manipulate the token stream and/or AST at lex time (for syntax), parse time (for code rewrites), compile time (for performance), and run time (for metaprogramming, e.g. doing multi-method dispatch or whatever).


I agree define_method doesn't do the AST thing.


Are you aware of Ryan Davis' ParseTree project? It will pull out the AST of an existing Ruby method or class (with some restrictions), and allow you to manipulate it for whatever purposes. Combine ParseTree with the Ruby2Ruby project to rewrite Ruby code, or combine it with Ruby2C to generate C code from your Ruby ASTs. The ZenOptimize project adds a JITer that will dynamically convert (simple) Ruby code to C at runtime.


References:


ParseTree: http://rubyforge.org/projects/parsetree/
ZenOptimize: http://blog.zenspider.com/archives/2005/04/ruby_go_zoom_zo.html


Great posts, BTW, I really enjoyed them.


-- Jim

Steve Yegge
2006-01-05 19:11:53
I checked out the parse tree package. Pretty snifty. I find it funny that he produces an s-expression format in Ruby. I'm not sure if that makes it easier to *process* from Ruby, since I haven't tried it out yet, but it's certainly convenient to read.
Patrick
2006-02-22 04:21:16
Can't you just print the string being evaluated to stderr while debugging? Don't know Ruby, so I might be mistaken, but I had the same trouble in lotusscript, while creating huge macros on the fly. It's not perfect, but at least you can see what's going to be processed.