NubyGems: Hash Initialization

by Gregory Brown

Introduction



Over the next few weeks, I will be posting short little blog entries about things that I've found myself tripping over when I first learned Ruby. They will be minimal little blurps that will hopefully help new users either understand a trap they've fallen into, or help them avoid a trap.

I hope the RubyGems folks forgive me for the terrible pun, but these are meant to be little gems of information for those new to Ruby. If you are an experienced rubyist, these posts will probably bore you, so be warned!

The Hash Initialization Problem



One fairly common need in programming is to set a default value for keys to map to in a hash. For example, if you are dealing with a hash of a bunch of numbers, you might just want a key that isn't there to map to zero.

There are two different ways you can use the hash constructor to do this. One is to just pass the number in as a parameter, like Hash.new(0) and the other is to use a block such as
Hash.new { |h,k| h[k] = 0 }

Now, in this simple example, these two pieces of code do the same thing from a users perspective. Both result in the ability to get results like this:


>> a[:foo]
=> 0
>> a[:foo] += 1
=> 1
>> a[:foo]
=> 1
>> a[:bar]
=> 0


Now, as a lazy coder who didn't initially have a strong grasp of blocks when I first started using Ruby, I much prefered the parameter form. However, there is a subtle difference between the two that can be very problematic if you aren't careful.

The thing that is important to know is that the parameter form will return the same exact object for all the default values. Though the example before used a Fixnum, which is an immediate value, if you use something like... say a string, it's not so simple. Take a look at the two chunks of code below, and note the difference.


irb(main):001:0> a = Hash.new("")
=> {}
irb(main):002:0> a[:foo]
=> ""
irb(main):003:0> a[:foo] << "bar"
=> "bar"
irb(main):004:0> a[:foo]
=> "bar"
irb(main):005:0> a[:train]
=> "bar"



irb(main):006:0> a = Hash.new { |h,k| h[k] = "" }
=> {}
irb(main):007:0> a[:foo]
=> ""
irb(main):008:0> a[:foo] << "bar"
=> "bar"
irb(main):009:0> a[:foo]
=> "bar"
irb(main):010:0> a[:train]
=> ""


See? The first bit of code shares a common string object, where the second bit creates a new string for each new key which is not mapped to a value. Though you might find the first behavior useful at times, it is usually the case that the second is what is desired, and this is the rule of thumb I go by to keep from getting snagged:

If I am using an immediate value for my default, I tend to use the parameter method. Otherwise, I tend to use the block form, especially when dealing with any type of collection or string.

Anyway, I hope this helps people understand what the two different initializers do and prevents some gotchas. Happy Hacking!

15 Comments

Haris Skiadas
2006-04-12 22:52:13
I wasn't even aware of the block initializer! Very useful. And more versatile! You could presumably have the value h(k) be something much more interesting and depending on h,k. For instance this:


Hash.new { |h,k| @count = 0 unless @count; @count += 1; h[k] = @count}


will assign an ever-increasing number to each new hash key.

Per Melin
2006-04-12 23:22:35
It's important to note that the block form actually assigns a value to every unassigned key that you look at. So depending on how you use it (e.g you loop through the hash looking for keys with a value) you could end up with a much bigger hash.
Neil Salter
2006-04-13 01:44:36
I really enjoyed this blog post, thanks.


It's the first time I've come across 'immediate' values in ruby, and it prompted me to do some digging. Immediate values are a subtle concept that could perhaps be covered in another nuby gem?

Ray
2006-04-13 02:32:41
Good tip. It made me think about the .default method, which internally must be doing the same assignment as passing in a parameter.



irb(main):001:0> a = Hash.new
=> {}
irb(main):002:0> a.default = ""
=> ""
irb(main):003:0> a[:foo]
=> ""
irb(main):004:0> a[:foo] << "bar"
=> "bar"
irb(main):005:0> a[:foo]
=> "bar"
irb(main):006:0> a[:train]
=> "bar"



Steven Bristol
2006-04-13 02:59:54
Great post. GREAT idea for a series of topics. Please write more like.
Greg
2006-04-13 08:02:31

Per Melin writes:


It's important to note that the block form actually assigns a value to every unassigned key that you look at. So depending on how you use it (e.g you loop through the hash looking for keys with a value) you could end up with a much bigger hash.


Yes, the block form *will* create a new object for each key which is not already assigned a value.


So, if you are using it in a way that might accidentally create some cruft, you might consider using Hash#has_key? before attempting to use your key.


Greg
2006-04-13 08:06:08

Ray writes:


Good tip. It made me think about the .default method, which internally must be doing the same assignment as passing in a parameter.


Yup, and here's confirmation of that:



irb(main):003:0> a = Hash.new
=> {}
irb(main):004:0> a.default = ""
=> ""
irb(main):005:0> a[:foo].object_id
=> 22448284
irb(main):006:0> a[:bar].object_id
=> 22448284
Greg
2006-04-13 08:10:04

Neil Salter writes:


It's the first time I've come across 'immediate' values in ruby, and it prompted me to do some digging. Immediate values are a subtle concept that could perhaps be covered in another nuby gem?


Sure... let me just try to remember some weird hang up I had with them. I'm trying to make this series based on problems I ran into, because I find that when I screw something up, i tend to learn the most.

JEG2
2006-04-13 10:55:13
A wonderfully informative post!
Brian
2006-04-14 03:53:38
Good post. Thank you for a nice nugget of knowledge.


I don't know what immediate value means either (hint, hint) but is it fair to say the parameter approach is appropriate when the default is immutable (or at least treated as immutable by programmer discipline and unit tests)?


Brian

Gregory Brown
2006-04-16 19:24:43
Brian writes:

is it fair to say the parameter approach is appropriate when the default is immutable (or at least treated as immutable by programmer discipline and unit tests)?


Yes, that sounds reasonable. Realistically, if you think that there is a possibility that the object might change, and you consider that change A Bad Thing, it is probably a better idea to use the block approach.

Peter Cooper
2006-05-27 06:44:17
I think you might have missed something users may still trip up on. There's no difference if you assign directly (not using <<), which applies a new string object as the value, meaning the initial example works like the last one. That is..


x = Hash.new("")
x[:foo] ==> ''
x[:foo] = 'bar'
x[:bar] ==> ''


Whereas << merely pushes text onto the end of the default.

Gregory Brown
2006-05-27 08:32:43
Actually, I doubt someone most people would be surprised by the fact that the default value gets replaced by whatever is assigned.


It's the fact that you can modify that default value that can trip you up. If you are only doing direct assignment, the single default object approach would probably perform better.

Dan Vanderkam
2006-11-21 00:40:16
I've never understood why "h" is the mandatory first parameter in the block form of Hash.new. This type of initialization would make far more sense:


h = Hash.new { "" }


Honestly, how often does your Hash block not end with "h[k] = foo"? What's more insidious though is that the above form of hash construction half-works. Here's an IRB session:



irb(main):001:0> h = Hash.new { [] }
=> {}
irb(main):002:0> h[0]
=> []
irb(main):003:0> h[0] << "blah"
=> ["blah"]
irb(main):004:0> h[0]
=> []


This caused me no end of confusion before I found out what the correct form was.. I'd like to see how Hash.new is implemented. I'm surprised that h[0] returned "[]" in this case and not "nil".

Greg
2006-11-21 07:50:02
Dan, I think you are misunderstanding how the block works. For the code you suggested

h = Hash.new { [] }


what is actually happening is the block is being called without anything being assigned every time you access an unset key.


Ruby actually calls the block and passes in the hash and the key index when it encounters a key which does not have a value assigned to it. Since you didn't specify it to do anything with that, it's just returning the value of the block. Since each time it is run, you have not set the value, you get a new Array object.


I'm not sure if it'd be nice or not if Hash.new's return value was auto-assigned to the Hash. I think the block form works okay as it is, honestly.


It did take me a few seconds to see why your example was failing, though.