13 Hustle: heaps and lists

13.1 Inductive data🔗

So far all of the data we have considered can fit in a single machine word (64-bits). Well, integers can’t, but we truncated them and only consider, by fiat, those integers that fit into a register.

In the Hustle language, we will add two inductively defined data types, boxes and pairs, which will require us to relax this restriction.

Boxes are like unary pairs, they simply hold a value, which can be projected out. Pairs hold two values which each can be projected out.

To see how values are now inductively defined notice that if you have a value v, you can make another value with (box v). Similarly, if v1 and v2 are values, then so is (cons v1 v2). This suggests the following recursive type definition for values:

; type Value =
; | Integer
; | Boolean
; | Character
; | Eof
; | Void
; | '()
; | (cons Value Value)
; | (box Value)

The new operations include constructors (box e) and (cons e0 e1) and projections (unbox e), (car e), and (cdr e). We’ll also include predicates for identifying boxes and pairs: (box? e) and (cons? e).

These features will operate like their Racket counterparts:

Examples
> (unbox (box 7))
make[1]: Entering directory '/home/runner/work/cmsc430.github.io/cmsc430.github.io/langs/hustle'
make[1]: Leaving directory '/home/runner/work/cmsc430.github.io/cmsc430.github.io/langs/hustle'
7
> (car (cons 3 4))
3
> (cdr (cons 3 4))
4
> (box? (box 7))
#t
> (cons? (cons 3 4))
#t
> (box? (cons 3 4))
#f
> (cons? (box 7))
#f

Usually boxes are mutable data structures, like OCaml’s ref type, but we will examine this aspect later. For now, we treat boxes as immutable data structures.

We will also add support for writing pair and box literals using the same quote notation that Racket uses.

These features will operate like their Racket counterparts:

Examples
> (unbox '#&7)
7
> (car '(3 . 4))
3
> (cdr '(3 . 4))
4
> (box? '#&7)
#t
> (cons? '(3 . 4))
#t
> (box? '(3 . 4))
#f
> (cons? '#&7)
#f

13.2 Empty lists can be all and end all🔗

While we’ve introduced pairs, you may wonder what about lists? Just as in Racket, lists can be represented by idiomatic uses of cons: a non-empty list is a pair whose car is an element and whose cdr is the rest of the list. What’s left? We need a representation of the empty list!

In Racket, and in our languages, we write this value as '(). There’s nothing particularly special about the empty list value, we just need another distinguished value to designate it.

Using cons and '() in a structured way we can form proper list, among other useful data structures.

We use the following AST data type for Hustle:

hustle/ast.rkt
#lang racket
;; type Expr = ... | (Lit Datum)
;; type Datum = ... | (cons Datum Datum) | (box Datum) | '()
;; type Op1 = ... | 'box | 'car | 'cdr | 'unbox | 'box? | 'cons?
;; type Op2 = ... | 'cons

13.3 Parsing🔗

Mostly the parser updates for Hustle are uninteresting. The only slight twist is the addition of compound literal datums.

It’s worth observing a few things about how quote works in Racket. First, some datums are self-quoting, i.e. we can write them with or without quoting and they mean the same thing:

Examples
> 5
5
> '5
5

All of the datums consider prior to Hustle have been self-quoting: booleans, integers, and characters.

Of the new datums, boxes are self-quoting, but pairs and the empty list are not.

Examples
> #&7
'#&7
> '#&7
'#&7
> ()
#%app: missing procedure expression;
probably originally (), which is an illegal empty
application
in: (#%app)
> '()
'()
> (1 . 2)
eval:406:0: #%app: bad syntax
in: (#%app 1 . 2)
> '(1 . 2)
'(1 . 2)

The reason for this is that unquoted list datums would be confused with expression forms without the quote, so its required, however for the other datums, there’s no possible confusion and the quote is inferred. Note also that once inside a self-quoting datum, it’s unambiguous that we’re talking about literal data and not expressions that need to be evaluated, so you can have empty lists and pairs:

Examples
> #&()
'#&()
> #&(1 . 2)
'#&(1 . 2)

This gives rise to two notions of datums that our parser uses, with (mutually defined) predicates for each:

hustle/parse.rkt
;; Any -> Boolean
(define (self-quoting-datum? x)
  (or (exact-integer? x)
      (boolean? x)
      (char? x)
      (and (box? x) (datum? (unbox x)))))

;; Any -> Boolean
(define (datum? x)
  (or (self-quoting-datum? x)
      (empty? x)
      (and (cons? x) (datum? (car x)) (datum? (cdr x)))))

Now when the parser encounters something that is a self-quoting datum, it can parse it as a Lit. But for datums that are quoted, it will need to recognize the quote form, so anything that has the s-expression shape 'd will also get parsed as a Lit.

Things can get a little confusing here so let’s look at some examples:

Examples
> (parse 5)
'#s(Lit 5)
> (parse '5)
'#s(Lit 5)

Here, both examples are really the same. When we write '5, that reads it as 5, so this is really the same example and corresponds to an input program that just contains the number 5 and we are calling parse with an argument of 5.

If the input program contained a quoted 5, then it would be '5, which we would represent as an s-expression as ''5. Note that this reads as ''5, i.e. a two-element list with the symbol 'quote as the first element and the number 5 as the second. So when writing examples where the input program itself uses quote we will see this kind of double quotation, and we are calling parse with a two-element list as the argument:

FIXME: langs needs to be update to parse this correctly.

Examples
> (parse ''5)
bad syntax ''5

This is saying that the input program was '5. Notice that it gets parsed the same as 5 by our parser.

If we were to parse the empty list, this should be considered a parse error because it’s like writing () in Racket; it’s not a valid expression form:

Examples
> (parse '())
parse error '()

However, if the empty list is quoted, i.e. ''(), then we are talking about the expression '(), so this gets parsed as (Lit '()):

Examples
> (parse ''())
'#s(Lit ())

It works similarly for pairs:

FIXME: langs needs to be update to parse second example correctly.

Examples
> (parse '(1 . 2))
parse error '(1 . 2)
> (parse ''(1 . 2))
parse error '(1 . 2)

While these examples can be a bit confusing at first, implementing this behavior is pretty simple. If the input is a self-quoting-datum?, then we parse it as a Lit containing that datum. If the the input is a two-element list of the form (list 'quote d) and d is a datum?, the we parse it as a Lit containing d.

Note that if the examples are confusing, the parser actually explains what’s going on in Racket. Somewhere down in the code that implements read is something equivalent to what we’ve done here in parse for handling self-quoting and explicitly quoted datums. Also note that after the parsing phase, self-quoting and quoted datums are unified as Lits and we no longer need to be concerned with any distinctions that existed in the concrete syntax.

The only other changes to the parser are that we’ve added some new unary and binary primitive names that the parser now recognizes for things like cons, car, cons?, etc.

hustle/parse.rkt
  #lang racket
  (provide parse parse-closed)
  (require "ast.rkt")

  ;; s:S-Expr -> e:ClosedExpr
  ;; Parse s into (a potentially open) expr e
  (define (parse s)
    (match (parse/acc s '() '())
      [(list _ e) e]))

  ;; s:S-Expr -> e:ClosedExpr
  ;; Parse s into closed expr e; signal an error when e is open
  (define (parse-closed s)
    (match (parse/acc s '() '())
      [(list '() e) e]
      [(list fvs e) (error "unbound identifiers" fvs)]))

  ;; s:S-Expr bvs:[Listof Id] fvs:[Listof Id]
  ;;   -> (list fvs-e:[Listof Id] e:Expr)
  ;; Parse s into expr e and list of free variables fvs-e,
  ;; assuming variables in bvs are bound and fvs are free.
  (define (parse/acc s bvs fvs)
    (define (rec s bvs fvs)
      (match s
        [(and 'eof (? (not-in bvs)))
         (list fvs (Eof))]
        [(? datum?)
         (list fvs (Lit s))]
        [(list (and 'quote (? (not-in bvs))) (list))
         (list fvs (Lit '()))]
        [(? symbol?)
         (list (if (memq s bvs) fvs (cons s fvs)) (Var s))]
        [(list-rest (? symbol? (? (not-in bvs) k)) sr)
         (match k
           ['let
            (match sr
              [(list (list (list (? symbol? x) s1)) s2)
               (match (parse/acc s1 bvs fvs)
                 [(list fvs e1)
                  (match (parse/acc s2 (cons x bvs) fvs)
                    [(list fvs e2)
                     (list fvs (Let x e1 e2))])])]
              [_ (error "let: bad syntax" s)])]
           [_
            (match (parse-es/acc sr bvs fvs)
              [(list fvs es)
               (list fvs
                     (match (cons k es)
                       [(list (? op0? o)) (Prim0 o)]
                       [(list (? op1? o) e1) (Prim1 o e1)]
                       [(list (? op2? o) e1 e2) (Prim2 o e1 e2)]
                       [(list 'begin e1 e2) (Begin e1 e2)]
                       [(list 'if e1 e2 e3) (If e1 e2 e3)]
                       [_ (error "bad syntax" s)]))])])]
        [_ (error "parse error" s)]))
    (rec s bvs fvs))

  ;; s:S-Expr bvs:[Listof Id] fvs:[Listof Id]
  ;;   -> (list fvs-e:[Listof Id] es:[Listof Expr])
  ;; Parse s into a list of expr es and list of free variables fvs-e,
  ;; assuming variables in bvs are bound and fvs are free.
  (define (parse-es/acc s bvs fvs)
    (match s
      ['() (list fvs '())]
      [(cons s ss)
       (match (parse/acc s bvs fvs)
         [(list fvs e)
          (match (parse-es/acc ss bvs fvs)
            [(list fvs es)
             (list fvs (cons e es))])])]
      [_ (error "parse error")]))

  ;; xs:[Listof Any] -> p:(x:Any -> Boolean)
  ;; Produce a predicate p for things not in xs
  (define (not-in xs)
    (λ (x) (not (memq x xs))))

  ;; Any -> Boolean
  (define (datum? x)
    (or (exact-integer? x)
        (boolean? x)
        (char? x)))

  ;; Any -> Boolean
  (define (op0? x)
    (memq x '(read-byte peek-byte void)))

  (define (op1? x)
    (memq x '(add1 sub1 zero?
                   char? integer->char char->integer
                   write-byte eof-object?
                   box unbox empty? cons? box? car cdr)))

  (define (op2? x)
    (memq x '(+ - < = eq? cons)))

13.4 Meaning of Hustle programs, implicitly🔗

To extend our interpreter, we can follow the same pattern we’ve been following so far. We have new kinds of values such as pairs, boxes, and the empty list, so we have to think about how to represent them, but the natural thing to do is to represent them with the corresponding kind of value from Racket. Just as we represent Hustle booleans with Racket booleans, Hustle integers with Racket integers, and so on, we can also represent Hustle pairs with Racket pairs. We can represent Hustle boxes with Racket boxes. We can represent Hustle’s empty list with Racket’s empty list.

Under this choice of representation, there’s very little to do in the interpreter. We only need to update the interpretation of primitives to account for our new primitives such as cons, car, etc. And how should these primitives be interpreted? Using their Racket counterparts of course!

hustle/interp-prim.rkt
  #lang racket
  (provide interp-prim0 interp-prim1 interp-prim2)

  ;; Op0 -> Value
  (define (interp-prim0 op)
    (match op
      ['read-byte (read-byte)]
      ['peek-byte (peek-byte)]
      ['void      (void)]))

  ;; Op1 Value -> Answer
  (define (interp-prim1 op v)
    (match (list op v)
      [(list 'add1 (? integer?))            (add1 v)]
      [(list 'sub1 (? integer?))            (sub1 v)]
      [(list 'zero? (? integer?))           (zero? v)]
      [(list 'char? v)                      (char? v)]
      [(list 'integer->char (? codepoint?)) (integer->char v)]
      [(list 'char->integer (? char?))      (char->integer v)]
      [(list 'write-byte    (? byte?))      (write-byte v)]
      [(list 'eof-object? v)                (eof-object? v)]
      [(list 'box v)                        (box v)]
      [(list 'unbox (? box?))               (unbox v)]
      [(list 'car (? pair?))                (car v)]
      [(list 'cdr (? pair?))                (cdr v)]
      [(list 'empty? v)                     (empty? v)]
      [(list 'cons? v)                      (cons? v)]
      [_ 'err]))

  ;; Op2 Value Value -> Answer
  (define (interp-prim2 op v1 v2)
    (match (list op v1 v2)
      [(list '+ (? integer?) (? integer?)) (+ v1 v2)]
      [(list '- (? integer?) (? integer?)) (- v1 v2)]
      [(list '< (? integer?) (? integer?)) (< v1 v2)]
      [(list '= (? integer?) (? integer?)) (= v1 v2)]
      [(list 'eq? v1 v2)                    (eq? v1 v2)]
      [(list 'cons v1 v2)                   (cons v1 v2)]
      [_ 'err]))

  ;; Any -> Boolean
  (define (codepoint? v)
    (and (integer? v)
         (or (<= 0 v 55295)
             (<= 57344 v 1114111))))

FIXME

We can try it out:

Examples
> (interp (parse '(cons 1 2)))
'(1 . 2)
> (interp (parse '(car (cons 1 2))))
1
> (interp (parse '(cdr (cons 1 2))))
2
> (interp (parse '(car '(1 . 2))))
parse error '(1 . 2)
> (interp (parse '(cdr '(1 . 2))))
parse error '(1 . 2)
> (interp (parse '(let ((x (cons 1 2)))
(+ (car x) (cdr x)))))
3

Now while this is a perfectly good specification, this interpreter doesn’t really shed light on how constructing inductive data works because it simply uses the mechanism of the defining language to construct it. Inductively defined data is easy to model in this interpreter because we can rely on the mechanisms provided for constructing inductively defined data at the meta-level of Racket.

The real trickiness comes when we want to model such data in an impoverished setting that doesn’t have such things, which of course is the case in assembly.

The main challenge is that a value such as (box v) has a value inside it. Pairs are even worse: (cons v0 v1) has two values inside it. If each value is represented with 64 bits, it would seem a pair takes at a minimum 128-bits to represent (plus we need some bits to indicate this value is a pair). What’s worse, those v0 and v1 may themselves be pairs or boxes. The great power of inductive data is that an arbitrarily large piece of data can be constructed. But it would seem impossible to represent each piece of data with a fixed set of bits.

The solution is to allocate such data in memory, which can in principle be arbitrarily large, and use a pointer to refer to the place in memory that contains the data.

Before tackling the compiler, let’s look at an alternative version of the interpreter that makes explicit a representation of memory and is able to interpret programs that construct and manipulate inductive data without itself relying on those mechanisms.

13.5 Meaning of Hustle programs, explicitly🔗

Let’s develop an alternative interpreter that describes constructing inductive data without itself constructing inductive data.

The key here is to describe explicitly the mechanisms of memory allocation and dereference. Abstractly, memory can be thought of as association between memory addresses and values stored in those addresses. As programs run, there is a current state of the memory, which can be used to look up values (i.e. dereference memory) or to extend by making a new association between an available address and a value (i.e. allocating memory).

The representation of values changes to represent inductive data through pointers to memory:

; type Value* =
; | Integer
; | Boolean
; | Character
; | Eof
; | Void
; | '()
; | (box-ptr Address)
; | (cons-ptr Address)
(struct box-ptr (i))
(struct cons-ptr (i))

; type Address = Natural

Here we have two kinds of pointer values, box pointers and cons pointers. A box value is represented by an address (some natural number) and a tag, the box-ptr constructor, which indicates that the address should be interpreted as the contents of a box. A cons is represented by an address tagged with cons-ptr, indicating that the memory contains a pair of values.

To model memory, we use a list of Value* values. When memory is allocated, new elements are placed at the front of the list. To model memory locations, use the distance from the element to the end of the list, this way addresses don’t change as memory is allocated.

For example, suppose we have allocated memory to hold four values '(97 98 99 100). The address of 100 is 0; the address of 99 is 1; etc. When a new value is allocated, say, '(96 97 98 99), the address of 99 is still 0, and so on. The newly allocated value 96 is at address 4. In this way, memory grows toward higher addresses and the next address to allocate is given by the size of the heap currently in use.

; type Heap = (Listof Value*)

When a program is intepreted, it results in a Value* paired together with a Heap that gives meaning to the addresses in the value, or an error:

; type Answer* = (cons Heap Value*) | 'err

So for example, to represent a box (box 99) we could have a box value, i.e. a tagged pointer that points to memory containing 99:

(cons (list 99) (box-ptr 0))

The value at list index 0 is 99 and the box value points to that element of the heap and indicates it is the contents of a box.

It’s possible that other memory was used in computing this result, so we might end up with an answer like:

(cons (list 97 98 99) (box-ptr 0))

Or:

(cons (list 97 98 99 100 101) (box-ptr 2))

Both of which really mean the same value: (box 99).

A pair contains two values, so a cons-ptr should point to the start of elements that comprise the pair. For example, this answer represents a pair (cons 100 99):

(cons (list 99 100) (cons-ptr 0))

Note that the car of this pair is at address 0 and the cdr is at address 1.

Note that we could have other things residing in memory, but so long as the address points to same values as before, these answers mean the same thing:

(cons (list 97 98 99 100 101) (cons-ptr 1))

In fact, we can reconstruct a Value from a Value* and Heap:

hustle/unload.rkt
  #lang racket
  (provide unload unload-value)
  (require "heap.rkt")

  ;; Answer* -> Answer
  (define (unload a)
    (match a
      ['err 'err]
      [(cons h v) (unload-value v h)]))

  ;; Value* Heap -> Value
  (define (unload-value v h)
    (match v
      [(? integer?)    v]
      [(? boolean?)    v]
      [(? char?)       v]
      [(? eof-object?) v]
      [(? void?)       v]
      ['()             '()]
      [(box-ptr a)
       (box (unload-value (heap-ref h a) h))]
      [(cons-ptr a)
       (cons (unload-value (heap-ref h a) h)
             (unload-value (heap-ref h (add1 a)) h))]))

Which relies on our interface for heaps:

hustle/heap.rkt
  #lang racket
  (provide alloc-box alloc-cons heap-ref heap-set
           (struct-out box-ptr)
           (struct-out cons-ptr))

  (struct box-ptr (i))
  (struct cons-ptr (i))

  ;; Value* Heap -> Answer*
  (define (alloc-box v h)
    (cons (cons v h)
          (box-ptr (length h))))

  ;; Value* Value* Heap -> Answer*
  (define (alloc-cons v1 v2 h)
    (cons (cons v2 (cons v1 h))
          (cons-ptr (length h))))

  ;; Heap Address -> Value*
  (define (heap-ref h a)
    (list-ref h (- (length h) (add1 a))))

  ;; Heap Address Value* -> Heap
  (define (heap-set h i v)
    (list-set h (- (length h) i 1) v))

  ;; Heap Address Natural -> [Listof Value*]
  (define (heap-ref/n h a n)
    (define (loop n vs)
      (match n
        [0 vs]
        [_ (loop (sub1 n)
                 (cons (heap-ref h (+ a n)) vs))]))
    (loop n '()))

Try it out:

Examples
> (unload-value (box-ptr 0) (list 99))
'#&99
> (unload-value (cons-ptr 0) (list 99 100))
'(100 . 99)
> (unload-value (cons-ptr 1) (list 97 98 99 100 101))
'(100 . 99)

What about nested pairs like (cons 1 (cons 2 (cons 3 '())))? Well, we already have all the pieces we need to represent values like these.

Examples
> (unload-value (cons-ptr 0)
(list '() 3 (cons-ptr 4) 2 (cons-ptr 2) 1))
'(1 2 3)

Notice that this list could laid out in many different ways, but when viewed through the lens of unload-value, they represent the same list:

Examples
> (unload-value (cons-ptr 4)
(list (cons-ptr 2) 1 (cons-ptr 0) 2 '() 3))
'(1 2 3)

The idea of the interpreter that explicitly models memory will be thread through a heap that is used to represent the memory allocated by the program. Operations that manipulate or create boxes and pairs will have to be updated to work with this new representation.

So for example, the cons operation should allocate two new memory locations and produce a tagged pointer to the address of the first one. The car operation should dereference the memory pointed to by the given cons-ptr value.

Examples
> (unload (alloc-cons 100 99 '()))
'(100 . 99)
> (unload (alloc-box 99 '()))
'#&99

Much of the work is handled in the new interp-prims-heap module:

hustle/interp-prims-heap.rkt
  #lang racket
  (provide interp-prim0 interp-prim1 interp-prim2)
  (require "heap.rkt")

  ;; Op0 Heap -> Answer*
  (define (interp-prim0 op h)
    (match op
      ['read-byte (cons h (read-byte))]
      ['peek-byte (cons h (peek-byte))]
      ['void      (cons h (void))]))

  ;; Op1 Value* Heap -> Answer*
  (define (interp-prim1 p v h)
    (match (list p v)
      [(list 'add1 (? integer? i))          (cons h (add1 i))]
      [(list 'sub1 (? integer? i))          (cons h (sub1 i))]
      [(list 'zero? (? integer? i))         (cons h (zero? i))]
      [(list 'char? v)                      (cons h (char? v))]
      [(list 'char->integer (? char?))      (cons h (char->integer v))]
      [(list 'integer->char (? codepoint?)) (cons h (integer->char v))]
      [(list 'eof-object? v)                (cons h (eof-object? v))]
      [(list 'write-byte (? byte?))         (cons h (write-byte v))]
      [(list 'box v)                        (alloc-box v h)]
      [(list 'unbox (box-ptr i))            (cons h (heap-ref h i))]
      [(list 'car   (cons-ptr i))           (cons h (heap-ref h i))]
      [(list 'cdr   (cons-ptr i))           (cons h (heap-ref h (add1 i)))]
      [(list 'empty? v)                     (cons h (empty? v))]
      [_                                    'err]))

  ;; Op2 Value* Value* Heap -> Answer*
  (define (interp-prim2 p v1 v2 h)
    (match (list p v1 v2)
      [(list '+ (? integer? i1) (? integer? i2)) (cons h (+ i1 i2))]
      [(list '- (? integer? i1) (? integer? i2)) (cons h (- i1 i2))]
      [(list '< (? integer? i1) (? integer? i2)) (cons h (< i1 i2))]
      [(list '= (? integer? i1) (? integer? i2)) (cons h (= i1 i2))]
      [(list 'eq? v1 v2)
       (match (list v1 v2)
         [(list (cons-ptr a1) (cons-ptr a2)) (cons h (= a1 a2))]
         [(list (box-ptr a1)  (box-ptr a2))  (cons h (= a1 a2))]
         [_                                  (cons h (eqv? v1 v2))])]
      [(list 'cons v1 v2) (alloc-cons v1 v2 h)]
      [_ 'err]))

  ;; Any -> Boolean
  (define (codepoint? v)
    (and (integer? v)
         (or (<= 0 v 55295)
             (<= 57344 v 1114111))))

Examples
> (unload (interp-prim1 'box 99 '()))
'#&99
> (unload (interp-prim2 'cons 100 99 '()))
'(100 . 99)
> (unload
    (match (interp-prim1 'box 99 '())
      [(cons h v)
       (interp-prim1 'unbox v h)]))
99

Finally, we can write the overall interpreter, which threads a heap throughout the interpretation of a program in interp-env-heap. The top-level interp function, which is intended to be equivalent to the original interp function that modelled memory implicitly, calls interp-env-heap with an initially empty heap and the unloads the final answer from the result:

hustle/interp-heap.rkt
  #lang racket
  (provide interp interp-env-heap)
  (require "env.rkt")
  (require "unload.rkt")
  (require "interp-prims-heap.rkt")
  (require "ast.rkt")

  ;; type Answer* =
  ;; | (cons Heap Value*)
  ;; | 'err

  ;; type Value* =
  ;; | Integer
  ;; | Boolean
  ;; | Character
  ;; | Eof
  ;; | Void
  ;; | '()
  ;; | (box-ptr  Address)
  ;; | (cons-ptr Address)


  ;; type Address = Natural

  ;; type Heap = (Listof Value*)
  ;; type REnv = (Listof (List Id Value*))

  ;; Expr -> Answer
  (define (interp e)
    (unload (interp-env-heap e '() '())))

  ;; Expr REnv Heap -> Answer*
  (define (interp-env-heap e r h)
    (match e
      [(Lit d)  (cons h d)]
      [(Eof)    (cons h eof)]
      [(Var x)  (cons h (lookup r x))]
      [(Prim0 p) (interp-prim0 p h)]
      [(Prim1 p e)
       (match (interp-env-heap e r h)
         ['err 'err]
         [(cons h v)
          (interp-prim1 p v h)])]
      [(Prim2 p e1 e2)
       (match (interp-env-heap e1 r h)
         ['err 'err]
         [(cons h v1)
          (match (interp-env-heap e2 r h)
            ['err 'err]
            [(cons h v2)
             (interp-prim2 p v1 v2 h)])])]
      [(If p e1 e2)
       (match (interp-env-heap p r h)
         ['err 'err]
         [(cons h v)
          (if v
              (interp-env-heap e1 r h)
              (interp-env-heap e2 r h))])]
      [(Begin e1 e2)
       (match (interp-env-heap e1 r h)
         ['err 'err]
         [(cons h _) (interp-env-heap e2 r h)])]
      [(Let x e1 e2)
       (match (interp-env-heap e1 r h)
         ['err 'err]
         [(cons h v)
          (interp-env-heap e2 (ext r x v) h)])]))

13.6 Representing Hustle values🔗

The first thing do is make another distinction in the kind of values in our language. Up until now, each value could be represented in a register. We now call such values immediate values.

We introduce a new category of values which are pointer values. We will (for now) have two types of pointer values: boxes and pairs.

So we now have a kind of hierarchy of values:

- values

+ pointers (non-zero in last 3 bits)

* boxes

* pairs

+ immediates (zero in last three bits)

* integers

* characters

* booleans

* ...

We will represent this hierarchy by shifting all the immediates over 3 bits and using the lower 3 bits to tag things as either being immediate (tagged #b000) or a box or pair. To recover an immediate value, we just shift back to the right 3 bits.

The pointer types will be tagged in the lowest three bits. A box value is tagged #b001 and a pair is tagged #b010. The remaining 61 bits will hold a pointer, i.e. an integer denoting an address in memory.

The idea is that the values contained within a box or pair will be located in memory at this address. If the pointer is a box pointer, reading 64 bits from that location in memory will produce the boxed value. If the pointer is a pair pointer, reading the first 64 bits from that location in memory will produce one of the value in the pair and reading the next 64 bits will produce the other. In other words, constructors allocate and initialize memory. Projections dereference memory.

The representation of pointers will follow a slightly different scheme than that used for immediates. Let’s first talk a bit about memory and addresses.

A memory location is represented (of course, it’s all we have!) as a number. The number refers to some address in memory. On an x86 machine, memory is byte-addressable, which means each address refers to a 1-byte (8-bit) segment of memory. If you have an address and you add 1 to it, you are refering to memory starting 8-bits from the original address.

We will make a simplifying assumption and always store things in memory in multiples of 64-bit chunks. So to go from one memory address to the next word of memory, we need to add 8 (1-byte times 8 = 64 bits) to the address.

What is 8 in binary? #b1000

What’s nice about this is that if we start from a memory location that is “word-aligned,” i.e. it ends in #b000, then every 64-bit index also ends in #b000.

What this means is that every address we’d like to represent has #b000 in its least signficant bits. We can therefore freely uses these three bits to tag the type of the pointer without needing to shift the address around. If we have a box pointer, we can simply zero out the box type tag to obtain the address of the boxes content. Likewise with pairs.

We use a register, 'rbx, to hold the address of the next free memory location in memory. To allocate memory, we simply increment the content of 'rbx by a multiple of 8. To initialize the memory, we just write into the memory at that location. To construct a pair or box value, we just tag the unused bits of the address.

The generated code will have to coordinate with the run-time system to initialize 'rbx appropriately, which we discuss in A Run-Time for Hustle.

So for example the following creates a box containing the value 7:

(seq (Mov 'rax (value->bits 7))
     (Mov (Offset 'rbx 0) 'rax) ; write '7' into address held by rbx
     (Mov 'rax 'rbx)            ; copy pointer into return register
     (Or 'rax type-box)         ; tag pointer as a box
     (Add 'rbx 8))              ; advance rbx one word

If 'rax holds a box value, we can “unbox” it by erasing the box tag, leaving just the address of the box contents, then dereferencing the memory:

(seq (Xor 'rax type-box) ; erase the box tag
(Mov 'rax (Offset 'rax 0))) ; load memory into rax

Pairs are similar, only they are represented as tagged pointers to two words of memory. Suppose we want to make (cons 3 4):

(seq (Mov 'rax (value->bits 4))
     (Mov (Offset 'rbx 0) 'rax) ; write '4' into address held by rbx
     (Mov 'rax (value->bits 3))
     (Mov (Offset 'rbx 8) 'rax) ; write '3' into word after address held by rbx
     (Mov 'rax rbx)             ; copy pointer into return register
     (Or 'rax type-cons)        ; tag pointer as a pair
     (Add 'rbx 16))             ; advance rbx 2 words

This code writes two words of memory and leaves a tagged pointer in 'rax. It’s worth noting that we chose to write the cdr of the pair into the first word of memory and the car into the second. This may seem like a strange choice, but how we lay out the memory is in some sense an arbitrary choice, so long as all our pair operations respect this layout. We could have just as easily done the car first and cdr second. The reason for laying out pairs as we did will make things slightly more convenient when implementing the cons primitive as we’ll see later.

If 'rax holds a pair value, we can project out the elements by erasing the pair tag, leaving just the address of the pair contents, then dereferencing either the first or second word of memory:

(seq (Xor 'rax type-cons)         ; erase the pair tag
     (Mov 'rax (Offset 'rax 8))   ; load car into rax
     (Mov 'rax (Offset 'rax 0)))  ; or... load cdr into rax

From here, writing the compiler for box, unbox, cons, car, and cdr is just a matter of putting together pieces we’ve already seen such as evaluating multiple subexpressions and type tag checking before doing projections.

13.7 A Compiler for Hustle🔗

The compiler for Hustle is essentially the same as for Fraud, although now with support for the new primitives: box, unbox, box?, cons, car, car, cdr, cons?, and empty?:

hustle/compile-ops.rkt
  #lang racket
  (provide compile-op0 compile-op1 compile-op2 pad-stack)
  (require "ast.rkt")
  (require "types.rkt")
  (require "assert.rkt")
  (require a86/ast a86/registers)

  ;; Op0 -> Asm
  (define (compile-op0 p)
    (match p
      ['void      (seq (Mov rax (value->bits (void))))]
      ['read-byte (seq pad-stack (Call 'read_byte) unpad-stack)]
      ['peek-byte (seq pad-stack (Call 'peek_byte) unpad-stack)]))

  ;; Op1 -> Asm
  (define (compile-op1 p)
    (match p
      ['add1
       (seq (assert-integer rax)
            (Add rax (value->bits 1)))]
      ['sub1
       (seq (assert-integer rax)
            (Sub rax (value->bits 1)))]
      ['zero?
       (seq (assert-integer rax)
            (Cmp rax 0)
            if-equal)]
      ['char?
       (seq (And rax mask-char)
            (Cmp rax type-char)
            if-equal)]
      ['char->integer
       (seq (assert-char rax)
            (Sar rax char-shift)
            (Sal rax int-shift))]
      ['integer->char
       (seq (assert-codepoint rax)
            (Sar rax int-shift)
            (Sal rax char-shift)
            (Xor rax type-char))]
      ['eof-object?
       (seq (Cmp rax (value->bits eof))
            if-equal)]
      ['write-byte
       (seq (assert-byte rax)
            pad-stack
            (Mov rdi rax)
            (Call 'write_byte)
            unpad-stack)]
      ['box
       (seq (Mov (Mem rbx) rax) ; memory write
            (Mov rax rbx)            ; put box in rax
            (Xor rax type-box)       ; tag as a box
            (Add rbx 8))]
      ['unbox
       (seq (assert-box rax)
            (Mov rax (Mem (- type-box) rax)))]
      ['car
       (seq (assert-cons rax)
            (Mov rax (Mem (- 8 type-cons) rax)))]
      ['cdr
       (seq (assert-cons rax)
            (Mov rax (Mem (- type-cons) rax)))]

      ['empty? (seq (Cmp rax (value->bits '())) if-equal)]
      ['cons? (type-pred ptr-mask type-cons)]
      ['box?  (type-pred ptr-mask type-box)]))


  ;; Op2 -> Asm
  (define (compile-op2 p)
    (match p
      ['+
       (seq (Pop r8)
            (assert-integer r8)
            (assert-integer rax)
            (Add rax r8))]
      ['-
       (seq (Pop r8)
            (assert-integer r8)
            (assert-integer rax)
            (Sub r8 rax)
            (Mov rax r8))]
      ['<
       (seq (Pop r8)
            (assert-integer r8)
            (assert-integer rax)
            (Cmp r8 rax)
            if-lt)]
      ['=
       (seq (Pop r8)
            (assert-integer r8)
            (assert-integer rax)
            (Cmp r8 rax)
            if-equal)]
      ['cons
       (seq (Mov (Mem rbx) rax)
            (Pop rax)
            (Mov (Mem 8 rbx) rax)
            (Mov rax rbx)
            (Xor rax type-cons)
            (Add rbx 16))]
      ['eq?
       (seq (Pop r8)
            (Cmp rax r8)
            if-equal)]))

  (define (type-pred mask type)
    (seq (And rax mask)
         (Cmp rax type)
         if-equal))

  ;; Asm
  ;; set rax to #t or #f if comparison flag is equal
  (define if-equal
    (seq (Mov rax (value->bits #f))
         (Mov r9  (value->bits #t))
         (Cmove rax r9)))

  ;; Asm
  ;; set rax to #t or #f if comparison flag is less than
  (define if-lt
    (seq (Mov rax (value->bits #f))
         (Mov r9  (value->bits #t))
         (Cmovl rax r9)))


  ;; Asm
  ;; Dynamically pad the stack to be aligned for a call
  (define pad-stack
    (seq (Mov r15 rsp)
         (And r15 #b1000)
         (Sub rsp r15)))

  ;; Asm
  ;; Undo the stack alignment after a call
  (define unpad-stack
    (seq (Add rsp r15)))

We can now confirm that the compiler generates code similar to what we wrote by hand above:

Examples
> (define (show e c)
(compile-e (parse e) c))
> (show '(box 7) '())
(list
(Mov 'rax 112)
(Mov (Mem #f #f 'rbx #f #f) 'rax)
(Mov 'rax 'rbx)
(Xor 'rax 1)
(Add 'rbx 8))

This moves the encoding of 7 into 'rax, then writes it into the memory address pointed to by 'rbx, i.e. the next free memory location. That address is then moved to 'rax and tagged as a box, which is the result of the expression. The final step is to increment 'rbx by 8 to advance the free memory pointer since one word of memory is now used.

Suppose we have a box value bound to variable x, then this code will unbox the value:

Examples
> (show '(unbox x) '(x))
(list
(Mov 'rax (Mem #f 0 'rsp #f #f))
(Push 'rax)
(And 'rax 7)
(Cmp 'rax 1)
(Pop 'rax)
(Jne 'err)
(Mov 'rax (Mem #f -1 'rax #f #f)))

This loads x from the stack into 'rax, then does tag checking to make sure it’s a box pointer, after which it erases the tag to reveal the address and loads that memory address into 'rax, thereby retrieving the value in the box.

The way that cons, car, and cdr work are essentially the same, except that pairs hold two values instead of one:

Examples
> (show '(cons 7 5) '())
(list
(Mov 'rax 112)
(Push 'rax)
(Mov 'rax 80)
(Mov (Mem #f #f 'rbx #f #f) 'rax)
(Pop 'rax)
(Mov (Mem #f 8 'rbx #f #f) 'rax)
(Mov 'rax 'rbx)
(Xor 'rax 2)
(Add 'rbx 16))
> (show '(car x) '(x))
(list
(Mov 'rax (Mem #f 0 'rsp #f #f))
(Push 'rax)
(And 'rax 7)
(Cmp 'rax 2)
(Pop 'rax)
(Jne 'err)
(Mov 'rax (Mem #f 6 'rax #f #f)))
> (show '(cdr x) '(x))
(list
(Mov 'rax (Mem #f 0 'rsp #f #f))
(Push 'rax)
(And 'rax 7)
(Cmp 'rax 2)
(Pop 'rax)
(Jne 'err)
(Mov 'rax (Mem #f -2 'rax #f #f)))

We can now see why we chose to layout pairs with the cdr first and car second. Since cons is a binary operation, the expression which produces the car value will be evaluated first and pushed on the stack. Then the expression that produces the cdr value will execute with its result sitting in rax. So at this point it’s easiest to write out the cdr since it’s already sitting in a register. Once we do that, we can pop the car value into 'rax and write that. Hence our choice for the layout.

13.8 A Run-Time for Hustle🔗

First, we extend our runtime system’s view of values to include pointers and use C struct to represent them:

values.h

#ifndef VALUES_H
#define VALUES_H

#include <stdint.h>

/* any abstract value */
typedef int64_t val_t;

typedef enum type_t {
  T_INVALID = -1,
  /* immediates */
  T_INT,
  T_BOOL,
  T_CHAR,
  T_EOF,  
  T_VOID,
  T_EMPTY,
  /* pointers */
  T_BOX,
  T_CONS, 
} type_t;

typedef uint32_t val_char_t;
typedef struct val_box_t {
  val_t val;
} val_box_t;
typedef struct val_cons_t {
  val_t snd;
  val_t fst;
} val_cons_t;

/* return the type of x */
type_t val_typeof(val_t x);

/**
 * Wrap/unwrap values
 *
 * The behavior of unwrap functions are undefined on type mismatch.
 */
int64_t val_unwrap_int(val_t x);
val_t val_wrap_int(int64_t i);
val_t val_wrap_byte(unsigned char b);

int val_unwrap_bool(val_t x);
val_t val_wrap_bool(int b);

val_char_t val_unwrap_char(val_t x);
val_t val_wrap_char(val_char_t b);

val_t val_wrap_eof();

val_t val_wrap_void();

val_box_t* val_unwrap_box(val_t x);
val_t val_wrap_box(val_box_t* b);

val_cons_t* val_unwrap_cons(val_t x);
val_t val_wrap_cons(val_cons_t* c);

#endif

The implementation of val_typeof is extended to handle pointer types:

values.c

#include "types.h"
#include "values.h"

type_t val_typeof(val_t x)
{
  switch (x & ptr_type_mask) {
  case box_type_tag:
    return T_BOX;
  case cons_type_tag:
    return T_CONS;
  }

  if ((int_type_mask & x) == int_type_tag)
    return T_INT;
  if ((char_type_mask & x) == char_type_tag)
    return T_CHAR;

  switch (x) {
  case val_true:
  case val_false:
    return T_BOOL;
  case val_eof:
    return T_EOF;
  case val_void:
    return T_VOID;
  case val_empty:
    return T_EMPTY;
  }

  return T_INVALID;
}

int64_t val_unwrap_int(val_t x)
{
  return x >> int_shift;
}
val_t val_wrap_int(int64_t i)
{
  return (i << int_shift) | int_type_tag;
}
val_t val_wrap_byte(unsigned char b)
{
  return (b << int_shift) | int_type_tag;
}

int val_unwrap_bool(val_t x)
{
  return x == val_true;
}
val_t val_wrap_bool(int b)
{
  return b ? val_true : val_false;
}

val_char_t val_unwrap_char(val_t x)
{
  return (val_char_t)(x >> char_shift);
}
val_t val_wrap_char(val_char_t c)
{
  return (((val_t)c) << char_shift) | char_type_tag;
}

val_t val_wrap_eof(void)
{
  return val_eof;
}

val_t val_wrap_void(void)
{
  return val_void;
}

val_box_t* val_unwrap_box(val_t x)
{
  return (val_box_t *)(x ^ box_type_tag);
}
val_t val_wrap_box(val_box_t* b)
{
  return ((val_t)b) | box_type_tag;
}

val_cons_t* val_unwrap_cons(val_t x)
{
  return (val_cons_t *)(x ^ cons_type_tag);
}
val_t val_wrap_cons(val_cons_t *c)
{
  return ((val_t)c) | cons_type_tag;
}

The rest of the run-time system for Hustle is more involved for two main reasons:

The first is that the compiler relies on a pointer to free memory residing in 'rbx. The run-time system will be responsible for allocating this memory and initializing the 'rdi register. To allocate memory, it uses malloc. It passes the pointer returned by malloc to the entry function. The protocol for calling functions in C says that the first argument will be passed in the 'rdi register. Since malloc produces 16-byte aligned addresses on 64-bit machines, 'rdi is initialized with an address that ends in #b000, satisfying our assumption about addresses.

Once the runtime system has provided the heap address in 'rdi, it becomes our responsibility to keep track of that value. Because 'rdi is used to pass arguments to C functions, we can’t keep our heap pointer in 'rdi and expect it to be saved. This leaves us with two options:

We can ensure that we save 'rdi somewhere safe whenever we might call a C function
We can move the value away from 'rdi as soon as possible and never have to worry about 'rdi being clobbered during a call to a C function (as long as we pick a good place!)

We’ve decided to use the second option, which leaves the choice of where to move the value once we receive it from the runtime system. As usual, we will consult the System V Calling Convention, which tells us that 'rbx is a callee save register, which means that any C function we might call is responsible for ensuring that the value in the register is saved and restored. In other words: we, the caller, don’t have to worry about it! Because of this we’re going to use 'rbx to store our heap pointer. You can see that we do this in the compiler with (Mov 'rbx 'rdi) as part of our entry code.

main.c

#include <stdio.h>
#include <stdlib.h>
#include "values.h"
#include "print.h"
#include "runtime.h"

FILE* in;
FILE* out;
void (*error_handler)();
val_t *heap;

void error_exit()
{
  printf("err\n");
  exit(1);
}

void raise_error()
{
  return error_handler();
}

int main(int argc, char** argv)
{
  in = stdin;
  out = stdout;
  error_handler = &error_exit;
  heap = malloc(8 * heap_size);

  val_t result;

  result = entry(heap);

  print_result(result);
  if (val_typeof(result) != T_VOID)
    putchar('\n');

  free(heap);
  return 0;
}

The second complication comes from printing. Now that values include inductively defined data, the printer must recursively traverse these values to print them. It also must account for the wrinkle of how the printing of proper and improper lists is different:

print.c

#include <stdio.h>
#include <inttypes.h>
#include "values.h"

void print_char(val_char_t);
void print_codepoint(val_char_t);
void print_cons(val_cons_t *);
void print_result_interior(val_t);
int utf8_encode_char(val_char_t, char *);

void print_result(val_t x)
{
  switch (val_typeof(x)) {
  case T_INT:
    printf("%" PRId64, val_unwrap_int(x));
    break;
  case T_BOOL:
    printf(val_unwrap_bool(x) ? "#t" : "#f");
    break;
  case T_CHAR:
    print_char(val_unwrap_char(x));
    break;
  case T_EOF:
    printf("#<eof>");
    break;
  case T_VOID:
    break;
  case T_EMPTY:
  case T_BOX:
  case T_CONS:
    printf("'");
    print_result_interior(x);
    break;
  case T_INVALID:
    printf("internal error");
  }
}

void print_result_interior(val_t x)
{
  switch (val_typeof(x)) {
  case T_EMPTY:
    printf("()");
    break;
  case T_BOX:
    printf("#&");
    print_result_interior(val_unwrap_box(x)->val);
    break;
  case T_CONS:
    printf("(");
    print_cons(val_unwrap_cons(x));
    printf(")");
    break;
  default:
    print_result(x);
  }
}

void print_cons(val_cons_t *cons)
{
  print_result_interior(cons->fst);

  switch (val_typeof(cons->snd)) {
  case T_EMPTY:
    // nothing
    break;
  case T_CONS:
    printf(" ");
    print_cons(val_unwrap_cons(cons->snd));
    break;
  default:
    printf(" . ");
    print_result_interior(cons->snd);
    break;
  }
}

void print_char(val_char_t c)
{
  printf("#\\");
  switch (c) {
  case 0:
    printf("nul"); break;
  case 8:
    printf("backspace"); break;
  case 9:
    printf("tab"); break;
  case 10:
    printf("newline"); break;
  case 11:
    printf("vtab"); break;
  case 12:
    printf("page"); break;
  case 13:
    printf("return"); break;
  case 32:
    printf("space"); break;
  case 127:
    printf("rubout"); break;
  default:
    print_codepoint(c);
  }
}

void print_codepoint(val_char_t c)
{
  char buffer[5] = {0};
  utf8_encode_char(c, buffer);
  printf("%s", buffer);
}

int utf8_encode_char(val_char_t c, char *buffer)
{
  // Output to buffer using UTF-8 encoding of codepoint
  // https://en.wikipedia.org/wiki/UTF-8
  if (c < 128) {
    buffer[0] = (char) c;
    return 1;
  } else if (c < 2048) {
    buffer[0] =  (char)(c >> 6)       | 192;
    buffer[1] = ((char)       c & 63) | 128;
    return 2;
  } else if (c < 65536) {
    buffer[0] =  (char)(c >> 12)      | 224;
    buffer[1] = ((char)(c >> 6) & 63) | 128;
    buffer[2] = ((char)       c & 63) | 128;
    return 3;
  } else {
    buffer[0] =  (char)(c >> 18)       | 240;
    buffer[1] = ((char)(c >> 12) & 63) | 128;
    buffer[2] = ((char)(c >>  6) & 63) | 128;
    buffer[3] = ((char)        c & 63) | 128;
    return 4;
  }
}

13.9 Correctness🔗

The statement of correctness for the Hustle compiler is the same as the previous one:

Compiler Correctness: For all e ∈ ClosedExpr, i, o ∈ String, and v ∈ Value, if (interp/io e i) equals (cons v o), then (exec/io e i) equals (cons v o).

	Syllabus
	Texts
	Schedule
	Notes
	Assignments
	Midterms
	Project
	Software
	Colophon

1	What is a Compiler?
2	From OCaml to Racket
3	a86: a Little Assembly Language
4	a86 Reference
5	Abscond: a language of numbers
6	Blackmail: incrementing and decrementing
7	Con: branching with conditionals
8	Dupe: a duplicity of types
9	Dodger: addressing a lack of character
10	Evildoer: change the world a couple nibbles at a time
11	Extort: when errors exist
12	Fraud: local binding, variables, and binary operations
13	Hustle: heaps and lists
14	Hoax: vectors and strings
15	Iniquity: function definitions and calls
16	Jig: jumping to tail calls
17	Knock: pattern matching
18	Loot: lambda the ultimate
19	Mug: symbols and interned string literals
20	Mountebank: quote and compound static data
21	Neerdowell: structures
22	Outlaw: self-hosting

13.1	Inductive data
13.2	Empty lists can be all and end all
13.3	Parsing
13.4	Meaning of Hustle programs, implicitly
13.5	Meaning of Hustle programs, explicitly
13.6	Representing Hustle values
13.7	A Compiler for Hustle
13.8	A Run-Time for Hustle
13.9	Correctness