Word Macro Stuff

Faced with over 100 Word VBA macros and a request for information about which macros call a certain webservice. How do you get that information without going insane? With a lot of python, some Windows black magic, and a couple of open Microsoft “standards”.

Recently a team came to me to ask if it was possible to figure out who called one of our old applications. The application is a horrible pain for the team maintaining it, and they’d like to shut it down. Not least because it’s basically impossible to build. They had traced down a lot of applications already, but they were stuck on a folder brimming with Word VBA macros. From experience, they knew some of them called this application, but which?

$ find <macrodir> -name ".dotm" | wc -l
<some number>

Mass analysis of embedded VBA code poses a rather interesting problem. Since the code is embedded within the Word file in a binary format, I can’t simply grep through the source. Plenty of tools exists that let you (to varying degrees) extract the text from Word documents, but in this case we want the embedded VBA source code.

The VBA editor embedded in every office application contains a feature that lets the user export the code. Fortunately, Windows contains an IPC mechanism called OLE Automation which exposes the functionality. I like the pywin32 module for talking OLE from python.

The simple procedure then is to 1) start Word by creating a new Word COM object, 2) Opening the document, and 3) Accessing the Visual Basic Project of the open document to export the code.

python export.py

import win32com.client
import click
from pathlib import Path

extensions = {
    1: ".bas",   # Module
    2: ".cls",   # Class Module
    3: ".frm",   # Form, Technically also frx, but we dont get to chose that.
    100: ".cls", # Excel objects codebehind
}

skip = {
    100,
}

defaultRefs = {
    "VBA",
    "Excel",
    "stdole",
    "Office",
    "MSForms",
    "Outlook",
    "SHDocVw",
}

@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
    wbPath = Path(workbook)
    outdir = Path(".")

    word = win32com.client.Dispatch("Word.Application")
    document = word.Documents.Open(wbPath.resolve())
    project = workbook.VBProject

    print(f"Exporting references")
    references = project.References
    with open(outdir / "references.txt", "w") as f:
        for reference in references:
            if reference.Type == 0:
                continue

            referencePath = Path(reference.FullPath)
            f.write(f"{referencePath}\n")

    print(f"Exporting code")
    components = project.VBComponents
    for component in components:
        if component.Type in skip:
            continue

        extension = extensions[component.Type]
        filename = f"{component.Name}{extension}"
        outpath = outdir / filename
        print(f"-> {filename}")
        component.Export(outpath.resolve())

    document.Close()
    word.Quit()

if __name__ == '__main__':
    main()

Now I should be able to simply run this command for every document to export their Visual Basic files.

$ ./export.py document.dotm
<error>

Ooh, and so begun my adventure into the depths of Office. It turns out you can’t simply access the Visual Basic Project. Due to (in hindsight) pretty obvious security considerations this functionality has been disabled by default for quite a while. That’s easy enough to solve though, I just changed the setting. Lets try that command again.

$ ./export.py document.dotm
<different error about denied>

As it turns out, the person who made this VBA code didn’t want just anybody to mess around with it. The Office VBA format contains support for password protecting the VBA project such that you can’t open it without knowing the password. Being a developer I could probably get the password from someone, but I had nowhere to use it. The Visual Basic Project object model doesn’t contain any way to actually unlock the project, even with the password. I’d have to go through every document and manually unlock them first.

Or I would have, if it wasn’t because there’s a pretty simple way to circumvent the Office VBA password protection. At this point I didn’t really understand how this worked, but I suspected it was simply corrupting the file in such a way Office could not determine it was password protected. It’s pretty interesting that the VBA project is not encrypted within the file.

So I needed to extend the script such that it would 1) unzip the word/vbaProject.bin, 2) do a search replace for DPB and substitute DBx, 3) place the word/vbaProject.bin back into the file, and 4) Continue extraction like normal.

python export.py

import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re

extensions = {
    1: ".bas",   # Module
    2: ".cls",   # Class Module
    3: ".frm",   # Form, Technically also frx, but we dont get to chose that.
    100: ".cls", # Excel objects codebehind
}

skip = {
    100,
}

defaultRefs = {
    "VBA",
    "Excel",
    "stdole",
    "Office",
    "MSForms",
    "Outlook",
    "SHDocVw",
}

@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
    wbPath = Path(workbook)
    outdir = Path(".")

    with TemporaryDirectory() as temp:
        vbaProjPath = Path("word/vbaProject.bin")
        try:
            with ZipFile(wbPath) as archive:
                archive.extract("word/vbaProject.bin", temp)
        except KeyError:
            exit(1)

        vba_temp = temp / vbaProjPath

        with open(vba_temp, "r+b") as f:
            content = f.read()
            content = re.sub(b"DPB=", b"DPx=", content)
            f.seek(0)
            f.truncate()
            f.write(content)

        temp_workbook = temp / wbPath.name
        with ZipFile(wbPath, "r") as zin, \
                ZipFile(temp_workbook, "w") as zout:
            for item in zin.infolist():
                if item.filename == "word/vbaProject.bin":
                    continue
                zout.writestr(item.filename, zin.read(item.filename))
            zout.write(vba_temp, "word/vbaProject.bin")

        word = win32com.client.Dispatch("Word.Application")
        document = word.Documents.Open(temp_workbook.resolve())
        project = workbook.VBProject

        print(f"Exporting references")
        references = project.References
        with open(outdir / "references.txt", "w") as f:
            for reference in references:
                if reference.Type == 0:
                    continue

                referencePath = Path(reference.FullPath)
                f.write(f"{referencePath}\n")

        print(f"Exporting code")
        components = project.VBComponents
        for component in components:
            if component.Type in skip:
                continue

            extension = extensions[component.Type]
            filename = f"{component.Name}{extension}"
            outpath = outdir / filename
            print(f"-> {filename}")
            component.Export(outpath.resolve())

        document.Close()
        word.Quit()

if __name__ == '__main__':
    main()

Not finally, we can extract the VB code.

$ ./export.py document.dotm
<some other error about being corrupted>

No, once again I was defeated by odd edge cases. Word will not open a corrupted file if it’s being opened by the COM interface. It won’t even try to repair it. Luckily for me, Microsoft actually opened up the spec for the office VBA format (MS-OVBA as Microsoft calls it¹), so I should be able to reconstruct the file perfectly.

The mysterious DPB field is described on page 25 section 2.3.1.16 of the document. It’s where Office stores the hash of the password, which it validates against what the user enters. By changing the name of the field to DPx I was corrupting the file such that Word was unable to find the password, and it was forced to assume it wasn’t password protected as it reconstructed it.

Through the manual I find that the DPB field makes reference to something called Data Encryption to encode the password. The encryption scheme ties the value of this field to the values of the ID, CMG, and GC fields though the ProjKey. By decoding the current value first it was possible to the ProjKey which I could then use to encode the NULL password before substituting it in for the old DPB value.

python export.py

import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re

extensions = {
    1: ".bas",   # Module
    2: ".cls",   # Class Module
    3: ".frm",   # Form, Technically also frx, but we dont get to chose that.
    100: ".cls", # Excel objects codebehind
}

skip = {
    100,
}

defaultRefs = {
    "VBA",
    "Excel",
    "stdole",
    "Office",
    "MSForms",
    "Outlook",
    "SHDocVw",
}

# Partial implementation of decryption of the _Data Encryption_ scheme
def read_key(val):
    seed = val[0]
    pkey_enc = val[2]
    return seed ^ pkey_enc

# Implementation of the _Data Encryption_ scheme
def enc(pkey, data):
    seed = 0x00
    pkey_enc = seed ^ pkey
    ver_enc = seed ^ 2

    unenc1 = pkey
    enc1 = pkey_enc
    enc2 = ver_enc

    ign_len = int((seed & 6) / 2)
    ign = []
    for i in range(0, ign_len):
        tmp = 7
        b_enc = tmp ^ ((enc2 + unenc1) & 0xFF)
        ign.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = tmp

    data_len = len(data)
    data_len_enc = []
    for i in range(0, 4):
        b = (data_len >> (i * 8)) & 0xFF
        b_enc = b ^ ((enc2 + unenc1) & 0xFF)
        data_len_enc.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = b

    data_enc = []
    for b in data:
        b_enc = b ^ ((enc2 + unenc1) & 0xFF)
        data_enc.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = b

    return bytes([seed, ver_enc, pkey_enc, *ign, *data_len_enc, *data_enc])


@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
    wbPath = Path(workbook)
    outdir = Path(".")

    with TemporaryDirectory() as temp:
        vbaProjPath = Path("word/vbaProject.bin")
        try:
            with ZipFile(wbPath) as archive:
                archive.extract("word/vbaProject.bin", temp)
        except KeyError:
            exit(1)

        vba_temp = temp / vbaProjPath

        with open(vba_temp, "r+b") as f:
            content = f.read()

            orig_len = len(content)
            match = re.search(b"DPB=\"([A-Z0-9]+)\"", content)
            dpb_enc = bytes.fromhex(match.group(1).decode("ascii"))
            pkey = read_key(dpb_enc)

            dpb = enc(pkey, [0]).hex().upper().encode("ascii")

            content = re.sub(match.group(0), b"DPB=\"" + dpb + b"\"", content)
            content += "\r" * orig_len - len(content)

            f.seek(0)
            f.truncate()
            f.write(content)

        temp_workbook = temp / wbPath.name
        with ZipFile(wbPath, "r") as zin, \
                ZipFile(temp_workbook, "w") as zout:
            for item in zin.infolist():
                if item.filename == "word/vbaProject.bin":
                    continue
                zout.writestr(item.filename, zin.read(item.filename))
            zout.write(vba_temp, "word/vbaProject.bin")

        word = win32com.client.Dispatch("Word.Application")
        document = word.Documents.Open(temp_workbook.resolve())
        project = workbook.VBProject

        print(f"Exporting references")
        references = project.References
        with open(outdir / "references.txt", "w") as f:
            for reference in references:
                if reference.Type == 0:
                    if reference.Name in defaultRefs:
                        continue
                    continue

                referencePath = Path(reference.FullPath)
                f.write(f"{referencePath}\n")

        print(f"Exporting code")
        components = project.VBComponents
        for component in components:
            if component.Type in skip:
                continue

            extension = extensions[component.Type]
            filename = f"{component.Name}{extension}"
            outpath = outdir / filename
            print(f"-> {filename}")
            component.Export(outpath.resolve())

        document.Close()
        word.Quit()

if __name__ == '__main__':
    main()

$ ./export.py document.dotm
$ ls
module.vba

Awesome, with it working for one file I assumed it would work for all my files. That was a bold assumption. After exporting quite a few files I found one that wouldn’t work. It was crashing the part of the code that extracted the DPB field.

$ ./export.py otherdocument.dotm
<crash about DPB not being found>

If you read the beginning of the MS-OVBA specification. You find that the file isn’t a simple text file, it’s a Compound File Binary (MS-CFB in Microsoft language²). The CFB specification contains this helpful diagram:

A CFB is a FAT-like filesystem in a single file. This explains why the DPB field wasn’t found. It was broken up by an unrelated sector. To correctly and reliably replace the DPB value I had to 1) recreate the sector stream, 2) replace the field in that, and 3) place that stream back in the file. According to the Office VBA specification, the stream we want is the PROJECT stream, located in the root directory³. Before I could do anything then, I would first have to locate that stream.

Locating the specific stream involved parsing the header of the CFB, finding the directory stream, search that for the stream, and finally read the stream out of the sector chain. It turned out that the PROJECT stream is actually so small that it’s contained within the ministream. I won’t go through the whole process here, and I did cut some corners to make the parsing simpler.

More accurately accessing the CFB had the additional positive effect of allowing me to write the correct size back into the directory entry. Previously I had been padding the stream with Carriage Returns which worked, but wasn’t strictly speaking compliant.

python export.py

import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re

extensions = {
    1: ".bas",   # Module
    2: ".cls",   # Class Module
    3: ".frm",   # Form, Technically also frx, but we dont get to chose that.
    100: ".cls", # Excel objects codebehind
}

skip = {
    100,
}

defaultRefs = {
    "VBA",
    "Excel",
    "stdole",
    "Office",
    "MSForms",
    "Outlook",
    "SHDocVw",
}

def _oddStringCompar(n1, n2):
    lenDif = len(n1) - len(n2)
    if lenDif != 0:
        return lenDif

    n1Bin = n1.upper().encode("UTF-16")
    n2Bin = n2.upper().encode("UTF-16")

    for (b1, b2) in zip(n1Bin, n2Bin):
        if b1 != b2:
            return b1 - b2
    return 0

def _findSub(dirs, storageid, name):
    storage = dirs[storageid]
    assert(storage.type == 0x01 or storage.type == 0x05)
    cursor = storage.child
    while cursor != "NO":
        node = dirs[cursor]
        compar = _oddStringCompar(node.name, name)
        if compar > 0:
            cursor = node.left
        elif compar < 0:
            cursor = node.right
        else:
            return cursor
    return None

def extract_project_stream(f):
    header = f.read(512)
    if header[0:8] != b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1":
        print("Not a CFB file")
        exit(1)

    if header[28:30] != b"\xFE\xFF":
        print("Endianness field is wrong")
        exit(1)

    major = int.from_bytes(header[26:28], byteorder='little')
    print(f"Major version is {major}")
    sectorSize = pow(2, int.from_bytes(header[30:32], byteorder='little'))
    print(f"Sectors are {sectorSize} bytes")
    if sectorSize < 512:
        # The sector size is less than the header size. That doesn't make
        # much sense
        print("Sector size is too small")
        exit(1)
    miniSectorSize = pow(2, int.from_bytes(header[32:34], byteorder='little'))
    print(f"Minisectors are {miniSectorSize} bytes")

    fatSectors = int.from_bytes(header[44:48], byteorder='little')
    print(f"There are {fatSectors} FAT sectors")
    if fatSectors > 109:
        # FAT sectors would start to spill into the overflow DIFAT table.
        # I don't want to deal with that.
        print("Too many FAT sectors")
        exit(1)

    dirStreamStart = int.from_bytes(header[48:52], byteorder='little')
    print(f"The Directory Entry Stream starts at {dirStreamStart}")

    miniCutoff = int.from_bytes(header[56:60], byteorder='little')
    print(f"Everything smaller than {miniCutoff} bytes goes in the ministream")

    miniFatStart = int.from_bytes(header[60:64], byteorder='little')
    print(f"The minifat table starts at {miniFatStart}")

    miniFatSectors = int.from_bytes(header[64:68], byteorder='little')
    print(f"There are {miniFatSectors} MiniFAT sectors")

    difat = []
    for i in range(0, fatSectors):
        entry = int.from_bytes(header[76 + (i*4):80 + (i*4)], byteorder="little")
        difat.append(entry)

    # Seek to the first actual sector
    f.seek(sectorSize)

    fat = []
    for fatSector in difat:
        f.seek(sectorSize * (fatSector+1))
        sector = f.read(sectorSize)
        for i in range(0, sectorSize // 4):
            entry = int.from_bytes(sector[(i*4):4 + (i*4)], byteorder="little")

            if entry == 0xFFFFFFC:
                entry = "DIFAT"
            elif entry == 0xFFFFFFFD:
                entry = "FAT"
            elif entry == 0xFFFFFFFE:
                entry = "END"
            elif entry == 0xFFFFFFFF:
                entry = "FREE"

            fat.append(entry)

    dirs = []
    cursor = dirStreamStart
    while cursor != "END":
        f.seek(sectorSize * (cursor+1))
        sector = f.read(sectorSize)

        for i in range(0, sectorSize // 128):
            loc = sectorSize*(cursor+1) + (i*128)
            nameLen = int.from_bytes(sector[64 + (i*128):66 + (i*128)], byteorder="little")
            # Remove the null byte
            if nameLen > 0: nameLen -= 2
            name = sector[(i*128):nameLen + (i*128)].decode("UTF-16")
            type_ = int.from_bytes(sector[66 + (i*128):67 + (i*128)], byteorder="little")
            def sid(id_):
                if id_ == 0xFFFFFFFF:
                    return "NO"
                return id_
            left =  sid(int.from_bytes(sector[68 + (i*128):72 + (i*128)], byteorder="little"))
            right = sid(int.from_bytes(sector[72 + (i*128):76 + (i*128)], byteorder="little"))
            child = sid(int.from_bytes(sector[76 + (i*128):80 + (i*128)], byteorder="little"))

            first = sid(int.from_bytes(sector[116 + (i*128):120 + (i*128)], byteorder="little"))
            size = sid(int.from_bytes(sector[120 + (i*128):128 + (i*128)], byteorder="little"))
            if major == 3:
                size = size & 0xFFFFFFFF
            entry = direntry(loc, name, left, right, child, type_, first, size)
            dirs.append(entry)

        cursor = fat[cursor]
    # The root directory entry (the first one) contains the start of the
    # mini stream
    miniStreamStart = dirs[0].first

    projectDirId = _findSub(dirs, 0, "PROJECT")
    if projectDirId is None:
        print("No PROJECT stream")
        exit(1)
    projectDir = dirs[projectDirId]
    if projectDir.type != 2:
        print("Couldn't find the PROJECT stream")
        exit(1)
    if projectDir.size >= miniCutoff:
        print("The PROJECT stream is not located in the ministream ({projectDir.size}/{miniCutoff})")
        exit(1)

    minifat = []
    cursor = miniFatStart
    while cursor != "END":
        f.seek(sectorSize * (cursor+1))
        sector = f.read(sectorSize)

        for i in range(0, sectorSize // 4):
            entry = int.from_bytes(sector[(i*4):4 + (i*4)], byteorder="little")

            if entry == 0xFFFFFFFE:
                entry = "END"
            elif entry == 0xFFFFFFFF:
                entry = "FREE"

            minifat.append(entry)

        cursor = fat[cursor]


    projectValue = b""
    blocks = []
    cursor = projectDir.first
    while cursor != "END":
        cursorSector = miniStreamStart
        for i in range(0, cursor // (sectorSize // miniSectorSize)):
            cursorSector = fat[cursorSector]
        f.seek(sectorSize * (cursorSector+1))
        sector = f.read(sectorSize)

        start = (cursor * miniSectorSize) % sectorSize
        blocks.append(((cursorSector+1) * sectorSize) + start)
        projectValue += sector[start:start+miniSectorSize]

        cursor = minifat[cursor]
    projectValue = projectValue[:projectDir.size]

    return ((blocks, miniSectorSize, projectDir.loc + 120), projectValue)

# Partial implementation of decryption of the _Data Encryption_ scheme
def read_key(val):
    seed = val[0]
    pkey_enc = val[2]
    return seed ^ pkey_enc

# Implementation of the _Data Encryption_ scheme
def enc(pkey, data):
    seed = 0x00
    pkey_enc = seed ^ pkey
    ver_enc = seed ^ 2

    unenc1 = pkey
    enc1 = pkey_enc
    enc2 = ver_enc

    ign_len = int((seed & 6) / 2)
    ign = []
    for i in range(0, ign_len):
        tmp = 7
        b_enc = tmp ^ ((enc2 + unenc1) & 0xFF)
        ign.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = tmp

    data_len = len(data)
    data_len_enc = []
    for i in range(0, 4):
        b = (data_len >> (i * 8)) & 0xFF
        b_enc = b ^ ((enc2 + unenc1) & 0xFF)
        data_len_enc.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = b

    data_enc = []
    for b in data:
        b_enc = b ^ ((enc2 + unenc1) & 0xFF)
        data_enc.append(b_enc)
        enc2 = enc1
        enc1 = b_enc
        unenc1 = b

    return bytes([seed, ver_enc, pkey_enc, *ign, *data_len_enc, *data_enc])


@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
    wbPath = Path(workbook)
    outdir = Path(".")

    with TemporaryDirectory() as temp:
        vbaProjPath = Path("word/vbaProject.bin")
        try:
            with ZipFile(wbPath) as archive:
                archive.extract("word/vbaProject.bin", temp)
        except KeyError:
            exit(1)

        vba_temp = temp / vbaProjPath

        with open(vba_temp, "r+b") as f:
            (d, stream) = extract_project_stream(f)

            match = re.search(b"DPB=\"([A-Z0-9]+)\"", content)
            dpb_enc = bytes.fromhex(match.group(1).decode("ascii"))
            pkey = read_key(dpb_enc)

            dpb = enc(pkey, [0]).hex().upper().encode("ascii")

            content = re.sub(match.group(0), b"DPB=\"" + dpb + b"\"", content)

            (blocks, blockSize, sizeLoc) = d
            padding = (len(blocks) * blockSize) - len(stream)
            if padding < 0:
                print("The new stream is too large to fit in the blocks")
                print("Inserting this stream would require allocating new space")
                print(f"which is not implemented yet. ({padding})")
                exit(1)
            stream = stream + b"\x00" * padding
            assert len(stream) == len(blocks) * blockSize

            print("Writing the changes back to the file")
            f.seek(sizeLoc)
            f.write((len(stream) - padding).to_bytes(8, "little"))
            for block in blocks:
                f.seek(block)
                f.write(stream[0:64])
                stream = stream[64:]

        temp_workbook = temp / wbPath.name
        with ZipFile(wbPath, "r") as zin, \
                ZipFile(temp_workbook, "w") as zout:
            for item in zin.infolist():
                if item.filename == "word/vbaProject.bin":
                    continue
                zout.writestr(item.filename, zin.read(item.filename))
            zout.write(vba_temp, "word/vbaProject.bin")

        word = win32com.client.Dispatch("Word.Application")
        document = word.Documents.Open(temp_workbook.resolve())
        project = workbook.VBProject

        print(f"Exporting references")
        references = project.References
        with open(outdir / "references.txt", "w") as f:
            for reference in references:
                if reference.Type == 0:
                    if reference.Name in defaultRefs:
                        continue
                    continue

                referencePath = Path(reference.FullPath)
                f.write(f"{referencePath}\n")

        print(f"Exporting code")
        components = project.VBComponents
        for component in components:
            if component.Type in skip:
                continue

            extension = extensions[component.Type]
            filename = f"{component.Name}{extension}"
            outpath = outdir / filename
            print(f"-> {filename}")
            component.Export(outpath.resolve())

        document.Close()
        word.Quit()

if __name__ == '__main__':
    main()

Finally I was able to strip the password from the Office file and export the objects without any manual intervention.

$ ./export.py document.dotm
<woo!>

From here I could use my standard text toolchain to analyze those projects.

If you want to follow along. I’ll be referring to version 9.1 of the document. ↩︎
Here I’ll be referring to the version 11.0 ↩︎
Directories are called objects in CFB ↩︎