Word Macro Stuff
Faced with over 100 Word VBA macros and a request for information about which macros call a certain webservice. How do you get that information without going insane? With a lot of python, some Windows black magic, and a couple of open Microsoft “standards”.
Recently a team came to me to ask if it was possible to figure out who called one of our old applications. The application is a horrible pain for the team maintaining it, and they’d like to shut it down. Not least because it’s basically impossible to build. They had traced down a lot of applications already, but they were stuck on a folder brimming with Word VBA macros. From experience, they knew some of them called this application, but which?
$ find <macrodir> -name ".dotm" | wc -l
<some number>
Mass analysis of embedded VBA code poses a rather interesting problem. Since the code is embedded within the Word file in a binary format, I can’t simply grep through the source. Plenty of tools exists that let you (to varying degrees) extract the text from Word documents, but in this case we want the embedded VBA source code.
The VBA editor embedded in every office application contains a feature that lets the user export the code. Fortunately, Windows contains an IPC mechanism called OLE Automation which exposes the functionality. I like the pywin32 module for talking OLE from python.
The simple procedure then is to 1) start Word by creating a new Word COM object, 2) Opening the document, and 3) Accessing the Visual Basic Project of the open document to export the code.
import win32com.client
import click
from pathlib import Path
extensions = {
1: ".bas", # Module
2: ".cls", # Class Module
3: ".frm", # Form, Technically also frx, but we dont get to chose that.
100: ".cls", # Excel objects codebehind
}
skip = {
100,
}
defaultRefs = {
"VBA",
"Excel",
"stdole",
"Office",
"MSForms",
"Outlook",
"SHDocVw",
}
@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
wbPath = Path(workbook)
outdir = Path(".")
word = win32com.client.Dispatch("Word.Application")
document = word.Documents.Open(wbPath.resolve())
project = workbook.VBProject
print(f"Exporting references")
references = project.References
with open(outdir / "references.txt", "w") as f:
for reference in references:
if reference.Type == 0:
continue
referencePath = Path(reference.FullPath)
f.write(f"{referencePath}\n")
print(f"Exporting code")
components = project.VBComponents
for component in components:
if component.Type in skip:
continue
extension = extensions[component.Type]
filename = f"{component.Name}{extension}"
outpath = outdir / filename
print(f"-> {filename}")
component.Export(outpath.resolve())
document.Close()
word.Quit()
if __name__ == '__main__':
main()
Now I should be able to simply run this command for every document to export their Visual Basic files.
$ ./export.py document.dotm
<error>
Ooh, and so begun my adventure into the depths of Office. It turns out you can’t simply access the Visual Basic Project. Due to (in hindsight) pretty obvious security considerations this functionality has been disabled by default for quite a while. That’s easy enough to solve though, I just changed the setting. Lets try that command again.
$ ./export.py document.dotm
<different error about denied>
As it turns out, the person who made this VBA code didn’t want just anybody to mess around with it. The Office VBA format contains support for password protecting the VBA project such that you can’t open it without knowing the password. Being a developer I could probably get the password from someone, but I had nowhere to use it. The Visual Basic Project object model doesn’t contain any way to actually unlock the project, even with the password. I’d have to go through every document and manually unlock them first.
Or I would have, if it wasn’t because there’s a pretty simple way to circumvent the Office VBA password protection. At this point I didn’t really understand how this worked, but I suspected it was simply corrupting the file in such a way Office could not determine it was password protected. It’s pretty interesting that the VBA project is not encrypted within the file.
So I needed to extend the script such that it would 1) unzip the word/vbaProject.bin, 2) do a search replace for DPB and substitute DBx, 3) place the word/vbaProject.bin back into the file, and 4) Continue extraction like normal.
import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re
extensions = {
1: ".bas", # Module
2: ".cls", # Class Module
3: ".frm", # Form, Technically also frx, but we dont get to chose that.
100: ".cls", # Excel objects codebehind
}
skip = {
100,
}
defaultRefs = {
"VBA",
"Excel",
"stdole",
"Office",
"MSForms",
"Outlook",
"SHDocVw",
}
@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
wbPath = Path(workbook)
outdir = Path(".")
with TemporaryDirectory() as temp:
vbaProjPath = Path("word/vbaProject.bin")
try:
with ZipFile(wbPath) as archive:
archive.extract("word/vbaProject.bin", temp)
except KeyError:
exit(1)
vba_temp = temp / vbaProjPath
with open(vba_temp, "r+b") as f:
content = f.read()
content = re.sub(b"DPB=", b"DPx=", content)
f.seek(0)
f.truncate()
f.write(content)
temp_workbook = temp / wbPath.name
with ZipFile(wbPath, "r") as zin, \
ZipFile(temp_workbook, "w") as zout:
for item in zin.infolist():
if item.filename == "word/vbaProject.bin":
continue
zout.writestr(item.filename, zin.read(item.filename))
zout.write(vba_temp, "word/vbaProject.bin")
word = win32com.client.Dispatch("Word.Application")
document = word.Documents.Open(temp_workbook.resolve())
project = workbook.VBProject
print(f"Exporting references")
references = project.References
with open(outdir / "references.txt", "w") as f:
for reference in references:
if reference.Type == 0:
continue
referencePath = Path(reference.FullPath)
f.write(f"{referencePath}\n")
print(f"Exporting code")
components = project.VBComponents
for component in components:
if component.Type in skip:
continue
extension = extensions[component.Type]
filename = f"{component.Name}{extension}"
outpath = outdir / filename
print(f"-> {filename}")
component.Export(outpath.resolve())
document.Close()
word.Quit()
if __name__ == '__main__':
main()
Not finally, we can extract the VB code.
$ ./export.py document.dotm
<some other error about being corrupted>
No, once again I was defeated by odd edge cases. Word will not open a corrupted file if it’s being opened by the COM interface. It won’t even try to repair it. Luckily for me, Microsoft actually opened up the spec for the office VBA format (MS-OVBA as Microsoft calls it1), so I should be able to reconstruct the file perfectly.
The mysterious DPB
field is described on page 25 section 2.3.1.16 of
the document. It’s where Office stores the hash of the password, which
it validates against what the user enters. By changing the name of the
field to DPx
I was corrupting the file such that Word was unable to
find the password, and it was forced to assume it wasn’t password
protected as it reconstructed it.
Through the manual I find that the DPB
field makes reference to
something called Data Encryption to encode the password. The
encryption scheme ties the value of this field to the values of the
ID
, CMG
, and GC
fields though the ProjKey
. By decoding the
current value first it was possible to the ProjKey
which I could then
use to encode the NULL
password before substituting it in for the old
DPB
value.
import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re
extensions = {
1: ".bas", # Module
2: ".cls", # Class Module
3: ".frm", # Form, Technically also frx, but we dont get to chose that.
100: ".cls", # Excel objects codebehind
}
skip = {
100,
}
defaultRefs = {
"VBA",
"Excel",
"stdole",
"Office",
"MSForms",
"Outlook",
"SHDocVw",
}
# Partial implementation of decryption of the _Data Encryption_ scheme
def read_key(val):
seed = val[0]
pkey_enc = val[2]
return seed ^ pkey_enc
# Implementation of the _Data Encryption_ scheme
def enc(pkey, data):
seed = 0x00
pkey_enc = seed ^ pkey
ver_enc = seed ^ 2
unenc1 = pkey
enc1 = pkey_enc
enc2 = ver_enc
ign_len = int((seed & 6) / 2)
ign = []
for i in range(0, ign_len):
tmp = 7
b_enc = tmp ^ ((enc2 + unenc1) & 0xFF)
ign.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = tmp
data_len = len(data)
data_len_enc = []
for i in range(0, 4):
b = (data_len >> (i * 8)) & 0xFF
b_enc = b ^ ((enc2 + unenc1) & 0xFF)
data_len_enc.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = b
data_enc = []
for b in data:
b_enc = b ^ ((enc2 + unenc1) & 0xFF)
data_enc.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = b
return bytes([seed, ver_enc, pkey_enc, *ign, *data_len_enc, *data_enc])
@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
wbPath = Path(workbook)
outdir = Path(".")
with TemporaryDirectory() as temp:
vbaProjPath = Path("word/vbaProject.bin")
try:
with ZipFile(wbPath) as archive:
archive.extract("word/vbaProject.bin", temp)
except KeyError:
exit(1)
vba_temp = temp / vbaProjPath
with open(vba_temp, "r+b") as f:
content = f.read()
orig_len = len(content)
match = re.search(b"DPB=\"([A-Z0-9]+)\"", content)
dpb_enc = bytes.fromhex(match.group(1).decode("ascii"))
pkey = read_key(dpb_enc)
dpb = enc(pkey, [0]).hex().upper().encode("ascii")
content = re.sub(match.group(0), b"DPB=\"" + dpb + b"\"", content)
content += "\r" * orig_len - len(content)
f.seek(0)
f.truncate()
f.write(content)
temp_workbook = temp / wbPath.name
with ZipFile(wbPath, "r") as zin, \
ZipFile(temp_workbook, "w") as zout:
for item in zin.infolist():
if item.filename == "word/vbaProject.bin":
continue
zout.writestr(item.filename, zin.read(item.filename))
zout.write(vba_temp, "word/vbaProject.bin")
word = win32com.client.Dispatch("Word.Application")
document = word.Documents.Open(temp_workbook.resolve())
project = workbook.VBProject
print(f"Exporting references")
references = project.References
with open(outdir / "references.txt", "w") as f:
for reference in references:
if reference.Type == 0:
if reference.Name in defaultRefs:
continue
continue
referencePath = Path(reference.FullPath)
f.write(f"{referencePath}\n")
print(f"Exporting code")
components = project.VBComponents
for component in components:
if component.Type in skip:
continue
extension = extensions[component.Type]
filename = f"{component.Name}{extension}"
outpath = outdir / filename
print(f"-> {filename}")
component.Export(outpath.resolve())
document.Close()
word.Quit()
if __name__ == '__main__':
main()
$ ./export.py document.dotm
$ ls
module.vba
Awesome, with it working for one file I assumed it would work for all my
files. That was a bold assumption. After exporting quite a few files
I found one that wouldn’t work. It was crashing the part of the code
that extracted the DPB
field.
$ ./export.py otherdocument.dotm
<crash about DPB not being found>
If you read the beginning of the MS-OVBA specification. You find that the file isn’t a simple text file, it’s a Compound File Binary (MS-CFB in Microsoft language2). The CFB specification contains this helpful diagram:
A CFB is a FAT-like filesystem in a single file. This explains why the
DPB
field wasn’t found. It was broken up by an unrelated sector. To
correctly and reliably replace the DPB
value I had to 1) recreate the
sector stream, 2) replace the field in that, and 3) place that stream
back in the file. According to the Office VBA specification, the stream
we want is the PROJECT
stream, located in the root directory3. Before
I could do anything then, I would first have to locate that stream.
Locating the specific stream involved parsing the header of the CFB,
finding the directory stream, search that for the stream, and finally
read the stream out of the sector chain. It turned out that the
PROJECT
stream is actually so small that it’s contained within the
ministream. I won’t go through the whole process here, and I did cut
some corners to make the parsing simpler.
More accurately accessing the CFB had the additional positive effect of allowing me to write the correct size back into the directory entry. Previously I had been padding the stream with Carriage Returns which worked, but wasn’t strictly speaking compliant.
import win32com.client
import click
from pathlib import Path
from tempfile import TemporaryDirectory
from zipfile import ZipFile
import re
extensions = {
1: ".bas", # Module
2: ".cls", # Class Module
3: ".frm", # Form, Technically also frx, but we dont get to chose that.
100: ".cls", # Excel objects codebehind
}
skip = {
100,
}
defaultRefs = {
"VBA",
"Excel",
"stdole",
"Office",
"MSForms",
"Outlook",
"SHDocVw",
}
def _oddStringCompar(n1, n2):
lenDif = len(n1) - len(n2)
if lenDif != 0:
return lenDif
n1Bin = n1.upper().encode("UTF-16")
n2Bin = n2.upper().encode("UTF-16")
for (b1, b2) in zip(n1Bin, n2Bin):
if b1 != b2:
return b1 - b2
return 0
def _findSub(dirs, storageid, name):
storage = dirs[storageid]
assert(storage.type == 0x01 or storage.type == 0x05)
cursor = storage.child
while cursor != "NO":
node = dirs[cursor]
compar = _oddStringCompar(node.name, name)
if compar > 0:
cursor = node.left
elif compar < 0:
cursor = node.right
else:
return cursor
return None
def extract_project_stream(f):
header = f.read(512)
if header[0:8] != b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1":
print("Not a CFB file")
exit(1)
if header[28:30] != b"\xFE\xFF":
print("Endianness field is wrong")
exit(1)
major = int.from_bytes(header[26:28], byteorder='little')
print(f"Major version is {major}")
sectorSize = pow(2, int.from_bytes(header[30:32], byteorder='little'))
print(f"Sectors are {sectorSize} bytes")
if sectorSize < 512:
# The sector size is less than the header size. That doesn't make
# much sense
print("Sector size is too small")
exit(1)
miniSectorSize = pow(2, int.from_bytes(header[32:34], byteorder='little'))
print(f"Minisectors are {miniSectorSize} bytes")
fatSectors = int.from_bytes(header[44:48], byteorder='little')
print(f"There are {fatSectors} FAT sectors")
if fatSectors > 109:
# FAT sectors would start to spill into the overflow DIFAT table.
# I don't want to deal with that.
print("Too many FAT sectors")
exit(1)
dirStreamStart = int.from_bytes(header[48:52], byteorder='little')
print(f"The Directory Entry Stream starts at {dirStreamStart}")
miniCutoff = int.from_bytes(header[56:60], byteorder='little')
print(f"Everything smaller than {miniCutoff} bytes goes in the ministream")
miniFatStart = int.from_bytes(header[60:64], byteorder='little')
print(f"The minifat table starts at {miniFatStart}")
miniFatSectors = int.from_bytes(header[64:68], byteorder='little')
print(f"There are {miniFatSectors} MiniFAT sectors")
difat = []
for i in range(0, fatSectors):
entry = int.from_bytes(header[76 + (i*4):80 + (i*4)], byteorder="little")
difat.append(entry)
# Seek to the first actual sector
f.seek(sectorSize)
fat = []
for fatSector in difat:
f.seek(sectorSize * (fatSector+1))
sector = f.read(sectorSize)
for i in range(0, sectorSize // 4):
entry = int.from_bytes(sector[(i*4):4 + (i*4)], byteorder="little")
if entry == 0xFFFFFFC:
entry = "DIFAT"
elif entry == 0xFFFFFFFD:
entry = "FAT"
elif entry == 0xFFFFFFFE:
entry = "END"
elif entry == 0xFFFFFFFF:
entry = "FREE"
fat.append(entry)
dirs = []
cursor = dirStreamStart
while cursor != "END":
f.seek(sectorSize * (cursor+1))
sector = f.read(sectorSize)
for i in range(0, sectorSize // 128):
loc = sectorSize*(cursor+1) + (i*128)
nameLen = int.from_bytes(sector[64 + (i*128):66 + (i*128)], byteorder="little")
# Remove the null byte
if nameLen > 0: nameLen -= 2
name = sector[(i*128):nameLen + (i*128)].decode("UTF-16")
type_ = int.from_bytes(sector[66 + (i*128):67 + (i*128)], byteorder="little")
def sid(id_):
if id_ == 0xFFFFFFFF:
return "NO"
return id_
left = sid(int.from_bytes(sector[68 + (i*128):72 + (i*128)], byteorder="little"))
right = sid(int.from_bytes(sector[72 + (i*128):76 + (i*128)], byteorder="little"))
child = sid(int.from_bytes(sector[76 + (i*128):80 + (i*128)], byteorder="little"))
first = sid(int.from_bytes(sector[116 + (i*128):120 + (i*128)], byteorder="little"))
size = sid(int.from_bytes(sector[120 + (i*128):128 + (i*128)], byteorder="little"))
if major == 3:
size = size & 0xFFFFFFFF
entry = direntry(loc, name, left, right, child, type_, first, size)
dirs.append(entry)
cursor = fat[cursor]
# The root directory entry (the first one) contains the start of the
# mini stream
miniStreamStart = dirs[0].first
projectDirId = _findSub(dirs, 0, "PROJECT")
if projectDirId is None:
print("No PROJECT stream")
exit(1)
projectDir = dirs[projectDirId]
if projectDir.type != 2:
print("Couldn't find the PROJECT stream")
exit(1)
if projectDir.size >= miniCutoff:
print("The PROJECT stream is not located in the ministream ({projectDir.size}/{miniCutoff})")
exit(1)
minifat = []
cursor = miniFatStart
while cursor != "END":
f.seek(sectorSize * (cursor+1))
sector = f.read(sectorSize)
for i in range(0, sectorSize // 4):
entry = int.from_bytes(sector[(i*4):4 + (i*4)], byteorder="little")
if entry == 0xFFFFFFFE:
entry = "END"
elif entry == 0xFFFFFFFF:
entry = "FREE"
minifat.append(entry)
cursor = fat[cursor]
projectValue = b""
blocks = []
cursor = projectDir.first
while cursor != "END":
cursorSector = miniStreamStart
for i in range(0, cursor // (sectorSize // miniSectorSize)):
cursorSector = fat[cursorSector]
f.seek(sectorSize * (cursorSector+1))
sector = f.read(sectorSize)
start = (cursor * miniSectorSize) % sectorSize
blocks.append(((cursorSector+1) * sectorSize) + start)
projectValue += sector[start:start+miniSectorSize]
cursor = minifat[cursor]
projectValue = projectValue[:projectDir.size]
return ((blocks, miniSectorSize, projectDir.loc + 120), projectValue)
# Partial implementation of decryption of the _Data Encryption_ scheme
def read_key(val):
seed = val[0]
pkey_enc = val[2]
return seed ^ pkey_enc
# Implementation of the _Data Encryption_ scheme
def enc(pkey, data):
seed = 0x00
pkey_enc = seed ^ pkey
ver_enc = seed ^ 2
unenc1 = pkey
enc1 = pkey_enc
enc2 = ver_enc
ign_len = int((seed & 6) / 2)
ign = []
for i in range(0, ign_len):
tmp = 7
b_enc = tmp ^ ((enc2 + unenc1) & 0xFF)
ign.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = tmp
data_len = len(data)
data_len_enc = []
for i in range(0, 4):
b = (data_len >> (i * 8)) & 0xFF
b_enc = b ^ ((enc2 + unenc1) & 0xFF)
data_len_enc.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = b
data_enc = []
for b in data:
b_enc = b ^ ((enc2 + unenc1) & 0xFF)
data_enc.append(b_enc)
enc2 = enc1
enc1 = b_enc
unenc1 = b
return bytes([seed, ver_enc, pkey_enc, *ign, *data_len_enc, *data_enc])
@click.command()
@click.argument("workbook", nargs=1, type=click.Path(exists=True))
def main(workbook):
wbPath = Path(workbook)
outdir = Path(".")
with TemporaryDirectory() as temp:
vbaProjPath = Path("word/vbaProject.bin")
try:
with ZipFile(wbPath) as archive:
archive.extract("word/vbaProject.bin", temp)
except KeyError:
exit(1)
vba_temp = temp / vbaProjPath
with open(vba_temp, "r+b") as f:
(d, stream) = extract_project_stream(f)
match = re.search(b"DPB=\"([A-Z0-9]+)\"", content)
dpb_enc = bytes.fromhex(match.group(1).decode("ascii"))
pkey = read_key(dpb_enc)
dpb = enc(pkey, [0]).hex().upper().encode("ascii")
content = re.sub(match.group(0), b"DPB=\"" + dpb + b"\"", content)
(blocks, blockSize, sizeLoc) = d
padding = (len(blocks) * blockSize) - len(stream)
if padding < 0:
print("The new stream is too large to fit in the blocks")
print("Inserting this stream would require allocating new space")
print(f"which is not implemented yet. ({padding})")
exit(1)
stream = stream + b"\x00" * padding
assert len(stream) == len(blocks) * blockSize
print("Writing the changes back to the file")
f.seek(sizeLoc)
f.write((len(stream) - padding).to_bytes(8, "little"))
for block in blocks:
f.seek(block)
f.write(stream[0:64])
stream = stream[64:]
temp_workbook = temp / wbPath.name
with ZipFile(wbPath, "r") as zin, \
ZipFile(temp_workbook, "w") as zout:
for item in zin.infolist():
if item.filename == "word/vbaProject.bin":
continue
zout.writestr(item.filename, zin.read(item.filename))
zout.write(vba_temp, "word/vbaProject.bin")
word = win32com.client.Dispatch("Word.Application")
document = word.Documents.Open(temp_workbook.resolve())
project = workbook.VBProject
print(f"Exporting references")
references = project.References
with open(outdir / "references.txt", "w") as f:
for reference in references:
if reference.Type == 0:
if reference.Name in defaultRefs:
continue
continue
referencePath = Path(reference.FullPath)
f.write(f"{referencePath}\n")
print(f"Exporting code")
components = project.VBComponents
for component in components:
if component.Type in skip:
continue
extension = extensions[component.Type]
filename = f"{component.Name}{extension}"
outpath = outdir / filename
print(f"-> {filename}")
component.Export(outpath.resolve())
document.Close()
word.Quit()
if __name__ == '__main__':
main()
Finally I was able to strip the password from the Office file and export the objects without any manual intervention.
$ ./export.py document.dotm
<woo!>
From here I could use my standard text toolchain to analyze those projects.