How to create CFG graph with ghidra scripting

Introduction

For some project I had to work with Ghidra. I had to 3 create control flow graph for a given function in a given program. One for the assembly code, one for the the raw pcode and finally one for the refined pcode. What a suprise it was when I realised 2 things:

First, nobody seems to did this on the internet or at least no one wrote some exhaustive stuff about it.
Second, Ghidra API is completely doomed.

So I decided to let my method here in case someone need it and find this by the purest luck.

Set up

Python
Ghidra

STOP !! Okay so the scripting system of ghidra is still in python 2 which is completely cursed. Obvisouly there were some crazy nerd to tinker some homemade weird stuff to make it work with Python 3. Let me introduce you Ghidrathon a ghidra extension working with JEP so provide a full support of python3. As the installation is completely cursed someone gave me a dockerfile dcontainer with everything pre-installed. So let add this to the list:

Docker container
Jep
Ghidrathon

Let’s code

For the this step I base a lot of my work on 2 things:

this dark repo of a good souls that decided to bless our mind with his knowledge
The sacred book the gods, THE HOLLY DOCUMENTATION !!!
Since the ghidra API for constructing graph is kind of cursed I used networkx to create graphs.

The nexts paragraphs will describe how I did for each graph.

Assembly and raw pcode

Why both at the same time ? Because raw pcode is at the same abstraction layer than assembly so we can construct both graph at the same time. First create an nx DiGraph: assembly_graph = nx.DiGraph()

Construct nodes and edges

We will need the BasicBlockModel object from ghidra API. Here a block is a series of instructions until the control flow diverge. This is obvisouly very usefull.

We can get the list of the block for a given function with the following code:

blockModel = BasicBlockModel(currentProgram)
blocks = blockModel.getCodeBlocksContaining(function.getBody(), monitor)

This is giving us an iterator. We can get through it to get all blocks, retrieve the starting address and all destinations blocks for each blocks and then all starting address of each destinations block. We get something like that:

while(blocks.hasNext()):
    bb = blocks.next()
    # Add vertex for block entry point
    source_vertex = bb.getFirstStartAddress()
    formated_source = format_address(source_vertex)
    
    assembly_graph.add_node(formated_source, label=str_instruction_block)
    dest = bb.getDestinations(monitor)

    while(dest.hasNext()):
        dbb = dest.next()
        dest_vertex = format_address(dbb.getDestinationAddress())

        assembly_graph.add_edge(formated_source, dest_vertex)

Sidenote

Ghidra API return string address with all leading zero. I've written the format_address() function you see called above to format the address, removing leading 0s and adding the '0x' prefix.

Add assembly instructions

Now we want to add the corresponding assembly instructions for each node. We will first get all the instructions of the function using the starting address and the ending address:

source_vertex = bb.getFirstStartAddress()
formated_source = format_address(source_vertex)
endpoint = bb.getMaxAddress()

# get all instructions in the block
instruction_block = [instr for instr in instructions 
                    if instr.getAddress() < endpoint 
                    and instr.getAddress() >= source_vertex]
        
# get the string of the instructions 
str_instruction_block = [formated_source] + 
[f"{format_address(instr getAddress())}: {instr} " 
for instr in instruction_block]                                                           

str_instruction_block = "\n".join(str_instruction_block)

So we can now twink a little bit our “add node” line to add attributes to our nodes:
assembly_graph.add_node(formated_source label=str_instruction_block)

Oh shit here we go again (for raw pcode)

From the API we can easily get the pcode for a single instucions, but we need them for all instruction in each block. So I wrote a little function that given a starting address and a block get all the raw pcode for a block:

def get_raw_pcode(startAdress, instruction_block):

    str_raw_pcode = [startAdress]
    for instr in instruction_block:    
        str_raw_pcode.append(str(instr))
        raw_pcode = instr.getPcode()
        for entry in raw_pcode:
            str_raw_pcode.append(f"  {entry}")

    return "\n".join(str_raw_pcode)

We can add a little call in our while loop to get all the pcode for each block and adding all of this in another graph. If you are a smart little reader you will ask to me with a little annoying voice “But Voxa why our we creating another graph ? Can we just add another attributes in the same graph ?”

Our final code looks like this:

def construct_low_lvl_graph(function, currentProgram, monitor):

    # get basic block
    blockModel = BasicBlockModel(currentProgram)
    blocks = blockModel.getCodeBlocksContaining(function.getBody(), monitor)

    # get instructions
    listing = currentProgram.getListing()
    addrSet = function.getBody()
    instructions = get_instruction(listing, addrSet)

    # construct graph
    assembly_graph = nx.DiGraph()
    pcode_graph = nx.DiGraph()

    # iterate of basic blocks
    while(blocks.hasNext()):
        bb = blocks.next()
        # Add vertex for block entry point
        source_vertex = bb.getFirstStartAddress()
        formated_source = format_address(source_vertex)
        endpoint = bb.getMaxAddress()

        # get all instructions in the block
        instruction_block = [instr for instr in instructions 
                             if instr.getAddress() < endpoint 
                             and instr.getAddress() >= source_vertex]
        
        # get the string of the instructions 
        str_instruction_block = [formated_source] + [f"{format_address(instr.getAddress())}: {instr}" 
                                                           for instr in instruction_block]                                                           
        str_instruction_block = "\n".join(str_instruction_block)

        # same for raw pcode
        str_raw_pcode = get_raw_pcode(formated_source, instruction_block)

        assembly_graph.add_node(formated_source, label=str_instruction_block)
        pcode_graph.add_node(formated_source, label=str_raw_pcode)

        dest = bb.getDestinations(monitor)

        # iterate over destination to create edges
        while(dest.hasNext()):
            dbb = dest.next()
            dest_vertex = format_address(dbb.getDestinationAddress())

            assembly_graph.add_edge(formated_source, dest_vertex)
            pcode_graph.add_edge(formated_source, dest_vertex)

    return assembly_graph, pcode_graph

refined pcode

Refined pcode is different. It is linked to the C source code. It more high level than the raw pcode. That’s why we are using another function relying on a different object. We can get these object through the getBasicBlock() function which does not return BasicBlock at all but an array of PcodeBlockBasic (Did I already say that this API was the evil itself ?). We need some preparation steps.\

To be honnest I didn’t understood fully, I guess you retrieve the C function called (high function) and then get the refined Pcode for the all function.

From the “high function” we can retrieve all the PcodeBlockBasic that behave in a similar way, but not really, as BasicBlocks.
For some esoteric reason, this object implement a method to retrieve an out block (understand a block to which the current block points) at a given index i, which means there is a kind of list that contains all these out block, and a function that return the size of these list BUT NO FUNCTION TO GET THE LIST DIRECTLY.

Deep breath, anyway, with our start address and our blocks, we can construct by ourselve the list of out blocks and the edges. Reconstructing the instruction flow is trivial here so I won’t go through this.

The code is as follow:

def construct_high_lvl_graph(function, monitor, program):
    high_function = get_high_function(function, monitor, program)
    refined_pcode_blocks = high_function.getBasicBlocks()

    graph = nx.DiGraph()

    for block in refined_pcode_blocks:
        start = block.getStart()
        formated_start = format_address(start)

        out_blocks = []
        if block != refined_pcode_blocks[-1]:
            for i in range(block.getOutSize()):
                out_blocks.append(block.getOut(i))
        
        ops = list(block.getIterator())

        str_ops = [formated_start] + [str(op) for op in ops]
        str_ops =  "\n".join(str_ops)

        graph.add_node(formated_start, label=str_ops)
        
        for out_block in out_blocks:
            graph.add_edge(formated_start, format_address(out_block.getStart()))

    return graph

Draw the graphs

We have constructed our 2 graphs in an abstract way. However, a graph is never as satisying as when it is drawn.

At the begining I tried to use matplotlib and the basic draw() function of networkx. It was painfull to set with the thousands of parameters that matplotlib function could take, and the time needed to polish the thing nicely was way above my life expectancy.

This were I discovered to_agraph(), this not relying on matplotlib but on pygraphviz.
You pass a graph to the function, it became a graphviz graph then you can draw it the format you want, with the shape you want, with the layout you want, in the file format you want, with the shape you want, with the layout you want, with the form…. I am on a loop, right ?

And the best thing ? Everything is plotted nicely wihtout the need to setup a figure pixel perfect with the boxes you made for your nodes and everything with a -0.003 scale factor (Sounds like there is a story, isn’t it ?)

The only problem is: graphviz only accept a small subset of attributes for the nodes, and the attribute name you used in your networkx graph must match with the graphviz attribute name. This is why we need 2 graphs for the raw pcode and the assembly, because you cannot have two times the “label” attribute in the same graph.

Sidenote

why specifically 'label' ? Because it's the attribute graphviz is using to draw stuff inside a node.

Here our nice final rendering function:

def render(graph, filename: str) -> None:
    agraph: AGraph = nx.nx_agraph.to_agraph(graph)
    agraph.node_attr.update(shape='box')

    agraph.draw(filename, prog='dot')

Conclusion

This is niche but I hope it will be usefull for some people.