Python Code Obfuscation

taking the snake to the dark side

Always two there are, no more, no less:
an apprentice Nicolas Szlifierski [Quarkslab, Telecom Bretagne]
and a master Serge Guelton [Quarkslab, Telecom Bretagne]

`/me`

Serge « sans paille » Guelton

$ whoami
sguelton

R&D engineer at QuarksLab on compilation for security
Associate researcher at Télécom Bretagne

Python bytecode is easy to reverse

$ echo "print('hello world')" > hello.py
$ python -m py_compile hello
$ pycdc hello.pyc
# Source Generated with Decompyle++
# File: hello.pyc (Python 2.7)

print 'hello world'

With an optimization flag?

$ printf "a = 1\nif a: print(a + 2)" > dce.py
$ python -O -m py_compile dce
$ pycdc dce.pyo
# Source Generated with Decompyle++
# File: dce.pyo (Python 2.7)

a = 1
if a:
    print a + 2

CPython performs close to zero optimization...

Hardening Solutions

Source code modification
Bytecode modification
Interpreter modification
any other idea? come and talk with me ;-)

Eenie, meenie, minie, moe... oh, why not all of them?

Source Code Modification

Python semantics makes it hard to perform source-to-source transformation because of lazy binding and polymorphism:

for i in range(10):
    s += hex(i)

Nothing is as it seems...

range = lambda *args: args

Monkey ~~Island~~patching anyone?

__builtin__.hex, __builtin__.oct =  __builtin__.oct, __builtin__.hex

Obfuscate Control Flow

No variable lookup for control flow statements! That's an obfuscation opportunity:

Turn statements into functions that update a memory dict
Chain these functions using composition
Turn a function definition into a lambda definition!

From the HITB challenge

(lambda g, c, d: (lambda _: (_.__setitem__('$', ''.join([(_['chr'] if ('chr'
in _) else chr)((_['_'] if ('_' in _) else _)) for _['_'] in (_['s'] if ('s'
in _) else s)[::(-1)]])), _)[-1])( (lambda _: (lambda f, _: f(f, _))((lambda
__,_: ((lambda _: __(__, _))((lambda _: (_.__setitem__('i', ((_['i'] if ('i'
in _) else i) + 1)),_)[(-1)])((lambda _: (_.__setitem__('s',((_['s'] if ('s'
in _) else s) + [((_['l'] if ('l' in _) else l)[(_['i'] if ('i' in _) else i
)] ^ (_['c'] if ('c' in _) else c))])), _)[-1])(_))) if (((_['g'] if ('g' in
_) else g) % 4) and ((_['i'] if ('i' in _) else i)< (_['len'] if ('len' in _
) else len)((_['l'] if ('l' in _) else l)))) else _)), _) ) ( (lambda _: (_.
__setitem__('!', []), _.__setitem__('s', _['!']), _)[(-1)] ) ((lambda _: (_.
__setitem__('!', ((_['d'] if ('d' in _) else d) ^ (_['d'] if ('d' in _) else
d))), _.__setitem__('i', _['!']), _)[(-1)])((lambda _: (_.__setitem__('!', [
(_['j'] if ('j' in _) else j) for  _[ 'i'] in (_['zip'] if ('zip' in _) else
zip)((_['l0'] if ('l0' in _) else l0), (_['l1'] if ('l1' in _) else l1)) for
_['j'] in (_['i'] if ('i' in _) else i)]), _.__setitem__('l', _['!']), _)[-1
])((lambda _: (_.__setitem__('!', [1373, 1281, 1288, 1373, 1290, 1294, 1375,
1371,1289, 1281, 1280, 1293, 1289, 1280, 1373, 1294, 1289, 1280, 1372, 1288,
1375,1375, 1289, 1373, 1290, 1281, 1294, 1302, 1372, 1355, 1366, 1372, 1302,
1360, 1368, 1354, 1364, 1370, 1371, 1365, 1362, 1368, 1352, 1374, 1365, 1302
]), _.__setitem__('l1',_['!']), _)[-1])((lambda _: (_.__setitem__('!',[1375,
1368, 1294, 1293, 1373, 1295, 1290, 1373, 1290, 1293, 1280, 1368, 1368,1294,
1293, 1368, 1372, 1292, 1290, 1291, 1371, 1375, 1280, 1372, 1281, 1293,1373,
1371, 1354, 1370, 1356, 1354, 1355, 1370, 1357, 1357, 1302, 1366, 1303,1368,
1354, 1355, 1356, 1303, 1366, 1371]), _.__setitem__('l0', _['!']), _)[(-1)])
                ({ 'g': g, 'c': c, 'd': d, '$': None})))))))['$'])

Interested? Give a try on http://blog.quarkslab.com

Bytecode Modification

Many opportunities there!

Use a different bytecode mapping (DropBox does this!)
Introduce new opcodes
Use non-standard opcode sequences

Opcode Shuffling

Modify the interpreter so that:

>>> import dis
>>> print dis.opmap['BINARY_ADD']
23

Turns into

>>> import dis
>>> print dis.opmap['BINARY_ADD']
62

and so on for bytecode generation etc

Constraints

Respect opcode arity
Some opcode values must respect some constraint
- Contiguous opcodes
- Constant step between opcodes
see python/Include/opcode.h

shuffle opcodes per groups for custom interpreter generation!

New Opcode Generation

An opcode is stored in a char but only ~112 are used!

Create aliasing between opcodes (easy)
Create new opcodes that behave like an opcode sequence (more interesting)

Collect frequently used opcode sequences
Turn them into a single opcode (CISC anyone?)
With an extension to handle opcode with more than two arguments

Frequently Used Opcodes

Recursively walk trough a .pyc and build the histogram, using marshal.loads and inspect.iscode
Pick frequently used opcode
Perform substitution (beware of jumps and exceptions!) [.pyc→.pyc]

For instance:

LOAD_FAST                0
LOAD_CONST               n

Turns into:

LOAD_FAST_LOAD_CONST     O
ANY_OPCODE_WITH_ARG      n

dis reaaaaly dislikes this one :-)

Unusual Opcode Sequence

Decompyler make assumptions on bytecode sequence (some think decompiling ~= pattern matching)

LOAD_FAST 0
LOAD_FAST 1
BUILD_MAP 0
ROT_THREE
BINARY_ADD
ROT_TWO
POP_TOP

Is equivalent to

LOAD_FAST 0
LOAD_FAST 1
BINARY_ADD

This makes uncompyle crash! But not pycdc...

Constants Encryption

>>>def foo(): return "hack.lu"
>>>import dis
>>>dis.dis(foo)
1           0 LOAD_CONST               1 ('hack.lu')
            3 RETURN_VALUE

Strings are loaded using LOAD_CONST, so...

Encrypt every string constant
Hook into LOAD_CONST to perform on-the-fly decryption

proof of concept... rot13... shame

Diving into CPython

Wanna write self modifying code?

Each function embeds its bytecode as a string :-)

But strings are immutable in Python :-(

Unless you modify them in a native module ;-)

Self Modifying Code

static PyObject* this_function_modifies_its_caller() {
  PyThreadState *tstate = PyThreadState_GET();
  if (NULL != tstate && NULL != tstate->frame) {
    PyFrameObject *frame = tstate->frame;

    int instr = frame->f_lasti;
    unsigned char* bytes = (void*)PyString_AS_STRING(frame->f_code->co_code);
    bytes[instr + 10] = INPLACE_MODULO;
  }
  Py_INCREF(Py_None);
  return Py_None;
}

Get parent frame
Get function's code
Replace « next opcode » by a modulo

Call this before a binary operation to turn it into a modulo

Extra Stuff

Change MAGIC number to a random value
Disable introspection on code object
Disable dump[s] from the marshal module
Disable bytecode recompilation upon change
basically make the interpreter less dynamic while still of for a given application

Bonus Points

Use a Python packer (e.g. pyinstaller) to bundle your Python application and the modified Python interpreter in a single binary
Use a Python compiler (e.g. numba, shedskin, pythran) to turn some functions/modules into native code
Use a C obfuscator to obfuscate the obfuscating part of the interpreter (see you at the lightning talk!)

How To...

$ ../configure --help | grep enable
[...]
  --disable-marshal       hide marshal functions
  --disable-codeobject    hide codeobject functions
  --disable-recompilation disable recompilation of .pyc file when .py file is
  --enable-cipher-str     enable string litteral ciphering
  --enable-shuffle-opcode enable opcodes shuffling
  --enable-gen-opcode     enable generation of new opcodes

Don't expect good engineering there though :-$

THE END

THE AUTHORS: Nicolas Szlifierski and Serge Guelton
THE REPO: https://github.com/quarkslab/cpython
branch obfuscated/2.7