Python Code Obfuscation

taking the snake to the dark side

Always two there are, no more, no less:
an apprentice Nicolas Szlifierski [Quarkslab, Telecom Bretagne]
and a master Serge Guelton [Quarkslab, Telecom Bretagne]

/me

Serge « sans paille » Guelton

$ whoami
sguelton
  • R&D engineer at QuarksLab on compilation for security
  • Associate researcher at Télécom Bretagne

Python bytecode is easy to reverse

$ echo "print('hello world')" > hello.py
$ python -m py_compile hello
$ pycdc hello.pyc
# Source Generated with Decompyle++
# File: hello.pyc (Python 2.7)

print 'hello world'

With an optimization flag?

$ printf "a = 1\nif a: print(a + 2)" > dce.py
$ python -O -m py_compile dce
$ pycdc dce.pyo
# Source Generated with Decompyle++
# File: dce.pyo (Python 2.7)

a = 1
if a:
    print a + 2
CPython performs close to zero optimization...

Hardening Solutions

  • Source code modification
  • Bytecode modification
  • Interpreter modification
  • any other idea? come and talk with me ;-)

Eenie, meenie, minie, moe... oh, why not all of them?

Source Code Modification

Python semantics makes it hard to perform source-to-source transformation because of lazy binding and polymorphism:

for i in range(10):
    s += hex(i)

Nothing is as it seems...

range = lambda *args: args

Monkey Islandpatching anyone?

__builtin__.hex, __builtin__.oct =  __builtin__.oct, __builtin__.hex

Obfuscate Control Flow

No variable lookup for control flow statements! That's an obfuscation opportunity:

  1. Turn statements into functions that update a memory dict
  2. Chain these functions using composition
  3. Turn a function definition into a lambda definition!

From the HITB challenge

(lambda g, c, d: (lambda _: (_.__setitem__('$', ''.join([(_['chr'] if ('chr'
in _) else chr)((_['_'] if ('_' in _) else _)) for _['_'] in (_['s'] if ('s'
in _) else s)[::(-1)]])), _)[-1])( (lambda _: (lambda f, _: f(f, _))((lambda
__,_: ((lambda _: __(__, _))((lambda _: (_.__setitem__('i', ((_['i'] if ('i'
in _) else i) + 1)),_)[(-1)])((lambda _: (_.__setitem__('s',((_['s'] if ('s'
in _) else s) + [((_['l'] if ('l' in _) else l)[(_['i'] if ('i' in _) else i
)] ^ (_['c'] if ('c' in _) else c))])), _)[-1])(_))) if (((_['g'] if ('g' in
_) else g) % 4) and ((_['i'] if ('i' in _) else i)< (_['len'] if ('len' in _
) else len)((_['l'] if ('l' in _) else l)))) else _)), _) ) ( (lambda _: (_.
__setitem__('!', []), _.__setitem__('s', _['!']), _)[(-1)] ) ((lambda _: (_.
__setitem__('!', ((_['d'] if ('d' in _) else d) ^ (_['d'] if ('d' in _) else
d))), _.__setitem__('i', _['!']), _)[(-1)])((lambda _: (_.__setitem__('!', [
(_['j'] if ('j' in _) else j) for  _[ 'i'] in (_['zip'] if ('zip' in _) else
zip)((_['l0'] if ('l0' in _) else l0), (_['l1'] if ('l1' in _) else l1)) for
_['j'] in (_['i'] if ('i' in _) else i)]), _.__setitem__('l', _['!']), _)[-1
])((lambda _: (_.__setitem__('!', [1373, 1281, 1288, 1373, 1290, 1294, 1375,
1371,1289, 1281, 1280, 1293, 1289, 1280, 1373, 1294, 1289, 1280, 1372, 1288,
1375,1375, 1289, 1373, 1290, 1281, 1294, 1302, 1372, 1355, 1366, 1372, 1302,
1360, 1368, 1354, 1364, 1370, 1371, 1365, 1362, 1368, 1352, 1374, 1365, 1302
]), _.__setitem__('l1',_['!']), _)[-1])((lambda _: (_.__setitem__('!',[1375,
1368, 1294, 1293, 1373, 1295, 1290, 1373, 1290, 1293, 1280, 1368, 1368,1294,
1293, 1368, 1372, 1292, 1290, 1291, 1371, 1375, 1280, 1372, 1281, 1293,1373,
1371, 1354, 1370, 1356, 1354, 1355, 1370, 1357, 1357, 1302, 1366, 1303,1368,
1354, 1355, 1356, 1303, 1366, 1371]), _.__setitem__('l0', _['!']), _)[(-1)])
                ({ 'g': g, 'c': c, 'd': d, '$': None})))))))['$'])

Interested? Give a try on http://blog.quarkslab.com

Bytecode Modification

Many opportunities there!

  • Use a different bytecode mapping (DropBox does this!)
  • Introduce new opcodes
  • Use non-standard opcode sequences

Opcode Shuffling

Modify the interpreter so that:

>>> import dis
>>> print dis.opmap['BINARY_ADD']
23

Turns into

>>> import dis
>>> print dis.opmap['BINARY_ADD']
62

and so on for bytecode generation etc

Constraints

  • Respect opcode arity
  • Some opcode values must respect some constraint
    • Contiguous opcodes
    • Constant step between opcodes
    see python/Include/opcode.h

shuffle opcodes per groups for custom interpreter generation!

New Opcode Generation

An opcode is stored in a char but only ~112 are used!

  • Create aliasing between opcodes (easy)
  • Create new opcodes that behave like an opcode sequence (more interesting)
  1. Collect frequently used opcode sequences
  2. Turn them into a single opcode (CISC anyone?)
  3. With an extension to handle opcode with more than two arguments

Frequently Used Opcodes

  1. Recursively walk trough a .pyc and build the histogram, using marshal.loads and inspect.iscode
  2. Pick frequently used opcode
  3. Perform substitution (beware of jumps and exceptions!) [.pyc.pyc]

For instance:

LOAD_FAST                0
LOAD_CONST               n

Turns into:

LOAD_FAST_LOAD_CONST     O
ANY_OPCODE_WITH_ARG      n

dis reaaaaly dislikes this one :-)

Unusual Opcode Sequence

Decompyler make assumptions on bytecode sequence (some think decompiling ~= pattern matching)

LOAD_FAST 0
LOAD_FAST 1
BUILD_MAP 0
ROT_THREE
BINARY_ADD
ROT_TWO
POP_TOP

Is equivalent to

LOAD_FAST 0
LOAD_FAST 1
BINARY_ADD

This makes uncompyle crash! But not pycdc...

Constants Encryption


>>>def foo(): return "hack.lu"
>>>import dis
>>>dis.dis(foo)
1           0 LOAD_CONST               1 ('hack.lu')
            3 RETURN_VALUE


Strings are loaded using LOAD_CONST, so...

  1. Encrypt every string constant
  2. Hook into LOAD_CONST to perform on-the-fly decryption



proof of concept... rot13... shame

Diving into CPython

Wanna write self modifying code?

Each function embeds its bytecode as a string :-)

But strings are immutable in Python :-(

Unless you modify them in a native module ;-)

Self Modifying Code

static PyObject* this_function_modifies_its_caller() {
  PyThreadState *tstate = PyThreadState_GET();
  if (NULL != tstate && NULL != tstate->frame) {
    PyFrameObject *frame = tstate->frame;

    int instr = frame->f_lasti;
    unsigned char* bytes = (void*)PyString_AS_STRING(frame->f_code->co_code);
    bytes[instr + 10] = INPLACE_MODULO;
  }
  Py_INCREF(Py_None);
  return Py_None;
}
  1. Get parent frame
  2. Get function's code
  3. Replace « next opcode » by a modulo

Call this before a binary operation to turn it into a modulo

Extra Stuff

  • Change MAGIC number to a random value
  • Disable introspection on code object
  • Disable dump[s] from the marshal module
  • Disable bytecode recompilation upon change
  • basically make the interpreter less dynamic while still of for a given application

Bonus Points

  • Use a Python packer (e.g. pyinstaller) to bundle your Python application and the modified Python interpreter in a single binary
  • Use a Python compiler (e.g. numba, shedskin, pythran) to turn some functions/modules into native code
  • Use a C obfuscator to obfuscate the obfuscating part of the interpreter (see you at the lightning talk!)

How To...

$ ../configure --help | grep enable
[...]
  --disable-marshal       hide marshal functions
  --disable-codeobject    hide codeobject functions
  --disable-recompilation disable recompilation of .pyc file when .py file is
  --enable-cipher-str     enable string litteral ciphering
  --enable-shuffle-opcode enable opcodes shuffling
  --enable-gen-opcode     enable generation of new opcodes
Don't expect good engineering there though :-$

THE END

THE AUTHORS
Nicolas Szlifierski and Serge Guelton
THE REPO
https://github.com/quarkslab/cpython
branch obfuscated/2.7