Breaking Python's PRNG with a few values and no bruteforce

Introduction

Python’s random module utilizes the Mersenne Twister pseudorandom number generator (PRNG), specifically MT19937. It’s a well-known fact that this PRNG is not cryptographically secure, as with access to a sufficient number of outputs from this PRNG (specifically 624 in total) it becomes possible to predict subsequent outputs. On the same subject, Ambionics Security demonstrated that it is possible to recover the Mersenne Twister seed knowing only two outputs of PHP’s mt_rand() function, without any bruteforce [Amb].

Given that both Python and PHP employ the same PRNG, a natural question arises: can we apply this discovery to Python’s PRNG as well? Initially, a quick examination of Python’s implementation reveals a significantly different seeding mechanism, raising doubts about this possiblity. However, upon thorough investigation, we discovered an alternative method to leverage their findings and adapt them to Python’s PRNG.

In this post, our focus will be on Python’s random implementation, drawing comparisons to PHP’s approach. We’ll cover the similarities and differences, particularly in how seeding is performed. Furthermore, we’ll demonstrate that even with a small number of outputs (as few as 6 for a 32-bit seed, akin to PHP), it’s feasible to deduce Python’s original seed.

Our work has been done on Python 3.13 but should be applicable to earlier versions of Python 3 as well.

Similarities with PHP

As stated in the introduction, our goal is to leverage the findings of [Amb] and try to apply them to Python as both languages use the same underlying PRNG. For this reason, this first part will be very similar to their previous work as they share a common foundation. We advise the curious reader to take a look at their work first.

MT19937’s random number generation process can be summarized as such :

  1. Generate an initial state array from the seed
  2. Update the state to get a new internal state
  3. When a number needs to be generated, return a tempered state value
  4. When all the state values have been used, go back to step 2

As our ultimate goal is to find the seed from the knowledge of some ouputs, we will start by analyzing how numbers are generated from the internal state.

Number generation

The number generation is done in the genrand_uint32 function, which also performs the state update, but it will be analyzed separately.

Generating a number simply consists in tempering a state value and returning it, as can be seen in the code below.

/* generates a random number on [0,0xffffffff]-interval */
static uint32_t
genrand_uint32(RandomObject *self)
{
    uint32_t y;
    //...
    uint32_t *mt;

    mt = self->state;
    if (self->index >= N) {
        // state update
    }

    y = mt[self->index++];
    y ^= (y >> 11);
    y ^= (y << 7) & 0x9d2c5680U;
    y ^= (y << 15) & 0xefc60000U;
    y ^= (y >> 18);
    return y;
}

The tempering step is reversible, meaning it is possible to recover the internal state value from the generated output.

This can be done like this in Python.

def unshiftRight(x, shift):
    res = x
    for i in range(32):
        res = x ^ res >> shift
    return res

def unshiftLeft(x, shift, mask):
    res = x
    for i in range(32):
        res = x ^ (res << shift & mask)
    return res

def untemper(v):
    v = unshiftRight(v, 18)
    v = unshiftLeft(v, 15, 0xefc60000)
    v = unshiftLeft(v, 7, 0x9d2c5680)
    v = unshiftRight(v, 11)
    return v

This is why it is possible to predict the next outputs of the PRNG with the knowledge of 624 consecutive 32-bits outputs, as this is equivalent to knowing the entire internal state, making it possible to clone the PRNG.

proof of concept
import random

random.seed(1234)

# You don't see some of the outputs
for _ in range(1234):
    random.getrandbits(32)

# you capture 624 consecutive outputs
state = [untemper(random.getrandbits(32)) for _ in range(624)]

print("Normal run :")

print(random.getrandbits(32))
print(random.random())
print(random.randbytes(4).hex())
print(random.randrange(1, 100000))

print("\nPredicted run :")

# set RNG state from observed ouputs
random.setstate((3, tuple(state + [624]), None))

print(random.getrandbits(32))
print(random.random())
print(random.randbytes(4).hex())
print(random.randrange(1, 100000))

Let’s name the current state S and the initial state I.

Knowing one of the first 624 outputs is equivalent to knowing a value of S.

We’ll only consider knowledge of some of the first 624 outputs, so only a single state update separates S from I.

Next we’ll look at the state update mechanism which transforms I into S.

Updating the internal state

The state update is also done in the genrand_uint32 function and is performed every 624 outputs, or immediately on the first number generation, after seeding.

#define N 624
#define M 397
#define MATRIX_A 0x9908b0dfU    /* constant vector a */
#define UPPER_MASK 0x80000000U  /* most significant w-r bits */
#define LOWER_MASK 0x7fffffffU  /* least significant r bits */
/* generates a random number on [0,0xffffffff]-interval */
static uint32_t
genrand_uint32(RandomObject *self)
{
    uint32_t y;
    static const uint32_t mag01[2] = {0x0U, MATRIX_A};
    /* mag01[x] = x * MATRIX_A  for x=0,1 */
    uint32_t *mt;

    mt = self->state;
    if (self->index >= N) { /* generate N words at one time */
        int kk;

        for (kk=0;kk<N-M;kk++) {
            y = (mt[kk]&UPPER_MASK)|(mt[kk+1]&LOWER_MASK);
            mt[kk] = mt[kk+M] ^ (y >> 1) ^ mag01[y & 0x1U];
        }
        for (;kk<N-1;kk++) {
            y = (mt[kk]&UPPER_MASK)|(mt[kk+1]&LOWER_MASK);
            mt[kk] = mt[kk+(M-N)] ^ (y >> 1) ^ mag01[y & 0x1U];
        }
        y = (mt[N-1]&UPPER_MASK)|(mt[0]&LOWER_MASK);
        mt[N-1] = mt[M-1] ^ (y >> 1) ^ mag01[y & 0x1U];

        self->index = 0;
    }

    // tempering
}

To help us model the state update mechanism, we define the function f(x, y) as:

$$ f(x, y) = \begin{cases} ((x \& \text{0x80000000}) \mid (y \& \text{0x7fffffff})) \gg 1 & y\&1 = 0 \\ (((x \& \text{0x80000000}) \mid (y \& \text{0x7fffffff})) \gg 1) \oplus \text{0x9908b0df} & y\&1 = 1 \end{cases} $$

Depending on the LSB of y, an additionnal XOR with a constant is performed.

Using f, the state update mechanism can be represented as the succession of the following operations, in this order:

$$ \begin{align} S_i &= I_{i+397} \oplus f(I_i, I_{i+1}) & \forall i \in \left[0, 226 \right] \\ S_j &= S_{j-227} \oplus f(I_j, I_{j+1}) & \forall j \in \left[227, 622 \right] \\ S_{623} &= S_{396} \oplus f(I_{623}, S_0) \end{align} $$

From $(2)$ we obtain:

$$ \begin{aligned} S_{227} &= S_0 \oplus f(I_{227}, I_{228}) \\ S_{227} \oplus S_0 &= f(I_{227}, I_{228}) \end{aligned} $$

Because f(x, y) is mostly dependent on y, we can gain a lot of information about $I_{228}$. In fact, we can recover the MSB of $I_{227}$ and the 31 lower bits of $I_{228}$. This is already greatly explained in [Amb], the only difference being that the conditional XOR in f depends on the LSB of $I_{228}$ in Python and not $I_{227}$ as in their version of PHP (it’s like in Python for PHP7+).

Here is the Python code to do so, given two internal states $S_i$ and $S_{i-227}$:

def invertStep(si, si227):
    # S[i] ^ S[i-227] == (((I[i] & 0x80000000) | (I[i+1] & 0x7FFFFFFF)) >> 1) ^ (0x9908b0df if I[i+1] & 1 else 0)
    X = si ^ si227
    # we know the LSB of I[i+1] because MSB of 0x9908b0df is set, we can see if the XOR has been applied
    mti1 = (X & 0x80000000) >> 31
    if mti1:
        X ^= 0x9908b0df
    # undo shift right
    X <<= 1
    # now recover MSB of state I[i]
    mti = X & 0x80000000
    # recover the rest of state I[i+1]
    mti1 += X & 0x7FFFFFFF
    return mti, mti1
proof of concept
import random

I = random.getstate()[1]
# this will force a state update
random.random()
S = random.getstate()[1]

for i in range(227, 240):
    Ii, Ii1 = invertStep(S[i], S[i-227])
    print(f"{Ii} == {I[i]&0x80000000}")
    print(f"{Ii1} == {I[i+1]&0x7FFFFFFF}")

In the case of PHP, this would be sufficient to recover the 32-bit seed value used to initialize the PRNG. As described in [Amb], each value of the initial state is computed from the previous one in an invertible manner. Recovering a single one is thus enough to rewind to $I_0$, the seed.

This is where our work differs, as in Python the seeding procedure is very different.

Python’s seeding procedure

To understand Python’s seeding mechanism it is necessary to start at the highest level.

class Random(_random.Random):
    def seed(self, a=None, version=2):
        """Initialize internal state from a seed.

        The only supported seed types are None, int, float,
        str, bytes, and bytearray.

        None or no argument seeds from current time or from an operating
        system specific randomness source if available.

        If *a* is an int, all bits are used.

        For version 2 (the default), all of the bits are used if *a* is a str,
        bytes, or bytearray.  For version 1 (provided for reproducing random
        sequences from older versions of Python), the algorithm for str and
        bytes generates a narrower range of seeds.

        """

        if version == 1 and isinstance(a, (str, bytes)):
            a = a.decode('latin-1') if isinstance(a, bytes) else a
            x = ord(a[0]) << 7 if a else 0
            for c in map(ord, a):
                x = ((1000003 * x) ^ c) & 0xFFFFFFFFFFFFFFFF
            x ^= len(a)
            a = -2 if x == -1 else x

        elif version == 2 and isinstance(a, (str, bytes, bytearray)):
            global _sha512
            if _sha512 is None:
                try:
                    # hashlib is pretty heavy to load, try lean internal
                    # module first
                    from _sha2 import sha512 as _sha512
                except ImportError:
                    # fallback to official implementation
                    from hashlib import sha512 as _sha512

            if isinstance(a, str):
                a = a.encode()
            a = int.from_bytes(a + _sha512(a).digest())

        elif not isinstance(a, (type(None), int, float, str, bytes, bytearray)):
            raise TypeError('The only supported seed types are:\n'
                            'None, int, float, str, bytes, and bytearray.')

        super().seed(a)
        self.gauss_next = None

This is the implementation of the random.seed function, responsible for manually seeding the generator. This function is also called when importing the random module to initialize the PRNG automatically, so that it is not mandatory to seed it manually (which is not recommended anyways).

In the automatic case, the argument a (the seed) is None. But this fonction allows seeding with different types of Python objects, representing either numbers or bytes.

There are actually two versions of the seeding procedure when using bytes. Version 1 is only used for compatibility with older Python versions and will not be further discussed. It basically converts a str or bytes object in a 64-bit int.

With version 2, when calling the seeding function with an object of type str, bytes or bytearray the object is concatenated with its SHA-512 hash represented as raw bytes. The result is interpreted as a big-endian integer value.

Example:

import random
random.seed('my seed')
print(random.randbytes(32).hex())

# SHA-512('my seed') => c046d02141256723ba893a44f1a54538aba62218287bc56767433568ca7548571fe1760ba924e2f8b1e9e8887814c985f3827c68fdb09e639f313ec1c83476be
# hex('my seed') => 6d792073656564

equivalent_seed = 0x6d792073656564c046d02141256723ba893a44f1a54538aba62218287bc56767433568ca7548571fe1760ba924e2f8b1e9e8887814c985f3827c68fdb09e639f313ec1c83476be
random.seed(equivalent_seed)
print(random.randbytes(32).hex())

When seeding with int, float or None types, the seed is passed to the internal seeding procedure directly.

Internal seeding procedure

This part is handled by the random_seed function of _randommodule.c.

In the case of seeding with None (the default case), the following piece of code is executed:

if (arg == NULL || arg == Py_None) {
    if (random_seed_urandom(self) < 0) {
        PyErr_Clear();

        /* Reading system entropy failed, fall back on the worst entropy:
            use the current time and process identifier. */
        if (random_seed_time_pid(self) < 0) {
            return -1;
        }
    }
    return 0;
}

The PRNG is preferably seeded using the operating system’s cryptographically secure PRNG (CSPRNG). If this fails, the seeding procedure falls back to using a weaker source of entropy, the current time and the PID.

Seeding using the system CSPRNG

The system specific seeding procedure is handled by the following function:

static int
random_seed_urandom(RandomObject *self)
{
    uint32_t key[N];

    if (_PyOS_URandomNonblock(key, sizeof(key)) < 0) {
        return -1;
    }
    init_by_array(self, key, Py_ARRAY_LENGTH(key));
    return 0;
}

Details on the entropy sources are given in the following extract of the docstring of the pyurandom function :

Used sources of entropy ordered by preference, preferred source first:

  • BCryptGenRandom() on Windows
  • getrandom() function (ex: Linux and Solaris): call py_getrandom()
  • getentropy() function (ex: OpenBSD): call py_getentropy()
  • /dev/urandom device

The system CSPRNG is used to generate 624 32-bit integers, which is enough to fill the entire state of the MT PRNG.

The rest of the initialisation is performed by the init_by_array function, which will be covered in a separate section.

Seeding using the time + PID

When no better source of entropy is available, the seeding mechanism resorts to using the current time and PID :

static int
random_seed_time_pid(RandomObject *self)
{
    PyTime_t now;
    if (PyTime_Time(&now) < 0) {
        return -1;
    }

    uint32_t key[5];
    key[0] = (uint32_t)(now & 0xffffffffU);
    key[1] = (uint32_t)(now >> 32);

#if defined(MS_WINDOWS) && !defined(MS_WINDOWS_DESKTOP) && !defined(MS_WINDOWS_SYSTEM)
    key[2] = (uint32_t)GetCurrentProcessId();
#elif defined(HAVE_GETPID)
    key[2] = (uint32_t)getpid();
#else
    key[2] = 0;
#endif

    if (PyTime_Monotonic(&now) < 0) {
        return -1;
    }
    key[3] = (uint32_t)(now & 0xffffffffU);
    key[4] = (uint32_t)(now >> 32);

    init_by_array(self, key, Py_ARRAY_LENGTH(key));
    return 0;
}

This process produces four 32-bit integers, which are once again used to initialize the internal state of the MT PRNG, using the init_by_array function.

The important part here is the size of the produced array. Four elements only, meaning at most 128 bits of entropy in the ideal case. However, because the current time and PID can be easily guessed, this is far from the truth.

In this work we are not interested in recovering the seed by bruteforce, so we will ignore this possibility and only focus on the fact that only 4 elements are used.

Seeding using a number

This procedure covers every other possible seeding cases as seen previously.

The following code is responsible for converting the number in an array of 32-bit integers :

/* This algorithm relies on the number being unsigned.
 * So: if the arg is a PyLong, use its absolute value.
 * Otherwise use its hash value, cast to unsigned.
 */
if (PyLong_CheckExact(arg)) {
    n = PyNumber_Absolute(arg);
} else if (PyLong_Check(arg)) {
    /* Calling int.__abs__() prevents calling arg.__abs__(), which might
        return an invalid value. See issue #31478. */
    _randomstate *state = _randomstate_type(Py_TYPE(self));
    n = PyObject_CallOneArg(state->Long___abs__, arg);
}
else {
    Py_hash_t hash = PyObject_Hash(arg);
    if (hash == -1)
        goto Done;
    n = PyLong_FromSize_t((size_t)hash);
}
if (n == NULL)
    goto Done;

/* Now split n into 32-bit chunks, from the right. */
bits = _PyLong_NumBits(n);
if (bits == (size_t)-1 && PyErr_Occurred())
    goto Done;

/* Figure out how many 32-bit chunks this gives us. */
keyused = bits == 0 ? 1 : (bits - 1) / 32 + 1;

/* Convert seed to byte sequence. */
key = (uint32_t *)PyMem_Malloc((size_t)4 * keyused);
if (key == NULL) {
    PyErr_NoMemory();
    goto Done;
}
res = _PyLong_AsByteArray((PyLongObject *)n,
                            (unsigned char *)key, keyused * 4,
                            PY_LITTLE_ENDIAN,
                            0, /* unsigned */
                            1); /* with exceptions */
if (res == -1) {
    goto Done;
}

init_by_array(self, key, keyused);

To summerize, if the seed is a positive integer (of any length), it is split in chunks of 32 bits to create the key array. If the integer is negative, it’s absolute value is used instead. If it’s a floating number, it’s hash value is used instead.

Examples :

random.seed(-0x909192939495969798)
# key = [0x95969798, 0x91929394, 0x90]

random.seed(0x909192939495969798)
# key = [0x95969798, 0x91929394, 0x90]

random.seed(871628728198271972.8621)
# key = [3494417408, 202941877]

random.seed(hash(871628728198271972.8621))
# key = [3494417408, 202941877]

random.seed("my seed")
# key = [3358881470, 2670804673, 4256210531, 4085415016, 2014628229, 2984896648, 2837766904, 534869515, 3396683863, 1732457832, 679200103, 2879791640, 4054140216, 3129555524, 1092970275, 3225866273, 1936024932, 7174432]

The init_by_array function is then called like previously.

Now that we have covered how the key array is constructed from different seed types, it is time to see how the MT initial state I is computed from this variable length array. From this point on, we will refer to this key array as K and its length as k.

Constructing the initial state

The init_by_array function is given below :

/* initialize by an array with array-length */
/* init_key is the array for initializing keys */
/* key_length is its length */
static void
init_by_array(RandomObject *self, uint32_t init_key[], size_t key_length)
{
    size_t i, j, k, l;       /* was signed in the original code. RDH 12/16/2002 */
    uint32_t *mt;

    mt = self->state;
    init_genrand(self, 19650218U);
    i=1; j=0;
    k = (N>key_length ? N : key_length);

    for (; k; k--) {
        mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1664525U))
                 + init_key[j] + (uint32_t)j; /* non linear */
        i++; j++;
        if (i>=N) { mt[0] = mt[N-1]; i=1; }
        if (j>=key_length) j=0;
    }

    for (k=N-1; k; k--) {
        mt[i] = (mt[i] ^ ((mt[i-1] ^ (mt[i-1] >> 30)) * 1566083941U))
                 - (uint32_t)i; /* non linear */
        i++;
        if (i>=N) { mt[0] = mt[N-1]; i=1; }
    }

    mt[0] = 0x80000000U; /* MSB is 1; assuring non-zero initial array */
}

It can be broken down in four parts :

  1. Initialization of the initial state I by calling init_genrand with a constant value.
  2. Addition of the elements of K to the previous state, repeating K if necessary.
  3. Mixing I.
  4. Setting $I_0$ to 0x80000000.

For comparison, in PHP, only the first part is performed, with a 32-bit seed. This is the standard seeding procedure of MT19937. However here, a constant seed of 19650218 is used, because the real seeding takes place in step 2.

For the sake of completion, here is the init_genrand function :

init_genrand
/* initializes mt[N] with a seed */
static void
init_genrand(RandomObject *self, uint32_t s)
{
    int mti;
    uint32_t *mt;

    mt = self->state;
    mt[0]= s;
    for (mti=1; mti<N; mti++) {
        mt[mti] =
        (1812433253U * (mt[mti-1] ^ (mt[mti-1] >> 30)) + mti);
        /* See Knuth TAOCP Vol2. 3rd Ed. P.106 for multiplier. */
        /* In the previous versions, MSBs of the seed affect   */
        /* only MSBs of the array mt[].                                */
        /* 2002/01/09 modified by Makoto Matsumoto                     */
    }
    self->index = mti;
    return;
}

Because it is called with a constant seed, we won’t explain how it works and simply consider the state has an initial precomputed value named C. If you are interested in understanding how this function can be inverted, we advise you to read [Amb].

To help us model the operations performed in the second part of init_by_array, we define A = 1664525 and the function g(x) as:

$$ g(x) = x \oplus (x \gg 30) $$

Then the following operations are executed, in this order:

$$ \begin{align} J_i &= (C_i \oplus g(J_{i-1})A) + K_j + j & \forall i \in \left[1, 623 \right], j \equiv i-1 \mod k \\ J_0 &= J_{623} \\ J_1 &= (J_1 \oplus g(J_0)A) + K_j + j & j \equiv 623 \mod k \end{align} $$

All additions are performed on 32-bit integers and $J_1$ is in fact updated twice.

Here J refers to the intermediate initial state after the addition of the elements of K.

We apply the same principle to model the rest of init_by_array. We define B = 1566083941.

Then the following operations are executed, in this order:

$$ \begin{align} I_i &= (J_i \oplus g(I_{i-1})B) - i & \forall i \in \left[2, 623 \right] \\ I_0 &= I_{623} \\ I_1 &= (I_1 \oplus g(I_0)B) - 1 \\ I_0 &= \text{0x80000000} \end{align} $$

The iteration of $(7)$ starts with i = 2 because i = 1 was used in $(6)$.

In the next section we will see how we can leverage this and combine it with the work of [Amb] to recover K, which is equivalent to finding the seed.

Seed recovery from few outputs

From $(4)$ we see that with the knowledge of two consecutive values $J_i$ and $J_{i-1}$ of the intermediate initial state, we can write:

$$ \begin{align} K_j + j &= J_i - (C_i \oplus g(J_{i-1})A) & j \equiv i-1 \mod k \end{align} $$

This only works $\forall i \in \left[3, 623 \right]$ because of $(5)$ and $(6)$.

Assuming we know the length k of K, we can recover the index j and thus have the real value of $K_j$. This assumption is totally reasonable as we have seen that k only depends on the type or length of the seed. Both of which can be easily guessed if unknown.

It is also important to note that we assumed $k <= 624$ in this work, which is usually the case.

Here is the Python code to do that :

def init_genrand(seed):
        MT = [0] * 624
        MT[0] = seed & 0xffffffff
        for i in range(1, 623+1): # loop over each element
            MT[i] = ((0x6c078965 * (MT[i-1] ^ (MT[i-1] >> 30))) + i) & 0xffffffff
        return MT

def recover_kj_from_Ji(ji, ji1, i):
    # ji => J[i]
    # ji1 => J[i-1]
    const = init_genrand(19650218)
    key = ji - (const[i] ^ ((ji1 ^ (ji1 >> 30))*1664525))
    key &= 0xffffffff
    # return K[j] + j
    return key

We have worked our way from the intermediate initial state J to the seed K, now we must find a way to J from the initial state I.

From $(7)$ we see that with the knowledge of two consecutive values $I_i$ and $I_{i-1}$ of the initial state, we can write:

$$ \begin{align} J_i &= (I_i + i) \oplus g(I_{i-1})B \end{align} $$

This only works $\forall i \in \left[3, 623 \right]$ because of $(9)$ and $(10)$.

Here is the Python code to do that :

def recover_Ji_from_Ii(Ii, Ii1, i):
    # Ii => I[i]
    # Ii1 => I[i-1]
    ji = (Ii + i) ^ ((Ii1 ^ (Ii1 >> 30)) * 1566083941)
    ji &= 0xffffffff
    # return J[i]
    return ji

We demonstrated that two consecutive values of I are enough to recover a single value of J. We also showed that two consecutive values of J are enough to recover a value of K. It is thus possible to recover a value of K with only three consecutive values of I.

Diagram

Combining our previous Python functions, this gives the following:

def recover_Kj_from_Ii(Ii, Ii1, Ii2, i):
    # Ii => I[i]
    # Ii1 => I[i-1]
    # Ii2 => I[i-2]
    # Ji => J[i]
    # Ji1 => J[i-1]
    Ji = recover_Ji_from_Ii(Ii, Ii1, i)
    Ji1 = recover_Ji_from_Ii(Ii1, Ii2, i-1)
    return recover_kj_from_Ji(Ji, Ji1, i)

Due to the constraints discussed earlier, this is only possible $\forall i \in \left[4, 623 \right]$.

proof of concept
import random

random.seed(12345)
# K = [12345]
# k = 1

I = random.getstate()[1]

for i in range(4, 624):
    print(i, recover_Kj_from_Ii(I[i], I[i-1], I[i-2], i))

We have seen previously, based on [Amb], that knowing a pair of values ($S_i$, $S_{i+227}$) of the current state is enough to recover the 31 lower bits of $I_{i+228}$ and the MSB of $I_{i+227}$.

Knowing another pair ($S_{i+1}$, $S_{i+228}$) would reveal the 31 lower bits of $I_{i+229}$ and the MSB of $I_{i+228}$. By combining the information gathered from both pairs, we obtain the exact value of $I_{i+228}$.

Add another pair ($S_{i+2}$, $S_{i+229}$) and now we know $I_{i+228}$, $I_{i+229}$ and the 31 lower bits of $I_{i+230}$. Three consecutive values of the initial state, allowing to recover two potential values for $K_j$, depending on the MSB of $I_{i+230}$.

The MSB of $I_{i+230}$ only affects the MSB of $K_j$.

If Python’s PRNG is manually seeded with a 32-bit integer (like it’s the case for PHP) the seed can be recovered from only 6 of the first 624 outputs.

proof of concept
import random

for i in range(10):
    random.seed(i)
    # k = 1
    # K = [i]

    S = [untemper(random.getrandbits(32)) for _ in range(624)]

    I_227_, I_228 = invertStep(S[0], S[227])
    I_228_, I_229 = invertStep(S[1], S[228])
    I_229_, I_230 = invertStep(S[2], S[229])

    I_228 += I_228_
    I_229 += I_229_

    # two possibilities for I_230
    seed1 = recover_Kj_from_Ii(I_230, I_229, I_228, 230)
    seed2 = recover_Kj_from_Ii(I_230+0x80000000, I_229, I_228, 230)
    # only the MSB differs
    print(hex(seed1), hex(seed2))

It is easy to see that this process can be extended to recover the seed even if k > 1, by adding new pairs. Adding just a single pair would allow to recover a 64-bit seed, as four consecutive values of I would be known.

Diagram

As only the MSB of $I_i$ will be missing regardless of how many pairs are used, only $K_J$ will have two possible values. Meaning for any seed length, there will only ever be two possibilities to try out.

The minimal number of outputs required to recover the seed in each scenario is given below :

$$ \begin{array}{cccc} \hline \text{Seed Type} & \text{Additional Info} & k & \text{Number of required outputs} \\ \hline \text{Int} & 32\text{-bit} & 1 & 6 \\ \text{Int} & 64\text{-bit} & 2 & 8 \\ \text{Int} & n\text{-bit} & \frac{n+31}{32} & 2k + 4^\text{*}\\ \text{Float} & & 2 & 8 \\ \text{Bytes} & \text{Version 1} & 2 & 8 \\ \text{Bytes} & \text{Version 2}, n\text{-chars} & \frac{n+3}{4} + 16 & 2(k-16) + 4^\text{*}\\ \text{None} & \text{Using time + PID} & 4 & 12 \\ \text{None} & \text{Using system random} & 624 & 624 \\ \hline \end{array} $$

$^\text{*}$ At least for “small” values of k as after adding a lot of pairs more relations can be made between them, lowering the total number of required ones. For instance, for k = 624, only 624 outputs (the full state) are required. We will briefly discuss this special case in the next section.

From a full state to the seed

Because the state update operation is invertible, knowing a full state (624 outputs) is enough to rewind to the initial state I from any state S.

The following Python code, based on equations $(1)$, $(2)$ and $(3)$, implements the rewind procedure.

def rewindState(state):
    prev = [0]*624
    # copy to not modify input array
    s = state[:]
    I, I0 = invertStep(s[623], s[396])
    prev[623] += I
    # update state 0
    # this does nothing when working with a known full state, but is important we rewinding more than 1 time
    s[0] = (s[0]&0x80000000) + I0
    for i in range(227, 623):
        I, I1 = invertStep(s[i], s[i-227])
        prev[i] += I
        prev[i+1] += I1
    for i in range(227):
        I, I1 = invertStep(s[i], prev[i+397])
        prev[i] += I
        prev[i+1] += I1
    # The LSBs of prev[0] do not matter, they are 0 here
    return prev

From $(1)$, we see that only the MSB of $I_0$ is used during the state update. The other bits are never used, so they do not impact the next state. For this reason, it is not possible to recover the exact value of those bits. The rewinded state will thus be different on the first value only.

proof of concept
import random

I = list(random.getstate()[1][:-1])

S1 = [untemper(random.getrandbits(32)) for _ in range(624)]
S2 = [untemper(random.getrandbits(32)) for _ in range(624)]
S3 = [untemper(random.getrandbits(32)) for _ in range(624)]

# rewind once
I_ = rewindState(S1)
S2_ = rewindState(S3)
S1_ = rewindState(S2)

print(I_ == I)
print(S1_[1:] == S1[1:])
print(S2_[1:] == S2[1:])

# rewind multiple times
I_ = rewindState(rewindState(rewindState(S3)))
print(I_ == I)
print(I_[:5])
print(I[:5])

With the knowledge of the initial state I, it is possible to recover all values of K, effectively recovering the seed in the case k = 624 (the default case).

The following Python code does just that.

def seedArrayFromState(s, subtractIndices=True):
    s_ = [0]*624
    for i in range(623, 2, -1):
        s_[i] = recover_Ji_from_Ii(s[i], s[i-1], i)
    s_[0]=s_[623]
    s_[1]=recover_Ji_from_Ii(s[1], s[623],  1)
    s_[2]=recover_Ji_from_Ii(s[2], s_[1], 2)
    seed = [0]*624
    for i in range(623, 2, -1):
        seed[i-1] = recover_kj_from_Ji(s_[i], s_[i-1], i)
    # system overdefined for seed[0,1,623]
    seed[0] = 0
    # thus s1 = (const[1] ^ ((const[0] ^ (const[0] >> 30))*1664525))
    s1_old = ((2194844435 ^ ((19650218 ^ (19650218 >> 30))*1664525))) & 0xffffffff
    seed[1] = recover_kj_from_Ji(s_[2], s1_old, 2)
    seed[623] = (s_[1] - (s1_old ^ ((s_[0] ^ (s_[0] >> 30))*1664525))) & 0xffffffff
    # subtract the j indices
    if subtractIndices:
        seed = [(2**32+e-i)%2**32 for i,e in enumerate(seed)]
    return seed

From the way the init_by_array function performs the seeding operation, multiple seeds can produce the same initial state I. Here we chose to fix $K_0 = 0$.

proof of concept
import random

def seedArrayToInt(s):
    seed = 0
    for e in s[::-1]:
        seed += e
        seed <<= 32
    return seed >> 32

# random is not manually seeded so it uses 624 random values

S = [untemper(random.getrandbits(32)) for _ in range(624)]
I = rewindState(S)

print("Normal run :")

print(random.getrandbits(32))
print(random.random())
print(random.randbytes(4).hex())
print(random.randrange(1, 100000))

print("\nReseeded run :")

seed_array = seedArrayFromState(I)
seed = seedArrayToInt(seed_array)

# the recovered seed is very big. Too big to be printed in decimal
# print(hex(seed))
random.seed(seed)

S_ = [untemper(random.getrandbits(32)) for _ in range(624)]

assert(S_ == S)

print(random.getrandbits(32))
print(random.random())
print(random.randbytes(4).hex())
print(random.randrange(1, 100000))

Here we assumed that the known 624 consecutive outputs are either directly generated after the seeding operation or are separated from it by a multiple of 624 other outputs. If it was not the case, rewinding would yield a misaligned initial state, which can be realigned, but one value would still be irrecoverable. We did not explore this case further.

Conclusion

We demonstrated that given only a few ouputs it is possible to recover Python’s PRNG seed value, without using any bruteforce. We have also shown that the amount of outputs needed depends on the type of seed used and can range from as few as 6, to all the way to 624 for the default case. In most cases, less than a dozen are needed.

Every function, code snippets and proof of concepts can be found on our GitHub.

This work further improves on the already established attacks against Python’s PRNG by offering a new and improved way to predict future and unveil past outputs. As Python allows to seed the PRNG using bytes, recovering the seed might disclose sensitive data, if it is used as seeding material.