Chapter 5. Obfuscation Theory

Let’s reiterate what you hope to achieve when you create an obfuscated program. You would like to give a user an executable program but prevent him from understanding some part of its construction. For example, you may want to prevent the user from extracting secret keys you have embedded in the program, replacing the keys, understanding the algorithms, extracting specific functionality out of the program, or adding some functionality to the program.

In the previous chapter, you saw a large number of obfuscation techniques. We argued that these obfuscating transformations are useful, saying that the confusion they add to the program increases the complexity of analyzing and understanding it, using the static and dynamic program analysis techniques from Chapter 3. The premise of that chapter was that an attacker has a limited type and number of tools available to him. In our discussions, it was sufficient to show that an obfuscation technique increased the complexity of a program to a point where the attacker’s tools are either too slow, too imprecise, or simply unable to compute the reverse transformation.

For example, consider the Perl program shown in Listing 5.1Obfuscation Theory303. At first glance, this appears to be a grotesque ASCII art kitten rather than a legitimate Perl program, let alone one that does something useful. For instance, it does not appear to have any alphanumerics. However, a more close examination begins to reveal the tricks the author used. For example, once you note that Perl variable names can use non-alphanumerics and begin with sigils ($, @, and &), the kitten begins to resolve merely into a series of assignments to weirdly named variables. The author uses logical and and or operations to construct alphanumerics out of strings that contain only non-alphanumeric characters. Finally, in the shaded section, he uses Perl’s magic quote characters to call Perl again to decrypt the rest of the program and then eval it.[1]

Problem 5.1

Q1:

Hidden inside the obfuscated program show in Listing 5.1Problem 5.1303 is the name of the pirate cat on the cover of this book. Can you work out what his name is?

This is an example of increasing the size of a program to hide its functionality. If you assume an attacker is not able to perform accurate string analysis to recognize the string being built, then this may be an effective obfuscation technique.

Unfortunately, in reality you cannot restrict attackers to a particular set of tools or type of attack for very long. In the above example, an attacker may simply run the program and, using a debugger, step through the code until he arrives at the eval instruction. He can then replace it with a print statement to determine what the original program looked like. In other examples you saw in Section 4.3.4Problem 5.1235, we used opaque predicates and code duplication to increase the size of programs. An attacker can use a weakness in the way we generate the opaque predicates to distinguish them from the predicates that occur naturally in the program.

Simply padding out a program with bogus code that’s really no more than nops doesn’t hinder the attacker very long. To be useful, the increased complexity must be difficult to reverse, and this requirement is not well captured by how difficult it is to perform a particular static or dynamic program analysis. The metrics we have been using fail to capture how long it will take for an attacker to decipher an obfuscated program, or indeed what it means to have deciphered an obfuscated program.

What you would really like to be able to do is prevent attackers from reversing your obfuscation, irrespective of what tools they have. In order to determine the theoretical limits of obfuscation, you need much more rigorous models for obfuscation and more stringent criteria for assessing the success of an obfuscation technique. Here, you will discover a surprising finding: Whether obfuscation is feasible or not will change depending on the formalization you choose.

For example, under one model suggested by Barak et al. [34], we require that an obfuscated application does not leak any information to an attacker except what is deducible from the program output. Under this model, Barak et al. showed that obfuscation of general programs (and indeed of even large classes of simpler programs) is impossible.

Example 5.1. Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.

Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.

Does this result make the search for good obfuscating transformations futile? We don’t think so. As you will see, the result of Barak et al. is a very strong and important one. It shows that programs exist that cannot be obfuscated. But it does not necessarily prevent any specific program from being obfuscated! For example, it may still be possible to obfuscate a simple Hello World program, a particular encryption algorithm, or even a specific virtual machine. In this sense, the question of strong obfuscation for particular programs remains an open question. What is more, even if obfuscation in general is impossible, strong obfuscation of a single virtual machine would be sufficient to provide many of the characteristics you require from obfuscation. A given program could be successfully protected by compiling it for this one obfuscated virtual machine.

There have also been some positive formal results under this strict model showing that obfuscation is possible, albeit for a very restricted set of functions [369]. We will examine these in detail.

In the rest of this chapter, we will describe some of the more promising results of what is and what is not possible for theoretically strong obfuscation. We will start in Section 5.1 by revisiting the definition of obfuscation and refine it in terms of the assets being protected. In light of this new definition, in Section 5.2Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.307 we will review Turing’s Halting Problem and see how the impossibility of deciding the equivalence of two programs suggests that some type of obfuscation may be possible. In Section 5.3Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.313 you’ll see examples of how certain assets in certain types of programs can, in fact, be securely hidden using obfuscation. For example, we will show you how to obfuscate access control, regular expressions, and databases. We will then turn our attention to encryption and cover a scheme for working with encrypted values. We will try to extend our efforts to cover the obfuscation of encryption algorithms; however, we find that it’s difficult to obfuscate encryption schemes in a provably secure way.

Finally, having successfully protected such a large number of assets, in Section 5.4Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.335, we will try to develop a provably secure general-purpose obfuscator and discover that this is not possible. In particular, you will see how to create a program that cannot be obfuscated. In light of the impossibility of constructing a general purpose obfuscator, in Section 5.5Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.344 we will explore possible workarounds. We will conclude the chapter with Section 5.6Hello World program written in Perl. The code can be de-obfuscated by replacing the magic quotes (') in the shaded section with a print statement.354, in which we discuss the future direction of provably secure obfuscation.

Definitions

A problem that arises when you want to discuss the limits of what is possible and what is impossible is that capturing your intent formally becomes extremely important. As you will see, the limits of obfuscation depend heavily on the assets you are interested in protecting and how much protection you consider sufficient.

Let’s start by reiterating your intent. Your intent is to protect some property of a program by transforming the program into a new one in which that property is harder to deduce. Unfortunately, while you may intuitively understand what this means, when such an intent is expressed in prose it is difficult to reason about. For example, you need to state formally what you mean by “property of a program” and by “harder to deduce.” In formalizing your intent, however, you must make many simplifying assumptions. As you read the remainder of this chapter, we encourage you to question the simplifying assumptions that are necessary and to consider how well they capture your intent—a proof that some obfuscation technique is secure will not be very useful unless the definition of obfuscation agrees with your original intent.

Unlike the previous chapter, here you also have the challenge of defining a very general formal model so that the conclusions you reach remain applicable no matter what new techniques for program analysis are discovered.

Our first requirement is one of correctness. An obfuscated program must behave the same as the original unobfuscated one. This is the same requirement we had on obfuscated programs in Chapter 4.

Definition 5.1 (Correct)

Let an input I from a set of inputs Definition 5.1 (Correct) be the input to a program P. An obfuscating transformation T of a program P is correct if and only if:

IϵDefinition 5.1 (Correct) : T(P)(I)=P(I)

In other words, an obfuscating transformation is correct if an obfuscated program still computes the same function as the original program. What about error states? Unlike the previous chapter, from here on we will assume that an error is simply another possible output of P, and we require that an obfuscated program maintain this error.

The second requirement is one of usefulness. An obfuscation must hide something. Let’s define assets as the feature of a program that you want to hide:

Definition 5.2 (Asset)

An asset, asset(Definition 5.2 (Asset)) is a derivable property of a program P and its set of inputs Definition 5.2 (Asset), such that asset (P,Definition 5.2 (Asset)) = 1.

What kinds of assets are you usually interested in hiding? You may be interested in protecting data such as a secret key embedded in your program, or data encoded by your program, or you may want to hide some functionality such as a regular expression, an algorithm, or some other combination of properties.

For example, if you wished to hide the true length and number of variables in your program, then you would define asset(P,Definition 5.2 (Asset)) = (length(P), count(variables(P)). Sometimes you may only want to prevent an attacker from being able to answer “yes” or “no” questions about your program. For example, you may want the attacker to be unable to determine whether there is a variable called foo in your program. In this case, asset(Definition 5.2 (Asset)) is a predicate such as: asset(P,Definition 5.2 (Asset)) = foo ϵ variables(P). You may also have more complex assets such as the fact that P never outputs a particular value, x: asset(P,Definition 5.2 (Asset))= ∀I ϵ Definition 5.2 (Asset) : P (I) ≠ x.

Finally, we can define an obfuscating transformation as follows:

Definition 5.3 (Obfuscating Transformation)

Let P be a program over an input set Definition 5.3 (Obfuscating Transformation) and let asset (P,Definition 5.3 (Obfuscating Transformation)) be the asset that you want to protect. Let m(P, asset(Definition 5.3 (Obfuscating Transformation))) ϵ [0, 1] be a metric that measures the difficulty of computing asset(P,Definition 5.3 (Obfuscating Transformation)). Let T(P) be a semantics-preserving program transformation. Then T(P) is an obfuscating transformation of P if

m(T(P), asset(Definition 5.3 (Obfuscating Transformation))) > m(P, asset(Definition 5.3 (Obfuscating Transformation))).

How do you calculate the value of metric m, the difficulty of computing asset(Definition 5.3 (Obfuscating Transformation))? This, of course, varies depending on the particular asset(Definition 5.3 (Obfuscating Transformation)), and it is not possible to suggest a general method of measuring it. However, you can think of many different approximations of the metric. Suppose you have a program that implements a particular algorithm. The number of instructions in this program or the program’s running time may be good approximations of the metric m—an obfuscated program is often both larger and slower than an unobfuscated one and it is harder to extract properties from it as a result.

What you really want is for the obfuscating transformation to be strong and difficult to reverse. This is captured in this definition:

Definition 5.4 (Strong Obfuscating Transformation)

An obfuscation T of P is strong if

m(P, asset(Definition 5.4 (Strong Obfuscating Transformation)))/m(T(P), asset(Definition 5.4 (Strong Obfuscating Transformation))) < ϵ.

In other words, a strong obfuscation is one where the cost of computing the property asset(Definition 5.4 (Strong Obfuscating Transformation)) of an obfuscated program is much larger than computing the same property on the original program.

Can you successfully achieve one or more of these goals? If so, how? And if not, why not? We will now turn our attention to these questions.

Provably Secure Obfuscation: Possible or Impossible?

In the previous chapter, you saw pragmatic reasons why obfuscation might be possible. Even the analysis of simple, everyday programs has proven to be difficult in practice. Programmers working in companies with a large amount of legacy code devote an immense amount of effort documenting code carefully so that future programmers are not mired. Many software engineering publications deal with measuring, controlling, and reducing the practical complexity of programs so that the time and cost of software development can be reduced. Even in banking applications and other applications where correctness is critical and there are procedures in place to ensure that good development practices are followed, the complexity of software has grown to an extent that makes managing and updating it very difficult.

In light of this practical difficulty, it would seem that the task of the obfuscator must be relatively straightforward. If even well-designed modular systems are difficult to understand in practice, then a deliberately obfuscated program will certainly be more difficult to understand. In the last chapter, we took advantage of those constructs that software engineering research has shown take a large amount of effort to understand in order to construct practical obfuscating transformations.

In this section, you will see there are also good theoretical reasons to suspect that provably secure obfuscation is possible. In particular, you will see two results that, taken together, suggest that there are some properties of programs that are computationally infeasible for an adversary to figure out.

First, we’ll revisit the oldest and most fundamental result on the limits of what can be computed by analyzing computer programs. Alan Turing [350] showed that computing whether a program halts—even if the program is unobfuscated—is impossible for some programs. As you will see, as a direct consequence of this result, there are other properties of some programs that also cannot be computed. In terms of our definition of obfuscation, if the asset you are trying to protect is whether or not a program halts, these programs are naturally obfuscated.

In practice, we do not construct obfuscated programs from scratch. Instead we start from a regular program and use an obfuscator to transform the program into an obfuscated one. We will show how this observation leads to Andrew Appel’s conclusion that, in this case, it is not uncomputable for an adversary to find the asset that we are hiding (as suggested by Turing’s result above) but that it is at most NP-hard.

Turing’s Halting Problem

To prove his result, Turing first assumed that there exists a Boolean method HALTS that can determine whether every program halts. HALTS accepts as input a string, P, representing the program to be tested and input for P. HALTS may perform arbitrarily complex analysis of P. For example, it may perform static and dynamic analysis—it may even run the program P for some time. However, for it to be well defined, HALTS itself must be a method that halts. In other words, HALTS cannot simply run the program P and wait for it to halt.

Turing then went on to use this method to create a new program whose termination he was able to show could not be decided correctly by HALTS. Since he had started by assuming HALTS was able to determine the termination of all programs, he arrived at a contradiction and so concluded that his original assumption—that such a HALTS method exists—must be false.

In Listing 5.2Turing’s Halting Problem309, we show such a program, COUNTEREXAMPLE, that uses the code of HALTS. The COUNTEREXAMPLE program is very similar in operation to the HALTS method. It also accepts a string representation of a program and its input. If the HALTS method determines that a program P on input x does not halt, then COUNTEREXAMPLE itself halts. On the other hand, if HALTS determines that the program P does halt, then COUNTEREXAMPLE does not halt. While the usefulness of such a program is not immediately obvious, it is clear that you could easily write such a program provided you have the source code for HALTS.

The final step in the proof involves running COUNTEREXAMPLE with itself as input. Here, you run into the contradiction mentioned above. If HALTS determines that COUNTEREXAMPLE does not halt, then COUNTEREXAMPLE halts. On the other hand, if HALTS determines that COUNTEREXAMPLE does halt, then COUNTEREXAMPLE does not halt. In either case, HALTS gives an incorrect result, and you must conclude that the assumption that HALTS exists must be false.

What this shows is that there exist programs for which you cannot determine whether they halt or not. It does not imply that you cannot determine whether all programs halt or not. We know, for example, that any program that contains no loops or function calls is certainly going to terminate. Nor does Turing’s proof imply that it is futile or useless to use analysis and heuristics to determine whether common, everyday programs halt. Some compilers, for example, warn if the program you are authoring contains infinite loops or unreachable statements. They are not able to detect every such instance but are nevertheless able to detect some interesting cases.

The uncomputability of HALTING implies other uncomputable results. For example, it proves EQUIVALENCE is also uncomputable. The EQUIVALENCE problem is to prove that two programs compute the same function. Of course, for some pairs of programs you can solve this problem trivially. For example, given two programs that take no input and print either “Hello World” or “Goodbye,” it’s simple to show that they are different. To prove that the problem is uncomputable, you need to construct a pair of programs that cannot be distinguished. You do this by reducing the EQUIVALENCE problem to the HALTING problem. Specifically, you construct a pair of programs such that, if you were able to distinguish one from the other, you would be able to solve HALTING. But, of course, we have already shown this to be uncomputable. In Listing 5.3Turing’s Halting Problem310, we show one such pair of programs.

Example 5.2. Halting program detector.

public class COUNTEREXAMPLE {
   HALTS detector = new HALTS();

   static void dontHalt() {
      while (true) {}
    }

   static void halt() {
      System.exit();
   }

   public static void main(String args[]) {
      String program = args[0];
      String input = args[1];
      if (detector.halts(program,input))
         dontHalt();
      else
         halt();
   }
}

In Listing 5.3Halting program detector.310 (a), we have the original COUNTEREXAMPLE. However, it has an explicit call to test its own termination. In Listing 5.3Halting program detector.310 (b), we have a second program that is a simple loop that does not halt. An algorithm that is able to distinguish these two programs from each other can also tell that COUNTEREXAMPLE is different from a program that never halts. In other words, it is able to determine that COUNTEREXAMPLE halts. This is something we know is uncomputable, and hence distinguishing these two programs is also uncomputable.

Example 5.3. Indistinguishable programs.

public class COUNTEREXAMPLE {
   HALTS detector = new HALTS();

   static void dontHalt() {
      while (true) {}
   }

   static void halt() {
      System.exit();
   }

   public void main() {
      String program = "COUNTEREXAMPLE";
      if (detector.halts(program,
                     "" /*  no input */))
         dontHalt();
      else
         halt();
   }
}
public class INFINITE  {



   static void dontHalt() {
      while (true) {}
   }




   public void main() {



      dontHalt();


   }
}

(a)

(b)

What does Turing’s result have to do with obfuscation? Consider a program that contains a secret private key. This private key is the asset you are trying to protect in Definition 5.4. The attacker knows that a key is embedded in the program and wants to know its value. An alternative way to state the attacker’s goal is to say that he has a set of programs, each of which contains a different private key—the input programs—and he wants to identify which program in the set is equivalent to the given one. That is, the attacker needs to solve the EQUIVALENCE problem—something we know in the general case to be uncomputable.

At first glance, what Turing’s result and the uncomputability of EQUIVALENCE seem to be suggesting is that there exist assets for which the attack cannot succeed. In the remainder of this chapter, you will see why this conclusion is premature and misleading, and we will consider the effect of limiting our ambition to protecting particular assets, those embedded in specific subsets of programs, against attackers of restricted capabilities.

Algorithm REAA: De-obfuscating Programs

In the remainder of this chapter we will use the annotation below to describe the goal of each proposed algorithm. Asset refers to the asset that the algorithm seeks to protect, while Input Programs refers to the restricted class from which the program to be obfuscated is chosen. Attacker Limits refers to the constraints we place on the resources and analysis tools available to an attacker. To be as general as possible, in almost all cases we will place no restrictions on the attacker. In this section our goal is to prevent an attacker from being able to distinguish our program from any given set of programs once the program has been obfuscated.

 

Goal

Asset: A unique program from a set

 

Attacker Limits: None

 

Input Programs: A set of programs

 

While obfuscation is a program transformation that increases the complexity of a program, de-obfuscation is another program transformation that involves removing this added complexity by analyzing the program. You can informally think of this as the attempt to get your original program back. There are simple transformations that are irreversible, such as removing comments and randomizing variable names, but that hardly hinder an attacker. You can use the formal definition of obfuscation to work around these shortcomings by defining your intent in terms of assets. For example, you can define de-obfuscation in terms of assets as follows:

Definition 5.5 (De-obfuscating Transformation)

Given an obfuscated program P′ and an obfuscating transformation T, a transformation T′ (P′, T) is a de-obfuscating transformation of P′ if

m(T′ (P′, T), asset(Definition 5.5 (De-obfuscating Transformation)))<m(P′, asset(Definition 5.5 (De-obfuscating Transformation))).

In other words, given an obfuscated program and the obfuscation algorithm, de-obfuscation is the attempt to recover the asset from an obfuscated program according to the metric, m. Where did we go wrong when we concluded the uncomputability of EQUIVALENCE indicates that obfuscation might be possible? It was by forgetting about the additional argument that a de-obfuscator has access to. In 2002, Andrew Appel [20] noted that since a de-obfuscator has access to the obfuscating algorithm, the problem is no longer uncomputable. You can see the attacker’s de-obfuscation method in Algorithm 5.1. He need only repeatedly guess an original program, obfuscate it, and see if this results in the same program as the one he is trying to de-obfuscate. Of course, generating, obfuscating, and testing programs in this way may well take a long time, but the process is no longer uncomputable!

To make this more concrete, let’s consider one of the obfuscations you saw in the previous chapter—expanding the size of a program by inserting dead code into it. In order to decide if this transformation is, in fact, an obfuscating one, you first need to select an asset and adopt a metric for measuring a program’s complexity. Here we’re going to adopt the length of the program as the measure of complexity. While it is often the case that longer programs are harder to understand, this may not always be a good metric. For example, here’s a program maxLenA that computes the length of the longest string in an array of strings:

public static int maxLenA(String[] word) {
   word = Arrays.toString(word)
            .replaceAll("(\w)", "a")
            .replaceAll("[ˆ,a]","")
            .split(",");
   Arrays.sort(word);
   return word[word.length-1].length();
}

And here’s another program maxLenB that computes the same function as maxLenA:

public static int maxLenB(String[] word) {
   int maxSoFar = 0;
   for (int i=0; i < word.length; i++) {
      if (word.length() > maxSoFar)
         maxSoFar = word.length();
   }
   return maxSoFar;
}

While maxLenA is shorter than maxLenB, it is also considerably harder to understand. Nevertheless, for purposes of demonstration, this simplistic metric is sufficient:

m(P, asset(Definition 5.5 (De-obfuscating Transformation))) = length(P)

To reverse such an obfuscation would require a de-obfuscator to recognize dead code by completely analyzing the control flow of a program. However, this is uncomputable. To de-obfuscate this in NP time according to Appel’s algorithm, however, you do not need to perform any static analysis at all. You instead generate all valid programs shorter than length(Obfuscated Program) and apply the dead code insertion obfuscation until you find the original program.

Example 5.1. Overview of algorithm REAA. P is the source of the obfuscated program and F is the obfuscating transformation the obfuscator uses.

DE-OBFUSCATE(P,F):

  1. Guess a source program S.

  2. Guess a key k.

  3. Obfuscate S using transformation T, i.e., P′ ← F (S, k).

  4. if P′ ≠ P repeat from 1.

  5. Output S.

Provably Secure Obfuscation: It’s Possible (Sometimes)!

In the next section we’ll show you how the most general form of obfuscation is impossible. However, that doesn’t mean that all types of obfuscation are impossible! In this section, we will construct examples of provably secure obfuscation and we’ll do so by limiting ourselves to certain assets. In Section 5.3.1Provably Secure Obfuscation: It’s Possible (Sometimes)!314 we will begin with the simplest case by obfuscating a single input–output relation of a program. We’ll then extend this result to obfuscating multiple input–output relations and then we’ll use these primitives to obfuscate access control programs and regular expressions.

In Section 5.3.2Provably Secure Obfuscation: It’s Possible (Sometimes)!322 you’ll see how it’s possible to obfuscate the relations in a database. In Section 5.3.3Provably Secure Obfuscation: It’s Possible (Sometimes)!324, we’ll show you how to perform operations on encrypted values in a provably secure fashion. On a related note, in Section 5.3.4Provably Secure Obfuscation: It’s Possible (Sometimes)!329 we describe whitebox cryptography and its use to obfuscate DES encryption. Unlike other algorithms in this chapter, whitebox DES is not provably secure; however, we include a discussion of its workings to show one direction being explored to obfuscate encryption.

In all these algorithms you are severely restricted as to the assets you protect and the operations you can perform on them. When computing with encrypted data, for example, the OBFPP algorithm can only perform additions and multiplications. These restrictions are what we have to live with for the benefit of having provably secure obfuscation.

Algorithm OBFLBS: Obfuscating with Point Functions

 

Goal

Asset: Private data

 

Attacker Limits: None

 

Input Programs: Point functions

 

An interesting asset you may wish to hide in a program is a binary check. These are functions that underly access control systems that use password protection. For example, to log into a Unix system, a login program must first check that the password a user enters matches the password stored in the system:

boolean isValidPassword(String password) {
  if (password.equals("yellowblue"))
    then return true;
    else return false;
}

Of course, in practice you would never use such a program because the password is embedded in the clear and anyone with access to the program could reverse engineer it. To prevent access to the login program from revealing the password of users, passwords cannot be stored in cleartext—they must be obfuscated.

Algorithm OBFLBS [233] is the generalization of the commonly used practice of using hashing to securely obfuscate password checking. Instead of embedding a user’s password, only the hash of the password is stored. It is then compared with the hash of an entered password:

boolean isValidPassword_PF(String password) {
 if (sha1(password).equals("642fb031ad5e946766bc9a25f35dc7c2"))
    then return true;
    else return false;
}

Password functions are in the class of functions called point-functions—they are false everywhere except at a specific point. In the case of password functions, they are false everywhere except where the password is correct.

The password checking function is now obfuscated, because computing the pre-image of a hashing function is a difficult cryptographic problem. Whereas in the previous chapter we made use of the difficulty of program analysis to create obfuscated programs, we now make use of cryptographic problems. If users choose passwords from a large set and an attacker given isValidPassword_PF is able to deduce the user’s password, then he must be able to invert SHA-1. In other words, the complexity of cracking this obfuscation is at least as hard as cracking SHA-1.

Hashing functions have been widely studied in cryptography, and building an obfuscation that we can show to be at least as hard to break as hash functions themselves is significant progress. Nevertheless, recently Wang [363] showed weaknesses in the MD5 hashing function that may allow an attacker to eventually deduce the secret.

What kind of function would you need instead of SHA-1 if you wanted an obfuscation that you could prove was uncrackable? You would need to be able to prove that the function is one way. In other words, given the output of the function, you should not be able to predict the input. Unfortunately, no known function has been proven to have this property. If you had a universally accessible function and a good source of randomness, you could devise such a function:

class RandomHash {
   Hashtable hashes = new Hashtable();
   Integer randomHash(String value) {
     if (hashes.contains(value))
       return (String)hashes.get(value);


     Integer newhash = new Integer(Math.random());
     hashes.put(value,newhash);
     return newhash;
   }
}

In the randomHash function, the first time the hash of an object is requested, a random number from a large set is returned. The return value is also stored. Thereafter, whenever that object hash is requested, the original value is returned.

Many banks use a similar system when creating debit cards. Blank cards with preprinted ID numbers are sent to different branches. When a customer requests a card, one is selected arbitrarily. However, once it is selected, the bank stores the relationship between the customer’s account and the card ID number. In this way, if the card is lost or stolen, a malicious attacker who finds the card will not be able to identify the customer simply from the ID number on the card.

It’s important to note that unless the mapping of a customer’s account to a card ID number is kept secret by the bank, the security of the scheme is compromised. Similarly, if an attacker were able to access memory and view the entire hashes table in RandomHash (as opposed to probing for individual entries), the security of the obfuscation would be lost. This is especially unfortunate because every program that wishes to use the random hashing function needs access to the same instance of the program.

You can visualize such programs as existing on a remote computer in such a way that they can only be probed for output by one particular input at a time. Mathematically, such programs are called random oracles and we will revisit them in the next section.

You can compose finitely many point functions to create multi-point functions that have the property of being false everywhere except at a given set of points. The scheme for obfuscating point functions can be trivially extended to apply to multi-point functions. Here, the original function responds to a fixed set of passwords that are revealed by examining the program:

boolean isValidPassword(String password) {
   if (password.equals("yellowblue"))
      then return true;
   else if (password.equals("redgreen"))
      then return true;
   else if (password.equals("turquoise"))
      then return true;
   else return false;
}

And here we’ve obfuscated the function and the passwords are hidden:

boolean isValidPassword_PF(String password) {
   if(sha1(password).equals("ce5a522e817efcac33decd1b667e28c277b5b5f0"))
      then return true;
   else
      if(sha1(password).equals("dca12bfecff189a98a274d7aad43aa7b4ee4eed2" ))
         then return true;
   else
      if(sha1(password).equals("095978404280e6c43dcde9db7903f7c72aac109c" ))
         then return true;
   else return false;
}

While it is certainly interesting to be able to obfuscate point functions, with the extension to multi-point functions you can use this style of obfuscation as a basic primitive to achieve strong obfuscation of other functions. For example, Lynn et al. [233] give a series of novel methods to obfuscate finite state automata. As you will see now, this makes it possible to obfuscate hierarchical access control and regular expressions.

Problem 5.2

Q1:

Note that the obfuscated form of the password-checking function reveals the number of passwords recognized by the system. How can you prevent an attacker from being able to deduce the exact number of passwords? Can you prevent the attacker from being able to determine the maximum number of passwords recognized?

Obfuscating access control

We will describe Lynn’s construction in terms of building a program to enforce an access control policy. An access control policy is more general than simple password access to a file; it expresses a set of properties of files and users who are allowed to access them.

The need to obfuscate such access control often arises in practice because the program that controls access may itself be running on a machine that is not completely trusted. For example, you may want to give access to specific sub-trees of your file system to particular users. In other words, you wish to create a file system that gives a user with access to a node in the tree access to all descendants of that node as well.

Such a system could be implemented by a program that queried the user for her credentials as she traversed the file system, and revealed the secrets required for her to access particular directories and files. To prevent the secrets embedded in the program from being revealed, the program must be obfuscated.

The problem (and indeed the solution) is similar to a more contrived scenario that often appears in fantasy fiction. A hero sets out on a quest and is forced to find his way through a maze by answering tricky riddles at each fork in the road. The answer to a riddle at each fork (in addition to revealing his virtues), makes it possible to solve future riddles and reveals which fork the hero should take next. Even when the hero is accompanied by a wise wizard, the wizard frustrates the audience by waiting to reveal the secret that the hero needs in order to solve each riddle until the very last moment. This plot device works to keep us in suspense by not revealing too early the entire maze or the route the hero should take.

You can model both of these scenarios as a graph traversal problem, as shown in Figure 5.1. Each edge in the graph is controlled by a password, and each node in the graph stores (in addition to the secret of interest and an ordered set of neighbors) a long random string called the key. Here are the data structures you need to set up the problem in this way:

class Graph {
  Node[] nodeList;
  Edge[] edgeList;
}
class Node {
  Node[] neighbors;
  String secret;
  Key key;
}
class Edge {
  Node source;
  Node sink;
  String password;
}
class Key extends String {}
Obfuscating access control. In (a) we show a maze a hero must traverse by answering a riddle at each junction. A correct answer reveals the correct junction to take. In (b) we show an equivalent file system that a user wishes to traverse. At each node, the user’s credentials are used to determine if he is allowed access.

Figure 5.1. Obfuscating access control. In (a) we show a maze a hero must traverse by answering a riddle at each junction. A correct answer reveals the correct junction to take. In (b) we show an equivalent file system that a user wishes to traverse. At each node, the user’s credentials are used to determine if he is allowed access.

Akin to the fantasy hero on his quest, to traverse this graph from a starting node u to its ith neighbor v, the user must furnish the key for node u and the password for the ith outgoing edge from u. The correct key proves to the node that the user is allowed to access that node, and the correct password indicates that the user is allowed to traverse the edge. On traversing the edge, the user is given the next node and its key:

Tuple<Node,Key> nextNode (Node current, Key key,
                          int whichEdge, String edgePassword) {
   if (current.getKey().equals(key)) {
      Node next = current.getNthNeighbor();
      if (next.getPassword().equals(edgePassword))
         return new Tuple<next, next.getKey()>;
   }
}

So far, all you have done is turn the original global access control problem into a sequence of looking up local data at each node in a path. How does this help you obfuscate the result? Note that the traversal at each local node can be expressed by two multi-point functions—one that tests the key of the current node and another that tests the password to get to the next node. Both of these functions are point functions, which you already know how to obfuscate using hash functions.

How does a user acquire the first secret before she can begin traversing the graph? You simply make the first secret in the first node a well-known one. For example, you can choose the empty string as the key for the first node. Of course, to traverse past the first node will require the user to know the passwords associated with the outgoing edges from the first node.

Obfuscating Regular Expressions

A regular expression is a pattern that describes a set of strings recognizable by a finite automaton. The pattern consists of concatenation (ab matches the character a followed by the character b), alternation (a |b can match either a or b), and quantification (a* matches a sequence of zero or more instances of a).

You have probably used a regular expression when you wanted to succinctly express a large set of strings without explicitly enumerating every element of the set. In fact, regular expressions can be used to denote infinite sets that you simply would not be able to enumerate.

Here is finite automaton that corresponds to the regular expression ˆb(a|i|u)d$:

Obfuscating Regular Expressions

This regular expression recognizes any string that begins with the letter "b", followed by either the letter "a", "i", or "u", and terminated by the letter "d". In other words, it recognizes the following strings: {"bad", "bid", "bud"}

Consider the slightly more complicated regular expression b(an)+d?$. Here’s its corresponding automaton:

Obfuscating Regular Expressions

This regular expression recognizes all strings that start with the letter "b" followed by one or more occurrences of "an" and possibly a final "d". In other words, it describes the set of strings that includes the strings {"ban", "band", "banand", ...}.

Can you obfuscate the set of strings recognizable by a regular expression? In other words, can you produce a program that will recognize the same set of strings as a given regular expression but not reveal this set?

If your regular expression recognized only a finite number of strings (for example, ˆb(a|i|u)d$), the answer is clearly yes! This is because you can enumerate all strings you wish to recognize and then apply the principles in OBFLBS to obfuscate the multi-point function that results. For example, here’s an obfuscated Java program that recognizes the finite regular expression ˆb(a|i|u)d$:

boolean matches(String test) {
   if (sha1(test).equals("e9b396d2dddffdb373bf2c6ad073696aa25b4f68"))
      return true;
   if (sha1(test).equals("5cbc865525cc90e273dfcd36e478803fdc604d11"))
      return true;
   if (sha1(test).equals("5c9ea593f4a6291130e817b564c7861d7c6a1154"))
      return true;
   return false;
   }

This program is constructed by enumerating the complete set of strings recognized by a regular expression (in our example, "bad", "bid" and "bud") and by embedding in the program a one-way hash of each string. To reverse engineer the language, an attacker would have to reverse the one-way hash.

Problem 5.3

Q1:

A regular language is usually defined over a given alphabet. An attacker can determine the language recognized by querying the obfuscated regular expression with randomly generated strings over the alphabet. However, this is very slow.

What combination of tricks prevents an attacker from determining the specific set of characters being matched at each node when testing the characters in the given alphabet at each node?

Unfortunately, as the size of the language recognized by a regular expression grows to infinity, obfuscating regular expressions with such an approach becomes infeasible.

As you saw, the language recognized by a regular expression can be represented as a finite automaton. By combining multi-point functions and a finite automaton representation of regular expressions, you can construct provably obfuscated regular expressions much like those we devised for an obfuscated access control system. The key idea is to replace the labels on the finite state machine with a keyed one-way hash. You also need to hide the structure of the finite state automaton. A simple way to do this is to introduce new paths between every pair of nodes with a keyed hash that no string maps to, like this:

Problem 5.3

To build such an automaton, first the labels on transition arcs are replaced with keyed hashes. The new transition arcs labeled ⊥ connect previously unconnected states using labels that nothing hashes to. In addition, new states could also be added to the finite state automaton in which no incoming transitions are ever taken. These additional arcs and states hide the true structure of the automaton.

Algorithm OBFNS: Obfuscating Databases

 

Goal

Asset: Private data

 

Attacker Limits: None

 

Input Programs: Arithmetic and relational operations on data

 

You may want to extend your newfound ability to obfuscate multi-point functions to obfuscate arbitrary databases. Algorithm OBFNS [267] (see Algorithm 5.2Algorithm OBFNS: Obfuscating Databases323 for details) gives a method for extending obfuscated multi-point functions to perform many of the functions you want from a database. The algorithm starts by building a database that is an obfuscated lookup function. For example, to access a single record in a database you would use a point function that takes as input the field you are querying and a database, and outputs the particular record. Algorithm OBFNS takes advantage of point-function obfuscation by encrypting the data attributes with a key derived from a hash of the query attributes. For example, here’s a phone book containing an association between names and phone numbers:

Example 5.2. Overview of algorithm OBFNS.

OBFUSCATEDATABASE(DB):

  1. Create a new obfuscated database DB′

  2. For each record consisting of (key, value)in DB:

    1. Generate two random numbers r1 and r2.

    2. Create an obfuscated key by generating a keyed hash of key with key r1.

    3. Create a value key by computing the xor of value and the hash of key with key r2.

    4. Store the obfuscated key and value in DB′.

Overview of algorithm OBFNS.

To look up a phone number for Alice, it is possible to search the database to find an entry with a name field of "Alice" and to return the corresponding phone field. Here’s the same phone book database obfuscated with Algorithm OBFNS:

Overview of algorithm OBFNS.

For each entry <name, number > in the original phone book a new entry is constructed in the obfuscated phone book:

<hash(concat(r1, name)), hash(concat(r2, name)) ⊕ number, r1, r2>.

The values r1 and r2 are random numbers that are stored in two new columns of the database. For the purposes of clarity, we have not computed the hash of the values shown. In practice, you would use any common non-invertible hash function such as those used in cryptography. To look up Alice in the obfuscated database, you would search through the database for records in which the first value is equal to hash(concat(r1, "Alice")). Once you have found such a record, you would use the corresponding value of r2 to compute the key, hash(concat(r2, name)). Finally, you retrieve the value of the phone number by xoring the obfuscated phone number with this key.

As you can see from the implementation in Listing 5.4Overview of algorithm OBFNS.326, retrieving values from an obfuscated database is considerably more computationally expensive than performing the same action on an unobfuscated one. However, as a result of such a construction, the only computationally feasible way to retrieve the values from the database is to provide the corresponding query attributes. In the example above, you cannot easily retrieve the phone number of a person unless you know her name. In particular, it can be very computationally expensive to retrieve all records from a database.

Of course the computational cost depends on the size and distribution of possible values in each field. In the example above, cracking the database may in practice not be difficult if the attacker has a regular telephone directory with a list of common names.

Problem 5.4

Q1:

Design and implement an obfuscated credit card database. The database should contain the name, password, and credit card numbers of customers. It should only be possible to retrieve a credit card number from the database given both the username and the corresponding password. What security problems would be faced if such an obfuscated database was employed?

Algorithm OBFPP: Homomorphic Encryption

 

Goal

Asset: Private data

 

Attacker Limits: None

 

Input Programs: Addition and multiplication operations on data

 

Example 5.4. An obfuscated database.

public class ObfDB {
   static Vector<Record> db = new Vector();

   public static void put(String key,String value) {
      long rnd1 = Util.getRandomLong();
      long rnd2 = Util.getRandomLong();
      String obf_key = Util.sha1(rnd1 + key);
      String obf_value = Util.xor(value, Util.sha1(rnd2 + key));
      db.add(new Record(obf_key, obf_value, rnd1, rnd2));
   }

   public static String get(String key) {
      for (Enumeration e = db.elements(); e.hasMoreElements() ;) {
         Record r = (Record)e.nextElement();
         String obf_key = Util.sha1("" + r.r1 + key);
         if (r.name.equals(obf_key))
            return xor(r.phone,Util.sha1("" + r.r2 + key));
      }
      return null;
   }

   public static void main(String args[]) {
      ObfDB o = new ObfDB();
      String[] phoneNames =   {"Alice", "Bob", "Charles" ,
                               "David", "Einstein" };
      String[] phoneNumbers = {"555-1000", "555-3124", "555-1516",
                               "555-9986", "555-7764"};
      for (int i=0; i < phoneNames.length; i++)
         put (phoneNames[i], phoneNumbers[i]);

      for (int i=0; i < phoneNames.length; i++) {
         System.out.print (phoneNames[i] + " " );
         System.out.println (get(phoneNames[i]));
      }
   }
}

class Record {
   public Record(String name,String phone,long r1,long r2) {
      this.name = name;
      this.phone = phone;
      this.r1 = r1;
      this.r2 = r2;
   }

   String name;
   String phone;
   long r1;
   long r2;
}

In practice, it is not sufficient to be able to merely store and extract obfuscated data from a database on a hostile host. More often you would like to be able to perform a computation with the values you extract and store the results back into the database. Can such an operation be performed on a hostile host?

For example, suppose you have a database of values that are encrypted. Are there operations you can perform on this database of values without first decrypting the values? Is it possible to find the product of all the values in the database? The average? Is it possible to compute the maximum value? Are there limits to the computations that can be performed?

The answers to these questions come from a branch of cryptography called homomorphic encryption. Homomorphic encryption is a type of encryption where you perform an operation on a value by performing a (possibly different) operation on its encryption. Many existing public key encryption schemes, such as RSA, already exhibit such a property for some operations.

You will recall that in RSA, Alice computes public and private keys as follows. She selects two large prime numbers p and q and computes n = pq. She also computes the totient of n, φ(n) = (p – 1)(q 1). Alice chooses an integer e such that gcd(n, φ(n)) = 1. Finally, she selects an integer d that is the inverse of e modulo φ(n). Now Alice’s public key is (n, e) and her private key is d.

To send Alice an encrypted number m, you compute E(m) = me mod n. Alice can recover the message by computing E(m)d mod n, which by Euler’s theorem is equal to m. Of course, if an attacker is able to factor n, he will be able to deduce d and decrypt the message as well. The security of the scheme relies on the fact that factoring products of large primes is computationally difficult.

Now let’s assume Alice encrypts two new numbers x and y with her private key to get encrypted values E(x) and E(y). Given access only to her public key, can you perform any operations on x and y by manipulating only E(x) and E(y)? Surprisingly, the answer is yes!

If you multiply E(x) and E(y), you get E(x) × E(y) = (xe × ye) mod n = (x × y)e mod n = E(x × y mod n). This is simply the product x × y, encrypted using Alice’s private key! In other words, if you have a scenario where you need a remote host to multiply numbers without the host learning the specific numbers involved, you can encrypt the numbers using RSA, give them to the remote host to multiply, and finally decrypt the result to determine the product. Such an ability, while interesting, is not powerful enough to be useful. In particular, it is unclear how you can extend the scheme to allow addition and division.

Algorithm OBFPP [279] (see Algorithm 5.3Databases, obfuscation ofAn obfuscated database.329) is another asymmetric encryption scheme similar to RSA. A number z is said to be an nth residue modulo n2 if there exists a number Databases, obfuscation ofAn obfuscated database. such that z = yn mod n2. Computing nth residuosity is thought to be computationally hard, and this hardness is the basis of the Paillier encryption scheme.

Given a public key of (n, g) and a message m, the OBFPP scheme has several attractive homomorphic properties:

  1. D(E(m1) × E(m2) mod n2) = m1 + m2 mod n.

  2. D(E(m1)k mod n2) = km1 mod n.

  3. D(E(m1) × gm2 mod n2) = m1 + m2 mod n.

  4. D(E(m1)E(m2) mod n2) = D(E(m2)E(m1) mod n2) = m1m2 mod n.

This means that several types of addition and multiplication operations useful for secret voting protocols, watermarking, and secret sharing schemes can be carried out on the encrypted values. A second thing that makes OBFPP particularly useful is that encryption is randomized via the r parameter (see Algorithm 5.3Databases, obfuscation ofAn obfuscated database.329). In other words, even if the same message m was encrypted twice using the same public key (n, g), the results would be probabilistically different.

Suppose you wish to set up a voting system for three candidates, Alice, Bob, and Charles. A voter casts a single vote for their chosen candidate by casting an n-tuple consisting of a one for the candidate of their choice and n1 zeros for the remaining candidates. At the end of the election, the election officials tally up the votes and declare the candidate with the most votes the winner. Have a look at this example:

 

Alice

Bob

Charles

Voter 1

0

0

1

Voter 2

1

0

0

Voter 3

0

1

0

Voter 4

1

0

0

Total

2

1

1

Using OBFPP, you could design a system that increases voter privacy. Votes are submitted by the voter, encrypted using the voting system’s public key votepub, and are never directly decrypted by the system, as shown here:

 

Alice

Bob

Charles

Voter 1

E(0, votepub)

E(0, votepub)

E(1, votepub)

Voter 2

E(1, votepub)

E(0, votepub)

E(0, votepub)

Voter 3

E(0, votepub)

E(1, votepub)

E(0, votepub)

Voter 4

E(1, votepub)

E(0, votepub)

E(0, votepub)

Total

Alicetotal

Bobtotal

Charlestotal

Each voter encrypts their vote using a public key before casting their ballot. The randomized encryption scheme reduces the likelihood of two votes encrypting to the same encrypted value and thus revealing that two voters are voting for the same candidate.

The voting officials hold the private key, votepriv. They compute the total votes in their encrypted form by taking advantage of Property (4) above. All votes for each candidate are multiplied together in encrypted form, and only the totals are decrypted to determine the winner. Once decrypted, a further check is performed to ensure that the total votes cast is equal to the number of voters.

Problem 5.5

Q1:

In the voting scheme described above, there is nothing that prevents a voter from casting malformed votes, such as multiple votes for the same candidate and negative votes for a competing candidate. How would a voting system address such problems?

Problem 5.6

Q1:

What prevents corrupt election officials from decrypting an individual voter’s vote and thus destroying their privacy? What changes could you make to the scheme to reduce this likelihood? For example, can you devise a scheme that would require a collusion of a large number of election officials to subvert a voter’s privacy?

Example 5.3. Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message.

GENERATEKEY():

ENCRYPT(m,(n, g)):

  1. Choose two large random prime numbers, p and q.

  2. Compute n = pq and λ = lcm(p 1, q – 1).

  3. Select a random integer g where Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message..

  4. Check that n divides the order of g by checking the existence of the following modular multiplicative inverse:

    μ = (L(gλ mod n2))–1 mod n

    where function L is defined as Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message..

  5. The public encryption key is (n, g) and the private decryption key is (λ, μ).

  1. Let Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message. be the message.

  2. Select a random number Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message..

  3. Compute the ciphertext c as

    c = gm · r n mod n2.

DECRYPT(c,(λ, μ)):

  1. Let Overview of algorithm OBFPP. m is the message to be encrypted and c is the encrypted message. be the ciphertext.

  2. Compute the message m as

    m = L(cλ mod n2) · μ mod n.

Algorithm OBFCEJO: Whitebox DES

 

Goal

Asset: Private data

 

Attacker Limits: Limited cryptanalysis

 

Input Programs: DES Encryption Programs

 

While being able to manipulate encrypted values is extremely useful, an even more common need is to surreptitiously decrypt data. For example, you may have a DRM application, such as a media player playing video content, that you want to keep encrypted until it is displayed on the screen. This means that the decryption key must be embedded in the player. In this case, it is vulnerable to attacks using static analysis, memory inspection, debuggers, and other tools. This scenario often arises in the case of set-top boxes used by TV stations. The TV company, after verifying that the user has paid for the content, sends out an encrypted video stream that is decoded using a secret key stored in the set-top box.

The set-top boxes themselves can help slow down an attacker. They use physical security to prevent the attacker from opening the device and tampering with it. Unfortunately, physical set-top boxes are expensive to build and distribute. They can also be very expensive to update once a key has been compromised.

Can obfuscation allow you to keep a key hidden purely in software while allowing you to perform a decryption? The study of this and related problems is called whitebox cryptography. The goal is to implement cryptographic primitives in such a way that even complete access to the implementation and execution does not reveal the cryptographic key. Here is an outline of how a whitebox cryptographic scheme would work:

WHITEBOXCRYPTOGRAPHY:

  1. Pick a cryptographic scheme Algorithm OBFCEJO: Whitebox DES with encryptor Ek (m) and decryptor Dk(m), where k is a symmetric encryption/decryption key and m is the message.

  2. Select a secret key: skey.

  3. Create a whiteboxed decryptor Dwb(m) by folding skey into D, i.e., let Dwb(m) = Dskey(m).

  4. Obfuscate Dwb(m) so that skey is securely hidden and inseparable from the decryptor code.

Notice how Dwb(m) has only one parameter while Dk(m) has two! Essentially, Dwb(m) is a version of Dk(m) that has been specialized by providing the key, a constant bitstring.

Unfortunately no provably secure whitebox encryption schemes have been invented. The most promising techniques target existing encryption schemes and take advantage of particular features of the chosen scheme to hide the key. For example, Chow et al. [65,66] proposed an obfuscation technique to perform AES and DES decryption by merging the key with the permutations that are performed and the substitution tables that are used. There is no intrinsically difficult computational problem that underlies this technique. Instead it tries to take advantage of the difficulty of analyzing nonlinear transformations. Nevertheless, in the absence of a reduction to a known difficult problem it is unlikely that they will resist a concerted cryptanalysis. Indeed in practice, all proposed whitebox DES and AES implementations have been broken.

Whitebox cryptographic schemes that have been proposed in the literature have been based on obfuscations of well-known symmetric algorithms such as DES and AES. It is interesting to note that in practice, however, companies that sell whitebox cryptographic solutions tend to roll their own proprietary algorithms. While this is anathema in “normal” cryptography, it makes perfect sense in the whitebox scenario. The strength of a whitebox algorithm lies not in the strength of the underlying cryptographic primitives on which it is based, but rather in how well it hides the key. Therefore, it is essential to select, or invent, decryption algorithms that are easy to whitebox, i.e., that are amenable to obfuscation.

With this in mind, we will not delve into the complete details of the OBFCEJO algorithm but rather describe some of the techniques that were used to obfuscate whitebox DES, with the hope that they inspire your own ideas and show you pitfalls to avoid.

Traditional DES

Before looking at whitebox DES, let’s remember some of the details of traditional DES. The Data Encryption Standard (DES) was proposed by IBM in response to a solicitation for proposals by a U.S. standards body, the National Bureau of Standards (now called National Institute of Standards and Technology). What was sought by the NBS was a cipher for use by the United States government. DES was adopted in 1977 and has been in use with various improvements since that time. While it was superseded by Advanced Encryption Standard (AES) in 2002, a stronger variation of DES called Triple-DES (3DES) continues to be used widely.

DES is a block cipher system. It takes a fixed number of plaintext bits (64 bits) and transforms them using permutations and substitutions into ciphertext bits. The algorithm uses a 64-bit key to customize this transformation and performs this set of permutations and substitutions 16 times. The result is that each 64-bit block is permuted in one of 264 different ways.

The first step to performing DES is to encode the key. Suppose your key K is 53687765746861216. In binary, this gives you:

K = 010100110110100001110111011001010111010001101000011000010101110

The DES algorithm then permutes the key according to this permutation table:

58

50

42

34

26

18

10

2

60

52

44

36

28

20

12

4

62

54

46

38

30

22

14

6

64

56

48

40

32

24

16

8

57

49

41

33

25

17

9

1

59

51

43

35

27

19

11

3

61

53

45

37

29

21

13

5

63

55

47

39

31

23

15

7

In other words, the 58th bit of the key becomes the 1st bit of the permuted key, the 50th bit becomes the 2nd bit, the 42nd bit becomes the third bit and so on. Notice that there are only 56 entries in the table. As a result of the permutation, 8 bits of the original key are dropped by the DES algorithm, giving an effective 56-bit key.

Thus, from the original key K, you get a 56-bit permuted K′:

K′ = 010100110110100001110111011001010111010001101000011000010101110

As shown in Figure 5.2, to encrypt plaintext, it is divided into blocks of 64 bits each. For each block of 64 bits of plaintext, DES permutes the block according to a special table. It then divides the result into a left (L0) and right (R0) half, each 32 bits long. The new left half (L1) is simply the old right half (R0). A keyed substitution on the right half (R0) is performed by rotating it left by one or two bits (depending on the round). This result is then permuted. An exclusive-or is then performed between this result and the left half (L0) to produce the new right half (R1). After 16 rounds, the resulting left and right halves go through a final permutation to produce the ciphertext.

DES transforms 64-bit blocks using an initial permutation, 16 “rounds” and a final permutation. In each round, an alternate half of the input is scrambled with a portion of the key before being recombined with the remaining half.

Figure 5.2. DES transforms 64-bit blocks using an initial permutation, 16 “rounds” and a final permutation. In each round, an alternate half of the input is scrambled with a portion of the key before being recombined with the remaining half.

Obfuscating DES

As you can see, DES consists of substitutions, permutations, and xor operations. Individually these operations are very simple and difficult to obfuscate directly. The strategy proposed by Chow et al. [65,66] involves transforming these operations into a randomized, key-dependent network of lookup tables.

A lookup table, like an oracle, is simply a function that is defined by explicitly stating all its input–output values. Since any finite function (including the linear and non-linear transformations used by DES) can be expressed as a lookup table, they are ideal for obscuring the differences between substitution, permutation, and xor operations. For example, have a look at the following two lookup tables that define two functions, A and B. A is a fixed substitution operation on inputs of two bits and B is an xor function:

Obfuscating DES

We’ll also be using the C table, which represents the negation function.

In a whitebox scenario, an adversary can not only control input, he can also execute parts of a program in isolation and investigate individual operations as they are being performed. To prevent an attacker from identifying individual operations you can instead precompute them by constructing their composition. For example, say that you want to perform the operations A, B, and then C on your data. In other words, you want to construct a new operation D = C ο B ο A. Composing the tables for A, B, and C gives you a new lookup table for D:

Obfuscating DES

The table is built by successively applying the lookup transformations B and C to the lookup table A.

If the number of operations composed in this way is large (for example, an entire round of DES) and only the composed lookup table is embedded in an application, an attacker is no longer able to separate out the individual steps of the computation. This constrains his ability to derive information about the key.

Unfortunately, lookup tables grow rapidly in size. To encode a function that takes n bits as input and outputs m bits, you need 2n × m bits of storage for the lookup table. For DES, which encodes 64-bit blocks at a time, this is clearly infeasible. Instead of using a single large lookup table, you can use smaller blocks of input and maintain a network of lookup tables.

Chow et al. also use partial evaluation to mingle the key with the rest of the algorithm. Once a key is chosen, those parts of the algorithm that use the key can be precomputed specifically for the key. In the case of DES, some operations (for example the fixed permutations) are independent of the key. In this case, the corresponding lookup tables in the network are also fixed. On the other hand, the substitution operation contains some information from the key and thus those lookup tables will vary from one instance to another. Remember that, unlike an oracle, the entries in the lookup table are available to the adversary. By looking at these entries, the adversary can distinguish the small number of lookup tables that are affected by the key from those that are not and focus his further analysis on just these tables.

Suppose that as in the example above, the lookup table B contains some information about the key. An adversary who is able to directly access B can gain some information about the key. What you need is some transformation that randomizes the tables and obscures the distinctions between key-dependent and key-independent lookup tables.

One way to achieve this is by applying a random transformation on the output of a lookup table and the corresponding inverse transformation on the input to the next lookup table in the network. These are called internal encodings. For example, if we take the A, B, and C tables above, we can replace them as follows:

Obfuscating DES

where r1 and r2 are random lookup tables with inverses Obfuscating DES and Obfuscating DES, respectively.

Note that even though Obfuscating DES, no one lookup table (A′, B′ or C′) leaks information about the key. An adversary would need to analyze all three lookup tables in order to deduce the same amount of information about the key as is available from B. As the number of tables grows so does the amount of work that an adversary would need to do.

One problem that remains is that the inputs in the first lookup table and the outputs of the final lookup table (A′ and C′ above) are untransformed and are potentially more vulnerable to an adversary. There is also a more subtle vulnerability. Presumably the attacker wishes to extract the key not because of any intrinsic value it may have itself but because he wishes to reuse it. If all he wants to do is to re-encrypt or decrypt data with the hidden key, he doesn’t actually need to extract it—he can simply reuse the entire whitebox implementation by embedding it in his own application!

To strengthen whitebox DES against both of these types of vulnerability, Chow et al. recommend applying external encodings. External encodings are very similar to internal encodings—where a random transformation is applied on the input or output. The difference is that the corresponding inverse transformation is not applied as part of the encryption but instead is applied elsewhere in the program.

For example, suppose you have a movie player with an embedded key with which it decrypts and plays DES encrypted movies. Let F be a random lookup table with an inverse F –1. To use an external encoding you could apply F to the data as part of the module that reads data from a file and apply F 1 to the input of the first lookup table. The movie file itself is not encrypted with DES but rather with an encoded version of it—and as a result, an attacker could not simply extract out the decryption engine, embed it in his own application, and use it to decrypt movie files.

Problem 5.7

Q1:

Whitebox DES is not provably secure. List the attacks that are possible given the scheme described. How would you address these weaknesses to make the scheme more robust to attack?

Provably Secure Obfuscation: It’s Impossible (Sometimes)!

Given the success you have seen so far, you may think it is possible to grow the asset that is being protected to cover all properties of all programs. Unfortunately, this turns out not to be possible. In fact, in the most general sense, obfuscation is impossible!

What does it mean to say “obfuscation is impossible?” In the last section, you saw that you can obfuscate point-functions, databases, and encryption functions. Aren’t we contradicting ourselves if we say that obfuscation is both “provably possible” and “provably impossible”?

The answer is no—all this means is that programs exist that can be obfuscated and other programs exist that cannot be. In the same way, the general uncomputability of HALTING does not prevent us from coming up with classes of programs for which you can compute whether the program halts.

Consider a blackbox program, which is one for which you have no access to its internals. A strongly obfuscated program should reveal no more information than a blackbox program. In practice of course, compared to blackbox programs, obfuscated programs leak some information. Given a blackbox program, all you can observe is its I/O behavior (and perhaps how long it takes to run). In contrast, given an obfuscated program, you can observe its memory accesses, the instructions that are being executed, procedure calls, and other clues that you may intuitively suspect could lead an adversary to eventually “crack” the obfuscation and force the program to reveal its secrets.

In this section, we will show that this intuition is, in fact, correct and that hiding all properties of all programs is unachievable. To show that this is true, it is sufficient to construct just one program containing a secret property that obfuscation is unable to hide. This is exactly what we will do. First we will define what a blackbox is and how it relates to our definition of obfuscation. Next we will construct a property that cannot be obfuscated and a program that exhibits this property, and thus we will show that not all properties of all programs can be obfuscated. Finally, we will discuss what features of obfuscation allowed us to construct such an unobfuscatable program.

A General Obfuscator

We have devised our definition of obfuscation in terms of assets. A general obfuscator is one that is able to hide all assets in a program. A generally obfuscated program is in some sense the most obfuscated version of a program that could be devised. It is one that cannot be analyzed to reveal any property of the original program except the relationship between input and output. How is such a perfectly obfuscated program (if it existed) different from a blackbox? The only difference is that you have source-code access to an obfuscated program. The only query you can make on a blackbox, on the other hand, is to compute its output on a finite number of inputs. Such access to a program is called oracle access.

Example 5.5. Different types of access to a program. In (a), you have source-code access to the original program, M. In (b), you have source-code access to the obfuscated program, ObfM. In (c), you have only oracle access to M.

public class M {
   public int run(int curr) {
      int steps=0;
      while (curr < > 1) {
         if (curr % 2 == 0)
           curr=curr / 2;
         else
           curr=3 * curr + 1;
         step++;
      }
      println(steps);
   }
}
public class ObfM {
   public int run (int curr) {
      int steps=0;
      while (curr < > 1) {
         curr = ++steps ?
         (curr % 2) * (3 * curr + 1)
         + ((curr+1) % 2) * (curr / 2)
         : steps * curr;
      }

      println(steps);
   }
}

(a)

(b)

> askoracle M 1
0
> askoracle M 10
5
 

(c)

 

The word oracle may lead you to imagine uncommonly powerful access to a function, but in the sense that we are using it here oracle access is much less powerful than source-code access. Oracle access to a program reveals no information about the internal structure of a program. In a sense, oracle access is rather similar to accessing a Web service. You have limited access to the computer that a Web service is running on—it may even be on the other side of the world. You are connected to it via a network, and you send it input using a specially constructed query. The remote computer computes an answer and returns it to you, and the only information you learn is the answer (and perhaps the time taken to compute it).

It is clear that every property that you can compute, given oracle access to the program, you can also compute given source-code access—all you need to do is compile and run the program. On the other hand, there are properties of some programs that you can compute given source-code access that you cannot compute given just oracle access. For example, given oracle access to most commonly written programs, it is impossible to determine the length of the program in bytes or the number of variables used, but it is trivial to compute these properties given the source code of the program. As Listing 5.5Different types of access to a program. In (a), you have source-code access to the original program, M. In (b), you have source-code access to the obfuscated program, ObfM. In (c), you have only oracle access to M.337 illustrates, there is a fundamental difference between source-code access and oracle access. In Listing 5.5Different types of access to a program. In (a), you have source-code access to the original program, M. In (b), you have source-code access to the obfuscated program, ObfM. In (c), you have only oracle access to M.337 (a) and (b), you have access to the algorithm being executed (and may recognize the Collatz problem). You also are able to tell what the size of the input set is and you are in a position to both modify and run the program. The source is a compressed summary of all I/O relations of the function it computes that is presented in a form that can be used and analyzed by other programs. In Listing 5.5Different types of access to a program. In (a), you have source-code access to the original program, M. In (b), you have source-code access to the obfuscated program, ObfM. In (c), you have only oracle access to M.337 (c), running a compiled program exemplifies accessing an oracle. You cannot tell from the interaction what all the inputs accepted by the program are, and the only I/O relations you learn are the ones that result from your execution. What is more, you are only able to execute finitely many such queries and thus are not in a position to determine all the properties of the program.

Let’s come up with some definitions of what is computable given source code or oracle-only access to a program:

Definition 5.6 (Source-code computable probability)

Given a program P′, and a source-code analyzer A for some property of the program, let the notation

Pr[A(P′) = 1]

be the probability that A is able to compute that property of P′.

Definition 5.7 (oracle access computable probability)

Given a program P′, with running time time(P′) and an oracle access analyzer R for some property of the program, let the notation

Pr[SimP′(1time(P′)) = 1]

represent the probability that R is able to compute that P′ has that property.

You are now able to state what properties of a blackbox you desire from an obfuscated program:

Definition 5.8 (Virtual blackbox)

Given an obfuscator O, an oracle access analyzer R, and a source-code analyzer A, O is a virtual blackbox if, for all properties of an obfuscated program P′ = O(P), its source-code computable probability is approximately the same as its oracle access computable probability, that is

Pr[A(P′) = 1] ≈ Pr[SimP′(1time(P′)) = 1].

There is another restriction that you must place on obfuscated programs in order to reason about them in this section. Specifically, you need to restrict how much bigger or slower an obfuscated program can be compared to the original. Let’s call such programs small and efficient, defined as follows:

Definition 5.9 (Small)

O is small if for all programs P, O(P) is at most polynomially larger than P.

Definition 5.10 (Efficient)

O is efficient if for all programs P, O(P) is at most polynomially slower than P.

By requiring that there is at most only a polynomial increase in size of a program and at most only a polynomial slowdown in its speed, you eliminate the degenerate case where an obfuscated program consists solely of an exhaustive list of input and output pairs. In most programs, such a list would be exponentially larger than the program itself.

Thus in this section, when we say that a program is obfuscated, we mean the following:

Definition 5.11 (Obfuscated)

A program O(P) is an obfuscated version of P if:

  • O is correct;

  • O is small;

  • O is efficient; and

  • O is a virtual blackbox.

The virtual blackbox property states that having access to an obfuscated program’s source code doesn’t give the attacker an advantage over having access to its input–output relation.

For example, given the program in Listing 5.5Definition 5.11 (Obfuscated)337 (a), you can obfuscate the length, comments, and number of variables of the original program by transforming it into the program in Listing 5.5Definition 5.11 (Obfuscated)337 (b). The program is still correct, small, and efficient, and at least the length, comments, and number of variable properties have been obfuscated. However, according to Definition 5.11, Listing 5.5Definition 5.11 (Obfuscated)337 (b) is not obfuscated because there are many other properties, such as control flow and variable names, that are preserved.

Example 5.6. A small self-reproducing program. This program fails to be unobfuscatable because oracle access to the program is equivalent to source-code access.

class Self {
  public static void main(String[] args) {
    char qq=34,q=39;
    String payload="+qq+payload;System.out.println(payload+qq+'; 
      payload='+qq+payload.replace(q,qq));}}";
    payload="class Self{public static void main(String[] args){ 
        char qq=34,q=39; String payload="+qq+payload;
    System.out.println(payload+qq+";payload="+qq+payload.replace(q,qq));
  }
}

Now that we have a good definition of obfuscation, we will show the following:

Theorem 5.1: (Impossibility of obfuscation)

Let Theorem 5.1: (Impossibility of obfuscation) be the set of all programs. Given any obfuscating transformation O, there exists a P ϵTheorem 5.1: (Impossibility of obfuscation) such that O(P) is not obfuscated according to Definition 5.11Theorem 5.1: (Impossibility of obfuscation)339.

We will know that we have succeeded in proving this theorem (1) if we can construct a program that has a property that is always evident from source-code access to any obfuscated version, and (2) if there is a negligible probability that this property can be deduced from just oracle access.

Obfuscating Learnable Functions

One way you might try to build an unobfuscatable program is to make the program so simple that it has only a single output. For example, you may wish to use the Hello World program. The intuition here is to build programs that are so trivial that they simply contain no structure rich enough to obfuscate.

The problem with this function is that its triviality, ironically, makes it simpler to obfuscate, given the definition. Simple functions are learnable with just a small number of oracle queries—in the case of Hello World, a table of input–output pairs consisting of a single entry mapping an input of the empty string to the output Hello World. As a result, all differences between oracle access and source-code access disappear. In other words, simple learnable programs like Hello World can be trivially rewritten as a single lookup table that is at most polynomially larger than the original. This lookup table reveals no more than oracle access to the program does and thus is a valid obfuscation of the program. What you instead require is a property that has a negligible likelihood of being revealed by oracle access.

One idea that you may want to try is using the self-reproducing program shown in Listing 5.6Obfuscating Learnable Functions340. The idea here is that no matter how much an obfuscator obfuscates this program, when the program is run, it will still produce the original unobfuscated source code as output. This is guaranteed by the correctness requirement of an obfuscator. You may think that you have succeeded, because now that you have the original program, you can use it to compute any properties you like.

Unfortunately, your self-reproducing program is not a suitable candidate either, because it lacks the virtual blackbox property. Oracle access to a self-reproducing program reveals the entire source and thus is equivalent to source-code access!

Proving that Obfuscation Is Impossible

 

Goal

Asset: All properties of a program

 

Attacker Limits: None

 

Input Programs: All possible programs

 

You need your program to reveal its secret only when you have information that is available to any obfuscated version of a program but that is unavailable with just blackbox access. As we discussed earlier, the crucial difference between an obfuscated program and a blackbox program is that you have the source code for an obfuscated program. The source code is something that can be passed to another program and manipulated. You have no such access to a blackbox program—all you can do is query it. You cannot pass it to other programs or look inside it. Barak et al. rely on this difference by making their candidate counterexample crucially rely on its own source code:

This unobfuscatable program is very similar to the self-reproducing program in Listing 5.6Proving that Obfuscation Is Impossible340. However, instead of printing out its own source, it tries to recognize its own source. Listing 5.7Proving that Obfuscation Is Impossible342 shows a candidate counterexample, Secret. This program operates in two modes: point mode and spy mode. In point mode, the program behaves exactly like the point functions you saw in Section 5.3.1Proving that Obfuscation Is Impossible314. If the input to the program in point mode is the secret key value, the program prints out 1; otherwise, it prints out 0. In spy mode, the program takes as input a program—let’s call it Candidate, and tests if Candidate behaves like Secret. If it does, Secret prints out the secret. As with point functions, oracle access to Secret gives you only a small probability of guessing the key.

Example 5.7. An unobfuscatable program containing a secret Proving that Obfuscation Is Impossible.

public class Secret   {
   final int POINT_MODE = 0;
   final int SPY_MODE = 1;
   final int S = 8544;

   public int pointFunction(int x) {
      if (x == S)
         return 1;
      else
         return 0;
   }

   public boolean behavesLikeMe(Program p) {
      int testPoint = S;
      int tests = 1000;
      boolean result = true;
      do {
         int result = p.run(testPoint);
         if (result != pointFunction(testPoint))
            return false;
         else if (result == Program.RAN_TOO_LONG)
            return false;
         int testPoint = (int)Math.random(Integer.MAXINT);
         test--;
      } until (test == 0);
      return result;
   }

   public void main(String args[])   {
      int mode  = getMode(args);
      Program p = getProgram(args);
      int input = getInput(args);
      switch (mode) {
         case POINT_MODE:
            System.out.println(pointFunction(input));
         case SPY_MODE:
            if (behavesLikeMe(p))
               System.out.println("The secret is "+S);
            else
               System.out.println("No secret for you.");
      }
   }
}

You may wonder how Secret can test if Candidate behaves like itself, given that earlier in this chapter you saw that EQUIVALENCE is uncomputable. Fortunately, Secret can determine with a high degree of probability that Candidate is the same program by testing it at the secret input and a large number of other random points. If the programs Secret and Candidate agree at these points, there is a high degree of probability that they compute the same function. Furthermore, since the running time of Secret is known to Secret, and by our definition of obfuscation, any obfuscated program is at most polynomially slower than the original program, behavesLikeMe can terminate any candidate that runs too long and will correctly return false.

To summarize, you now have a program, Secret, that when passed itself as input in spy mode, will output the secret. That is, if you have source-code access to Secret, no matter how obfuscated, you will be able to learn the secret. However, when an attacker is given only oracle access to Secret (in other words, when the attacker is unable to look at the source code of Secret, or to pass it as input to another program), he cannot learn much about the secret.

This contradicts the blackbox property of the definition of obfuscation and thus proves that the program Secret cannot be obfuscated according to Definition 5.11An unobfuscatable program containing a secret .339. Since we constructed Secret without reference to any particular obfuscation algorithm, the existence of Secret proves Theorem 5.1An unobfuscatable program containing a secret .340.

Discussion

One of the most interesting aspects of the proof of impossibility of obfuscation is to understand why the attacker is not able to use the oracle to get the information that source-code access would give him. Remember that if you are able to deduce the secret simply from oracle access to Secret, then according to Definition 5.11Discussion339 the program was indeed obfuscated and Theorem 5.1Discussion340 (Impossibility of Obfuscation) does not hold.

Why can’t an attacker write a program that behaves like Secret? From our discussion of point functions, you already know that it is infeasible for an attacker to exhaustively test the oracle for the secret, and unlikely for an attacker to guess the secret. But why can’t the attacker write a program that has access to the oracle? The program could simply relay all its input to the oracle and output the oracle’s response. Since it is a program and not an oracle, it could now be passed to other programs, and if it was passed to Secret in spy mode, it would successfully trick Secret into revealing the secret Discussion.

An attacker cannot do this because the candidate program he generates cannot be one that itself uses oracles. Specifically, an oracle is not allowed make a call to itself. This may seem like a trick—and it is a trick, but one that is necessary to make the unobfuscatable program possible. However, it is not an especially unusual one. In an earlier section, when we explored Turing’s proof of HALTING, we insisted that COUNTEREXAMPLE was entirely self-contained—it was not allowed to make calls to other programs. In fact, if we relaxed the definition to allow such calls, Turing’s proof of uncomputability would no longer hold. The same is true for the unobfuscatable program.

In software engineering terms, if an oracle is like a remote Web service, then prohibiting a program from using an oracle amounts to preventing (or detecting) a candidate program’s attempt to communicate over the network. If Secret detects such an attempt, it can refuse to reveal its secret.

There is one other interesting observation we can make about the means by which Theorem 5.1Discussion340 (Impossibility of Obfuscation) was proven. First, no static analysis of the Candidate was required—it was cracked by taking advantage of the fact that Candidate is trying to leak its secret. How similar are the programs you usually want to obfuscate to Candidate? What does the unobfuscatable program imply about programs that do not attempt to conspire against the obfuscator? Finally, you now have proof that there exist programs that can be obfuscated and others that cannot be. Which class do useful programs belong to? Can we tweak the definition of obfuscation so that all useful programs are obfuscatable?

Provably Secure Obfuscation: Can It Be Saved?

You may wonder why we are so keen on provable obfuscation. After all, if the obfuscation techniques we have introduced so far are already sufficiently powerful in making reverse engineering of programs prohibitively expensive, is this not enough? Was this not what we had argued for earlier? In other words, what additional advantage does provable obfuscation give us that makes it worth pursuing?

General provable obfuscation, if it existed, would allow us to solve many important problems in cryptography. For example, in voting protocols you need to count the number of votes cast while maintaining the anonymity of the individual ballots. The OBFPP scheme we described is a specific example of the more general technique of homomorphic encryption. Homomorphic encryption schemes can help protect the privacy of a vote. However, if the only operations possible on an encrypted vote are addition and multiplication, the number of additional checks possible are limited. For example, the lack of support for relational testing prevents voting schemes based on OBFPP from being able to test the validity of each vote cast.

It may certainly be possible to come up with an encryption scheme that supports a particular set of operations, but it is time-consuming to create and test each new encryption system. Moreover, encryption schemes that support other functions, such as division, have also been difficult to develop. However, if a provably secure obfuscation scheme were developed, these problems would vanish. To see that, have a look at this program:

long homomorphicEncryption(long x,long y) {
   long privateKey = 0x1234...1234;
   long decryptX = decrypt(privateKey,x);
   long decryptY = decrypt(privateKey,y);
   long result = f(decryptX, decryptY);
   return encrypt(privateKey,result);
}

If we had provably secure obfuscation a program such as this one could be obfuscated and used as the homomorphic operator. The program uses an embedded private key to decrypt both of its arguments, applies an arbitrary operation, f, on them, and then uses the embedded private key to re-encrypt and return the result. The program itself is obfuscated, and the security of the obfuscation would guarantee the secrecy of the key.

Another use of provably secure obfuscation could be to develop new types of public key encryption schemes. In fact, with a provably secure obfuscator, you would be in a position to turn any private key encryption scheme into a public key encryption scheme. Given the difficulty of developing good public key encryption techniques, this would be hugely advantageous. To develop a new public key encryption scheme, you would embed your private key in a program like this:

long publicEncrypt(long x) {
   long privateKey = 0x1234...1234;
   return privateEncrypt(privateKey,x);
}

You would then obfuscate this program and release it as your public key. To encrypt a message to send to you, anyone could run the program with the message as argument. To decrypt the message, you would use your private key and decryption routine. The security of the obfuscation would ensure that your embedded private key would remain secure.

Overcoming Impossibility

Given the value of provable obfuscation and the proof that the general case is unsolvable, what future directions are possible and promising? The proof suggests that there exist programs that cannot be obfuscated. However, it does not suggest that any specific program is not obfuscatable. One promising possibility is to find a way to restrict the class of programs you are interested in obfuscating so as to exclude Secret.

For example, you have already seen that point functions can indeed be obfuscated. What would be the most useful program one could try to obfuscate? It would be a program capable of providing the most general functionality while remaining outside the domain of the proof. For example, you could choose to obfuscate a particular limited virtual machine. There is nothing directly in the proof you saw that suggests this is impossible. Once you have such an obfuscated virtual machine, your task is “simply” to deliver to this virtual machine the program you wish to execute in a form that it can execute that is nevertheless resilient to analysis. However, since the output of this virtual machine would be usable by an attacker, it is likely that the original impossibility proof could be adapted to show that the obfuscation of even particular virtual machines is impossible.

A more promising approach is to find an alternate definition of obfuscation that remains useful but prevents Theorem 5.1Overcoming Impossibility340 (Impossibility of Obfuscation) from being applicable. In the remainder of this section, we will explore both these directions for rescuing provable obfuscation. In Section 5.5.2 you will see secure multiparty computation which splits a program up and executes each piece by a separate party in such a way that no one party gains complete access to the asset you are trying to protect. In Section 5.5.3Overcoming Impossibility349 we will explore another approach that transforms an obfuscated program in such a way that it encrypts the output before returning it.

Both of these transformations turn the non-interactive definition of obfuscation into an interactive one that requires a server and a client to be present in order to execute an obfuscated program.

Definitions Revisited: Make Obfuscation Interactive

The definitions of obfuscation you saw in Section 5.1Definitions Revisited: Make Obfuscation Interactive304 implied that once a program P has been obfuscated, it executes without further interaction with a server. Defined in this way, general provably secure obfuscation is impossible.

However, if you allow an obfuscated program to distribute its computation so some of it is not accessible to an attacker, this amounts to providing a blackbox to an obfuscated program where it can securely perform computation and store data. This is rather like remote tamperproofing, which you will learn more about in Section 7.5Definitions Revisited: Make Obfuscation Interactive453. Of course it is then trivially possible for an obfuscated program to move all of its computation to the blackbox and be completely secure.

The interesting and open problem is what is the smallest amount of data and computation that needs to be performed by a blackbox in order to allow general purpose obfuscators to exist. The solution to this problem may come from an area called secure multi-party computation (SMC). The aim of SMC is to enable clients to carry out a distributed computation on private data without leaking this data to an adversary.

For obfuscation, the interesting special case is where the number of parties involved is two—the party who wishes to provide the securely obfuscated program and the party who wishes to run it. In this case, the asset to be protected is the private information on which the two parties perform the distributed computation—the obfuscator provides the asset and relies on SMC to ensure that the adversary is able to run the program without gaining access to this asset.

Cheapskate Problem

Let us look at an example using SMC. Suppose you and your friends wish to work out the average amount of pocket money you have, but you do not wish to tell anyone how much pocket money you have. Can you come up with a protocol that allows you to compute the average without revealing your allowance? In fact, you can! Have a look at this figure:

Cheapskate Problem

The algorithm starts by making all the participants stand around in a circle. The first person, Alice, begins by thinking of a random number N, adds her allowance x, and whispers the sum to Bob. Bob takes that number, adds his own allowance to it and whispers it to the next person in the circle, Carol. She, in turn, adds her own allowance, and so on. When the total finally returns to Alice, she substracts the random number N that she began with and announces the total to the group. Every member of the group can then divide the total by the number of people in the group and know the average allowance.

It is instructive to briefly examine how this protocol works. No one member at any time has sufficient information to decode the “running total.” What attacks are possible in this scenario? What would happen if two people in the group were to collude? For example, if Bob and David colluded by sharing their totals, they could compute Carol’s allowance. If the two colluding parties are separated by a larger number of people, they may not be able to tell what each person between them had as an allowance, but they would be able to have a much more accurate approximation of their total income than can be determined from the average.

A second attack that is possible is that any participant could lie about their income. While it is always the case in multiple-party computations that a participant can lie about his input, in this problem it is particularly advantageous for a participant to lie. If he is the only party to lie, at the end of the round he would be able to calculate the true total while every other person in the circle would have an incorrect value.

Can we do better than this? Can these problems be resolved? Can the protocol be extended to calculate functions other than the average?

Millionaire Problem

 

Goal

Asset: Private data

 

Attacker Limits: None

 

Input Programs: All possible programs

 

Suppose you and a friend wish to calculate which of the two of you is the richer but you do not want to reveal this information to each other. One way of managing this would be for each person to reveal their wealth to a trusted third party, who in turn would tell you who is richer. In such a scheme, no one friend would have an advantage, nor would they be able to gain much information about the other person’s wealth—unless the trusted third party was compromised!

A secure multi-party computation allows you to perform these and other more complex types of computations without using a trusted third party. For example, suppose Alice and Bob are the two millionaires. Yao [379a] shows how they could work out which one of them is richer as follows:

  1. Let Alice’s wealth, a, and Bob’s wealth, b, be integers between 0 and 5 million.

  2. Bob picks a random N-bit integer, x, encrypts it using Alice’s public key, C = EAlicepub (x), and sends Alice Cb + 1.

  3. Alice uses the her private key to generate <Y1, Y2, Y3 ··· Y5 > such that Yn = DAlicepriv (Cb + n). She can do this even though she does not know b because Bob sent her Cb + 1.

  4. Alice generates a random prime p of length N/2-bits.

  5. Alice computes <Z1, Z2, Z3 ··· Z5 > such that Zn = Yn mod p.

  6. Alice transmits p to Bob and sends him 5 numbers <Q1, Q2, ..., Q5> where Qn = Zn if n < a else Qn = Zn + 1.

  7. Bob computes x mod p and compares it to the bth number that Alice sent him. If they are equal then Alice is richer, else Bob is richer.

Problem 5.8

Q1:

What weaknesses in this protocol might be used to compromise its integrity? Can you devise another protocol that is resistant to these problems?

While this solution is specific for the millionaire problem, it can be generalized to a two-party SMC problem as follows. Let Alice have an input x = <x1, ..., xs> and Bob have an input y = <y1, ..., yr >. They wish to compute a publicly known function f (x, y) without revealing information about their inputs that cannot be inferred from the output of the entire computation. Surprisingly, cryptographers have devised general protocols that are capable of computing arbitrary functions f between two or more parties.

Definition Revisited: Make Obfuscation Non-Semantics Preserving

Earlier in this chapter, we considered a definition of obfuscation based on indistinguishability and equivalence. According to that definition, a program P is obfuscated if an attacker is unable to distinguish it from every other program P′without running it. Alternatively, you could define obfuscation as the ability to compute some function (rather than merely a predicate) over the original program. While these are good candidates for the definition of obfuscation, they are all stronger than Definition 5.11Definition Revisited: Make Obfuscation Non-Semantics Preserving339 and thus are still susceptible to the uncomputability of obfuscation.

For a weaker definition of obfuscation, you need a definition that does not rely on oracles but rather on I/O access to the program itself. For example, you could define a virtual blackbox as follows:

Definition 5.12 (Virtual blackbox (Alternate))

Given an obfuscated program P′, an oracle-access analyzer R, and a source-code analyzer A, P′ is a virtual blackbox if for each property of the program, the probability of computing that property given I/O access to P is approximately the same as the source-code computable probability of P′.

Under this definition, Theorem 5.1Definition 5.12 (Virtual blackbox (Alternate))340 does not apply, since it is possible to construct a Candidate program that is able to fool Secret into revealing Definition 5.12 (Virtual blackbox (Alternate)) using nothing more than I/O access.

Another possibility is to remove the requirement that an obfuscated function maintain correctness—that the obfuscating transformation be semantics-preserving. At first sight, this may appear to be the least desirable axiom to lose—after all, your entire intent in using obfuscation is to protect a specific program, presumably because you are interested in the output of this program.

To prove that the program Secret could not be obfuscated, we did not analyze it but rather created an input—a program called Candidate—that had the same input–output behavior and thus would trick Secret into revealing Definition 5.12 (Virtual blackbox (Alternate)).

If the obfuscation algorithm was allowed to change the output of the program—for example by preventing the output of the program from leaking information to the attacker, then creating a program like Candidate would not be possible. For example, we can prevent the program from producing any information (which is not very useful) or by encrypting or obscuring the output.

Such a transformed program would not be useful in the same way that obfuscated programs are. In particular, to avoid the problems that Candidate runs into, extracting useful information from the output of such a transformed program cannot happen on the client. However, such transformations are ideal for remote execution.

Example 5.4. Executing a remote procedure call between a server and a client.

SERVER RPE_SEND(P, I):

CLIENT RPE_RECEIVE(P′, I′):

  1. Encode P and I into P′ and I′, respectively, such that they are suitable for transmission over the network for the given client.

  2. Send (P′, I′) to the client and wait for a response O.

  3. Decode O to retrieve P (I).

  1. Decode P′ and I′ into P″ and I″, respectively.

  2. Execute P″ with input I″, i.e., let OP″ (I″).

  3. Encode the result O and send it back to the server.

Remote Procedure Call

A Remote Procedure Call (RPC) mechanism is a technique for implementing a client-server model of distributed computing. It allows a server to cause a subroutine on a remote machine to be executed and the result of the execution returned to the client. To execute an RPC, a server constructs a message containing the name of a specific procedure and serialized parameters and sends it to the client. The client deserializes the parameters, executes the requested procedure, serializes the result, and returns it to the server.

RPC mechanisms are useful in cases where the set of behaviors you may want from a remote service is small and as a result can be created as stored procedures on a remote server. It has the advantage that the computation is done at a remote location that an attacker is unable to debug or distort. If there is a computation that you wish to keep secret from an attacker, you could store it remotely on a server and provide access to it only via an RPC call.

For example, decrypting a message using a fixed key is ideally suited for RPC. The limitation of RPC is the need to have identified ahead of time the set of procedures you will require (and the keys you will use). What you really need is something Herzberg et al. [161] call remote program execution (RPE). We show an RPE protocol in Algorithm 5.4. It works similarly to a remote procedure call, except that, in addition to encoding a program for transmission, it encodes the program for execution on the target platform. The way to think about this is as an RPC where the target procedure is in fact a virtual machine. The argument to that remote procedure is a program encoded for that virtual machine and its return value is the output of the program.

Here’s what this would look like:

Remote Procedure Call

Here, the server uses transformer M1 to encode program P for transmission and sends it to the client. The client uses transformer U1 to decode the message and the virtual machine X executes P with input I. The client encodes the output using another encoder M2 and sends it to the server where it gets decoded using transformer U2.

At first glance you have not really gained anything. While it is true that the actual computation is happening on a protected server via the RPC, the fact that you need to encode the entire program for the virtual machine on the server means an attacker can intercept that program and statically analyze it to extract its assets.

Moreover, in real-world programs, the portions of a program that you want to protect are large and computationally expensive. Calls to the section of code you want to protect are deeply interwoven with the rest of the program. Moving these sections of the program to a remote server, irrespective of how it is encoded, would result in an unacceptable slowdown in the execution of your program.

What you would ideally like is to reverse the roles of server and client. You need the majority of the computation to be carried out on the client and to make the output of that computation be incomprehensible to the end user at the client. Then the client could request the server to perform the small amount of computation required to decrypt the output. Not only does this scheme not require extensive computation on the server, but encrypting the output has the advantage of sidestepping the impossibility-of-obfuscation result.

Whitebox Remote Program Execution

Herzberg et al. [161] have suggested the construction by which remote program execution can be used to achieve this model of provably secure obfuscated execution. In their model, the server takes a program P and instead of merely encoding it for transmission, encodes and encrypts it for a particular virtual machine running on the client.

The client is an obfuscated program that consists of three parts: a decryption routine with an embedded key, a virtual machine, and an re-encryption routine with another embedded key. Here’s what that would look like:

Whitebox Remote Program Execution

The server encodes a program P for transmission by transformer M1 by encrypting it with a key k1. The server sends the resulting package to the client, which first decodes it using the transformer U1, which has k1 embedded in it. The client then executes P using the virtual machine X with input I. The client encrypts the result using another encoder, M2, which has key k2 embedded within it. The client encrypts the result and returns it to the server where transformer U2 decrypts and returns it to the end user. U1, X, and M2 on the client need to be obfuscated together in order to make them inseparable. The means to do so remains an open problem.

There are a couple of features worth noting in this proposal. First, decryption and encryption keys are built into the client. As such, they are vulnerable to a reverse-engineering attack. Of course, you already know how some types of encryption and decryption can be carried out using Algorithm OBFCEJO, for example. Unfortunately, this is not sufficient, because if an attacker can separate the individual steps, he can gain access to the decrypted program in memory. Instead, a method of obfuscation that makes the three steps shown inseparable is required.

Second, the security of the construction relies on the encrypted output not leaking any information. If any information leaks to an attacker, he will be able to construct a Candidate that communicates via that leaky channel and thus construct a program that is unobfuscatable. One example of such a channel is execution time. For example, suppose an attacker devises a program Candidate that executes indefinitely when given a specific input I. No matter how this program was obfuscated before being passed to the client, when executed and given input I, it would be recognizable as the attacker’s program. To close this channel, the virtual machine must only execute a given program for a fixed amount of time before returning its output.

The first limitation of secure multi-party computation is speed. The gain in privacy is attained with a large data overhead, and many of the protocols are extremely slow. As a result, only very simple arithmetic and relational computations are practical. Second, many secure multi-party computation algorithms are only resilient to passive attacks from malicious participants. This means, for example, that the protocol gives correct results only if the participants follow the protocol correctly. Some proposed algorithms require all participants to act honestly, while others work correctly if a majority of the participants are honest. For example, in the Cheapskate Problem, any one of the friends is able to secretly influence the total that is computed. They are able to derive information from the computed average while preventing other participants from doing so. As a result, they gain an unfair advantage. Other protocols defend against this possibility by allowing members to detect this type of fraud and stop the computation before their information has leaked.

To address these difficulties, many secure multi-party protocols are devised not to perform arbitrary computations but rather highly specific ones using optimized and efficient primitives. Nevertheless, secure multi-party computation can be used to build secure protocols for auctions, voting, private information retrieval, and other types of sensitive distributed computations, and in essence can allow you to build a slow, Turing complete, virtual machine.

Discussion

In this chapter you saw recent work on the theory and limits of obfuscation. We showed how fundamental problems in computer science such as the halting problem and the equivalence problem provide bounds on what kind of obfuscation is possible. You saw examples of provably secure obfuscation of point-functions used to obfuscate access control and regular expressions and the construction of an obfuscated database.

We also discussed encryption algorithms and how to perform some operations on encrypted values using properties of the underlying encryption scheme. We were not able to extend this scheme to securely hide the encryption or decryption itself. However, we showed you the whitebox approaches that currently are being used to try to securely obfuscate cryptographic algorithms like DES.

All of these examples were building up to the famous Theorem 5.1Discussion340, which states that, under a very strict definition of obfuscation, obfuscation is impossible, or, more precisely, programs exist that cannot be obfuscated.

You have also seen how contrived the counterexample required to prove this impossibility result was. This does not mean the theorem is inconsequential or unimportant. It helps us identify where the core difficulty in carrying out obfuscation lies.

Using this understanding, we were able to explore two workarounds for provable obfuscation that are applicable to a smaller class of programs: secure multi-party computation, where security is gained by distributing the computation among more parties, and whitebox remote procedure calls.

Whether or not obfuscation is possible appears to depend strongly on how you define obfuscation, the properties you want to protect, and the class of programs you are considering. There are many open questions in each of these workarounds. For example, are there interesting commonly used applications that cannot be executed using this design for obfuscated execution? To avoid the tyranny of the impossibility of obfuscation, you need to prevent the program you are obfuscating from being able to signal a single bit of information to a potential attacker. As a result, your encoded program on the client must be difficult to distinguish from any other encoded program in the class you are interested in. This is true for its execution time, memory usage, and any other factor that could be used by the program as a side channel. As you grow the class of programs you would like your obfuscator to obfuscate, it becomes increasingly difficult to prevent a program from being able to send such a signal.



[1] A program to obfuscate your own Perl programs in this style is available from this book’s Web site.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.62.105