Friday, December 24, 2010

Java 32 vs. 64 bit

I thought it would be a good idea to install a 64 bit version of Java on my new PC that has a 64 bit architecture. But I had to find out it's not worth the effort.

I have the following:

$ java -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b07)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b17, mixed mode)

$ java32 -version
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b07)
Java HotSpot(TM) Client VM (build 17.0-b17, mixed mode, sharing)

I run tests on a huge Frege source code file that looks like this:

package Stress where
x = (((( ... ))))  // actually 1 million ( and 1 million )

Running the compiler with java32 -Xmx500m, I get

lexical analysis took 7.333s, 2000007 tokens, 272740 tokens/s
parser (including lexical analysis) took 38.364s, 2000007 tokens (52132 tokens/s)

The 64 bit Java, however, dies with an Out of Memory error. With -Xmx1g it reports:

lexical analysis took 8.818s, 2000007 tokens, 226809 tokens/s
parser (including lexical analysis) took 37.910s, 2000007 tokens (52756 tokens/s)

The speedup is barely significant. Only with 2 Gigabytes, it seems to get better:

... lexical analysis took 6.33s, 2000007 tokens, 315956 tokens/s
parser (including lexical analysis) took 34.709s, 2000007 tokens (57622 tokens/s)

10% speedup, but only if it is given 4 times the memory! And, by the way, it is seldom the case that one has to compile a 100000 lines Frege source (assuming 20 tokens per line). Even the parser (which is generated from a yacc grammar), has only some 230000 tokens on 35000 lines.
On smaller files, the 32 bit client version is unbeatable, the 64bit server version actually being 2 to 3 times slower.

Monday, December 13, 2010

More thoughts about the abstract JVM

Having read some papers about the topic of purity and nullness checking for Java methods, it turns out that things are not that easy.
The vicious sub-typing stands in the way of sound reasoning about Java methods! Consider:

class A { 
  B b;
  T a1() { return this.b.m1(); }
  T a2() { return this.b.m2(); }
}
class B {
  T m1() { /* pure code */ }
  T m2() { return new T(...); }
}
class C extends B {
  T m1() { System.out.print("ha"); return super.m1(); }
  T m2() { if (something) return null; return super.m2(); }
}

Even if we knew that we had a value v with runtime type A, we cannot answer the questions "Is the call v.a1() free of side effects?" and "May the call v.a2() return null?"
Because at runtime, v.b could well be a C. Note that C might not even exist at the time of our check.

Except for very simple methods, this seems to mean that we cannot detect either purity, null-behaviour nor the runtime type of return values unless the involved methods or classes are declared final or static.

It will be next to impossible to provide a simple workaround for more complex existing code. Perhaps one could automatically "finalize" an existing class somehow .... lots of things to think about.

Saturday, December 11, 2010

The abstract JVM

What is the sign of the value one gets by multiplying 87457 with -37239? This is immediately clear. We need not actually perform the multiplication. Instead, we know that "plus times minus yields minus". We are abstracting away from the concrete numbers and reformulate the question in another algebra with only 4 possible sentences (plus times plus, plus times minus, minus times plus and minus times minus). In other words we perform an abstract interpretation.

What is the sign of 23-187? Again, we need not compute the numeric result to answer the question. We know that when we subtract a number from a smaller number that the result will be negative. For some reason, it is (sometimes, but consider 498374635243648 and 498374635246348) easier for our brain to compare numbers than to subtract them actually. This is especially interesting when one knows that contemporary computers think differently. For them, subtracting numbers is easy. Hence, if one needs to compare two numbers, one subtracts the second from the first and when the sign is negative, then the first number was smaller.

A similar method of abstract interpretation could be employed in our quest for knowledge about certain properties of native functions (i.e. java methods). The problems, solution strategies and (epistemo)logical basics have been detailed in previous posts. Here is a summary:

  1. Native functions cause problems through null values, exceptions and side effects if and when this is not correctly declared.
  2. The compiler should do its best to verify the truth of the assertions made by the programmer.
  3. It will often be only possible to prove that the assertion is false. There are cases where even this will be impossible. In the case of null values we must insist on a positive proof that the function cannot return null.
I envision an abstract JVM, that can interpret the bytecode of a java method and infer evidence about the methods behavior that is interesting for us.
There are quite some tools that do similar things, though abstract interpretation is mostly called static analysis. For example FindBugs, a tool to find bugs in java programs, or soot, a Java optimization framework. The latter even has a (not at all documented) purity check. Unfortunately, it insists on "whole program analysis", i.e. one must name a class with a main-method. (I can imagine that this is because of the problem with interface methods - one can say nothing at all about an interface method per se. Yet, in a whole program analysis it is perhaps possible to enumerate all classes that may possibly occur, and then one also has the set of all classes that implement a given interface and can inspect the concrete methods.)

The FindBugs tool uses the Byte Code Engineering Library (BCEL), and I think this makes also a good choice for our purposes.

Our abstract JVM would operate on an abstract machine state, that contains abstract local variables and the abstract stack. Like in the multiplication example above, we of course abstract from concrete values. The only interesting thing for us are the types of references, their visibility and whether it is possible that they are null. The abstract interpretation of a method starts out with an empty stack and the set of the argument local variables including this with their declared types. All arguments are visible (of course) and not null (anticipating the ability to rule out null values anywhere in the Frege realm).
The method might start with a aload_0 instruction. In a real JVM this pushes the value of this on top of the stack. In our abstract JVM we push a copy of abstract local variable 0 on the stack, which is the triple (type of this, visible, not null). The next one could be a getfield #1. Here our abstract machine must lookup the type of the field #1 and must replace the top of the stack with the triple (type of field, visible, possibly null) . And so on. Eventually, there will be some return instruction, and in case it's areturn and we have a possible null value on the stack, we note this. We know also when in the course of interpretation there was a putfield instruction and a visible target object was on the stack.
By the way: we see also the return type, and when it is U extends X, we cannot accept declarations that say it's just a X. For we must insist on knowing the most specific type of all objects that are injected into Frege, lest our abstract interpretation is worthless. Consider, the byte code of X.foo might be pure, when the byte code of U.foo is not. No "inheritance", no "subtyping" there, sorry, no way to give us an X for an U. It's not my fault that the Java language pays no attention to such matters except for checked exceptions. In a proper OO language, it should be impossible to overwrite a method that returns t objects with one that returns t objects or null, in general, the overriding method must not have more freedom than the overridden one.

Back to the abstract JVM: I think, this is also how byte code verifiers work: they keep track of the types of stack values and check - among other things - if the method calls are valid. This gives me some hope that the abstract JVM could be reasonably fast. In contrast, the tools/frameworks mentioned above seem to need enormous amounts of memory and cpu time. We'll see how it turns out.

Of course, our interpretation being abstract means that we will have false positives. Consider:
public X foo() { return false?null:this; }
It doesn't matter that in reality this function never returns null. Unless the Java compiler optimizes the conditional expression a bit so that we can't see it in the byte code, our abstract machine will think that the method indeed may return null.
However, if the null-tracking turns out to be trustworthy, why not employ this knowledge in cases like
public X foo(Y bar) {
     if (bar == null) this.x = 42; 
     return this; }
Under the assumption that bar is not null, the method could then be classified as pure.
I am really curious how frequently the abstract JVM will assure us that a method is null-safe.

It is clear that the correctness of our results depend on the level of "abstractness" we employ. The problem leads naturally to a program that starts out with a very high level of abstraction, but can be developed further and further to take more and more knowledge into account. One example would be to feed back what we know about possible null values so that we sometimes do not need to follow a ifnull byte code. Or take the getfield example from above. Isn't it a bit too defensive to assume that a field is possibly null? What if the field is final and all constructors of the class initialize it with a non null value?
The nice thing is that every less coarse abstraction potentially has a very positive feedback. For, assume that on a certain abstraction level  we concluded that some method deep in the java runtime system can return null. This could be the reason for alleging  many innocent functions that use the former one in the calculation of their result to also return null. Thus, on a slightly finer abstraction level we could prove null-safety for many functions that do not themselves exhibit the detail that is now taken into account.
The downside is of course, that computing resources needed for abstract interpretation will increase with every decrease in abstractness, perhaps so much so that one must finally decide to abstract certain hard to compute details away again and leave it at that. Experience will show.

Next task is to think about a sound "safety signature", a data structure where we can record the behavior of a method once it is computed (the alternative being to interpret anew each repeated function call). Problem is that the behavior can depend on the methods arguments. Take the example above, foo is pure under the assumption that its argument is not null. If it is null, then this is modified. But this matters only if this is visible, i.e. a method that does new X(...).foo(null) is innocent.

Epistemological problems of purity

It would be nice if purity (of java functions) was a property that could be proved easily. But this is not so. On the contrary, purity is a concept about what a function may not do, and it is generally and on principle impossible to prove for a java function. Just think of a java function that calls an interface method on some field of this - this amounts to the question whether all possible implementations of this method are pure, including the implementations that will be written next week.

In the general case, we can not prove that a function has no side effects, throws no exceptions or does not return null. That is, we cannot confirm that the native declaration written by the Frege programmer is valid. However, in certain cases we can show that it must be invalid, or at least that it is most probably invalid. In addition, we should exploit the (rare?) cases where a positive proof can indeed confirm what the programmer asserted.

Side effects and Input/Output

We can show for many functions that they are not pure. We will regard the following as proof that a function is not pure, that is, it produces or consumes side-effects:
  • invocation of a native function will most likely perform input/output. In fact, as the JVM itself has no I/O instructions, this is the only way to perform input/output. Certainly, there might be pure functions that are written in C or even Assembler for performance reasons, but I think we can ignore them for the first approximation. In any case, the number of such functions will be sufficiently small.
  • altering fields in objects visible to the caller. An object is visible to the caller if it is this, or another argument of the function, or if it is referenced in a field of a visible object.
  • altering static fields or objects visible through such a field.
  • reading non final static fields or non-final objects visible through final fields
Of course, not only the actions of the function itself but also of called functions and functions called from these functions and so on must be considered. Note that it is allowed to create objects and alter them and even to return them. For example, this one should be okay:
    public int[] initarray(int s, int v) { 
        int[] a = new int[s]; 
        while (--s >=0) a[s] = v; return a; }
    Exceptions

    Exceptions are not a big problem. The java compiler does not allow to call functions that throw checked exceptions if the call itself is not guarded by a try-catch statement. Hence, if the Frege programmer untruthfully asserts that a java function does not throw checked exceptions, the Frege compiler will simply produce a java source that will be rejected by the java compiler.
    It is no problem to check if a method throws checked exceptions, so we can as well do that in the Frege compiler.
    Regarding unchecked exceptions, one could argue that there must be a reason that they are unchecked.  If they are not caught, the program will terminate due to a severe error that couldn't be handled anyway. However, it should be quite easy to check whether a function could possibly throw an exception and warn the programmer who might be unaware of this fact. Or there could be a pedantic mode who flags this as error.

    Null references

    The Frege type system cannot protect the programmer from committing errors that are equivalent to dereferencing a null pointer in Java. Yet it can detect such cases and will emit warnings. A warning-free Frege program will thus never encounter a pattern match failure.
    Internally, null references simply do not exist and cannot be created through Frege code. But a native method,  whose declaration is contrafactual in this regard, could return null and hence compromise safety. Correct declaration of native methods that return objects is thus absolutely critical.
    I am in favor of being very conservative in this regard. If it is not absolutely evident that the method may never return null, a declaration with a non Maybe return type will be rejected.

    What if we get it wrong?

    The least impact has a wrong assertion about (unchecked) Exceptions. The program may terminate unexpectedly, but in the last consequence intensionally. Nevertheless, as said before, it would be nice to let the user of such a method know at compilation time.

    Funny things may happen if one manages to declare a function with side effects as pure. The compiler may apply all optimizations that are appropriate for really pure functions as it sees fit. In most cases, it is just not predictable what will happen. Consider the following:
    native print System.out.print :: String -> ()   
    main _ = fst (let ha = print "ho" in (ha,ha))
    The output might be "hohoho", "hoho", "ho" or nothing at all!
    There will also be harmless cases such as:
    native getenv System.getenv :: String -> Maybe String
    If one assumes an operating system where the process environment is read only, this function is actually pure.
    Anyway, it is generally a bad idea to assert purity when there is none. The semantics of the resulting program is undefined.

    The worst thing is the possible smuggling of  null values into the realm of the otherwise null-value-free Frege code. As always, the dereferencing of the null may happen hours after it has been stored in some data structure, and in totally unrelated parts of the program. It is cold comfort, that in Frege one needs to look only at suspicious native functions to find the error, while in a normal Java program the error could be literally everywhere.

    Friday, December 10, 2010

    Possible solutions for problems with native functions

    In my last post I described the problems that occur in the presence of native functions whose alleged properties are not checked by the compiler.

    I can think of the following approaches to deal with those problems:
    1. do not allow native functions at all - clearly unacceptable
    2. all native functions must have a return type like IO (Exception (Maybe t)). This would be safe, but very troublesome. Most, if not all Frege code would be in the I/O Monad.
    3. like 2., but maintain a database of pure java methods so that the compiler  knows them and allows less restrictive return types. A nightmare.
    4. Refuse to accept native declarations that are provably wrong.
    5. Refuse to accept native declarations with potential problems, but there is a way - not necessarily an easy one! - to force the compilers to accept.
    6. Warn the programmer about potential problems, but trust him anyway and accept the native declaration.
    7. Trust the programmer no matter what.

    Scenarios 1 and 3 are out of the question for obvious reasons.

    Scenario 2 is the safest of all, nothing can go wrong. However, its adoption would make Frege almost unusable. One could think, however, of a "scenario 2 light" where functions defined in the Frege base libraries would be exempt from the harsh rule. But I do not like that anyway, because I know that there are many useful java functions out there. It's nonsensical to make their usability clumsy.

    Scenario 4 is definitely the way to go. Often one can read from the type that the function cannot be pure. For example, a function with no arguments is either absolutely trivial or it cannot be pure: unless it computes the same result all the time it uses global state or does input. One example is System.nanoTime().

    Likewise, a void function is either absolutely trivial like
         public void doNothingUseful() { return; }
    or it exists to encapsulate side effects like setting a field in an object or printing a newline on the screen. In either case, it is wise to demand that such a  function lives in the IO Monad.
    A side note: when talking about such matters, it must be understood that due to the architecture of Java and the JVM, it'll never be possible to prevent abuse or to guarantee safe execution. For example, one can always deliberately fool the type system by replacing or modifying class files. With "fool the type system" I mean that run time errors are made possible that the type system promised to prevent - such as calling a method with the wrong number of arguments or with arguments of wrong type. 
    But while this is especially easy in Java (yes, there is a reason why there are a dozens of java "build-tools" out there!) it is nothing that could not be done for other type safe languages. 
    Unfortunately, not all impure functions are that easy to spot. I will devote another post to the question how one could extend the knowledge about a java functions behaviour a bit.
    Scenario 5 and 6 also assume that such knowledge exists.

    Scenario 7 is what is currently done in Frege2. I already explained why this is unacceptable.

    Unsafe native functions in Frege 2

    Integration of Frege and Java is easy, one just declares a native function, for example

        native indexOf :: String -> Char -> Int

    This makes an expression like

        indexOf "Frege" 'g'

    well typed, and the compiler generates code like

        ("Frege".indexOf('g'))

    This is all very nice, but there are some potential problems with java code:

    • it can return null
    • it can throw exceptions
    • it can have side effects, including, but not limited to, performing input/output
    The problem is not that java code can do such things, it is rather that one must declare it correctly to avoid compromising safety of Frege code. Frege can handle all 3 cases or combinations thereof very well:

    • if the java method returns null, one declares the return type Maybe instead just t
    • if the java method throws exceptions, one declares the return type Exception t instead just t
    • if the java method is not pure, one declares the return type IO t instead just t
    The compiler generates then appropriate wrapper functions or call sequences.

    In the presence of unrestricted native (i.e. java) functions, whether the Frege code will be type safe or not depends on the good will and good knowledge of the programmer. It is (currently) quite easy to write a small program that tries to cast an Int to a Bool in Frege without getting even a warning from the compiler. This is clearly unacceptable.


    Saturday, November 20, 2010

    Frege 3.0

    It's been a long time since I last posted, but I have not been idle since then, though some private affairs detracted my attention from programming.

    I've come to the conclusion that the original idea how frege programs should work is not viable. Until now, any Java function (or method, if you like that better) could simply be made available in Frege through a native definition. If that method had side effects, the Frege programmer would take care. Or wouldn't he? And there was the do ... for ...construct that was transformed to nested case statements whose patterns were all strict. And, of course, like in the german proverb "Ist der Ruf erst ruiniert, lebt sich's g√§nzlich ungeniert." (Once reputation is ruined, one can live completely unembarrassed.), why not have mutable references?

    This works well in small programs. But the compiler is a mess. Imperative and functional parts are mixed like spaghetti in a bowl.  Need access to the symbol table somewhere in the middle? No problem, we have a global value for that. Of course, it's not just a Symtab, it's a Var Symtab, a mutable reference cell to a symtab, because sometimes we need to change it, you know. And so on.
    One day I contemplated what would it take to compile two source files in parallel. Or to use the scanner in a Java-GUI for syntax highlighting. It's just impossible as it stands.

    There was but one way out of this calamity: monads. I had ignored monads previously, because I felt I didn't understand them. But nowadays, I think I do understand them. And, in principle, it's really no big deal. The IO monad, for instance, is just a way to write imperative code and, at the same time, separate that code from pure code with the help of the type system. The type system tells you if your code may have side effects. That's great.

    So I decided:
    1. Every native function must have an IO return type by default. (If one is really really sure, one can override this.)
    2. No mutable types in Frege anymore, except arrays, but of course the mutating functions are in the IO monad.
    3. The do-notation is for monadic code, like in Haskell.
    4. Get rid of the anachronistic and superfluous while, break and continue.
    While I was at it, some more changes came to mind:
    1. Abandon record types. They had been somewhat clumsy in practice. It turned out, that most of them lived as single arguments of some constructor, adding one more indirection. Of course, they also were mutable meanwhile. Instead of this, constructor fields can be named now, and there are all sorts of fancy syntax to non.destructively update or change a value with fields. This turns out to be very nice.
    2. Change the way insatance method dispatch works. Currently, to use an instance function of some unknown type a, one must provide a value of type a, even when the function does not use this value. Then, at runtime, the correct function is looked up in a Map<String, Lambda> that is kown to the type. This used to work most of the time, but for monads, where return has type forall a m.Monad m => a -> m a, it means essentialy that one cannot write many functions (sequence for instance) that should work on any monad. Moreover, the concept will not work at all for multiple parameter classes (they will be in version 3.1, hopefully). Thus, we will switch to dictionary passing. Every class constraint will correspond to a dictionary that must be passed by the caller. Of course, the compiler will arrange this.
    3. Implement usable higher order polymorphism, this means, we need a forall keyword.