Rants about Java and other internet technologies by Sam Pullara

Using JAX-RS with Protocol Buffers for high-performance REST APIs

One of the great things about the JAX-RS specification is that it is very extensible and adding new providers for different mime-types is very easy. One of the interesting binary protocols out there is Google Protocol Buffers. They are designed for high-performance systems and drastically reduce the amount of over-the-wire data and also the amount of CPU spent serializing and deserializing that data. There are other similar frameworks out there including Fast Infoset and Thrift. Extending JAX-RS to support those protocols is nearly identical so all of the ideas I’ll talk about are generally valid for those frameworks as well. The one limitation that we will table for now is that JAX-RS only works over HTTP and will not work for raw socket protocols and the high-performance aspect of protobufs is somewhat reduced by our dependency on the HTTP envelope. My assumption is that you have done your homework and know that message passing is your overriding bottleneck.

The first thing you will need to do to get started is to download and build Protocol Buffers. You can get the latest stable release from here. All the example code you will find in this blog post was developed against protobuf-2.0.3 and the JAX-RS 1.0 specification (using jersey-1.0.1) though I don’t expect the API to change very much going forward. Once you have protoc in your path you are ready to create your first JAX-RS / protobuf project.

The dependencies you will need to create the application are actually quite small. I use Maven (and IntelliJ 8.0) to do my development so that is how I’ll describe what you need. For running the application you’ll need these installed:

    <dependency>
      <groupId>com.sun.jersey</groupId>
      <artifactId>jersey-server</artifactId>
      <version>1.0.1</version>
    </dependency>
    <dependency>
      <groupId>com.sun.grizzly</groupId>
      <artifactId>grizzly-servlet-webserver</artifactId>
      <version>1.8.6.3</version>
    </dependency>
    <dependency>
      <groupId>com.google.protobuf</groupId>
      <artifactId>protobuf-java</artifactId>
      <version>2.0.3</version>
    </dependency>

Then to execute the tests that we will create to verify that things are working as expected you’ll need two additional test-time only dependencies:

    <dependency>
      <groupId>com.sun.jersey</groupId>
      <artifactId>jersey-client</artifactId>
      <version>1.0.1</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.5</version>
      <scope>test</scope>
    </dependency>

Not a huge set of dependencies on the surface but Maven does hide a lot of the complexity underneath — total is about 15 jars (mostly grizzly). The next step is to create a Protocol Buffer using their definition language. Instead of making one up myself, I’ll just use the one from their example, addressbook.proto:

package tutorial;

option java_package = "com.sampullara.jaxrsprotobuf.tutorial";
option java_outer_classname = "AddressBookProtos";

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;

  enum PhoneType {
    MOBILE = 0;
    HOME = 1;
    WORK = 2;
  }

  message PhoneNumber {
    required string number = 1;
    optional PhoneType type = 2 [default = HOME];
  }

  repeated PhoneNumber phone = 4;
}

message AddressBook {
  repeated Person person = 1;
}

A fairly simple data description but it does touch on a lot of the features of Protocol Buffers including embedded messages, enums, repeating entries and their type system. Now lets define a simple service that we want to get to work using the extension SPI of JAX-RS. This service will have two methods, a GET method for returning a new instance of a Person and a POST method that just reflects what is passed to it back to the caller unmodified. That will also let us do some round trip testing. Here is the proposed service:

package com.sampullara.jaxrsprotobuf.tutorial;

import javax.ws.rs.*;

@Path("/person")
public class AddressBookService {
    @GET
    @Produces("application/x-protobuf")
    public AddressBookProtos.Person getPerson() {
        return AddressBookProtos.Person.newBuilder()
                .setId(1)
                .setName("Sam")
                .setEmail("sam@sampullara.com")
                .addPhone(AddressBookProtos.Person.PhoneNumber.newBuilder()
                        .setNumber("415-555-1212")
                        .setType(AddressBookProtos.Person.PhoneType.MOBILE)
                        .build())
                .build();
    }

    @POST
    @Consumes("application/x-protobuf")
    @Produces("application/x-protobuf")
    public AddressBookProtos.Person reflect(AddressBookProtos.Person person) {
        return person;
    }
}

For each of these methods we’ve restricted them to either consuming or producing content of type application/x-protobuf. When JAX-RS sees a request that matches that type or a caller that accepts that type these will be valid endpoints to satisfy those requests. Out of the box, Jersey includes readers and writers for a variety of types including form data, XML and JSON. They also provide a way to register new mime-type readers and writers with a very simple set of annotations on classes that implement either MessageBodyReader or MessageBodyWriter. The class that implements reading is very straight forward, first it calls you back to see if you can read something, then it calls you to actually read it passing you the stream of data. Here is the implementation:

    @Provider
    @Consumes("application/x-protobuf")
    public static class ProtobufMessageBodyReader implements MessageBodyReader<Message> {
        public boolean isReadable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
            return Message.class.isAssignableFrom(type);
        }

        public Message readFrom(Class<Message> type, Type genericType, Annotation[] annotations,
                    MediaType mediaType, MultivaluedMap<String, String> httpHeaders, 
                    InputStream entityStream) throws IOException, WebApplicationException {
            try {
                Method newBuilder = type.getMethod("newBuilder");
                GeneratedMessage.Builder builder = (GeneratedMessage.Builder) newBuilder.invoke(type);
                return builder.mergeFrom(entityStream).build();
            } catch (Exception e) {
                throw new WebApplicationException(e);
            }
        }
    }

This class either needs to be under a package that is registered to be scanned when the application starts or it could be explicitly registered by extending Application. You’ll see in our Main method later we use the former strategy. You’ll note that in order for us to instantiate a new Protocol Buffer builder we need to use reflection on the type that JAX-RS is expecting. I’ve convinced myself thats the best way to do it but please comment if you can think of a better way. If there were additional configuration information you needed to pass to the reader you could annotate the methods with that information and receive it here in the annotations array.

The writer is a bit more complicated because in addition to the isWritable and writeTo methods you have to be able to return the size that you are going to write. I was hoping that Protocol Buffers supported a quick way to sum the size of an object but alas they do not so instead I actually do the write in getSize and temporarily store the result with a weak map. In the future I’d like to see streaming better supported. Here is how I implemented the writer:

    @Provider
    @Produces("application/x-protobuf")
    public static class ProtobufMessageBodyWriter implements MessageBodyWriter<Message> {
        public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
            return Message.class.isAssignableFrom(type);
        }

        private Map<Object, byte[]> buffer = new WeakHashMap<Object, byte[]>();

        public long getSize(Message m, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            try {
                m.writeTo(baos);
            } catch (IOException e) {
                return -1;
            }
            byte[] bytes = baos.toByteArray();
            buffer.put(m, bytes);
            return bytes.length;
        }

        public void writeTo(Message m, Class type, Type genericType, Annotation[] annotations, 
                    MediaType mediaType, MultivaluedMap httpHeaders,
                    OutputStream entityStream) throws IOException, WebApplicationException {
            entityStream.write(buffer.remove(m));
        }
    }

I’d love to get around the non-streaming limitation in this integration so if you have any ideas, send them my way. Now we also need to generate the code from the Protocol Buffer definition file. I again use Maven to do that with this additional stanza:

      <plugin>
        <artifactId>maven-antrun-plugin</artifactId>
        <executions>
          <execution>
            <id>generate-sources</id>
            <phase>generate-sources</phase>
            <configuration>
              <tasks>
                <mkdir dir='target/generated-sources' />
                <exec executable='protoc'>
                  <arg value='--java_out=target/generated-sources' />
                  <arg value='src/main/resources/addressbook.proto' />
                </exec>
              </tasks>
              <sourceRoot>target/generated-sources</sourceRoot>
            </configuration>
            <goals>
              <goal>run</goal>
            </goals>
          </execution>
        </executions>
      </plugin>

That should now be enough to build the service itself along with the message readers and writers. The last thing to do on the production side is to show how you would deploy this using the Grizzly container:

public class Main {
    public static final URI BASE_URI = UriBuilder.fromUri("http://localhost/").port(9998).build();

    public static void main(String[] args) throws IOException {
        System.out.println("Starting grizzly...");
        URI uri = BASE_URI;
        SelectorThread threadSelector = createServer(uri);
        System.out.println(String.format("Try out %sperson\nHit enter to stop it...", uri));
        System.in.read();
        threadSelector.stopEndpoint();
    }

    public static SelectorThread createServer(URI uri) throws IOException {
        Map<String, String> initParams = new HashMap<String, String>();
        initParams.put("com.sun.jersey.config.property.packages", "com.sampullara");
        return GrizzlyWebContainerFactory.create(uri, initParams);
    }
}

Jersey+Grizzly makes it very easy instantiate a new servlet container at a particular URI and immediately access the REST services that you have deployed. For testing, it is nice to be able to bring up an actual environment so easily. In our tests we are also going to make use of the REST client that is included with Jersey so that you can see the serialization on both sides of the wire. In order to get the server up and running during the test we need to implement setUp() and tearDown():

    private SelectorThread threadSelector;
    private WebResource r;

    @Override
    protected void setUp() throws Exception {
        super.setUp();

        //start the Grizzly web container and create the client
        threadSelector = Main.createServer(Main.BASE_URI);

        ClientConfig cc = new DefaultClientConfig();
        cc.getClasses().add(ProtobufProviders.ProtobufMessageBodyReader.class);
        cc.getClasses().add(ProtobufProviders.ProtobufMessageBodyWriter.class);
        Client c = Client.create(cc);
        r = c.resource(Main.BASE_URI);
    }

    @Override
    protected void tearDown() throws Exception {
        super.tearDown();
        threadSelector.stopEndpoint();
    }


The client doesn’t have the special class scanning capability so we directly register our providers with the client and point it at the same URI that the server is running on. Being able to control those in your tests makes integration tests far easier as you don’t have to worry about mismatched configurations. The first tests we will run will be using the Jersey client:

    public void testUsingJerseyClient() {
        WebResource wr = r.path("person");
        AddressBookProtos.Person p = wr.get(AddressBookProtos.Person.class);
        assertEquals("Sam", p.getName());

        AddressBookProtos.Person p2 = wr.type("application/x-protobuf").post(AddressBookProtos.Person.class, p);
        assertEquals(p, p2);
    }

Notice how you can build up a web resource incrementally adding additional constraints or paths to it until ultimately you call one of the HTTP methods on that resource. We also see that using that client API we get typed access to the REST server. Slightly more complicated is another test using direct HTTP connections:

    public void testUsingURLConnection() throws IOException {
        AddressBookProtos.Person person;
        {
            URL url = new URL("http://localhost:9998/person");
            URLConnection urlc = url.openConnection();
            urlc.setDoInput(true);
            urlc.setRequestProperty("Accept", "application/x-protobuf");
            person = AddressBookProtos.Person.newBuilder().mergeFrom(urlc.getInputStream()).build();
            assertEquals("Sam", person.getName());
        }
        {
            URL url = new URL("http://localhost:9998/person");
            HttpURLConnection urlc = (HttpURLConnection) url.openConnection();
            urlc.setDoInput(true);
            urlc.setDoOutput(true);
            urlc.setRequestMethod("POST");
            urlc.setRequestProperty("Accept", "application/x-protobuf");
            urlc.setRequestProperty("Content-Type", "application/x-protobuf");
            person.writeTo(urlc.getOutputStream());
            AddressBookProtos.Person person2 = AddressBookProtos.Person.newBuilder().mergeFrom(urlc.getInputStream()).build();
            assertEquals(person, person2);
        }
    }

This code looks more like what a non-Java client might do to access your REST service and deserialize the information using their Protocol Buffers. In fact, why don’t we try this with some Python 2.5 code:

import urllib
import addressbook_pb2

f = urllib.urlopen("http://localhost:9998/person")
person = addressbook_pb2.Person()
person.ParseFromString(f.read())
print person.name

Works great and outputs “Sam” as expected. Very fast but still interoperable between multiple languages in a type-safe way. Once Thrift is further along I will likely make the same sort of interoperability possible.

For those that just want to open up the final product and see how it all works, here is a link to download it. You’ll also note that I actually use graven under the covers to do my builds as Maven’s XML is a little too verbose for me.

Build your own mail analyzer for Mac Mail.app

You’ve probably read about things like Xoopit and Xobni for analyzing both online mail and your outlook mail. As it turns out, Apple has done something great in this regard that I think has been mostly overlooked. Mail.app stores all of the meta-data for you email in a file called ~/Library/Mail/Envelope Index. You might wonder what the format of this file is… well it is a SQLite3 database. The contents are pretty easy to see, go to the terminal and type:

macpro:~ sam$ sqlite3 ~/Library/Mail/Envelope\ Index
SQLite version 3.6.3
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite>

Everything about your mailboxes is stored within this database and the structure of the database is normalized so its very easy to navigate. The tables of most interest for mail analysis are:

sqlite> .tables
addresses              mailboxes              todo_notes
alarms                 messages               todos
associations           properties             todos_deleted_log
attachments            recipients             todos_server_snapshot
calendars              subjects
feeds                  threads

Fortunately, accessing a SQLite database is quite easy from just about any language that you decide to use. I’m just going to do all the queries in straight sqlite3 rather than a language, but they could be embedded in your application. First things first, copy your Envelope Index to another directory:

macpro:tmp sam$ cp ~/Library/Mail/Envelope\ Index .

Now you can use that database without worrying about messing up the locking or corrupting data while Mail.app is using it. Since we might as well do an example that is interesting rather than merely educational, how about we answer the question: “Who are my coworkers with whom that I collaborated?”. This is going to be a multi-query process to extract the information — there may be more efficient ways to do it — but think of this as instructive rather than prescriptive. First I need to limit the query to only those mailboxes which contain work email:

DROP TABLE coworkermailboxes;
CREATE TABLE coworkermailboxes(id);
CREATE INDEX coworkermailboxes_index ON coworkermailboxes(id);
INSERT INTO coworkermailboxes SELECT rowid FROM mailboxes
WHERE
url like 'imap://samp@snv-webmail.corp.yahoo.com/%' OR
url like 'imap://sam@mail.sampullara.com/Yahoo%20Inc%20Archive';

That gives us a table with several mailboxes that I have in Mail.app including Sent Messages. I would peruse the list of mailboxes to ensure that you are grabbing all the correct information. For me I had to also search my archives. Now I am going to take a series of steps to get to the final out put by iteratively processing successive tables of information. The first table, is a list of those people that you have both sent and received an email with directly (they were the sender and you were a receiver or you were the sender and they were the receiver):

DROP TABLE coworkers;
CREATE TABLE coworkers(id);
CREATE INDEX coworkers_index ON coworkers(id);
INSERT INTO coworkers SELECT a.rowid FROM addresses a, messages m, recipients r
WHERE
m.sender = a.rowid AND
m.mailbox IN (SELECT id FROM coworkermailboxes) AND
r.message_id = m.rowid AND r.address_id = 4
INTERSECT
SELECT a.rowid FROM addresses a, messages m, recipients r
WHERE
m.sender = 4 AND
m.mailbox IN (SELECT id FROM coworkermailboxes) AND
r.message_id = m.rowid AND r.address_id = a.rowid
;

Note I have directly inserted my addresses rowid into this query for the sender on the one hand and the receiver on the other. The next step will be to count the actual number of emails you have received from each of those on the list:

DROP TABLE coworkers2;
CREATE TABLE coworkers2(id, recv);
CREATE INDEX coworkers2_index ON coworkers2(id);
SELECT "Get the received mail";
INSERT INTO coworkers2 SELECT w.id, COUNT(*) FROM messages m, recipients r, coworkers w
WHERE m.sender = 4 AND
m.mailbox IN (SELECT id FROM coworkermailboxes) AND
r.message_id = m.rowid AND r.address_id = w.id
GROUP BY w.id ORDER BY COUNT(*)
;

Finally, we count the number of sent emails and also derive a ratio of sent/received so we can judge how collaborative the exchanges have been:

DROP TABLE coworkers3;
CREATE TABLE coworkers3(id, sent float, recv float, ratio float);
CREATE INDEX coworkers3_index ON coworkers3(id);
SELECT "Get the sent mail";
INSERT INTO coworkers3 SELECT w.id, COUNT(*), w.recv, COUNT(*)*1.0/w.recv FROM messages m, recipients r, coworkers2 w
WHERE
m.sender = w.id AND
m.mailbox IN (SELECT id FROM coworkermailboxes) AND
r.message_id = m.rowid AND r.address_id = 4
GROUP BY w.id ORDER BY COUNT(*)
;

You will now have a table named coworkers3 that can be mined for information about your level of correspondence with them. For example, here is way to find relatively equal sends and receives:

SELECT a.comment FROM addresses a, coworkers3 w
WHERE
a.rowid = w.id AND
ratio >= .5 AND
ratio <=2 AND
sent > 10
ORDER BY sent
LIMIT 20;

When I do this I see the people that either I use to find information or that use me to find information. Every interaction is usually a request and then a response. On the other hand, this query will find those that typically made announcements out to the groups that I also worked with:

SELECT a.comment FROM addresses a, coworkers3 w
WHERE
a.rowid = w.id AND
ratio <= 1 AND
sent > 10
ORDER BY ratio
LIMIT 20;

And so on. Adding more filters on top of this you could easily derive your team at work for a particular time period and other insights. With the wealth of information contained in this meta-data store you could figure out all kinds of things:

  • Who sent you an email that you didn’t reply to yet?
  • Who do you respond to the most quickly?
  • Who responds to you most quickly?
  • What are you and your coworkers approximate working hours?
  • What groups of CCs could be made into aliases?

There really is no limit to how far the analysis could go. Ideally, it would be possible to setup a dashboard in Mail.app that let you cut and slice the data in a far more precise way than smart folders currently allow today. Maybe they should come out with super-sql-smart folders!

Using JAX-RS (Jersey) to build a JPA/JAXB-backed JSON REST API

Building applications for deployment to the web has evolved over the last several years to be focused on dynamic behavior, separation of model/view/controller, and simplified but scalable configuration and deployment.  From a performance, tools and library perspective I’m still highly biased to development in Java over more up-and-coming languages.  However, much has been learned in the Java community from the better frameworks like Rails and those lessons should not be ignored.

I’ve been looking for a while though to find that perfect combination of frameworks and libraries that would give me the expressive power that I want for building web applications.  There have been many contenders from JRuby on Rails, to Grails, to Seam and even just writing everything myself.  Ultimately, I believe in the DRY principle (like Rails), though I don’t think many frameworks go far enough when dealing with the database.  When you are building a web application it is rare that you are going to change what database you are using.  In fact, the majority of your scaling architecture is likely highly dependent on how you store your data.  This is why I prefer an application framework that allows me to start with the database and construct my application’s data object model from it.

So what are my acceptance criteria for this über-framework?

  • Great object-relational mapping tool that works well with MySQL + PostgreSQL
  • Excellent support for consuming and producing XML and JSON that integrates with the well with the data objects that the ORM tool uses
  • Supports writing MVC applications naturally
  • Support for building REST APIs with arbitrary URL mapping to service parameters
  • High straight-line performance with the ability to scale up servers
  • Great defaults that make configuration mostly unnecessary with simple deployment
  • State-of-the-art IDE support. I don’t like to type anymore nor memorize APIs.
  • Suitable for quick prototyping and production applications
  • Support for templating views of any output type (HTML, XML, etc)
  • Easy to unit and integration test
  • Open source

Certainly a high barrier but I think I have finally found one that is a very strong contender. Amazingly, it is even coming out of the JSR standards process with a nice layer of open source on top of it.  JSR-311 was stated to develop an API for providing support for RESTful Web Services in the Java Platform.  Not only does it do that nicely but it also has the right hooks for simple dependency injection, orthogonal to JPA (my favorite ORM), support for both XML and JSON natively, and except in unusual circumstances very DRY.

Because it is in Java and works well with JPA it satisfies a large number of my requirements before we even look at what it offers.  Another aspect of it that didn’t make the above list is that the production quality reference implementation is available as a couple of dependencies in Maven making it very easy to work with.  It also works well deployed within lightweight containers like Grizzly, heavier ones like Tomcat and Glassfish, and the REST APIs it creates can even be directly tested without any container at all. There are some things that Jersey supports that are non-standard that I think are excellent additions to the framework and should likely make it into future versions including support for templating (like JSP and Freemarker) that help it satisfy my requirements.

To give you an example of how terse the API can be, here is the simplest example that includes deployment as an operating web service:

public class Main {
    @Path("/helloworld")
    public static class HelloWorldResource {
        @GET
        @Produces("text/plain")
        public String getClichedMessage() {
            return "Hello World";
        }
    }

    public static void main(String[] args) throws IOException {
        Map<String, String> initParams = new HashMap<String, String>();
        initParams.put("com.sun.jersey.config.property.packages", "com.sun.jersey.samples.helloworld");

        System.out.println("Starting grizzly...");
        URI uri = UriBuilder.fromUri("http://localhost/").port(9998).build();
        SelectorThread threadSelector = GrizzlyWebContainerFactory.create(uri, initParams);
        System.out.println(String.format("Try out %shelloworld\nHit enter to stop it...", uri));
        System.in.read();
        threadSelector.stopEndpoint();
    }
}

The @Path annotation lets you use URI path templates to specify the matching paths and path parameters to your REST service. You can produce any set of content-types and content negotiation will be done for you based on the incoming request. Exceptions can be mapped directly to error responses. Query, Matrix, Path, Header and Cookie parameters are all supported and automatically injected based on annotations. Here is a more sophisticated example from an application I am writing:

    @GET
    @Produces("application/json")
    @Path("/network/{id: [0-9]+}/{nid}")