CONTENTS

Chapter 13. Programming for the Web

When you think about the Web, you probably think of applications—web browsers, web servers—and the many kinds of content that those applications move around the network. But it's important to note that standards and protocols, not the applications themselves, have enabled the Web's growth. Since the earliest days of the Internet, there have been ways to move files from here to there, and document formats that were just as powerful as HTML, but there was not a unifying model for how to identify, retrieve, and display information nor was there a universal way for applications to interact with that data over the network. Since the web explosion began, HTML has reigned supreme as a common format for documents, and most developers have at least some familiarity with it. In this chapter, we're going to talk a bit about its cousin, HTTP, the protocol that handles communications between web clients and servers, and URLs, which provide a standard for naming and addressing objects on the Web. Java provides a very simple API for working with URLs to address objects on the Web. We'll discuss how to write web clients that can interact with the servers using the HTTP GET and POST methods. In Chapter 14, we'll take a look at servlets, simple Java programs that run on web servers and implement the other side of these conversations.

13.1 Uniform Resource Locators (URLs)

A URL points to an object on the Internet.[1] It's a text string that identifies an item, tells you where to find it, and specifies a method for communicating with it or retrieving it from its source. A URL can refer to any kind of information source. It might point to static data, such as a file on a local filesystem, a web server, or an FTP archive; or it can point to a more dynamic object such as a news article on a news spool or a record in a database. URLs can even refer to less tangible resources such as Telnet sessions and mailing addresses.

A URL is usually presented as a string of text, like an address. Since there are many different ways to locate an item on the Net, and different mediums and transports require different kinds of information, there are different formats for different kinds of URLs. The most common form has three components: a network host or server, the name of the item, its location on that host, and a protocol by which the host should communicate:

protocol://hostname/path/item-name

protocol (also called the "scheme") is an identifier such as http, ftp, or gopher; hostname is an Internet hostname; and the path and item components form a unique path that identifies the object on that host. Variants of this form allow extra information to be packed into the URL, specifying for example, port numbers for the communications protocol and fragment identifiers that reference parts inside the object.

We sometimes speak of a URL that is relative to another URL, called a base URL. In that case we are using the base URL as a starting point and supplying additional information. For example, the base URL might point to a directory on a web server; a relative URL might name a particular file in that directory.

13.2 The URL Class

Bringing this down to a more concrete level is the Java URL class. The URL class represents a URL address and provides a simple API for accessing web resources, such as documents and applications on servers. It uses an extensible set of protocol and content handlers to perform the necessary communication and even data conversion. With the URL class, an application can open a connection to a server on the network and retrieve content with just a few lines of code. As new types of servers and new formats for content evolve, additional URL handlers can be supplied to retrieve and interpret the data without modifying your applications.

A URL is represented by an instance of the java.net.URL class. A URL object manages all the component information within a URL string and provides methods for retrieving the object it identifies. We can construct a URL object from a URL specification string or from its component parts:

try {  
    URL aDoc =
      new URL( "http://foo.bar.com/documents/homepage.html" );
    URL sameDoc =
      new URL("http","foo.bar.com","documents/homepage.html");
}   
catch ( MalformedURLException e ) { }

These two URL objects point to the same network resource, the homepage.html document on the server foo.bar.com. Whether the resource actually exists and is available isn't known until we try to access it. When initially constructed, the URL object contains only data about the object's location and how to access it. No connection to the server has been made. We can examine the URL's components with the getProtocol(), getHost(), and getFile() methods. We can also compare it to another URL with the sameFile() method (which has an unfortunate name for something which may not point to a file). sameFile() determines whether two URLs point to the same resource. It can be fooled, but sameFile() does more than compare the URLs for equality; it takes into account the possibility that one server may have several names, and other factors. (It doesn't go as far as to fetch the resources and compare them, however.)

When a URL is created, its specification is parsed to identify just the protocol component. If the protocol doesn't make sense, or if Java can't find a protocol handler for it, the URL constructor throws a MalformedURLException. A protocol handler is a Java class that implements the communications protocol for accessing the URL resource. For example, given an http URL, Java prepares to use the HTTP protocol handler to retrieve documents from the specified server.

13.2.1 Stream Data

The lowest level and most general way to get data back from a URL is to ask for an InputStream from the URL by calling openStream(). Getting the data as a stream may also be useful if you want to receive continuous updates from a dynamic information source. The drawback is that you have to parse the contents of the byte stream yourself. Not all types of URLs support the openStream() method because not all types of URLs refer to concrete data; you'll get an UnknownServiceException if the URL doesn't.

The following code prints the contents of an HTML file:

try {  
    URL url = new URL("http://server/index.html");  
 
    BufferedReader bin = new BufferedReader ( 
        new InputStreamReader( url.openStream( ) )); 
 
    String line; 
    while ( (line = bin.readLine( )) != null )   
        System.out.println( line ); 
} catch (Exception e) { }

We ask for an InputStream with openStream() and wrap it in a BufferedReader to read the lines of text. Because we specify the http protocol in the URL, we enlist the services of an HTTP protocol handler. As we'll discuss later, that raises some questions about what kinds of handlers are available. This example partially works around those issues because no content handler (only the protocol handler) is involved; we read the data and interpret the content ourselves, by simply printing it.

One note about applets. In the applet environment, you typically have additional security restrictions that limit the URLs to which you may communicate. To be sure that you can access the specified URL and use the correct protocol handler, you should construct URLs relative to the base URL that identifies the applet's codebase—the location of the applet code. This insures that any data you load comes via the same protocol and from the same server as your applet itself. For example:

new URL( getCodeBase( ), "foo/bar.gif" );

Alternately, if you are just trying to get data files or media associated with an applet, there is a more general way; see the discussion of getResource() in Chapter 11.

13.2.2 Getting the Content as an Object

As we said previously, reading content from a stream is the most general mechanism for accessing data over the Web. openStream() leaves the parsing of data up to you. The URL class supports a more sophisticated, pluggable, content-handling mechanism that we'll discuss now, but be aware that this is not widely used because of lack of standardization and limitations in how you can deploy new handlers. Consider this section to be mainly for educational purposes.

When Java knows the type of content being retrieved from a URL, and a proper content handler is available (installed), you can retrieve the item the URL addresses as a native Java object by calling the URL's getContent() method. In this mode of operation, getContent() initiates a connection to the host, fetches the data for you, determines the Multipurpose Internet Mail Extensions (MIME) type of the contents, and invokes a content handler to turn the bytes into a Java object. (It acts just as if you had read a serialized Java object, as in Chapter 12). MIME is the standard developed to facilitate multimedia email, but it has become widely used as a general way to specify how to treat data. Java uses MIME to help it pick the right content handler. This sounds good, but generally requires that you supply the correct handlers with your application or install them in the Java runtime environment. Unfortunately, there is not a standard way to do this. (The HotJava web browser provides a mechanism for adding new handlers, but it is not a widely deployed browser, so that doesn't help us much in practical terms.)

For example, given the URL http://foo.bar.com/index.html, a call to getContent()uses the HTTP protocol handler to retrieve data and an HTML content handler to turn the data into an appropriate document object. A URL that points to a plain-text file might use a text-content handler that returns a String object. Similarly, a GIF file might be turned into an ImageProducer object using a GIF content handler. If we access the GIF file using an FTP URL, Java uses the same content handler but uses the FTP protocol handler to receive the data.

getContent() returns the output of the content handler but leaves us wondering what kind of object we got. Since the content handler has to be able to return anything, the return type of getContent() is Object. In a moment, we'll describe how we could ask the protocol handler about the object's MIME type, which it discovered. Based on this, and whatever other knowledge we have about the kind of object we are expecting, we can cast the Object to its appropriate, more specific type. For example, if we expect a String, we'll cast the result of getContent() to a String:

try  { 
    String content = (String)myURL.getContent( );  
} catch ( ClassCastException e ) { ... }

Various kinds of errors can occur when trying to retrieve the data. For example, getContent() can throw an IOException if there is a communications error. Other kinds of errors can occur at the application level: some knowledge of how the application-specific content and protocol handlers deal with errors is necessary. One problem that could arise is that a content handler for the data's MIME type wouldn't be available. In this case, getContent() invokes a special "unknown type" handler that returns the data as a raw InputStream. A sophisticated application might interpret this behavior and try to decide what to do with the data on its own.

In some situations, we may also need knowledge of the protocol handler. For example, consider a URL that refers to a nonexistent file on an HTTP server. When requested, the server returns the familiar "404 Not Found" message. To deal with protocol-specific operations like this, we may need to talk to the protocol handler, which we'll discuss next.

The openStream() and getContent() methods both implicitly create the connection to the remote URL object. When the connection is set up, the protocol handler is consulted to create a URLConnection object. The URLConnection manages the protocol-specific communications. We can get a URLConnection for our URL with the openConnection() method. One of the things we can do with the URLConnection is ask for the object's content type. For example:

URLConnection connection = myURL.openConnection( ); 
String mimeType = connection.getContentType( ); 
... 
Object contents = myURL.getContents( );

We can also get protocol-specific information. Different protocols provide different types of URLConnection objects. The HttpURLConnection object, for instance, can interpret the "404 Not Found" message and tell us about the problem. We'll talk more about the HttpURLConnection later in this chapter.

13.3 Handlers in Practice

The content- and protocol-handler mechanisms we've described are very flexible; to handle new types of URLs, you need only add the appropriate handler classes. One interesting application of this would be Java-based web browsers that could handle new and specialized kinds of URLs by downloading them over the Net. The idea for this was touted since the earliest days of Java. Unfortunately, it has never come to fruition. There is no API for dynamically downloading new content and protocol handlers. In fact, there is no standard API for determining what content and protocol handlers exist on a given platform. Although content and protocol handlers are part of the Java API and an intrinsic part of the mechanism for working with URLs, specific content and protocol handlers aren't defined. The standard Java classes don't, for example, include content handlers for HTML, GIF, MPEG, or other common data types. Sun's SDK and all of the other Java environments do come with these kinds of handlers, but these are installed on an application-level basis and not documented.

There are two real issues here:

The HotJava web browser supports the content and protocol handler mechanism, and you can install handlers locally (as for all Java applications), but other web browsers such as Netscape and Internet Explorer do not directly support handlers at all. You can install them for use in your own (perhaps intranet-based) applets but you cannot use them to extend the capabilities of the browser. Netscape and Internet Explorer are currently classic monolithic applications: knowledge about certain kinds of objects, like HTML and GIF files, is built in. These browsers can be extended via a plug-in mechanism, which is a much less fine-grained and powerful approach than Java's handler mechanism. If you're writing applets for use in Netscape or Internet Explorer now, about all you can do is use the openStream() method to get a raw input stream from which to read data.

13.3.1 Other Handler Frameworks

The idea of dynamically downloadable handlers could also be applied to other kinds of handler-like components. For example, the Java XML community is fond of referring to XML as a way to apply semantics to documents and to Java as a portable way to supply the behavior that goes along with those semantics. It's possible that an XML viewer could be built with downloadable handlers for displaying XML tags.

The JavaBeans APIs also touch upon this subject with the Java Activation Framework. The JAF provides a way to detect the type of a stream of data and "encapsulate access to it" in a JavaBean. If this sounds suspiciously like the content handler's job, it is. Unfortunately, it looks like these APIs will not be merged and, outside of the Java Mail API, the JAF has not been widely used.

13.3.2 Writing Content and Protocol Handlers

Although content and protocol handlers are used fairly extensively in Java, they have not been leveraged very much by developers for their own applications. We discussed some of the reasons for this earlier. But, if you're adventurous and want to try leveraging content and protocol handlers in your own applications, you can find all the information you'll need in Appendix A, which covers creating and installing your own handlers.

13.3.3 Talking to Web Applications

Web browsers are the universal clients for web applications. They retrieve documents for display and serve as a user interface, primarily through the use of HTML forms and links. In the remainder of this chapter, we will show how to write client-side Java code that uses HTTP through the URL class to work with web applications directly. There are many reasons an application (or applet) might want to communicate in this way. For example, compatibility with another browser-based application might be important, or you might need to gain access to a server through a firewall where direct socket connections (and hence RMI) are not available. HTTP has become the lingua franca of the Net and despite its limitations (or more likely because of its simplicity), it has rapidly become one of the most widely supported protocols in the world. As for using Java on the client side, all the other reasons you would write a client GUI application (as opposed to a pure Web/HTML-based application) also present themselves. A client-side GUI can do sophisticated presentation and validation while, with the techniques presented here, still use web-enabled services over the network.

The primary task we discuss here is sending data to the server, specifically HTML form-encoded data. In a web browser, the name/value pairs of HTML form fields are encoded in a special format and sent to the server using one of two methods. The first method, using the HTTP command GET, encodes the user's input into the URL and requests the corresponding document. The server recognizes that the first part of the URL refers to a program and invokes it, passing along the information encoded in the URL as a parameter. The second method uses the HTTP command POST to ask the server to accept the encoded data and pass it to a web application as a stream. In Java, we can create a URL that refers to a server-side program and send it data using either the GET or POST methods. (In Chapter 14 we'll see how to build web applications that implement the other side of this conversation.)

13.3.4 Using the GET Method

Using the GET method of encoding data in a URL is pretty easy. All we have to do is create a URL pointing to a server program and use a simple convention to tack on the encoded name/value pairs that make up our data. For example, the following code snippet opens a URL to a CGI program called login.cgi on the server myhost and passes it two name/value pairs. It then prints whatever text the CGI sends back:

URL url = new URL( 
    // this string should be URL-encoded as well
    "http://myhost/cgi-bin/login.cgi?Name=Pat&Password=foobar");
  
BufferedReader bin = new BufferedReader ( 
  new InputStreamReader( url.openStream( ) ));
  
String line; 
while ( (line = bin.readLine( )) != null )
    System.out.println( line );

To form the new URL, we start with the URL of login.cgi; we add a question mark (?), which marks the beginning of the form data, followed by the first name/value pair. We can add as many pairs as we want, separated by ampersand (&) characters. The rest of our code simply opens the stream and reads back the response from the server. Remember that creating a URL doesn't actually open the connection. In this case, the URL connection was made implicitly when we called openStream(). Although we are assuming here that our server sends back text, it could send anything. (In theory of course we could use the getContentType() method of the URL to check the MIME type of any returned data and try to retrieve the data as an object using getContent() as well).

It's important to point out that we have skipped a step here. This example works because our name/value pairs happen to be simple text. If any "non-printable" or special characters (including ? or &) are in the pairs, they have to be encoded first. The java.net.URLEncoder class provides a utility for encoding the data. We'll show how to use it in the next example.

Another important thing to note is that although this example sends a password field, you should never do so using this simplistic approach. All of the data we're sending goes in clear text across the network (it is not encrypted). And in this case, the password field would appear anywhere the URL is printed as well (e.g., server logs and bookmarks). We'll talk about secure web communications later in this chapter and when we discuss writing web applications using servlets in Chapter 14.

13.3.5 Using the POST Method

Next we'll create a small application that acts like an HTML form. It gathers data from two text fields—name and password—and posts the data to a specified URL using the HTTP POST method. Here we are writing a Swing-based client application that works with a server-side web-based application, just like a web browser.

Here's the code:

//file: Post.java
import java.net.*;
import java.io.*;
import java.awt.*;
import java.awt.event.*;
import javax.swing.*;
  
public class Post extends JPanel implements ActionListener {
  JTextField nameField, passwordField;
  String postURL;
  
  GridBagConstraints constraints = new GridBagConstraints( );
  void addGB( Component component, int x, int y ) {
    constraints.gridx = x;  constraints.gridy = y;
    add ( component, constraints );
  }
  
  public Post( String postURL ) {
    this.postURL = postURL;
    JButton postButton = new JButton("Post");
    postButton.addActionListener( this );
    setLayout( new GridBagLayout( ) );
    addGB( new JLabel("Name:"), 0,0 );
    addGB( nameField = new JTextField(20), 1,0 );
    addGB( new JLabel("Password:"), 0,1 );
    addGB( passwordField = new JPasswordField(20),1,1 );
    constraints.gridwidth = 2;
    addGB( postButton, 0,2 );
  }
  
  public void actionPerformed(ActionEvent e) {
    postData( );
  }
  
  protected void postData( ) {
    StringBuffer sb = new StringBuffer( );
    sb.append( URLEncoder.encode("Name") + "=" );
    sb.append( URLEncoder.encode(nameField.getText( )) );
    sb.append( "&" + URLEncoder.encode("Password") + "=" );
    sb.append( URLEncoder.encode(passwordField.getText( )) );
    String formData = sb.toString( );
  
    try {
      URL url = new URL( postURL );
      HttpURLConnection urlcon = 
          (HttpURLConnection) url.openConnection( );
      urlcon.setRequestMethod("POST");
      urlcon.setRequestProperty("Content-type", 
          "application/x-www-form-urlencoded");
      urlcon.setDoOutput(true);
      urlcon.setDoInput(true);
      PrintWriter pout = new PrintWriter( new OutputStreamWriter(
          urlcon.getOutputStream( ), "8859_1"), true );
      pout.print( formData );
      pout.flush( );
  
      // read results...
      if ( urlcon.getResponseCode( ) != HttpURLConnection.HTTP_OK )
        System.out.println("Posted ok!");
      else {
        System.out.println("Bad post...");
        return;
      }
      //InputStream in = urlcon.getInputStream( );
      // ...
  
    } catch (MalformedURLException e) {
      System.out.println(e);     // bad postURL
    } catch (IOException e2) {
      System.out.println(e2);    // I/O error
    }
  }
  
  public static void main( String [] args ) {
    JFrame frame = new JFrame("SimplePost");
    frame.getContentPane( ).add( new Post( args[0] ), "Center" );
    frame.pack( );
    frame.setVisible(true);
  }
}

When you run this application, you must specify the URL of the server program on the command line. For example:

% java Post http://www.myserver.example/cgi-bin/login.cgi

The beginning of the application creates the form; there's nothing here that won't be obvious after you've read Chapter 15 through Chapter 17. All the magic happens in the protected postData() method. First we create a StringBuffer and load it with name/value pairs, separated by ampersands. (We don't need the initial question mark when we're using the POST method because we're not appending to a URL string.) Each pair is first encoded using the static URLEncoder.encode() method. We run the name fields through the encoder as well as the value fields, even though we know that they contain no special characters.

Next we set up the connection to the server program. In our previous example, we didn't have to do anything special to send the data because the request was made by the web browser for us. Here, we have to carry some of the weight of talking to the remote web server. Fortunately, the HttpURLConnection object does most of the work for us; we just have to tell it that we want to do a POST to the URL and the type of data we are sending. We ask for the URLConnection object using the URL's openConnection() method. We know that we are using the HTTP protocol, so we should be able to cast it safely to an HttpURLConnection type, which has the support we need.

Next we use setRequestMethod() to tell the connection we want to do a POST operation. We also use setRequestProperty() to set the "Content-Type" field of our HTTP request to the appropriate type—in this case, the proper MIME type for encoded form data. (This is necessary to tell the server what kind of data we're sending.) Finally, we use the setDoOutput() and setDoInput() methods to tell the connection that we want to both send and receive stream data. The URL connection infers from this combination that we are going to do a POST operation and expects a response. Next we get an output stream from the connection with getOutputStream() and create a PrintWriter so we can easily write our encoded data.

After we post the data, our application calls getResponseCode() to see whether the HTTP response code from the server indicates the POST was successful. Other response codes (defined as constants in HttpURLConnection) indicate various failures. At the end of our example, we indicate where we could have read back the text of the response. For this application, we'll assume that simply knowing the post was successful is sufficient.

Although form-encoded data (as indicated by the MIME type we specified for the Content-Type field) is the most common, other types of communications are possible. We could have used the input and output streams to exchange arbitrary data types with the server program. The POST operation accepts nonform data as well; the server application simply has to know how to handle it. One final note: if you are writing an application that needs to decode form data, you can use the java.net.URLDecoder to undo the operation of the URLEncoder. If you use the Servlet API, this happens automatically, as you'll see in Chapter 14.

13.3.6 The HttpURLConnection

Other information from the request is available from the HttpURLConnection as well. We could use getContentType() and getContentEncoding() to determine the MIME type and encoding of the response. We could also interrogate the HTTP response headers using getHeaderField(). (HTTP response headers are metadata name/value pairs carried with the response.) There are also convenience methods to fetch integer- and date-formatted header fields: getHeaderFieldInt() and getHeaderFieldDate(), which return an int and a long type, respectively. The content length and last modification date are also provided through getContentLength() and getLastModified().

13.3.7 SSL and Secure Web Communications

The previous examples sent a field called Password to the server. However, standard HTTP doesn't provide encryption to hide our data. Fortunately, adding security for GET and POST operations like this is easy (trivial in fact, for the developer). Where available you simply have to use a secure form of the HTTP protocol—HTTPS:

https://www.myserver.example/cgi-bin/login.cgi

HTTPS is a version of the standard HTTP protocol run over SSL (Secure Sockets Layer), which uses public-key encryption techniques to encrypt the data sent. Most web browsers and servers currently come with built-in support for HTTPS (or raw SSL sockets). Therefore, if your web server supports HTTPS, you can use a browser to send and receive secure data simply by specifying the https protocol in your URLs. There is a lot more to know in general about SSL and related aspects of security such as authenticating whom you are actually talking to. But as far as basic data encryption goes, this is all you have to do. It is not something your code has to deal with directly. As of Java 1.4, the standard distribution from Sun is shipped with SSL and HTTPS support. Applets written using the Java Plug-in also have access to the HTTPS protocol handler. We'll discuss writing secure web applications in more detail in Chapter 14.

13.3.8 URLs, URNs, and URIs

Earlier we talked about URLs and distinguished them from the concept of URNs or Uniform Resource Names. Whereas a URL points to a specific location on the Net and specifies a protocol or scheme for accessing its contents, a URN is simply a globally unique name. A URL is analogous to giving someone your phone number. But a URN is more like giving them your social security number. Your phone number may change, but your social security number uniquely identifies you forever.

While it's possible that some mechanism might be able to look at a given URN and resolve it to a location (a URL), it is not necessarily so. URNs are intended only to be permanent, unique, abstract identifiers for an item whereas a URL is a mechanism you can use to get in touch with a resource right now. You can use a phone number to contact me today, but you can use my social security number to uniquely identify me anytime.

An example of a URN is http://www.w3.org/1999/XSL/Transform, which is the identifier for a version of the Extensible Stylesheet Language, standardized by the W3C. Now, it happens that this is also a URL (you can go to that address and find information about the standard), but that is for convenience only. This URNs primary mission is to uniquely label the version of the programming language in a way that never changes.

Collectively, URLs and URNs are called Uniform Resource Identifiers or URIs. A URI is simply a URL or URN. So, we can talk about URLs and URNs as kinds of URIs. The reason for this abstraction is that URLs and URNs, by definition, have some things in common. All URIs are supposed to be human-readable and "transcribable" (it should be possible to write them on the back of a napkin). They always have a hierarchical structure, and they are always unique. Both URLs and URNs also share some common syntax, which is described by the URI RFC-2396.

Java 1.4 introduced the java.net.URI class to formalize these distinctions. Prior to that, there was only the URL class in Java. The difference between the URI and URL classes is that the URI class does not try to parse the contents of the identifier and apply any "meaning." The URL class immediately attempts to parse the scheme portion of the URL and locate a protocol handler, whereas the URI class doesn't interpret its content. It serves only to allow us to work with the identifier as structured text, according to the general rules of URI syntax. With the URI class, you can construct the string, resolve relative paths, and perform equality or comparison operations, but no hostname or protocol resolution is done.

[1]  The term URL was coined by the Uniform Resource Identifier (URI) working group of the IETF to distinguish URLs from the more general notion of Uniform Resource Names or URNs (see RFC-2396). Look for Section 13.3.8 later in this chapter.

CONTENTS