Skip to content

Exercise5BasicHTTPServer

sethnielson edited this page Feb 18, 2019 · 4 revisions

Exercise 5: Basic HTTP Server

Assigned 2/13/2019
Due 2/20/2019
Points 25

Overview

This is your first exercise in the class that is not building directly off the escape room. BUT, your escape room program (or programs) will still be VERY helpful. In fact, you could still use them as a template.

But we'll come back to that. First, the goal of this lab is to build a very simple web server! Modern web servers have significant capabilities, but they are actually built on a simple protocol. We will only be implementing the most basic subset, but you will be able to use a browser to test your functionality!

Your web server needs to accept incoming data. Now, it will almost always be the case that an incoming HTTP request will come in a single data transmission. But just in case, you will need to make sure you have the full request before acting on it. How will you know if you have the full HTTP request?

HTTP is an text-based, line-oriented protocol. That means that each line differentiates a piece of data and a blank line marks the end of the HTTP request:

line 1 \r\n
line 2 \r\n
...
line n \r\n
\r\n

So, in your data_received methods, you should read data until you have a \r\n\r\n. If you don't have that, you need to end your processing and wait for a subsequent data_received call.

def data_received(self, data):
    self.buffer += data
    if not self.has_full_packet(self.buffer):
        return

Now, once you have the entire request, you can start processing. HTTP requests' first line is the critical part. It has the "method", which is either GET, POST, and a few others. You only need to deal with GET. A typical GET line might look like this:

GET /docs/index.html HTTP/1.1

This line has three parts: the method (GET), the URI parameter (/docs/index.html), and the version string (HTTP/1.1). Your HTTP server should verify that the method is GET. It can ignore version. The URI string indicates what data you need to send back.

Your server should be configurable with a "document root", a directory that represents the root directory for the server. Suppose that you set your document root as /home/user/netsec/www_server/root. Then, if your URI is /somefile.html you would look for /home/user/netsec/www_server/root/somefile.html. If the URI was /a/b/c/somefile.html, you would look for /home/user/netsec/www_server/root/a/b/c/somefile.html. If the file is found, you will send it back in an HTTP response. Otherwise, you'll send a 404 Not Found response instead. More on this in a minute.

What about the lines that come after? For now, you can ignore them. But you should record them and print them out. Each line will be a key-value pair separated by a comma. For example:

Host: www.nowhere123.com
Accept: image/gif, image/jpeg, */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)

Although it will not be graded, you really should create a dictionary with these key-value pair mappings and do a little snooping around. What kind of data is your browser sending anyway?

Once again, you'll know when the request is finished by a blank line.

HTTP responses, the type of messages you send back, are a mix of a line-oriented protocol along with binary data of a fixed length. Here's an example.

HTTP/1.1 200 OK
Date: Sun, 18 Oct 2009 08:56:53 GMT
Server: Apache/2.2.14 (Win32)
Last-Modified: Sat, 20 Nov 2004 07:16:26 GMT
Content-Length: 44
Connection: close
Content-Type: text/html

Your first line needs to read HTTP/1.1 followed by one of two codes:

  • 200 OK (if the URI was found)
  • 404 Not Found (if the URI was NOT found)

After this first line, you need to report associated key-values. If the file is found, you need to report the following:

  • Date - the current time as an ascii string
  • Server - this is the type of server. For yours, report "NetSec Prototype Server 1.0"
  • Last-Modified - this is the last time the file was modified. You can simply report the current time.
  • Content-Length - this is the length of the file
  • Connection - the value will always be close, which indicates to the client (browser) that you will not keep the connection open
  • Content-Type - this is what type of file is requested as is often known as the MIME type. In real life, it could be text/plain, text/html, application/octet-stream and many others. We will only use HTML files for testing, so you can hard-code this to text/html.

If the file is not found, your response only needs the Date, Server, and Content-Length fields. Normally, even a 404 error returns an HTML page with more detailed error information that can be processed by the browser. But for simplicity, you need not return any such page and your Content-Length can be zero.

If the file is found, once you've sent the header, there should be a blank line followed by the actual binary data of the file. It's important that you get Content-Length right, otherwise the browser won't know how much data to receive.

Detailed Specification

As with the original escape room program, we will use an autograder. You need to name your http server class: ExampleHttpServer. This class must inherit from asyncio.Protocol and it must correctly use data_received to receive http requests and use transport.write to send http responses.

Moreover, this class must have a constructor that takes the document root as the input. So, your class should look something like this:

class ExampleHttpServer(asyncio.Protocol):
    def __init__(self, document_root):
        # initialization

    def data_received(self, data):
        # process incoming http requests

And, of course, you will probably want to over-write connection_made to save a copy of the transport.

Don't forget to update your factory for this protocol to pass in the document root. Here's an example

document_root = sys.argv[1] # first command line parameter
loop = asyncio.get_event_loop()
loop.create_server(lambda: ExampleHttpServer(document_root), HOST, PORT)

As we discussed in class, the data_received needs to recognize when a request ends and begins. It should know what to do if:

  • it does not receive the full request in a single data_received call
  • if it receives more than one request in a single data_received call

As for the requests themselves, you are not required to do anything with the headers for full credit. All you need to do is:

  • Ensure that the request-line (the first line) is properly formatted with [method] [uri] http/[version]
  • Check that the HTTP method is GET
  • Check that the version is 1.0 or 1.1
  • Translate the uri to the correct location under document root

Although in real HTTP you would return different error codes for improper formatting, you will return the "404 not found" for errors as well. If version isn't 1.0 or 1.1, return 404 not found. If the method isn't GET, return 404 not found. If you can't find the file, return 404 not found.

If you find the file, your response must look like this:

HTTP/1.1 200 OK\r\n
Date: [the current time as an ascii string]\r\n
Server: NetSec Prototype Server 1.0\r\n
Last-Modified: [just use the current time]\r\n
Content-Length: [content-length of the file]\r\n
Connection: close\r\n
Content-Type: text/html\r\n
\r\n
[Content-Length bytes of binary data]

If you do not find the file, or there is an error, your response must look like this:

HTTP/1.1 404 Not Found
Date: [the current time as an ascii string]\r\n
Server: NetSec Prototype Server 1.0\r\n
Content-Length: 0\r\n
\r\n

Extra Credit

There will be a lot of extra credit for this lab. Here are the really easy ones (2 points each)

  • More Mime Types - Use Python's MimeTypes module to return the correct type for different types of files based on extension
  • index.html - If a URI is a directory, check if an index.html file is present. If so, return that instead of not found
  • Last-Modified - Use the stat function to return the actual last modified time of the file in the last modified response header
  • Error 404 page - Return a generated 404 error HTML page. You will need to include a content type and content length headers with appropriate values

The harder one: cookies. Worth a full 25 points (!) Implement the Escape Room in HTML following these requirements:

  • If the URI is /cgi/escape_room, it will connect to a "virtual" page... you will generate the HTML yourself
  • If the header comes without a Cookie field set, start a new escape room, create a cookie, and use the Set-Cookie header in the response
  • If the request comes in with a cookie, it should continue the same game.
  • The command and the parameters will be part of a GET query string: /cgi/escape_room?cmd+word1+word2+...+wordn

The HTML response should be 404 not found for any error, and other wise a string with the response. The generated page also must have a form that the user enters the next command. You will have to look up how a form translates input into a GET query string.

We will test this one manually, so let us know by email if you get it done and would like to be graded. This should be enough detail to figure out the rest, but alert me if something is wrong or incomplete.

Grading

You need to have the following file in github:

your_repository_root>/src/exercises/ex5/example_http_server.py

Points will be awarded as follows:

  • 10 points for correctly returning an HTML file from the document root directory
  • 5 points for correctly returning an HTML file from subdirectories of the document root
  • 2 points for a 404 error for not finding a file
  • 2 points for a 404 error for requesting a directory (or returning an index.html, if present, and if doing the extra credit)
  • 2 points for a 404 error for the wrong method (i.e., not GET)
  • 2 points for a malformed request
  • 2 points for a request with the wrong version

The extra credit will be awarded as described in the extra credit section. If you think of additional HTTP features that you would like to implement, send me an email and maybe I'll give you extra credit for it anyway.