CS 43 — Lab 2: A Concurrent Web Server

Due: Thursday, February 24 @ 11:59 PM

1. Overview

Having built a web client, for this lab we’ll look at the other end of the HTTP protocol — the web server. As real web clients (e.g., browsers like Firefox) send requests to your server, you’ll be finding the requested files and serving them back to the clients.

1.1. Goals

Implement the server side of (non-persistent) HTTP over a TCP connection.
Apply socket system calls (bind, listen, and accept) on a server process to interact clients.
Use threading to serve multiple concurrent clients.
More practice with sockets, send, and recv.

1.2. Handy References:

RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
Manual pages for bind, listen, and accept.
Manual pages for pthread_create, and pthread_detach.

1.3. Lab Recordings

Week 1

Week 2

2. Requirements

Your server program, lab2, will receive two arguments:

the port number it should listen on for incoming connections, and
the directory out of which it will serve files (typically called the document root).

For example:

./lab2 8080 test_documents

This command will tell your web server to listen for connections on port 8080 and serve files out of the test_documents directory. That is, the test_documents directory is considered / when responding to requests. If you’re asked for /index.html, you should respond with the file that resides in test_documents/index.html. If you’re asked for /dir1/dir2/file.ext, you should respond with the file test_documents/dir1/dir2/file.ext.

On most UNIX systems, only users with administrative (root) privileges are allowed to bind to ports below 1024. Users without such privileges often test web services on ports 8080 or 8000 because they sound "close" to port 80.

When connecting your web browser to your lab2 server, you’ll need to explicitly specify the port number in the URL with a colon (:) after the host, like:

http://localhost:8080/index.html

or equivalently,

http://127.0.0.1:8080/index.html

You may find the chdir system call helpful when dealing with file paths. It will change your process’s "working directory", and making your working directory the document root will help in locating files within it.

2.1. Workflow of Your Program

Roughly, your server should follow this sequence:

Read the arguments, find your document root, bind to the specified port, and begin listening for incoming connections.
Accept a connection, and:
1. week 1: hand the socket off to a function that handles the remaining steps.
2. week 2: pass the socket to a new thread for concurrent processing.
Receive and parse a request from the client.
Look for the path that was requested, starting from your document root (the second argument to your program). One of four things should happen: You might want to make each of these cases a separate function!
1. If the path exists and it’s a regular file, formulate a response (with the Content-Type header set) and send it back to the client.
2. If the path exists and it’s a directory that contains an index.html file, respond with that file.
3. week 2: If the path exists and it’s a directory that does NOT contain an index.html file, respond with a directory listing.
4. If the path does not exist, respond with a 404 code with a basic HTML error page. The 404 HTML page can be static and very simple — it just needs to be enough for a user to see a 404 message in a real browser.
Close the connection and continue serving other clients.

2.2. Server Behavior Expectations

For full credit:

Your server should send byte-for-byte identical copies of files to clients. Use wget or curl to fetch files and md5sum or diff to compare the fetched file with the original. I will do this when grading!
A variety of file formats should display properly in a real web browser (e.g., firefox), including both text and binary formats. You’ll need to return the proper HTTP Content-Type header in your response. You don’t need to handle everything on that list, but you should at least be able to handle files with .html, .txt, .jpeg, .jpg, .gif, .png, .pdf, and .ico extensions. You may assume that the file extension is correct (e.g., I’m not going to name a PDF file with a .txt suffix).
If asked for a file that does not exist, you should respond with a 404 error code with a readable error page, just like a web server would. It doesn’t need to be fancy, but it should contain some basic HTML so that the browser renders something and makes the error clear.
Some clients may be slow to complete a connection or send a request. Your server should be able to serve multiple clients concurrently, not just back-to-back. For this lab, use multithreading with pthreads to handle concurrent connections. (We’ll try an alternative to threads, event-based concurrency, in a future lab assignment.)
If the path requested by the client is a directory, you should handle the request as if it was for the file index.html inside that directory, if such a file exists. Hint: use the stat system call to determine if a path is a directory or a file. Using the S_ISDIR macro on the st_mode field of the stat struct will help you to identify directories.
The web server should respond with a list of files when the user requests a directory that does not contain an index.html file. You can read the contents of a directory using the opendir and readdir calls. Together they behave like an iterator. That is, you can open a DIR * with opendir and then continue calling readdir, which returns info for one file, on that DIR * until it returns NULL. Note that there should be no additional files created on the server’s disk to respond to the request. The response should mimic result of running:
```
python -m SimpleHTTPServer
```
Your program should generate no warnings from valgrind. If valgrind ever tells you something is wrong DON’T IGNORE IT! Fix it before moving on.

2.3. Assumptions

You may assume that file suffixes correctly correspond to their type (e.g., if a file ends in ".pdf" that it really is a PDF file).
You may assume that requests sent to your server are at most 4 KB in length.
You may assume that if the user requests a path that is a directory, the path will end in a trailing /. When generating the list of files in a directory, make sure your server also sends back URLs that end in / for directories. This is for the benefit of your browser, which keeps track of its current location based on the absence or presence of slashes.
You may assume that you will only receive GET requests from clients.
If you receive an HTTP/1.1 request, you should respond back with an HTTP/1.0 response.

You should NOT assume anything about the size of the file that a client requests. Rather than trying to read the entire file into memory at once, you can read a chunk of the file (e.g., 4096 bytes) and then send just that chunk (in loop!) before reading the next chunk.

2.4. Checkpoint

To be on track, by the start of the next lab session, you should have finished:

Your server can accept client connections and hand them to a function for further processing.
Your processing function can:
- receive a full request from the client, using the presence of a double CRLF to determine that it has received the full request.
- parse the request and extract the requested path.
- generate a response (both header and body) for requested regular files and directories that contain an index.html file.

A good stretch goal is:

Sending back a simple, static HTML document for 404 errors (requested file not found).

It’s fine to defer until next week:

Handling multiple clients concurrently, with threading.
Producing directory listings for directories that do not contain an index.html file.

3. Examples and Testing

You should test your server in two ways:

Using a real web browser like firefox, request files and ensure that they render properly. Note: browsers are very forgiving in what they receive and will do their best to render properly, even when they aren’t given correct data.
To verify correctness, you should use a tool like wget to request and save copies of files from your server. You can then use the tools like diff and md5sum that we used to verify correctness in lab 1.

4. Tips & FAQ

Use HTTP version 1.0 — version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you’ll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.
All HTTP headers are ASCII string characters, so you can use the str family of functions to manipulate them safely.
Always, always, always check the return value of any system calls you make!

4.1. File types

When setting the Content-Type header, use the following file suffix to content type mappings:

html: text/html
txt: text/plain
jpeg: image/jpeg
jpg: image/jpg
gif: image/gif
png: image/png
pdf: application/pdf
ico: image/x-icon

It’s fine to hard-code knowledge of these specific types into your server.

4.2. File paths

chdir: use this function to change your server process’s "current working directory" to test_documents. You probably want to do this at the very beginning of your program so that all paths can be relative to the document root.
stat: use this system call to determine if a path is a directory or a file. Allocate a variable of type struct stat and pass the address of the struct to stat (along with the path string). On success, the stat call will fill in the struct, and you can access the fields:
- Use the macro S_ISDIR() and pass in the st_mode field of your struct stat variable. S_ISDIR() will return true (non-zero) if the path is a directory or false (zero) otherwise.
- You don’t need to worry about all the other fields of struct stat including S_ISCHR, S_ISBLK, etc.)

4.3. String Parsing and File I/O

Many of the tools you used in lab 1 for manipulating strings will also be helpful in lab 2.
If you need to copy a specific number of bytes from one buffer to another, and you’re not 100% sure that the data will be entirely text, use memcpy rather than strncpy. The latter terminates early if it finds a null terminator (\0), whereas memcpy will always copy the requested number of bytes.
Similar to lab 1, you will likely find fopen to be helpful for opening files. This time, use a mode of "r", since you’ll only be reading files. Afterward, you can read the contents with fread. Don’t forget to fclose when done.

5. Other Reference Material

5.1. Socket Programming

The server side of socket programming has a few more system calls than a client. Use man bind, man listen, and man accept to read through each of these functions. Look through your starter code on github, and follow along with the description of each of the system calls.

socket(): Like the client side, first create a socket. This time, we name it server_sock since it’s going to serve a special purpose. Use server_sock only to accept new connections. Never use server_sock with calls to send or recv.
setsockopt: The default behavior of TCP (implemented by the OS) is that if you bind to a port and terminate your program, the OS makes you wait for a minute before anyone else can bind to that port again. Setting the SO_REUSEADDR socket options disables the waiting, which makes rapid debugging easier.
bind(): Associate a socket with the IP address and port on which it should listen for incoming connections. A machine can have more than one network interface or IP address, usually if it connects to two different networks. Assign the INADDR_ANY macro to the sockaddr's address to serve content on all the server’s IP interfaces.
listen(): After binding to an address and port, use listen to begin allowing client connections. This function essentially opens the socket for business. The backlog parameter defines how many clients are allowed to wait in a queue for your server to accept them.
while(1): A server is always on: enter an infinite loop, where the main body of the work is going to happen. We declare a second sock integer that will eventually represent a new client connection.
accept(): finally, call accept to connect to a new client. You pass the server socket as a parameter to accept. On success, it returns a new socket that represents your connection to the new client. Use that newly returned socket to communicate with the client via send and recv.

5.2. Threading

Some clients may be slow to complete a connection or send a request. To prevent all other clients waiting on one slow client, your server should be able to serve multiple clients concurrently, not just back-to-back. For this part of the lab, we’ll use multithreading with pthreads to handle concurrent connections.

Use pthread_create and pthread_detach after calling accept for each new client.
Unlike many of your prior experiences with threading (e.g., parallel GOL in CS 31), the threads in this assignment don’t need to coordinate their actions. This makes the threading relatively easy, and it’s something that can be added on after the main serving functionality is implemented. When starting out, organize your code such that it calls a function on any newly-accepted client sockets, and let that function do all the work for that connection. This will make adding pthread support quite simple!
In your starter code you should see a thread_detach_example.c. This is very similar to what you will be implementing. This function takes the number of threads as an input argument, and then it creates and detaches each thread. Each thread independently runs thread_function. The example passes one argument to each thread, an integer pointer. In your server, this will be the socket descriptor (integer) for a newly-accepted client.
- Inside of the thread_function, you just have to cast the input back from a generic void * unknown type pointer to be an integer pointer. Then, you can dereference that pointer to get the value, after which you can free it. This is the main complexity in this part of the lab — wrangling pointers!
Finally, we have a call to the pthread_detach function. This basically says I am creating a thread, it is going to go do something in the background, and I don’t need the thread to return a result — just exit once its done executing. Therefore the return value of our thread_function is NULL to satisfy a void * return value. By detaching a thread, we are telling the OS to just clean it up once its done executing our thread_function, without the need for calling pthread_join.

5.3. Providing a directory listing

Your web server should respond with a list of files when the user requests a directory that does not contain an index.html file.
- Similar to opening a file with fopen and reading from a file with fread, you can read the contents of a directory using the opendir, readdir and closedir calls.
- That is, if you have a valid directory path, you can pass it to opendir and store the result in a (DIR * ) pointer. Just like a file pointer, every time you open a directory, you should close the directory with closedir.
- Next, you can keep calling readdir, which returns info for one file, on that (DIR * ) pointer until it returns NULL. See man readdir for details. DO NOT attempt to free the struct dirent pointer that readdir returns — the man page makes it very clear that you should not attempt to free that pointer!
- You can follow the following html format to create your directory listing (substitute /path with the actual path):
  <html> Directory listing for: /path/ <br/> <ul> <li><a href="your_dir_listing_with_slash/">"dir_name"</a></li> .... </ul> </html>

6. Submitting

Please remove any excessive debugging output prior to submitting.

To submit your code, commit your changes locally using git add and git commit. Then run git push while in your lab directory.