CS 43 — Lab 1: A Basic Web Client

Due: Thursday, February 10 @ 11:59 PM

1. Overview

Please familiarize yourself with the course’s partnership expectations before starting the lab.

In this lab we’ll write our first networking application — a barebones web client. A web client communicates with a web server, and they both "speak" HTTP, the Hypertext Transfer Protocol.

HTTP uses a client-server model of communication: in which the client (your lab code) initiates communication, and a server that is always-on, passively waits and responds. The web client and web server communicate using HTTP requests and responses. We’ll look into the HTTP protocol format in a lot more detail in class on Tuesday.

1.1. Goals

Use git to clone a repository full of starter code.
Apply top-down design to write a web client.
Practice with C networking basics: sockets, send(), receive() and DNS (name-to-IP resolution).
Manipulate HTTP headers with C string functions.

1.2. Handy References

Departmental git resources.
RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
Manual pages for socket, connect, send, and recv.
Some example files for testing.
Refreshers on pointers and strings.

1.3. Lab Recordings

Week 1

Week 2

2. Requirements

We’ll write a command-line program named lab1 that takes a URL as its only parameter. It should retrieve the indicated file specified in the URL and stores it in the local directory (that lab1 was run from) with the appropriate filename from the URL. If the URL does not end in a filename, your program should automatically name the file index.html.

For example:

# This should create a local file named 'pride_and_prejudice.txt' containing lots of text.
 $ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt

2.1. Workflow of Your Program

The high-level tasks your lab1 web client needs to perform are:

Given a URL of the form http://host/path, construct a HTTP request to send to the web server.
1. isolate the host and the file portions using string functions
2. lookup the server’s hostname via DNS to get its IP address so that you can address data to it
3. create a socket and connect to the server’s IP address, on port 80, the port used for HTTP
4. generate an HTTP 1.0 request string for the specified file path
5. use the send system call to send the request over the network
Receive and interpret the server’s response.
1. use the recv system call to receive the server’s response, in full
2. inspect the HTTP response code received from the web server
3. if there are no errors, open a file for writing (named according to the name of the file in the URL argument) and save the body of HTTP response to the file
4. if there are errors, report them and exit

2.2. Client Behavior Expectations

For full credit:

Your client should faithfully download and save byte-for-byte identical copies of the files it’s asked to retrieve. It should work for both text (e.g., html) and binary files (e.g., images). See: Examples and Testing below.
Your client should report any errors or unexpected responses it encounters. If you get any HTTP response code other than 200, simply report the code you received and terminate.
Your client should name the files it saves according to the name of the file in the URL argument. That is, everything after the final / in the URL should be considered the file name to use when storing the file locally. If there’s nothing after the final /, use the name index.html.

2.3. Assumptions

You may assume that the URL will be no more than 100 characters long and that it will be of the form http://host/path, where:
- The host portion will be an IP address (e.g., 130.58.68.26) or a hostname (e.g., demo.cs.swarthmore.edu). See Other Reference Material and the provided getaddrinfo.c file for more info about DNS.
- The path may or may not be an empty string, may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name. If no path is given, your client should request: /. The server will send you back an index.html file, if it has one.
You may assume that the files you’ll be retrieving are no larger than one megabyte. This means you can statically declare storage space for the server’s response, which makes life a bit easier.

2.4. Checkpoint

To be on track, by the start of the next lab session, you should have finished:

Extract the host and path portions of the input URL
Construct the HTTP request by filling in the provided template
Extract just the file name from the path

A good stretch goal is:

Start trying to send the request to the server and maybe print the response, but it’s ok if this part isn’t solid yet

It’s fine to defer until next week:

Making send/recv calls robust
Parsing response headers
Saving the output file

3. Examples and Testing

To test your program, you’ll want to ensure that the files it’s saving are identical to the originals.

For a quick check, you can open the file in a browser, and it should appear like the original website. Note that appearance alone does NOT guarantee that the file is byte-for-byte identical.

An easy and more precise way to check that the files are correct is to use wget, which downloads files much like your lab program, to retrieve a correct copy of the file. Run your lab code on a URL first, then run wget on the same URL — it’ll store a (correct) copy of the file of the same name with .1 appended to the end of it.

For text files, you can use diff -u to see if the files are identical:

diff -u index.html index.html.1

If the files are identical you will see no output. If the files are not identical, diff will show you the lines that differ. Examining the differences may help you to narrow down what’s going wrong while debugging.

For all files (text and binary) you can use something like md5sum to generate a hash of the two files. If the hashes differ, so do the files. You will need to use something like md5sum to make sure binary files (images, pdfs) are identical. You can then compare wget's file with yours.

$ md5sum index.html index.html.1
937e1d7af5e5cc0ce63694cdd2969233  index.html
937e1d7af5e5cc0ce63694cdd2969233  index.html.1

Here, the hash is 937e1d7af5e5cc0ce63694cdd2969233, and it matches for both files.

4. Tips & FAQ

Use HTTP version 1.0 — version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you’ll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.
All HTTP headers are ASCII string characters, so you can use the str family of functions to manipulate them safely.

Do NOT use strlen, or any other string functions, on the body of the response. The response body is not necessarily a string. In some cases (e.g., html responses) it will be, but in other cases (e.g., image files) it won’t be. Remember that the C string functions look for, and typically terminate when they find, the null terminator character. A null terminator is nothing more than a byte whose value is zero (0). Such bytes are LIKELY TO BE PRESENT in binary response data. If you call strlen() on binary data and it finds a 0, it will stop and return the WRONG ANSWER to you.

"But, if I can’t call strlen() on the response, how will I know how much data I received?"

The recv() function’s return value will tell you how many bytes you received every time you call it. Likewise, the send() function will tell you how many bytes you successfully transmitted.

You should ALWAYS check the return values of these functions because the answer may not be what you expect. That is, even if you tell recv() to get 1000 bytes, the call may return with fewer bytes, and the only way you’ll know is to check the return value. Likewise, you may tell send() to transmit 1000 bytes, but it may only have room to buffer fewer bytes. You can’t just assume that all 1000 bytes were sent! Instead, check the return value of send() to see if (or which) bytes need to be resent. + For this lab assignment, your life will be easier if you call send() and recv() each in exactly one place (inside a loop). Use send() in a loop to send the entire request and recv() in a loop to read the entire response. If recv() returns 0, it means you’ve reached the end of the data.

4.1. String Manipulation

Spend some time thinking about how to do the string manipulation. It does not need to be complex — refer back to your lab 0 code for inspiration.
Good functions to use for handling filenames and text include: snprintf, sscanf, strstr, and strchr . You can learn more about these and other useful functions (e.g., send and recv ) by reading their man pages. For example, try man snprintf on the command line.
The newlines, which signal the end of a message in many protocols, are represented in HTTP as \r\n, not just \n.

4.2. Writing Output

The fopen, fwrite, and fclose functions may be useful for writing output files.
Make sure you do not save the HTTP headers from the web server’s response as part of the file’s contents.

4.3. General C Programming

Good systems programming involves:
1. writing a small bit of code
2. testing
3. brief comments
4. repeat
Test early and often, and don’t write new code until you’ve ironed out any problems with your existing code!
Run valgrind as you go, rather than waiting until the end. It will help you identify problems sooner!
If a system call fails, the perror() function will typically tell you why, in a nice, human-readable way. Take advantage of it, and don’t assume why a system call might be failing!

5. Other Reference Material

This lab comes early in the semester, when we haven’t seen much course content yet. I’ve put together some brief reference material to help bootstrap you on some of the tools we’ll be using for this lab. I do NOT expect that you will finish this lab as am expert on these topics — just that you’ll have enough background to finish the lab. We’ll cover the details in class soon enough.

6. Submitting

Please remove any excessive debugging output prior to submitting.

To submit your code, commit your changes locally using git add and git commit. Then run git push while in your lab directory.