CS43 Lab 1: A Basic Web Client

Due: Tuesday, September 10, 11:59 PM

1. Handy References:

Departmental git resources.
RFC 1945: HTTP 1.0 Specification. Sections 4, 5, and 6 are probably the most helpful.
Manual pages for socket, send, and recv.
Some example files for testing.
C Pointers and memory layout, Strings, C-Arrays, Text files

2. Lab 1 Goals

Use git to clone a repository full of starter code.
Apply top-down design to write a web-client.
Practice with C networking basics: sockets, send(), receive() and DNS (name-to-IP resolution).
Manipulate HTTP headers with C string functions.

3. Overview

In this lab we will write our first networking application — a barebones web client. A web client corresponds with a web server, and they both "speak" HTTP, the Hypertext Transfer Protocol.

HTTP uses a client-server model of communication: in which the client initiates communication, and a server that is always-on, passively waits and responds. The web client and web server correspond using HTTP queries and responses. One way to think of HTTP, is as a document retrieval system over the web. We will look into the HTTP protocol format in a lot more detail in class tomorrow.

4. Lab Requirements

We will write a command-line program called lab1 that takes a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename. If the URL does not end in a filename, your program should automatically name the file index.html.

For example:

# This should create a local file named 'pride_and_prejudice.txt' containing lots of text.
 $ ./lab1 http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt

4.1. Workflow of your program

Given a URL of the form http://host/path, construct a HTTP query to send to the web-server.
1. separate the host from the file portions using string parsing
2. look up the hostname via DNS to get its IP address necessary to route the packet
3. create a socket and connect to the IP address above, on port 80, the port used for HTTP.
4. generate an HTTP 1.0 request for the file
5. use the send() system-call to send the request over the network
Parse the HTTP response headers received from the web-server and either:
1. faithfully download byte-for-byte copies of both text (e.g., html) and binary files (e.g., images) using the recv() system call.
2. report any errors or unexpected responses the web-client encounters.
Finally, if there are no errors, save the file according to the name of the file in the URL argument.

4.2. Try wget

Running ./lab1 should have the same functionality as wget. Try the following on the command line to get a sense of how wget works (your code does not need to print extra information that wget does).

# This should create a local file named 'pride_and_prejudice.txt' containing lots of text.
 $ wget http://demo.cs.swarthmore.edu/example/pride_and_prejudice.txt

# This should create a local file named 'index.html' containing the demo server's home page contents.
 $ wget http://demo.cs.swarthmore.edu

# This should return a 404 not found error
$ wget http://demo.cs.swarthmore.edu/example/hi.txt

# This should create a local image file named 'fiona.jpg' containing a cute cat picture.
 $ wget http://demo.cs.swarthmore.edu/example/fiona.jpg

5. Try TELNET!

To see what the HTTP message format looks like, let’s generate a HTTP request from the command-line using telnet.

Wait, you can just look at the HTTP message structure? Yes! HTTP is an all-text protocol (it’s really old), meaning we can actually read the request and response fields. In later labs, we will see application-layer protocols that are all binary for which we will need other tools to parse the header fields.

Telnet is an application layer protocol that was used for remote login to access network servers (this has been replaced today by SSH). Try the following in your terminal window:

telnet demo.cs.swarthmore.edu 80

GET / HTTP/1.0
Host: demo.cs.swarthmore.edu

(press return twice after the last line)

Request message breakdown:

demo.cs.swarthmore.edu: hostname to connect
80: the port number that serves as the identifier for an HTTP message so that when the client receives this message
GET: HTTP "verb" to request data.
/: path being requested
HTTP/1.0: HTTP version being "spoken" by the client.
Host:hostname: Required header field
\r\n\r\n: Two CRLF s (carriage-returns) to indicate end of the header

Response message:

HTTP/1.0 200 OK
Vary: Accept-Encoding
Content-Type: text/html
Accept-Ranges: bytes
ETag: "316912886"
Last-Modified: Wed, 04 Jan 2017 17:47:31 GMT
Content-Length: 1062
Connection: close
Date: Tue, 03 Sep 2019 14:06:05 GMT
Server: lighttpd/1.4.35

<body of file>

HTTP/1.0: HTTP version spoken by webserver
200 OK: File exists, being transferred.
Optional Header fields
\r\n\r\n: Two CRLF s (two carriage-returns) to indicate the end of the header.
Body of message follows.

Record your responses to your TELNET queries on the week1-lab sheet on github

6. Socket Programming

As we saw yesterday, a protocol defines both the message + header format and transfer procedure. We saw that like a human protocol (initial Hello), the network protocol must first establish a connection, before we start sending and receiving data. The application layer HTTP message, is encapsulated in a transport layer protocol, TCP. TCP is used as the transport protocol of choice for in-order, reliable delivery. To start a TCP connection, we associate the client with a socket. You can think of a socket like a mailbox to send and receive mail.

Socket Programming: the Socket interface sits between the Application layer and the transport layer. We associate the client and and server with a socket to communicate across the network. The routers in-between do not "speak" the transport layer or application layer protocols. This is analogous to not having the postal mail system look inside your mail package!

Figure 1. The Socket interface sits between the Application layer and the transport layer. We associate the client and and server with a socket to communicate across the network. The routers in between do not "speak" the transport layer or application layer protocols. This is analogous to not having the postal mail system look inside your mail package!

7. Socket system-calls

Once we establish a socket on the client side, the following system-calls are used to send, receive data and eventually close the socket.

socket(): create a new communication endpoint
connect(): actively attempt to establish a connection
send(): send some data over a connection
recv(): receive some data over a connection
close(): close the connection.

Work through the system calls given in the starter code and fill in the weekly lab sheet on github

8. String Parsing and Name-to-IP resolution

As stated in Lab requirements your lab1 program takes in a URL as its only parameter, retrieves the indicated file, and stores it in the local directory with the appropriate filename.

If the URL does not end in a filename, your program should automatically name the file index.html.
You may assume that the URL will be no more than 100 characters long and that it will be of the form http://host/path.
The path may or may not be an empty string, and may or may not contain multiple slashes (for subdirectories), and may or may not contain a file name.
If no path is given, your client should request: /. The server will send you back an index.html file, if it has one.

You may assume that the files you’ll be retrieving are no larger than one megabyte. This means you can statically declare storage space for the response, which makes life a bit easier.

8.1. Resolving a hostname to an IP address

The host portion may be an IP address or a hostname like demo.cs.swarthmore.edu. Socket programming requires an IP address for communication (e.g., 130.58.68.137), so when given hostnames, you’ll need to query the domain name system (DNS) to find their corresponding IP address.
To look up an IP address for a given host name, use getaddrinfo() (man getaddrinfo on the command line will give you the details), or consult the getaddrinfo.c example. We will cover DNS in much more detail later in the course, so for now, you can treat it like a black box that magically converts hostnames to IP addresses.

9. Miscellaneous hints and background information

Good systems programming involves:

writing a small bit of code
testing
brief comments
repeat step 1

Students who follow this have a much higher rate of succeeding in the course!

All HTTP headers are ASCII string characters, so you can use the str family of functions to manipulate them safely.

Do NOT use strlen, or any other string functions, on the body of the response. The response body is not necessarily a string. In some cases (e.g., html responses) it will be, but in other cases (e.g., image files) it won’t be. Remember that the C string functions look for, and typically terminate when they find, the null terminator character. A null terminator is nothing more than a byte whose value is zero (0). Such bytes are LIKELY TO BE PRESENT in binary response data. If you call strlen() on binary data and it finds a 0, it will stop and return the WRONG ANSWER to you.

"But, if I can’t call strlen() on the response, how will I know how much data I received?"

The recv() function’s return value will tell you how many bytes you received every time you call it. Likewise, the send() function will tell you how many bytes you successfully transmitted.

You should ALWAYS check the return values of these functions because the answer may not be what you expect. That is, even if you tell recv() to get 1000 bytes, the call may return with fewer bytes, and the only way you’ll know is to check the return value. Likewise, you may tell send() to transmit 1000 bytes, but it may only have room to buffer fewer bytes. You can’t just assume that all 1000 bytes were sent! Instead, check the return value of send() to see if (or which) bytes need to be resent.
For this lab assignment, your life will be easier if you call send() and recv() each in exactly one place (inside a loop). Use send() in a loop to send the entire request and recv() in a loop to read the entire response. If recv() returns 0, it means you’ve reached the end of the data.

Use HTTP version 1.0 — version 1.1 can get a lot more complicated. The subset of the HTTP 1.0 protocol you’ll need to implement for this assignment is quite small, but you may find the full protocol specification to be helpful.
Section 2.2 in the book should also be helpful. Your book talks about the "request line" and "header lines" for an HTTP request. You will only need to use the request line and the host line of the header.
Good functions to use for handling filenames and text include: snprintf, sscanf, strstr, and strchr . You can learn more about these and other useful functions (e.g., send and recv ) by reading their man pages. For example, try man snprintf on the command line.
Newlines, which signal the end of a message in many protocols, are represented in HTTP as \r\n, not just \n.
You will need to remove the HTTP headers from the web server’s response before saving the data to a file.
Spend some time thinking about how to do the string manipulation. It does not need to be complex. The complete program, including comments, error handling etc. can be written in about 100-150 leisurely lines.
The fopen , fwrite , and fclose functions may be useful for writing the output file.

10. Testing

To test your program, you’ll want to ensure that the files it’s saving are identical to the originals.
For a quick check, you can open the file in a browser, and it should appear like the original website. Note that appearance alone does NOT guarantee that the file is byte-for-byte identical.
An easy and more precise way to check that the files are correct is to use wget, which downloads files much like your lab program, to retrieve a correct copy of the file. You will need to use md5sum to make sure binary files (images, pdfs) are identical. You can then compare wget 's file with yours. Example below:
```
$ md5sum index.html index.html.1
937e1d7af5e5cc0ce63694cdd2969233  index.html
937e1d7af5e5cc0ce63694cdd2969233  index.html.1
```
For text files, you can use diff to see if the files are identical. For all files (text and binary) you can use something like md5sum to generate a hash of the two files. If the hashes differ, so do the files. If the files are identical you will see no output:
```
$ diff index.html index.html.1
$
```
If the files are not identical, diff will show you the lines that differ. For e.g.,
```
diff index.html index.html.1
21a22,23
> </body>
> </html>
>
```
You should run valgrind, as you add to your code base, to make sure your code has no memory leaks.

11. Grading Rubric

Total: 3 points

1/2 point for completing weekly-lab questions and lab review.
1 point for transferring text files correctly - MD5sums must match.
1/2 point for transferring binary image files correctly - MD5sums must match.
1/2 point for correctly identifying and reporting error messages.
1/4 point for naming files correctly.
1/4 point for no valgrind errors.

12. Submitting

Please remove any debugging output prior to submitting.

To submit your code, simply commit your changes locally using git add and git commit. Then run git push while in your lab directory.