031.3 Lesson 1
Certificate: |
Web Development Essentials |
---|---|
Version: |
1.0 |
Topic: |
031 Software Development and Web Technologies |
Objective: |
031.3 HTTP Basics |
Lesson: |
1 of 1 |
Introduction
The HyperText Transfer Protocol (HTTP) defines how a client asks the server for a specific resource. Its working principle is quite simple: the client creates a request message identifying the resource it needs and forwards that message to the server via the network. In turn, the HTTP server evaluates where to extract the requested resource and sends a response message back to the client. The reply message contains details about the requested resource, followed by the resource itself.
More specifically, HTTP is the set of rules that define how the client application should format request messages that will be sent to the server. The server then follows HTTP rules to interpret the request and format reply messages. In addition to requesting or transferring requested content, HTTP messages contain extra information about the client and server involved, about the content itself, and even about its unavailability. If a resource cannot be sent, a code in the response explains the reason for the unavailability and, if possible, indicates where the resource was moved.
The part of the message that defines the resource details and other context information is called the header of the message. The part following the header, which contains the content of the corresponding resource, is called the payload of the message. Both request messages and response messages can have a payload, but in most cases, only the response message has one.
The Client’s Request
The first stage of an HTTP data exchange between the client and the server is initiated by the client, when it writes a request message to the server. Take, for example, a common browser task: to load an HTML page from a server hosting a website, such as https://learning.lpi.org/en/
. The address, or URL, provides several pieces of relevant information. Three pieces of information appear in this particular example:
-
The protocol: HyperText Transfer Protocol Secure (
https
), an encrypted version of HTTP. -
The web host’s network name (
learning.lpi.org
) -
The location of the requested resource on the server (the
/en/
directory—in this case, the English version of the home page).
Note
|
A Uniform Resource Locator (URL) is an address that points to a resource on the Internet. This resource is usually a file that can be copied from a remote server, but URLs can also indicate dynamically generated content and data streams. |
How the Client Handles the URL
Before contacting the server, the client needs to convert learning.lpi.org
to its corresponding IP address. The client uses another Internet service, the Domain Name System (DNS), to request the IP address of a host name from one or more predefined DNS servers (DNS servers are usually automatically defined by the Internet Service Provider, ISP).
With the server’s IP address, the client tries to connect to the HTTP or HTTPS port. Network ports are identification numbers defined by the Transmission Control Protocol (TCP) to intertwine and identify distinct communication channels within a client/server connection. By default, HTTP servers receive requests on TCP ports 80 (HTTP) and 443 (HTTPS).
Note
|
There are other protocols used by web applications to implement client/server communication. For audio and video calls, for example, it is more appropriate to use WebSockets, a lower level protocol that is more efficient than HTTP for transferring data streams in both directions. |
The format of the request message that the client sends to the server is the same in HTTP and HTTPS. HTTPS is already more widely used than HTTP, because all data exchanges between client and server are encrypted, which is an indispensable feature to promote privacy and security on public networks. The encrypted connection is established between client and server even before any HTTP message is exchanged, using the Transport Layer Security (TLS) cryptographic protocol. By doing this, all HTTPS communication is encapsulated by TLS. Once decrypted, the request or response transmitted over HTTPS is no different from a request or response made exclusively over HTTP.
The third element of our URL, /en/
, will be interpreted by the server as the location or path for the resource being requested. If the path is not provided in the URL, the default location /
will be used. The simplest implementation of an HTTP server associates paths in URLs with files on the file system where the server is running, but this is just one of the many options available on more sophisticated HTTP servers.
The Request Message
HTTP operates through a connection already established between client and server, usually implemented in TCP and encrypted with TLS. In fact, once a connection meeting the requirements imposed by the server is ready, an HTTP request typed by hand in plain text could generate the response from the server. In practice, however, programmers rarely need to implement routines to compose HTTP messages, as most programming languages provide mechanisms that automate the making of the HTTP message. In the case of the example URL, https://learning.lpi.org/en/
, the simplest possible request message would have the following content:
GET /en/ HTTP/1.1 Host: learning.lpi.org User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0 Accept: text/html
The first word of the first line identifies the HTTP method. It defines which operation the client wants to perform on the server. The GET method informs the server that the client requests the resource that follows it: /en/
. Both client and server may support more than one version of the HTTP protocol, so the version to be adopted in the data exchange is also provided in the first line: HTTP/1.1
.
Note
|
The most recent version of the HTTP protocol is HTTP/2. Among other differences, messages written in HTTP/2 are encoded in a binary structure, whereas messages written in HTTP/1.1 are sent in plain text. This change optimizes data transmission rates, but the content of the messages is basically the same. |
The header can contain more lines after the first one to contextualize and help identify the request to the server. The Host
header field, for example, may appear redundant, because the server’s host has obviously been identified by the client in order to establish the connection and it’s reasonable to assume that the server knows its own identity. Nonetheless, it’s important to inform the host of the expected host name in the request header, because it is common practice to use the same HTTP server to host more than one website. (In this scenario, each specific host is called virtual host.) Therefore, the Host
field is used by the HTTP server to identify which one the request refers to.
The User-Agent
header field contains details about the client program making the request. This field can be used by the server to adapt the response to the needs of a specific client, but it is more often used to produce statistics about the clients using the server.
The Accept
field is of more immediate value, because it informs the server about the format for the requested resource. If the client is indifferent about the resource format, the Accept
field can specify */*
as the format in.
There are many other header fields that can be used in an HTTP message, but the fields shown in the example are enough to request a resource from the server.
In addition to the fields in the request header, the client can include other complementary data in the HTTP request that will be sent to the server. If this data consists only of simple text parameters, in the format name=value
, they can be added to the path of the GET method. The parameters are embedded in the path after a question mark and are separated by ampersand (&
) characters:
GET /cgi-bin/receive.cgi?name=LPI&email=info@lpi.org HTTP/1.1
In this example, /cgi-bin/receive.cgi
is the path to the script on the server that will process and possibly use the parameters name
and email
, obtained from the request path. The string that corresponds to the fields, in the format name=LPI&email=info@lpi.org
, is called query string and is supplied to the receive.cgi
script by the HTTP server that receives the request.
When the data is made up of more than short text fields, it’s more appropriate to send it in the payload of the message. In this case, the HTTP POST method must be used so that the server receives and processes the message’s payload, according to the specifications indicated in the request header. When the POST method is used, the request header must provide the size of the payload that will be sent next and how the body is formatted:
POST /cgi-bin/receive.cgi HTTP/1.1 Host: learning.lpi.org Content-Length: 1503 Content-Type: multipart/form-data; boundary=------------------------405f7edfd646a37d
The Content-Length
field indicates the size in bytes of the payload and the Content-Type
field indicates its format. The multipart/form-data
format is the one most commonly used in traditional HTML forms that use the POST method. In this format, each field inserted in the request’s payload is separated by the code indicated by the boundary
keyword. The POST method should be used only when appropriate, as it uses a slightly larger amount of data than an equivalent request made with the GET method. Because the GET method sends the parameters directly in the request’s message header, the total data exchange has a lower latency, because an additional connection stage to transmit the message body will not be necessary.
The Response Header
After the HTTP server receives the request message header, the server returns a response message back to the client. An HTML file request typically has a response header like this:
HTTP/1.1 200 OK Accept-Ranges: bytes Content-Length: 18170 Content-Type: text/html Date: Mon, 05 Apr 2021 13:44:25 GMT Etag: "606adcd4-46fa" Last-Modified: Mon, 05 Apr 2021 09:48:04 GMT Server: nginx/1.17.10
The first line provides the version of the HTTP protocol used in the response message, which must correspond to the version used in the request header. Then, still in the first line, the status code of the response appears, indicating how the server interpreted and generated the response for the request.
The status code is a three-digit number, where the left-most digit defines the response class. There are five classes of status codes, numbered from 1 to 5, each indicating a type of action taken by the server:
- 1xx (Informational)
-
The request was received, continuing the process.
- 2xx (Successful)
-
The request was successfully received, understood, and accepted.
- 3xx (Redirection)
-
Further action needs to be taken in order to complete the request.
- 4xx (Client Error)
-
The request contains bad syntax or cannot be fulfilled.
- 5xx (Server Error)
-
The server failed to fulfill an apparently valid request.
The second and third digits are used to indicate additional details. Code 200, for example, indicates that the request could be answered without any problems. As shown in the example, a brief text description following the response code (OK
) can also be provided. Some specific codes are of particular interest to ensure that the HTTP client can access the resource in adverse situations or to help to identify the reason for failure in the event of an unsuccessful request:
301 Moved Permanently
-
The target resource has been assigned a new permanent URL, provided by the
Location
header field in the response. 302 Found
-
The target resource resides temporarily under a different URL.
401 Unauthorized
-
The request has not been applied because it lacks valid authentication credentials for the target resource.
403 Forbidden
-
The
Forbidden
reponse indicates that, although the request is valid, the server is configured to not provide it. 404 Not Found
-
The origin server did not find a current representation for the target resource or is not willing to disclose that one exists.
500 Internal Server Error
-
The server encountered an unexpected condition that prevented it from fulfilling the request.
502 Bad Gateway
-
The server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request.
Although they indicate that it was not possible to fulfill the request, status codes 4xx
and 5xx
at least indicate that the HTTP server is running and is capable of receiving requests. The 4xx
codes require an action to be taken on the client, because its URL or credentials are wrong. In contrast, 5xx
codes indicate something wrong on the server side. Therefore, in the context of web applications, these two classes of status codes indicate that the source of the error lies in the application itself, either client or server, not in the underlying infrastructure.
Static and Dynamic Content
HTTP servers use two basic mechanisms to fulfill the content requested by the client. The first mechanism provides static content: that is, the path indicated in the request message corresponds to a file on the server’s local file system. The second mechanism provides dynamic content: that is, the HTTP server forwards the request to another program—probably a script‒to build the response from different sources, such as databases and other files.
Although there are different HTTP servers, they all use the same HTTP communication protocol and adopt more or less the same conventions. An application that does not have a specific need can be implemented with any traditional server, such as Apache or NGINX. Both are capable of generating dynamic content and providing static content, but there are subtle differences in the configuration of each.
The location of static files to be served up, for example, is defined in different ways in Apache and NGINX. The convention is to keep these files in a specific directory for this purpose, having a name associated with the host, for example /var/www/learning.lpi.org/
. In Apache, this path is defined by the configuration directive DocumentRoot /var/www/learning.lpi.org
, in a section that defines a virtual host. In NGINX, the directive used is root /var/www/learning.lpi.org
in a server
section of the configuration file.
Whichever server you choose, the files at /var/www/learning.lpi.org/
will be served via HTTP in almost the same way. Some fields in the response header and their contents may vary between the two servers, but fields like Content-Type
must be present in the response header and must be consistent across any server.
Caching
HTTP was designed to work on any type of Internet connection, fast or slow. Furthermore, most HTTP exchanges have to traverse many network nodes due to the distributed architecture of the Internet. As a result, it is important to adopt some content caching strategy to avoid the redundant transfer of previously downloaded content. HTTP transfers can work with two basic types of cache: shared and private.
A shared cache is used by more than a single client. For example, a large content provider might use caches on geographically distributed servers, so that clients get the data from their nearest server. Once a client has made a request and its response was stored in a shared cache, other clients making that same request in that same area will received the cached response.
A private cache is created by the client itself for its exclusive use. It is the type of caching the web browser does for images, CSS files, JavaScript, or the HTML document itself, so they don’t need to be downloaded again if requested in the near future.
Note
|
Not all HTTP requests must be cached. A request using the POST method, for example, implies a response associated exclusively with that particular request, so its response content should not be reused. By default, only responses to requests made using the GET method are cached. Furthermore, only responses with conclusive status codes such as 200 (OK), 206 (Partial Content), 301 (Moved Permanently), and 404 (Not Found) are suitable for caching. |
Both the shared and private cache strategy use HTTP headers to control how the downloaded content should be cached. For the private cache, the client consults the response header and verifies whether the content in the local cache still corresponds to the current remote content. If it does, the client waives the transfer of the response payload and uses the local version.
The validity of the cached resource can be assessed in several ways. The server can provide an expiration date in the response header for the first request, so that the client discards the cached resource at the end of the term and requests it again to obtain the updated version. However, the server is not always able to determine the expiration date of a resource, so it is common to use the ETag
response header field to identify the version of the resource, for example Etag: "606adcd4-46fa"
.
To verify that a cached resource needs updating, the client requests only its response header from the server. If the ETag
field matches the one in the locally stored version, the client reuses the cached content. Otherwise, the updated content of the resource is downloaded from the server.
HTTP Sessions
In a conventional website or web application, the features that handle session control are based on HTTP headers. The server cannot assume, for example, that all requests coming from the same IP address are from the same client. The most traditional method that allows the server to associate different requests to a single client is the use of cookies, an identification tag that is given to the client by the server and that is provided in the HTTP header.
Cookies allow the server to preserve information about a specific client, even if the person running the client does not identify himself or herself explicitly. With cookies, it is possible to implement sessions where logins, shopping carts, preferences, etc., are preserved in between different requests made to the same server that provided them. Cookies are also used to track user browsing, so it is important to ask for consent before sending them.
The server sets the cookie in the response header using the Set-Cookie
field. The field value is a name=value
pair chosen to represent some attribute associated with a specific client. The server can, for example, create an identification number for a client that requests a resource for the first time and pass it on to the client in the response header:
HTTP/1.1 200 OK Accept-Ranges: bytes Set-Cookie: client_id=62b5b719-fcbf
If the client allows the use of cookies, new requests to this same server have the cookie field in the header:
GET /en/ HTTP/1.1 Host: learning.lpi.org Cookie: client_id=62b5b719-fcbf
With this identification number, the server can retrieve specific definitions for the client and generate a customized response. It is also possible to use more than one Set-Cookie
field to deliver different cookies to the same customer. In this way, more than one definition can be preserved on the client side.
Cookies raise both privacy issues and potential security holes, because there is a possibility that they can be transferred to another client, who will be identified by the server as the original client. Cookies used to preserve sessions can give access to sensitive information from the original client. Therefore, it’s very important for clients to adopt local protection mechanisms to prevent their cookies from being extracted and reused without authorization.
Guided Exercises
-
What HTTP method does the following request message use?
POST /cgi-bin/receive.cgi HTTP/1.1 Host: learning.lpi.org User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0 Accept: */* Content-Length: 27 Content-Type: application/x-www-form-urlencoded
-
When an HTTP server hosts many websites, how is it able to identify which one a request is for?
-
What parameter is provided by the query string of the URL
https://www.google.com/search?q=LPI
? -
Why is the following HTTP request not suitable for caching?
POST /cgi-bin/receive.cgi HTTP/1.1 Host: learning.lpi.org User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0 Accept: */* Content-Length: 27 Content-Type: application/x-www-form-urlencoded
Explorational Exercises
-
How could you use the web browser to monitor the requests and responses made by an HTML page?
-
HTTP servers that provide static content usually map the requested path to a file in the server’s filesystem. What happens when the path in the request points to a directory?
-
The contents of files sent over HTTPS are protected by encryption, so they cannot be read by computers between the client and the server. Despite this, can these computers in the middle identify which resource the client has requested from the server?
Summary
This lesson covers the basics of HTTP, the main protocol used by client applications to request resources from web servers. The lesson goes through the following concepts:
-
Request messages, header fields, and methods.
-
Response status codes.
-
How HTTP servers generate responses.
-
HTTP features useful for caching and session management.
Answers to Guided Exercises
-
What HTTP method does the following request message use?
POST /cgi-bin/receive.cgi HTTP/1.1 Host: learning.lpi.org User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0 Accept: */* Content-Length: 27 Content-Type: application/x-www-form-urlencoded
The POST method.
-
When an HTTP server hosts many websites, how is it able to identify which one a request is for?
The
Host
field in the request header provides the targeted website. -
What parameter is provided by the query string of the URL
https://www.google.com/search?q=LPI
?The parameter named
q
with a value ofLPI
. -
Why is the following HTTP request not suitable for caching?
POST /cgi-bin/receive.cgi HTTP/1.1 Host: learning.lpi.org User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0 Accept: */* Content-Length: 27 Content-Type: application/x-www-form-urlencoded
Because requests made with the POST method imply a write operation on the server, they should not be cached.
Answers to Explorational Exercises
-
How could you use the web browser to monitor the requests and responses made by an HTML page?
All popular browsers offer development tools that, among other things, can show all network transactions that have been carried out by the current page.
-
HTTP servers that provide static content usually map the requested path to a file in the server’s filesystem. What happens when the path in the request points to a directory?
It depends on how the server is configured. By default, most HTTP servers look for a file named
index.html
(or another predefined name) in that same directory and send it as the response. If the file isn’t there, the server issues a404 Not Found
response. -
The contents of files sent over HTTPS are protected by encryption, so they cannot be read by computers between the client and the server. Despite this, can these computers in the middle identify which resource the client has requested from the server?
No, because the request and response HTTP headers themselves are also encrypted by TLS.