3.10 Preventing Cross-Site Scripting
3.10.1 Problem
You are developing a
web-based application, and you want to ensure that an attacker cannot
exploit it in an effort to steal information from the browsers of
other people visiting the same site.
3.10.2 Solution
When you are generating HTML that must contain external input, be
sure to escape that input so that if it contains embedded HTML tags,
the tags are not treated as HTML by the browser.
3.10.3 Discussion
Cross-site scripting attacks
(often called CSS, but more frequently
XSS in an effort to avoid confusion with
cascading style sheets) are a general class of attacks with a common
root cause: insufficient input validation. The goal of many
cross-site scripting attacks is to steal information (usually the
contents of some specific cookie) from unsuspecting users. Other
times, the goal is to get an unsuspecting user to launch an attack on
himself. These attacks are especially a problem for sites that store
sensitive information, such as login data or session IDs, in cookies.
Cookie theft
could allow an attacker to hijack a session or glean other
information that is intended to be private.
Consider, for example, a web-based message board, where many
different people visit the site to read the messages that other
people have posted, and to post messages themselves. When someone
posts a new message to the board, if the message board software does
not properly validate the input, the message could contain
malicious HTML that, when viewed by other
people, performs some unexpected action. Usually an attacker will
attempt to embed some JavaScript code that steals cookies, or
something similar.
Often, an attacker has to go to greater lengths to exploit a
cross-site script vulnerability; the example described above is
simplistic. An attacker can exploit any page that will include
unescaped user input, but usually the attacker has to trick the user
into displaying that page somehow. Attackers use many methods to
accomplish this goal, such as fake pages that look like part of the
site from which the attacker wishes to steal cookies, or embedded
links in innocent-looking email messages.
It is not generally a good idea to allow users to embed HTML in any
input accepted from them, but many sites allow simple tags in some
input, such as those that enable bold or italics on text. Disallowing
HTML altogether is the right solution in most cases, and it is the
only solution that will guarantee that cross-site scripting will be
prevented. Other common attempts at a solution, such as checking the
referrer header for all requests (the referrer header is easily
forged), do not work.
To disallow HTML in user input, you can do one of the following:
Attempting to recognize HTML and refuse it can be error-prone, unless
you only look for the use of the greater-than (>) and less-than
(<) symbols. Trying to match tags that will not be allowed (i.e.,
a blacklist) is not a good idea because it is difficult to do, and
future revisions of HTML are likely to introduce new tags. Instead,
if you are going to allow some tags to pass through, you should take
the whitelist approach and only allow tags that you know are safe.
|
JavaScript code injection does not
require a <script> tag; many other tags can
contain JavaScript code as well. For example, most tags support
attributes such as "onclick" and
"onmouseover" that can contain
JavaScript code.
|
|
The following spc_escape_html(
) function will replace occurrences of special
HTML characters with their escape sequences. For example, input that
contains something like
"<script>" will be replaced
with "<script>",
which no browser should ever interpret as HTML.
Our function will escape most HTML tags, but it will also allow some
through. Those that it allows through are contained in a whitelist,
and it will only allow them if the tags are used without any
attributes. In addition, the a (anchor) tag will
be allowed with a heavily restricted href
attribute. The attribute must begin with
"http://", and it must be the only
attribute. The character set allowed in the
attribute's value is also heavily restricted, which
means that not all necessarily valid URLs will successfully make it
through. In particular, if the URL contains
"#",
"?", or
"&", which are certainly valid
and all have special meaning, the tag will not be allowed.
If you do not want to allow any HTML through at all, you can simply
remove the call to spc_allow_tag() in
spc_escape_html(), and force all possible HTML to
be properly escaped. In many cases, this will actually be the
behavior that you'll want.
spc_escape_html() will return a C-style string
dynamically allocated with malloc(), which the
caller is responsible for deallocating with
free(). If memory cannot be allocated, the return
will be NULL. It also expects a C-style string
containing the text to filter as its only argument.
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
/* These are HTML tags that do not take arguments. We special-case the <a> tag
* since it takes an argument. We will allow the tag as-is, or we will allow a
* closing tag (e.g., </p>). Additionally, we process tags in a case-
* insensitive way. Only letters and numbers are allowed in tags we can allow.
* Note that we do a linear search of the tags. A binary search is more
* efficient (log n time instead of linear), but more complex to implement.
* The efficiency hit shouldn't matter in practice.
*/
static unsigned char *allowed_formatters[] = {
"b", "big", "blink", "i", "s", "small", "strike", "sub", "sup", "tt", "u",
"abbr", "acronym", "cite", "code", "del", "dfn", "em", "ins", "kbd", "samp",
"strong", "var", "dir", "li", "dl", "dd", "dt", "menu", "ol", "ul", "hr",
"br", "p", "h1", "h2", "h3", "h4", "h5", "h6", "center", "bdo", "blockquote",
"nobr", "plaintext", "pre", "q", "spacer",
/* include "a" here so that </a> will work */
"a"
};
#define SKIP_WHITESPACE(p) while (isspace(*p)) p++
static int spc_is_valid_link(const char *input) {
static const char *href="href";
static const char *http = "http://";
int quoted_string = 0, seen_whitespace = 0;
if (!isspace(*input)) return 0;
SKIP_WHITESPACE(input);
if (strncasecmp(href, input, strlen(href))) return 0;
input += strlen(href);
SKIP_WHITESPACE(input);
if (*input++ != '=') return 0;
SKIP_WHITESPACE(input);
if (*input == '"') {
quoted_string = 1;
input++;
}
if (strncasecmp(http, input, strlen(http))) return 0;
for (input += strlen(http); *input && *input != '>'; input++) {
switch (*input) {
case '.': case '/': case '-': case '_':
break;
case '"':
if (!quoted_string) return 0;
SKIP_WHITESPACE(input);
if (*input != '>') return 0;
return 1;
default:
if (isspace(*input)) {
if (seen_whitespace && !quoted_string) return 0;
SKIP_WHITESPACE(input);
seen_whitespace = 1;
break;
}
if (!isalnum(*input)) return 0;
break;
}
}
return (*input && !quoted_string);
}
static int spc_allow_tag(const char *input) {
int i;
char *tmp;
if (*input == 'a')
return spc_is_valid_link(input + 1);
if (*input == '/') {
input++;
SKIP_WHITESPACE(input);
}
for (i = 0; i < sizeof(allowed_formatters); i++) {
if (strncasecmp(allowed_formatters[i], input, strlen(allowed_formatters[i])))
continue;
else {
tmp = input + strlen(allowed_formatters[i]);
SKIP_WHITESPACE(tmp);
if (*input == '>') return 1;
}
}
return 0;
}
/* Note: This interface expects a C-style NULL-terminated string. */
char *spc_escape_html(const char *input) {
char *output, *ptr;
size_t outputlen = 0;
const char *c;
/* This is a worst-case length calculation */
for (c = input; *c; c++) {
switch (*c) {
case '<': outputlen += 4; break; /* < */
case '>': outputlen += 4; break; /* > */
case '&': outputlen += 5; break; /* & */
case '\': outputlen += 6; break; /* " */
default: outputlen += 1; break;
}
}
if (!(output = ptr = (char *)malloc(outputlen + 1))) return 0;
for (c = input; *c; c++) {
switch (*c) {
case '<':
if (!spc_allow_tag(c + 1)) {
*ptr++ = '&'; *ptr++ = 'l'; *ptr++ = 't'; *ptr++ = ';';
break;
} else {
do {
*ptr++ = *c;
} while (*++c != '>');
*ptr++ = '>';
break;
}
case '>':
*ptr++ = '&'; *ptr++ = 'g'; *ptr++ = 't'; *ptr++ = ';';
break;
case '&':
*ptr++ = '&'; *ptr++ = 'a'; *ptr++ = 'm'; *ptr++ = 'p';
*ptr++ = ';';
break;
case ''':
*ptr++ = '&'; *ptr++ = 'q'; *ptr++ = 'u'; *ptr++ = 'o';
*ptr++ = 't'; *ptr++ = 't';
break;
default:
*ptr++ = *c;
break;
}
}
*ptr = 0;
return output;
}
|