CS 161 Notes Lec 7-8

Course website: https://su21.cs161.org/
Course textbook: https://textbook.cs161.org/

Overview

Lecture 7: Cryptographic Hashes and MACs

Hashing
1. Definition
2. Security: one-way, second preimage resistant, collision resistant
3. Examples
4. Length extension attacks
5. Application: lowest-hash scheme
6. Do hashes provide integrity?
MACs
Authenticated encryption

Lecture 8: PRNGs and Diffie-Hellman Key Exchange

PRNGs
Stream ciphers
Diffie-Hellman Key Exchange

Hashing

Recall the cryptography roadmap

	Symmetric-key	Asymmetric-key
Confidentiality	1. One-time pads 2. Block ciphers with chaining modes (e.g. AES-CBC)	1. RSA encryption 2. ElGamal encryption
Integrity, Authentication	1. MACs (e.g. HMAC)	1. Digital signatures (e.g. RSA signatures)

Lecture 5-6 considered the block with symmetric-key and confidentiality. In lecture 7, the block with symmetric-key and integrity, authentication will be considered.

But before that, hashing will be introduced briefly. Since hashing algorithms(like linear hashing, double hashing…) are usually covered in courses like Algorithms and Data Structures, so here we won’t discuss how hashing algorithms should be performed, but mainly focus on how should they be used.

Definition

In my understanding, hashing can be seen as a black box
Define a hash function H(M):
- Input: Arbitrary length message M
- Output: Fixed length, n-bit hash
- Sometimes written as {0, 1}* → {0, 1}n
It has following properties:
- Correctness: Deterministic
  - Hashing the same input always produces the same output
- Efficiency: Efficient to compute
- Security: One-way-ness (“preimage resistance”)
  - Given an output y, it’s infeasible to find an input x s.t. H(x) = y.
  - Note that there’re definitely many inputs that will result in the same output, but the one-way-ness property claims that you can’t find any of them.
- Security: Second preimage resistance
  - Given an input x, it is infeasible to find another input x’ ≠ x such that H(x) = H(x’)
- Security: Collision-resistance
  - It’s infeasible to find a pair of x, x’ s.t. x ≠ x’ and H(x) = H(x’).
  - Though there’re definitely gonna be a lot of inputs that collide since input has arbitrary length will output is fixed length, but the time it takes to find a pair of such inputs is so long that makes hashing collision-resistant.
- Unpredictability: No predictable patterns for how changing the input affects the output
  - Changing 1 bit in the input causes the output to be completely different
  - Note: Not part of the theoretical definition of hash, but useful in practice

Note that in the security properties above, second preimage resistance is harder to break than collision-resistance, since the former one has specified an input x for you to collide while the later one lets you to find a pair of inputs all by yourself. In terms of numbers, it takes 2^(n/2) tries on average to find a pair to break collision-resistance, while to break second preimage resistance, it takes 2^(n-1) tries on average.

Examples

MD5
- Output: 128 bits
- Security: Completely broken

A GIF that displays its own MD5 hash

SHA-1
- Output: 160 bits
- Security: Completely broken in 2017
- Was known to be weak before 2017, but still used sometimes
SHA-2
- Output: 256, 384, or 512 bits (sometimes labeled SHA-256, SHA-384, SHA-512)
- Not broken so far, but some variants are vulnerable to a length extension attack
- Current standard
SHA-3 (Keccak, why this name?)
- Output: 256, 384, or 512 bits
- Current standard

Length extension attacks

Given H(x) and the length of x, but not x, an attacker can create H(x

m) for any m of the attacker’s choosing.

Note: This doesn’t violate any property of hash functions but is undesirable in some circumstances.
SHA-256 (256-bit version of SHA-2) is vulnerable to this attack.
SHA-3 is not vulnerable to this attack.

Application: lowest-hash scheme

Just a fun scenario where hashing technique can be applied :)

Scenario
- An attacker has stolen 150 million (150,000,000) records from a website.
- The attacker wants to prove to us that they didn’t steal fewer records and exaggerate the number.
- The attacker doesn’t want to send all 150M records to us.
- How can we be sure the attacker isn’t lying?
Idea: Use cryptographic hashes
- Ask the attacker to hash all 150M records and send us the 10 records with the lowest hash value.
- Hashes are unpredictable, so the attacker is basically choosing 10 random records.
How do we know the attacker didn’t cheat?
- Check the hashes of the 10 records returned, and verify the records.
- We should then hash all the records we got, and check how those 10 lowest records distribute in our hash pool to judge whether the attacker really got approximately that many records.

Do hashes provide integrity?

It depends on your threat model :)

Imagine a scenario that users are trying to download and install firefox.
How can Mozilla make sure that users can get the real file instead of a malicious file?
Mozilla can provide a hash code as checksum for users to check, users can hash the file they downloaded and verify if it matches the hash code released by Mozilla.
Here the threat model is that attackers can’t change the checksum released by Mozilla ;) If that’s the case, hashes here do provide integrity.
Imagine another scenario where Alice and Bob are communicating through an insecure channel.
To make sure Mallory can’t modify the message, Alice includes a hash tag of the message.
The plan was that Bob can verify the message with the hash tag just like checksum.
Well, this method is insecure ;)
Mallory can actually change the message together with the hash!
So in this scenario hash doesn’t provide integrity.

MACs

MACs stands for Message Authentication Codes.

Our motivation of studying such topic is to provide Alice and Bob’s insecure channel with integrity. Remember lecture 7’s main topic is this highlighted block in cryptography roadmap below.

	Symmetric-key	Asymmetric-key
Confidentiality	1. One-time pads 2. Block ciphers with chaining modes (e.g. AES-CBC)	1. RSA encryption 2. ElGamal encryption
Integrity, Authentication	1. MACs (e.g. HMAC)	1. Digital signatures (e.g. RSA signatures)

How does MAC work?

Alice wants to send M to Bob, but doesn’t want Mallory to tamper with it
Alice sends M and T = MAC(K, M) to Bob
Bob receives M and T
Bob computes MAC(K, M) and checks that it matches T
If the MACs match, Bob is confident the message has not been tampered with (integrity)

Definition

Two parts:
- KeyGen() → K: Generate a key K
- MAC(K, M) → T: Generate a tag T for the message M using key K
  - Inputs: A secret key and an arbitrary-length message
  - Output: A fixed-length tag on the message
Properties
- Correctness: Determinism
  - Note: Some more complicated MAC schemes have an additional Verify(K, M, T) function that don’t require determinism, but this is out of scope for CS161
- Efficiency: Computing a MAC should be efficient
- Security: EU-CPA(defined below)

Security: unforgeability

EU-CPA

EU-CPA stands for Existentially Unforgeable under Chosen-Plaintext Attack.

EU-CPA means that without the key, an attacker can’t create a valid tag on a message.
- Mallory cannot generate MAC(K, M’) without K
- Mallory cannot find any M’ ≠ M such that MAC(K, M’) = MAC(K, M)
In terms of the game
- Mallory can let Alice do MAC(K, M) for any K she provides.
- But EU-CPA ensures that Mallory will never be able to generate a tag MAC(K, M’) on her own for a never-seen message M’.

So, how do we implement a MAC?

We can use hashes!

An example of MAC: NMAC

KeyGen():
- Output two random, n-bit keys K_1 and _K_2, where _n is the length of the hash output
NMAC(K_1, _K_2, _M):
- Output H(_K_1 H (_K_2 M))
NMAC is secure if the two keys are different (even in just one bit)
- Provably secure if the underlying hash function is secure
Intuition: Using two hashes prevents a length extension attack
Otherwise, an attacker who sees a tag for M could generate a tag for M M’ (that’s the definition of length extension attack)

Another Example: HMAC

Issues with NMAC:
- Recall: NMAC(K_1, _K_2, _M) = H(_K_1 H (_K_2 M))
- We need two different keys
- NMAC requires the keys to be the same length as the hash output (n bits, that’s so annoying)
- Can we use NMAC to design a scheme that uses one arbitrary-length key?
HMAC(K, M):
- Compute K’ as a version of K that is the length of the hash output
  - If K is too short, pad K with 0’s to make it n bits
  - If K is too long, hash it so it’s n bits
- Output H((K’ ⊕ opad) H((K’ ⊕ ipad) M))
- opad (outer pad) is the hard-coded byte 0x5c repeated until it’s the same length as K’
- ipad (inner pad) is the hard-coded byte 0x36 repeated until it’s the same length as K’
- As long as opad and ipad are different, you’ll get two different keys
- For paranoia, the designers chose two very different bit patterns, even though they theoretically need only differ in one bit :)
- HMAC is a hash function, so it has the properties of the underlying hash too
- You can’t verify a tag T if you don’t have K

Do MACs provide integrity?

Do MACs provide integrity?
- Yes. An attacker cannot tamper with the message without being detected
Do MACs provide authenticity?
- It depends on your threat model
- If a message has a valid MAC, you can be sure it came from someone with the secret key, but you can’t narrow it down to one person
- If only two people have the secret key, MACs provide authenticity: it has a valid MAC, and it’s not from me, so it must be from the other person
Do MACs provide confidentiality?
- MACs are deterministic ⇒ No IND-CPA security (Once again, no randomness is the main issue)
- MACs in general have no confidentiality guarantees; they can leak information about the message
  - HMAC doesn’t leak information about the message, but it’s still deterministic, so it’s not IND-CPA secure

Authenticated encryption

Definition

A scheme that simultaneously guarantees confidentiality and integrity (and authenticity, depending on your threat model) on a message.

Two ways of achieving authenticated encryption:

Combine schemes that provide confidentiality with schemes that provide integrity(combine the previous techniques in cryptography roadmap)
Use a scheme that is designed to provide confidentiality and integrity(build your own scheme)

Key reuse

Consider the first method for authenticated encryption mentioned above: Combining schemes that provide confidentiality with schemes that provide integrity
Key reuse: Using the same key in two different algorithms (e.g. AES-CBC and HMAC)
- Note: Using the same key for multiple uses of one algorithm (e.g. computing HMACs on different messages with the same key) is not key reuse
Reusing keys can cause the underlying algorithms to interfere with each other and affect security guarantees
- Example: If you use a block-cipher-based MAC algorithm and a block cipher chaining mode, the underlying block ciphers may no longer be secure
- Thinking about these attacks is hard
Simplest solution: Do not reuse keys! One key per algorithm.

MAC-then-encrypt or encrypt-then-MAC?

Let’s design an authenticated encryption scheme.

What do we have so far?
- An IND-CPA encryption scheme (e.g. AES-CBC): Enc(K, M) and Dec(K, M)
- An unforgeable MAC scheme (e.g. HMAC): MAC(K, M)
MAC-then-encrypt
- First compute MAC(K_2, _M)
- Then encrypt the message and the MAC together: Enc(k1, M MAC(K_2, _M))
Encrypt-then-MAC
- First compute Enc(K_1, _M)
- Then MAC the ciphertext: MAC(K_2, Enc(_K_1, _M))
- Both Enc(K_1, _M) and MAC(K_2, Enc(_K_1, _M)) are required when communicating
Which is better?
- In theory, both are IND-CPA and EU-CPA secure if applied properly
- MAC-then-encrypt has a flaw: You don’t know if tampering has occurred until after decrypting
  - Attacker can supply arbitrary tampered input, and you always have to decrypt it (it can lead to side channels related with the tiny difference in execution time)
  - Passing attacker-chosen input through the decryption function can cause side-channel leaks
Always use encrypt-then-MAC because it’s more robust to mistakes

AEAD encryption modes

Second method for authenticated encryption: Use a scheme that is designed to provide confidentiality, integrity, and authenticity
Authenticated encryption with additional data (AEAD): An algorithm that provides both confidentiality and integrity over the plaintext and integrity over additional data
- Additional data is usually context (e.g. memory address), so you can’t change the context without breaking the MAC
Great if used correctly: No more worrying about MAC-then-encrypt
- If you use AEAD incorrectly, you lose both confidentiality and integrity/authentication :(
- Example of correct usage: Using a crypto library with AEAD

Very fast mode of operation
- Fully parallel encryption
- Galois multiplication isn’t parallelizable, but it’s very fast

PNRGs

PNRG stands for Pseudorandom Number Generator.

From previous materials, we draw a conclusion that randomness is extremely important in the field cryptography, so how do we exactly generate random numbers?

Entropy

We first introduce “entropy”.
In cryptography, “random” usually means “random and unpredictable”.
Entropy is a measure of uncertainty
- A measure of uncertainty.
- Higher entropy = more unpredictable outcomes = desirable in cryptography.
- For more information about entropy, check Shannon entropy.

Before introducing PRNGs, let’s first consider methods to generate true randomness.

True randomness

Can be generated by an unpredictable circuit on CPU(Supported by Intel chips, such as RDRAND in Intel x86).
Can be generated by measuring human activities(e.g. the time when keyboard is pressed).
Usually we need to add lots of entropy sources together to get a good source, since even true randomness can contain bias.
True randomness are often too expansive to generate.

PRNGs

An algorithm that uses a little bit of true randomness to generate a lot of random-looking output.
Also called deterministic random bit generators (DRBGs).
How to use
1. Generate some expensive true randomness (e.g. noisy circuit on your CPU).
2. Use the true randomness as input to the PRNG.
3. Generate random-looking numbers quickly and cheaply with the PRNG.
PRNGs are deterministic: Output is generated according to a set algorithm
- However, for an attacker who can’t see the internal state, the output is computationally indistinguishable from true randomness.
A PRNG has three functions:
- PRNG.Seed(entropy): Initializes the internal state using the entropy
  - Input: Some truly random bits
- PRNG.Reseed(entropy): Updates the internal state using the existing state and the entropy(that’s basically adding randomness)
  - Input: Some truly random bits
- PRNG.Generate(n): Generate n pseudorandom bits
  - Input: A number n
  - Output: n pseudorandom bits
  - Updates the internal state as needed
  - Some PRNGs also support adding entropy here
Properties
- Correctness: Deterministic
- Efficiency: Efficient to generate pseudorandom bits
- Security: Indistinguishability from random
  - Game: Present an attacker with a truly random sequence and a sequence outputted from a secure PRNG, the attacker should not be able to determine which is which with probability > 1/2
  - Equivalence: An attacker cannot predict future outputs of the PRNG
A PRNG should be seeded with all available sources of entropy
- More sources of entropy are usually better
Rollback resistance:
- If the attacker learns the internal PRNG state, they cannot learn anything about previous states or outputs
- Game: An attacker knows the current internal state of the PRNG and is given a sequence of truly random bits and a sequence of previous output from the PRNG, the attacker cannot determine which is which with probability > 1/2
- Rollback resistance is not required in a secure PRNG, but it is a useful property

HMAC-DRBG

Remember DRBG stands for deterministic random bit generators, another form of saying PRNG.

Idea: HMAC looks like a random function, with an output, you can’t deduce its input, which is similar to rollback resistance.

Algorithms for HMAC-DRBG:

Seed(s):
  // initial state
  K = 0
  V = 0
  
  // When updating internal state with provided entropy
  K = HMAC(K, V || 0x00 || s)
  V = HMAC(K, V)
  
  K = HMAC(K, V || 0x01 || s)
  V = HMAC(K, V)

  ...

Generate(n):
  output = ''
  while len(output) < n do
    V = HMAC(K, V)
    output = output || V
  end while

  // updating internal state without extra entropy
  K = HMAC(K, V || 0x00)
  V = HMAC(K, V)

return output[0:n]

HMAC-DRBG is secure and rollback-resistant as long as the HMAC it’s based on is secure.

Stream ciphers

Remember one-time pads? It’s perfectly IND-CPA secure but requires a random n-bit key each time. So, can we implement a symmetric encryption algorithm that uses PRNG to generate keys that work like one-time pads?

It’s fairly straight forward

Note that IV should be updated for each message!

In fact, AES-CTR introduced earlier is a type of stream cipher

Diffie-Hellman Key Exchange

Remember in previous encryption schemes, we assumed that Alice and Bob can somehow share a random key for each round of communication. So, how can they actually realize this?

Certainly it’s not done by magic, but by math ;)

Discrete log problem

First we need to know that given ga mod p, it is computationally hard to find a.
No need to know why this is the truth, leave the proof to mathematics :)

Based on discrete log problem, we have the Diffie-Hellman problem, which states that given ga mod p and gb_mod _p, it is computationally hard to find gab mod p.

With those in mind, here’s a graph illustrating how Diffie-Hellman key exchange works,

In the end all Eve can know are values of g, p, ga ,gb . But she can never deduce gab base on these values! In fact, even Alice and Bob themselves don’t need to know each other’s key, they only need to calculate the final key given by gab mod p.

DHE

One way of using Diffie-Hellman is to use it ephemerally, called DHE.
That’s basically using Diffie-Hellman in a short-term and temporary way.
The motivation is that you don’t want Eve to be able to steal a, b from Alice and Bob after a long time of communication, and then decrypt the message which she has been storing for a long time.
The idea is simple, Alice and Bob just delete a, b and K right after they used them!

ECDH

ECDH stands for Elliptic-Curve Diffie-Hellman (ECDH)
Notice: The discrete-log problem seems hard because exponentiating integers in modular arithmetic “wraps around”
- Diffie-Hellman can be generalized to any mathematical group that has this cyclic property
- Discrete-log uses the “multiplicative group of integers mod p under generator g”
Elliptic curves: A type of mathematical curve
- Big idea: Repeatedly adding a point to itself on a curve is another cyclic group
Elliptic-curve Diffie-Hellman: A variation of Diffie-Hellman that uses elliptic curves instead of modular arithmetic
- Based on the elliptic curve discrete log problem, the analog of the discrete log problem
- Benefit of ECDH: The underlying problem is harder to solve, so we can use smaller keys (3072-bit DHE is about as secure as 384-bit ECHDE)

Note that there’s one more thing left for our communication! By the time Alice and Bob use Diffie-Hellman exchange to initialize their public keys, they have no way to secure integrity and authenticity!

So they can be attacked by Mallory in the following way

How to deal with such MITM(Man in the Middle) adversary will be introduced in the future lectures.

Written on May 10, 2023

-->