# Understanding Public-Key Cryptography for Beginner

This article is a basic intro to Public-key Cryptography related issues, written for programers and Linux users. This provide a conceptual introduction to understand Public-key cryptography, and associated protocols such as {HTTPS, SSH, SSL, TLS} and the bewildering meaning of acronyms such as {SHA-1, MD5, RSA, DES, Blowfish, AES, RC5}.

### Why is Public-Key Cryptography So Confusing?

Public-key Cryptography (PKC) as used in tech industry (For example, SSH, HTTPS, certificates, digital signature) is very confusing.

First, is the math concepts involved. You basically have to be a mathematician to understand how it works in detail. (it involves advanced number theory, elliptic curve theory (algebraic geometry).)

Besides the math, it gets extremely complex. Because:

- Practical use of PKC also involves symmetric-key encryption.
- Practical use of PKC involves many network protocols at many levels, including: {HTTPS, SSH, SSL, TLS, …}.
- There are many different algorithms used for the encryption as parts of PKC, and they change over the years. (For example, RSA, DES, Triple DES, Blowfish, AES, RC5)
- The question of trusting public keys. Thus we have Certificates, then Certificate Authority (CA). (which in turn, is the question of trusting Certificate Authority. Because CA is centralized, thus spawn a distributed system “Web of Trust”.)
- Public key can expire, or revoked (For example, stolen computer). So, this means, there needs to be protocols for checking certificate safety. (For example, Certificate Revocation List (CRL), Online Certificate Status Protocol (OCSC)).
- Practical PKC also involves “message digest” (aka fingerprint, hash key. (and these terms also have other meanings in computer science or cryptography.)). A message needs to be condensed into a fixed-length string, called “message digest”, for purposes of identification, checksum. And there are many algorithm for this, and they change over the years. (For example, MD5, SHA-1, …)
- The question of signatures. Digital signatures is a way to ensure that somebody really wrote something. The issue is totally SEPARATE from sending secret messages, yet, intimately tied.

Is it confusing enough?

Luckily, as a programer or sys admin, we don't need to understand the math to use it. We can still understand the concept of public-key encryption for sending secret messages.

Here's a very basic intro to get you started.

## Ciphers, Plaintext, Ciphertext

- A method to encrypt a mesage is called a Cipher.
- Message before encryption is called “plaintext”
- Message after encryption is called “ciphertext”.

There are many ciphers.

### Cipher Involves a “Key”

To encrypt a message, almost always involves a “key”. For example, you can shift letter by 2, such as {a → c, b → d, c → e, …}. Then, the “key” is the map of letter transformation (or, just the phrase “shift alphabet to the right by 2”). Or, you can create a customized map to transform the letters, for example, {a → 3, b → u, c → h, …}. This map would be the “key”. To create a ciphertext, you just apply this transformation. To get plaintext, just use the reverse map.

### Most Basic Cipher: Substitution Ciphers

The cipher described above belongs to a class called Substitution Ciphers. Substitution Ciphers are ancient. One popular example of substitution cipher is called
ROT13. It just shift alphabet by 13 places {a → n, b → o, c → p, …}. If you use Emacs, you can call the command `rot13-region`

to convert selected text into cipher text using the rot13 cipher. This is useful for example, when you post a quiz, and you can include the answer by rot13 the answer, to prevent accidental reading of the answer.

Try to decipher this:

GEL GB QRPVCURE GUVF OL EBG 13

(start emacs, paste into a empty file, select the text, then type 【`Alt`+`x`】 then `rot13-region`

)

### Symmetric-Key Ciphers

Ciphers are classified into many types. The most common type of cipher use the same key to encrypt and decrypt. Such cipher is called Symmetric-key ciphers. For example, substitution ciphers are symmetric-key ciphers, because, given the key used (a map of letter transformation), you are able to both encrypt and decrypt a plaintext/ciphertext.

Symmetric-key ciphers have a major practical problem. To transmit a message, both sender and receiver must already know the key. So, there's a chicken-and-egg problem. You want to send a secret message, but first you need to send another message (the key) secretly. For example, on the internet, when you send credit card numbers to a site, and if you are using symmetric-key cipher, both your machine and the remote site must know the key first.

### Public-key Cryptography (PKC): RSA and Others

In about 1973, mathematicians started to understand a system that eventually became the Public-key cryptography (PKC). Its popularity began in 1990s.

PKC solves the problem of requiring first sending the receiver a secret key.

Here's the basics of PKC:

- Each user has a paired keys: public key and private key (think of them as just 2 numbers). These 2 keys are mathematically related. (that is, the pair of numbers have certain properties that connect them.)
- Public key can be used to encrypt a message, and that encrypted message can be decrypted by the paired private key. (and vice versa: Private key can be used to encrypt a message, and that encrypted message can be decrypted by the paired Public key.)

The way to use this is that, each person have a key pair. One called public, and the other private. The public key is published. Only the private key is kept in secret, known just to himself.

Suppose, Alice and Bob want to talk in secret. Alice encrypt the message using Bob's public key. When Bob receives it, he can just decrypt it using his secret private key.

There are several public-key ciphers. The most popular one is RSA.

On Linux terminal, you can type `ssh-keygen -t rsa`

to generate a public/private key pair (for the RSA cipher). By default, they are stored at `~/.ssh/id_rsa.pub`

and `~/.ssh/id_rsa`

. Here's a example of what the keys look like:

## The Problem of Trusting Keys

This public-key cryptography solved the problem of requiring the receiver to know a secret key. However, along with it comes new problems.

How do you know, that Bob's public key is actually from Bob? A bad guy can publish a key and claim he's Bob. If Alice encrypt a message using a wrong person's public key, then that wrong person will be able to read the message. So, here you have the problem of trusting keys.

### Certificate and Certificate Authority

To solve this trusting public key problem, is invented Public key certificate and Certificate authority (CA). Basically, a organization is trusted as the authority, and it publish “certificates”.

The Certificate contains the public key and a person/org's ID, and the issuer, and other info. So, for example, when you send credit card number to Amazon.com, you first seek the public key in a certificate from Amazon that is issued by a Certificate Authority that you trust.

You can view the certificates that are bundled with browsers.

In Google Chrome browser, it's under Settings, Show Advanced, HTTPS/SSL

In Firefox, it's under Options/Preference, Advanced tab, Encryption. In Internet Explorer 9, it's under Options, Content tab.

Web Browser bundle certificates of big companies such as {Google, Yahoo, Skype, Microsoft's live.com, …}, and certs from global CAs. You can add more or remove some.

The most popular global CA (For example, for websites, online banking) are {VeriSign, Thawte}, both now owned by Symantec. (Thawte was founded by Mark Shuttleworth in 1995, before he started Ubuntu Linux in 2004) Other popular ones are Comodo and GoDaddy. These companies do business of issuing certs. To have a cert issued by them, you have to pay money.

For Firefox, you can see their cert inclusion policy and a list at 〔 Mozilla CA Certificate Store At http://www.mozilla.org/projects/security/certs/ 〕

### Public-key Infrastructure

There is, another problem. Certificates Authorities are basically big companies. How do you know they are not corrupted?

Here's a example how big the problem can be. Google engineers in 2017-02 published a post titled “Intent to Deprecate and Remove: Trust in existing Symantec-issued Certificates”, claiming misissue. Symantec is one of the biggest Certificates Authority.

Overall, several systems are invented to solve the problem of trusting Certificates Authority.

- One solution is Web of trust, which is decentralized trust model, which is more or less a approval-rating system of certs by all users. This is used by GPG software. [see GPG Tutorial].
- Another solution is a community-driven Certificate Authority, such as CAcert.org «CAcert Inc. is an incorporated non-profit association registered in New South Wales (Australia) since July 2003 which runs CAcert.org.»

Both “web of trust” and CAcert.org are not ideal. They are only used in Linux communities mostly among individuals. (they are not used by online merchants, or banks and financial services, governments, or any serious business.)

Overall, a system for the key-trust-problem is called the Public-key infrastructure (PKI).

### Key Revocation and Expiration

There is, more complexity. Suppose Alice's computer got stolen, and thus her secret key is exposed. Now, her public key can't be trusted anymore. If you encrypt a message using that public key, then her computer thief can read it. So, there's a need for key revocation.

When you lost a key, you should be able to report it, and the CA or PKI should be able to know it, and “cancel” your public key that is now bad. There's also a need for key expiration. Thus, is created Certificate revocation list and also Online Certificate Status Protocol.

Now, when you visit website, sometimes you get a error message saying that the Certificate can't be trusted, or that the Cert expired, or that the revocation list can't be verified. Now you know what it means.

## Public-Key Cryptography and Symmetric-Key Cryptography; Mixed Use

PKC is great. However, it's slow to use PKC to encrypt messages. So, to make it practical, all HTTPS, SSH, etc systems that use PKC actually use hybrid of public-key and symmetric-key cipher. Following is a basic description of how it works.

First, use PKC system to encrypt a message that is a (randomly generated) key for a symmetric-key cipher. Then, once this is done, all exchange of messages use symmetric-key cipher to encrypt and decrypt. For example, to send credit card to Amazon.com, first you get the public key of Amazon (by getting its certificate issued by a trusted CA. (which is usually bundled with the browser.)). Then, use PKC to encrypt a randomly generated key of some symmetric-key cipher method. This new, temporary, randomly generated key, is now known to both you and Amazon. For the rest of the session, symmetric-key encryption is used to communicate.

## Symmetric-key Ciphers: AES, Blowfish, DES, Triple DES, Serpent, Twofish, …

So now, you see that Symmetric-key Ciphers are quite important. They are also much simpler, and faster. Vast majority of ciphers, from ancient to modern ones, are actually of the symmetric-key type. Symmetric-key ciphers are usually called Symmetric-key algorithm.

Over the years, many symmetric-key ciphers came and went, because computer gets faster, and unbreakable cipher 10 years ago becomes easy to break. So, over the years we have {AES, Blowfish, DES, Triple DES, Serpent, Twofish} and lots others.

DES was the standard around 1990s. Since, Triple DES replaced it, then, AES is the most popular today. Here's some basic description of symmetric-key ciphers.

In cryptography, a block cipher is a deterministic algorithm operating on fixed-length groups of bits, called blocks, with an unvarying transformation that is specified by a symmetric key. Block ciphers are important elementary components in the design of many cryptographic protocols, and are widely used to implement encryption of bulk data.

Common block ciphers: AES, Blowfish, DES, Triple DES, Serpent, Twofish.

DES is the archetypal block cipher — an algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertext bitstring of the same length. In the case of DES, the block size is 64 bits. DES also uses a key to customize the transformation, so that decryption can supposedly only be performed by those who know the particular key used to encrypt. The key ostensibly consists of 64 bits; however, only 56 of these are actually used by the algorithm. Eight bits are used solely for checking parity, and are thereafter discarded. Hence the effective key length is 56 bits, and it is always quoted as such. Every 8th bit of the selected key is discarded, that is, positions 8, 16, 24, 32, 40, 48, 56, 64 are removed from the 64 bit key leaving behind only the 56 bit key.

Like other block ciphers, DES by itself is not a secure means of encryption but must instead be used in a mode of operation.

Triple DES (aka TDEA): basically applying DES 3 times, in sequence, using 3 different keys.

AES is based on a design principle known as a substitution-permutation network, and is fast in both software and hardware.[6] Unlike its predecessor DES, AES does not use a Feistel network. AES is a variant of Rijndael which has a fixed block size of 128 bits, and a key size of 128, 192, or 256 bits. By contrast, the Rijndael specification per se is specified with block and key sizes that may be any multiple of 32 bits, both with a minimum of 128 and a maximum of 256 bits.

AES has been adopted by the U.S. government and is now used worldwide. It supersedes the Data Encryption Standard (DES),[5] which was published in 1977. The algorithm described by AES is a symmetric-key algorithm.

In cryptography, RC5 is a block cipher notable for its simplicity. Designed by Ronald Rivest in 1994, RC stands for “Rivest Cipher”, or alternatively, “Ron's Code” (compare RC2 and RC4). The Advanced Encryption Standard (AES) candidate RC6 was based on RC5.

## Hash Function, Cryptographic Hash Function, Fingerprint, Message Digest

In encryption, we turn a message into a scrambled form. In computing and computing security, we also need to “condense” a message into a short string, and this string can serve for ID purposes. This turning a message into a short string is called hashing, and the result is called Hash Value, Fingerprint, or Message Digest. This is often needed as part of cryptographic protocols.

Hash function is the most basic idea. Basically, it turns a chunk of data (i.e. a large string of arbitrary size) into a short string of fixed length. Ideally, any 2 different input data should produce different strings. Strickly speaking, this is impossible. It is possible only if the length of the result is equal to the length of biggest possible input.

When 2 different chunk of data get turned into identical short string by a hash function, it's called a “collision”, and this is not desirable. Different type of hash functions are designed for different purposes, with different degrees of negligible chance of collision.

Here's a excerpt from Wikipedia on Hash function:

A hash function is any algorithm or subroutine that maps large data sets of variable length, called keys, to smaller data sets of a fixed length. For example, a person's name, having a variable length, could be hashed to a single integer. The values returned by a hash function are called hash values, hash codes, hash sums, checksums or simply hashes.

Hash functions are related to (and often confused with) checksums, check digits, fingerprints, randomization functions, error correcting codes, and cryptographic hash functions. Although these concepts overlap to some extent, each has its own uses and requirements and is designed and optimized differently.

Fingerprint (computing), is basically a hash function with more stringent requirement of no collision.

Cryptographic hash function, is a hash function with highest requirement of no collision.

A cryptographic hash function is a hash function; that is, an algorithm that takes an arbitrary block of data and returns a fixed-size bit string, the (cryptographic) hash value, such that an (accidental or intentional) change to the data will (with very high probability) change the hash value. The data to be encoded are often called the “message,” and the hash value is sometimes called the message digest or simply digest.

The ideal cryptographic hash function has four main or significant properties:

- it is easy to compute the hash value for any given message
- it is infeasible to generate a message that has a given hash
- it is infeasible to modify a message without changing the hash
- it is infeasible to find two different messages with the same hash

The most commonly used hash functions are: MD5, SHA-1. MD5 is popular in 1990s and is today obsolete. SHA-1 also reached end of its life today. (git uses SHA-1 still as file ID. See: 〔 How would git handle a SHA-1 collision on a blob? At http://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob 〕 [see git Tutorial]).

The MD5 Message-Digest Algorithm is a widely used cryptographic hash function that produces a 128-bit (16-byte) hash value. Specified in RFC 1321, MD5 has been utilized in a wide variety of security applications, and is also commonly used to check data integrity. MD5 was designed by Ron Rivest in 1991 to replace an earlier hash function, MD4. An MD5 hash is typically expressed as a hexadecimal number, 32 digits long.

However, it has since been shown that MD5 is not collision resistant; as such, MD5 is not suitable for applications like SSL certificates or digital signatures that rely on this property. In 1996, a flaw was found with the design of MD5, and while it was not a clearly fatal weakness, cryptographers began recommending the use of other algorithms, such as SHA-1 — which has since been found to be vulnerable as well. In 2004, more serious flaws were discovered in MD5, making further use of the algorithm for security purposes questionable — specifically, a group of researchers described how to create a pair of files that share the same MD5 checksum. Further advances were made in breaking MD5 in 2005, 2006, and 2007. In December 2008, a group of researchers used this technique to fake SSL certificate validity, and US-CERT now says that MD5 “should be considered cryptographically broken and unsuitable for further use”, and most U.S. government applications now require the SHA-2 family of hash functions.

Emacs has MD5 algorithm builtin. For example, try this emacs lisp code `(md5 "some big text file content …")`

. [see Emacs: How to Evaluate Emacs Lisp Code]

In cryptography, SHA-1 is a cryptographic hash function designed by the United States National Security Agency and published by the United States NIST as a U.S. Federal Information Processing Standard. SHA stands for “secure hash algorithm”. The four SHA algorithms are structured differently and are distinguished as SHA-0, SHA-1, SHA-2, and SHA-3. SHA-1 is very similar to SHA-0, but corrects an error in the original SHA hash specification that led to significant weaknesses. The SHA-0 algorithm was not adopted by many applications. SHA-2 on the other hand significantly differs from the SHA-1 hash function.

SHA-1 is the most widely used of the existing SHA hash functions, and is employed in several widely used applications and protocols.

To get a idea of what cryptographic hash functions are there, and their security status, see: Comparison of cryptographic hash functions and Hash function security summary.

## Summary

- symmetric-key ciphers have a chicken-and-egg problem. Both party must first know a agreed-upon key before they can talk in private. This works fine between two spies but doesn't work for sending credit card on internet.
- public-key cryptography solved this. Each party has a pair of connected keys, public and private. You use someone's public key to encrypt a message for him to read. He uses his private key to decrypt it.
- public-key cryptography has key-trusting problem. You can't know someone's public key is really him. Thus created Certificate Authority concept.
- Certificate Authority has a problem too, because it relies on a centralized organization. The organization can be corrupt. Thus we have some decentralized system, but isn't ideal neither. A system that tries to solve the whole problem public-key trust issue is is called public-key infrastructure.
- public-key encryption is slow. In practice, web uses PKC to communicate a key and use symmetric-key encryption for communication of content, because it's much faster.
- Symmetric-key ciphers come and go, because computers get faster, or some of the ciphers are discovered to be easily broken. (Typical solution is to increase key length.) (PKC cipher also come and go, but slower.)
- {Hash, message digest, finger-print} are related idea. They condense a message into a short string as ID. These are used for different purposes, to identify or authenticate messages or any text such as certificates. Algorithm for these also changes over time.

What about HTTPS, SSH, SSL, TLS? These are networking protocols using public-key cryptography. To understand them, you must first understand TCP/IP protocols. See: TCP/IP Tutorial for Beginner.