Encryption

What is encryption?

Encryption is the process of converting a piece of plaintext into a ciphertext using a key.

Plaintext - the original, unencrypted message. This is the data that's being encrypted.

This is some readable plaintext. It's a really big secret so I had better be sure to encrypt it properly.

Ciphertext - the encrypted version of the plaintext after it's completed encryption.

Once plaintext is encrypted into ciphertext, it looks like this:

U2FsdGVkX1+H7BSzYgumzI4SfcqHpp9KxqAPsPTZ1TU13gC6dEBYnP2Q5q0r7wRRR1WxGMvsYFVGJlV6/atZQfC6XiaiMZUafJyhCvf/h52gzR7qv2o+G76XBaAItir+ZrcqDaCkLvKtWbEGkS44LsDVBU4lEqnTrA==

This ciphertext is no longer human readable text, but given the right key, it can be converted back into the original plaintext.

Key - this data is used to encrypt, and sometimes decrypt a message(some methods of encryption will use a different key for encrypting than for decrypting). For our purposes "key" and "password" are used interchangably.

Demonstration

Let's make an encryption algorithm. Though for real use cases, it's not a good idea to build your algorithm, but for a demonstration it's ok

We'll start with some assumptions about our input plaintext. For simplicity, we'll limit it to lowercase letters(a-z).

For the password(or key), you pick any number, and each character gets incremented by that amount

So, if we have the plaintext: here is my secret message

And for the key we'll pick: 13

our first letter h, becomes u

here is my secret message <-- plaintext
ifsf jt nz tfdsfu nfttbhf
jgtg ku oa ugetgv oguucig
.... .. .. ...... ....... <-- 8 more steps 
khuh lv pb vhfuhw phvvdjh
livi mw qc wigvix qiwweki
mjwj nx rd xjhwjy rjxxflj
nkxk oy se ykixkz skyygmk
olyl pz tf zljyla tlzzhnl
pmzm qa ug amkzmb umaaiom
qnan rb vh bnlanc vnbbjpn
robo sc wi combod wocckqo
spcp td xj dpncpe xpddlrp
tqdq ue yk eqodqf yqeemsq
urer vf zl frperg zrffntr <-- ciphertext

Our final encrypted message would look like this:

urer vf zl frperg zrffntr

To decrypt this someone would need to know the ciphertext, and the key used to encrypt it 13.

So is our message safe? No, not really. This is known as a Caesar Cipher, which was used by Julius Caesar over 2000 years ago, and it has some major weaknesses. The version we did, called ROT13, has a unique property, it doesn't have to be reversed for encryption or decryption, because the latin alphabet has 26 characters, shifting 13 characters, two times, will take you back to the original message.

You can try encrypting and decrypting your own messages at rot13.com

Cracking 'Caesar's Cipher'

So why is our algorithm not secure? It's very vulnerable to an attack called frequency analysis. Letters will appear in roughtly the same ratios in any text.

I've taken the first chapter of Sherlock Holmes by Arthur Conan Doyle, and counted the number of times each letter appears, then ordered them.

Here are the letters(e appears the most, and z the least):

etoaisnhrdlumcwyfgpbvkxjqz

I took the first paragraph of the chapter 2, and put it through our algorithm, and then counted the frequency of each letter. Remember this is is encrypted, so r isn't the most frequently used letter, but whatever has been encrypted as r is.

rgvnufbaeqypjzslhtoicxdk

Because we know how to decrypt our algorithm, let's take a peek at what these letters are when shifted 13 characters the original letters.

etiahsonrdlcwmfyugbvpkqx

Here is are the letters from Chapter 1, and our encrypted paragraph, side by side

etoaisnhrdlumcwyfgpbvkxjqz <--Ch. 1
etiahsonrdlcwmfyugbvpkqx__ <-- Ch. 2 Paragraph

It's not identical, but it's close. If someone didn't know how to decrypt the text, or that the key was 13, they could just replace each letter in the text with the corresponding letter in the "true" letter frequency order. So r becomes e, etc.

rgvnufbaeqypjzslhtoicxdk__ <- replace each one of these letters
etoaisnhrdlumcwyfgpbvkxjqz <- with the corresponding letter here

Let's see the original text next to our "cracked" text.

at three oclock precisely i was at baker street but holmes had not yet returned the landlady....
at tiree nulnuk vreuosely o mas at paker street pft inlces iad hnt yet retfrhed tie lahdlady...

It's not quite perfect, but there are some obvious changes we could make:

  1. h -> n
    lahdlady is clearly landlady, so we know h should really be n
  2. o -> i
    The at in the beginning is correct, so we know o mas should be i mas, rather than a mas
  3. m -> w
    We can guess that m in mas should be a w

I made these substitutions, and some other obvious ones that appeared after making the above substitutions, and within about 6 substitutions we get to:

at three oclock precisely i was at baker street but holmes had not yet returned the landlady...
at three oflofk vrefisely i was at baker street but holces had not yet returned the landlady...

Ok, so we missed oclock, precisely, and holmes, but if this was the plans of some enemy, we still have most of the information we might need to thwart their attack, and as the text gets longer, the more likely it is to align with the "true" letter frequency, as well have lots of words we can use to find obvious fixes to any errors.

Hashing

What is hashing?

Hashing is another part of the world of cryptography, but it's different from encryption. With encryption the important part was that the data was preserved, but with a hash we can't get the information back that we put in, but it can be used to verify that the inputted information is the same.

Hash Function - instructions used for a hashing operation.

Digest - the output from the hash function.

Encryption vs Hashing

Encryption

Encryption is just one direction of a cyclical process, the other is the decryption.

PLAINTEXT
KEY ENCRYPTION
DECRYPTION KEY
CIPHERTEXT
Hashing

With hashing it's a one-way operation, and the hash is the final result.

INPUT
HASH FUNCTION
DIGEST

Examples(SHA256 hashing algorithm)

  • Input
  • SHA-256 Hash
  • password
  • 5E884898DA28047151D0E56F8DC6292773603D0D6AABBDD62A11EF721D1542D8
  • A
  • 559AEAD08264D5795D3909718CDD05ABD49572E84FE55590EEF31A88A08FDFFD
  • Hello, world!(repeated 10k times)
  • CE9FE3447A34D159CBF59C8B01688AFEF4EDAFD32D5A2DB20EC4F002C8C43BDC

Real-world usage

Say we want to store usernames and passwords in a database to use for user sign up/sign in. You sign up with username: bob and password: password. We could just store bob/password in the database and quite easily use it to sign you in. But what if the database gets hacked or leaked? Your username and password, which you likely use other places too, is now out there, in plaintext.

But this is taking unneeded risks, I don't really need to know your actual password to acheive my goal of authenticating you. I don't care that you type password, what I really need to know is "did you send the same thing this time, that you sent when you signed up." For that we can replace the plaintext password with the hash of the password.

  • Value from user
    User value conversion
    Comparison
    Value from database
  • password
    password
    none
    password
    password
    ==
    password
    password
    Password stored as plain text
  • password
    password
    hash function
    5e884898
    5e884898
    ==
    5e884898
    5e884898
    Password stored as hash digest

Demonstration

Here's a very flawed, but easy to understand hashing algorithm
Step 1: password -> numbers
  • Letter are replaced with numbers order in the alphabet

    a
    1
    b
    2
    c
    3
    d
    4
    e
    5
     
    ...
     
  • Numbers stay as numbers

     
    ...
     
    2
    2
    3
    3
    4
    4
     
    ...
     
  • Non alphanumeric characters are replaced with "0"

     
    ...
     
    /
    0
    :
    0
    !
    0
    (
    0
     
    ...
     
Step 2: Calculate the hash
a2c4e6            <- the raw input
123456            <- input converted to numbers
12 34 56          <- split numbers into groups of two
12 + 34 + 56      <- sum the numbers

102               <- if length < 5:
10200             <- pad from the right with `0` until 5 digits
10200             <- final hash digest

10234567          <- if length > 5:
10234567          <- cut digits from the left until 5 digits
34567             <- final hash digest

We are cutting/padding the result because hashes tend to be a fixed output length. Whether you input a ".", or the entire dictionary, you'll get back values of equal length, but different contents.

Here's the process of hashing the password: hello

he l l o
85121215        <- replacing 'h' with '8', 'e' with '5', etc.
85 12 12 15     <- split the number into 2-digit numbers
124             <- sum the numbers
12400           <- pad with "0" until 5 numbers long

So the hash(or digest) of hello, for this hashing algorithm, is 12400. Any time hello is given as an input to this function, th e digest will always be 12400.

And for the password: password.1.2.3.password.4.5.6.

 pa s s w o rd.1.2.3. pa s s w o rd.4.5.6.
161191923151840102030161191923151840405060
16 11 91 92 31 51 84 01 02 03 01 61 19 19 23 15 18 40 40 50 60
728
72800

Can we take 12400 and use it to get back to the input, hello? No, the only way to determine the original input, is by calculating the hash of every possible combination we can think of(well, at least for a more secure hashing algorithm than the one above). This is called a "brute force" attempt, and would take a long time, depending on the input length, and the hashing algorithm. But if we had hello and 12400 (and knowledge of the hashing algorithm), could we quickly tell if hello is the password that hashed to 12400? Yes. So what should be store in the database? hello or 12400? Definitely the "digest".

Client v Server

It's useful before we go on to understand all the factors at play, and which person/computer/server has access to what data. The requests we'll be talking about happen between the client and the server

Client - the service requester. When you visit Wikipedia, you are not the client, but it's usually your web browser, or mobile/desktop app.

Server - computer that is providing a resource or service. When you visit a Wikipedia link, you will send a request the gets routed to one of Wikipedia's many servers, which will respond with the information you requested(or maybe some error message).

Code / Algorithms represents the code and encryption/decryption tools that TLWSD sends to your browser.

KEY
KEY
is passed outside of the channel where the encrypted text is sent.
KEY
KEY
KEY
Encryption(Client)

This is the initiator of the message

KEY
is never sent to nor seen by the server
Storage(Server)

This is a TLWSD server.

Decryption(Client)

The end user trying to receive a message.

Starts with
PLAINTEXT
KEY
CODE / ALGORITHMS
Creates
CIPHERTEXT
and sends it to the Server
 
 
 
Starts with
NOTHING
Receives
CIPHERTEXT
from Encryption Client
Sends
CIPHERTEXT
to Decryption Client
 
 
Starts with
KEY
CODE / ALGORITHMS
 
Requests
CIPHERTEXT
from Server
Creates
PLAINTEXT
from
CIPHERTEXT
and
KEY
by using
CODE / ALGORITHMS

Encryption + Hashing

Why do we need both?

Ok, we now have two tools, encryption and hashing. But we haven't really discussed the problem we're trying to solve. The problem is, I want to take a message from you, and deliver it to your friend. I want to provide a seamless experience where your friend will know if they got the password wrong, so they can try again. They'll also know if they got it right, and then see your original message in plaintext. However, I don't want at any time, even for a moment, to have your plaintext, your password, or any other thing that is easily derivable into the plaintext or password.

Encryption gets us most of the way there. I won't know what your plaintext or password is, but your friend won't get the certainty that they've correctly decrypted the message.

Let's look at how we can use encryption + hashing to solve this. In trying to simplify things, i've realized our encrpytion algorithm requires a number and our hashing algorithm can take more complex inputs. Since neither of us will ever use these algorithms to actually encrypt anything, we can make the arbitrary rule that the alphabet position of the first letter of the password/key is used for our encryption algorithm

Demonstration

We need a message to encrypt. We also need a password.

plaintext: we need a message to encrypt (lower case, no punctuation for simplicity)

password: wealsoneedapassword. w is the 23rd letter, so we'll use 23 to encrypt

First we'll encrypt the message with the password(or "key")

we need a message to encrypt
xfaoffeabanfttbhfaupafodszqu
........20 more rows........
h9v099zvwv19ddw79ve v90ycjae
i8w 889wxw08eex68wfaw8 zdkbf

Now here's the trick that enables me to "know" if your friend has the right or wrong password, without ever knowing it.

We're going to take the ciphertext and the password, and take the hash of them together.

Why does that solve the problem? Because the decryption code will have access to the cipher text, and when your friend guesses a password if their password is the same as yours, the the hash will be the same too.

hash(ciphertext + password) => somehash

hash(ciphertext + password_guess) => ?? <- if this is anything but somehash, it's the wrong password. If it is somehash, then your friend got the password correct.

Let's hash: i8w 889wxw08eex68wfaw8 zdkbf + wealsoneedapassword

i8w 889wxw08eex68wfaw8 zdkbfwealsoneedapassword                                       <- ciphertext + password
23701657421213719211952201707145180162201015250321721220228111921144116119192315184"  <- converted to numbers
[23, 70, 16, 57, 42, 12, 13, 71, 92, 11, 95, 22, 1, 70, 71, 45, 18, 1, 62, 20,        <- split into 2 digit numbers
10, 15, 25, 3, 21, 72, 12, 20, 22, 81, 11, 92, 11, 44, 11, 61, 19, 19, 23, 15,
18, 4]
1421                                                                                  <- summed
"14210"                                                                               <- 0 padded, we have our hash!

Finally our problem is solved with just two pieces of infomation:

ciphertext: i8w 889wxw08eex68wfaw8 zdkbf

hash: 14210

And with this pattern and these two pieces of data, we get some important benefits:

  • We transport only encrypted/obscured data. Nothing in the cipertext or hash digest can be used to derive that password or plaintext
  • The decryption code on your friends computer doens't need to have access to the plaintext or password/key in order to determine if a password guess is the true correct.
  • When your friend sees the unencrypted plaintext, they can be confident they got the password/key correct
  • When your friend sees the unencrypted plaintext, they can be confident they are reading the same message that was encrypted

That last one is important and worth an extra bit of explaination

Why hash(ciphertext + password)?

Why can't we just hash the password hash(password)?

Originally the goal was the be able to give a "Wrong password" message to the user trying to read the message. If we just hashed the password, and not the ciphertext + password, we would acheive this goal. So why include the ciphertext? We want to ensure not just that the password is correct, but more importantly, the decrypted plaintext message is the same as the plaintext that was originally encryped

If we only hash the password, we also leak information. Or at least make it easier for an attacker. Below are 5 messages that were sent, 3 of them are from you to a friend, using the same password.

Using ciphertext.hash(ciphertext + password)
  • 811C19.39
  • 3F0F0B.B0
  • 01311E.23
  • 7714D5.91
  • 548087.86
Using ciphertext.hash(password)
  • 811C19.2C
  • 3F0F0B.16
  • 01311E.2C
  • 7714D5.E7
  • 548087.2C

It becomes much easier for a malicious actor to know which messages are related, or have the same password. But worse than that, lets say this message 811C19.2C is your bank account number which your friend is supposed to send $1m to. If a malicious actor got a hold of this message, and changed it to 911C19.2C, just changing the 8 to a 9 will change the output, but your friend will have no idea. They will send that $1m to "223" instead of "123". What would happen in the hash(ciphertext + password) case? Because the 39 digest is hash("811C19" + password), even with the right password, the cipher text is wrong, so the hash digest will never match. The message would essentially be "broken". Even the right password would fail like a wrong password. That's a bit annoying, but it's by design. What's more annoying, seeing an "incorrect password" message when you know you put the right password in, or sending $1m to the wrong bank account? This is designed to only show the plaintext when 1. the password is correct AND 2. the message has not be altered.

This is a basic explanation of Message Authentication Codes(MAC). You can read more about MACs here.

TLWSD vs Others

Others

Isn't there a simpler method?

There may be simpler methods, but they lose some or all of the benefits of doing the encryption and decryption on the client. Let's look at how some other sites do "encrypted messages", and why it's flawed.

The method i've seen on some other sites is to send the plaintext and the password to the server. The data that gets sent from your computer to the server might look something like this:

example.com/?utf8=%E2%9C%93&authenticity_token=4n%2BMwYx4iMcggjmRiaiF%2BKUYbrW8otsUMybeduiXB0M%3D&message%5Bbody%5D=This+is+my+message&message%5Bpassword%5D=This+is+my+password&message%5Bterms_of_service%5D=0&message%5Bterms_of_service%5D=1&commit=SAVE+THIS+MESSAGE

This url is currently encoded. Encoding is different from encryption. Encoding allows us to have characters that are reserved for special use, but that can also be used by the user. For example / is a reserved character in a URL. I has a special meaning, but if I want to submit some text that contains a / character, I can, the browser will encode it to %2F, but the goal here is not to distort or hide data, just to separate what is user data, and what is used internally as a control character

We want to decode this URL and see that data that's being sent. To do that we need something that can decode the encoding, and parse the query string for us. A query string is just the part of the URL that contains the parameter data. It could be what you entered on a form, or the language/timezone/etc. you have set.

Read more about percent encoding.

Here's the URL query string decoded and parsed:

"authenticity_token": "4n+MwYx4iMcggjmRiaiF+KUYbrW8otsUMybeduiXB0M=",
"commit": "SAVE THIS MESSAGE",
"message[body]": "This is my message",
"message[password]": "This is my password",
"message[terms_of_service]": "1",
"utf8": "✓"

The authenticity_token param might look similar to an encrypted message, but it's just a marker that denotes a particular user, and can be used to block a user, or remember their settings, etc.

The parts here that should alarm you,It's possible that the message gets encrypted on the server, and the password gets hashed on the server. But how do we know? Can we trust the person who created the site? Maybe, but even trustworthy people make mistakes. And what about all the steps in between you and the server, do you trust AT&T? Comcast? The only way to make sure your data is secure, is to never let it leave your computer unless it's encrypted.

The end user's plaintext password guesses also get send to the server, where they're compared against the original password(or a hash of that password), if correct, the plaintext is sent to the end user.

Here's the parsed query string from a password guess:

{
  "authenticity_token": "gPNLW/31XYeI3MMJvQztSCASg8m1K/s0Ot1OEcnFSEM=",
  "commit": "RETRIEVE MESSAGE",
  "password": "testing",
  "utf8": "✓"
}

And when I get the password correct, the response is a bunch of HTML, but burried in there is of course, the plain text message

...
<pre id="retrieved-message">This is my message</pre>
...

Hopefully it's clear why this is not secure. Two main reasons:

  1. The text(plain or cipher) and password are being sent over the same channel, at the same time. This is like locking your front door, and tying the key to the doorknob.
  2. The plaintext and password are leaving your domain(the client), and being sent to another server. Assume that any information that is sent from your computer, to the server, could be seen by anyone.

TLWSD

What do we send to the server?

I've shown the data that other sites send to the server, so let's look at ours. Here is the data:

csrfmiddlewaretoken=bzah36gpB9pwrj3VACqZdfGYWS7xYdbvZXcpjkKsDwIIjBlfVLjPNIYpmvMMd8N6&msg_text=eyJpdiI6IjAxTjJCeVNIUEZ4OXFMK2hYTURUalE9PSIsInYiOjEsIml0ZXIiOjEwMDAsImtzIjoxMjgsInRzIjo2NCwibW9kZSI6ImdjbSIsImFkYXRhIjoiIiwiY2lwaGVyIjoiYWVzIiwic2FsdCI6IlY3T0plNjFaU25jPSIsImN0IjoiWnBGOUp1dnMvQ0JJRnpTTWZaajNycTk5WXp2V21hV3lwdTQ9In0%3D&access_count_remaining=1&max_view_time=0&max_view_time_units=seconds&ttl=0&ttl_units=hours&desc_text=&has_password=true&password_hint=

Here it is parsed:

  {
    "access_count_remaining": "1",
    "csrfmiddlewaretoken": "bzah36gpB9pwrj3VACqZdfGYWS7xYdbvZXcpjkKsDwIIjBlfVLjPNIYpmvMMd8N6",
    "desc_text": "",
    "has_password": "true",
    "max_view_time": "0",
    "max_view_time_units": "seconds",
    "msg_text": "eyJpdiI6IjAxTjJCeVNIUEZ4OXFMK2hYTURUalE9PSIsInYiOjEsIml0ZXIiOjEwMDAsImtzIjoxMjgsInRzIjo2NCwibW9kZSI6ImdjbSIsImFkYXRhIjoiIiwiY2lwaGVyIjoiYWVzIiwic2FsdCI6IlY3T0plNjFaU25jPSIsImN0IjoiWnBGOUp1dnMvQ0JJRnpTTWZaajNycTk5WXp2V21hV3lwdTQ9In0=",
    "password_hint": "",
    "ttl": "0",
    "ttl_units": "hours"
  }

The csrfmiddlewaretoken is similar to the authenticity_token above. It just ensures that the form being submitted was one that was generated by our site.

All the other fields are options you can add to your link. I used the same message and password as I did in the previous example. msg_text is where we see what the message looks like when it's being sent to the server.

We also include a has_password parameter, why? Because we're only sending msg_text to the server, and don't have or send a password parameter, to the server, a password encrypted message and a plaintext message all look the same. The only place this really matters is when the end user opens the link. Do we show them the msg_text as is, or do we prompt them for a password, and use that to decrpyt the msg_text(still client-side, nothing is leaving the browser while the user attempts to decrypt the message)

HTTPS

Doesn't HTTPS take care of all this?

You've probably seen the padlock on the left of the address bar that looks like this:

Imatge de cadenat HTTPS

Or maybe you've seen something like this:

Imatge caducada del cadenat HTTPS

Or more likely you've gotten this when trying to visit a website:

Imatge d'advertència d'HTTPS caducada

These are all related to HTTPS or Hypertext Transfer Protocol Secure. HTTPS is a secure way of transmitting data between a client and server. HTTPS provides a few protections:

  • It encrypts the data you send to the server, and the data the server sends to you. This stops anyone who maybe be listening on a public network from seeing the data you're sending(well at least the plaintext data, they can see the encrypted data, but that's useless to them)
  • It also protects against man-in-the-middle attacks, where someone/some machine sits between you and the website you want to visit, and mimics the actual website, mean while having access to the data you're sending.

Read more about HTTPS and MITM Attacks

Sure, HTTPS is encryption, between you and the server, but HTTPS is not the same thing as the client-side encryption, we're talking .

It's encryption between your computer and the server. But HTTPS this You still have to trust that the server you're sending that plaintext to, is trustworthy, capable, flawless, superhuman, incapable of mistakes, etc. And I can assure you they're not all of those things.

Isn't Server Side/HTTPS good enough?

I'm not sending nuclear lanch codes or anything, just a Netflix password

Maybe you're a generally trusting person, and you think "Hey, not-tlwsd.com has cooler fonts than tlwsd.com, so i'll just use them, I trust them not to share my data". That's fine, i'm not saying other people aren't to be trusted, or that we're more trustworthy. The purpose of client-side encryption is that you don't have to trust anyone that doesn't have your secret key. You don't have to trust that they'll encrypt your message. You don't have to trust that their code is water tight and bug free, or that their database could never accidentally get leaked or hacked. The point is, even if the website doesn't do what they said they'll do, or their database does get hacked, you don't need to worry. If you never sent data in a form that was sensitive, it doesn't really matter who sees the cipher text.*

* The type of encryption used at TLWSD (AES-GCM 256-bit) is infeasible to crack. This article does a good job explaining how long it would take. But that doesn't mean once it's encrypted, you're off the hook. If you use password1234 as your password, it's not going to take trillions of trillions ... of trillons of years to crack, it'll take seconds. Likewise, if you use the password hint feature for your TLWSD link, and put Password is ThisIsSoSecure:)101? it'll take seconds.

Here's the way I see it, either what you're sending is sensitive, or it's not. If you want to tell your friend the name of that TV show you were talking about, then e-mail it, yell it over loudspeaker in front of their house, or print it on a million sheets of paper and drop them out of a plane over their office(maybe stick to email). But if it's at all sensitive, whether it's a Netflix password, your credit card number, nuclear launch codes, or your jeans size, then use the most secure method you can, with the lowest number of people with access to the data except for you, and the person you're sending it to(if that number is 1 or more, it should be 0).