Encryption is the process of converting a piece of plaintext into a ciphertext using a key.
Plaintext - the original, unencrypted message. This is the data that's being encrypted.
This is some readable plaintext. It's a really big secret so I had better be sure to encrypt it properly.
Ciphertext - the encrypted version of the plaintext after it's completed encryption.
Once plaintext is encrypted into ciphertext, it looks like this:
U2FsdGVkX1+H7BSzYgumzI4SfcqHpp9KxqAPsPTZ1TU13gC6dEBYnP2Q5q0r7wRRR1WxGMvsYFVGJlV6/atZQfC6XiaiMZUafJyhCvf/h52gzR7qv2o+G76XBaAItir+ZrcqDaCkLvKtWbEGkS44LsDVBU4lEqnTrA==
This ciphertext is no longer human readable text, but given the right key, it can be converted back into the original plaintext.
Key - this data is used to encrypt, and sometimes decrypt a message(some methods of encryption will use a different key for encrypting than for decrypting). For our purposes "key" and "password" are used interchangably.
Let's make an encryption algorithm. Though for real use cases, it's not a good idea to build your algorithm, but for a demonstration it's ok
We'll start with some assumptions about our input plaintext. For simplicity, we'll limit it to lowercase
letters(a-z
).
For the password(or key), you pick any number, and each character gets incremented by that amount
So, if we have the plaintext: here is my secret message
And for the key we'll pick: 13
our first letter h
, becomes u
here is my secret message <-- plaintext ifsf jt nz tfdsfu nfttbhf jgtg ku oa ugetgv oguucig
.... .. .. ...... ....... <-- 8 more steps show all
khuh lv pb vhfuhw phvvdjh livi mw qc wigvix qiwweki mjwj nx rd xjhwjy rjxxflj nkxk oy se ykixkz skyygmk olyl pz tf zljyla tlzzhnl pmzm qa ug amkzmb umaaiom qnan rb vh bnlanc vnbbjpn robo sc wi combod wocckqo
spcp td xj dpncpe xpddlrp tqdq ue yk eqodqf yqeemsq urer vf zl frperg zrffntr <-- ciphertext
Our final encrypted message would look like this:
urer vf zl frperg zrffntr
To decrypt this someone would need to know the ciphertext, and the key used to encrypt it 13
.
So is our message safe? No, not really. This is known as a Caesar Cipher, which was used by Julius Caesar over 2000 years ago, and it has some major weaknesses. The version we did, called ROT13, has a unique property, it doesn't have to be reversed for encryption or decryption, because the latin alphabet has 26 characters, shifting 13 characters, two times, will take you back to the original message.
So why is our algorithm not secure? It's very vulnerable to an attack called frequency analysis. Letters will appear in roughtly the same ratios in any text.
I've taken the first chapter of Sherlock Holmes by Arthur Conan Doyle, and counted the number of times each letter appears, then ordered them.
Here are the letters(e
appears the most, and z
the least):
etoaisnhrdlumcwyfgpbvkxjqz
I took the first paragraph of the chapter 2, and put it through our algorithm, and then counted the frequency of each letter. Remember this is is encrypted, so r
isn't the most frequently used letter, but whatever has been encrypted as r
is.
rgvnufbaeqypjzslhtoicxdk
Because we know how to decrypt our algorithm, let's take a peek at what these letters are when shifted 13 characters the original letters.
etiahsonrdlcwmfyugbvpkqx
Here is are the letters from Chapter 1, and our encrypted paragraph, side by side
etoaisnhrdlumcwyfgpbvkxjqz <--Ch. 1 etiahsonrdlcwmfyugbvpkqx__ <-- Ch. 2 Paragraph
It's not identical, but it's close. If someone didn't know how to decrypt the text, or that the key was
13
, they could just replace
each letter in the text with the corresponding letter in the "true" letter frequency order. So r
becomes
e
, etc.
rgvnufbaeqypjzslhtoicxdk__ <- replace each one of these letters etoaisnhrdlumcwyfgpbvkxjqz <- with the corresponding letter here
Let's see the original text next to our "cracked" text.
at three oclock precisely i was at baker street but holmes had not yet returned the landlady.... at tiree nulnuk vreuosely o mas at paker street pft inlces iad hnt yet retfrhed tie lahdlady...
It's not quite perfect, but there are some obvious changes we could make:
h
-> n
lahdlady
is clearly landlady
, so we know h
should really be n
o
-> i
at
in the beginning is correct, so we know o mas
should be i mas
, rather
than a mas
m
-> w
m
in mas
should be a w
I made these substitutions, and some other obvious ones that appeared after making the above substitutions, and within about 6 substitutions we get to:
at three oclock precisely i was at baker street but holmes had not yet returned the landlady... at three oflofk vrefisely i was at baker street but holces had not yet returned the landlady...
Ok, so we missed oclock
, precisely
, and holmes
, but if this was the plans
of some enemy, we still have most of the information we might need to thwart their attack, and as the text
gets longer, the more likely it is to
align with the "true" letter frequency, as well have lots of words we can use to find obvious fixes to any
errors.
Hashing is another part of the world of cryptography, but it's different from encryption. With encryption the important part was that the data was preserved, but with a hash we can't get the information back that we put in, but it can be used to verify that the inputted information is the same.
Hash Function - instructions used for a hashing operation. Digest - the output from the hash function.Encryption is just one direction of a cyclical process, the other is the decryption.
PLAINTEXT
CIPHERTEXT
With hashing it's a one-way operation, and the hash is the final result.
INPUT
DIGEST
5E884898DA28047151D0E56F8DC6292773603D0D6AABBDD62A11EF721D1542D8
559AEAD08264D5795D3909718CDD05ABD49572E84FE55590EEF31A88A08FDFFD
CE9FE3447A34D159CBF59C8B01688AFEF4EDAFD32D5A2DB20EC4F002C8C43BDC
Say we want to store usernames and passwords in a database to use for user sign up/sign in. You sign up with
username: bob
and password: password
. We could just store
bob
/password
in the database and quite easily use it to sign you in. But what if the
database gets hacked or leaked? Your username and password, which you likely use other places too, is now out
there, in plaintext.
But this is taking unneeded risks, I don't really need to know your actual password to acheive my goal of
authenticating you. I don't care that you type password
, what I really need to know is "did you send the
same thing this time, that you sent when you signed up." For that we can replace the plaintext password with the
hash of the password.
Letter are replaced with numbers order in the alphabet
Numbers stay as numbers
Non alphanumeric characters are replaced with "0"
a2c4e6 <- the raw input 123456 <- input converted to numbers 12 34 56 <- split numbers into groups of two 12 + 34 + 56 <- sum the numbers 102 <- if length < 5: 10200 <- pad from the right with `0` until 5 digits 10200 <- final hash digest 10234567 <- if length > 5: 10234567 <- cut digits from the left until 5 digits 34567 <- final hash digest
We are cutting/padding the result because hashes tend to be a fixed output length. Whether you input a ".", or the entire dictionary, you'll get back values of equal length, but different contents.
Here's the process of hashing the password: hello
he l l o 85121215 <- replacing 'h' with '8', 'e' with '5', etc. 85 12 12 15 <- split the number into 2-digit numbers 124 <- sum the numbers 12400 <- pad with "0" until 5 numbers long
So the hash(or digest) of hello
, for this hashing algorithm, is 12400
. Any time hello
is given as an input to this function, th e digest will always be 12400
.
And for the password: password.1.2.3.password.4.5.6.
pa s s w o rd.1.2.3. pa s s w o rd.4.5.6. 161191923151840102030161191923151840405060 16 11 91 92 31 51 84 01 02 03 01 61 19 19 23 15 18 40 40 50 60 728 72800
Can we take 12400
and use it to get back to the input, hello
? No, the only way to
determine the original input, is by calculating the hash of every possible combination we can think of(well, at
least for a more secure hashing algorithm than the one above). This is called a "brute force" attempt, and would
take a long time, depending on the input length, and the hashing algorithm. But if we had hello
and
12400
(and knowledge of the hashing algorithm), could we quickly tell if hello
is the
password that hashed to 12400
? Yes. So what should be store in the database? hello
or
12400
? Definitely the "digest".
It's useful before we go on to understand all the factors at play, and which person/computer/server has
access to what data. The requests we'll be talking about happen between the client
and the
server
Code / Algorithms
represents the code and encryption/decryption tools
that TLWSD sends to your browser.
This is the initiator of the message
This is a TLWSD server.
The end user trying to receive a message.
Ok, we now have two tools, encryption and hashing. But we haven't really discussed the problem we're trying to solve. The problem is, I want to take a message from you, and deliver it to your friend. I want to provide a seamless experience where your friend will know if they got the password wrong, so they can try again. They'll also know if they got it right, and then see your original message in plaintext. However, I don't want at any time, even for a moment, to have your plaintext, your password, or any other thing that is easily derivable into the plaintext or password.
Encryption gets us most of the way there. I won't know what your plaintext or password is, but your friend won't get the certainty that they've correctly decrypted the message.
Let's look at how we can use encryption + hashing to solve this. In trying to simplify things, i've realized our encrpytion algorithm requires a number and our hashing algorithm can take more complex inputs. Since neither of us will ever use these algorithms to actually encrypt anything, we can make the arbitrary rule that the alphabet position of the first letter of the password/key is used for our encryption algorithm
We need a message to encrypt. We also need a password.
plaintext: we need a message to encrypt
(lower case, no punctuation for simplicity)
password: wealsoneedapassword
. w is the 23rd letter, so we'll use 23
to encrypt
First we'll encrypt the message with the password(or "key")
we need a message to encrypt xfaoffeabanfttbhfaupafodszqu ........20 more rows........ h9v099zvwv19ddw79ve v90ycjae i8w 889wxw08eex68wfaw8 zdkbf
Now here's the trick that enables me to "know" if your friend has the right or wrong password, without ever knowing it.
We're going to take the ciphertext and the password, and take the hash of them together.
Why does that solve the problem? Because the decryption code will have access to the cipher text, and when your friend guesses a password if their password is the same as yours, the the hash will be the same too.
hash(ciphertext + password)
=> somehash
hash(ciphertext + password_guess)
=> ??
<- if this is anything but
somehash
, it's the wrong password. If it is somehash
, then your friend got the password
correct.
Let's hash: i8w 889wxw08eex68wfaw8 zdkbf
+ wealsoneedapassword
i8w 889wxw08eex68wfaw8 zdkbfwealsoneedapassword <- ciphertext + password 23701657421213719211952201707145180162201015250321721220228111921144116119192315184" <- converted to numbers [23, 70, 16, 57, 42, 12, 13, 71, 92, 11, 95, 22, 1, 70, 71, 45, 18, 1, 62, 20, <- split into 2 digit numbers 10, 15, 25, 3, 21, 72, 12, 20, 22, 81, 11, 92, 11, 44, 11, 61, 19, 19, 23, 15, 18, 4] 1421 <- summed "14210" <- 0 padded, we have our hash!
Finally our problem is solved with just two pieces of infomation:
ciphertext: i8w 889wxw08eex68wfaw8 zdkbf
hash: 14210
And with this pattern and these two pieces of data, we get some important benefits:
That last one is important and worth an extra bit of explaination
hash(ciphertext + password)
?hash(password)
?Originally the goal was the be able to give a "Wrong password" message to the user trying to read the message. If we just hashed the password, and not the ciphertext + password, we would acheive this goal. So why include the ciphertext? We want to ensure not just that the password is correct, but more importantly, the decrypted plaintext message is the same as the plaintext that was originally encryped
If we only hash the password, we also leak information. Or at least make it easier for an attacker. Below are 5 messages that were sent, 3 of them are from you to a friend, using the same password.
ciphertext.hash(ciphertext + password)
811C19.39
3F0F0B.B0
01311E.23
7714D5.91
548087.86
ciphertext.hash(password)
811C19.2C
3F0F0B.16
01311E.2C
7714D5.E7
548087.2C
It becomes much easier for a malicious actor to know which messages are related, or have the same password. But
worse than that, lets say this message 811C19.2C
is your bank
account number which your friend is supposed to send $1m to. If a malicious actor got a hold of this message, and
changed it to 911C19.2C
, just changing the 8 to a 9 will change the
output, but your friend will have no idea. They will send that $1m to "223" instead of "123". What would happen in
the hash(ciphertext + password) case? Because the 39
digest is hash("811C19" + password)
,
even with the right password, the cipher text is wrong, so the hash digest will never match. The message would
essentially be "broken". Even the right password would fail
like a wrong password. That's a bit annoying, but it's by design. What's more annoying, seeing an "incorrect
password" message when you know you put the right password in, or sending $1m to the wrong bank account? This is
designed to only show the plaintext when 1. the password is correct AND 2. the message has not be altered.
This is a basic explanation of Message Authentication Codes(MAC). You can read more about MACs here.
There may be simpler methods, but they lose some or all of the benefits of doing the encryption and decryption on the client. Let's look at how some other sites do "encrypted messages", and why it's flawed.
The method i've seen on some other sites is to send the plaintext and the password to the server. The data that gets sent from your computer to the server might look something like this:
example.com/?utf8=%E2%9C%93&authenticity_token=4n%2BMwYx4iMcggjmRiaiF%2BKUYbrW8otsUMybeduiXB0M%3D&message%5Bbody%5D=This+is+my+message&message%5Bpassword%5D=This+is+my+password&message%5Bterms_of_service%5D=0&message%5Bterms_of_service%5D=1&commit=SAVE+THIS+MESSAGE
This url is currently encoded. Encoding is different from encryption. Encoding allows us to have characters that
are reserved for special use, but that can also be used by the user. For example /
is a reserved
character in a URL. I has a special meaning, but if I want to submit some text that contains a /
character, I can, the browser will encode it to %2F
, but the goal here is not to distort or hide
data, just to separate what is user data, and what is used internally as a control character
We want to decode this URL and see that data that's being sent. To do that we need something that can decode the
encoding, and parse the query string
for us. A query string is just the part of the URL that contains
the parameter data. It could be what you entered on a form, or the language/timezone/etc. you have set.
Here's the URL query string decoded and parsed:
"authenticity_token": "4n+MwYx4iMcggjmRiaiF+KUYbrW8otsUMybeduiXB0M=", "commit": "SAVE THIS MESSAGE", "message[body]": "This is my message", "message[password]": "This is my password", "message[terms_of_service]": "1", "utf8": "✓"
The authenticity_token
param might look similar to an encrypted message,
but it's just a marker that denotes a particular user, and can be used to block a user, or remember their
settings, etc.
The parts here that should alarm you,It's possible that the message gets encrypted on the server, and the password gets hashed on the server. But how do we know? Can we trust the person who created the site? Maybe, but even trustworthy people make mistakes. And what about all the steps in between you and the server, do you trust AT&T? Comcast? The only way to make sure your data is secure, is to never let it leave your computer unless it's encrypted.
The end user's plaintext password guesses also get send to the server, where they're compared against the original password(or a hash of that password), if correct, the plaintext is sent to the end user.
Here's the parsed query string from a password guess:
{ "authenticity_token": "gPNLW/31XYeI3MMJvQztSCASg8m1K/s0Ot1OEcnFSEM=", "commit": "RETRIEVE MESSAGE", "password": "testing", "utf8": "✓" }
And when I get the password correct, the response is a bunch of HTML, but burried in there is of course, the plain text message
... <pre id="retrieved-message">This is my message</pre> ...
Hopefully it's clear why this is not secure. Two main reasons:
I've shown the data that other sites send to the server, so let's look at ours. Here is the data:
csrfmiddlewaretoken=bzah36gpB9pwrj3VACqZdfGYWS7xYdbvZXcpjkKsDwIIjBlfVLjPNIYpmvMMd8N6&msg_text=eyJpdiI6IjAxTjJCeVNIUEZ4OXFMK2hYTURUalE9PSIsInYiOjEsIml0ZXIiOjEwMDAsImtzIjoxMjgsInRzIjo2NCwibW9kZSI6ImdjbSIsImFkYXRhIjoiIiwiY2lwaGVyIjoiYWVzIiwic2FsdCI6IlY3T0plNjFaU25jPSIsImN0IjoiWnBGOUp1dnMvQ0JJRnpTTWZaajNycTk5WXp2V21hV3lwdTQ9In0%3D&access_count_remaining=1&max_view_time=0&max_view_time_units=seconds&ttl=0&ttl_units=hours&desc_text=&has_password=true&password_hint=
Here it is parsed:
{ "access_count_remaining": "1", "csrfmiddlewaretoken": "bzah36gpB9pwrj3VACqZdfGYWS7xYdbvZXcpjkKsDwIIjBlfVLjPNIYpmvMMd8N6", "desc_text": "", "has_password": "true", "max_view_time": "0", "max_view_time_units": "seconds", "msg_text": "eyJpdiI6IjAxTjJCeVNIUEZ4OXFMK2hYTURUalE9PSIsInYiOjEsIml0ZXIiOjEwMDAsImtzIjoxMjgsInRzIjo2NCwibW9kZSI6ImdjbSIsImFkYXRhIjoiIiwiY2lwaGVyIjoiYWVzIiwic2FsdCI6IlY3T0plNjFaU25jPSIsImN0IjoiWnBGOUp1dnMvQ0JJRnpTTWZaajNycTk5WXp2V21hV3lwdTQ9In0=", "password_hint": "", "ttl": "0", "ttl_units": "hours" }
The csrfmiddlewaretoken
is similar to the authenticity_token
above. It just ensures
that the form being submitted was one that was generated by our site.
All the other fields are options you can add to your link. I used the same message and password as I did in the
previous example. msg_text
is where we see what the message
looks like when it's being sent to the server.
We also include a has_password
parameter, why? Because
we're only sending msg_text
to the server, and don't have or send a password
parameter,
to the server, a password encrypted message and a plaintext message all look the same. The only place this really
matters is when the end user opens the link. Do we show them the msg_text
as is, or do
we prompt them for a password, and use that to decrpyt the msg_text
(still client-side, nothing
is leaving the browser while the user attempts to decrypt the message)
You've probably seen the padlock on the left of the address bar that looks like this:
Or maybe you've seen something like this:
Or more likely you've gotten this when trying to visit a website:
These are all related to HTTPS or Hypertext Transfer Protocol Secure. HTTPS is a secure way of transmitting data between a client and server. HTTPS provides a few protections:
man-in-the-middle
attacks, where someone/some machine sits between you
and the website you want to visit, and mimics the actual website, mean while having access to the data you're
sending.Read more about HTTPS and MITM Attacks
Sure, HTTPS is encryption, between you and the server, but HTTPS is not the same thing as the client-side encryption, we're talking .
It's encryption between your computer and the server. But HTTPS this You still have to trust that the server you're sending that plaintext to, is trustworthy, capable, flawless, superhuman, incapable of mistakes, etc. And I can assure you they're not all of those things.Maybe you're a generally trusting person, and you think "Hey, not-tlwsd.com has cooler fonts than tlwsd.com, so i'll just use them, I trust them not to share my data". That's fine, i'm not saying other people aren't to be trusted, or that we're more trustworthy. The purpose of client-side encryption is that you don't have to trust anyone that doesn't have your secret key. You don't have to trust that they'll encrypt your message. You don't have to trust that their code is water tight and bug free, or that their database could never accidentally get leaked or hacked. The point is, even if the website doesn't do what they said they'll do, or their database does get hacked, you don't need to worry. If you never sent data in a form that was sensitive, it doesn't really matter who sees the cipher text.*
* The type of encryption used at TLWSD (AES-GCM 256-bit) is infeasible to crack. This article does a good job explaining
how long it would take. But that doesn't mean once it's encrypted, you're off the hook. If you use
password1234
as your password, it's not going to take trillions of trillions ... of trillons of years
to crack, it'll take seconds. Likewise, if you use the password hint
feature for your TLWSD link, and
put Password is ThisIsSoSecure:)101?
it'll take seconds.
Here's the way I see it, either what you're sending is sensitive, or it's not. If you want to tell your friend the name of that TV show you were talking about, then e-mail it, yell it over loudspeaker in front of their house, or print it on a million sheets of paper and drop them out of a plane over their office(maybe stick to email). But if it's at all sensitive, whether it's a Netflix password, your credit card number, nuclear launch codes, or your jeans size, then use the most secure method you can, with the lowest number of people with access to the data except for you, and the person you're sending it to(if that number is 1 or more, it should be 0).