Adobe break-in
Adobe recently suffered from a break-in where intruders were able to get hold of Adobe users’data, containing email addresses, encrypted passwords, password hint names, etc.
This break-in was acknowledged by Adobe (note that the acknowledgement page from Adobe does not have a date or timestamp, at least not on Nov-17, it only mentions ‘recently’).
The posting on the Sophos blog by Paul Ducklin provides a very interesting overview on the cryptographic blunders made by Adobe. In this post I’ll focus on the content of the file, not the different cryptographic weaknesses.
Data dump file
It didn’t took long before a data dump, pretending to be the hacked customer database, was available for download on the Internet. The file was a 4G .tar.gz file called Base_users_adobe.com.tar.gz. This file extracted to a 9G file called cred with 153004874 lines. These are the md5sums of the files :
e3eda0284c82aaf7a043a579a23a09ce Base_users_adobe.com.tar.gz 020aaacc56de7a654be224870fb2b516 cred
I have no reason to believe this file was crafted but I also have no valid proof that this is the “real” database dump. Regardless of that, the data in the file is useful enough to do some statistics against.
If you think that your account might be affected or is in the data dump file then use the online verification tool from LastPass.
Structure of the dump file
The structure of the file is already described in the post on the Sophos blog. Basically every record contains six fields, separated by a pipe sign (|). The data was always embedded between hyphens (-).
- 1 : the userID
- 2 : a blank field, only containing the hyphens
- 3 : the email address
- 4 : password data
- 5 : the password hint
- 6 : a blank field, only containing the hyphens
Some unusual aspects where found in the file:
- The third field, containing the email address, did not contain a valid email address in a couple of cases. It contained what seemed to be only a username (without the @domain part). Maybe these were older entries where usernames where not equal to the email address? See further down this post for the statistics on valid and non valid email addresses;
- Not all the records had a password hint;
- The first userID was 103238704, the last userID was 209850522;
- The last line of the data file contained the string “152989508 rows selected.”.
Processing the file
I don’t think it made sense to process all the records. A sample of the data would also return reasonable valid results. I created a PHP script that read every line of the file and then processed every 250th line.
$rec_processed = 0; $skip_row = 250; while (($buffer = fgets($handle, 4096)) !== false) { if ($rec_processed % $skip_row) { $rec_processed++; continue; }
Processing the line consisted in exploding the data into an array, where the split pattern is the pipe and the hyphen (|-).
$line = explode("|-", $buffer);
I then used a combination of substr and again explode to put the values into different fields.
I used a couple of booleans to check for specific domaintypes.
if (strpos($split_email[1], "mil.") !==false) $mail_mil = true; if (strpos($split_email[1], "gov.") !==false) $mail_gov = true; if (strpos($split_email[1], "fgov.") !==false) $mail_gov = true; if (strpos($split_email[1], "fed.") !==false) $mail_gov = true;
With the last step I stored all the variables in a database table with the structure below. Most of these field are self-explanatory. The table is not speed or storage optimized but it is was perfect for this analysis.
CREATE TABLE IF NOT EXISTS adobe ( id int(11) NOT NULL AUTO_INCREMENT, email varchar(100) NOT NULL, tld varchar(10) NOT NULL, domain varchar(50) NOT NULL, fulldomain varchar(250) NOT NULL, lasttwo varchar(100) NOT NULL, resetpw varchar(255) NOT NULL, resetpw_len int(11) NOT NULL, resetpw_data tinyint(1) NOT NULL, resetpw_numeric tinyint(1) NOT NULL, resetpw_count int(11) NOT NULL, mail_mil tinyint(1) NOT NULL, mail_gov tinyint(1) NOT NULL, userid varchar(15) NOT NULL, hash varchar(255) NOT NULL, validemail tinyint(1) NOT NULL, )
The whole processing and storing in a database took a while and resulted in 609524 records and a table of 156.8 MB.
Statistics
Remember that the statistics below are all made against the sampled data (1/250). For the actual numbers you’ll have to multiply with a factor 250, at maximum.
Valid and non valid email address
Out of a total of 609524 records there were 2057 records with a non valid email address and 607467 with a valid email address.
Valid email | Non valid email | ||
---|---|---|---|
607467 | 99.66% | 2057 | 0.34% |
Records with password reset data
There were 172076 records with password reset data and 437448 records had an empty password reset field. Note that out of the 172076 records with password reset data there were 1456 records that did not had a valid email address.
With password reset | Without password reset | ||
---|---|---|---|
172076 | 28.23% | 437448 | 71.77% |
Password reset length
The maximum password reset length used was 50 (used in 93 records). Strangely enough there were also 2339 records with a password reset length of 1. Most password reset data had a length of approx. 10 or less. Below is a sample of records (uid, email, password reset data) with password reset length of 1.
10487xxxx | @live.co.uk | ! 10858xxxx | @yahoo.com.tw | 0 10866xxxx | @hotmail.com | c 10867xxxx | @aon.at | s
This is a sample of records with password reset length of 12.
onesty-brokerpark.de | wie Apple ID hotmail.com | usual p/word casema.nl | voornaam+100 vlaspand.be | 2 x deurcode dds.nl | Lelijk woord gmail.com | myself ph no alliancemediagroup.net | name, usual#
Password reset length | Occurrences | |
---|---|---|
1 to 10 | 135506 | 22,23% |
11 to 20 | 31054 | 5,09% |
21 to 30 | 4453 | 0,73% |
31 to 40 | 773 | 0,12% |
41 to 50 | 290 | 0,04% |
0 or no data | 437448 | 71,77% |
Password reset number of words
I also counted the number of words used in the password reset data. A word is anything that is, according to str_word_count a word (see PHP.net).
Most password reset data consisted of one single word (109768 records). The longest password reset data was 14 words long.
Password reset number of words | Occurrences | |
---|---|---|
14 words | 1 | – |
13 words | 3 | – |
12 words | 10 | – |
11 words | 18 | – |
10 words | 30 | – |
9 words | 79 | 0.01% |
8 words | 141 | 0.02% |
7 words | 335 | 0.05% |
6 words | 772 | 0.13% |
5 words | 1917 | 0.31% |
4 words | 4739 | 0.78% |
3 words | 12024 | 1.97% |
2 words | 28826 | 4.73% |
1 word | 109768 | 18.01% |
0 words or no data | 450861 | 73.97% |
Out of curiosity, I did a check on ‘non-polite’ words, the presence of ‘adobe’, ‘linux’, ‘microsoft’ or a reference to a loved-one in the password reset data. The low numbers seem to indicate that people take care of their language when choosing password reset data …
Type | Occurrences |
---|---|
f**k y*u | 22 |
adobe | 761 |
d*mn | 11 |
sh*t | 57 |
honey | 61 |
love | 1310 |
microsoft | 6 |
linux | 14 |
All numeric data in password reset
In total 5324 records had password reset data that consisted out of only a numeric value. A couple of samples are below
yahoo.com | 111111 ymail.com | 041117 live.com | 987654321 naver.com | 1626 telenet.be | 695 hotmail.com | 20858 hotmail.com | 2410560724105607
All numeric | Mix numeric and non numeric | ||
---|---|---|---|
5324 | 99.13% | 604200 | 0.87% |
Military or government email addresses
There were 383 email addresses with ‘mil’ in the address and 515 with ‘gov’, ‘fgov’ or ‘fed’ in the address.
Type | Occurrences |
---|---|
‘mil.’ | 383 |
‘gov.’ ‘fgov.’ ‘fed.’ |
515 |
Top TLDs / Country
The major part of the email addresses are in the .com TLD. This is not unusual considering the popularity of the major email providers gmail.com, hotmail.com and yahoo.com. If we leave the .com, .net, .org and .edu out of the list then the majority of the accounts is based in Germany (.de), France (.fr), the United Kingdom (.uk) and Japan (.jp).
TLD | Occurrences | |
---|---|---|
.com | 408609 | 67.26% |
.net | 31091 | 5.12% |
.de | 21232 | 3.50% |
.fr | 16568 | 2.73% |
.uk | 14161 | 2.33% |
.jp | 12883 | 2.12% |
.it | 9663 | 1.59% |
.ru | 9136 | 1.50% |
.edu | 7841 | 1.29% |
.br | 6755 | 1.11% |
.ca | 5426 | 0.89% |
.au | 4614 | 0.76% |
.nl | 4355 | 0.72% |
.es | 4191 | 0.69% |
.org | 4020 | 0.66% |
.pl | 3858 | 0.64% |
… | … | … |
.be | 1611 | 0.27% |
.eu | 425 | 0.07% |
Top domains
There are two ways to look at the ‘top domains’. Either only take the last part before the last dot (the part before the TLD) without taking into account the TLD. Or alternatively take everything that’s after the “@”. I made both comparisons.
Regardless the approach, the top three is always hotmail, gmail and yahoo.
Domain | Occurrences | |
---|---|---|
hotmail | 142336 | 23.43% |
gmail | 96312 | 15.85% |
yahoo | 79609 | 13.11% |
co | 24369 | 4.01% |
com | 20788 | 3.42% |
aol | 14010 | 2.31% |
live | 9769 | 1.61% |
gmx | 6266 | 1.03% |
5845 | 0.96% | |
msn | 5677 | 0.93% |
other | 202486 | 33.33% |
Domain | Occurrences | |
---|---|---|
hotmail.com | 129690 | 21.35% |
gmail.com | 95922 | 15.79% |
yahoo.com | 70996 | 11.69% |
aol | 13836 | 2.28% |
hotmail.fr | 6143 | 1.01% |
msn.com | 5628 | 0.93% |
hotmail.co.uk | 5620 | 0.93% |
mail.ru | 4994 | 0.82% |
web.de | 4893 | 0.81% |
live.com | 4889 | 0.80% |
other | 264856 | 43.60% |
I was also interested in the number of .be (Belgium) and .eu (Europe) domains that were affected. Below is an overview. In total there were 1611 .be records and 425 .eu records.
.be domains | Occurrences | |
---|---|---|
skynet.be | 336 | 20.86% |
telenet.be | 315 | 19.55% |
pandora.be | 122 | 7.57% |
live.be | 119 | 7.39% |
hotmail.be | 58 | 3.60% |
scarlet.be | 43 | 2.67% |
… | … | … |
fgov.be | 3 | 0.19% |
other | 681 | 38.36% |
.eu domains | Occurrences | |
---|---|---|
onet.eu | 87 | 20.47% |
interia.eu | 47 | 11.06% |
… | … | … |
ec.europa.eu | 3 | 0.71% |
other | 288 | 67.76% |