Cleaning and Analysing Data Breach Data

Note: I wrote this back in mid 2017. I don’t know why I forgot to move it out of draft, so here it is now.

I wanted to look to see how many particular domains occurred in recent data breaches (the right-hand sides of email addresses). Of course not every breach is freely available on the web, but it didn’t take long (about 2 minutes of Googling) to find both the LinkedIn and Adobe data breach contents downloadable as torrents.

I decided to use SQL Server to run queries on this data, as I already had the dev version running on my PC. These data sets are pretty massive though, so I needed to clean up the data to make querying it possible. With my task of looking for particular email domains, I needed to ensure those were isolated into their own column so they could be searched quickly. Without separating them, I’d have to run a string-matching query on the end of the email address (e.g. WHERE Email LIKE '%@foo.com') which is disgustingly slow, especially on datasets of this size (it takes about 3 minutes to scan across every row). So I wrote a couple of little console apps to iterate over the data and make it BULK INSERT friendly for SQL Server.

Disclaimer: I’m not cracking the hashes found in these data sets, nor am I going to publish the data to the public. If you ask me for a copy of the data, I will say no. If you want to find out if you were in any data breaches, go to Have I Been Pwned as it contains far more data sets that I have looked at (and some of them aren’t publicly available).

LinkedIn

The LinkedIn data breach consists of a bundle of .txt files containing the output from some database dumps. There’s also a massive sql file that I assume is the generation script for the tables, but I didn’t both looking too much into it.

The format of the data seems to be:

{email}:{hash}

Some emails are missing (just containing the string null), and some hashes are missing (just containing the string xxx). Some files also contain an ID number instead of an email address, and there are also entries that are stored as {id} : {hash} -> {email}, so we have a lot of cleaning up to do.

The Code

I actually started writing this in PowerShell, but it ended up requiring me to use .NET libraries to iterate through the files efficiently, so I figured I might as well just whip up a quick C# Console app instead to do the job. The debugging in Visual Studio made it a lot easier to examine the data being processed too.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class Program
{
static void Main(string[] args)
{
int counter = 0;
string line;
string rootDirectory = @"C:\Users\tim\Downloads\LinkedIn\";

int goodCounter = 0;
int badCounter = 0;

foreach (var file in Directory.GetFiles(rootDirectory, "*.txt", SearchOption.AllDirectories))
{
using (var inputFile = new StreamReader(file))
using (var outputFile = new StreamWriter(rootDirectory + "importable.log", true)) //append = true, and a different extension so the loop doesn't read it
using (var logFile = new StreamWriter(rootDirectory + "log.log", true))
{
while ((line = inputFile.ReadLine()) != null)
{
counter++;
if (line.Contains("@"))
{
if (line.Contains("->"))
{
var foo = line.Replace("->", ";").Split(';');
try
{
var emailParts = foo[1].Split('@');

var sb = new StringBuilder();
sb.Append(emailParts[0]); // email user
sb.Append(":");
sb.Append(emailParts[1]); // email domain
outputFile.WriteLine(sb.ToString());
goodCounter++;
}
catch (IndexOutOfRangeException)
{
badCounter++;
logFile.WriteLine("Bad Entry: " + counter + ", " + line); ;
}
}
else
{
var splitLine = line.Split(':');
if (splitLine.Length > 1)
{
try
{
var emailParts = splitLine[0].Split('@');

var sb = new StringBuilder();
sb.Append(emailParts[0]); // email user
sb.Append(":");
sb.Append(emailParts[1]); // email domain
outputFile.WriteLine(sb.ToString());
goodCounter++;
}
catch (IndexOutOfRangeException)
{
badCounter++;
logFile.WriteLine("Bad Entry: " + counter + ", " + line); ;
}
}
else
{
badCounter++;
logFile.WriteLine(line);
}
}
}
else
{
badCounter++;
logFile.WriteLine(line);
}

if (goodCounter != 0 && goodCounter % 1000000 == 0) { Console.WriteLine("Good Entries: " + goodCounter); }
if (badCounter != 0 && badCounter % 1000000 == 0) { Console.WriteLine("Bad Entries: " + badCounter); }
}
}
}
Console.WriteLine("Final Good Entry Count: " + goodCounter);
Console.WriteLine("Final Bad Entry Count: " + badCounter);
Console.ReadLine();
}
}

I made it write out to console on every millionth entry found, so I got a sense of how fast it was running and how much was left to process.

The final count after execution was 82,209,229 entries with email addresses, and 168,548,828 without email addresses or badly formatted. Interestingly the number of cleaned records I expected was 164,611,595, which is what Troy Hunt reports on Have I Been Pwned, but mine was around half of that. He does allow searching on usernames in addition to email addresses, so that could account for the extra data, or there might be more data inside the SQL script that I ignored. I’m not too fussed though.

The SQL

Now to load it into the DB:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
USE DataBreaches
GO

DROP TABLE LinkedIn
GO

CREATE TABLE LinkedIn (EmailUser VARCHAR(512), EmailDomain VARCHAR(512))
GO

CREATE CLUSTERED INDEX [EmailUserIdx] ON [dbo].[LinkedIn] ([EmailUser] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [EmailDomainIdx] ON [dbo].[LinkedIn] ([EmailDomain] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

PRINT 'First File'
GO

BULK INSERT LinkedIn FROM 'C:\Users\tim\Downloads\LinkedIn\importable.log' WITH ( ROWTERMINATOR = '\n', FIELDTERMINATOR = ':')
GO

I found creating the indexes first and then inserting the data was the fastest approach, and it took about 30 mins to load it all in.

Adobe

The Adobe breach data was presented just as a single file, containing entries in the format:

{number}-|--|-{email}-|-{hash}-|-{securityQuestionAnswer}|--

The Code

This one was a little less complicated, as the data was more consistent. To get the data I wanted out (the email address) I simply replace the annoying delimiter (‘-|-‘) with an easier to manage delimiter (‘;’) which let me split it easier. Then I split the email on the ‘@’ symbol, giving me the username and domain names that I need.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class Program
{
static void Main(string[] args)
{
int counter = 0;
string line;
string rootDirectory = @"C:\Users\tim\Downloads\users.tar\";

using (var inputFile = new StreamReader(rootDirectory + "cred"))
using (var outputFile = new StreamWriter(rootDirectory + "importable.txt"))
using (var logFile = new StreamWriter(rootDirectory + "log.txt"))
{
while ((line = inputFile.ReadLine()) != null)
{
var splitLine = line.Replace("-|-", ";").Split(';');
if (splitLine.Length > 1)
{
try
{
var emailParts = splitLine[2].Split('@');
var toWrite = emailParts[0] + ":" + emailParts[1];
outputFile.WriteLine(toWrite);
counter++;
if (counter % 1000000 == 0)
{
Console.WriteLine(counter); // printing out every million to show where the program is at
}
}
catch (IndexOutOfRangeException)
{
logFile.WriteLine("Bad Entry: " + counter);
}
}
}
}
Console.WriteLine("Final Count: " + counter);
Console.ReadLine();
}
}

For this program, I got a final count of 152,472,407 records in a file about 6GB in size. I ended up with 517,128 rows that could not be parsed properly, and were omitted from the importable.txt file. I checked out a couple of these, and they look like usernames instead of email addresses. I’m happy to ignore these as they’re not needed for what I’m trying to search for.

I also wrote this one before the LinkedIn one, so I might’ve saved a little time by using a StringBuilder like in the LinkedIn app to build the output, but as it’s a once-off run it’s not really worth optimising now. It still only took about 25 mins to run.

The SQL

Again we use BULK INSERT to bring the contents of that text file in to a table:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
USE DataBreaches
GO

DROP TABLE Adobe
GO

CREATE TABLE Adobe (EmailUser VARCHAR(512), EmailDomain VARCHAR(512))
GO

CREATE CLUSTERED INDEX [EmailUserIdx] ON [dbo].[Adobe] ([EmailUser] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

CREATE NONCLUSTERED INDEX [EmailDomainIdx] ON [dbo].[Adobe] ([EmailDomain] ASC) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

PRINT 'First File'
GO

BULK INSERT Adobe FROM 'C:\Users\tim\Downloads\users.tar\importable.txt' WITH ( ROWTERMINATOR = '\n', FIELDTERMINATOR = ':')
GO

And again this insert took about 30 mins to execute. It’d probably be faster inserting to a dedicated SQL Server machine with proper storage, rather than the dev instance on my laptop, but the execution time was reasonable enough that I could get 150 million records cleaned up and inserted into a DB with indexing in about an hour.

The Data

So now to run some queries:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
SELECT COUNT(*) FROM [dbo].[LinkedIn] WHERE EmailDomain = 'gmail.com';
SELECT COUNT(*) FROM [dbo].[LinkedIn] WHERE EmailDomain = 'hotmail.com';
SELECT COUNT(*) FROM [dbo].[LinkedIn] WHERE EmailDomain = 'microsoft.com';
SELECT COUNT(*) FROM [dbo].[LinkedIn] WHERE EmailDomain = 'apple.com';
SELECT COUNT(*) FROM [dbo].[LinkedIn] WHERE EmailDomain = 'google.com';

-----------
17017515

-----------
13727483

-----------
18684

-----------
2502

-----------
6448

SELECT COUNT(*) FROM [dbo].[Adobe] WHERE EmailDomain = 'gmail.com';
SELECT COUNT(*) FROM [dbo].[Adobe] WHERE EmailDomain = 'hotmail.com';
SELECT COUNT(*) FROM [dbo].[Adobe] WHERE EmailDomain = 'microsoft.com';
SELECT COUNT(*) FROM [dbo].[Adobe] WHERE EmailDomain = 'apple.com';
SELECT COUNT(*) FROM [dbo].[Adobe] WHERE EmailDomain = 'google.com';

-----------
24035770

-----------
32571129

-----------
19867

-----------
3987

-----------
23651

I’m not sure what other interesting information could be gleaned from these datasets, but it was a fun exercise in processing large text files and cleaning them up with code for fast querying.