May 26, 2009
Data Proliferation, Attacking the Monster We’ve Created
Within our homes, small and medium business settings, and enterprise environments
we use data. We manipulate it, we report on it, we use it to create more data, we
may ship it off site, we bring it in, and we send it out. While we need all of it
to do our jobs; are we watching or keeping up with where we are placing it?
Are we concerned that some of the times that we send it out that we do so unknowingly
or accidentally?
Do we ever stop and think about what the exposure to the company is if we continue
to create these stores of data that could expose us to legal proceedings, loss of
business, or worse?
Data proliferation is occurring everywhere around us and surprisingly enough the majority
of it is legitimate use of data and a good portion of that data is sensitive and protected
under any number of regulatory concerns (PCI, HIPAA, California SB 1386, etc.). So
how do we understand data proliferation and what can we do to manage it?
Understanding data proliferation is often times a mix of psychology, computer science,
and business processes re-engineering. It is frustrating at times but hopefully I
can put you on a path that will, if nothing else, assist in getting your hands around
your own data. Dealing with data proliferation is a continual fight that can be summed
up as follows:
1. Identify Possible Locations
2. Discover Sensitive Information
3. Identify Business Process
4. Re-Engineer or Remove Process
5. Identify Third Party Locations
6. Repeat the Process
Each one of these steps will be expanded upon. Often times it is easy for us to pick
out the obvious locations where data is and or should be stored. This isn’t our problem.
The problem is all of those areas where data is unexpectedly stored. So we will turn
to a tool to see what the damage is. For this I suggest using an open source tool.
I recommend Spider if
you are in a Windows environment. It does the job of identifying the obvious credit
card numbers, social security numbers, etc. without the cost associated of an enterprise
data loss prevention solutions.
The problem is there will be false positives, but at this point in the game we are
just trying to hone in on our data stores and there is no cost justification for acquiring
a more powerful solution.
First identify all areas within the business that may use or work with sensitive data.
The immediate areas are typically, Human Resources, Accounting, Internal Audit, all
corporate file servers, and backups. This is not by any means an exhaustive list of
possible locations but these areas, in my experience, are typically the largest stores
of sensitive and protected information.
A typical scenario is that we use a tool like Spider within our network and we identify
a user workstation with multiple excel spreadsheets containing what appears to be
sensitive data. We talk to the user who has these files, and we find out that they
use these excel files to facilitate and monitor the charge back process and then at
the end of the quarter they can go back and show the charge back percentages and they
use it for profit analysis and any other number of legitimate business needs.
So we now have an idea of why but what happens with the data when it is compiled into
these reports, are there subsystems that are involved?
Are there applications that are handling this data?
Where are the backups located for these systems?
Are there additional file shares that are used to share these excel files?
Are there any laptops that are used to perform these business processes?
Now the scope of our search has expanded from finding these excel files to maybe two
or three other departments, several other file servers, maybe some offline systems
like laptops, and of course the backups for these systems. At this point we need to
stop the bleeding of data and the exposure. So we have to dig deep into each process
that is occurring within the charge back department to understand how they are using
the data and how much of it they actually need. Is there a secure way to share this
information and is a credit card number needed for profit and loss statements. Often
we will find out that
only a portion of the jobs associated with the data gathering require all subsets
of the data.
Could we provide secure jobs that ship these reports to the charge back department
with sanitized data? Could a system be put in place to present full credit card numbers
only when needed? At this point we merge the psychological and technological parts
of our minds to understand the business process and develop the solution. Often times
we find these data stores and the business process is that it is following doesn’t
need the data, they are just doing what has always been done.
Ask the question to people, do you absolutely need this data to perform your job?
Get documentation and regulations that require this to be done. Then implement the
appropriate controls on those systems to meet or exceed the regulatory requirements
affecting that data. Often times once more stringent controls are put in place people
realize that maybe they don’t need all of the data to run their reports and there
may be a better way to do it.
Following the loop through though we now have to perform all of these steps with each
business process that could have had the sensitive data spread to it. Now at this
point the potential for understanding our data proliferation dilemma could be overwhelming.
Using a more advanced solution to understand your data proliferation could be considered
and in that realm there are two solutions that should be considered McAfee’s Network
DLP Discover and Symantec’s Vontu DLP solution. Both of these solutions are maintained
by leaders in the security space and should be placed on the score sheet for any company
looking to acquire a DLP discovery solution.