Recently, we were pentesting a Data mining and Analytics company. The amount of data that they talked about is phenomenal and they are planning to move to Big Data. They invited me to write a blog on state of the art, Big Data security concerns and challenges and I happily accepted.
Key Insights on Existing Big Data Architecture
Big data is fundamentally different from traditional relational databases in terms of requirements and architecture. Big data is often characterized by 3Vs, Volume, Velocity and Variety of data. Some of the fundamental differences in Big Data architecture are as follows:
- Distributed Architecture: Big data architecture is highly distributed on the scale of 1000s of data and processing nodes. Data is horizontally partitioned, replicated and distributed among multiple data nodes available. As a result, Big Data architecture is generally highly resilient and fault tolerant.
- Real-Time, Stream and Continuous Computations: Performing computation real-time and continuously is next trend in Big Data apart from Batch processing model as supported by Hadoop.
- Ad-hoc Queries: Big data enables Knowledge Workers to create and execute data analyzing queries on the fly.
- Parallel and Powerful Programming Language: The computations performed in Big Data are much more complex, highly parallel and computationally intensive than traditional SQL / PLSQL queries. For example, Hadoop uses MapReduce framework to perform computations on data processing nodes. MapReduce programs are written in Java.
- Move the code: In Big Data, it is easy to move the code, rather than data.
- Non Relational Data: Migrating tremendously from traditional relational databases, the data stored in Big Data is non relational. The main advantage of non relational data is that it can accommodate large volume and variety of data.
- Auto-tiering: In Big Data, hottest data blocks are tiered into higher performance media, while the coldest data is sent to lower cost high capacity drives. As a result, it is extremely difficult to know precisely where the data is exactly located among the available data nodes.
- Variety of Input Data Sources: Big Data requires collecting data from many sources such as logs, end to point devices, social media etc.
Finally, there is no silver bullet in Big Data in terms of data model. Hadoop is already outdated and unsuitable for many Big data problems. Some of the emerging Big data solutions are following:
- For Real-time analytics: Cloudscale, Storm
- For Graph Computation: Giraph and Pregel (Some examples graph computation are Shortest Paths, Degree of Separation etc.)
- For low latency queries over very large data set: Dremel and so on.
(Read more: APT Secrets that Vendors Don't Tell)
Top 5 Big Data Vulnerability Classes
1. Insecure Computation
There are many ways an insecure program can create big security challenges for a big data solution including:
- An insecure program can access sensitive data such as personal profile, age credit cards etc.
- An insecure program can corrupt the data leading to in current results.
- An insecure program can perform Denial of Service into your Big Data solution leading to financial loss.
2. End-point input validation/filtering
Big data collects data from variety of sources. There are two fundamental challenges in data collection process:
- Input Validation: How can we trust data? What kind of data is untrusted? What are untrusted data sources?
- Data Filtering: Filter rogue or malicious data.
The amount of data collection in Big Data makes it difficult to validate and filter data on the fly.
The behavior aspect of data poses additional challenges in input validation and filtering. Traditional Signature based data filtering may not solve the input validation and data filtering problem completely. For example a rogue or malicious data source can insert large legitimate but incorrect data to the system to influence prediction results.
(Read more: Technology/Solution Guide for Single Sign-On)
3. Granular access control
Existing solutions of Big Data are designed for performance and scalability, keeping almost no security in mind. Traditional relational databases have pretty comprehensive security features in terms of access control in terms users, tables and rows and even at cell level. However, many fundamental challenges prevent Big Data solutions to provide comprehensive access control:
- Security of Big Data is still an ongoing research.
- Non relational nature of data breaks traditional paradigm of table, row or cell level of access control. Current NoSQL databases dependents on 3rd party solutions or application middleware to provide access control.
- Ad-hoc Queries poses additional challenge wrt to access control. For example, imagine end user could have submitted legitimate SQL queries to Relational Databases.
- Access control is disabled by default.
4. Insecure data storage and Communication
There are multiple challenges related to data storage and communication in Big Data:
- Data is stored at various Distributed Data Nodes. Authentication, authorization and Encryption of data is challenge at each node.
- Auto-tiering: Auto partitioning and moving of data can save sensitive data on a lower cost and less sensitive medium.
- Real Time analytics and Continuous computation requires low latency with respect to queries and hence encryption and decryption may provide additional overhead in terms of performance.
- Secure communication among nodes, middlewares and end users is another area of concern.
- Transactional logs of big data is another big data itself and should be protected same as data.
5. Privacy Preserving Data Mining and Analytics
Monetization of Big data generally involves doing data mining and analytics. However, there are many security concerns pertaining to monetizing and sharing big data analytics in terms of invasion of privacy, invasive marketing, and unintentional disclosure of sensitive information, which must be addressed.
For example, AOL released anonymized search logs for academic purposes, but users were easily identified by their searchers. Netflix faced a similar problem when users of their anonymized data set were identified by correlating their Netflix movie scores with IMDB scores.