As the influence of Big Data continues to permeate the tech world, it is becoming increasingly important to understand the intricacies of the field. To help you get to grips with some of the fundamental concepts and terms, we have compiled a brief guide of what we feel are the most important definitions you should know. (If you are new to Big Data, then you might also want to check out our overview of industry-favourite Apache Hadoop)
Algorithm: A set of rules given to an AI, neural network, or other machine to help it learn on its own; classification, clustering, recommendation, and regression are four of the most popular types.
Apache Hadoop: An open-source tool to process and store large distributed data sets across many machines by using MapReduce.
Apache Spark: An open-source big data processing engine that runs on top of Apache Hadoop, Mesos, or the cloud.
Artificial intelligence: The development of computer systems able to perform tasks normally requiring human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages.
Blob storage: A Microsoft Azure service that stores unstructured data in the cloud as a blob or an object.
Cluster: A subset of data that share particular characteristics. A cluster can also refer to several machines that work together to solve a single problem.
Cassandra: A popular open source database management system from Apache. Cassandra was designed to handle large volumes of data across distributed servers.
Data Flow Management: The specialised process of ingesting raw device data, while managing the flow of thousands of producers and consumers. Once ingested a series of steps are then applied to prepare the data for further business processing, these include: data enrichment, analysis in stream, aggregation, splitting, schema translation, and format conversion.
Dark Data: This refers to all the data that is gathered by enterprises but not used for any meaningful purposes. It is ‘dark’ and may never be analysed. It may take the form of things like social network feeds, call center logs, meeting notes, and other miscellaneous items. It is estimated that between 60-90% of all collected data is ‘dark’, but the exact figure is unknown.
Data Lake: A storage repository that holds raw data in its native format.
Data Mining: Finding meaningful patterns and deriving insights in large sets of data using sophisticated pattern recognition techniques. To derive meaningful patterns, data scientists use statistics, machine learning algorithms, and artificial intelligence.
Data Scientist: Someone who can make sense of big data by extracting raw data and working with it to come up with useful insights and analytics. Data scientists need to be skilled in statistics, computer science, story telling and have a good understanding of the business context they work in.
Data warehouse: Repositories for enterprise-wide data – but in a structured format after cleaning and integrating with other sources. They are typically used for conventional data (but not exclusively).
Device Layer: The entire range of sensors, actuators, smartphones, gateways, and industrial equipment that send data streams corresponding to their environment and performance characteristics.
Distributed File System: As big data is too large to store on a single system, a Distributed File System is a data storage system meant to store large volumes of data across multiple storage devices. This helps to decrease the cost and complexity of storing large amounts of data.
Machine Learning: Machine learning is a method of designing systems that can learn, adjust, and improve based on the data fed to them. Using predictive and statistical algorithms that are fed to these machines, they learn and continually zero in on “correct” behavior and insights and they keep improving as more data flows through the system.
MapReduce: A data processing model that filters and sorts data in the Map stage, then performs a function on that data and returns an output in the Reduce stage. MapReduce is the foundational component of Apache Hadoop.
Munging: The process of manually converting or mapping data from one raw form into another format for more convenient consumption.
Normalizing: The process of organising data into tables so that the results of using the database are always unambiguous and as intended.
NoSQL: This stands for Not Only SQL (Standard Query Language). NoSQL refers to database management systems that are designed to handle large volumes of data that do not have a structure or ‘schema’ (like relational SQL databases have). NoSQL databases are often well-suited for big data systems because of their flexibility in the type of data they can accommodate.
Predictive Analytics: Using collected data in order to forecast with probabilities what might happen in the coming weeks, months or years.
R: An open-source language primarily used for data visualization and predictive analytics.
Stream Processing: The real-time processing of data. The data is processed continuously, concurrently, and record-by-record.
Structured Data: Anything than can be put into relational databases and organized in such a way that it relates to other data via tables.
Unstructured Data: Data that either does not have a pre-defined data model or is not organized in a pre-defined manner e.g. email messages, social media posts and recorded human speech etc.
Visualisation: The process of analysing data and expressing it in a readable, graphical format, such as a chart or graph.