- Extended Relational
- Non-Relational
- Hybrid
- Consistency
- Batch (slow/cold) vs. real-time streaming (fast/hot) data processing and paths
- Slow/cold path - batch processing
- Batch Processing / Analysis
- Historical Lookup
- Auditing
- Fast/hot path - real-time processing
- Real-time Analytics / Machine Learning Analysis
- Real-time Reporting
- Notifications
- Slow/cold path - batch processing
- Embedded models or interfaces
- API or RPC or REST
- Deployed trained models (offline learning) vs. online learning
- Latency (near real time)
- Reliability and fault tolerance
- Availability
- Scalability/Volume handling
- Performance/speed
- Goals and implementation - Oracle
- Analyze and transform data in real-time
- Optimize data structures for intended use
- Use parallel processing
- Increase hardware and memory
- Database configuration and operations
- Dedicate hardware sandboxes
- Analyze data at rest, in-place
- Goals and implementation - Oracle
- Throughput
- Extensibility
- Security
- Cost/financial
- Data quality
- Skills availability
- Backup and recovery
- Locations and placement
- Privacy and sensitive data
- Disaster recovery
- Schema on read vs schema on write
- Bringing the analytical capabilities to the data, VS
- Bringing the data to the analytical capabilities through staging, extracting, transforming and loading
- Maturity Considerations - Oracle
- Reference architecture
- Development patterns
- Operational processes
- Governance structures and polices
- Governance
- Govern data quality
- Operations, Infrastructure, and DevOps
- Monitoring
- Security and privacy
- Authentication
- Authorization
- Accounting
- Data protection
- Compliance
- Data Aquisition, Ingestion, and Integration
- Messaging and message queues
- ETL/ELT
- Change data capture
- FTP
- API/ODBC
- Replication
- Bulk movement
- Virtualization
- Analytics types and options on ingestion - Oracle
- Sensor-based real-time events
- Near real-time transaction events
- Real-time analytics
- Near real time analytics
- No immediate analytics
- Data Processing
- Batch and stream processing/computing (velocity)
- Massive scaling and processing of multiple concurrent input streams
- Parallel computing platform
- Clusters or grids
- Massively parallel processing (MPP)
- High performance computing (HPC)
- Options - Oracle
- Leave it at the point of capture
- Add minor transformations
- ETL data to analytical platform
- Export data to desktops
- Fast data - Oracle
- Streams
- Events
- Actions
- Batch and stream processing/computing (velocity)
- Data Access
- Querying
- Real-time analytics
- BI analytics
- MapReduce analytics
- Data Modeling and Structure
- Star schema
- Snowflake schema
- Data Analysis, data mining, discovery, simulation, and optimization
- Advanced analytics and modeling
- Text and natural language analytics
- Video and voice analytics
- Geospatial analytics
- Data visualization
- Data mining
- Where to do analysis - Oracle
- At ingest – real time evaluation
- In a raw data reservoir
- In a discovery lab
- In a data warehouse/mart
- In BI reporting tools
- In the public cloud
- On premises
- Data sets
- Data science
- Data discovery
- In-place analytics
- Faceted analytics
- SQL analytics
- Data Storage and Management
- Data lake
- Data warehouse (volume), aka enterprise information store
- Centralized, integrated data store
- Powers BI analytics, reporting, and drives actionable insights
- Responsible for integrating data
- Structured, prepared, and stored data optimized for
- Analytical applications and decision support
- Querying and reporting
- Data mining
- In-database analytics
- Operational analytics
- MPP engine
- 'Deep analytical appliance' - IBM
- Operational data store (ODS)
- Database Systems and DBMS
- Relational (RDBMS)
- NoSQL
- Real-time analytics and insights
- NewSQL
- Hybrid
- Data marts
- Data warehouse extracted data subsets oriented to specific business lines, departments or analytical applications
- Can be a 'live' data mart
- File systems (Non-distributed)
- Distributed file systems (e.g., HDFS) and Hadoop (volume and variety)
- Real-time and MapReduce analytics and insights
- Deep analysis of petabytes of structured and unstructured data
- In-memory
- Data factory
- Data Reservoir
- Dedicated and ad-hoc
- Discovery labs
- Sandboxes
- Data lifecycle management
- Rule-based Data and Policy Tracking
- Data compression
- Data archiving
- Deployment Choice
- On-premise, aka traditional IT
- In-cloud
- Public cloud
- Private cloud
- Appliance
- Managed services
- Presentation, Analytics, and Applications (visibility)
- Browser/web
- Mobile
- Desktop
- Dashboards
- Reports
- Notifications and messaging
- Scorecards
- Charts and graphics
- Visualization and discovery
- Search
- Alerting
- EPM and BI applications
- Recommendations
- Data ingestion
- Optimize the process of loading data in the data store to support time-sensitive analytic goals.
- Search and survey
- Secure federated navigation and discovery across all enterprise content.
- Data transformation
- Convert data values from source system and format to destination system and format.
- Analytics
- Discover and communicate meaningful patterns in data.
- Actionable decisions
- Make repeatable, real-time decisions about organizational policies and business rules.
- Discover and explore
- Discover, navigate, and visualize vast amounts of structured and unstructured information across many enterprise systems and data repositories.
- Reporting, dashboards, and visualizations
- Provide reports, analysis, dashboards, and scorecards to help support the way that people think and work.
- Provisioning
- Deploy and orchestrate on-premises and off-premises components of a big data ecosystem.
- Monitoring and service management
- Conduct end-to-end monitoring of services in the data center and the underlying infrastructure.
- Security and trust
- Detect, prevent, and otherwise address system breaches in the big data ecosystem.
- Collaborate and share
- Analytics-as-a-service
- Consumes both data at rest and in motion
- Applies analytical algorithms
- Provides
- Dashboards
- Reports
- Visualizations
- Insights
- Predictive modeling
- Abstracts away all complexity of data collection, storage, and cleansing
- Data-as-a-service
- Data-at-rest-service
- Data-in-motion-service
- NoSQL tools (Hive, Pig, BigSQL, ...)
- EMR clusters (Hadoop, Cassandra, MongoDB, ...) and Traditional DW
- Big data file system (HDFS, CFS, GPFS, S3, ...)
- Infrastructure & Appliances (Baremetal or IaaS) and object storage
- Designed to be a flexible and a “just-in-time” architecture development approach
- Key Steps
- Establish Business Context and Scope
- Establish an Architecture Vision
- Assess the Current State
- Establish Future State and Economic Model
- Develop a Strategic Roadmap
- Establish Governance over the Architecture
- Staging
- Temporary storage
- Used for cleaning, integration and transformation routines
- Data management
- Long-time managed storage
- Clean and integrated data
- Sandboxing
- Temporary data stores
- Used by people, groups, and departments
- Experimentation with data, processing, and analysis techniques
- Application optimized storage
- Example usage = data mart
- Archive and raw data archive
- Raw, processed, and transformed data
- Volume - Scale of data
- Variety - Different forms of data
- Velocity - Analysis of streaming data
- Veracity - Overall quality and correctness of the data
- Garbage in, garbage out
- Assess the truthfulness and accuracy of the data as well as identify missing or incomplete information
- Visibility/Visualization
- Value
- Variability
- Structured
- Transactions
- Master and reference
- Unstructured
- Text
- Image
- Video
- Audio
- Social
- Semi-structured
- Machine generated
- Data storage (databases)
- Sensors
- Events
- Parquet
- RFID tags
- Instore WiFi logs
- Machine Logs
- Application
- Events
- Server
- CDRs
- Clickstream
- Text, including documents, emails, scanned documents, records, ...
- Social networks
- Public web
- Geo-location/geospatial
- Feeds
- Machine generated
- Clickstream
- Software
- Media
- Images
- Video
- Audio
- Business applications
- OLTP - Online transaction processing
- ERP - Enterprise resource planning
- CRM - Customer relationship management
- SCM - Supply chain management
- HR
- Product/Project management
- Online chat
- Merchant listings
- DMP - Data management platform (advertising/marketing)
- CDR - Call detail records
- Surveys, questionnaires, binary questions, and sentiment
- Billing data
- Product catalog
- Network data
- Subscriber data
- Staffing
- Inventory
- POS and transactional
- eCommerce transactions
- Biometrics
- Mobile devices
- Weather data
- Traffic pattern data
- Mobile devices
- Surveillance
Big Data Architecture Patterns
- Polyglot
- Lambda
- Kappa
- IOT-A
- Message Queue/Stream Processing (MQ/SP) block
- Buffer data
- Processing speed
- Throughput handling of downstream components
- Micro-batching can increase ingestion rate into downstream components
- Process and filter data
- Cleaning and removal
- Stream processing
- Continuous queries
- Aggregates
- Counts
- Real-time machine learning/AI
- Output
- Real-time
- Ingest data into downstream blocks (DB and/or DFS)
- Example technologies
- Buffer data
- Database (DB) block
- Distributed File System (DFS) block
- Message Queue/Stream Processing (MQ/SP) block
- Connected devices
- Support for a variety of workload types
- Stream processing
- Low-latency data queries
- SLAs (depends on application)
- Recovery Point Objective
- Recovery Time Objective
- Availability
- Latency
- Disaster recovery
- Security
- Privacy
- ACLs
- Data encryption and masking
- Data management
- Data modeling and schemas
- Device metadata
- Data streams
- Composed of data records flowing through the system
- Processing and output
- Native and processed raw data support
- Input data typically time-series
- Real-time stream
- Interactive querying
- Output generated in batches
- Data transformations
- Aggregations and computation
- Data integration and enrichment
- Native and processed raw data support
- Data movement and storage
- One or more data stores
- Leverage big data approach
- Scale out techniques and storage on commodity hardware
- Historical data/references (volume)
- Schema-on-read (e.g., data lake)
- Community defined interfaces
- Many different data formats and non-relational sensor data (variety)
- High rate data generation and handling via data streams in IoT context (velocity)
- Scale out techniques and storage on commodity hardware
- Analytics
- APIs/SDKs
- Applications and presentation
Big Data and IoT Tech Stacks
- SACK
- Spark - Digest
- Akka - Ingest
- Cassandra
- Kafka
- SMACK
- Spark
- Mesos
- Akka
- Cassandra
- Kafka
- Built on the shared nothing principle
- Each node is independent and self-sufficient
- Ability to store any and all data types relatively cheap
- Ability to process any and all data quickly and relatively cheap
- Vast community, ecosystem, and pluggable architecture
- Scalable, flexible, computational model
- Batch
- Process batches of data on regular time intervals, e.g., hourly, daily, overnight, etc.
- Aka, MapReduce on Hadoop
- Real-time
- Monitor and react in real time
- Key-value data stores, such as NoSQL, allow for high performance, index-based retrieval - Oracle
- Real-time MapReduce and processing (e.g., Spark)
- Streaming
- Stream ingestion
- Near Real-Time (NRT) Event Processing with External Context
- NRT Event Partitioned Processing
- Complex Topology for Aggregations or ML
- Interactive/ad-hoc querying
- Data analysts reviewing data
- Online
- Search
- In-memory
Coming soon...
- Lambda Architecture
- AWS Architecture Center
- AWS Big Data Partner Solutions
- GCP Architecture
- Introduction to big data classification and architecture
- An Enterprise Architect’s Guide to Big Data
- BIG DATA REFERENCE ARCHITECTURE
- Getting Started with Big Data Architecture
- BIG DATA: Architectures and Technologies
- Big Data Architecture
- Big Data Analytics Architecture
- What is Streaming Data?
- Align Big Data with Specific Business Goals
- Ease Skills Shortage with Standards and Governance
- Optimize Knowledge Transfer with a Center of Excellence
- Top Payoff is Aligning Unstructured with Structured Data
- Plan Your Discovery Lab for Performance
- Align with the Cloud Operating Model
- Accommodate All Forms of Data
- Consistent Information and Object Model
- Integrated Analysis
- Insight to Action
- Organizational Structures & Awareness
- Stewardship
- Policy
- Value Creation
- Data Risk Management & Compliance
- Information Security & Privacy
- Data Architecture
- Data Quality Management
- Classification & Metadata
- Information Lifecycle Management
- Audit Information, Logging & Reporting
Courtesty of Guru99