DataScience_Examples

All about DataSince, DataEngineering and ComputerScience

View the Project on GitHub datainsightat/DataScience_Examples

Batch Data Pipeline

Pipelines process a certain amount of data an then exits.

EL, ELT

Quality Considerations

Quality Operations BigQuery

Filter to identify and isolate invalid data

Filter invalid data

Duplicates

Duplicates

Accuracy

Accuracy

Completeness

Completeness

Missing

Missing

Uniform

Uniform

ETL

Tranformation cannot be expressed in SQL. Use Dataflow as ETL Tool and land data in BigQuery.

Architecture

ETL Tools

Quality Data Operations

Look Beyond Dataflow and BigQuery

Issue Solution
Latency Dataflow to Bigtable
Spark Dataproc
Visual Cloud Data Fusion

Dataproc

Dataproc

Cloud Data Fusion

Cloud Data Fusion

Data Catalog

Metadata as a service.

Labels

Streaming Data Processing

Streaming

Bounded Data (Batch) Unbounded Data (Stream)
Finite data set Infinite data set
Complete Never complete
Time of element is disregarded Time of element is siginificant
in rest in motion
Durable storage Temporary storage
Data Integration (10sec - 10min) Data decisions (100ms - 10sec)
Data warehouse real-time Real-time recommendations
  Fraud detection
  Gaming events
  Finance back office

3Vs

Products

Pipeline