DataScience_Examples

All about DataSince, DataEngineering and ComputerScience

View the Project on GitHub datainsightat/DataScience_Examples

Data Engineer Exam

Data Engineer Exam Guide
Google Cloud Documentation
Medium Blog
Practice Exam

Ressources

Storage and Database Documentation

Topic Description Link
Disks   https://cloud.google.com/compute/docs/disks/ https://cloud.google.com/bigtable/docs/choosing-ssd-hdd
Cloud Storage World-wide storage and retrieval of any amount of data at any time https://cloud.google.com/storage/docs/
Cloud Memorystore Fully manage in-memory data store service. https://cloud.google.com/memorystore/docs/redis/
Cloud SQL MySQL and PostgreSQL database service https://cloud.google.com/sql/docs/
Datastore NoSQL document and database service https://cloud.google.com/datastore/docs/
Firestore Store mobile and web app data at global scale https://cloud.google.com/firestore/docs
Firebase REaltime Database Store and sync data in real Time https://firebase.google.com/docs/database/
Cloud BigTable NoSQL wide-column database service. https://cloud.google.com/bigtable/docs/
Cloud Spanner Mission-critical, scalable, relational database service https://cloud.google.com/spanner/docs/

Data Analytics

Topic Description Link
BigQuery A full managed, high scalable data warehouse with built-in ML https://cloud.google.com/bigquery/docs/
Dataproc Managed Spark and Hadoop service https://cloud.google.com/dataproc/docs/
Dataflow Real-time batch and stream data processing https://cloud.google.com/dataflow/docs/
Datalab Explore, analyze and visualize large datasets https://cloud.google.com/datalab/docs/
Dataprep by Trifacta Cloud data service to explore, clean and prepare data for analysis https://cloud.google.com/datalab/docs/
Pub/Sub Ingest event streams from anywhere at any scale https://cloud.google.com/pubsub/docs/
Google Data Studio Tell great data stories to support better business desicisions https://marketingplatform.google.com/about/data-studio/
Cloud Composer A fully managed workflow orchestration service built on Apache Airflow https://cloud.google.com/composer/docs/

Machine Learning Documentation

Topic Description Link
AI Platform Build superior models and deploy them into production https://cloud.google.com/ml-engine/docs
Cloud TPU Train and run ML models faster than ever https://cloud.google.com/automl/docs/
AutoML Easily train high-quality, custom ML models https://cloud.google.com/automl/docs/
Cloud Natural Language API Derive insights from unstructured text  
Speech-to-Text Speech-to-text conversion powered by ML  
Cloud Translation Dynamically translate between languages  
Text-to-Speech Text-to-speech conversion powered by ML  
Dialogflow Enterprise Edition Create conversational experiences across devices and platforms  
Cloud Vision Derive insight from images powered by ML  
Video Intelligence Extract metadata from videos  

Infrastructure Documentation

Topic Description Link
Google Cloud’s operations suite (Stackdriver) Monitoring and management for services, containers, applications and infrastructure  
Cloud Monitoring Monitoring for applicationson Google Cloud and AWS  
Cloud Logging Logging for applications on Google Cloud and AWS  
Error Reporting Identifies and helps you understand application errors  
Cloud Trace Find performance bottlenecks in production  
Cloud Debugger Investigate code behaviour in production  
Cloud Profiler Continuous CPU and heap profiling to improve performance and reduce costs  
Transparent Service Level Indicators Monitor Google Cloud services and their effects on your workloads  
Cloud Deployment Manager Manage cloud resources with simple templates  
Cloud Console Google Cloud’s integrated management console  
Cloud Shell Command-line management form any browser  

Data Processing Systems

Designing and building

Data Processing Anatomy

Data Processing Anatomy

A view of data engineering on Google Cloud

Data Engineering

Storage and Databases

Storage and Databases

Processing

Processing

Data Processing Services

Combines storage and compute.

Services

Data Abstractions

Service Data Abstraction Compute Abstraction
Dataproc, Spark RDD DAG
Bigquery Table Query
Dataflow PCollection Pipeline

Artificial Intelligence

AI

Pre- and Postpcocessing Services

PrePost Processing

Infrastructure Services

Infrastructure Services

Design Flexible Data Representations

Storage Type Stored in
Cloud Storage Object Bucket
Datastore Property Entity > Kind
Cloud SQL Values Rows and Columns > Table > Database
Cloud Spanner Values Rows and Columns > Tables > Database

Data in Files and Data in Transit

Data Types

Standard SQL Data Types

Data type Value
string variable-length (unicode) character
int64 64-bit integer
float64 Double-precision decimal values
bool true or false
array ordered list of zero or more elements
struct container of ordered fields
timestamp represents an absolutej point in time

BigQuery Datasets, Tables and Jobs

BigQuery is Columnar

Columnar Storage

Spark hides Complexity in RDDs

RDDs hide complexity and allow making decisions on your behald. Manages: Location, Partition, Replication, Recovery, Pipelining …

RDD

Dataflow

PCollections

PCollections

Batch and stream processing

Batch and Stream

Bounded va Unbounded data. Dataflow uses windows to use streaming data. PCollections are not compatible with RDDs.

Tensorflow

Opensource code for machine learning.

Tensor

Design Data Pipelines

Dataproc

Spark

Spark uses a DAG to process data. It executes commands only, if told to do so = “lazy evaluation” (oppostite “eager execution).

Spark

Dataproc can augment BigQuery

projectId = <your-project-id>

sql = "
  select
    n.year,
    n.month,
    n.day,
    n.weight_pounds
  from
    `bigquery-public-data.samples.natality` as n
  order by
    n.year
  limit 50"
  
print "Running query ..."
data = qbq.read_sql.gb1(sql,projectId=projectId)
data[:5]

Extract data from BiqQuery using Dataproc and let Spark do the analysis.

Open Source Software

Open Source

Initialization Actions

Cluster Properties

Dataflow

Pipelines

Dataflow Pipeline

Operations

Operation Action
ParDo Allows for parallel processing
Map 1:1 relationship between input and output in Python
FlatMap Non 1:1 relationships
.apply(ParDo) Java for Map and FlatMap
GroupBy Shuffle
GroupByKey Explicit Shuffle
Combine Aggregate values

Pipelines are often organized in Map and Reduce sequences.

Templates

Templates

Separation for work and better ressource allocation.

BigQuery

Solutions

Separate compute and storage enables serverless execution.

Solutions

Design Data Processing Infrastructure

Data Ingestion

Ingestion

Load data into BigQuery

Load data

PubSub

Pub/Sub holds messages up to 7 days.

PubSub

Ingest Processing Analysis
Pub/Sub Dataflow BigQuery

Exam Guide Review

Storage

Selecting the appropriate storage technologies

Be familiar with the common use cases and qualities of the different storage options. Each storage system or database is optimized for different things - some are best at automatically updating the data for transactions. Some are optimized for speed of data retrieval but not for updates or changes. Some are very fast and inexpensive for simple retrieval but slow for complex queries.

Pipelines

Designing data pipelines

An important element in designing the data processing pipeline starts with selecting the appropriate service or collection of services.
All Platform Notebooks, Google Data Studion, BigQuery all have interactive interfaces. Do you know when to use each?

Processing Infrastructure

Designing a data processing solution

Pub/Sub and Dataflow together provide once, in-order, processing and possibly delayed or repeated streaming data.
Be familiar with the common assemblies of services and how they are often used together: Dataflow, Dataproc, BigQuery, Cloud Storage and Pub/Sub.

Migration

Migrating data warehousing and data processing

Technologically, Dataproc is superior to Open Source Hadoop and Dataflow is superior to Dataproc. However, this does not mean that the most advanced technology is always the best solution. You need to consider the business requirements. The client might want to first migrate from the data center to the cloud. Make sure everything is working (validate it). And only after they are confident with that solution, to consider improving and modernizing.

Building and Operationalizing Data Processing Systems

Storage Options

Building Dataprocessing Systems

Flowchart

ACID - Consistency

BASE - Availability

In Cloud Datastore there are just two APIs provide a strongly consistent view for reading entity values and indexes: lookup by key, ancestor query.

Storage Options

Cloud Storage

Cluster node options

Access

Features

Best Practices

Cloud SQL

Familiar

Not Supported

Flexible pricing

Connect from anywhere

Fast

Google Security

Cloud Bigtable

Properties

Important Features

Cloud Spanner

Properties

Important features

Cloud Spanner

Datastore

Properties

Important features

Building and maintaining Pipelines

Apache Beam

Apache Beam

Size Scalability and Fault-tolerance Programming Model Unbound data
Autoscaling and rebalancing handles variable volumes of data and growth On-demand and distribution of processing scales with fault tolerance Efficient pipelines + Efficient execution Windowing, triggering, incremental processing and out-of-order data are addressed in the streamning model.

Dataflow Windowing for streams

Windowing

Side Inputs in Dataflow

Side Inputs

Building a Streaming Pipeline

Exam

Scaling Beyond BigQuery

BigQuery Cloud BigTable
Easy, Inexpensive Low latency, High throughput
latency in order of seconds 100kQPS at 6ms latency in 10 node cluster
100k rows/second streaming  

Analyze Data and enable Machine Learning

Analyze Data

Pretraned Models

Exam

The three modes of the natural language API are: Sentiment, Entity and Syntax

Notebooks, Datalab

Exam

Cloud ML

Exam

Exam

Machine Learning

Exam

Step Tool
Collect data Logging API, Pub/Sub
Organize data BigQuery, Dataflow, ML Preprocessing SDK
Create Model Tensorflow
Train, Deploy Cloud ML

Tensorflow

High-Performance library for numerical computation. Tensorflow is coded for example in Python using DG (Directed Graphs) => Lazy evaluation (can be run in eager mode).

Exam

Mathematical information is transported from node to node.

Exam

TF Methods

https://www.tensorflow.org/api_docs/python/tf/keras/losses

Method Description
tf.layers A layser is a class implementing common neural networks operations, such as convolution, bat chorm, etc. These operations require managing variables, losses and updates, as well as applying TensorFlow ops to input tensors
tf.losses Loss Function
tf.metrics General Metrics about the TF performance

Unstructured Data

Exam

Task Solution
Real-time insight into supply chain operations. Which partner is causing issues? Human
Drive product decisions. How do people really use feature x? Human
Did error reates decrease after the bug fix was applied Easy counting problems > Big Data
Which stores are experiencing long delays in paymemt processing Easy counting problems > Big Data
Are procemmers checking in low-quality code? Harder counting problemns > ML
Which stores are experiencing a lack of parking space? Harder counting problems > ML

Exam

Supervised Learning

Labels

Exam

Regression, Classification

Regression and Classification models are supervised ML methods.

Exam

Structured data is a great source for machine learnign model, because it is already labeled.

Regression problems predict continuous values, like prices, whereas classification problems predict categorical values, like colours.

Measure Loss

Mean Squared Error (MSE)

Exame

Root Mean Squared Error (RMSE)

The root of MSE. The measure is in the unit of the model and is therefore easier to interpretate.

XEntropy

Xentropy is a hint for a classification problem.

Gradient Descent

Exam

Turn ML problem into search problem.

Recompute Error after Batch of examples

Exam

Training and Validating

When is a Model sufficiently good?

Exam

Training vs Evaluation Data

Exam

Validation Techniques
Data Validation
Scare Independend Test Data, Cross Validate

Cross Validation

Modeling Business Processes for Analysis and Optimization

Confusion Matrix

Exam

Exam

Build, Buy od Modify > Business Priorities

Exam

AutoML > Use an existing ML Model and tailor it to your specific needs.

Build Effective ML

Big Data > Feature Engineering > Model Architectures

Make ML Pipeline Robust

Feature Engineering

Good features bing human insight to a problem

Choosing good features

Feature Engineering Process

Other Important Concepts

Exam

Performance

Schema

Exam

Performance (De-Normalize) vs Efficiency (Normalize).

Nested Schemas

Exam

Key Elements of Performance

Query Explamation Map

Exam

Partitioning

Time-partitioning tables are a cost-effective way to manage data. Be carfeul selecting training data from time-partitioned tables (shuffle data among time slots).

Order of Operatoins can inluence shuffling overhead

Exam

Windowing

Use windowing to manage streaming data.

Performance BigTable

Cloud BigTable spearates processing and storage.

Exam

Row key Design

Exam

Datasets get shuffled among different tablets, which enables parallel processing.

Growing a BigTable Cluster

Pricing

Storage Processing Free
Amount of data in table On-demand of Flate-rate Loading
Interest rate of streaming data On-demand based on amount of data processed Exporting
Automatic discount for old data 1 TB/month free Queries on metadata
  Have an opt in to run high-compute queries Cached queries
    Queries with errors

Designing for Security and Complience

Privacy, Authorization and Authentication. Identity- and Accessmanagement. Intrusion detection, Attack medigationm Resililance and Recovery. Granularity of control (Table, tow, column, service).

Identity and Access

Cloud IAM Resource Hierarchy

Exam

Folders

Additoinal grouping mechanism and isolation boudaries between projects

Folders allow delegation of administration rights

Encryption

Default Encryption Coustomer-Managed Encrytion Keys CMEK Customer-Supplied Encyptino Keys CSK Client-Side Encryption
Data is automatically encrypted before being written to disk Google-generated data encryption key (DEK) is sill used Keep kes on premises, and use them to encrypt your cloud service Data in encrypted before it is sent to the cloud
Each encryption key is itself encrypted with a set of root keys Allows you to create, use, and revoke the key encryption key (KEK) Google cant’t recover them Your keys; your tools
  Uses Cloud Key Management Sevice (Cloud KMS) Disk encryption on VMs Cloud Storage encryption Google doesn’t know wheether your data is encrypted before it’s uploaded
    Keys are never stored on disk unencrypted No way to recover keys
    You provide your key at each operation, and Google purges it from its servers when each operation completes If you loase your keys, remember to delete the objects!

Cloud Security

Exam

Performing Quality Control

Service specific Monitoring is available (like TensordBoard). Assessing, troubleshooting and improving data representations and improving data processing infrastracture are distributed through all the technologies. Advocating policies and publishing data and reports are not just technical skills.

Monitoring

TensorBoard

Exam

Esimator comes with a method that handles distributed training and evaluation

estimator = tf.estimator.LinearRegressor(
                         model_dir=output_dir,
                         freature_columns=feature_cols)
                         
...

tf.estimator.train_and_evaluatre(estimator, train_spec, eval_spec)

Visualize TF.

Ensure Reliability

Service produces consistent outputs and works as expected. Available vs Durable (Data loss).

Distributing for Scale may improve reliability

Exam

Data Visualization and Reporting Tools

Google Data Studio

Exam