All about DataSince, DataEngineering and ComputerScience
View the Project on GitHub datainsightat/DataScience_Examples
Data Engineer Exam Guide
Google Cloud Documentation
Medium Blog
Practice Exam
Topic | Description | Link |
---|---|---|
Disks | https://cloud.google.com/compute/docs/disks/ https://cloud.google.com/bigtable/docs/choosing-ssd-hdd | |
Cloud Storage | World-wide storage and retrieval of any amount of data at any time | https://cloud.google.com/storage/docs/ |
Cloud Memorystore | Fully manage in-memory data store service. | https://cloud.google.com/memorystore/docs/redis/ |
Cloud SQL | MySQL and PostgreSQL database service | https://cloud.google.com/sql/docs/ |
Datastore | NoSQL document and database service | https://cloud.google.com/datastore/docs/ |
Firestore | Store mobile and web app data at global scale | https://cloud.google.com/firestore/docs |
Firebase REaltime Database | Store and sync data in real Time | https://firebase.google.com/docs/database/ |
Cloud BigTable | NoSQL wide-column database service. | https://cloud.google.com/bigtable/docs/ |
Cloud Spanner | Mission-critical, scalable, relational database service | https://cloud.google.com/spanner/docs/ |
Topic | Description | Link |
---|---|---|
BigQuery | A full managed, high scalable data warehouse with built-in ML | https://cloud.google.com/bigquery/docs/ |
Dataproc | Managed Spark and Hadoop service | https://cloud.google.com/dataproc/docs/ |
Dataflow | Real-time batch and stream data processing | https://cloud.google.com/dataflow/docs/ |
Datalab | Explore, analyze and visualize large datasets | https://cloud.google.com/datalab/docs/ |
Dataprep by Trifacta | Cloud data service to explore, clean and prepare data for analysis | https://cloud.google.com/datalab/docs/ |
Pub/Sub | Ingest event streams from anywhere at any scale | https://cloud.google.com/pubsub/docs/ |
Google Data Studio | Tell great data stories to support better business desicisions | https://marketingplatform.google.com/about/data-studio/ |
Cloud Composer | A fully managed workflow orchestration service built on Apache Airflow | https://cloud.google.com/composer/docs/ |
Topic | Description | Link |
---|---|---|
AI Platform | Build superior models and deploy them into production | https://cloud.google.com/ml-engine/docs |
Cloud TPU | Train and run ML models faster than ever | https://cloud.google.com/automl/docs/ |
AutoML | Easily train high-quality, custom ML models | https://cloud.google.com/automl/docs/ |
Cloud Natural Language API | Derive insights from unstructured text | |
Speech-to-Text | Speech-to-text conversion powered by ML | |
Cloud Translation | Dynamically translate between languages | |
Text-to-Speech | Text-to-speech conversion powered by ML | |
Dialogflow Enterprise Edition | Create conversational experiences across devices and platforms | |
Cloud Vision | Derive insight from images powered by ML | |
Video Intelligence | Extract metadata from videos |
Topic | Description | Link |
---|---|---|
Google Cloud’s operations suite (Stackdriver) | Monitoring and management for services, containers, applications and infrastructure | |
Cloud Monitoring | Monitoring for applicationson Google Cloud and AWS | |
Cloud Logging | Logging for applications on Google Cloud and AWS | |
Error Reporting | Identifies and helps you understand application errors | |
Cloud Trace | Find performance bottlenecks in production | |
Cloud Debugger | Investigate code behaviour in production | |
Cloud Profiler | Continuous CPU and heap profiling to improve performance and reduce costs | |
Transparent Service Level Indicators | Monitor Google Cloud services and their effects on your workloads | |
Cloud Deployment Manager | Manage cloud resources with simple templates | |
Cloud Console | Google Cloud’s integrated management console | |
Cloud Shell | Command-line management form any browser |
Combines storage and compute.
Service | Data Abstraction | Compute Abstraction |
---|---|---|
Dataproc, Spark | RDD | DAG |
Bigquery | Table | Query |
Dataflow | PCollection | Pipeline |
Storage | Type | Stored in |
---|---|---|
Cloud Storage | Object | Bucket |
Datastore | Property | Entity > Kind |
Cloud SQL | Values | Rows and Columns > Table > Database |
Cloud Spanner | Values | Rows and Columns > Tables > Database |
Data type | Value |
---|---|
string | variable-length (unicode) character |
int64 | 64-bit integer |
float64 | Double-precision decimal values |
bool | true or false |
array | ordered list of zero or more elements |
struct | container of ordered fields |
timestamp | represents an absolutej point in time |
RDDs hide complexity and allow making decisions on your behald. Manages: Location, Partition, Replication, Recovery, Pipelining …
Bounded va Unbounded data. Dataflow uses windows to use streaming data. PCollections are not compatible with RDDs.
Opensource code for machine learning.
Spark uses a DAG to process data. It executes commands only, if told to do so = “lazy evaluation” (oppostite “eager execution).
projectId = <your-project-id>
sql = "
select
n.year,
n.month,
n.day,
n.weight_pounds
from
`bigquery-public-data.samples.natality` as n
order by
n.year
limit 50"
print "Running query ..."
data = qbq.read_sql.gb1(sql,projectId=projectId)
data[:5]
Extract data from BiqQuery using Dataproc and let Spark do the analysis.
Operation | Action |
---|---|
ParDo | Allows for parallel processing |
Map | 1:1 relationship between input and output in Python |
FlatMap | Non 1:1 relationships |
.apply(ParDo) | Java for Map and FlatMap |
GroupBy | Shuffle |
GroupByKey | Explicit Shuffle |
Combine | Aggregate values |
Pipelines are often organized in Map and Reduce sequences.
Separation for work and better ressource allocation.
Separate compute and storage enables serverless execution.
Pub/Sub holds messages up to 7 days.
Ingest | Processing | Analysis |
---|---|---|
Pub/Sub | Dataflow | BigQuery |
Selecting the appropriate storage technologies
Be familiar with the common use cases and qualities of the different storage options. Each storage system or database is optimized for different things - some are best at automatically updating the data for transactions. Some are optimized for speed of data retrieval but not for updates or changes. Some are very fast and inexpensive for simple retrieval but slow for complex queries.
Designing data pipelines
An important element in designing the data processing pipeline starts with selecting the appropriate service or collection of services.
All Platform Notebooks, Google Data Studion, BigQuery all have interactive interfaces. Do you know when to use each?
Designing a data processing solution
Pub/Sub and Dataflow together provide once, in-order, processing and possibly delayed or repeated streaming data.
Be familiar with the common assemblies of services and how they are often used together: Dataflow, Dataproc, BigQuery, Cloud Storage and Pub/Sub.
Migrating data warehousing and data processing
Technologically, Dataproc is superior to Open Source Hadoop and Dataflow is superior to Dataproc. However, this does not mean that the most advanced technology is always the best solution. You need to consider the business requirements. The client might want to first migrate from the data center to the cloud. Make sure everything is working (validate it). And only after they are confident with that solution, to consider improving and modernizing.
ACID - Consistency
BASE - Availability
In Cloud Datastore there are just two APIs provide a strongly consistent view for reading entity values and indexes: lookup by key, ancestor query.
Cluster node options
Access
Features
Best Practices
Familiar
Not Supported
Flexible pricing
Connect from anywhere
Fast
Google Security
Properties
Important Features
Properties
Important features
Properties
Important features
Size | Scalability and Fault-tolerance | Programming Model | Unbound data |
---|---|---|---|
Autoscaling and rebalancing handles variable volumes of data and growth | On-demand and distribution of processing scales with fault tolerance | Efficient pipelines + Efficient execution | Windowing, triggering, incremental processing and out-of-order data are addressed in the streamning model. |
BigQuery | Cloud BigTable |
---|---|
Easy, Inexpensive | Low latency, High throughput |
latency in order of seconds | 100kQPS at 6ms latency in 10 node cluster |
100k rows/second streaming |
The three modes of the natural language API are: Sentiment, Entity and Syntax
Step | Tool |
---|---|
Collect data | Logging API, Pub/Sub |
Organize data | BigQuery, Dataflow, ML Preprocessing SDK |
Create Model | Tensorflow |
Train, Deploy | Cloud ML |
High-Performance library for numerical computation. Tensorflow is coded for example in Python using DG (Directed Graphs) => Lazy evaluation (can be run in eager mode).
Mathematical information is transported from node to node.
https://www.tensorflow.org/api_docs/python/tf/keras/losses
Method | Description |
---|---|
tf.layers | A layser is a class implementing common neural networks operations, such as convolution, bat chorm, etc. These operations require managing variables, losses and updates, as well as applying TensorFlow ops to input tensors |
tf.losses | Loss Function |
tf.metrics | General Metrics about the TF performance |
Task | Solution |
Real-time insight into supply chain operations. Which partner is causing issues? | Human |
Drive product decisions. How do people really use feature x? | Human |
Did error reates decrease after the bug fix was applied | Easy counting problems > Big Data |
Which stores are experiencing long delays in paymemt processing | Easy counting problems > Big Data |
Are procemmers checking in low-quality code? | Harder counting problemns > ML |
Which stores are experiencing a lack of parking space? | Harder counting problems > ML |
Labels
Regression and Classification models are supervised ML methods.
Structured data is a great source for machine learnign model, because it is already labeled.
Regression problems predict continuous values, like prices, whereas classification problems predict categorical values, like colours.
e
The root of MSE. The measure is in the unit of the model and is therefore easier to interpretate.
Xentropy is a hint for a classification problem.
Turn ML problem into search problem.
Data | Validation |
---|---|
Scare | Independend Test Data, Cross Validate |
Cross Validation
AutoML > Use an existing ML Model and tailor it to your specific needs.
Big Data > Feature Engineering > Model Architectures
Good features bing human insight to a problem
Performance (De-Normalize) vs Efficiency (Normalize).
Time-partitioning tables are a cost-effective way to manage data. Be carfeul selecting training data from time-partitioned tables (shuffle data among time slots).
Use windowing to manage streaming data.
Cloud BigTable spearates processing and storage.
Datasets get shuffled among different tablets, which enables parallel processing.
Storage | Processing | Free |
---|---|---|
Amount of data in table | On-demand of Flate-rate | Loading |
Interest rate of streaming data | On-demand based on amount of data processed | Exporting |
Automatic discount for old data | 1 TB/month free | Queries on metadata |
Have an opt in to run high-compute queries | Cached queries | |
Queries with errors |
Privacy, Authorization and Authentication. Identity- and Accessmanagement. Intrusion detection, Attack medigationm Resililance and Recovery. Granularity of control (Table, tow, column, service).
Additoinal grouping mechanism and isolation boudaries between projects
Folders allow delegation of administration rights
Default Encryption | Coustomer-Managed Encrytion Keys CMEK | Customer-Supplied Encyptino Keys CSK | Client-Side Encryption |
---|---|---|---|
Data is automatically encrypted before being written to disk | Google-generated data encryption key (DEK) is sill used | Keep kes on premises, and use them to encrypt your cloud service | Data in encrypted before it is sent to the cloud |
Each encryption key is itself encrypted with a set of root keys | Allows you to create, use, and revoke the key encryption key (KEK) | Google cant’t recover them | Your keys; your tools |
Uses Cloud Key Management Sevice (Cloud KMS) | Disk encryption on VMs Cloud Storage encryption | Google doesn’t know wheether your data is encrypted before it’s uploaded | |
Keys are never stored on disk unencrypted | No way to recover keys | ||
You provide your key at each operation, and Google purges it from its servers when each operation completes | If you loase your keys, remember to delete the objects! |
Service specific Monitoring is available (like TensordBoard). Assessing, troubleshooting and improving data representations and improving data processing infrastracture are distributed through all the technologies. Advocating policies and publishing data and reports are not just technical skills.
Esimator comes with a method that handles distributed training and evaluation
estimator = tf.estimator.LinearRegressor(
model_dir=output_dir,
freature_columns=feature_cols)
...
tf.estimator.train_and_evaluatre(estimator, train_spec, eval_spec)
Visualize TF.
Service produces consistent outputs and works as expected. Available vs Durable (Data loss).
Google Data Studio