DataScience_Examples

All about DataSince, DataEngineering and ComputerScience

View the Project on GitHub datainsightat/DataScience_Examples

BigQuery

Big Query

Computation time of 50 minutes down to 13 seconds!

Big Query combines data storage and the SQL engine.

Big Query Components

You can load data direct into the SQL engine without uploading it into die Bigquery datastorage.

Dataprep

UI to explore dataset.

Big Query Components

Security

Viewer, Editor, Owner

STRUCTs and ARRAYs

Structs are nested collections of columns. Arrays allow to split data cells in multiple records.

Big Query Components

GIS

GeoVIS

Big Query Components

Datasets

Datasets

Machine Learning

ML features BigQuery:

Distribution of ML models

Distribution ML Models

Types of ML models

Types ML Models

Linear Classifier (Logistic Regression)

Linear Classifier

DNN Classifier

Deep Neural Network

XGBoost Classifier

XGBoost

Linear Regression

Linear Regression

DNN Regression

DNN Regression

XGBoost Regression

XGBoost Regression

Train TF, Predict BigQuery

TF Model

Recommendation, Matrix Factorization

Matrix Model

Clustering

KNN Clustering

ML in BigQuery

ML Process in BigQuery:

Types ML Models

ML.WEIGHTS

Show how usefull an item is to predict value.

Overview

Types ML Models

Example

Workflow

Extract Data

select
  url, title
from
  `bigquery-public-data.hacker_news.stories`
where
   langth(title) > 10
   and length(url) > 0
limit 10;

Create Model

create or replace model
  advdata.txtclass
  options(model_type='logistic_reg', input_label_cols=['source'])
  as
  
  with
    extract as (
      ...
    ),
    ds as (
      ...
    )

Evaluate Model

select
  *
from
  ml.evaluate(model adcdata.txtclass)

Predict

select
  *
from
  ml.predict(model advdata.txtclass, (
    select
      'government' as word 1,
      'shutdown' as word 2,
      'leaves' as word 3,
      'workers' as word 4,
      'reeling' as word 5)

Partitioning and Clustering

Overview

create table mydataset.myclusteredtable
  (
  c1 numeric,
  userId string,
  c3 string,
  enventDate timestamp,
  c5 geography
  )
partition by date(eventDate)
cluster by userId
options
  (
  partition_expiration_days=3,
  description="cluster"
  )
as select * from maydataset.myothertable

Partitioning

Partitioning

Partitioning Time

Create Partitioning Time

Partition Filter

Clustering

Cluster

Create Cluster

Automatic Reclustering

Streaming

Streaming data is charged transaction.

Streaming

$ export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json"
$ pip install google-cloud-bigquery

from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')

dataset_ref = bigquery_client.dataset('my_dataset_id')
table_ref = dataset_ref.table('my_dataset_id')
table = bigquery_client.get_table(table_ref)

to_to_insert = [
  (u'customer 1', 5),
  (u'customer 2', 17)
]

errors = bigquery_client.insert_rows(table, rows_to_insert)

Data Studio

Data Studio

BI Engine

Bi Engine

Analytic Window Functions

Rank

with
  longest_trips as (
    select
      start_station_id,
      duration,
      rank() over(partition by start_station_id order by duration desc) as nth_longest
    from
      `bigquery-public-data`.london_bicycles.cycle_hire
    )
  
  select
    start_station_id,
    array_agg (
      duration
      order by
        nth_longest
      limit
        3
      ) as durations
  from
    longest trips
  groub by
    start_station_id

GIS Functions

GIS

Performance

Pipeline

Performance

Intermediate Tables

Intermediate Tables

gcp > BigQuery > Execution Details

Aproximate Functions

Prcing

Active Storage

Long-term Storage

Table or partition not edited 90+ days

BigQuery Slots

On-Demand Pricing

Flat-Rate Pricing

Flex Slots

Fair Scheduler

BigQuery Slots

Fair Scheduler

Fair Scheduler

You can set hierarchical reservation.