DataScience_Examples

All about DataSince, DataEngineering and ComputerScience

View the Project on GitHub datainsightat/DataScience_Examples

PostgreSQL

Setup

Get and run Docker Image

$ docker pull postgres
$ docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432 postgres

Connect to PostgreSQL container

attribute value
Host localhost
Username postgres
Port 5432
Password mysecretpassword

Database Design

dbdiagram.io

db design

Code

Table users {
  id SERIAL [pk, increment]
  created_at TIMESTAMP
  updated_at TIMESTAMP
  username VARCHAR(30)
  bio VARCHAR(400)
  avatar VARCHAR(200)
  phone VARCHAR(25)
  email VARCHAR(40)
  password VARCHAR(50)
  status VARCHAR(15)
}

...

Datatypes

Numbers

Datatype Description
smalling -32,768 to +32,767
integerr -2,147,583,648 to +2,147,583,647
bigint -9.22 10^18 to +9.22 10^18
smallserial +1 to +32,767
serial +1 to +2,147,583,647
bigserial +1 to +9.22 10^18
decimalr 131,072 digits before decimal point, 16,383 after
numeric 131,072 digits before decimal point, 16,383 after
real -1 10^37 to +1 10^37 with a least 6 place precision
double precision -1 10^307 to +1 10^308 with a least 15 place precision
float -same as real or double

Creat table using these datatypes like so:

create table products (
    id serial primary key,
    name varchar(50).
    department varchar(50),
    price integer,
    weight integer
)

Characters

Datatype Description
char(5) Store come characters, length will always be 5, even if PG has to insert spaces
varchar Store any length of string
varchar(40) Store a string up to 40 characters
text Store any length of string

Boolean

Value Boolean representation
true, yes, on 1, t, y TRUE
false, no, off, 0, f, n FALSE
null NULL
select ('yes'::BOOLEAN);
-> TRUE

Date and Time

Date Value
1980-11-20 20 November 1980
Nov-20-1980 20 November 1980
20-Nov-1980 20 November 1980
1980-November-20 20 November 1980
November 20, 1980 20 November 1980
select ('NOV-20-1980')::DATE);
select ('NOV 20 1980')::DATE);
...
Time Value
01:23 AM 01:23, no time zone
05:23 PM 17:23, no time zone
20:34 20:34, no time zone
select ('01:23 AM')::TIME);
select ('20:34')::TIME);
...
Time with Timezone Value
01:23 AM EST 01:23-05:00
05:23 PM PST 17:23-08:00
20:34 UTC 20:34+00:00
select ('01:23 AM EST')::TIME WITH TIME ZONE);
select ('20:34 UTC')::TIME WITH TIME ZONE);
...
Timestamp with Timezone Value
Nov-20-1980 01:23 AM PST 1980-11-20 01:23:00-07
select ('Nov-20-1980 01:23 AM PST')::TIMESTAMP WITH TIME ZONE);

Intervals

Interval Description
1 day, 1 D 1 day
1 D 1 H 1 M 1 S 1 day 1 hour 1 minute 1 second
select ('1 D 20 H 30 M 45 S')::INTERVAL);

select ('1 D 20 H 30 M 45 S')::INTERVAL) - ('1 D'::INTERVAL);
-> 20:30:45

select ('NOV-20-1980 1:23 AM EST')::TIMESTAMP WITH TIME ZONE)
-
('1 D'::INTERVAL);
-> 1980-11-18 23:23:00-07

select ('NOV-20-1980 1:23 AM EST'::TIMESTAMP WITH TIME ZONE)
-
('NOV-20-1980 11:43 AM EST'::TIMESTAMP WITH TIME ZONE);
-> { "hours": -10, "minutes": -20 }

Basic SQL Commands

CREATE TABLE > Create Table

create table cities (
    name varchar(50),
    country varchar(50),
    population integer,
    area integer
);

INSERT INTO > Insert Data into Table

insert into cities (name,country,population,area)
values
    ('Delhi','India',28125000,2240),
    ('Shanghai','China',22125000,4015),
    ('Sao Paulo','Brazil',20935000,3043);

SELECT > Select Data from Table

select * from cities;
select name,country from cities;

Calculated Columns

Operator Function
+ Add
- Subtract
* Multiply
/ Divice
^ Exponent
|/ Square Root
@ Absolute Value
% Remainder
select
  name,
  population / area as population_denstity
from
  cities;

String Operators

Operator Function
|| Concatenate
concat() Concatenate
lower() Lowercase
upper() Uppercase
length() Number of characters
SELECT
    name || ', ' || country,
    concat(name,', ',country),
    lower(name),
    upper(country),
    length(name)
FROM
    cities;

DISTINCT > Unique Values

select
    distinct
    country
from
    cities

GREATER/LEAST > Biggest, Smallest values

select
    name,
    weight,
    greatest(30,2*weight),
    least(30,2*weight)
from
    products;

CASE > If then else Statement

select
    name,
    price,
    case
        when price > 600 then 'high'
        when price > 300 then 'medium'
        else 'cheap'
    end

WHERE > Filter Table

Operator Function
= Equal
> Greater
< Smaller
>= Greater or Equal
<= Smaller or Equal
<> Not Equal
!= Not Equal
between Between two values
in Value in list
not in Value not in list
is null value is NULL
is not null value is not null
SELECT
    name,
    area
FROM
    cities
WHERE
    area between 2000 and 5000
    and name in ('Delhi','Shanghai');

UPDATE > Update rows

UPDATE
    cities
set
    population = 39505000
WHERE
    name = 'Shanghai';

DELETE > Delete rows

DELETE FROM
    cities
WHERE
    name = 'Tokyo';

Tables

Primary Keys and Foreign Keys

create Table users (
    id serial primary key,
    username varchar(50)
);

insert INTO
    users (username)
VALUES
    ('monahan93'),
    ('pferrer'),
    ('si930nis'),
    ('99stroman');

select * from users;

|id|username| |-|-| |1|monahan93| |2|pferrer| |3|si930nis| |4|99stroman|

create Table photos (
    id serial primary key,
    url varchar(200),
    user_id integer REFERENCES users(id) on delete cascade
);

insert INTO
    photos (url,user_id)
VALUES
    ('http://img1.jpg',4),
    ('http://img2.jpg',4),
    ('http://img3.jpg',1),
    ('http://img4.jpg',2),
    ('http://img01.jp2',NULL); 

select * from photos;

|id|url|user_id| |-|-|-| |1|http://img1.jpg|4| |2|http://img2.jpg|4| |3|http://img3.jpg|1| |4|http://img4.jpg|2|

The ‘SERIAL’ keyword lets the index column increase by 1, each time a new record is inserted.

create Table comments (
    id serial primary key,
    photo_id integer REFERENCES photos(id) on delete cascade,
    user_id integer REFERENCES photos(id) on delete cascade,
    contents varchar(240)
);

insert INTO
    comments (contents,photo_id,user_id)
VALUES
    ('Quo velit iusto ducimus quos a incidunt nesciunt facilis.', 2, 1),
    ('Non est totam.', 2, 1),
    ('Fuga et iste beatae.', 3, 2),
    ('Molestias tempore est.', 1, 2);

select * from comments;

|id|photo_id|user_id|contents| |-|-|-|-| |1|2|1|Quo velit iusto ducimus quos a incidunt nesciunt facilis.| |2|2|1|Non est totam.| |3|3|2|Fuga et iste beatae.| |4|1|2|Molestias tempore est.|

Validation

NOT NULL

The value must not be null

create table products (
    is serial primary key,
    name varchar(50),
    department varchar(50),
    price integer not null,
    weight integer
);

Add validation to existing column. If the column contains already null values, these values need to be replaced by actual values.

update
    products
set
    price = 999
where
    price is null;

alter table
    products
alter column
    price
set not null;

DEFAULT

Set a default value for unspecified columns.

create table products (
    is serial primary key,
    name varchar(50),
    department varchar(50),
    price integer default 999,
    weight integer
);

alter table
    products
alter column
    price
set default 999;

UNIQUE

Ensure that column values are unique.

create table products (
    is serial primary key,
    name varchar(50) unique,
    department varchar(50),
    price integer default 999,
    weight integer
);

You cannot add the constraint to existing tables, if the column contans mulitple entries with the same value.

alter table
    products
alter column
    name
add unique (name);

alter table
    products
drop constraint products_name_key;
Multi column Uniqueness
create table products (
    is serial primary key,
    name varchar(50),
    department varchar(50),
    price integer default 999,
    weight integer,
    unique(name,department)
);

alter table
    products
add unique (name, department);

Validation Check

create table products (
    is serial primary key,
    name varchar(50),
    department varchar(50),
    price integer check (price > 0),
    weight integer
);

Existing rows must satisfy the restriction.

alter table
    products
add check (price > 0);

Views

Create a fake table that has rows from other tables. These rows can be exact rows as they exist on another table, or a computed value. You can referance a view in any place where we would normally reference a table. View don’t create a new table or move any data around.

create view tags as (
  select
    id,
    created_at,
    user_id,
    post_id,
    'photo_tag' as type
  from
    photo_tags

  union all

  select
    id,
    created_at,
    user_id,
    post_id,
    'caption_tag' as type
  from
    caption_tags
);

select * from tags;

Materialized Views

Query that gets executed only at very specific times, gut the results are saved an can be referenced without rerunning the query.

create materialized view weekly_likes as (
  select
    date_trunc('week', coalesce(posts.created_at, comments.created_at)) as week,
    count(posts.id) as num_likes_for_posts,
    count(comments.id) as num_likes_for_comments
  from
    likes
  left join
    posts on
      posts.id = likes.post_id
  left join
    comments on
      comments.id = likes.comment_id
  group by
    week
  order by
    week
) with data;

select * from weekly_likes;

refresh materialized view weekly_likes;

Transactions

Every time you access the PG database, you create a connection to the database. Each connection can access all the data from the database. Start a transaction using the ‘BEGIN’ keyword. The transaction isolates a connection from other connections. Just this connection can access the table that is uses inside the transaction. Changes within the transaction are just shown in the connection. Other connections do no see the changes until the transaction is closed using the ‘COMMIT’ keyword. If a connection crashes, PG undoes any changes within the transaction.

Keyword Action
BEGIN Start transaction
COMMIT End connection. Merge changes
ROLLBACK End transaction. Undo changes
create table accounts (
  id serial primary_key,
  name varchar(20) not null,
  balance integer not null
);

insert into
  accounts (name, balance)
values
  ('Gia',100),
  ('Alyson',100);

begin; -- Start Transaction.

update
  accounts
set
  balance = balance - 50
where name = 'Gia';

commit; -- End Transaction. Merge changes to database.
--rollback; -- End Transaction. Remove changes.

Joins

For each comment, show the contents of the comment and the username of the user who wrote the comment:

SELECT
  a.contents,
  b.username
FROM
  comments a
JOIN
  users b ON
    a.user_id = b.id;

joins

Search for all users that commented on their own photos.

SELECT
  url, contents, username
FROM
  comments a
JOIN
  photos b on
    b.id = a.photo_id
JOIN
  users c ON
    c.id = a.user_id
    and a.user_id = b.user_id;

Aggregation

Function Description
COUNT() Number of values
SUM() Sum of a group of numbers
AVG() Average of a group of numbers
MIN() Minimum
MAX() Maximum

GROUP BY > Groups rows by a unique set of values

SELECT
  user_id,
  count(*)
FROM
  photos
GROUP BY
 user_id;

HAVING > Filter the set of groups

HAVING filters grouped values

SELECT
  b.username,
  count(*)
FROM
  photos a
LEFT JOIN
  users b ON
    b.id = a.user_id
GROUP BY
  b.username
HAVING
  count(*) >= 4;

Sorting

ORDER BY > Sort columns

select
    *
from
    products
order by
    price,
    weight desc;

OFFSET > Skip the first n lines

select
    *
from
    users
offset
    10;

LIMIT > Get specific number of records

select
    *
from
    users
limit
    5;

Union and Intersections

‘UNION’ removes identical entries. Use the ‘UNION ALL’ keyword wo keep identical records.

Keyword Function
UNION Join together the reults of two queries and remove duplicate rows
UNION ALL Join together results of two queries
INTERSECT Find the rows common in the results of two queries. Remove duplicates
INTERSECT ALL Find the rows common in the results of two queries
EXCEPT Find the rows that are present in first query but not second query. Remove duplicates
EXCEPT ALL Find the rows that are present in first query butg not second query
(
    select
        *
    from
        products
    order by
        price desc
    limit
        4
)

union

(
    select
        *
    from
        products
    order by
        price / weight desc
    limit
        4
);

Subqueries

The shape of data matters in the use of subqueries

Statement shape
select * from orders Many rows, many columns
select id from orders Many rows, one column
select count(*) from orders Single value

Subqueries in SELECT Statements

Any subquery must result in a single value

select
    name,
    price,
    (select max(price) from products) price_max,
    (select price from products where id = 3) price_id_3
from
    products
where
    price > 867;

Subqueries in FROM Statements

Any subquery compatible to the outer query. Apply alias to subquery.

select
    avg(p.order_count)
from (
    select
        user_id,
        count(*) as order_count
    from
        orders
    group by
        user_id
) as p;

Subqueries in JOIN Clauses

Any subquery that is compatible to the on statement

select
    u.first_name
from
    users as u
join (
    select
        user_id
    from
        orders
    where
        product_id = 3
) as o on
    u.id = o.user_id;

Subqueries in WHERE Statements

Operator Datastructure
>,<,>=,<=,=,<>,!= Single Value
IN, NOT IN Single Column
(>,<,>=,<=,=,<>) ALL/SOME/ANY Single Column
select
    id
from
    orders
where
    product_id in (
        select
            id
        from
            products
        where
            price / weight > 5
);
SELECT
    name, price
FROM
    products
WHERE
    price > (
        SELECT
            max(price)
        FROM
            products
        WHERE
            department = 'toys'
    );
Operator Description
ALL All items in list
SOME/ANY At least one item in list
select
    name,
    department,
    price
from
    products
where
    price > some (
        select
            price
        from
            products
        where
            department = 'Industrial'
);

Correlated Subqueries

Get the most expensive product for each department

select
    name,
    deparment,
    price
from
    products p1
where
    p1.price = (
        select
            max(price)
        from
            products p2
        where
            p2.department = p1.department
);
select
    p1.name
    (
        select
            count(*)
        from
            orders as o1
        where
            o1.product_id = p1.id
    ) as num_orders
from
    products as p1

Common Table Expressions (CTEs)

with
  tags as (
    select
      user_id,
      created_at
    from
      caption_tags

    union all

    select
      user_id,
      created_at
    from
      photo_tags
  )

select
  username,
  tags.created_at
from
  users
join
  tags on
    tags.user_is = users.id
where
  tags.created_at < '2020-01-07';

Recursive CTEs

Very different from simple CTEs. They are useful anytime you have a tree or graph-type data structure. Every recursive CTE has a union inside it.

with
  recursive countdown(val) as (

    select -- working table
      3 as val

    union

    select -- results table
      val - 1
    from
      countdown
    where
      val > 1
  )

select
  *
from
  countdown;

Recursive CTE

with
  recursive suggestions(leader_id,follower_id,depth) as (
    select
      leader_id,
      follow_id,
      1 as depth
    from
      followers
    where
      follower_id = 1000 -- 'me'

    union

    select
      followers.leader_id,
      followers.follower_id,
      depth + 1
    from
      followers
    join
      suggestions on
        suggestions.leader_id = followers.follower_id
    where
      depth < 3
  )

select distinct
  users.id,
  users.username
from
  suggestions
join
  users on
    users.id = suggestions.leader_id
where
  depth > 1
limit
  30;

Performance

Minimize the amount of data loaded from heap files to memory. A ‘full table scan’ requires postgresql to load all heap files into memory.

Internals

Term Description
Heap, Heap File File that contains all data of a table
Tuple, Item Individual row of a table
Block, Page A heap file is divided into many ‘blocks’
select oid, datname from pg_database; -- id databases

select * from pq_class; -- id objects

show data_directory; -- folder where data is stored

select * from pq_stats where tablename = 'users'; -- statistics about columns of the table

Index

An index tells the query which heap file(s) to load in order to retrieve the information.

create index on users (username);

Query Tuning

SQL Query > Parser > Rewrite > Planner > Execute

EXPLAIN, EXPLAN ANALYZE > Display info about query

Explan builds the query and provides information about it, but does not run it. Explan Analyze actually runs the query.

EXPLAIN ANALYZE SELECT
    username,
    contents
FROM
    users
JOIN
    comments ON
        comments.user_id = users.id
WHERE
    username = 'Alyson14';

Analyze the result from the inner most rows to the outer rows. Costs flow upwards.

Hash Join  (cost=8.17..19.95 rows=3 width=576) (actual time=0.003..0.004 rows=0 loops=1)
  Hash Cond: (comments.user_id = users.id)
  ->  Seq Scan on comments  (cost=0.00..11.40 rows=140 width=502) (actual time=0.002..0.002 rows=0 loops=1)
  ->  Hash  (cost=8.16..8.16 rows=1 width=82) (never executed)
        ->  Index Scan using users_username_idx on users  (cost=0.14..8.16 rows=1 width=82) (never executed)
              Index Cond: ((username)::text = 'Alyson14'::text)

Planning Time: 0.136 ms
Execution Time: 0.038 ms
Step Description
-> Query node. Access, or process data
Hash How the node is generating data
cost Amount of processing power, time, required for the step
rows How many rows will be produced
width average number of bytes or each row

Cost Calculation

Costs are shown in two numbers. The first is the cost for the first row, the second ist the cost for all other rows.

cost=8.17..19.95

Cost =
  (# pages read sequentially) * seq_page_cost
  + (# pages read at random) * random_page_cost
  + (# rows scanned) * cpu_tuple_cost
  + (# index entries scanned) * cpu_index_tuple_cost
  + (# times function/operator evaluated) * cpu_operator_cost

Cost Factors

Factor Cost
seq_page_cost 1.0
random_page_cost 4.0
cpu_tuple_cost 0.01
cpu_index_tuple_cost 0.005
cpu_operator_cost 0.0025

Migrations

Schema Migrations

MigrationE

Schema Migration File

A schema migration file contains two sections: up and down:

--up
alter table
  comments
rename column
  contents
to
  body;

--down
alter table
  comments
rename column
  body
to
  comments;

Example Create Table

--up
create table comments (
  id serial primary key,
  created_at timestamp,
  updated_at timestamp,
  contents varchar(240),
);

--down
drop table comments;

Data Migration

Schema Migration > Data Migration > Schema Migration

1) Add column 2) Deploy new Version of API that fills both old column and new column 3) Copy data to new column 4) Deploy new version of API to write data just to new column 5) Drop old column </a>

Demo Webserver

MigrationE

Data Isolation with Schemas

Schemas are like folders to organize things in a database. Every database gets a default schmea called ‘public’. Each schema can have its own seperate copy of a table.

create schema test;

create table test.users (
  id serial primary key,
  username varchar
);

insert into test.users (username)
values
  ('Alice'),
  ('Bob');

Controlling Schema Access with Search Paths

’'’sql show search_path;

seach_path = “$user”, public; search_path = “sg”, test;

set search_path to test, public – Access schema “test” first, before defaulting to “public” ‘’’