Database Topic Overview

title: Database topic overview
author: Alexander Patrakov

</slideshow>

SQL and NoSQL ⌘

SQL = Structured Query Language
- Standard used by many databases
- We have a course: https://www.nobleprog.com/sql-fundamentals-training
  - Can be delivered on any traditional database
  - You practiced it for MySQL = you know 50% of PostgreSQL
NoSQL = Not Only SQL
- Collective term for non-traditional databases
- Every NoSQL database is different

Let's try some SQL ⌘

SQL databases contain tables
- How to create them and alter their structure?
  - DDL = Data Definition Language
- How to Create, Read, Update, Delete rows?
  - DML = Data Modification Language
Learn statements: CREATE TABLE, INSERT, SELECT, UPDATE, DELETE
Try them on MySQL and PostgreSQL
- Or maybe online in sql.js?

For those who want to self-study SQL ⌘

Use this online tool or another one

CREATE TABLE dept (id INTEGER PRIMARY KEY, dname VARCHAR(255) NOT NULL, loc VARCHAR(255) NOT NULL);
INSERT INTO dept VALUES(10,'ACCOUNTING','NEW YORK');
INSERT INTO dept VALUES(20,'RESEARCH','LONDON');
INSERT INTO dept VALUES(30,'SALES','PARIS');
INSERT INTO dept VALUES(40,'OPERATIONS','BERLIN');
CREATE TABLE salgrade(id INTEGER PRIMARY KEY, losal DECIMAL(10,2) NOT NULL, hisal DECIMAL(10,2) NOT NULL);
INSERT INTO salgrade VALUES(1,700,1200);
INSERT INTO salgrade VALUES(2,1201,1400);
INSERT INTO salgrade VALUES(3,1401,2000);
INSERT INTO salgrade VALUES(4,2001,3000);
INSERT INTO salgrade VALUES(5,3001,9999);
CREATE TABLE emp (id INTEGER PRIMARY KEY, ename VARCHAR(255) NOT NULL,
  job VARCHAR(255) NOT NULL, mgr INTEGER REFERENCES emp(id),
  hiredate DATE NOT NULL, sal DECIMAL(10,2) NOT NULL,
  comm DECIMAL(10,2), deptno INTEGER NOT NULL REFERENCES dept(id));
INSERT INTO emp VALUES(7839,'BUSH','PRESIDENT',NULL,'1981-11-17',5000,NULL,10);
INSERT INTO emp VALUES(7698,'BLAIR','MANAGER',7839,'1981-05-01',2850,NULL,30);
INSERT INTO emp VALUES(7566,'PUTIN','MANAGER',7839,'1981-04-02',2975,NULL,20);
INSERT INTO emp VALUES(7902,'TOOSK','ANALYST',7566,'1981-12-03',3000,NULL,20);
INSERT INTO emp VALUES(7369,'THATCHER','CLERK',7902,'1980-12-17',800,NULL,20);
INSERT INTO emp VALUES(7499,'BAROSSO','SALESMAN',7698,'1981-02-20',1600,300,30);
INSERT INTO emp VALUES(7521,'WALTON','SALESMAN',7698,'1981-02-22',1250,500,30);
INSERT INTO emp VALUES(7654,'CHIRACK','SALESMAN',7698,'1981-09-28',1250,1400,30);
INSERT INTO emp VALUES(7782,'MERKEL','MANAGER',7839,'1981-06-09',2450,NULL,10);
INSERT INTO emp VALUES(7788,'CARNEGIE','ANALYST',7566,'1982-12-09',3000,NULL,20);
INSERT INTO emp VALUES(7844,'GATES','SALESMAN',7698,'1981-09-08',1500,0,30);
INSERT INTO emp VALUES(7876,'FORD','CLERK',7788,'1983-01-12',1100,NULL,20);
INSERT INTO emp VALUES(7900,'BUFFETT','CLERK',7698,'1981-12-03',950,NULL,30);
INSERT INTO emp VALUES(7934,'ELISON','CLERK',7782,'1982-01-23',1300,NULL,10);

Click "Expand" to see the queries for populating the initial structure and content
Follow SQL Programming slides and exercises

Two major applications ⌘

Web applications
- Queries usually touch only a tiny fraction of data
- Performance is needed
- Achieved by a good schema, properly placed indexes, and well-written queries
- SQL is usually simple
Data warehousing, ad-hoc analytics, reporting
- Real-time performance is not needed
- Queries sometimes go through the entire database for aggregation
- More advanced SQL features are used

Available SQL-based databases ⌘

MySQL and its variants (MariaDB, MySQL Cluster, Percona Server)
PostgreSQL
SQLite
- No separate training, but mentioned as a subtopic for Android and iOS courses
MS SQL Server
MS Access
Oracle
Firebird
Specialized, less common: Impala, SequoiaDB, Trafodion

Developer topics ⌘

How to issue SQL statements from various programming languages
Non-standard SQL extensions
- For some databases - how to write such extensions
Query optimization
Internals

ORMs ⌘

There is a conceptual mismatch between software written in object-oriented style and relational databases
- There is a need to convert between object fields/properties and database columns
  - Trivial for strings and integers, non-trivial for arrays and references to other objects
  - One should be able to create a graph of object references and persist it to the database
- Solved (to some degree) by Object-Relational Managers (ORMs)
Available ORMs:
- Java: Hibernate
- PHP: Doctrine (taught as a part of Symfony web framework, deserves more)
- Python:
  - Web framework neutral: SQLAlchemy, Peewee (both have no trainings, can be added to a Flask training)
  - Web framework specific: Django ORM, Web2py DAL (both taught as a part of the corresponding web frameworks)
- C#: ADO.NET (in particular, LINQ to SQL)
Reality: some web developers don't know SQL, because ORM provides good-enough facilities for accessing the database

Sysadmin topics ⌘

Implicit assumptions:
- Sysadmins already know the basics of SQL
  - May or may not be true in the real world
- Sysadmins only need the basics of SQL + sysadmin-only SQL statements
Installation & configuration
Data safety against power loss
Typical errors and how to fix them
Access control
Replication
Performance monitoring and tuning

Typical problems during courses ⌘

Licensing
- Some companies prohibit the use of evaluation copies for training purposes - please check every time!
- Evaluation versions may be too restricted in functionality
- For demonstrating replication, multiple VMs per participant = multiple Windows licenses
  - Only NobleProg Poland has that, and we can also try Amazon
Insufficient server resources
- Huge problem for Introduction to SQL Server 2012 Integration Services (SSIS)
Missing prerequisites
- Problem for MS SQL Server 2016
Finding trainers that know the internals
- I suspect that PostgreSQL Administration and Development was never delivered in full
  - Still present on some frontends, please evaluate and maybe delete

What about NoSQL? ⌘

Each NoSQL database is unique
- The assumption that the admin already knows the basics no longer holds
- The admin is still not interested how to interact with this thing from Java or Python
  - That (and time) is in fact the only difference between Cassandra for Developers and Cassandra Administration
  - For Redis, the outline is the same for sysadmins and developers
    - Of course delivered differently - using redis-cli (for admins) or Python or C# or ... (for developers)
Common points do exist
- No joins, no complex queries, result: more performance
- Usually no ACID - replaced with eventual consistency, also for performance
- Probably not good for ad-hoc querying by analysts

Why NoSQL? ⌘

High availability
- You have to replicate data
- Distributed transactions (needed for ACID) are hard but solved problem
  - PAXOS, Raft
  - So no actual need to go NoSQL just because of that
Big data
- If your data doesn't fit on a single server, you have to shard it
- Joining sharded data is fundamentally hard
Performance
- For simple queries, traditional relational databases spend more time in query parser and planner than actually executing the plan!
- Solution: limit operations to very simple ones, which don't require planning
Convenience
- Each NoSQL database was created in order to solve a particular problem
- If this is the same as your problem, you win
- E.g. graph databases are a win if you have many N:N relationships - queries are easier to write than in SQL
It's trendy
- Mostly applies to MongoDB + Node.js, see also "MEAN stack"
- "I know nothing about storing data at all, and Mongo looks simple, so I don't have to understand anything" (source)
  - Not a valid reason, see this blog post from 2013

NewSQL movement ⌘

SQL is a powerful query language
For database authors, using it means that users already know 50% of their database
There is no direct conflict between SQL and distributed systems
- Yes, joins are fundamentally slow, but users know, and they need joins, so...
See a case study from Timescale DB
- "SQL has become the narrow waist for data analysis"

Topics to cover in NoSQL courses ⌘

Common:
- Data model
- Query language
- Use cases
Developer:
- Language bindings
- Consistency guarantees
- Performance tips related to specific ways of writing queries
Admin:
- Replication and sharding
- Performance tuning without rewriting the queries

Available NoSQL databases ⌘

People sometimes classify them into column-oriented stores, document stores, key-value stores, and so on
- Still there is a lot of variability inside each class
People sometimes attempt to classify them as CA, CP, or AP
- Reference to Brewer's CAP theorem
- This classification should be viewed as secondary
  - The primary classification should be according to their purpose
Cassandra (good for time-series data, for analytics and aggregation, has SQL-like syntax without joins)
Redis (data structure server, very low level, all data in memory, very fast, often used for caches and message queues)
HBase (built on top of Hadoop, so good for Big Data)
MongoDB (stores and searches JSON documents)
Solr and Elasticsearch (full text search oriented)
Missing trainings:
- ~~A good overview of the ecosystem (but is this possible to sell? possible at all?)~~
  - This outline is close but with strong emphasis on Big Data, and still never delivered
- Tarantool (similar to Redis but better replication and scripting, used in Mail.ru)
- Neo4j (graph database, good for representing connections between arbitrary objects)

Visual guide to NoSQL systems ⌘

Source

Error creating thumbnail: File missing

Classification is disputed

Key-Value stores ⌘

Very simple operations
- SET key value
- GET key
- DELETE key
Canonical example: memcached
A bit more complicated example: Redis
- Positioned as a data structure server, not key-value store
- Supports sharding and master-slave replication
Fast because of simple operations

Cassandra ⌘

Data model:
- Tables (formerly called "column families"), like in relational databases
- There is schema, but it's cheap to add new columns
- Querying is possible only by keys, not by arbitrary fields
- No joins
  - May be possible to emulate, inefficiently, client-side or with MapReduce jobs
  - See also Spark
Consistency model:
- Distributed system, high availability across multiple data centers, no single point of failure
- No ACID
- Lightweight transactions
- Last Write Wins
Internals:
- Log-structured merge tree, memtables and SSTables, tombstones, immutable disk files
Use cases:
- Storing and aggregating big data
- Easy to add nodes to the cluster
- Performs best when data is never deleted
  - Case study: keeping message history of Discord - why they migrated from MongoDB
- See also this overview

MongoDB ⌘

Stores JSON documents
- Has a JavaScript query API
- Integrates well with Node.js
No schema
- So migrations are not 100% necessary
- It's common to store old data in old format, and reinterpret in the application on the fly
Strongly consistent by default
Scales horizontally
- but selects Consistency over Availability

Trainer feedback on our Cassandra courses ⌘

Trainer: Renhart Gittens
https://www.nobleprog.co.uk/cassandra-developers-course
- No feedback
https://www.nobleprog.co.uk/cassandra-administration-course
- This is basically the developers course with the Datamodeling labs and Cassandra drivers section omitted to fit it into 2 days. I suppose more emphasis would be placed on the admin components over the 2 days.
https://www.nobleprog.co.uk/training/fundamentals-cassandra-db
- Should be updated to refer to Cassandra 3.0. This 3-day course is probably superseded by the cassandra-administration-course which drops the reference to Cassandra 2.0 but the extra day gives more opportunity to explore Cassandra internals
https://www.nobleprog.co.uk/apache-cassandra-2x-core-internals-course
- This course is dated - refers to version 2.x and the use of rackspace and installing into the cloud? While Apache Cassandra 2.2 is still supported its end of life cannot be very far away.

Other troubles with Cassandra courses ⌘

Only one trouble: system requirements
- Needs 4 GB of RAM by default
  - Downscaling is possible, but the trainer is against it
    - Would harm the course content by making it impossible to illustrate various database decisions
- The server in USA has only 32 GB of RAM, so we can't accept more than 6-7 delegates

Database Topic Overview

Contents

SQL and NoSQL ⌘

Let's try some SQL ⌘

For those who want to self-study SQL ⌘

More SQL and DBA topics ⌘

Two major applications ⌘

Available SQL-based databases ⌘

Developer topics ⌘

ORMs ⌘

Sysadmin topics ⌘

Typical problems during courses ⌘

What about NoSQL? ⌘

Why NoSQL? ⌘

NewSQL movement ⌘

Topics to cover in NoSQL courses ⌘

Available NoSQL databases ⌘

Visual guide to NoSQL systems ⌘

Key-Value stores ⌘

Cassandra ⌘

MongoDB ⌘

Trainer feedback on our Cassandra courses ⌘

Other troubles with Cassandra courses ⌘

Navigation menu

Personal tools

Namespaces

Variants

Views

Search

Opportunities

Navigation

Tools