Database Topic Overview

From Training Material
Jump to navigation Jump to search

<slideshow style="nobleprog" headingmark="⌘" incmark="…" scaled="true" font="Trebuchet MS" >

title
Database topic overview
author
Alexander Patrakov

</slideshow>

SQL and NoSQL ⌘

  • SQL = Structured Query Language
  • NoSQL = Not Only SQL
    • Collective term for non-traditional databases
    • Every NoSQL database is different

Let's try some SQL ⌘

  • SQL databases contain tables
    • How to create them and alter their structure?
      • DDL = Data Definition Language
    • How to Create, Read, Update, Delete rows?
      • DML = Data Modification Language
  • Learn statements: CREATE TABLE, INSERT, SELECT, UPDATE, DELETE
  • Try them on MySQL and PostgreSQL

For those who want to self-study SQL ⌘

CREATE TABLE dept (id INTEGER PRIMARY KEY, dname VARCHAR(255) NOT NULL, loc VARCHAR(255) NOT NULL);
INSERT INTO dept VALUES(10,'ACCOUNTING','NEW YORK');
INSERT INTO dept VALUES(20,'RESEARCH','LONDON');
INSERT INTO dept VALUES(30,'SALES','PARIS');
INSERT INTO dept VALUES(40,'OPERATIONS','BERLIN');
CREATE TABLE salgrade(id INTEGER PRIMARY KEY, losal DECIMAL(10,2) NOT NULL, hisal DECIMAL(10,2) NOT NULL);
INSERT INTO salgrade VALUES(1,700,1200);
INSERT INTO salgrade VALUES(2,1201,1400);
INSERT INTO salgrade VALUES(3,1401,2000);
INSERT INTO salgrade VALUES(4,2001,3000);
INSERT INTO salgrade VALUES(5,3001,9999);
CREATE TABLE emp (id INTEGER PRIMARY KEY, ename VARCHAR(255) NOT NULL,
  job VARCHAR(255) NOT NULL, mgr INTEGER REFERENCES emp(id),
  hiredate DATE NOT NULL, sal DECIMAL(10,2) NOT NULL,
  comm DECIMAL(10,2), deptno INTEGER NOT NULL REFERENCES dept(id));
INSERT INTO emp VALUES(7839,'BUSH','PRESIDENT',NULL,'1981-11-17',5000,NULL,10);
INSERT INTO emp VALUES(7698,'BLAIR','MANAGER',7839,'1981-05-01',2850,NULL,30);
INSERT INTO emp VALUES(7566,'PUTIN','MANAGER',7839,'1981-04-02',2975,NULL,20);
INSERT INTO emp VALUES(7902,'TOOSK','ANALYST',7566,'1981-12-03',3000,NULL,20);
INSERT INTO emp VALUES(7369,'THATCHER','CLERK',7902,'1980-12-17',800,NULL,20);
INSERT INTO emp VALUES(7499,'BAROSSO','SALESMAN',7698,'1981-02-20',1600,300,30);
INSERT INTO emp VALUES(7521,'WALTON','SALESMAN',7698,'1981-02-22',1250,500,30);
INSERT INTO emp VALUES(7654,'CHIRACK','SALESMAN',7698,'1981-09-28',1250,1400,30);
INSERT INTO emp VALUES(7782,'MERKEL','MANAGER',7839,'1981-06-09',2450,NULL,10);
INSERT INTO emp VALUES(7788,'CARNEGIE','ANALYST',7566,'1982-12-09',3000,NULL,20);
INSERT INTO emp VALUES(7844,'GATES','SALESMAN',7698,'1981-09-08',1500,0,30);
INSERT INTO emp VALUES(7876,'FORD','CLERK',7788,'1983-01-12',1100,NULL,20);
INSERT INTO emp VALUES(7900,'BUFFETT','CLERK',7698,'1981-12-03',950,NULL,30);
INSERT INTO emp VALUES(7934,'ELISON','CLERK',7782,'1982-01-23',1300,NULL,10);
  • Click "Expand" to see the queries for populating the initial structure and content
  • Follow SQL Programming slides and exercises

More SQL and DBA topics ⌘

  • Relational algebra
    • How to express 1:0-1, 1:N, or N:M relations using FOREIGN KEYs
    • How to combine matching rows using a JOIN
  • Normal forms and integrity
    • Taught only as a part of Oracle courses, but really belong to all SQL courses
  • Indexing
  • Grouping and aggregation
  • Concurrency and transactions
  • Security

Two major applications ⌘

  • Web applications
    • Queries usually touch only a tiny fraction of data
    • Performance is needed
    • Achieved by a good schema, properly placed indexes, and well-written queries
    • SQL is usually simple
  • Data warehousing, ad-hoc analytics, reporting
    • Real-time performance is not needed
    • Queries sometimes go through the entire database for aggregation
    • More advanced SQL features are used

Available SQL-based databases ⌘

  • MySQL and its variants (MariaDB, MySQL Cluster, Percona Server)
  • PostgreSQL
  • SQLite
    • No separate training, but mentioned as a subtopic for Android and iOS courses
  • MS SQL Server
  • MS Access
  • Oracle
  • Firebird
  • Specialized, less common: Impala, SequoiaDB, Trafodion

Developer topics ⌘

  • How to issue SQL statements from various programming languages
  • Non-standard SQL extensions
    • For some databases - how to write such extensions
  • Query optimization
  • Internals

ORMs ⌘

  • There is a conceptual mismatch between software written in object-oriented style and relational databases
    • There is a need to convert between object fields/properties and database columns
      • Trivial for strings and integers, non-trivial for arrays and references to other objects
      • One should be able to create a graph of object references and persist it to the database
    • Solved (to some degree) by Object-Relational Managers (ORMs)
  • Available ORMs:
    • Java: Hibernate
    • PHP: Doctrine (taught as a part of Symfony web framework, deserves more)
    • Python:
      • Web framework neutral: SQLAlchemy, Peewee (both have no trainings, can be added to a Flask training)
      • Web framework specific: Django ORM, Web2py DAL (both taught as a part of the corresponding web frameworks)
    • C#: ADO.NET (in particular, LINQ to SQL)
  • Reality: some web developers don't know SQL, because ORM provides good-enough facilities for accessing the database

Sysadmin topics ⌘

  • Implicit assumptions:
    • Sysadmins already know the basics of SQL
      • May or may not be true in the real world
    • Sysadmins only need the basics of SQL + sysadmin-only SQL statements
  • Installation & configuration
  • Data safety against power loss
  • Typical errors and how to fix them
  • Access control
  • Replication
  • Performance monitoring and tuning

Typical problems during courses ⌘

  • Licensing
    • Some companies prohibit the use of evaluation copies for training purposes - please check every time!
    • Evaluation versions may be too restricted in functionality
    • For demonstrating replication, multiple VMs per participant = multiple Windows licenses
      • Only NobleProg Poland has that, and we can also try Amazon
  • Insufficient server resources
  • Missing prerequisites
  • Finding trainers that know the internals

What about NoSQL? ⌘

  • Each NoSQL database is unique
    • The assumption that the admin already knows the basics no longer holds
    • The admin is still not interested how to interact with this thing from Java or Python
      • That (and time) is in fact the only difference between Cassandra for Developers and Cassandra Administration
      • For Redis, the outline is the same for sysadmins and developers
        • Of course delivered differently - using redis-cli (for admins) or Python or C# or ... (for developers)
  • Common points do exist
    • No joins, no complex queries, result: more performance
    • Usually no ACID - replaced with eventual consistency, also for performance
    • Probably not good for ad-hoc querying by analysts

Why NoSQL? ⌘

  • High availability
    • You have to replicate data
    • Distributed transactions (needed for ACID) are hard but solved problem
      • PAXOS, Raft
      • So no actual need to go NoSQL just because of that
  • Big data
    • If your data doesn't fit on a single server, you have to shard it
    • Joining sharded data is fundamentally hard
  • Performance
    • For simple queries, traditional relational databases spend more time in query parser and planner than actually executing the plan!
    • Solution: limit operations to very simple ones, which don't require planning
  • Convenience
    • Each NoSQL database was created in order to solve a particular problem
    • If this is the same as your problem, you win
    • E.g. graph databases are a win if you have many N:N relationships - queries are easier to write than in SQL
  • It's trendy
    • Mostly applies to MongoDB + Node.js, see also "MEAN stack"
    • "I know nothing about storing data at all, and Mongo looks simple, so I don't have to understand anything" (source)

NewSQL movement ⌘

  • SQL is a powerful query language
  • For database authors, using it means that users already know 50% of their database
  • There is no direct conflict between SQL and distributed systems
    • Yes, joins are fundamentally slow, but users know, and they need joins, so...
  • See a case study from Timescale DB
    • "SQL has become the narrow waist for data analysis"

Topics to cover in NoSQL courses ⌘

  • Common:
    • Data model
    • Query language
    • Use cases
  • Developer:
    • Language bindings
    • Consistency guarantees
    • Performance tips related to specific ways of writing queries
  • Admin:
    • Replication and sharding
    • Performance tuning without rewriting the queries

Available NoSQL databases ⌘

  • People sometimes classify them into column-oriented stores, document stores, key-value stores, and so on
    • Still there is a lot of variability inside each class
  • People sometimes attempt to classify them as CA, CP, or AP
    • Reference to Brewer's CAP theorem
    • This classification should be viewed as secondary
      • The primary classification should be according to their purpose
  • Cassandra (good for time-series data, for analytics and aggregation, has SQL-like syntax without joins)
  • Redis (data structure server, very low level, all data in memory, very fast, often used for caches and message queues)
  • HBase (built on top of Hadoop, so good for Big Data)
  • MongoDB (stores and searches JSON documents)
  • Solr and Elasticsearch (full text search oriented)
  • Missing trainings:
    • A good overview of the ecosystem (but is this possible to sell? possible at all?)
      • This outline is close but with strong emphasis on Big Data, and still never delivered
    • Tarantool (similar to Redis but better replication and scripting, used in Mail.ru)
    • Neo4j (graph database, good for representing connections between arbitrary objects)

Visual guide to NoSQL systems ⌘

Source

Classification is disputed

Key-Value stores ⌘

  • Very simple operations
    • SET key value
    • GET key
    • DELETE key
  • Canonical example: memcached
  • A bit more complicated example: Redis
    • Positioned as a data structure server, not key-value store
    • Supports sharding and master-slave replication
  • Fast because of simple operations

Cassandra ⌘

  • Data model:
    • Tables (formerly called "column families"), like in relational databases
    • There is schema, but it's cheap to add new columns
    • Querying is possible only by keys, not by arbitrary fields
    • No joins
      • May be possible to emulate, inefficiently, client-side or with MapReduce jobs
      • See also Spark
  • Consistency model:
    • Distributed system, high availability across multiple data centers, no single point of failure
    • No ACID
    • Lightweight transactions
    • Last Write Wins
  • Internals:
  • Use cases:
    • Storing and aggregating big data
    • Easy to add nodes to the cluster
    • Performs best when data is never deleted
      • Case study: keeping message history of Discord - why they migrated from MongoDB
    • See also this overview

MongoDB ⌘

  • Stores JSON documents
    • Has a JavaScript query API
    • Integrates well with Node.js
  • No schema
    • So migrations are not 100% necessary
    • It's common to store old data in old format, and reinterpret in the application on the fly
  • Strongly consistent by default
  • Scales horizontally
    • but selects Consistency over Availability

Trainer feedback on our Cassandra courses ⌘

Other troubles with Cassandra courses ⌘

  • Only one trouble: system requirements
    • Needs 4 GB of RAM by default
      • Downscaling is possible, but the trainer is against it
        • Would harm the course content by making it impossible to illustrate various database decisions
    • The server in USA has only 32 GB of RAM, so we can't accept more than 6-7 delegates