A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Automating Databases for Modern DevOps Practices: A Guide to Common Patterns and Anti-Patterns for Database Automation Techniques
Oracle SQL: Understanding Execution Plan and Performance Tuning
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. PostgreSQL has been in development and use for over 35 years. Throughout those years, the project that started as an open-source relational database for transactional workloads has turned into one of the most reliable and comprehensive database solutions for a variety of use cases. The depth and breadth of PostgreSQL's capabilities have become so rich that you might hear "just use Postgres" if you ask for advice about database options for your next application. What can explain PostgreSQL's meteoric rise in terms of popularity? Why have over 90,000 developers ranked PostgreSQL as the #1 database on StackOverflow? Why has DB-Engines recognized PostgreSQL as the DBMS of 2023? In short, it's reliable and enterprise-ready, it's expandable by design, and it's a true open-source database that is being developed and stewarded by the community. So let's take a quick tour through some of PostgreSQL's notable capabilities to see what the database offers to application developers. Starting PostgreSQL in Minutes For those getting started with PostgreSQL, let's first see how to start the database on your laptop and generate a sample dataset within a few minutes. Launching PostgreSQL in Docker The fastest way to get started is by launching a database container in Docker: Shell mkdir ~/postgresql_data/ docker run --name postgresql \ -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password \ -p 5432:5432 \ -v ~/postgresql_data/:/var/lib/postgresql/data -d ankane/pgvector:latest This command starts the postgresql container using the postgresql_data directory as a volume for the database's data, logs, and configuration. Once started, the database will listen for incoming connections on port 5432. The container uses the latest version of PostgreSQL with the pgvector extension (ankane/pgvector:latest), which is used at the end of this guide. Note that you can always replace ankane/pgvector:latest with postgres:latest if pgvector is not necessary. Once started, you can connect to the database with the psql tool that is shipped with PostgreSQL: Shell docker container exec -it postgresql psql -U postgres The psql prompt should welcome you as follows: Shell Cpsql (15.4 (Debian 15.4-2.pgdg120+1)) Type "help" for help. postgres=# Generating Mock Data It's not a problem if you don't have a dataset handy for your first experiments with PostgreSQL. The database comes with built-in capabilities that let you generate mock data of various complexity. Imagine that we're working with 100 temperature sensors deployed across 10 cities in the United States. Let's see how we can generate a sample dataset without leaving the boundaries of our previously opened psql session. First, create the sensor table with a few essential columns: Shell CREATE TABLE sensor ( id int PRIMARY KEY, name text, city text ); And then use the generate_series function of PostgreSQL to generate records for 100 sensors, placing them randomly across 10 cities in the United States: Shell INSERT INTO sensor (id, name, city) SELECT gs.id, -- Setting the sensor ID 'sensor_' || gs.id, -- Generating a unique name for the sensor (ARRAY[ 'New York', 'Los Angeles', 'Chicago', 'Miami', 'Boston', 'Philadelphia', 'Seattle', 'San Francisco', 'Dallas', 'Atlanta' ])[floor(random() * 10) + 1] -- Selecting a random city for the sensor FROM generate_series(1, 100) AS gs(id); -- Generating IDs for 100 sensors. Lastly, go ahead and take a look at a subset of the generated data: Shell select * from sensor order by id limit 5; The output should be as follows: Shell id | name | city ----+-----------+--------------- 1 | sensor_1 | New York 2 | sensor_2 | Philadelphia 3 | sensor_3 | Dallas 4 | sensor_4 | Boston 5 | sensor_5 | New York (5 rows) Exploring the Depth and Breadth of PostgreSQL's Capabilities By definition, PostgreSQL supports all the capabilities you expect from a relational SQL database, including standard DML and DDL statements, ACID transactions, foreign keys, indexes, materialized views, and more. At the same time, the core PostgreSQL capabilities go far beyond what developers usually expect from a relational database. Let's take a look at a few of such capabilities in action. Modern SQL Modern SQL is a category of contemporary SQL capabilities that allow you to solve various tasks in SQL in a concise and efficient manner. Marcus Winand might have coined the category after launching a dedicated resource on the topic. PostgreSQL supports several modern SQL capabilities, including common table expressions (CTEs), recursive queries, and window functions. Let's take a look at CTEs and window functions in action. Assume that our sensors continuously send information about the current temperature, and we'd like to monitor the highest temperature reported by every sensor. First, let's create the sensor_measurement table to keep track of the reported temperatures: Shell CREATE TABLE sensor_measurement ( id SERIAL PRIMARY KEY, temperature NUMERIC, sensor_id INT, time TIMESTAMP, FOREIGN KEY (sensor_id) REFERENCES sensor(id) ); Next, use the generate_series function one more time to generate 1000 measurements (10 measurements for each sensor from the sensor table): Shell INSERT INTO sensor_measurement (temperature, sensor_id, time) SELECT round((random() * 100)::numeric, 2) AS temperature, -- Generates a random temperature between 0 and 100 degrees Fahrenheit s.id AS sensor_id, NOW() - (INTERVAL '1 day' * round((random() * 30)::numeric)) AS time -- Random timestamp within the last 30 days FROM sensor s, generate_series(1, 10) AS gs; -- Inserts 10 measurements for each sensor Finally, let's find the highest temperature report by each sensor: Shell WITH RankedTemperatures AS ( SELECT sensor_id, temperature, RANK() OVER (PARTITION BY sensor_id ORDER BY temperature DESC) AS temperature_rank FROM sensor_measurement ) SELECT sensor_id, temperature FROM RankedTemperatures WHERE temperature_rank = 1 ORDER BY sensor_id; The output should be as follows: Shell sensor_id | temperature -----------+------------- 1 | 59.87 2 | 85.76 3 | 94.99 4 | 90.09 5 | 99.40 ...truncated The query uses the RANK() window function to slice and rank the data by a sensor ID (PARTITION BY sensor_id). The ranking is calculated as part of the WITH RankedTemperatures AS common-table expression. Then, the query returns the result for temperature_rank = 1, which stores the highest temperature reported by a sensor. JSON Even though PostgreSQL is a well-known SQL database, it has supported JSON as a first-class citizen for over 12 years. PostgreSQL comes with specialized data types, operators, and index methods that make it seamless to use the database for workloads typical of document databases. Continuing our example with the sensors, assume that we decided to store each sensor's technical specification as a JSON object in the sensor table and query that data directly using PostgreSQL's JSON capabilities. First, alter the sensor table by adding the spec column of the JSONB type: Shell ALTER TABLE sensor ADD COLUMN spec JSONB; Next, generate the technical specification for each sensor with the jsonb_build_object function: Shell UPDATE sensor SET spec = jsonb_build_object( 'size', jsonb_build_object( 'width', (random() * 10 + 10)::int, -- Width between 10 and 20 inches 'height', (random() * 10 + 10)::int, -- Height between 10 and 20 inches 'depth', (random() * 10 + 10)::int -- Depth between 10 and 20 inches ), 'weight', (random() * 10 + 1)::numeric(4,2), -- Weight between 1 and 11 pounds 'max_temperature_range', jsonb_build_object( 'min', (random() * 10)::int, -- Min temperature between 0 and 10 F 'max', (random() * 100 + 100)::int -- Max temperature between 100 and 200 F ) ); Lastly, query the JSON objects directly to find the sensors with a weight over five pounds and a maximum temperature range greater than 150°F: Shell SELECT id, name, city, spec FROM sensor WHERE (spec->>'weight')::numeric > 5 AND (spec->'max_temperature_range'->>'max')::numeric > 150; The output should be as follows: Shell id | name | city | spec ----+-----------+---------------+---------------------------------------------------------------------------------------------------------------------- 3 | sensor_3 | Seattle | {"size": {"depth": 10, "width": 19, "height": 16}, "weight": 8.01, "max_temperature_range": {"max": 161, "min": 9} 6 | sensor_6 | Boston | {"size": {"depth": 15, "width": 15, "height": 16}, "weight": 5.86, "max_temperature_range": {"max": 157, "min": 4} ...truncated As you can see, the query uses the specialized ->> and -> operators to extract JSON fields at different levels of the JSON structure and then filter the data. Check out the following documentation page to learn more about the JSON-related capabilities in PostgreSQL. Full-Text Search As a SQL database, PostgreSQL comes with rich capabilities for querying and working with text data. On top of that, the database supports advanced full-text search capabilities that let you analyze large or complex text documents. While traditional text search looks for exact matches of words, full-text search considers linguistic components such as stems, synonyms, and ranking by relevance to provide you with the most applicable search results. Suppose our temperature sensors occasionally fail for various reasons. The maintenance team tracks these incidents in the sensor_failure table and uses the full-text search capabilities to easily discover similar failures, root causes, and solutions. First, let's create the sensor_failure table: Shell CREATE TABLE sensor_failure ( id SERIAL PRIMARY KEY, sensor_id INT, failure_description TEXT, FOREIGN KEY (sensor_id) REFERENCES sensor(id) ); Next, preload a few sample failure reports into the table: Shell INSERT INTO sensor_failure (sensor_id, failure_description) VALUES (1, 'The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem.'), (2, 'Temperature readings were inconsistent and showed significant fluctuations. The root cause was identified as a loose connection in the wiring.'), (3, 'The sensor stopped functioning after exposure to extreme weather conditions. It was determined that water ingress damaged the internal components.'), (4, 'Power supply interruption caused the sensor to reboot multiple times. The issue was traced back to a malfunctioning power adapter.'), (5, 'Calibration drift resulted in inaccurate humidity measurements. The cause was a worn-out calibration sensor that needed replacement.'), (6, 'The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause.'), (7, 'Sensor experienced hardware failure due to a short circuit. It was found that a nearby lightning strike caused the electrical surge.'), (8, 'Corrosion on the sensor contacts caused intermittent data loss. The root cause was prolonged exposure to a high-humidity environment.'), (9, 'The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue.'), (10, 'Firmware update failed, rendering the sensor unresponsive. The failure was due to an incomplete download of the update file.'); Then, assuming we want to perform a full-text search on the incidents, PostgreSQL needs to preprocess the raw text data. The descriptions of the failures need to be parsed into tokens, such as words and phrases, and then converted to more meaningful units of text called lexemes. Add the failure_lexemes column to the table, asking PostgreSQL to use the English language rules for tokenizing and normalizing the text. Also, create a GIN index for the lexemes to expedite the search: Shell ALTER TABLE sensor_failure ADD COLUMN failure_lexemes tsvector GENERATED ALWAYS AS (to_tsvector('english', failure_description)) STORED; CREATE INDEX failure_lexemes_idx ON sensor_failure USING GIN (failure_lexemes); Finally, use the @@ operator to search for incidents related to the "network issue router": Shell SELECT sensor_id, failure_description FROM sensor_failure WHERE failure_lexemes @@ to_tsquery('english', 'network & issue & router'); The output should be as follows: Shell sensor_id | failure_description -----------+-------------------------------------------------------------------------------------------------------------------------------------------------- 1 | The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem. 6 | The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause. 9 | The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue. (3 rows) The search phrase gets normalized to 'network' & 'issue' & 'router'. The order of the words doesn't matter as long as the words have matches in the document. Tapping Into PostgreSQL Extensions While the core PostgreSQL capabilities, such as JSON and full-text search support, make this relational database truly unique, it's the ecosystem of extensions and broader derived solutions that make PostgreSQL one of a kind. Extensibility and pluggability were foundational principles for PostgreSQL. Today, you can find hundreds of extensions and solutions that open up new use cases for PostgreSQL. For instance, there are extensions that let PostgreSQL handle time series and geospatial data easily, function as a message queue, or scale horizontally while tolerating various types of outages. How about generative AI (GenAI)? PostgreSQL has extensions for apps leveraging large language models (LLMs) and other machine learning models. Let's look deeper at pgvector, the foundational extension for GenAI apps using PostgreSQL. PostgreSQL for AI pgvector is the extension that turns PostgreSQL into a vector database. It adds a new data type, operators, and index types to work with vectorized data (embeddings) in the database. This extension is also used as a foundation for other extensions, such as pg_vectorize, pgvectorscale, and PostgresML, which bring additional capabilities for GenAI workloads. Suppose we want to leverage AI for our sensor incident reports. For instance, if someone asks a question like, "What are the most recent network issues caused by a faulty router?," we want PostgreSQL to easily find the answer. First, enable the pgvector extension and add the failure_vector column with 3 dimensions: Shell CREATE EXTENSION vector; ALTER TABLE sensor_failure ADD COLUMN failure_vector vector(3); Next, let's assign random vectors for all incident failures: Shell UPDATE sensor_failure SET failure_vector = ARRAY[ round((random())::numeric, 2), round((random())::numeric, 2), round((random())::numeric, 2) ]::vector; Then, let's set closer vectors for the incidents mentioning 'network & issue & router': Shell UPDATE sensor_failure SET failure_vector = ARRAY[ 0.9 + round((random() * 0.02 - 0.01)::numeric, 2), 0.8 + round((random() * 0.02 - 0.01)::numeric, 2), 0.7 + round((random() * 0.02 - 0.01)::numeric, 2) ]::vector WHERE failure_lexemes @@ to_tsquery('english', 'network & issue & router'); Finally, assuming that the vector for the "What are the most recent network issues caused by a faulty router?" is [0.9, 0.8, 0.7], we can perform the vector similarity search as follows: Shell SELECT sensor_id, failure_description, failure_vector, 1 - (failure_vector <=> '[0.9, 0.8, 0.7]') AS cosine_similarity FROM sensor_failure WHERE 1 - (failure_vector <=> '[0.9, 0.8, 0.7]') > 0.90 ORDER BY cosine_similarity DESC LIMIT 3; The output should be as follows: Shell sensor_id | failure_description | failure_vector | cosine_similarity -----------+--------------------------------------------------------------------------------------------------------------------------------------------------+-----------------+-------------------- 9 | The sensor failed to communicate with the server due to network problems. A faulty router was discovered to be the issue. | [0.9,0.8,0.7] | 1 1 | The sensor failed to report data due to a network connectivity issue. After investigation, it was found that a faulty router caused the problem. | [0.91,0.81,0.7] | 0.9999870975983235 6 | The sensor experienced intermittent data loss due to network issues. A faulty router was identified as the root cause. | [0.9,0.81,0.69] | 0.9999486655364522 In real life, you'll use an embedding model from providers like OpenAI, Meta, Google, or others to generate embeddings for sensor incident reports and user search prompts. Consequently, the failure_vector column will have hundreds or thousands of dimensions. Check out this step-by-step guide to see how to build GenAI apps using pgvector with an LLM. Conclusion "Just use Postgres" has become a motto of the PostgreSQL community for a good reason. Over the years, the database has gained a broad set of core and extended capabilities that allow it to be used for a variety of use cases far beyond transactional workloads. However, don't be misguided by the motto. The community is not trying to say that PostgreSQL is a Swiss army knife or the only database you need. Instead, "just use Postgres" is a suggestion to check if PostgreSQL can meet the new demands coming from your applications. If it can, then you'll benefit from running a single database in production; if not, you can add a specialized database solution to your stack. The choice is yours! This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report
Moving data from one place to another is conceptually simple. You simply read from one datasource and write to another. However, doing that consistently and safely is another story. There are a variety of mistakes you can make if you overlook important details. We recently discussed the top reasons so many organizations are currently seeking DynamoDB alternatives. Beyond costs (the most frequently mentioned factor), aspects such as throttling, hard limits, and vendor lock-in are frequently cited as motivation for a switch. But what does a migration from DynamoDB to another database look like? Should you dual-write? Are there any available tools to assist you with that? What are the typical do’s and don’ts? In other words, how do you move out from DynamoDB? In this post, let’s start with an overview of how database migrations work, cover specific and important characteristics related to DynamoDB migrations, and then discuss some of the strategies employed to integrate with and migrate data seamlessly to other databases. How Database Migrations Work Most database migrations follow a strict set of steps to get the job done. First, you start capturing all changes made to the source database. This guarantees that any data modifications (or deltas) can be replayed later. Second, you simply copy data over. You read from the source database and write to the destination one. A variation is to export a source database backup and simply side-load it into the destination database. Past the initial data load, the target database will contain most of the records from the source database, except the ones that have changed during the period of time it took for you to complete the previous step. Naturally, the next step is to simply replay all deltas generated by your source database to the destination one. Once that completes, both databases will be fully in sync, and that’s when you may switch your application over. To Dual-Write or Not? If you are familiar with Cassandra migrations, then you have probably been introduced to the recommendation of simply “dual-writing” to get the job done. That is, you would proxy every writer mutation from your source database to also apply the same records to your target database. Unfortunately, not every database implements the concept of allowing a writer to retrieve or manipulate the timestamp of a record like the CQL protocol allows. This prevents you from implementing dual-writes in the application while back-filling the target database with historical data. If you attempt to do that, you will likely end up with an inconsistent migration, where some target Items may not reflect their latest state in your source database. Wait… Does it mean that dual-writing in a migration from DynamoDB is just wrong? Of course not! Consider that your DynamoDB table expires records (TTL) every 24 hours. In that case, it doesn’t make sense to back-fill your database: simply dual-write and, past the TTL period, switch your readers over. If your TTL is longer (say a year), then waiting for it to expire won’t be the most efficient way to move your data over. Back-Filling Historical Data Whether or not you need to back-fill historical data primarily depends on your use case. Yet, we can easily reason around the fact that it typically is a mandatory step in most migrations. There are 3 main ways for you to back-fill historical data from DynamoDB: ETL ETL (extract-transform-load) is essentially what a tool like Apache Spark does. It starts with a Table Scan and reads a single page worth of results. The results are then used to infer your source table’s schema. Next, it spawns readers to consume from your DynamoDB table as well as writer workers ingest the retrieved data to the destination database. This approach is great for carrying out simple migrations and also lets you transform (the T in the ETL part) your data as you go. However, it is unfortunately prone to some problems. For example: Schema inference: DynamoDB tables are schemaless, so it’s difficult to infer a schema. All table attributes (other than your hash and sort keys) might not be present on the first page of the initial scan. Plus, a given Item might not project all the attributes present within another Item. Cost: Sinces extracting data requires a DynamoDB table full scan, it will inevitably consume RCUs. This will ultimately drive up migration costs, and it can also introduce an upstream impact to your application if DynamoDB runs out of capacity. Time: The time it takes to migrate the data is proportional to your data set size. This means that if your migration takes longer than 24 hours, you may be unable to directly replay from DynamoDB Streams after, given that this is the period of time that AWS guarantees the availability of its events. Table Scan A table scan, as the name implies, involves retrieving all records from your source DynamoDB table – only after loading them to your destination database. Unlike the previous ETL approach where both the “Extract” and “Load” pieces are coupled and data gets written as you go, here each step is carried out in a phased way. The good news is that this method is extremely simple to wrap your head around. You run a single command. Once it completes, you’ve got all your data! For example: $ aws dynamodb scan --table-name source > output.json You’ll then end up with a single JSON file containing all existing Items within your source table, which you may then simply iterate through and write to your destination. Unless you are planning to transform your data, you shouldn’t need to worry about the schema (since you already know beforehand that all Key Attributes are present). This method works very well for small to medium-sized tables, but – as with the previous ETL method – it may take considerable time to scan larger tables. And that’s not accounting for the time it will take you to parse it and later load it to the destination. S3 Data Export If you have a large dataset or are concerned with RCU consumption and the impact on live traffic, you might rely on exporting DynamoDB data to Amazon S3. This allows you to easily dump your tables’ entire contents without impacting your DynamoDB table performance. In addition, you can request incremental exports later, in case the back-filling process takes longer than 24 hours. To request a full DynamoDB export to S3, simply run: $ aws dynamodb export-table-to-point-in-time --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME --s3-bucket BUCKET_NAME --s3-prefix PREFIX_NAME --export-format DYNAMODB_JSON The export will then run in the background (assuming the specified S3 bucket exists). To check for its completion, run: Plain Text $ aws dynamodb list-exports --table-arn arn:aws:dynamodb:REGION:ACCOUNT:table/source { "ExportSummaries": [ { "ExportArn": "arn:aws:dynamodb:REGION:ACCOUNT:table/TABLE_NAME/export/01706834224965-34599c2a", "ExportStatus": "COMPLETED", "ExportType": "FULL_EXPORT" } ] } Once the process is complete, your source table’s data will be available within the S3 bucket/prefix specified earlier. Inside it, you will find a directory named AWSDynamoDB, under a structure that resembles something like this: Plain Text $ tree AWSDynamoDB/ AWSDynamoDB/ └── 01706834981181-a5d17203 ├── _started ├── data │ ├── 325ukhrlsi7a3lva2hsjsl2bky.json.gz │ ├── 4i4ri4vq2u2vzcwnvdks4ze6ti.json.gz │ ├── aeqr5obfpay27eyb2fnwjayjr4.json.gz │ ├── d7bjx4nl4mywjdldiiqanmh3va.json.gz │ ├── dlxgixwzwi6qdmogrxvztxzfiy.json.gz │ ├── fuukigkeyi6argd27j25mieigm.json.gz │ ├── ja6tteiw3qy7vew4xa2mi6goqa.json.gz │ ├── jirrxupyje47nldxw7da52gnva.json.gz │ ├── jpsxsqb5tyynlehyo6bvqvpfki.json.gz │ ├── mvc3siwzxa7b3jmkxzrif6ohwu.json.gz │ ├── mzpb4kukfa5xfjvl2lselzf4e4.json.gz │ ├── qs4ria6s5m5x3mhv7xraecfydy.json.gz │ ├── u4uno3q3ly3mpmszbnwtzbpaqu.json.gz │ ├── uv5hh5bl4465lbqii2rvygwnq4.json.gz │ ├── vocd5hpbvmzmhhxz446dqsgvja.json.gz │ └── ysowqicdbyzr5mzys7myma3eu4.json.gz ├── manifest-files.json ├── manifest-files.md5 ├── manifest-summary.json └── manifest-summary.md5 2 directories, 21 files So how do you restore from these files? Well… you need to use the DynamoDB Low-level API. Thankfully, you don’t need to dig through its details since AWS provides the LoadS3toDynamoDB sample code as a way to get started. Simply override the DynamoDB connection with the writer logic of your target database, and off you go! Streaming DynamoDB Changes Whether or not you require back-filling data, chances are you want to capture events from DynamoDB to ensure both will get in sync with each other. DynamoDB Streams can be used to capture changes performed in your source DynamoDB table. But how do you consume from its events? DynamoDB Streams Kinesis Adapter AWS provides the DynamoDB Streams Kinesis Adapter to allow you to process events from DynamoDB Streams via the Amazon Kinesis Client Library (such as the kinesis-asl module in Apache Spark). Beyond the historical data migration, simply stream events from DynamoDB to your target database. After that, both datastores should be in sync. Although this approach may introduce a steep learning curve, it is by far the most flexible one. It even lets you consume events from outside the AWS ecosystem (which may be particularly important if you’re switching to a different provider). For more details on this approach, AWS provides a walkthrough on how to consume events from a source DynamoDB table to a destination one. AWS Lambda Lambda functions are simple to get started with, handle all checkpointing logic on their own, and seamlessly integrate with the AWS ecosystem. With this approach, you simply encapsulate your application logic inside a Lambda function. That lets you write events to your destination database without having to deal with the Kinesis API logic, such as check-pointing or the number of shards in a stream. When taking this route, you can load the captured events directly into your target database. Or, if the 24-hour retention limit is a concern, you can simply stream and retain these records in another service, such as Amazon SQS, and replay them later. The latter approach is well beyond the scope of this article. For examples of how to get started with Lambda functions, see the AWS documentation. Final Remarks Migrating from one database to another requires careful planning and a thorough understanding of all steps involved during the process. Further complicating the matter, there’s a variety of different ways to accomplish a migration, and each variation brings its own set of trade-offs and benefits. This article provided an in-depth look at how a migration from DynamoDB works, and how it differs from other databases. We also discussed different ways to back-fill historical data and stream changes to another database. Finally, we ran through an end-to-end migration, leveraging AWS tools you probably already know. At this point, you should have all the tools and tactics required to carry out a migration on your own.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Modern database practices enhance performance, scalability, and flexibility while ensuring data integrity, consistency, and security. Some key practices include leveraging distributed databases for scalability and reliability, using cloud databases for on-demand scalability and maintenance, and implementing NoSQL databases for handling unstructured data. Additionally, data lakes store vast amounts of raw data for advanced analytics, and in-memory databases speed up data retrieval by storing data in main memory. The advent of artificial intelligence (AI) is rapidly transforming database development and maintenance by automating complex tasks, enhancing efficiency, and ensuring system robustness. This article explores how AI can revolutionize development and maintenance through automation, best practices, and AI technology integration. The article also addresses the data foundation for real-time AI applications, offering insights into database selection and architecture patterns to ensure low latency, resiliency, and high-performance systems. How Generative AI Enables Database Development and Maintenance Tasks Using generative AI (GenAI) for database development can significantly enhance productivity and accuracy by automating key tasks, such as schema design, query generation, and data cleaning. It can generate optimized database structures, assist in writing and optimizing complex queries, and ensure high-quality data with minimal manual intervention. Additionally, AI can monitor performance and suggest tuning adjustments, making database development and maintenance more efficient. Generative AI and Database Development Let's review how GenAI can assist some key database development tasks: Requirement analysis. The components that need additions and modifications for each database change request are documented. Utilizing the document, GenAI can help identify conflicts between change requirements, which will help in efficient planning for implementing change requests across dev, QA, and prod environments. Database design. GenAI can help develop the database design blueprint based on the best practices for normalization, denormalization, or one big table design. The design phase is critical and establishing a robust design based on best practices can prevent costly redesigns in the future. Schema creation and management. GenAI can generate optimized database schemas based on initial requirements, ensuring best practices are followed based on normalization levels and partition and index requirements, thus reducing design time. Packages, procedures, and functions creation. GenAI can help optimize the packages, procedures, and functions based on the volume of data that is processed, idempotency, and data caching requirements. Query writing and optimization. GenAI can assist in writing and optimizing complex SQL queries, reducing errors, and improving execution speed by analyzing data structures based on data access costs and available metadata. Data cleaning and transformation. GenAI can identify and correct anomalies, ensuring high-quality data with minimal manual intervention from database developers. Generative AI and Database Maintenance Database maintenance to ensure efficiency and security is crucial to a database administrator's (DBA) role. Here are some ways that GenAI can assist critical database maintenance tasks: Backup and recovery. AI can automate back-up schedules, monitor back-up processes, and predict potential failures. GenAI can generate scripts for recovery scenarios and simulate recovery processes to test their effectiveness. Performance tuning. AI can analyze query performance data, suggest optimizations, and generate indexing strategies based on access paths and cost optimizations. It can also predict query performance issues based on historical data and recommend configuration changes. Security management. AI can identify security vulnerabilities, suggest best practices for permissions and encryption, generate audit reports, monitor unusual activities, and create alerts for potential security breaches. Database monitoring and troubleshooting. AI can provide real-time monitoring, anomaly detection, and predictive analytics. It can also generate detailed diagnostic reports and recommend corrective actions. Patch management and upgrades. AI can recommend optimal patching schedules, generate patch impact analysis reports, and automate patch testing in a sandbox environment before applying them to production. Enterprise RAG for Database Development Retrieval augmented generation (RAG) helps in schema design, query optimization, data modeling, indexing strategies, performance tuning, security practices, and back-up and recovery plans. RAG improves efficiency and effectiveness by retrieving best practices and generating customized, context-aware recommendations and automated solutions. Implementing RAG involves: Building a knowledge base Developing retrieval mechanisms Integrating generation models Establishing a feedback loop To ensure efficient, scalable, and maintainable database systems, RAG aids in avoiding mistakes by recommending proper schema normalization, balanced indexing, efficient transaction management, and externalized configurations. RAG Pipeline When a user query or prompt is input into the RAG system, it first interprets the query to understand what information is being sought. Based on the query, the system searches a vast database or document store for relevant information. This is typically accomplished using vector embeddings, where both the query and the documents are converted into vectors in a high-dimensional space, and similarity measures are used to retrieve the most relevant documents. The retrieved information, along with the original query, is fed into a language model. This model uses both the input query and the context provided by the retrieved documents to generate a more informed, accurate, and relevant response or output. Figure 1. Simple RAG pipeline Vector Databases for RAG Vector databases are tailored for high-dimensional vector operations, making them perfect for similarity searches in AI applications. Non-vector databases, however, manage transactional data and complex queries across structured, semi-structured, and unstructured data formats. The table below outlines the key differences between vector and non-vector databases: Table 1. Vector databases vs. non-vector databases Feature Vector Databases Non-Vector Databases Primary use case Similarity search, machine learning, AI Transactional data, structured queries Data structure High-dimensional vectors Structured data (tables), semi-structured data (JSON), unstructured data (documents) Indexing Specialized indexes for vector data Traditional indexes (B-tree, hash) Storage Vector embeddings Rows, documents, key-value pairs Query types k-NN (k-nearest neighbors), similarity search CRUD operations, complex queries (joins, aggregations) Performance optimization Optimized for high-dimensional vector operations Optimized for read/write operations and complex queries Data retrieval Nearest neighbor search, approximate nearest neighbor (ANN) search SQL queries, NoSQL queries When taking the vector database route, choosing a suitable vector database involves evaluating: data compatibility, performance, scalability, integration capabilities, operational considerations, cost, security, features, community support, and vendor stability. By carefully assessing these aspects, one can select a vector database that meets the application's requirements and supports its growth and performance objectives. Vector Databases for RAG Several vector databases in the industry are commonly used for RAG, each offering unique features to support efficient vector storage, retrieval, and integration with AI workflows: Qdrant and Chroma are powerful vector databases designed to handle high-dimensional vector data, which is essential for modern AI and machine learning tasks. Milvus, an open-source and highly scalable database, supports various vector index types and is used for video/image retrieval and large-scale recommendation systems. Faiss, a library for efficient similarity search, is widely used for large-scale similarity search and AI inference due to its high efficiency and support for various indexing methods. These databases are chosen based on specific use cases, performance requirements, and ecosystem compatibility. Vector Embeddings Vector embeddings can be created for diverse content types, such as data architecture blueprints, database documents, podcasts on vector database selection, and videos on database best practices for use in RAG. A unified, searchable knowledge base can be constructed by converting these varied forms of information into high-dimensional vector representations. This enables efficient and context-aware retrieval of relevant information across different media formats, enhancing the ability to provide precise recommendations, generate optimized solutions, and support comprehensive decision-making processes in database development and maintenance. Figure 2. Vector embeddings Vector Search and Retrieval Vector search and retrieval in RAG involve converting diverse data types (e.g., text, images, audio) into high-dimensional vector embeddings using machine learning models. These embeddings are indexed using techniques like hierarchical navigable small world (HNSW) or ANN to enable efficient similarity searches. When a query is made, it is also converted into a vector embedding and compared against the indexed vectors using distance metrics, such as cosine similarity or Euclidean distance, to retrieve the most relevant data. This retrieved information is then used to augment the generation process, providing context and improving the relevance and accuracy of the generated output. Vector search and retrieval are highly effective for applications such as semantic search, where queries are matched to similar content, and recommendation systems, where user preferences are compared to similar items to suggest relevant options. They are also used in content generation, where the most appropriate information is retrieved to enhance the accuracy and context of the generated output. LLMOps for AI-Powered Database Development Large language model operations (LLMOps) for AI-powered database development leverages foundational and fine-tuned models, effective prompt management, and model observability to optimize performance and ensure reliability. These practices enhance the accuracy and efficiency of AI applications, making them well suited for diverse, domain-specific, and robust database development and maintenance tasks. Foundational Models and Fine-Tuned Models Leveraging large, pre-trained GenAI models offers a solid base for developing specialized applications because of their training on diverse datasets. Domain adaptation involves additional training of these foundational models on domain-specific data, increasing their relevance and accuracy in fields such as finance and healthcare. A small language model is designed for computational efficiency, featuring fewer parameters and a smaller architecture compared to large language models (LLMs). Small language models aim to balance performance with resource usage, making them ideal for applications with limited computational power or memory. Fine-tuning these smaller models on specific datasets enhances their performance for particular tasks while maintaining computational efficiency and keeping them up to date. Custom deployment of fine-tuned small language models ensures they operate effectively within existing infrastructure and meet specific business needs. Prompt Management Effective prompt management is crucial for optimizing the performance of LLMs. This includes using various prompt types like zero-shot, single-shot, few-shot, and many-shot and learning to customize responses based on the examples provided. Prompts should be clear, concise, relevant, and specific to enhance output quality. Advanced techniques such as recursive prompts and explicit constraints help ensure consistency and accuracy. Methods like chain of thought (COT) prompts, sentiment directives, and directional stimulus prompting (DSP) guide the model toward more nuanced and context-aware responses. Prompt templating standardizes the approach, ensuring reliable and coherent results across tasks. Template creation involves designing prompts tailored to different analytical tasks, while version control manages updates systematically using tools like Codeberg. Continuous testing and refining of prompt templates further improve the quality and relevance of generated outputs. Model Observability Model observability ensures models function optimally through real-time monitoring, anomaly detection, performance optimization, and proactive maintenance. By enhancing debugging, ensuring transparency, and enabling continuous improvement, model observability improves AI systems' reliability, efficiency, and accountability, reducing operational risks and increasing trust in AI-driven applications. It encompasses synchronous and asynchronous methods to ensure the models function as intended and deliver reliable outputs. Generative AI-Enabled Synchronous Observability and AI-Enabled Asynchronous Data Observability Using AI for synchronous and asynchronous data observability in database development and maintenance enhances real-time and historical monitoring capabilities. Synchronous observability provides real-time insights and alerts on database metrics, enabling immediate detection and response to anomalies. Asynchronous observability leverages AI to analyze historical data, identify long-term trends, and predict potential issues, thus facilitating proactive maintenance and deep diagnostics. Together, these approaches ensure robust performance, reliability, and efficiency in database operations. Figure 3. LLMOps for model observability and database development Conclusion Integrating AI into database development and maintenance drives efficiency, accuracy, and scalability by automating tasks and enhancing productivity. In particular: Enterprise RAG, supported by vector databases and LLMOps, further optimizes database management through best practices. Data observability ensures comprehensive monitoring, enabling proactive and real-time responsiveness. Establishing a robust data foundation is crucial for real-time AI applications, ensuring systems meet real-time demands effectively. Integrating generative AI into data architectures and database selections, analytics layer building, data cataloging, data fabric, and data mesh development will increase automation and optimization, leading to more efficient and accurate data analytics. The benefits of leveraging AI in database development and maintenance will allow organizations to continuously improve performance and their database's reliability, thus increasing value and stance in the industry. Additional resources: Getting Started With Vector Databases by Miguel Garcia, DZone Refcard Getting Started With Large Language Models by Tuhin Chattopadhyay, DZone Refcard This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report
How do you approach data processing? What aspects are worth special consideration? Find the difference between relational vs non-relational databases to make informed decisions and learn how to choose a database regarding your project needs. What Is a Relational vs Non-Relational Database? That’s obviously the first question to address when choosing a database for your project. Knowing the difference between relational vs non-relational databases helps to be more specific with your requirements and leverage the right solutions. Being in use for decades, databases have gone through lots of changes and advancements. But at the same time, most representatives can be referred to as one or another type. Every team commonly faces the choice between a non-relational and relational database. Let’s cover the major characteristics of each solution to make more informed decisions. And, of course, we’ll start the comparison of relational vs non-relational databases with definitions. Relational databases are used to store data in a structured table-based manner. All the data remains easily accessible, linked, and related to support relations. Non-relational databases work in a completely different way to store semi-structured data. They don’t apply a rigid structure, thus introducing more dynamic schemas for unstructured data processing. Explained as simply as possible, databases are diversified by data structures. Relational solutions focus on predefined schemas to define and manipulate data. In comparison, non-relational ones are known for better flexibility as they can process any type of data without modifying the architecture. The distinct characteristic of a relational database is that it always stores data in tables using rows and columns. Therefore, it supports a straightforward and intuitive way of displaying data. At the same time, it allows teams to form relations based on specific entities. Most relational databases use Structured Query Language; thus, they are often called SQL databases. Non-relational databases are believed to appear as a viable alternative as not all the data can be stored in tabular format. This type embraces all the database types that can’t follow the relational structure and traditional SQL syntax. It doesn’t mean they don’t apply SQL language. What’s more, most of them use both SQL and UnQL (Unstructured Query Language). Therefore this type can also be referred to as NoSQL (not only SQL) databases. If SQL databases fall under the table-based category, NoSQL databases can be divided into several categories. The most common types of NoSQL databases include: Document databases collect, process, and retrieve data as JSON-like documents. Key-value stores arrange data in a key-value format where keys serve as unique identifiers. Graph databases are single-purpose platforms to create and manipulate graphs where data is presented in the form of nodes, edges, and properties. Wide-column stores organize data into flexible columns to be spread across database nodes and multiple servers. It supports varying the column format regardless of the row in the same table. Regarding differences between relational vs non-relational databases, teams have gained the opportunity to find reasonable solutions to their needs. Today’s businesses collect and process a huge amount of data, including dealing with complex queries. Well-outlined project requirements establish the foundation for making informed decisions. The main idea is that they need to choose a database that can query data efficiently and support instant outcomes. If the project leverages structured data and follows ACID compliance, relational databases are a good choice. If the data remains unstructured and doesn’t fit the predefined criteria, it’s better to choose a non-relational database. So let’s proceed with other essential details that become decisive for the final choice. Relational vs Non-Relational Database Pros and Cons Discussing the difference between relational and non-relational databases, we’d like to draw attention to the main advantages and disadvantages of these database types. It greatly helps teams to make a choice and select a database compatible with set requirements. The main idea is that it allows them to do comprehensive research and remain business-specific. The database selection might be difficult at first sight but considering more details aims to simplify the final decision. So let’s go with the mentioned types of databases to find their pros and cons. Advantages of Relational Databases ACID Compliance ACID properties differentiate a relational database and bring it to the dominant market position. It embraces all the necessary standards to guarantee the reliability of transactions within a database. Simplicity Due to the predefined schema and simple structure, the relational database is quite a straightforward solution. It doesn’t require lots of architectural efforts as the team uses structured query language. Data Accuracy Compared to other database types, the accuracy of data is higher for relational databases. It focuses on preventing data redundancy as there is no repeated or duplicated information. Security The table-based model makes it easier to restrict access to confidential data and reduces the chances of errors significantly. Disadvantages of Relational Databases Scalability Being vertically scalable, the relational database has a distinct disadvantage: low scalability. Strict consistency requirements restrict horizontal scaling, whereas vertical scaling comes with certain limits and greatly depends on supported hardware. Flexibility Rigid schemas and constraints could become pros and cons at the same time. Though it’s easy to interpret the data and identify the relationships, it remains complex to implement changes to the data structure. Relational databases aren’t suitable for huge or unstructured data. Performance The relational database performance is tightly dependent on the amount of data, the complexity of tables, and their number. Any increase in these areas leads to a time increase in performing queries. Advantages of Non-Relational Databases Horizontal Scaling Handling large datasets became easier with the introduction of non-relational databases. Moreover, horizontal scaling allows a team to accommodate, manage, and store more data while maintaining lower costs. Flexibility With the flexible data schema and non-rigid structure, non-relational databases can combine, process, and store any type of data. It becomes a distinct feature that differentiates it from a relational database that handles only structured data. Non-relational databases apply dynamic schemas for unstructured data. Fast Queries If relational databases can be used for complex queries, queries in non-relational databases remain faster. The main advantage is that it embraces the way to store the data initially optimized for queries. Besides, queries don’t require joints typical for relational database types. Easier Maintenance Non-relational databases are simpler and faster to set up and maintain. Some of them allow developers to map the data structure similar to programming languages. Thus it supports faster development time and fewer errors. Disadvantages of Non-Relational Databases Data Integrity Maintaining data integrity greatly depends on building relationships between data elements. Lack of integrity methods in non-relational databases could reduce overall data reliability, accuracy, and completeness. It becomes the developers’ responsibility to complete accurate and error-free data transferring from one stage to another. Consistency Focusing on scalability and performance, the non-relational database opts for consistency issues. It has no required mechanisms to prevent data redundancy and relies on eventual consistency. Thus they aren’t that efficient for handling large amounts of data. Moreover, when database categories vary, achieving all the use cases with one database is hard. Data Analysis In the light of comparing relational vs non-relational databases, the second ones have fewer facilities for data analysis. Besides, it usually requires programming expertise to handle the analysis, even for the simplest query. Also, many of them lack integration with popular BI tools. When To Use Relational vs Non-Relational Databases In the light of comparing relational vs non-relational databases, it’s important to address the common use cases. Learning the good market practices and the experience of others can provide some additional insights on how to choose a database for your project. Obviously, one or another category often suits certain needs and requirements better. The team’s task remains to learn details, referring to the smallest details. At the same time, you won’t find a strict distinction on use cases. Different types of databases were successfully implemented for various types of projects. It’s worth saying that knowing the relational vs non-relational database pros and cons is a must-have there. The informed choice can be supported via the detailed analysis of project specifications and solution availability. So let’s check on some useful advice on where to use relational vs non-relational databases. Use Cases of a Relational Database Highly Structured Data A stable data structure becomes necessary unless the project entails constant changes. It’s a great option to leverage strict, planned, predictable schemas to handle data distributed across different tables. Besides, it increases access to more tools for testing and analyzing data. The organized and specific nature enables easier manipulation and data querying. Secure and Consistent Environment When security and consistency are top priorities, teams need to make the right decisions. Relational databases have become a reasonable solution here. ACID principles support all the necessary functionality to handle data due to the latest compliance regulations. This type is often a choice for healthcare, fintech, enterprises, etc. Support Wide support availability is explained by the amount of time on the market. It’s often faster to find the team with the required expertise, as most relational databases follow similar principles. Also, they are more efficient for integrating data from other systems and using additional tools. The team has more product choices when utilizing these types of databases, including business intelligence tools. Use Cases of a Non-Relational Database Large Amounts of Unstructured Data One of the main reasons to apply a non-relational database is that not all data can fit into plain tables. For example, the project needs an efficient tool to accommodate various types of data like videos, articles, or social media content. Therefore, lots of data remain unstructured though it supports horizontal scalability. It helps to cover diversity and bring proper changes if required. Flexible Development Environment Fast accumulation rates are explained by the ability to collect data quickly and easily without its predefinition. The data often remains not restricted to certain formats and can be processed later. For many teams, a non-relational database is a great option, especially when the project requirements aren’t completely clear or they plan on continuous changes or updates. Timing Priorities The fast development environment enables faster and easier product delivery. Less methodical approaches eliminate any upfront preparing, planning, preparing, or designing of the non-relational databases. Teams can proceed with immediate development instead. It commonly suits the needs of MVP or some urgent product releases. Thanks to the many different database types on the market, there is always a suitable approach to fulfill project needs. Of course, the database selection varies from project to project. Moreover, some teams find it efficient to combine several databases to cover all the use cases. Popular Databases: The Current Market State The question of how to choose a database can’t be fully addressed without checking the market availability. It’s a fact that database selection is also impacted by the market state and popularity of certain databases. Besides, the successful experience of others can become a good practice to follow. As long as the team defines project specifications, they are ready to proceed with learning more details on available databases on the market. Keeping up with market tendencies allows them to stay up-to-date and increase the efficiency of leveraged solutions. The fast growth of the market has brought a great variety of databases to adopt. At present, the number of available databases has reached more than 300 databases. So, in the same way we can diversify databases by types or functionalities, it’s common practice to rank them by popularity. As we proceed with comparing relational vs non-relational databases, it’s worth saying that representatives of both database types have gained strong positions. Based on the latest Stack Overflow Developer Survey results, let’s look at the most popular databases. Popular Relational Databases MySQL MySQL is one of the most known relational databases. Released back in 1995, it has gained considerable popularity due to its functionality and used approaches. The open-source database has great support and is compatible with most libraries and frameworks. It is suitable for delivering cross-platform solutions, and even though mostly used for SQL queries, it also has NoSQL support if required. PostgreSQL PostgreSQL is another powerful open-source object-relational database first released in 1996. One of its distinctive characteristics is that it presents data in the form of objects instead of rows and columns. PostgreSQL is highly extensible; thus, it suits the needs of large software solutions. There’s no need to recompile the database as developers can write the code in various programming languages. SQLite SQLite is also a relational database management system released in 2000. It obtains one distinctive difference since it’s a server-side database. That often makes it faster as requests are serialized by the server. Also, it has bindings to different programming languages and is used for a variety of solutions, including IoT and embedded systems. Microsoft SQL Server Microsoft SQL Server is a known relational database management system introduced by Microsoft in 1989. They have greatly improved the solution with many unique features like customization, in-memory analytics, integrations, etc. Also, it supports different development tools and cloud services; however, it only works on Windows-based servers. Popular Non-Relational Databases MongoDB MongoDB is classified as a non-relational solution, particularly a document-oriented database released in 2009. It enables storing different types of data as it uses JSON-like objects. This technology solution works way faster than relational ones because it doesn’t require processing collected data. It usually remains unstructured and is suitable for handling massive sets of data. Redis Redis is a popular in-memory data store that is also used as a key-value database introduced in 2009. This open-source non-relational solution embraces in-memory data structure to support extensibility and clustering. It allows teams to store large data sets without a complex structure. Redis is often combined to leverage other data storage solutions as it can be applied as a caching layer. DynamoDB DynamoDB is a non-relational database introduced by Amazon in 2012. The tech focus embraces the support of data structures, documents, and key-value cloud services. High scalability and performance remain the main advantages of choosing this database, as it enables running high-performance apps at any scale. Due to the good functionality and being first to the market, the relational solutions still gain a considerable share of the market. Indeed, the introduction of new representatives makes everyone strengthen available approaches and keep on advancing new solutions. How To Choose a Database: Relational vs Non-Relational Databases Gathering all the vital details on different types of databases becomes necessary for making a good choice. With well-defined project requirements, the team looks for a database to correspond to their needs and support solution efficiency. The important thing is that both database types are viable options. The awareness of major differences greatly helps with its selection. Databases Relational Non-relational Language Structured Query Language (SQL) Structured Query Language (SQL), Unstructured Query Language (UnQL) Data schema Predefined schemas Dynamic schemas Database categories Table-based Document, key-value, graph, and wide-column stores Scalability Vertical scalability Horizontal scalability Performance Low High Security High Less secure Complex queries Used Not used Base properties ACID (atomicity, consistency, isolation, durability) transaction supported Follows CAP (consistency, availability, partition tolerance) theorem Online processing Used for OLTP Used for OLAP Hierarchical data storage Not suitable Best suitable Usage Better for multi-row transactions Better for unstructured data like documents or JSON There isn’t a bad choice; it’s more about the opportunity to meet requirements better and receive more outcomes. Considering the above-mentioned aspects, we’ve also decided to focus on key aspects of how to choose a database. Data Schema The main difference between the non-relational and relational databases remains the applied data schemas. If relational solutions use predefined schemas and deal with structured data, non-relational ones apply flexible schemas to process unstructured data in various ways. It’s important to remember that this factor often explains other distinct specifications of the database selection. Data Structure Structuring supports the way to locate and access data. If the team chooses the relational architecture, they proceed with the table-based structure. The tabular format focuses on linking and relating based on common data. The non-relational solutions can differ by several structures, including key-value, document, graph, or wide-column stores. In other words, they bring alternatives to structure data impossible to deal with in relational databases. Scaling The database selection can also be impacted by properties to scale your non-relational vs relational database. The relational database is vertically scalable when the load increase should be completed on a single server. Non-relational solutions are proven more efficient here as horizontal scaling allows adding more servers, thus handling higher traffic. Security It has always been crucial to leverage well-protected and highly secured solutions. ACID compliance for relational databases makes them more secure and easier to restrict access to confidential data. Non-relational types of databases are considered less secure, though known for great performance and scalability. Analytics Capabilities Relational databases are considered more efficient for leveraging data analysis and reporting. Most BI tools won’t let you query non-relational databases but work great with structured data. Of course, it is important to check the current database’s functionality as many of them keep introducing new alternatives. Integration Another aspect to consider in choosing a relational database vs a non-relational database is the opportunity to integrate it with other tools and services. Teams always have to check its compatibility with other tech solutions applied to the project. Integration requirements are growing dramatically to support consistency across all business solutions. Support Consideration Let’s draw attention to the point of how each representative is supported. It involves constant database advancement and its popularity on the market. Lack of support always ends with unexpected results and often failures. Make sure to choose databases that have gained good market share, have strong community support, and meet the project needs. Obviously, the database selection varies from project to project, but the main thing it should correspond to the outlined needs. There won’t be a bad choice as every project can be addressed from different perspectives. The main idea is to choose a database that can bring efficiency and meet the outlined project-specific requirements. Conclusion An excellent way to compare relational vs non-relational databases relies on a comprehensive analysis of its core aspects, main pros and cons, and typical use cases. Considering all the gathered details in this article, we can conclude that relational databases are a good choice when teams look for dynamic queries, high security, and cross-platform support. If scalability, performance, and flexibility remain the main priorities, it is better to opt for non-relational databases.
Managing database connection strings securely for any microservice is critical; often, we secure the username and password using the environment variables and never factor in masking or hiding the database hostname. In reader and writer database instances, there would be a mandate in some organizations not to disclose the hostname and pass that through an environment variable at runtime during the application start. This article discusses configuring the hostname through environment variables in the properties file. Database Configurations Through Environment Variables We would typically configure the default connection string for Spring microservices in the below manner, with the database username and password getting passed as the environment variables. Java server.port=8081 server.servlet.context-path=/api/e-sign/v1 spring.esign.datasource.jdbc-url=jdbc:mysql://localhost:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} spring.esign.datasource.driver-class-name=com.mysql.cj.jdbc.Driver spring.esign.datasource.minimumIdle=5 spring.esign.datasource.maxLifetime=120000 If our microservice connects to a secure database with limited access and the database administrator or the infrastructure team does not want you to provide the database hostname, then we have an issue. Typically, the production database hostname would be something like below: Java spring.esign.datasource.jdbc-url=jdbc:mysql://prod-db.fabrikam.com:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} Using @Configuration Class In this case, the administrator or the cloud infrastructure team wants them to provide the hostname as an environment variable at runtime when the container starts. One of the options is to build and concatenate the connection string in the configuration class as below: Java @Configuration public class DatabaseConfig { private final Environment environment; public DatabaseConfig(Environment environment) { this.environment = environment; } @Bean public DataSource databaseDataSource() { String hostForDatabase = environment.getProperty("ESIGN_DB_HOST", "localhost:3306"); String dbUserName = environment.getProperty("DB_USER_NAME", "user-name"); String dbUserPassword = environment.getProperty("DB_USER_PASSWORD", "user-password"); String url = String.format("jdbc:mysql://%s/e-sign?allowPublicKeyRetrieval=true&useSSL=false", hostForDatabase); DriverManagerDataSource dataSource = new DriverManagerDataSource(); dataSource.setDriverClassName("com.mysql.cj.jdbc.Driver"); dataSource.setUrl(url); dataSource.setUsername(dbUserName); // Replace with your actual username dataSource.setPassword(dbUserPassword); // Replace with your actual password return dataSource; } } The above approach would work, but we need to use the approach with application.properties, which is easy to use and quite flexible. The properties file allows you to collate all configurations in a centralized manner, making it easier to update and manage. It also improves readability by separating configuration from code. The DevOps team can update the environment variable values without making code changes. Environment Variable for Database Hostname Commonly, we use environment variables for database username and password and use the corresponding expression placeholder expressions ${} in the application properties file. Java spring.esign.datasource.username=${DB_USER_NAME} spring.esign.datasource.password=${DB_USER_PASSWORD} However, for the database URL, we need to use the environment variable only for the hostname and not for the connection string, as each connection string for different microservices would have different parameters. So, to address this, Spring allows you to have the placeholder expression within the connection string shown below; this gives flexibility and the ability to stick with the approach of using the application.properties file instead of doing it through the database configuration class. Java spring.esign.datasource.jdbc-url=jdbc:mysql://${ESIGN_DB_HOST}:3306/e-sign?allowPublicKeyRetrieval=true&useSSL=false Once we have decided on the above approach and if we need to troubleshoot any issue for whatever reason in lower environments, we can then use the ApplicationListener interface to see the resolved URL: Java @Component public class ApplicationReadyLogger implements ApplicationListener<ApplicationReadyEvent> { private final Environment environment; public ApplicationReadyLogger(Environment environment) { this.environment = environment; } @Override public void onApplicationEvent(ApplicationReadyEvent event) { String jdbcUrl = environment.getProperty("spring.esign.datasource.jdbc-url"); System.out.println("Resolved JDBC URL: " + jdbcUrl); } } If there is an issue with the hostname configuration, it will show as an error when the application starts. However, after the application has been started, thanks to the above ApplicationReadyLogger implementation, we can see the database URL in the application logs. Please note that we should not do this in production environments where the infrastructure team wants to maintain secrecy around the database writer hostname. Using the above steps, we can configure the database hostname as an environment variable in the connection string inside the application.properties file. Conclusion Using environment variables for database hostnames to connect to data-sensitive databases can enhance security and flexibility and give the cloud infrastructure and DevOps teams more power. Using the placeholder expressions ensures that our configuration remains clear and maintainable.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Real-time streaming architectures are designed to ingest, process, and analyze data as it arrives continuously, enabling near real-time decision making and insights. They need to have low latency, handle high-throughput data volumes, and be fault tolerant in the event of failures. Some of the challenges in this area include: Ingestion – ingesting from a wide variety of data sources, formats, and structures at high throughput, even during bursts of high-volume data streams Processing – ensuring exactly-once processing semantics while handling complexities like stateful computations, out-of-order events, and late data arrivals in a scalable and fault-tolerant manner Real-time analytics – achieving low-latency query responses over fresh data that is continuously being ingested and processed from streaming sources, without compromising data completeness or consistency It's hard for a single technology component to be capable of fulfilling all the requirements. That's why real-time streaming architectures are composed of multiple specialized tools that work together. Introduction to Apache Kafka, Flink, and Pinot Let's dive into an overview of Apache Kafka, Flink, and Pinot — the core technologies that power real-time streaming systems. Apache Kafka Apache Kafka is a distributed streaming platform that acts as a central nervous system for real-time data pipelines. At its core, Kafka is built around a publish-subscribe architecture, where producers send records to topics, and consumers subscribe to these topics to process the records. Key components of Kafka's architecture include: Brokers are servers that store data and serve clients. Topics are categories to which records are sent. Partitions are divisions of topics for parallel processing and load balancing. Consumer groups enable multiple consumers to coordinate and process records efficiently. An ideal choice for real-time data processing and event streaming across various industries, Kafka's key features include: High throughput Low latency Fault tolerance Durability Horizontal scalability Apache Flink Apache Flink is an open-source stream processing framework designed to perform stateful computations over unbounded and bounded data streams. Its architecture revolves around a distributed streaming dataflow engine that ensures efficient and fault-tolerant execution of applications. Key features of Flink include: Support for both stream and batch processing Fault tolerance through state snapshots and recovery Event time processing Advanced windowing capabilities Flink integrates with a wide variety of data sources and sinks — sources are the input data streams that Flink processes, while sinks are the destinations where Flink outputs the processed data. Supported Flink sources include message brokers like Apache Kafka, distributed file systems such as HDFS and S3, databases, and other streaming data systems. Similarly, Flink can output data to a wide range of sinks, including relational databases, NoSQL databases, and data lakes. Apache Pinot Apache Pinot is a real-time distributed online analytical processing (OLAP) data store designed for low-latency analytics on large-scale data streams. Pinot's architecture is built to efficiently handle both batch and streaming data, providing instant query responses. Pinot excels at serving analytical queries over rapidly changing data ingested from streaming sources like Kafka. It supports a variety of data formats, including JSON, Avro, and Parquet, and provides SQL-like query capabilities through its distributed query engine. Pinot's star-tree index supports fast aggregations, efficient filtering, high-dimensional data, and compression. Integrating Apache Kafka, Flink, and Pinot Here is a high-level overview of how Kafka, Flink, and Pinot work together for real-time insights, complex event processing, and low-latency analytical queries on streaming data: Kafka acts as a distributed streaming platform, ingesting data from various sources in real time. It provides a durable, fault-tolerant, and scalable message queue for streaming data. Flink consumes data streams from Kafka topics. It performs real-time stream processing, transformations, and computations on the incoming data. Flink's powerful stream processing capabilities allow for complex operations like windowed aggregations, stateful computations, and event-time-based processing. The processed data from Flink is then loaded into Pinot. Pinot ingests the data streams, builds real-time and offline datasets, and creates indexes for low-latency analytical queries. It supports a SQL-like query interface and can serve high-throughput and low-latency queries on the real-time and historical data. Figure 1. Kafka, Flink, and Pinot as part of a real-time streaming architecture Let's break this down and dive into the individual components. Kafka Ingestion Kafka offers several methods to ingest data, each with its own advantages. Using the Kafka producer client is the most basic approach. It provides a simple and efficient way to publish records to Kafka topics from various data sources. Developers can leverage the producer client by integrating it into their applications in most programming languages (Java, Python, etc.), supported by the Kafka client library. The producer client handles various tasks, including load balancing by distributing messages across partitions. This ensures message durability by awaiting acknowledgments from Kafka brokers and manages retries for failed send attempts. By leveraging configurations like compression, batch size, and linger time, the Kafka producer client can be optimized for high throughput and low latency, making it an efficient and reliable tool for real-time data ingestion into Kafka. Other options include: Kafka Connect is a scalable and reliable data streaming tool with built-in features like offset management, data transformation, and fault tolerance. It can read data into Kafka with source connectors and write data from Kafka to external systems using sink connectors. Debezium is popular for data ingestion into Kafka with source connectors to capture database changes (inserts, updates, deletes). It publishes changes to Kafka topics for real-time database updates. The Kafka ecosystem also has a rich set of third-party tools for data ingestion. Kafka-Flink Integration Flink provides a Kafka connector that allows it to consume and produce data streams to and from Kafka topics. The connector is a part of the Flink distribution and provides fault tolerance along with exactly-once semantics. The connector consists of two components: KafkaSource allows Flink to consume data streams from one or more Kafka topics. KafkaSink allows Flink to produce data streams to one or more Kafka topics. Here's an example of how to create a KafkaSource in Flink's DataStream API: Java KafkaSource<String> source = KafkaSource.<String>builder() .setBootstrapServers(brokers) .setTopics("ad-events-topic") .setGroupId("ad-events-app") .setStartingOffsets(OffsetsInitializer.earliest()) .setValueOnlyDeserializer(new SimpleStringSchema()) .build(); DataStream<String> stream = env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kafka Source"); Note that FlinkKafkaConsumer, based on the legacy SourceFunction API, has been marked as deprecated and removed. The newer data-source-based API, including KafkaSource, provides greater control over aspects like watermark generation, bounded streams (batch processing), and the handling of dynamic Kafka topic partitions. Flink-Pinot Integration There are a couple options for integrating Flink with Pinot to write processed data into Pinot tables. Option 1: Flink to Kafka to Pinot This is a two-step process where you first write data from Flink to Kafka using the KafkaSink component of the Flink Kafka connector. Here is an example: Java DataStream<String> stream = <existing stream>; KafkaSink<String> sink = KafkaSink.<String>builder() .setBootstrapServers(brokers) .setRecordSerializer(KafkaRecordSerializationSchema.builder() .setTopic("ad-events-topic") .setValueSerializationSchema(new SimpleStringSchema()) .build() ) .setDeliveryGuarantee(DeliveryGuarantee.AT_LEAST_ONCE) .build(); stream.sinkTo(sink); As part of the second step, on the Pinot side, you would configure the real-time ingestion support for Kafka that Pinot supports out of the box, which would ingest the data into the Pinot table(s) in real time. This approach decouples Flink and Pinot, allowing you to scale them independently and potentially leverage other Kafka-based systems or applications in your architecture. Option 2: Flink to Pinot (Direct) The other option is to use the Flink SinkFunction that comes as part of the Pinot distribution. This approach simplifies the integration by having a streaming (or batch) Flink application directly write into a designated Pinot database. This method simplifies the pipeline as it eliminates the need for intermediary steps or additional components. It ensures that the processed data is readily available in Pinot for low-latency query and analytics. Best Practices and Considerations Although there are a lot of factors to consider when using Kafka, Flink, and Pinot for real-time streaming solutions, here are some of the common ones. Exactly-Once Semantics Exactly-once semantics guarantee that each record is processed once (and only once), even in the presence of failures or out-of-order delivery. Achieving this behavior requires coordination across the components involved in the streaming pipeline. Use Kafka's idempotence settings to guarantee messages are delivered only once. This includes enabling the enable.idempotence setting on the producer and using the appropriate isolation level on the consumer. Flink's checkpoints and offset tracking ensure that only processed data is persisted, allowing for consistent recovery from failures. Finally, Pinot's upsert functionality and unique record identifiers eliminate duplicates during ingestion, maintaining data integrity in the analytical datasets. Kafka-Pinot Direct Integration vs. Using Flink The choice between integrating Kafka and Pinot directly or using Flink as an intermediate layer depends on your stream processing needs. If your requirements involve minimal stream processing, simple data transformations, or lower operational complexity, you can directly integrate Kafka with Pinot using its built-in support for consuming data from Kafka topics and ingesting it into real-time tables. Additionally, you can perform simple transformations or filtering within Pinot during ingestion, eliminating the need for a dedicated stream processing engine. However, if your use case demands complex stream processing operations, such as windowed aggregations, stateful computations, event-time-based processing, or ingestion from multiple data sources, it is recommended to use Flink as an intermediate layer. Flink offers powerful streaming APIs and operators for handling complex scenarios, provides reusable processing logic across applications, and can perform complex extract-transform-load (ETL) operations on streaming data before ingesting it into Pinot. Introducing Flink as an intermediate stream processing layer can be beneficial in scenarios with intricate streaming requirements, but it also adds operational complexity. Scalability and Performance Handling massive data volumes and ensuring real-time responsiveness requires careful consideration of scalability and performance across the entire pipeline. Two of the most discussed aspects include: You can leverage the inherent horizontal scalability of all three components. Add more Kafka brokers to handle data ingestion volumes, have multiple Flink application instances to parallelize processing tasks, and scale out Pinot server nodes to distribute query execution. You can utilize Kafka partitioning effectively by partitioning data based on frequently used query filters to improve query performance in Pinot. Partitioning also benefits Flink's parallel processing by distributing data evenly across worker nodes. Common Use Cases You may be using a solution built on top of a real-time streaming architecture without even realizing it! This section covers a few examples. Real-Time Advertising Modern advertising platforms need to do more than just serve ads — they must handle complex processes like ad auctions, bidding, and real-time decision making. A notable example is Uber's UberEats application, where the ad events processing system had to publish results with minimal latency while ensuring no data loss or duplication. To meet these demands, Uber built a system using Kafka, Flink, and Pinot to process ad event streams in real time. The system relied on Flink jobs communicating via Kafka topics, with end-user data being stored in Pinot (and Apache Hive). Accuracy was maintained through a combination of exactly-once semantics provided by Kafka and Flink, upsert capabilities in Pinot, and unique record identifiers for deduplication and idempotency. User-Facing Analytics User-facing analytics have very strict requirements when it comes to latency and throughput. LinkedIn has extensively adopted Pinot for powering various real-time analytics use cases across the company. Pinot serves as the back end for several user-facing product features, including "Who Viewed My Profile." Pinot enables low-latency queries on massive datasets, allowing LinkedIn to provide highly personalized and up-to-date experiences to its members. In addition to user-facing applications, Pinot is also utilized for internal analytics at LinkedIn and powers various internal dashboards and monitoring tools, enabling teams to gain real-time insights into platform performance, user engagement, and other operational metrics. Fraud Detection For fraud detection and risk management scenarios, Kafka can ingest real-time data streams related to transaction data, user activities, and device information. Flink's pipeline can apply techniques like pattern detection, anomaly detection, rule-based fraud detection, and data enrichment. Flink's stateful processing capabilities enable maintaining and updating user- or transaction-level states as data flows through the pipeline. The processed data, including flagged fraudulent activities or risk scores, is then forwarded to Pinot. Risk management teams and fraud analysts can execute ad hoc queries or build interactive dashboards on top of the real-time data in Pinot. This enables identifying high-risk users or transactions, analyzing patterns and trends in fraudulent activities, monitoring real-time fraud metrics and KPIs, and investigating historical data for specific users or transactions flagged as potentially fraudulent. Conclusion Kafka's distributed streaming platform enables high-throughput data ingestion, while Flink's stream processing capabilities allow for complex transformations and stateful computations. Finally, Pinot's real-time OLAP data store facilitates low-latency analytical queries, making the combined solution ideal for use cases requiring real-time decision making and insights. While individual components like Kafka, Flink, and Pinot are very powerful, managing them at scale across cloud and on-premises deployments can be operationally complex. Managed streaming platforms reduce operational overhead and abstract away much of the low-level cluster provisioning, configuration, monitoring, and other operational tasks. They allow resources to be elastically provisioned up or down based on changing workload demands. These platforms also offer integrated tooling for critical functions like monitoring, debugging, and testing streaming applications across all components. To learn more, refer to the official documentation and examples for Apache Kafka, Apache Flink, and Apache Pinot. The communities around these projects also have a wealth of resources, including books, tutorials, and tech talks covering real-world use cases and best practices. Additional resources: Apache Kafka Patterns and Anti-Patterns by Abhishek Gupta, DZone Refcard Apache Kafka Essentials by Sudip Sengupta, DZone Refcard This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report
In today's tech environment, there is a frequent requirement to synchronize applications. This need often arises during technology upgrades, where the goal is to transition a database and its processes from an outdated legacy system to a newer technology. In such scenarios, it's typically required to allow both applications to coexist for a period of time. Sometimes both applications, together with their own databases, must be maintained as masters because dismantling the processes dependent on the legacy one is not viable. Consequently, specific solutions for keeping the two master databases aligned are essential, ensuring that operations on one database are mirrored on the other one, and vice versa. In this article, we discuss a real case we dealt with by abstracting away from several technical details, but focusing on those decisions that shape the structure of our solution. The Scenario The scenario we dealt with was about a technology migration of an application upon which quite all the processes of the company depend. One of the main business constraints was related to the fact that the old application would not be decommissioned at the end of the development, but would continue to coexist with the new one for a long time, allowing for a progressive migration of all the processes to the new version. The consequence of this fact was that the two databases would both become master and they would require to be kept aligned. Here is a list of the main tech constraints that shaped our decision: The two databases handle the same dataset but with different schemas: for example, a customer on one database is represented using a different number of tables and columns compared to the other. There is no CDC (Change Data Capture) product available for getting the databases in sync. The legacy application can synchronize itself only via asynchronous messages. If one of the two applications goes down, the other one must still be available. We approached the solution by making the following decisions: We decided to use a bi-directional asynchronous message communication managed at the application level for exchanging data between the two masters and to implement the same synchronizing algorithm on both sides. Each master publishes an alignment event that carries the whole set of data aligned with the last modification. We exploit a vector clock algorithm for processing the events on both sides. Asynchronous Communication and Common Algorithm Two Kafka queues have been used for exchanging messages in both directions. The Avro schema has been kept identical on both queues, so the events are also identical in the format. Such a decision permitted us to create an abstraction layer in common with the two masters that are independent of the used technologies, but it is only dependent on the alignment algorithm and the shared data model used for the events. The main advantages we wanted to focus on are: Keeping the alignment module separated from the implementation of the two masters, so the design can be addressed separately from them. Permitting the two masters to work without being dependent on the other. If one master stops to work, the other can continue. Relying everything to an algorithm means not depending on a specific technology, but only on its implementation, which can be tested with special test suites. In the long run, this results in a stable solution with little susceptibility to errors. The price to pay is the replication of the algorithm on both applications. Establishing Order Among Messages A pivotal requirement in aligning databases is a mechanism that enables the ordering of messages irrespective of the system in which they were generated. This ordering mechanism is crucial for maintaining the integrity and consistency of data across distributed environments. Two types of ordering exist: total and partial. Total ordering allows for the sequential arrangement of all generated messages, offering a comprehensive view of events across the system. On the other hand, partial ordering facilitates the sequential arrangement of only a subset of messages, providing flexibility in how events are correlated. We evaluated different solutions for achieving order among messages: Server Clock Utilizing the server's clock as a basis for ordering can be straightforward but raises questions about which server's clock to use. Each application has its own infrastructure and components. Which are the components used as a reference for the clocks? How do you keep them synchronized? In cases of misalignment, determining the course of action becomes crucial and the order can be compromised. A Dedicated Centralized Logical Clock A centralized logical clock presents an alternative by providing a singular reference point for time across the system. However, this centralization can introduce bottlenecks and points of failure, making it less ideal for highly distributed or scalable systems. Distributed Logical Clock Distributed logical clocks, such as vector clocks, offer a solution that allows for both total and partial ordering without relying on a single point of failure. This approach enables each part of the system to maintain its own clock, with mechanisms in place to update these clocks based on the arrival of new messages or data changes. Vector clocks are particularly suitable for managing the complexities of distributed systems, offering a way to resolve conflicts and synchronize data effectively. Vector Clocks: How They Work For each record of the database, each system keeps its own internal logic clock together with the clock of the other database received from the alignment queue. In the following diagram, they are represented by columns Clock A and Clock B. In the example, Master A modifies a record and increases the value of its own Clock A. Master B receives the record and compares the two clocks. Clock B is 0 and it is equal, whereas Clock A has been increased; thus, Master B accepts the message and overwrites its own record by aligning it with that of Master A. In the following, Master B performs a similar modification on the same record, increasing its own clock Clock B. Master A will receive the message and since Clock A is the same, it can accept the message by aligning the record. There is the possibility of a conflict when a modification is performed concurrently on the same record in both systems. In this particular case, both the systems receive an alignment message where their own clock is minor w.r.t. to what is stored at that moment. Although this scenario could be considered rare, we need to define how to resolve a conflict. There could be different solutions: for example, we could decide that in case of conflict, one of the two masters always wins, which means it is "more master" than the other. Or, as we decided, we used timestamps for defining the "last" record. We are aware that using timestamps for defining ordering can be very problematic, but the probability of a conflict (i.e., an update on the same data occurring on both systems in a short period of time) was considered very low (under 0,1%). In this scenario, also the event timestamp must be sent in the alignment message. Conclusions In this article, we report our experience in keeping two different databases aligned with two different technologies by using an application-level solution. The core of the solution is the usage of asynchronous communication together with a solid algorithm that guarantees determinism in the alignment. Such a solution works, even if it requires efforts in modifying the databases and all the writing queries for managing the vector clocks atomically, and it requires also the duplication of the algorithm on both sides.
Polymorphism, a fundamental concept in object-oriented programming, allows objects of different types to be treated as instances of a common superclass. This flexibility is essential for creating systems that can be easily extended and maintained. While traditional SQL databases in combination with Jakarta Persistence (JPA) can handle polymorphic data, NoSQL databases offer distinct advantages. Unlike SQL databases, which require strict schema definitions, NoSQL databases take a schema-less approach, inherently supporting dynamic and flexible data structures. This flexibility becomes especially appealing when integrated with Jakarta NoSQL, a tool that provides robust support for defining and managing polymorphic fields through custom converters. In many enterprise applications, there is a common need to manage different types of data objects. For example, an e-commerce platform may handle various payment methods such as credit cards, digital wallets, and bank transfers, each with specific attributes. Similarly, asset management systems in large corporations deal with different types of assets like real estate, machinery, and intellectual property, each with unique properties. Healthcare systems must accommodate various data types, from personal information to medical records and test results. Utilizing NoSQL databases with polymorphic fields can store and manage these diverse data types cohesively. The schema-less nature of NoSQL databases also makes it easier to adapt to changing requirements than relational databases. This tutorial will show how to use Jakarta NoSQL to manage polymorphic fields using custom converters. We will include sample code for integrating REST services using Helidon and Oracle NoSQL. This example will demonstrate the practical application of polymorphism in efficiently managing various data types in a schema-less NoSQL database environment. Using Polymorphism With NoSQL and Jakarta NoSQL This tutorial will explore the NoSQL and schema-less capabilities in the Java world using Oracle NoSQL, Java Helidon, and Rest API. We will create a Machine entity where it will provide an engine field that we will convert to JSON. Thanks to the natural flexibility of Oracle NoSQL and its schema-less design, this approach works seamlessly. The first step is creating the Helidon project, where you can use the Helidon Starter: Helidon Starter. After creating a Microprofile-compliant project, the next step is to include the Oracle NoSQL driver: Oracle NoSQL Driver. In the properties file, since we will run locally, we need two properties: one to define the host connection and the second to define the database name: Properties files jnosql.document.database=machines jnosql.oracle.nosql.host=http://localhost:8080 Also, update the port to use 8181: Properties files server.port=8181 The next step is to configure and run Oracle NoSQL. To make it easier, we will use Docker, but it is important to note that Oracle NoSQL also supports cloud deployment on Oracle infrastructure: Shell docker run -d --name oracle-instance -p 8080:8080 ghcr.io/oracle/nosql:latest-ce By using Docker, we simplify the setup process and ensure that our Oracle NoSQL instance is running in a controlled environment. This setup provides a practical approach for development and testing purposes, highlighting the flexibility of deploying Oracle NoSQL in different environments, including cloud infrastructure. After setting up the configuration and database, the next step involves defining the Entity and creating the Converter implementation. In this example, we will demonstrate the seamless integration of Jakarta NoSQL and Jakarta JSON-B, showing how different Jakarta specifications can work together effectively. The initial step is to define the Machine entity, which incorporates a polymorphic Engine field. Java @Entity @JsonbVisibility(FieldAccessStrategy.class) public class Machine { @Id private String id; @Column @Convert(EngineConverter.class) private Engine engine; @Column private String manufacturer; @Column private int year; // Getters and setters } Next, we define the Engine class, specifying the types and implementations using JsonbTypeInfo to handle polymorphism: Java @JsonbTypeInfo( key = "type", value = { @JsonbSubtype(alias = "gas", type = GasEngine.class), @JsonbSubtype(alias = "electric", type = ElectricEngine.class) } ) @JsonbVisibility(FieldAccessStrategy.class) public abstract class Engine { // Common engine attributes // Getters and setters } The engine converter might change by the provider; it can work as String, Map<String, Object>, BSON, etc. After setting up the configuration and database, the next step is to create the application and database bridge. Eclipse JNoSQL integrates two specifications, Jakarta NoSQL and Jakarta Data, using a single interface to achieve this. Annotations handle the necessary steps. First, we define the MachineRepository interface: Java @Repository public interface MachineRepository extends BasicRepository<Machine, String> { @Query("from Machine where engine.type = :type") List<Machine> findByType(@Param("type") String type); } This repository interface allows us to search elements directly within the JSON data without any issues, and you can extend it further without creating a rigid structure. Next, we define the controller to expose this resource: Java @Path("/machines") @ApplicationScoped public class MachineResource { private static final Logger LOGGER = Logger.getLogger(MachineResource.class.getName()); public static final Order<Machine> ORDER_MANUFACTURER = Order.by(Sort.asc("manufacturer")); private final MachineRepository repository; @Inject public MachineResource(@Database(DatabaseType.DOCUMENT) MachineRepository repository) { this.repository = repository; } @GET public List<Machine> getMachines(@QueryParam("page") @DefaultValue("1") int page, @QueryParam("page_size") @DefaultValue("10") int pageSize) { LOGGER.info("Get machines from page " + page + " with page size " + pageSize); Page<Machine> machines = this.repository.findAll(PageRequest.ofPage(page).size(pageSize), ORDER_MANUFACTURER); return machines.content(); } @GET @Path("gas") public List<Machine> getGasMachines() { return this.repository.findByType("gas"); } @GET @Path("electric") public List<Machine> getElectricMachines() { return this.repository.findByType("electric"); } @GET @Path("{id}") public Machine get(@PathParam("id") String id) { LOGGER.info("Get machine by id " + id); return this.repository.findById(id) .orElseThrow(() -> new WebApplicationException("Machine not found with id: " + id, Response.Status.NOT_FOUND)); } @PUT public void save(Machine machine) { LOGGER.info("Saving a machine " + machine); this.repository.save(machine); } } This controller exposes endpoints for managing machines, including getting machines by type and pagination. Finally, execute the application and insert data using the following curl commands: Shell curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "1", "model": "Thunderbolt V8", "engine": { "type": "gas", "horsepower": 450 }, "manufacturer": "Mustang", "year": 2021, "weight": 1600.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "2", "model": "Eagle Eye EV", "engine": { "type": "electric", "horsepower": 300 }, "manufacturer": "Tesla", "year": 2022, "weight": 1400.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "3", "model": "Road Runner GT", "engine": { "type": "gas", "horsepower": 400 }, "manufacturer": "Chevrolet", "year": 2020, "weight": 1700.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "4", "model": "Solaris X", "engine": { "type": "electric", "horsepower": 350 }, "manufacturer": "Nissan", "year": 2023, "weight": 1350.0 }' curl --location --request PUT 'http://localhost:8181/machines' \ --header 'Content-Type: application/json' \ --data '{ "id": "5", "model": "Fusion Hybrid 2024", "engine": { "type": "electric", "horsepower": 320 }, "manufacturer": "Toyota", "year": 2024, "weight": 1450.0 }' This setup allows you to search by type within the JSON, and pagination is easily implemented. For more on pagination techniques, refer to this article. With some data inserted, you can explore and understand how the search works: Get machines with pagination Get all machines Get gas machines Final Thoughts Integrating polymorphism with NoSQL databases using Jakarta NoSQL and Jakarta JSON-B offers flexibility and efficiency in managing diverse data types. By leveraging NoSQL’s schema-less nature, this approach simplifies development and enhances application adaptability. For the complete example and source code, visit soujava/helidon-oracle-json-types.
Vector databases allow for efficient data storage and retrieval by storing them as points or vectors instead of traditional rows and columns. Two popular vector database options are pgVector extension for PostgreSQL and Amazon OpenSearch Service. This article compares the specifications, strengths, limitations, capabilities, and use cases for pgVector and OpenSearch to help inform decision-making when selecting the best-suited option for various needs. Introduction The rapid advancements in artificial intelligence (AI) and machine learning (ML) have necessitated the development of specialized databases that can efficiently store and retrieve high-dimensional data. Vector databases have emerged as a critical component in this landscape, enabling applications such as recommendation systems, image search, and natural language processing. This article compares two prominent vector database solutions, pgVector extension for PostgreSQL and Amazon OpenSearch Service, directly relevant to your roles as technical professionals, database administrators, and AI and ML practitioners. Technical Background Vector databases store data as vectors, enabling efficient similarity searches and other vector operations. pgVector enhances PostgreSQL's capabilities to handle vectors, while OpenSearch provides a comprehensive solution for storing and indexing vectors and metadata, supporting scalable AI applications. Problem Statement Choosing the proper vector database involves understanding the available options' specific requirements, performance characteristics, and integration capabilities. This article provides a practical and detailed comparison to assist in making an informed decision and instill confidence in the process. Methodology or Approach This analysis reviews current practices, case studies, and theoretical models to compare pgVector and OpenSearch comprehensively. It highlights critical differences in technical specifications, performance, and use cases, ensuring the audience feels well-informed. pgVector Extension for PostgreSQL pgVector is an open-source extension for PostgreSQL that enables storing and querying high-dimensional vectors. It supports various distance calculations and provides functionality for exact and approximate nearest-neighbor searches. Key features include: Vector storage: Supports vectors with up to 16,000 dimensions. Indexing: Supports indexing of vector data using IVFFlat for up to 2000 dimensions. Integration: Seamlessly integrates with PostgreSQL, leveraging its ACID compliance and other features. Amazon OpenSearch Service OpenSearch is an open-source, all-in-one vector database that supports flexible and scalable AI applications. Key features include: Scalability: Handles large volumes of data with distributed computing capabilities. Indexing: Supports various indexing methods, including HNSW and IVFFlat. Advanced features: Provides full-text search, security, and anomaly detection features. Comparative Analysis Technical Specifications CAPABILITY PGVECTOR (POSTGRESQL EXTENSION) AMAZON OPENSEARCH Max Vector Dimensions Up to 16,000 Up to 16,000 (various indexing methods) Distance Metrics L2, Inner Product, Cosine L1, L2, Inner Product, Cosine, L-infinity Database Type Relational NoSQL Performance Optimized for vector operations A variable may not match pgVector for intensive vector operations Memory Utilization High control over memory settings Limited granularity CPU Utilization More efficient Higher CPU utilization Fault Tolerance and Recovery PostgreSQL mechanisms Automated backups and recovery Security PostgreSQL features Advanced security features Distributed Computing Capabilities Limited Built for distributed computing GPU Acceleration Supported via libraries Supported by FAISS and NMSLIB Cost Free cost for PostgreSQL AWS infrastructure costs Integration with Other Tools PostgreSQL extensions and tools AWS services and tools Performance pgVector is designed to optimize vector operations, offering several tuning options for performance improvement. In contrast, OpenSearch's performance can vary, particularly with complex queries or large data volumes. Strengths and Limitations pgVector Strengths Open-source and free Seamless integration with PostgreSQL Efficient handling of high-dimensional vectors Detailed tuning options for performance optimization pgVector Limitations Requires knowledge of PostgreSQL and SQL Limited to vector indexing Scalability depends on the PostgreSQL setup OpenSearch Strengths Highly scalable with distributed computing Versatile data type support Advanced features, including full-text search and security Integration with AWS services OpenSearch Limitations Steeper learning curve Variable performance for high-dimensional vectors Higher latency for complex queries Use Cases pgVector Use Cases E-commerce: Recommendation systems and similarity searches. Healthcare: Semantic search for medical records and genomics research. Finance: Anomaly detection and fraud detection. Biotechnology and genomics: Handling complex genetic data. Multimedia analysis: Similarity search for images, videos, and audio files. OpenSearch Use Cases Marketing: Customer behavior analysis. Cybersecurity: Anomaly detection in network events. Supply chain management: Inventory management. Healthcare: Patient data analysis and predictive modeling. Telecommunications: Network performance monitoring. Retail: Recommendation engines and inventory management. Semantic search: Contextually relevant search results. Multimedia analysis: Reverse image search and video recommendation systems. Audio search: Music recommendation systems and audio-based content discovery. Geospatial search: Optimized routing and property suggestions. Conclusion: Future Trends and Developments The field of vector databases is rapidly evolving, driven by the increasing demand for efficient storage and retrieval of high-dimensional data in AI and ML applications. Future developments may include improved scalability, enhanced performance, and new features to support advanced use cases. Understanding these trends can help you make informed decisions and plan for the future.
Editor's Note: The following is an article written for and published in DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures. Time series data has become an essential part of data collection in various fields due to its ability to capture trends, patterns, and anomalies. Through continuous or periodic observation, organizations are able to track how key metrics are changing over time. This simple abstraction powers a broad range of use cases. The widespread adoption of time series data stems from its versatility and applicability across numerous domains. For example: Financial institutions analyze market trends and predict future movements. IoT devices continuously generate time-stamped data to monitor the telemetry of everything from industrial equipment to home appliances. IT infrastructure relies on temporal data to track system performance, detect issues, and ensure optimal operation. As the volume and velocity of time series data have surged, traditional databases have struggled to keep pace with the unique demands placed by such workloads. This has led to the development of specialized databases, known as time series databases (TSDBs). TSDBs are purpose built to handle the specific needs of ingesting, storing, and querying temporal data. Core Features and Advantages of Time Series Databases TSDBs work with efficient data ingestion and storage capabilities, optimized querying, and analytics to manage large volumes of real-time data. Data Ingestion and Storage TSDBs utilize a number of optimizations to ensure scalable and performant loading of high-volume data. There are several of these optimizations that stand out as key differentiators: Table 1. Ingestion and storage optimizations Feature Description Expected Impact Advanced compression Columnar compression techniques such as delta, dictionary, and run length and LZ array-based Dramatically reduces the amount of data that needs to be stored on disk and, consequently, scanned at query time Data aggregation and downsampling Creation of summaries over specified intervals Reduces data volumes without a significant loss in information High-volume write optimization A suite of features such as append-only logs, parallel ingestion, and asynchronous write path Ensures that there are no bottlenecks in the write path and that data can continuously arrive and be processed by features working together Optimized Querying and Analytics To ensure fast data retrieval at query time, several optimizations are essential. These include specialized time-based indexing, time-based sharding/partitioning, and precomputed aggregates. These techniques take advantage of the time-based, sequential nature of the data to minimize the amount of data scanned and reduce the computation required during queries. An overview of these techniques are highlighted below. Indexing Various indexing strategies are employed across TSDBs to optimize data retrieval. Some TSDBs use an adapted form of the inverted index, which allows for rapid indexing into relevant time series by mapping metrics or series names to their locations within the dataset. Others implement hierarchical structures, such as trees, to efficiently index time ranges, enabling quick access to specific time intervals. Additionally, some TSDBs utilize hash-based indexing to distribute data evenly and ensure fast lookups, while others may employ bitmap indexing for compact storage and swift access. These diverse strategies enhance the performance of TSDBs, making them capable of handling large volumes of time-stamped data with speed and precision. Partitioning Partitioning consists of separating logic units of time into separate structures so that they can be accessed independently. Figure 1. Data partitioning to reduce data scan volume Pre-Computed Aggregates A simplified version of pre-computation is shown below. In practice, advanced statistical structures (e.g., sketches) may be used so that more complex calculations (e.g., percentiles) can be performed over the segments. Figure 2. Visualizing pre-computation of aggregates Scalability and Performance Several tactics and features ensure TSDBs remain reliable and performant as data velocity and volume increase. These are summarized in the table below: Table 2. Scalability tactics and features Feature Description Expected Impact Distributed architecture Provides seamless horizontal scaling Allows for transparently increasing the amount of processing power to both producing and consuming applications Partitioning and sharding Allows for data to be isolated to distributed processing units Ensures that both write and read workloads can fully utilize the distributed cluster Automated data management Enables data to move through different tiers of storage automatically based on its temporal relevance Guarantees that the most frequently used data is automatically stored in the fastest access path, while less used data has retention policies automatically applied Time Series Databases vs. Time Series in OLAP Engines Due to the ubiquity of time series data within businesses, many databases have co-opted the features of TSDBs in order to provide at least some baseline of the capabilities that a specialized TSDB would offer. And in some cases, this may satisfy the use cases of a particular organization. However, outlined below are some key considerations and differentiating features to evaluate when choosing whether an existing OLAP store or a time-series-optimized platform best fit a given problem. Key Considerations An organization's specific requirements will drive which approach makes the most sense. Understanding the three topics below will provide the necessary context for an organization to determine if bringing in a TSDB can provide a high return on investment. Data Volume and Ingestion Velocity TSDBs are designed to handle large volumes of continuously arriving data, and they may be a better fit in cases where the loading volumes are high and the business needs require low latency from event generation to insight. Typical Query Patterns It is important to consider whether the typical queries are fetching specific time ranges of data, aggregating over time ranges, performing real-time analytics, or frequently downsampling. If they are, the benefits of a TSDB will be worth introducing a new data framework into the ecosystem. Existing Infrastructure and Process When considering introducing a TSDB into an analytic environment, it is worthwhile to first survey the existing tooling since many query engines now support a subset of temporal features. Determine where any functionality gaps exist within the existing toolset and use that as a starting point for assessing fit for the introduction of a specialized back end such as TSDB. Differentiating Features There are many differences in implementation, and the specific feature differences will vary depending on the platforms being considered. However, generally, the two feature sets are emphasized broadly in TSDBs: time-based indexing and data management constructs. This emphasis stems from the fact that both feature sets are tightly coupled with time-based abstractions. Use of a TSDB will be most successful when these features can be best leveraged. Time-Based Indexing Efficient data access is achieved through constructs that leverage the sequential nature of time series data, allowing for fast retrieval while maintaining low ingest latency. This critical feature allows TSDBs to excel in use cases where traditional databases struggle to scale effectively. Data Management Constructs Time-based retention policies, efficient compression, and downsampling simplify the administration of large datasets by reducing the manual work required to manage time series data. These specialized primitives are purposefully designed to manage and analyze time series data, and they include functionality that traditional databases typically lack. Use Cases of Time Series Databases in Analytics There are various uses for time series data across all industries. Furthermore, emerging trends such as edge computing are putting the power of real-time time series analytics as close to the source of data generation as possible, thereby reducing the time to insight and removing the need for continuous connectivity to centralized platforms. This opens up a host of applications that were previously difficult or impossible to implement until recently. A few curated use cases are described below to demonstrate the value that can be derived from effectively leveraging temporal data. Telemetry Analysis and Anomaly Detection One of the most common use cases for TSDBs is the observation and analytics on real-time metrics. These metrics come from a variety of sources, and a few of the most prominent sources are described below. IT and Infrastructure Monitoring TSDBs enable real-time monitoring of servers, networks, and application performance, allowing for immediate detection and response to issues. This real-time capability supports performance optimization by identifying bottlenecks, determining capacity needs, and detecting security intrusions. Additionally, TSDBs enhance alert systems by identifying anomalous patterns and breaches of predefined thresholds, proactively informing staff to prevent potential problems. They also support custom dashboards and visualizations for quick and effective data interpretation, making them an invaluable tool for modern IT operations. IoT and Sensor Data TSDBs are vital for telemetry analysis and anomaly detection in IoT and sensor data applications, particularly when aligned with edge computing. They efficiently handle the large volumes of temporal data generated by IoT devices and sensors, enabling real-time monitoring and analysis at the edge of the network. This proximity allows for immediate detection of anomalies, such as irregular patterns or deviations from expected behavior, which is crucial for maintaining the health and performance of IoT systems. By processing data locally, TSDBs reduce latency and bandwidth usage, enhancing the responsiveness and reliability of IoT operations. Smart Cities and Utilities Extreme weather and the need for quick time to action has driven a growth in the usage of temporal data within city and utility infrastructures. Quickly deriving insights from deviations in normal operations can make a significant impact in these applications. TSDBs enable this through both the ability to ingest large volumes of data quickly as well as natively providing highly performant real-time analytic capabilities. For instance, it can mean the difference between high winds causing live wire breakages, which increase fire risk, and an automated shutdown that significantly reduces such risks. Furthermore, better information about energy generation and demand can be used to improve the efficiency of such systems by ensuring that supply and demand are being appropriately matched. This is particularly important during times when there is heavy strain on the energy grid, such as periods of unusual heat or cold, when effective operation can save lives. Trend Analysis The usefulness of TSDBs is not limited to real-time analytics; they are also used for performing long-term trend analysis and often provide the most value when identifying real-time deviations from longer term trends. The optimizations mentioned above, such as pre-computation and partitioning, allow TSDBs to maintain high performance, even if data volumes grow dramatically. Financial Analytics In the realm of financial analytics, TSDBs are indispensable for trend analysis. Analysts can identify patterns and trends over time, helping to forecast market movements and inform investment strategies. The ability to process and analyze this data in real time allows for timely decision making, reducing the risk of losses and capitalizing on market opportunities. Additionally, TSDBs support the integration of various data sources, providing a comprehensive view of financial markets and enhancing the accuracy of trend analysis. Healthcare and Biometric Data Medical devices and wearables generate vast amounts of time-stamped data, including heart rates, glucose levels, and activity patterns. TSDBs facilitate the storage and real-time analysis of this data, allowing healthcare providers to monitor patients continuously and detect any deviations from normal health parameters promptly. Trend analysis using TSDBs can also help in predicting the onset of diseases, monitoring the effectiveness of treatments, and tailoring personalized healthcare plans. This proactive approach not only improves patient outcomes but also enhances the efficiency of healthcare delivery. Industrial Predictive Maintenance Industries deploy numerous sensors on equipment to monitor parameters such as vibration, temperature, and pressure. By collecting and analyzing time-stamped data, TSDBs enable the identification of patterns that indicate potential equipment failures. This trend analysis allows maintenance teams to predict when machinery is likely to fail and schedule timely maintenance, thereby preventing costly unplanned downtimes. Moreover, TSDBs support the optimization of maintenance schedules based on actual usage and performance data, enhancing overall operational efficiency and extending the lifespan of industrial equipment. Conclusion Time series databases offer tools that simplify working with temporal data, thereby enabling businesses to improve operational efficiency, predict failures, and enhance security. The expanding capabilities of TSDBs highlight the value of real-time analytics and edge processing. Features like time-based partitioning, fast ingestion, and automated data retention — now found in traditional databases — encourage TSDB adoption by allowing proof of concepts on existing infrastructure. This demonstrates where investing in TSDBs can yield significant benefits, pushing the boundaries of temporal data management and optimizing analytics ecosystems. Integration with machine learning and AI for advanced analytics, enhanced scalability, and adoption of cloud-native solutions for flexibility are driving forces ensuring future adoption. TSDBs will support edge computing and IoT for real-time processing, strengthen security and compliance, and improve data retention management. Interoperability with other tools and support for open standards will create a cohesive data ecosystem, while real-time analytics and advanced visualization tools will enhance data interpretation and decision making. Together, these factors will ensure that TSDBs continue to be an essential piece of data infrastructure for years to come. This is an excerpt from DZone's 2024 Trend Report, Database Systems: Modernization for Data-Driven Architectures.Read the Free Report
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Principal Sales Engineer,
Snowflake
Otavio Santana
Award-winning Software Engineer and Architect,
OS Expert
Denis Magda
Head of DevRel,
Yugabyte