Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lockfile (*.wal) is not cleaned up right after connection.close #10002

Open
1 task done
StefanJannCH opened this issue Dec 15, 2023 · 17 comments
Open
1 task done

Lockfile (*.wal) is not cleaned up right after connection.close #10002

StefanJannCH opened this issue Dec 15, 2023 · 17 comments

Comments

@StefanJannCH
Copy link

What happens?

When using Java API (duckdb_jdbc 0.9.2) under windows 11, the lock file (*.wal) is not cleaned up with a connection.close call. This leads to an error if another connection is created (java.sql.SQLException: IO Error: Cannot open file ...).

To Reproduce

Open connection, close connection. WAL file is still there.. Open Connection again from another Thread, error!
Sometimes, when GC is run one or multiple times -> WAL goes away.

OS:

Windows 11

DuckDB Version:

0.9.2

DuckDB Client:

Java

Full Name:

Stefan Jann

Affiliation:

none

Have you tried this on the latest main branch?

I have not tested with any build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@szarnyasg
Copy link
Collaborator

@StefanJannCH thanks for raising this issue. Please

  1. provide a reproducible example
  2. check for that all objects related to the database (e.g., ResultSet, PreparedStatement) are closed. The database is only cleaned up when every such object has been closed.

Copy link

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link

github-actions bot commented Jul 2, 2024

This issue was closed because it has been stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 2, 2024
@AyushSingal22
Copy link

AyushSingal22 commented Nov 19, 2024

I am facing the same issue. I am creating a CRON job for creating a new DB.
The Databases are being created but it is not removing the .wal [lock]. which is creating the database unaccessible for another processes.
I am sharing the Code.
dailyWorker.js

const cron = require("node-cron");
const uuid = require('uuid');
const {deleteFile , fileExists , makeDir, copyFile} = require("./utils");


let lastDb = "";
const DATA_TEMP_STORE_PATH = "datastore/temp/";
const DATA_STORE_PATH = "datastore/store/";

function getDataFilePath(name) {
    return DATA_TEMP_STORE_PATH + name;
}

makeDir(DATA_STORE_PATH);
makeDir(DATA_TEMP_STORE_PATH);

cron.schedule("*/10 * * * * *", async function() {
    const creator = require("./dbCreator");
    console.log("Running db creator: " + new Date());
    let dbName = uuid.v4();
    let dbFile = getDataFilePath(dbName)
    await creator(dbFile);
    await copyFile(dbFile, DATA_STORE_PATH + dbName);
    lastDb = dbFile;
});

dbCreator.js

const creator = async (dbName) => {
    const duckdb = require("duckdb");
    const db = new duckdb.Database(dbName);

    async function f(val) {
        const con = db.connect();
        con.run('CREATE TABLE a (i INTEGER)');
        const stmt = con.prepare('INSERT INTO a VALUES (?)');
        for (let i = 0; i < 10000000; i++) {
            stmt.run(val);
        }
        stmt.finalize();
        con.close();
    };
    await f(1);
    db.close();
}
module.exports = creator;

@freakynit
Copy link

freakynit commented Dec 26, 2024

+1...

Happening on quick sequence of operations involving dropping a table and re-creating with same name. This makes database completely inaccessible.

I'm using from nodejs: @duckdb/node-api.

Edit:

Same wal file not gettig cleaned issue..
OS: Apple M1 Sonoma 14.5
Node: 20.11.1
Driver: 1.1.3-alpha.7

@carlopi carlopi reopened this Dec 26, 2024
@carlopi
Copy link
Contributor

carlopi commented Dec 26, 2024

This looks to be connected to node-api, possibly a misuse of CAPI, we'll take a look. Unsure if this is best placed here or inhttps://github.com/duckdb/duckdb-node-neo

@freakynit
Copy link

@carlopi I'll tag this issue on duck-node repo too.

Thanks..

@carlopi
Copy link
Contributor

carlopi commented Dec 26, 2024

@freakynit: could you possibly share an end-to-end reproduction?

@freakynit
Copy link

@carlopi Will try to get reproducible code and share..

@freakynit
Copy link

@carlopi Unable to reproduce with test code... will share relevant log if I encounter the issue again.

@carlopi
Copy link
Contributor

carlopi commented Dec 26, 2024

To get started even a more complete description of what's happening would help getting started diagnosing this.

Note that this might be something that need fixing in the code itself, improving error messages providing way out or documenting better the expected usage of the API.

@freakynit
Copy link

freakynit commented Dec 26, 2024

@carlopi Got it.. lemme share relevant log lines and existing code (will expand in further comment):

iteration: 1
Going to drop following tables [ 'table_2' ]
dropping table: table_2
dropped table: table_2
Tables dropped. Loading data and getting count...
Going to drop following tables [ 'table_1' ]
dropping table: table_1
dropped table: table_1
Tables dropped. Loading data and getting count...
iteration: 2
Going to drop following tables [ 'table_2' ]
dropping table: table_2
dropped table: table_2
Tables dropped. Loading data and getting count...
Error while using connection: [Error: FATAL Error: Failed to create checkpoint because of error: Could not remove file "./db.bin.wal": No such file or directory]
[Error: FATAL Error: Failed to create checkpoint because of error: Could not remove file "./db.bin.wal": No such file or directory]

Code:

async function run6(connection) {
    const dataFilePaths = [
        '/Users/nitinbansal/Downloads/mix data/generated.csv',
        '/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
    ];
    const tableNames = ['table_1', 'table_2'];

    let counts = 0;

    for(let i = 0; i < 3; i++) {
        console.log(`iteration: ${i+1}`);

        for(let j = 0; j < dataFilePaths.length; j++) {
            await useConnection(DB_FILE_PATH, async (connection) => {
                await dropTables(connection, true);

                console.log(`Tables dropped. Loading data and getting count...`);

                let result = await loadFile(connection, dataFilePaths[j], tableNames[j], 3);
                let totalRecords = await getRecordCount(connection, tableNames[j]);

                counts = totalRecords;
            });
        }
    }
}

@freakynit
Copy link

freakynit commented Dec 26, 2024

Code for methods used above (almost full code):

async function initConnection(dbFilePath) {
    const instance = await DuckDBInstance.create(dbFilePath);
    return await instance.connect();
}


async function useConnection(dbFilePath, callback) {
    let connection = null;
    try {
        connection = await initConnection(dbFilePath);
        return await callback(connection); // Return the result of the callback
    } catch (error) {
        console.error("Error while using connection:", error);
        throw error;
    } finally {
        connection = null; // todo: prod: use close() if available
    }
}


async function dropTables(connection, dropAllTables, tablesToDrop) {
    if(dropAllTables) {
        const query = `SELECT table_name FROM information_schema.tables WHERE table_schema = 'main'`;
        const result = await runAndGetResults(connection, query, -1, true);
        tablesToDrop = result.rows.flat();
    }

    if (tablesToDrop.length === 0) {
        console.log('No tables to drop.');
        return;
    }

    console.log(`Going to drop following tables`, tablesToDrop);

    for(const tableName of tablesToDrop) {
        console.log(`dropping table: ${tableName}`);

        try {
            await runAndGetResults(connection, `DROP TABLE IF EXISTS ${tableName}`, -1, false);
            console.log(`dropped table: ${tableName}`);
        } catch (err) {
            console.error(`error dropping table ${tableName}:`, err);
        }
    }
}


async function runAndGetResults(connection, query, maxRowsToFetch = 2048, jsonParseEachRow = false, returnIterator = false) {
    const result = await connection.run(query);
    ...
    ...
    return {
        header: columnNames,
        schema: schema,
        rows: collectedRows
    };
}


async function loadFile(connection, filepath, tableName, sampleRowsToGet = 3) {
    // Runs `CREATE TABLE ${tableName} AS SELECT * FROM read_csv_auto('${filepath}')`
    ...
    ...
    return {
        "table": tableName,
        "filepath": filepath,
        "schema": queryResult.schema,
        "sampleRows": queryResult.rows
    }
}


async function getRecordCount(connection, tableName) {
    const query = `select count(1) as total_count from ${tableName}`;
    const result = await runAndGetResults(connection, query, -1, true, false);

    if(result.rows.length > 0) {
        return result.rows[0][0];
    } else {
        return 0;
    }
}

cc: @carlopi

@jraymakers
Copy link
Contributor

Looks like you're creating a new DuckDBInstance for each operation. I believe it is a known problem that having multiple DuckDB instances in the same process writing to the same database files can cause corruption.

Currently, @duckdb/node-api doesn't provide a way to explicitly clean up an instance. It gets automatically cleaned up a short amount of time after the reference is released, but creating many instances in quick succession, and attaching them to the same database files, could result in multiple instances attached to, and writing to, the same database files at the same time.

My recommendation is to only have a single DuckDBInstance. Use the ATTACH and DETACH commands to load and unload database files.

@carlopi carlopi removed the stale label Dec 26, 2024
@freakynit
Copy link

@jraymakers Thanks.. one question on using same connection only:

If I repeatedly select large amounts of data (like millions of row) using chunked streaming (not full, but, partial, 2048 at a time), can that lead to OOM?

I'll be consuming first 2048 rows only from the resultset, leaving remaining as-is, unconsumed. I can't use limit for this case since I need to run query on full dataset, and only show first 2048 rows. Or that further rows are actually fetched on next chunk call itself?

@jraymakers
Copy link
Contributor

How much memory a query uses depends on the details. But if you're just scanning a data source, and streaming the results, that should minimize the memory usage.

Note that I released @duckdb/node-api@1.1.3-alpha.8 a couple days ago that implements true streaming. For your use case, you should be sure to use that, otherwise the results could get materialized eagerly in the background. This PR demonstrates some of the different patterns and compares some aspects of their performance.

@freakynit
Copy link

@jraymakers Thanks a lot for clarification. I'll sure checkout the new release and associated PR.

Also, I did simple streaming load test to gauge it's impact on RAM... whether it keeps increasing, or moves up and down corresponding to memory getting reclaimed too. It seems it works fine as of now. The memory doesn' keep piling up.

Here's test code:

async function run6(connection) {
    const dataFilePaths = [
        '/Users/nitinbansal/Downloads/mix data/generated.csv',
        '/Users/nitinbansal/Downloads/hackernews_10k_dec 2024.csv'
    ];
    const tableNames = ['table_1', 'table_2'];

    let counts = 0;

    await useConnection(DB_FILE_PATH, async (connection) => {
        await dropTables(connection, true);

        console.log(`Tables dropped. Loading data and getting count...`);

        let result = await loadFile(connection, dataFilePaths[0], tableNames[0], 3);
        result = await loadFile(connection, dataFilePaths[1], tableNames[1], 3);

        for(let k = 0; k < 200; k++) {
            console.log(`iteration: ${k+1}`);

            result = await runAndGetResults(connection, `select * from table_1`, 2048, false, false);
            console.log(`selected row count ${result?.rows?.length}`)

            result = await runAndGetResults(connection, `select * from table_2`, 2048, false, false);
            console.log(`selected row count ${result?.rows?.length}`)
        }
    });
}

The DB file size is roughly 1GB.

Thanks again..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants