Update and insert data into a table (upsert)
Sometimes you have a pandas.DataFrame
that represents an existing table in the database that already contains data.
If you want to update the existing records of the table with data from the pandas.DataFrame
and also insert new rows
(that exist in the pandas.DataFrame
but not in the table) you should use the
DatabaseManager.upsert_table()
(update and insert) method. The method can update
and insert new rows or only update existing rows. It works by creating and executing an UPDATE statement followed by an optional
INSERT statement derived from the structure of the pandas.DataFrame
. Let’s look at some examples of using
DatabaseManager.upsert_table()
with the Customer table of Runescape.db.
# upsert_table.py
from datetime import datetime
import pandas as pd
import pandemy
db = pandemy.SQLiteDb(file='Runescape.db', must_exist=True)
query = """SELECT * FROM Customer ORDER BY CustomerId ASC"""
dtypes = {
'CustomerName': 'string',
'Residence': 'string',
'IsAdventurer': 'boolean'
}
with db.engine.connect() as conn:
df = db.load_table(
sql=query,
conn=conn,
index_col='CustomerId',
dtypes=dtypes,
parse_dates=['BirthDate']
)
print(f'Customer table original:\n\n{df}')
# Change some data
df.loc[1, ['BirthDate', 'Residence']] = [datetime(1891, 7, 15), 'Falador']
df.loc[4, 'IsAdventurer'] = True
# Add new data
df.loc[9, :] = ['Prince Ali', datetime(1969, 6, 20), 'Al Kharid', False]
df.loc[10, :] = ['Mosol Rei', datetime(1983, 4, 30), 'Shilo Village', False]
with db.engine.begin() as conn:
# Update and insert the new data
db.upsert_table(
df=df,
table='Customer',
conn=conn,
where_cols=['CustomerName'],
upsert_index_cols=False,
update_only=False,
datetime_cols_dtype='str',
datetime_format=r'%Y-%m-%d'
)
# Load the data back
df_upsert = db.load_table(
sql=query,
conn=conn,
index_col='CustomerId',
dtypes=dtypes,
parse_dates=['BirthDate']
)
print(f'\n\nCustomer table after upsert:\n\n{df_upsert}')
$ python upsert_table.py
Customer table original:
CustomerName BirthDate Residence IsAdventurer
CustomerId
1 Zezima 1990-07-14 Yanille True
2 Dr Harlow 1970-01-14 Varrock False
3 Baraek 1968-12-13 Varrock False
4 Gypsy Aris 1996-03-24 Varrock False
5 Not a Bot 2006-05-31 Catherby True
6 Max Pure 2007-08-20 Port Sarim True
Customer table after upsert:
CustomerName BirthDate Residence IsAdventurer
CustomerId
1 Zezima 1891-07-15 Falador True
2 Dr Harlow 1970-01-14 Varrock False
3 Baraek 1968-12-13 Varrock False
4 Gypsy Aris 1996-03-24 Varrock True
5 Not a Bot 2006-05-31 Catherby True
6 Max Pure 2007-08-20 Port Sarim True
7 Prince Ali 1969-06-20 Al Kharid False
8 Mosol Rei 1983-04-30 Shilo Village False
The Customer table is loaded from the database into a pandas.DataFrame
(df
). The data is modified and two new rows
are added to df
. DatabaseManager.upsert_table()
is called with the updated
version of df
. The where_cols
parameter is set to the CustomerName column which means that rows from df
with a CustomerName that already exists in the Customer table will be updated. Rows that do not have a matching
CustomerName will be inserted instead. The BirthDate column is inserted as formatted strings (YYYY-MM-DD) by the
parameters datetime_cols_dtype='str'
and datetime_format=r'%Y-%m-%d'
. Setting the parameter update_only=True
would have
only updated existing rows and not inserted any new rows.
Note
The Primary key column CustomerId is not included in database statements sent to the database in the example above.
This is due to the parameter upsert_index_cols=False
, which is also the default behavior.
The values of CustomerId in df
of the two new rows (9 and 10) differ from the ones inserted into the database
(7 and 8). The CustomerId column is defined as an INTEGER data type in the database and if it is not supplied
in the INSERT statement SQLite will autoincrement the value by one from the previous row.
It is useful to exclude the Primary key from the upsert if it is generated by the database.
Using the dry_run parameter
If you want to look at the SQL statements sent to the database you can set the parameter dry_run=True
.
This will return the SQL statements that would have been executed on the database for every row in df
.
Nothing gets executed on the database. This is useful to verify that you have set the parameters correct to make
the statements do what you expect. If update_only=True
the returned INSERT statement will be None
.
The next example illustrates using the dry_run
parameter with the Customer table from before.
# upsert_table_dry_run.py
from datetime import datetime
import pandas as pd
import pandemy
db = pandemy.SQLiteDb(file='Runescape.db', must_exist=True)
query = """SELECT * FROM Customer ORDER BY CustomerId ASC"""
dtypes = {
'CustomerName': 'string',
'Residence': 'string',
'IsAdventurer': 'boolean'
}
with db.engine.connect() as conn:
df = db.load_table(
sql=query,
conn=conn,
index_col='CustomerId',
dtypes=dtypes,
parse_dates=['BirthDate']
)
print(f'Customer table:\n\n{df}')
with db.engine.begin() as conn:
# Get the UPDATE and INSERT statements
update_stmt, insert_stmt = db.upsert_table(
df=df,
table='Customer',
conn=conn,
where_cols=['CustomerName', 'BirthDate'],
upsert_cols=['Residence', 'CustomerName'],
upsert_index_cols=True,
update_only=False,
dry_run=True
)
print(f'\n\nUPDATE statement:\n\n{update_stmt}')
print(f'INSERT statement:\n\n{insert_stmt}')
$ python upsert_table_dry_run.py
Customer table:
CustomerName BirthDate Residence IsAdventurer
CustomerId
1 Zezima 1891-07-15 Falador True
2 Dr Harlow 1970-01-14 Varrock False
3 Baraek 1968-12-13 Varrock False
4 Gypsy Aris 1996-03-24 Varrock True
5 Not a Bot 2006-05-31 Catherby True
6 Max Pure 2007-08-20 Port Sarim True
7 Prince Ali 1969-06-20 Al Kharid False
8 Mosol Rei 1983-04-30 Shilo Village False
UPDATE statement:
UPDATE Customer
SET
Residence = :Residence,
CustomerId = :CustomerId
WHERE
CustomerName = :CustomerName AND
BirthDate = :BirthDate
INSERT statement:
INSERT INTO Customer (
Residence,
CustomerName,
CustomerId
)
SELECT
:Residence,
:CustomerName,
:CustomerId
WHERE
NOT EXISTS (
SELECT
1
FROM Customer
WHERE
CustomerName = :CustomerName AND
BirthDate = :BirthDate
)
Here we use the columns CustomerName and BirthDate in the WHERE clause and specify that the index column (CustomerId)
should also be updated. If the index is a pandas.MultiIndex
all levels are included if upsert_index_cols=True
.
A list of level names can be used to only select desired levels of the index to the upsert. By specifying the upsert_cols
parameter a subset of the columns of df
can be selected for the upsert, in this case the columns Residence and CustomerName.
Since CustomerName is also supplied to the where_cols
parameter it is excluded from the columns to update,
because that would otherwise result in an update to the same value.