Merge a DataFrame into a table
The DatabaseManager.merge_df()
method lets you take a pandas.DataFrame
and update existing and insert new rows into a table based on some criteria. It executes a combined UPDATE and INSERT statement
– a MERGE statement. The MERGE statement is executed once for every row in the pandas.DataFrame
. The method and its API
is very similar to DatabaseManager.upsert_table()
.
See the section Update and insert data into a table (upsert)
in the SQLite User guide.
An example of using DatabaseManager.merge_df()
is shown below with the Item table
from Runescape.db. Setting the parameter dry_run=True
will return the generated MERGE statement as a string instead of
executing it on the database. To avoid having to connect to a real Oracle database server the begin()
method of the DatabaseManager.engine
is mocked.
# merge_df.py
from datetime import datetime
from unittest.mock import MagicMock
import pandas as pd
import pandemy
db_sqlite = pandemy.SQLiteDb(file='Runescape.db', must_exist=True)
query = """SELECT * FROM Item ORDER BY ItemId ASC"""
dtypes = {
'ItemId': 'int8',
'ItemName': 'string',
'MemberOnly': 'boolean',
'Description': 'string'
}
with db_sqlite.engine.connect() as conn:
df = db_sqlite.load_table(
sql=query,
conn=conn,
index_col=['ItemId'],
dtypes=dtypes,
)
print(f'Item table:\n\n{df}')
# Change some data
df.loc[1, 'Description'] = 'This pot is not empty!'
df.loc[12, ['ItemName', 'MemberOnly']] = ['Dragon Super Scimitar', False]
# Add new data
df.loc[17, :] = ['Coal', False, 'Hmm a non-renewable energy source!']
df.loc[18, :] = ['Redberry pie', False, 'Looks tasty.']
db_oracle = pandemy.OracleDb(
username='Fred_the_Farmer',
password='Penguins-sheep-are-not',
host='fred.farmer.rs',
port=1234,
service_name='woollysheep'
)
db_oracle.engine.begin = MagicMock() # Mock the begin method
print(f'\n\nRows to be updated or inserted:\n\n{df.loc[[1, 12, 17, 18], :]}')
with db_oracle.engine.begin() as conn:
merge_stmt = db_oracle.merge_df(
df=df,
table='Item',
conn=conn,
on_cols=['ItemName'],
merge_cols='all',
merge_index_cols=False,
dry_run=True
)
print(f'\n\nMERGE statement:\n\n{merge_stmt}')
$ python merge_df.py
Item table:
ItemName MemberOnly Description
ItemId
1 Pot False This pot is empty.
2 Jug False This jug is empty.
3 Shears False For shearing sheep.
4 Bucket False Its a wooden bucket.
5 Bowl False Useful for mixing things.
6 Cake tin False Useful for baking things.
7 Tinderbox False Useful for lighting a fire.
8 Chisel False Good for detailed Crafting.
9 Hammer False Good for hitting things.
10 Newcomer map False Issued to all new citizens of Gielinor.
11 Unstrung symbol False It needs a string so I can wear it.
12 Dragon Scimitar True A vicious, curved sword.
13 Amulet of glory True A very powerful dragonstone amulet.
14 Ranarr seed True A ranarr seed - plant in a herb patch.
15 Swordfish False Id better be careful eating this!
16 Red dragonhide Body True Made from 100% real dragonhide.
Rows to be updated or inserted:
ItemName MemberOnly Description
ItemId
1 Pot False This pot is not empty!
12 Dragon Super Scimitar False A vicious, curved sword.
17 Coal False Hmm a non-renewable energy source!
18 Redberry pie False Looks tasty.
MERGE statement:
MERGE INTO Item t
USING (
SELECT
:ItemName AS ItemName,
:MemberOnly AS MemberOnly,
:Description AS Description
FROM DUAL
) s
ON (
t.ItemName = s.ItemName
)
WHEN MATCHED THEN
UPDATE
SET
t.MemberOnly = s.MemberOnly,
t.Description = s.Description
WHEN NOT MATCHED THEN
INSERT (
t.ItemName,
t.MemberOnly,
t.Description
)
VALUES (
s.ItemName,
s.MemberOnly,
s.Description
)
The Item table is loaded from the database into a pandas.DataFrame
(df
). Two rows are modified and two new rows are
added. DatabaseManager.merge_df()
is called with df
and dry_run=True
. The parameter
merge_cols='all'
includes all columns from df
in the MERGE statement, which is the default. It also accepts a list of column
names to only include a subset of the columns. merge_index_cols=False
excludes the index column from the statement, which is also
the default.
In the returned MERGE statement the t
(target) alias refers to the Item table in the database and the s
(source) alias
to df
. When a value of the ItemName column in the database matches a value from the ItemName column of df
an UPDATE
statement is executed to update the MemberOnly and Description columns. The ItemName column does not get updated since it is part
of the ON clause and would mean an update to the same value. When there is no match the values of the columns ItemName, MemberOnly
and Description are inserted into their respective column counterparts of the Item table.
Oracle supports adding a WHERE clause to the UPDATE clause of the WHEN MATCHED THEN part. This can be controlled with
the parameter omit_update_where_clause
, which defaults to True
. If set to False
the columns to update will not
be updated if their values from df
are the same as in the database. If at least one value differs the update will be executed.
The next example illustrates this and also uses two columns in the ON clause (ItemId and ItemName). This time df
is loaded
with a pandas.MultiIndex
. Setting merge_index_cols=True
includes all column levels of the pandas.MultiIndex
in the MERGE statement. You can also supply a list of column level names to only include the desired index levels.
merge_df_omit_update_where_clause.py
# merge_df_omit_update_where_clause.py
from datetime import datetime
from unittest.mock import MagicMock
import pandas as pd
import pandemy
db_sqlite = pandemy.SQLiteDb(file='Runescape.db', must_exist=True)
query = """SELECT * FROM Item ORDER BY ItemId ASC"""
dtypes = {
'ItemId': 'int8',
'ItemName': 'string',
'MemberOnly': 'boolean',
'Description': 'string'
}
with db_sqlite.engine.connect() as conn:
df = db_sqlite.load_table(
sql=query,
conn=conn,
index_col=['ItemId', 'ItemName'],
dtypes=dtypes,
)
db_oracle = pandemy.OracleDb(
username='Fred_the_Farmer',
password='Penguins-sheep-are-not',
host='fred.farmer.rs',
port=1234,
service_name='woollysheep'
)
db_oracle.engine.begin = MagicMock() # Mock the begin method
with db_oracle.engine.begin() as conn:
merge_stmt = db_oracle.merge_df(
df=df,
table='Item',
conn=conn,
on_cols=['ItemId', 'ItemName'],
merge_cols='all',
merge_index_cols=True,
omit_update_where_clause=False,
dry_run=True
)
print(f'MERGE statement:\n\n{merge_stmt}')
$ python merge_df_omit_update_where_clause.py
MERGE statement:
MERGE INTO Item t
USING (
SELECT
:MemberOnly AS MemberOnly,
:Description AS Description,
:ItemId AS ItemId,
:ItemName AS ItemName
FROM DUAL
) s
ON (
t.ItemId = s.ItemId AND
t.ItemName = s.ItemName
)
WHEN MATCHED THEN
UPDATE
SET
t.MemberOnly = s.MemberOnly,
t.Description = s.Description
WHERE
t.MemberOnly <> s.MemberOnly OR
t.Description <> s.Description
WHEN NOT MATCHED THEN
INSERT (
t.MemberOnly,
t.Description,
t.ItemId,
t.ItemName
)
VALUES (
s.MemberOnly,
s.Description,
s.ItemId,
s.ItemName
)
Tip
If the MERGE includes a lot of columns and the statement needs to be tailored to suit a specific use case it is error prone to
type all column names into the statement by hand. Using dry_run=True
is useful to extract a template of the MERGE
statement that can be further manually edited and parametrized. An empty pandas.DataFrame
representing the table the MERGE
acts on is enough to get a template statement. The final statement can then be added to a SQLContainer
and executed with the DatabaseManager.execute()
method.
See also
Diving into Oracle MERGE Statement : A short and informative tutorial.