Hive Metastore Utils¶
About Hive Metastore¶
The Hive Metastore is a database with metadata for Hive tables.
To configure `SparklySession
to work with external Hive Metastore, you need to set hive.metastore.uris
option.
You can do this via hive-site.xml
file in spark config ($SPARK_HOME/conf/hive-site.xml):
<property>
<name>hive.metastore.uris</name>
<value>thrift://<n.n.n.n>:9083</value>
<description>IP address (or fully-qualified domain name) and port of the metastore host</description>
</property>
or set it dynamically via SparklySession
options:
class MySession(SparklySession):
options = {
'hive.metastore.uris': 'thrift://<n.n.n.n>:9083',
}
Tables management¶
Why: you need to check if tables exist, rename them, drop them, or even overwrite existing aliases in your catalog.
from sparkly import SparklySession
spark = SparklySession()
assert spark.catalog_ext.has_table('my_table') in {True, False}
spark.catalog_ext.rename_table('my_table', 'my_new_table')
spark.catalog_ext.create_table('my_new_table', path='s3://my/parquet/data', source='parquet', mode='overwrite')
spark.catalog_ext.drop_table('my_new_table')
Table properties management¶
Why: sometimes you want to assign custom attributes for your table, e.g. creation time, last update, purpose, data source. The only way to interact with table properties in spark - use raw SQL queries. We implemented a more convenient interface to make your code cleaner.
from sparkly import SparklySession
spark = SparklySession()
spark.catalog_ext.set_table_property('my_table', 'foo', 'bar')
assert spark.catalog_ext.get_table_property('my_table', 'foo') == 'bar'
assert spark.catalog_ext.get_table_properties('my_table') == {'foo': 'bar'}
Note properties are stored as strings. In case if you need other types, consider using a serialisation format, e.g. JSON.
Using non-default database¶
Why to split your warehouse into logical groups (for example by system components). In all catalog_ext.* methods you can specify full table names <db-name>.<table-name> and it should operate properly
from time import time
from sparkly import SparklySession
spark = SparklySession()
if spark.catalog_ext.has_database('my_database'):
self.catalog_ext.rename_table(
'my_database.my_badly_named_table',
'new_shiny_name',
)
self.catalog_ext.set_table_property(
'my_database.new_shiny_name',
'last_update_at',
time(),
)
Note be careful using ‘USE’ statements like: spark.sql(‘USE my_database’), it’s stateful and may lead to weird errors, if code assumes correct current database.