We ran into a couple of gotchas during the update that weren’t covered by the excellent migration document that the Airflow team put together.
This isn’t a critique, there is a lot about Airflow 2.0 that we are very excited about, just wanted to share my notes in case anybody else runs into similar issues.
We didn’t run all of the bridge release steps and in retrospect, I imagine that might have probably made the process smoother. We assumed that any necessary database migrations would be included in the 2.0 release as well. I would recommend others follow all of the steps, just to be safe. If this is indeed what went wrong, the Airflow team may want to more strongly emphasize the criticality of this step.
Because the invocation of the Kubernetes Pod Operator changed, our decision to implement a dynamic DAG generator process was a big win because we have a single interface to update our mapping to the operator instead of hundreds of individual jobs. This approach has worked really well and we will keep using it.
As a side note, we’ve found that migrating config files with the past two major Airflow updates is really painful because we have a lot of modified settings that we want to keep. We need to come up with a better way to do this, maybe using environment variables to override the config file.
Problem #1: Our DAGs wouldn’t load into the serialized DAG table, even after running the database migration.
The most significant one was an issue around running the “airflow db upgrade” command. Despite it indicating success, DAGs were not loading to the serialized DAG table and couldn’t be viewed in the UI.
After this, we found a couple of config settings that were missing, and added them, but it didn’t resolve the problem.
Fortunately, our implementation makes it trivial to tear down and rebuild the Airflow metadata DB, so that’s what we did. Unfortunately, both “airflow db reset” and “airflow db init” raised fatal errors.
We ended up having to create a connection to the database and manually drop the offending tables/objects before “airflow db init” would work correctly.
This resolved the DAG serialization problem.
Problem #2: Our DEV and QA environments have been sharing a metadata DB, which isn’t working now with DAG Serialization.
While the deployment worked great in DEV, all sorts of weird problems around scheduling and queuing started happening once we migrated the QA environment. Jobs simply hang with tasks stuck in queued or scheduled, but they never run.
The obvious solution seems like it should be to set the sql_alchemy_schema property in airflow.cfg and have a separate namespace for each, but this didn’t work for us. It doesn’t seem like this property is fully supported across Airflow’s deployment and migration scripts, causing both “airflow db init” and “airflow db reset” fail. This is really frustrating!
Our solution was to spin up a separate database instance for the QA environment.
Airflow 2.0 seems to be up and running now. I’ll update this if we encounter any other problems as we migrate our other environments.