Upgrading to Airflow 2.0
We ran into a couple of gotchas during the update that weren’t covered by the excellent migration document that the Airflow team put together.
This isn’t a critique, there is a lot about Airflow 2.0 that we are very excited about, just wanted to share my notes in case anybody else runs into similar issues.
We didn’t run all of the bridge release steps and in retrospect, I imagine that might have probably made the process smoother. We assumed that any necessary database migrations would be included in the 2.0 release as well. I would recommend others follow all of the steps, just to be safe. If this is indeed what went wrong, the Airflow team may want to more strongly emphasize the criticality of this step.
Because the invocation of the Kubernetes Pod Operator changed, our decision to implement a dynamic DAG generator process was a big win because we have a single interface to update our mapping to the operator instead of hundreds of individual jobs. This approach has worked really well and we will keep using it.
As a side note, we’ve found that migrating config files with the past two major Airflow updates is really painful because we have a lot of modified settings that we want to keep. We need to come up with a better way to do this, maybe using environment variables to override the config file.
Problem #1: Our DAGs wouldn’t load into the serialized DAG table, even after running the database migration.
The most significant one was an issue around running the “airflow db upgrade” command. Despite it indicating success, DAGs were not loading to the serialized DAG table and couldn’t be viewed in the UI.
After this, we found a couple of config settings that were missing, and added them, but it didn’t resolve the problem.
Fortunately, our implementation makes it trivial to tear down and rebuild the Airflow metadata DB, so that’s what we did. Unfortunately, both “airflow db reset” and “airflow db init” raised fatal errors.
We ended up having to create a connection to the database and manually drop the offending tables/objects before “airflow db init” would work correctly.
This resolved the DAG serialization problem.
Problem #2: Our DEV and QA environments have been sharing a metadata DB, which isn’t working now with DAG Serialization.
While the deployment worked great in DEV, all sorts of weird problems around scheduling and queuing started happening once we migrated the QA environment. Jobs simply hang with tasks stuck in queued or scheduled, but they never run.
The obvious solution seems like it should be to set the sql_alchemy_schema property in airflow.cfg and have a separate namespace for each, but this didn’t work for us. It doesn’t seem like this property is fully supported across Airflow’s deployment and migration scripts, causing both “airflow db init” and “airflow db reset” fail. This is really frustrating!
Our solution was to spin up a separate database instance for the QA environment.
Problem #3: The Airflow scheduler stops scheduling tasks after an indeterminate period of time but continues emitting a heartbeat.
This one is puzzling because it didn’t appear right away and there isn’t an obvious reason for the issue that is appearing in any of our logs. Jobs will run correctly for some time then will hang, with tasks stuck indefinitely in a queued state.
If I restart the container Airflow runs in or run killall airflow, the scheduler will restart and resumes scheduling correctly, until it goes into the same holding pattern after some period of time.
One option is to set up a cron to automatically run killall airflow every couple of hours to make sure that the scheduler remains healthy. This is obviously not an ideal solution, but it may be viable as a stop-gap measure until a better solution is identified, especially if someone is dealing with this in their production environment.
I’ll update this story as we make progress on this. We won’t be upgrading our production environment to Airflow 2.0+ until our lower environments are running in a consistent, reliable way.