Skip to main content

AirFlow Cron Management UI

1. Comprehensive Guide: Setting Up Apache Airflow on Ubuntu 24.04

This guide provides a full, step-by-step process for installing and configuring a stable Apache Airflow instance on an Ubuntu 24.04 AWS EC2 server. It uses a locally installed PostgreSQL database for the metadata backend and includes common troubleshooting solutions and best practices for a robust setup.

2. Prerequisites

  • An AWS account.
  • A terminal or SSH client to connect to the EC2 instance.
  • An SSH key pair configured in your AWS account.

3. Step 1: Launch and Configure AWS EC2 Instance

  1. Launch Instance: Navigate to the AWS EC2 console and launch a new instance.
  2. AMI: Select Ubuntu Server 24.04 LTS.
  3. Instance Type: Choose an instance with at least 4 GB of RAM. A t3.medium or larger is highly recommended.
  4. Key Pair: Assign your SSH key pair to the instance.
  5. Security Group: Create a new security group with the following inbound rules:
    • SSH (port 22): Source set to “My IP” for secure shell access.
    • Custom TCP (port 8080): Source set to “My IP” to access the Airflow Web UI.
Connect via SSH:
ssh -i /path/to/your-key.pem ubuntu@<your-ec2-public-ip>

4. Step 2: Prepare the Ubuntu System

Update system packages and install Python & PostgreSQL dependencies:
# Update package lists and upgrade existing packages
sudo apt update && sudo apt upgrade -y

# Install Python tools and PostgreSQL client libraries
sudo apt install -y python3-pip python3.12-venv postgresql-client libpq-dev

5. Step 3: Install and Configure PostgreSQL

5.1 Install PostgreSQL Server

sudo apt install -y postgresql postgresql-contrib

5.2 Create Airflow Database and User

Switch to the postgres user:
sudo -i -u postgres psql
Inside the psql shell, run:
-- Create a dedicated user for Airflow
CREATE USER airflow WITH PASSWORD 'strong_password';

-- Create the database
CREATE DATABASE airflow;

-- Set the new user as owner
ALTER DATABASE airflow OWNER TO airflow;

-- Exit psql
\q
Exit postgres session:
exit

6. Step 4: Set Up and Install Apache Airflow

6.1 Set AIRFLOW_HOME Environment Variable

echo "export AIRFLOW_HOME=~/airflow" >> ~/.bashrc
source ~/.bashrc

6.2 Create and Activate Python Virtual Environment

python3 -m venv ~/airflow_venv
source ~/airflow_venv/bin/activate

6.3 Install Airflow with Constraints

pip install "apache-airflow[postgres]==2.9.2" --constraint "https://raw.githubusercontent.com/apache/airflow/constraints-2.9.2/constraints-3.12.txt"
Using constraints prevents dependency conflicts.

7. Step 5: Configure and Initialize Airflow

7.1 Run db upgrade to generate airflow.cfg

airflow db upgrade

7.2 Edit airflow.cfg

nano ~/airflow/airflow.cfg
Make these changes:
# Change executor to allow parallel task execution
executor = LocalExecutor

# Update database connection string
sql_alchemy_conn = postgresql+psycopg2://airflow:strong_password@localhost/airflow

# (Optional) Disable example DAGs
load_examples = False
Save and exit.

7.3 Run db upgrade again

airflow db upgrade

8. Step 6: Create Admin User and Start Services

8.1 Create Admin User

airflow users create \
  --username admin \
  --role Admin \
  --firstname Your \
  --lastname Name \
  --email you@example.com

8.2 Start Webserver and Scheduler

# Start the webserver on port 8080
airflow webserver --port 8080 -D

# Start the scheduler
airflow scheduler -D

9. Step 7: Access the Airflow UI

Open your browser:
http://<your-ec2-public-ip>:8080
Log in with the admin credentials.

10. Optional Production Setup: Using systemd

10.1 Create Webserver Service

sudo nano /etc/systemd/system/airflow-webserver.service
[Unit]
Description=Airflow Webserver
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Environment="AIRFLOW_HOME=/home/ubuntu/airflow"
User=root
Group=root
Type=simple
ExecStart=/home/ubuntu/airflow_venv/bin/airflow webserver --port 8080
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

10.2 Create Scheduler Service

sudo nano /etc/systemd/system/airflow-scheduler.service
[Unit]
Description=Airflow Scheduler
After=network.target postgresql.service
Requires=postgresql.service

[Service]
Environment="AIRFLOW_HOME=/home/ubuntu/airflow"
User=root
Group=root
Type=simple
ExecStart=/home/ubuntu/airflow_venv/bin/airflow scheduler
Restart=on-failure
RestartSec=10s

[Install]
WantedBy=multi-user.target

10.3 Enable and Start Services

sudo systemctl daemon-reload
sudo systemctl start airflow-webserver
sudo systemctl start airflow-scheduler
sudo systemctl enable airflow-webserver
sudo systemctl enable airflow-scheduler
sudo systemctl status airflow-webserver
sudo systemctl status airflow-scheduler

11. Part 4: Creating Your First DAGs

  • Place DAG files in ~/airflow/dags/.
  • Restart services after creating/updating DAGs:
sudo systemctl restart airflow-webserver airflow-scheduler

Example 1: Simple Cron Job DAG

File: ~/airflow/dags/my_first_cron_dag.py
from __future__ import annotations
import pendulum
from airflow.decorators import dag, task

@dag(
    dag_id="my_first_cron_dag",
    start_date=pendulum.datetime(2025, 9, 30, tz="UTC"),
    schedule="*/5 * * * *",
    catchup=False,
    tags=["example", "cron"],
)
def my_simple_cron_workflow():
    @task
    def print_a_message():
        current_time = pendulum.now("UTC")
        print(f"Hello from Airflow! The time is now: {current_time}")

    print_a_message()

my_simple_cron_workflow()

Example 2: DAG for Debugging Failures

File: ~/airflow/dags/chaotic_testing_dag.py
from __future__ import annotations
import pendulum
from datetime import timedelta
from airflow.decorators import dag, task
from airflow.operators.bash import BashOperator
from airflow.operators.empty import EmptyOperator

@dag(
    dag_id="chaotic_testing_dag",
    start_date=pendulum.datetime(2025, 9, 30, tz="UTC"),
    schedule="*/10 * * * *",
    catchup=False,
    tags=["example", "testing", "failure"],
)
def chaotic_workflow():
    start = EmptyOperator(task_id="start")
    
    task_that_fails_on_purpose = BashOperator(
        task_id="task_that_fails_on_purpose",
        bash_command='echo "This task will fail with exit code 1."; exit 1',
    )
    
    task_that_will_not_run = BashOperator(
        task_id="task_that_will_not_run",
        bash_command='echo "This task will not run."',
    )
    
    @task
    def task_with_a_python_exception():
        raise ValueError("This is a deliberate Python error!")
    
    task_that_will_timeout = BashOperator(
        task_id="task_that_will_timeout",
        bash_command='echo "This task will time out."; sleep 60',
        execution_timeout=timedelta(seconds=15),
    )
    
    end = EmptyOperator(task_id="end")
    
    start >> [task_that_fails_on_purpose, task_with_a_python_exception(), task_that_will_timeout]
    task_that_fails_on_purpose >> task_that_will_not_run >> end
    task_with_a_python_exception() >> end
    task_that_will_timeout >> end

chaotic_workflow()

12. Part 5: Troubleshooting Guide

Common Issues

  1. DAG not appearing in UI
    • Restart Airflow services:
      sudo systemctl restart airflow-webserver airflow-scheduler
      
    • Check for “Import Errors” in UI.
  2. systemd status shows Active: activating (auto-restart)
    • Cause: Wrong ExecStart path or User in service file.
    • Solution: Correct paths, run sudo systemctl daemon-reload.
  3. Scheduler does not appear to be running
    • Cause: Orphaned process on port 8793.
    • Solution:
      lsof -i :8793
      kill -9 <PID>
      
  4. PostgreSQL permission denied for schema public
    • Ensure airflow user owns database:
      ALTER DATABASE airflow OWNER TO airflow;
      
  5. Health check URL
    • Endpoint: /health
    • Returns 200 OK with JSON confirming metadatabase and scheduler health. Useful for AWS Target Group.


This keeps all original content, commands, examples, DAGs, and troubleshooting steps intact while structuring the guide for clarity and readability.