quest_ga4_bigquery_queries.exe
_
×

The 10 Essential BigQuery Queries to Analyze Your GA4 Data

Ten copy-paste BigQuery queries for GA4, organized by business question, each with the schema trap to avoid and a near-zero cost tip. The 2026 cookbook.

bigquery ga4 sql reference

The GA4 BigQuery export gives you every raw event: unsampled, with no cardinality limits. It is the data the interface never quite shows you. The hard part is not the GA4 BigQuery queries themselves, it is the schema. Nested arrays, no native session_id column, timestamps in microseconds, and four different places where the traffic source can live. Most analysts give up at the first query that returns a wrong result without throwing an error.

This cookbook gives you ten copy-paste queries, organized by real business question rather than by SQL function. For each one: the query, the schema trap that silently breaks the count, and the trick to keep the cost near zero. If you have not yet enabled the export or grasped the table structure, start with the pillar guide, working with the GA4 BigQuery export: it lays the groundwork this cookbook assumes.

Three schema traps to know before any query

Before the first query, internalize these three facts. They explain 90% of the gaps between BigQuery and the GA4 interface.

There is no session_id. GA4 stores no unique session identifier. You rebuild it by concatenating user_pseudo_id with the ga_session_id event parameter. On its own, ga_session_id repeats across users and inflates your counts.

Timestamps are in microseconds. event_timestamp is not in seconds or milliseconds. To get a readable date, wrap it in TIMESTAMP_MICROS(event_timestamp). Skip this and you get dates in 1970 or nonsensical duration math.

Traffic source lives in four places. This is the trap that costs the most in wrong analyses. Depending on the question, you read a different field.

LocationScopeWhen to use it
traffic_sourceUser, frozen at first visitInitial acquisition (the channel that first brought the user in)
collected_traffic_sourceEvent, values collected with the eventRaw source at the exact moment of the event
session_traffic_source_last_clickSession, last non-direct clickAligning with the session reports in the GA4 interface
event_params (source, medium, campaign)Event, legacy implementationsOlder accounts, avoid on anything new

The rule of thumb: traffic_source for the channel that starts the journey, session_traffic_source_last_click for the channel that closes it. Confusing the two is the single most common attribution mistake on BigQuery.

In every query below, replace project.dataset with your real identifier and adjust the _TABLE_SUFFIX date window.

1. Active, new and returning users

The first question of any report: how many people, and how many are new.

SELECT
  COUNT(DISTINCT user_pseudo_id) AS active_users,
  COUNT(DISTINCT IF(event_name = 'first_visit', user_pseudo_id, NULL)) AS new_users,
  COUNT(DISTINCT user_pseudo_id)
    - COUNT(DISTINCT IF(event_name = 'first_visit', user_pseudo_id, NULL)) AS returning_users
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626';
-- active_users: 18,240 | new_users: 11,905 | returning_users: 6,335

The trap. A new user is defined by a first_visit event inside the window, not by their first row in the table. And user_pseudo_id is a device or browser identifier, not a person: the same visitor on two devices counts twice, and a cleared cookie creates a new user.

The cost. Always filter on _TABLE_SUFFIX and select only the columns you need. Here, BigQuery scans only user_pseudo_id and event_name, not the whole table.

2. Counting sessions correctly

With no native session_id, a session has to be rebuilt. This is the only reliable way.

SELECT
  COUNT(DISTINCT CONCAT(
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
  )) AS sessions
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626';
-- sessions: 24,117

The trap. Never count ga_session_id on its own: that number (a session-start timestamp) repeats from one user to the next. The valid session key is always user_pseudo_id concatenated with ga_session_id.

The cost. The UNNEST(event_params) inside a correlated subquery stays cheap because you only scan one nested column. Reuse this pattern everywhere instead of flattening the whole table.

3. Most viewed pages and engagement

The content classic, with an engagement measure next to raw volume.

SELECT
  (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page,
  COUNTIF(event_name = 'page_view') AS views,
  ROUND(SUM(
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec')
  ) / 1000, 0) AS engagement_seconds
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
GROUP BY page
ORDER BY views DESC
LIMIT 20;

The trap. page_location holds the full URL, UTM parameters included. The same page then shows up across dozens of rows. Normalize with REGEXP_REPLACE or SPLIT(page, '?')[OFFSET(0)] if you want to aggregate by path.

The cost. Two subqueries on event_params, each on a single nested column. The LIMIT does not reduce the scan: the _TABLE_SUFFIX is what does.

4. The channel that starts versus the channel that closes

The attribution question that separates experts from copy-paste. Here, last non-direct click at session level.

SELECT
  session_traffic_source_last_click.manual_campaign.source AS source,
  session_traffic_source_last_click.manual_campaign.medium AS medium,
  session_traffic_source_last_click.cross_channel_campaign.default_channel_group AS channel,
  COUNT(DISTINCT CONCAT(
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
  )) AS sessions
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
GROUP BY 1, 2, 3
ORDER BY sessions DESC;

The trap. The session_traffic_source_last_click field has only existed in the export since July 2024: on earlier dates it will be empty. For the channel that starts the user (first touch), read traffic_source.source and traffic_source.medium instead, frozen at first visit. The two answer different questions, so do not mix them in the same column.

The cost. These fields are native columns (not arrays), so they are very cheap to read. No UNNEST is needed for the session source.

5. E-commerce funnel and drop-off

From product view to purchase, in one query.

SELECT
  COUNTIF(event_name = 'view_item') AS view_item,
  COUNTIF(event_name = 'add_to_cart') AS add_to_cart,
  COUNTIF(event_name = 'begin_checkout') AS begin_checkout,
  COUNTIF(event_name = 'purchase') AS purchase,
  ROUND(SAFE_DIVIDE(
    COUNTIF(event_name = 'purchase'),
    COUNTIF(event_name = 'view_item')
  ) * 100, 1) AS view_to_purchase_pct
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626';
-- view_item: 52,300 | add_to_cart: 14,800 | begin_checkout: 6,020 | purchase: 3,110

The trap. These COUNTIF count events, not unique sessions or users. One visitor who views ten products inflates view_item. For a strict sequential funnel (each step counted only if the previous one happened), you have to count distinct sessions per step, step by step.

The cost. A single pass over event_name. This is one of the cheapest queries in the cookbook.

6. Revenue by source and product category

The commercial detail lives in the items array, which has to be unnested.

SELECT
  session_traffic_source_last_click.manual_campaign.source AS source,
  item.item_category AS category,
  ROUND(SUM(item.item_revenue), 2) AS revenue,
  SUM(item.quantity) AS quantity
FROM `project.dataset.events_*`,
  UNNEST(items) AS item
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
  AND event_name = 'purchase'
GROUP BY 1, 2
ORDER BY revenue DESC;

The trap. items is an array: the UNNEST creates one row per product. So never sum ecommerce.purchase_revenue (event level) after unnesting items, or you multiply revenue by the number of items in the cart. For revenue per product, use item.item_revenue; for revenue per order, stay at event level with no UNNEST.

The cost. The UNNEST(items) stays local to each row, and the event_name = 'purchase' filter already cuts the volume processed sharply.

7. Retention and simple cohorts

How many users come back N days after their first visit.

WITH first_visit AS (
  SELECT
    user_pseudo_id,
    MIN(PARSE_DATE('%Y%m%d', event_date)) AS cohort
  FROM `project.dataset.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260501' AND '20260626'
  GROUP BY 1
),
activity AS (
  SELECT DISTINCT
    user_pseudo_id,
    PARSE_DATE('%Y%m%d', event_date) AS day
  FROM `project.dataset.events_*`
  WHERE _TABLE_SUFFIX BETWEEN '20260501' AND '20260626'
)
SELECT
  f.cohort,
  DATE_DIFF(a.day, f.cohort, DAY) AS day_n,
  COUNT(DISTINCT a.user_pseudo_id) AS users
FROM first_visit f
JOIN activity a USING (user_pseudo_id)
GROUP BY 1, 2
ORDER BY 1, 2;

The trap. The cohort is based on the first activity observed in the queried window, not the true first visit if your history is truncated. Widen the first_visit CTE window or retention will be overstated. And a user_pseudo_id that changes (cleared cookies) reappears as a new user, which dilutes the curves.

The cost. Two scans of the same table range. If you run it often, materialize the daily activity table once, then query that.

8. Path and event sequence before conversion

The actual path taken inside converting sessions.

SELECT
  user_pseudo_id,
  (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id,
  STRING_AGG(event_name, ' > ' ORDER BY event_timestamp) AS path
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
GROUP BY 1, 2
HAVING path LIKE '%purchase%'
LIMIT 100;
-- example: session_start > page_view > view_item > add_to_cart > purchase

The trap. Ordering relies on event_timestamp (microseconds), which is reliable, except when two events share the exact same timestamp, where the order becomes undefined. In that case, break the tie with event_bundle_sequence_id. Note too that STRING_AGG truncates very long paths beyond the string size limit.

The cost. The GROUP BY on two keys and the LIMIT 100 do not prevent a full scan of the window: keep the date range tight while exploring.

9. Key events and conversion rate by device and country

The conversion rate, segmented, where the gaps hide.

SELECT
  device.category AS device,
  geo.country AS country,
  COUNTIF(event_name = 'purchase') AS conversions,
  COUNT(DISTINCT CONCAT(
    user_pseudo_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
  )) AS sessions,
  ROUND(SAFE_DIVIDE(
    COUNTIF(event_name = 'purchase'),
    COUNT(DISTINCT CONCAT(
      user_pseudo_id,
      (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
    ))
  ) * 100, 2) AS conversion_rate_pct
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
GROUP BY 1, 2
ORDER BY conversions DESC;

The trap. In GA4, “conversions” are now called key events. The export does not flag a “this is a key event” column: replace 'purchase' with the names of your own key events. device.category and geo.country are native columns, readable without UNNEST.

The cost. The subqueries on ga_session_id are repeated: factor them into a CTE if the query runs often, for readability more than for the scan.

10. Quality control: duplicates and unattributed traffic

The query nobody runs, and the one that saves every analysis after it.

SELECT
  event_name,
  COUNT(*) AS occurrences,
  COUNT(*) - COUNT(DISTINCT CONCAT(
    user_pseudo_id,
    event_name,
    CAST(event_timestamp AS STRING),
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
  )) AS estimated_duplicates
FROM `project.dataset.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260601' AND '20260626'
GROUP BY event_name
ORDER BY estimated_duplicates DESC;

The trap. Duplicates (same user, same event, same timestamp) often reveal a double-firing tag. As for (not set) values and unassigned traffic, they come from parameters missing at collection time, not from BigQuery. To fix them at the source, cross-check with the GA4 audit and its 11 configuration mistakes and the 7 data layer mistakes that skew your data.

The cost. A single aggregated pass per event_name: cheap, and worth scheduling as a weekly check.

Keeping the cost near zero

BigQuery bills per byte scanned, not per query. Four habits are enough to cut the bill tenfold.

Always filter on _TABLE_SUFFIX so you scan only the events_* tables for the period you need. Never run SELECT *: every column you skip is money saved, and the GA4 schema is wide. Preview the scanned volume before running (the BigQuery editor’s estimator shows it at no charge). Finally, for recurring queries, build views or intermediate tables (a sessions table, a daily activity table) instead of re-aggregating the raw data every time. The pillar guide covers how to build a reusable sessions view.

Reconciling with the interface: why the numbers differ

BigQuery and the GA4 interface will almost never give exactly the same number, and that is normal. The interface applies sampling on large explorations, confidentiality thresholds, and its own modeling, while BigQuery hands you the raw data. A gap of a few percent is expected; a massive gap signals a definition problem (session, key event) on your side.

One cause of discrepancy is growing in 2026: Consent Mode. Since its tightening, a share of non-consented traffic is no longer fully exported to BigQuery, which truncates the volumes. Before concluding it is a bug, check the consent impact with our analysis of Consent Mode v2 in GA4.

Going further

You now have a reliable SQL toolkit that answers real questions without falling into the classic counting mistakes. The logical next step: wire these queries into an automated dashboard with Looker Studio and BigQuery, so you never run them by hand again.

And if you let an AI generate your SQL, this cookbook becomes your safety net: keep these reference queries to verify and correct whatever the assistant produces. That is exactly the method described in Claude Code for the data analyst: querying GA4 and BigQuery. The machine writes fast; you are the one who knows where the schema traps are.