---
name: conjoint-design
description: "Design conjoint experiments: attributes, power, AMCE/AMIE estimation."
argument-hint: "[describe your design question or paste attribute table]"
---

# Conjoint Design Expert

## Instructions

> Worked example (attribute table → power calculation → PAP tier assignment): see `reference/example.md`.

### 1. Attribute Architecture
- **Orthogonality:** Ensure every attribute is independent of every other attribute to allow for the estimation of causal effects for each component.
- **Randomization of Order:** Order attributes randomly at the *respondent level* (not the task level) to prevent "primacy" or "recency" effects while avoiding the cognitive overload of finding information in different orders across tasks (Stantcheva 2023). A specific logical flow may override this if theoretically required.
- **D-Optimal Designs:** Consider D-optimal or constrained randomization schemes rather than pure randomization. D-optimal designs choose the sets of administered conditions that maximize statistical power and may be preferable when the number of possible attribute combinations is large relative to the sample size (Auspurg & Hinz 2015; Stantcheva 2023).
- **Attribute Density:** Monitor for respondent fatigue. Stefanelli and Lukac (2020) cite evidence that conjoint results remain stable with up to 10 attributes; Bansak et al. (2018) find that response quality does not degrade with up to 30 *tasks* on MTurk, and Bansak et al. (2021, "Beyond the Breaking Point") extend this to the *attribute* dimension, reporting stability at many attributes. These are the canonical sources for the task- and attribute-count claims respectively. Still evaluate whether the complexity of the levels increases cognitive load beyond the attribute count alone.
- **Nested/Constrained Randomization:** Not all attributes need to be fully crossed. When ecological validity demands it, certain attribute levels can be linked or nested within other attributes (e.g., origin countries nested within policy domain). This is acceptable when: (a) the nesting is theoretically justified, (b) the primary attributes of interest remain fully independently randomized, and (c) the analyst acknowledges that nested attributes cannot be cleanly separated from their parent attribute. See Auspurg & Hinz (2015) on restricted randomization in factorial surveys.
- **Attribute-Level Restrictions:** Implausible combinations can be excluded when they would confuse respondents or produce artifactual responses, but this is a judgment call, not a mandate. Eye-tracking evidence from Bansak & Jenke (2025) shows that odd (incongruent or nonsensical) attribute combinations have minimal, inconsistent effects on respondent attention, search, and choice, so decisions to include or exclude them should be driven primarily by statistical, substantive, and theoretical considerations (e.g., whether the randomization distribution should reflect a real-world target distribution per De la Cuesta, Egami & Imai 2022) rather than by assumed cognitive-burden concerns. Document all restrictions in the pre-analysis plan.
- **Medium-Level Specificity:** Attribute levels should be concrete enough to be meaningful but not so specific that they introduce unintended confounds. Describe treatments at a "medium level of specificity" -- "fully described but not overly described" (Sniderman 2018). Avoid vague descriptions (e.g., "a policy that helps the economy") and overly narrow ones (e.g., "a $2.3B infrastructure bill for Route 95 in Pennsylvania").

### 2. Statistical Power and Error Logic
- **Effective N (N_eff):** Calculate sample size based on (Respondents $\times$ Tasks $\times$ Profiles). Throughout this section, N_eff refers to this effective number of profile evaluations. However, respondents and tasks are *not interchangeable* -- adding respondents improves precision more than adding tasks per respondent due to within-respondent correlation. When in doubt, prioritize more respondents over more tasks (Stefanelli and Lukac 2020).
- **Closed-Form Formula:** The standard error of an AMCE is approximately: SE = $\sqrt{\text{Var}(Y) \times L / N_{\text{eff}}}$, where $L$ is the number of levels for the attribute and N_eff is as defined above (Schuessler and Freitag 2020). This provides a quick diagnostic for whether precision is adequate.
- **Interaction Power:** Estimating interaction effects requires approximately *twice* the sample size needed for main AMCEs, in the canonical balanced two-level-by-two-level case; the exact multiplier scales with the number of levels on each interacting attribute. The standard error of an interaction coefficient is approximately $\sqrt{2}$ times the SE of the corresponding main effect in that balanced case (Schuessler and Freitag 2020). Budget accordingly when interaction hypotheses are confirmatory.
- **Empirical AMCE Benchmarks:** Typical AMCEs in published conjoint studies range from 0.02 to 0.10 (percentage-point changes in selection probability), with a median around 0.05 (Stefanelli and Lukac 2020). Very large AMCEs (> 0.15) are rare. Use these benchmarks when setting the smallest effect size of interest (SESOI) if no prior data are available.
- **Minimum Detectable Effect (MDE):** Set the MDE based on the attribute with the highest number of levels, as this level will be the most difficult to estimate precisely. Report whether the MDE falls within the range of plausible AMCEs given prior literature.
- **Type S and Type M Errors:** When power is low, beware of "Type S" (Sign) errors (getting the direction wrong) and "Type M" (Magnitude) errors (exaggerating the effect size). At 50% power for a true effect of d = 0.5, the probability of a Type S error is approximately 1/18, and the expected Type M error (exaggeration ratio) is approximately 1.5 (Lakens 2025, citing Gelman and Carlin 2014).
- **Low-N_eff Danger Zone (rule of thumb):** Designs with fewer than ~3,000 effective profile evaluations are at high risk of being underpowered for detecting typical AMCE magnitudes (0.02–0.05). This threshold is a pragmatic rule of thumb derived from plugging median-AMCE benchmarks into the Schuessler and Freitag (2020) formula, not a research finding; adjust based on the expected AMCE magnitude, number of levels, and design. Below this threshold, conduct an explicit sensitivity analysis showing what effects *can* be detected.
- **Levels-Power Tradeoff:** Each additional level for an attribute reduces the effective number of observations per level. As an illustration, going from 4 to 5 levels reduces per-level N by about 20%, with a corresponding precision loss (Schuessler and Freitag 2020). Only add levels when each is theoretically necessary.
- **Multiple Testing:** Conjoint designs estimate many AMCEs simultaneously, which inflates the family-wise false-positive rate. Even under the sharp null of no effects, a typical conjoint pipeline produces at least one significant AMCE in more than 90% of experimental trials (Liu and Shiraito 2023); this mirrors the broader "garden of forking paths" problem (Gelman and Loken 2014) and the undisclosed-flexibility findings of Simmons, Nelson, and Simonsohn (2011), which motivate pre-specification and correction. Pre-specify a correction method: Bonferroni (most conservative, guards against false positives), Benjamini-Hochberg (controls FDR, most lenient), or adaptive shrinkage (balanced; preferred in Liu and Shiraito's simulations). Report both corrected and uncorrected results for confirmatory hypotheses.
- **Cohen's d Warning:** Do not use Cohen's d benchmarks (small = 0.2, medium = 0.5, large = 0.8) to calibrate conjoint power analyses. AMCEs are measured in percentage-point changes in choice probability, not in standard deviation units. Translating between the two requires knowing Var(Y), which depends on the choice task structure.
- **Tools:** Use the `cjpowR` R package (Freitag 2021) or the associated Shiny app for simulation-based power analysis. These allow specification of the number of attributes, levels, tasks, and profiles, and return power curves for main effects and interactions. For a general declare-diagnose-redesign workflow that couples the closed-form formula with design-based simulation across estimands, diagnosands, and assignment schemes, use the `DeclareDesign` framework (Blair, Cooper, Coppock, and Humphreys 2019). For interaction analysis, use `FindIt` (Egami and Imai 2019). For heterogeneity detection, use `cjbart` (Robinson and Duch 2024) or the Bayesian mixture-of-regularized-regressions approach in Goplerud, Imai, and Pashley (2025). For lexicographic preference ranking, use `cjRank` (Dill, Howlett, and Müller-Crepon 2024). For assumption-free tests of whether a factor matters at all, use `CRTConjoint` (Ham, Imai, and Janson 2024). For deploying an adaptive focal/context design, use the Docker container at github.com/dmolitor/adaptive-infra, with replication scaffolding at github.com/jennahgosciak/adaptive_conjoint (Gosciak, Molitor, and Lundberg 2026); standard survey platforms (Qualtrics) do not support continuous Thompson-sampling updates.
- **Compromise Power Analysis:** When the respondent pool is fixed (e.g., hard-to-reach populations, elite samples), use a compromise power analysis that balances Type I and Type II error rates. An alpha > 0.05 may be defensible when it minimizes the combined error rate (Lakens 2025).

### 3. Treatment Validation and Realism
- **Identify the DGP Before Attributes:** Before specifying attributes, articulate the data-generating process implied by the theory: which component of the compound profile bears the causal effect of interest, and under what assumptions about how respondents bundle or separate information. This aligns with the broader project principle of DGP identification and with the "Why → If-Then" funnel developed in the `hypothesis-building` skill.
- **Experimental vs. Mundane Realism:** Distinguish between *experimental realism* (does the task engage respondents psychologically?) and *mundane realism* (does the task resemble real-world decisions?). Mundane realism is neither necessary nor sufficient for validity -- what matters is that the treatment creates the intended psychological state (Druckman 2022, citing Aronson and Carlsmith 1968). For conjoints, tabular displays may lack mundane realism but achieve experimental realism if respondents attend to and process the information.
- **Attention and Salience as Generalizability Levers:** Even when internal validity is secured, survey-experimental effects may amplify or reverse real-world effects because the survey environment compresses the consideration set (attention) and distorts the relative weights on attributes (salience). Fu and Li (2024) formalize these two mechanisms: consideration-set compression tends to inflate AMCE magnitudes, and context-dependent salience can flip effect signs. Audit a conjoint against both: does the table force attention to attributes respondents would ignore in the real-world analog, and are the attributes displayed with relative salience that mirrors the target environment?
- **Information Availability, Access, and Processing:** Validate that respondents (1) have *access* to the attribute information (can they see it?), (2) *attend* to it (do they read it?), and (3) *process* it as intended (do they interpret it the way the researcher assumes?). Attention checks and comprehension probes address conditions 1-2; pilot studies and cognitive interviews address condition 3 (Druckman 2022).
- **Names-as-Cues Warning:** When attributes include proper names, cultural referents, or country names, these may carry unintended associations beyond the dimension of interest. Pilot-test whether respondents associate additional meanings with the selected names or labels.
- **Pretreatment Mock Vignette:** Consider presenting respondents with a non-experimental practice vignette before the conjoint block to familiarize them with the task format. This reduces learning effects across early tasks.
- **Repeated Task for IRR Estimation:** At design stage, plan to repeat the first conjoint task at the end of the block with left/right profile order reversed. This provides a direct estimate of intra-respondent reliability (IRR) for measurement error correction (Clayton et al. 2023), who report ~75% intra-respondent agreement on identical tasks across eight replicated conjoint studies. Clayton et al. find respondents generally do not detect the repetition at rates that bias the IRR estimate. The repeated task costs one additional task per block but enables bias-corrected AMCEs and marginal means via the `projoint` R package. This is a design decision — it cannot be retrofitted after data collection.
- **Assumption-Free Factor Tests:** Complement the standard Hainmueller, Hopkins, and Yamamoto (2014) diagnostics with the conditional randomization test in `CRTConjoint` (Ham, Imai, and Janson 2024), which provides an assumption-free test of whether a factor of interest matters in *any* way given the other factors, and tests for profile-order, carryover, and fatigue effects. This is especially valuable when AMCE-based confidence intervals are narrow and contain zero — a narrow AMCE CI implies a weak *marginal* effect, not necessarily a weak *total* causal effect.

### 4. Estimating Effects
- **Reference Categories:** Clearly identify the "baseline" or "reference" level for every attribute.
- **SESOI in AMCE Terms:** For every confirmatory hypothesis, state the smallest meaningful AMCE -- the smallest percentage-point change in choice probability that would be theoretically or practically significant. If 2 percentage points is considered trivially small, state this as the lower bound. Justify the SESOI based on prior conjoint studies, theoretical significance, or policy relevance (Lakens 2025).
- **Average Marginal Component Effects (AMCE):** Frame results as the average change in the probability of being selected when an attribute changes from the reference level to the level of interest. Note that the AMCE "critically relies upon the distribution of the other attributes" used for averaging (De la Cuesta, Egami, and Imai 2022) -- this is typically the uniform distribution imposed by the randomization, which may not match real-world attribute distributions. When the target of inference is a specific real-world or counterfactual distribution, use design-based weighting (e.g., via `factorEx`) rather than the default uniform AMCE. Define the estimand explicitly — unit-specific quantity, target population, and aggregation — before selecting an estimator, per the estimand-first framework of Lundberg, Johnson, and Stewart (2021).
- **AMCE Interpretation Guard-Rail (no majoritarian claims):** The AMCE does *not* identify the share of respondents who prefer a given feature. It aggregates both the *direction* and the *intensity* of preferences, so statements of the form "voters prefer X" or "Americans prefer Y" are **not supported** by a statistically significant AMCE unless additional (strong) assumptions hold, e.g., uncorrelated direction and intensity (Abramson, Koçak, and Magazinnik 2022). Ganter (2023) sharpens the same point: AMCE identifies a population-level effect on *choice probability*, not a parameter of the underlying preference distribution, and conflating the two has produced widespread interpretive overreach. Valid interpretations include: change in expected vote share over the experimenter-defined contest distribution, mapping to a Borda-rule winner, or an average of individual ideal points. If the substantive target is a majoritarian or electoral claim, consider the bounding method and structural interpretations in Abramson et al. (2022), Ganter (2023), and the outcome-variable/estimand contingency analyses in Treger (2025).
- **Marginal Means (MMs):** In addition to AMCEs, report marginal means -- the model-predicted probability that a profile is selected when a given attribute level is shown, averaged over all other attributes. MMs provide absolute levels of support rather than relative differences, and are particularly useful for subgroup comparisons (Leeper, Hobolt, and Tilley 2020).
- **Interaction Effects via AMIE:** Do *not* test for attribute-by-attribute interactions by adding product terms to a dummy-coded AMCE regression -- such interaction coefficients are **baseline-dependent artifacts** that change when the reference category changes (Egami and Imai 2019). Instead, use the Average Marginal Interaction Effect (AMIE), defined as the additional effect of an attribute combination beyond the sum of its separate AMEs: AMIE(a, b) = ACE(a, b) − AME(a) − AME(b). The AMIE is invariant to the choice of baseline condition and is the only interaction estimand interpretable across coding schemes. Estimate AMIEs using penalized ANOVA via the `FindIt` R package (`CausalANOVA()`), which simultaneously handles the high-dimensionality problem (even a modest conjoint with 5 attributes and 4 levels each generates 100+ interaction parameters) through regularization that shrinks weak interactions toward zero and collapses adjacent levels with similar effects (Egami and Imai 2019). Note: the AMIE framework applies to interactions between *randomized conjoint attributes*, not to interactions between attributes and non-randomized respondent characteristics (subgroup moderators).
- **Heterogeneity Detection via BART or Mixture Models:** To detect treatment effect heterogeneity across respondents without pre-specifying moderators, two complementary approaches are available.
  - *BART (`cjbart`):* Robinson and Duch (2024) fit a probit BART model to estimate Individual-level Marginal Component Effects (IMCEs) -- each respondent's predicted effect for each attribute level. The method introduces a three-level estimand hierarchy: OMCE (observation-level) → IMCE (individual-level, averaged across tasks) → AMCE (population-level, averaged across respondents). The IMCE distribution is the primary heterogeneity diagnostic: a tight, normal distribution centered on the AMCE suggests homogeneous effects; multimodal, skewed, or widely dispersed distributions (especially spanning both sides of zero) indicate substantive heterogeneity. Use `het_vimp()` to identify which respondent covariates most strongly partition the IMCE distribution via random forest variable importance scores.
  - *Bayesian mixture of regularized regressions:* Goplerud, Imai, and Pashley (2025) propose a finite mixture of regularized regressions that groups respondents exhibiting similar treatment-effect patterns and directly models cluster membership with covariates. This is preferable when interpretable clusters of respondents and their moderator-driven membership are of scientific interest, and when hierarchical interaction structure should be enforced.
  - Heterogeneity detection with either tool is **exploratory by default** and should be labeled as such unless moderators and directions were pre-registered (Robinson and Duch 2024).
- **Profile-Context Heterogeneity (the GML estimand):** BART, FactorHet, and CRT all reason about heterogeneity *across respondents*. Gosciak, Molitor, and Lundberg (2026) introduce a complementary estimand for heterogeneity *across the profile context space*: $\theta(\vec{x}) = \Pr(\text{choose focal} = A \mid \text{context} = \vec{x})$ and the two summary parameters $\theta(\vec{x}_{Max}), \theta(\vec{x}_{Min})$ that bound the range a focal-attribute effect can take across all combinations of the other attributes. The estimand reframes the conjoint by designating one attribute as *focal* (the causal quantity of interest) and the rest as *context* (a vector $\vec{x}$ that conditions the focal effect). Use this when the theoretical claim is configural — "the effect of X depends on the full bundle of who the immigrant/candidate/job applicant is" — rather than additive. Two-way AMIEs and the CRT can detect that heterogeneity exists but cannot localize *where* it lives; $\theta(\vec{x}_{Max})$ and $\theta(\vec{x}_{Min})$ pin down the cells that maximize and minimize the focal effect, with all other attributes held jointly fixed at $\vec{x}$. Even without an adaptive design, the estimand is computable on any standard conjoint by reading off conditional cell means, at the cost of thin cells in rare combinations of context attributes (see §5 Adaptive Randomization variant for the data-collection design that targets this estimand directly).
- **Nested Marginal Means for Lexicographic Preferences:** When the theory predicts that respondents may have categorical (lexicographic) preferences -- always vetoing profiles with a given attribute level regardless of other attributes -- standard AMCEs and marginal means can be misleading because co-occurrence rates across task pairs artificially inflate or deflate estimates. Use nested marginal means (Dill, Howlett, and Müller-Crepon 2024): (1) identify the attribute with the highest predictive power (Rank 1), (2) compute marginal means for remaining attributes restricted to tasks where the Rank 1 attribute does not vary between profiles, (3) iterate to rank remaining attributes. This reveals whether attributes that appear unimportant are genuinely so or are simply dominated by a higher-priority veto attribute. Implemented in the `cjRank` R package.
- **Equivalence Testing for Null Predictions:** When the hypothesis predicts that an attribute has *no meaningful effect* (e.g., a manipulation check attribute), use the TOST equivalence testing procedure with a pre-specified equivalence range in raw AMCE units rather than simply reporting a non-significant p-value (Lakens 2025).

### 5. Design Variants
- **Paired-Choice vs. Rating:** Conjoint tasks can use forced binary choice (select one of two profiles) or rating scales (rate each profile independently). Forced choice produces a binary DV suitable for LPM estimation of AMCEs. Ratings provide continuous DVs that capture intensity. Best practice: use forced choice as primary DV with ratings as secondary/robustness check.
- **Factorial Vignette Design:** An alternative to tabular conjoint displays. Attributes are assembled into short paragraph-form vignettes rather than presented as attribute tables. This enhances realism and respondent engagement while preserving experimental control (Auspurg & Hinz 2015). Vignettes also reduce social desirability bias because "characteristics can be smoothly hidden in stories," whereas conjoint tables make attributes highly salient (Stantcheva 2023). Note that the long-standing claim that *standard* conjoints attenuate SDB is supported but bounded: Horiuchi, Markovich, and Yamamoto (2022) test this directly and find conjoints reduce SDB on some sensitive attributes but not on others, so the SDB-mitigation argument should not be treated as automatic. Appropriate when: profiles describe complex scenarios (e.g., policy decisions) rather than simple object comparisons.
- **External Validity Evidence:** Hainmueller et al. (2015) validated conjoint and vignette designs against a real Swiss referendum on citizenship. Paired-choice designs performed better than rating designs in predicting real-world voting behavior (Stantcheva 2023). This supports using forced choice as primary DV in that comparison.
- **Forced-Choice Caveat in Electoral Conjoints:** The forced-choice recommendation is not universal. Visconti and Yang (2024) show that electoral conjoints using forced choice implicitly assume no non-voters and no protest/blank votes, producing misclassification error and external-validity bias even with unbiased AMCE estimators. When respondents would plausibly abstain or cast a protest vote in the real-world analog, consider an *unforced-choice* design that offers "abstain" and "blank/null" options, or at minimum report ratings-based effects alongside forced-choice AMCEs. The abstention/non-choice option also connects to Miller and Ziegler (2024) on preferential abstention in conjoints.
- **Between-Subjects to Conjoint Conversion:** When a between-subjects vignette experiment suffers from confounds (outcomes not held constant across conditions), consider converting to a conjoint. Benefits: (a) holds outcomes constant through independent randomization, (b) specifies concrete content rather than vague descriptions, (c) gains statistical power through repeated within-respondent measurements, (d) enables estimation of interaction effects.
- **When NOT to Use a Conjoint:** Conjoints are not always the best design. Avoid conjoints when: (a) the research question concerns a single treatment dimension (a simple vignette experiment suffices), (b) the attributes cannot be independently randomized without producing incoherent profiles, or (c) the theoretical mechanism operates through holistic impression formation rather than attribute-by-attribute evaluation (Sniderman 2018).
- **Adaptive Randomization (focal/context conjoint):** Gosciak, Molitor, and Lundberg (2026) propose an adaptive variant for studies whose argument is *about one attribute*. The design is the first to port adaptive randomization (Thompson 1933; Offer-Westort, Coppock, and Green 2021; Kasy and Sautmann 2021) into conjoint analysis specifically. Inference under adaptivity has its own machinery (Hadad et al. 2021), but GML's three-phase structure sidesteps most of it by reserving a fixed-allocation validation phase for inference. Pick a focal attribute ex ante (e.g., education, race, co-ethnicity); treat all other attributes as a context vector $\vec{x}$; within a profile pair, hold $\vec{x}$ identical across both profiles so they differ only on the focal attribute. Use Thompson sampling across the discrete context space to allocate respondents toward the contexts that look most likely to maximize and minimize the focal effect. The design proceeds in three phases: (1) a *warm-up* with uniform allocation across all contexts, providing initial posterior estimates; (2) an *adaptive* phase using Thompson sampling to concentrate sample on candidate $\vec{x}_{Max}$ and $\vec{x}_{Min}$ (parallel or sequential searches; sequential is more efficient); (3) a *validation* phase with fixed allocation at the two selected contexts to defeat the winner's curse. Two practical constraints distinguish this design from a standard conjoint:
  - **One pair per respondent.** Because both profiles in a pair share the same context and differ only on the focal attribute, a respondent who saw multiple tasks would see through the design. The result is no within-respondent leverage: estimating the same surface that a standard conjoint would recover with 1,000 respondents requires roughly an order of magnitude more (GML's immigration replication used 10,000). This is in tension with Bansak et al. (2018) on choice-task counts and is the central cost of the design.
  - **Two signals per context value.** Each context value (e.g., "Europe") needs two signals (e.g., "Germany" and "Poland") so the two profiles in a pair are not literally identical. Signals are randomly permuted within pairs. Choose signal pairs that are substantively similar and that the literature does not flag as carrying differential social-desirability load; the estimand $\theta(\vec{x})$ remains identified even if signals have effects, but a low-effect signal pair makes the design easier to defend.
  - **Low-dimensional context space.** The adaptive search needs the context space to be enumerable (8-32 cells in practice). For a 7-attribute conjoint with 4 levels each, this means coarsening hard before fielding. The coarsening is a substantive design choice reviewers will challenge; document the theoretical basis. Pooling extensions (Bayesian hierarchical models on $\theta(\vec{x})$) are flagged as future work in GML.
  Use this design when: (a) the theoretical claim is configural about one focal attribute, (b) the context space can be honestly coarsened to 8-32 cells, (c) the budget supports ~5,000+ respondents, (d) one country and one platform. Avoid this design when the argument is parallel-comparative across attributes, multi-country, or budget-constrained — in those cases, retain the standard design and recover the GML estimand observationally from cell means with appropriate caveats about thin cells. Replication code: github.com/jennahgosciak/adaptive_conjoint; deployment infrastructure: github.com/dmolitor/adaptive-infra.

### 6. PAP Flexibility for Conjoints
- **Flexible Pre-Analysis Plans:** Conjoint experiments involve many analytical choices (which attributes to interact, how to subset marginal means, which respondent-level moderators to test). The pre-analysis plan should distinguish clearly between: (a) *locked* specifications (the primary hypothesis tests, which cannot be changed), (b) *conditional* specifications (analyses that will be conducted if certain conditions are met, e.g., "if the main effect is significant, test the following interactions"), and (c) *exploratory* analyses (clearly labeled as hypothesis-generating). See the `pre-registration-writing` skill for registry selection, locked/conditional/exploratory tier templates, and contingency-tree conventions that complement the conjoint-specific guidance here.
- **Cross-Category Comparison:** When comparing effects of qualitatively different attributes (e.g., comparing the AMCE of "gender" to the AMCE of "education"), note that such comparisons are only meaningful if the levels span comparable ranges of the underlying dimension. AMCEs depend on the choice of levels -- changing which levels are included changes the AMCE.
- **Reporting Standards Integration:** Conjoint PAPs and methods sections should comply with APSA Experimental Section, JARS, and DA-RT requirements. Use the `methods-reporting` skill to audit the 45-item checklist (attribute list, randomization scheme, restrictions, sample flow, estimator, SEs, weights, and replication materials) before submission; the conjoint-specific items in this skill slot into that broader reporting scaffold rather than substituting for it.

### 7. Regression Models for Conjoint Data
- **Baseline AMCE Model:** Y_itj = $\alpha_i$ + $\beta$(Attributes_tj) + $\varepsilon_{itj}$. Respondent fixed effects ($\alpha_i$), SEs clustered by respondent. $\beta$ coefficients are AMCEs.
- **Interaction Models (AMIE):** Use `FindIt::CausalANOVA()` with `nway=2` (or `nway=3`) to estimate AMIEs via penalized ANOVA with weighted zero-sum constraints. Use `cv.CausalANOVA()` to select the regularization cost parameter (1-SE rule). Decompose specific combination effects with `AMIE()`. Present as AMIE matrices across factor-level combinations (Egami and Imai 2019). Do not use conventional dummy-coded product terms.
- **Cross-Group Comparison:** Estimate per-group AMCEs (e.g., per country) separately, then pool with group $\times$ attribute interactions for formal heterogeneity tests.
- **Secondary DV Replication:** Replicate all models using continuous rating outcome as robustness check.
- **Heterogeneity Models (BART / Mixture):** Use `cjbart::cjbart()` to fit a probit BART model, then `IMCE()` to extract individual-level effects with credible intervals. Report the AMCE alongside the IMCE standard deviation as the primary heterogeneity diagnostic. Use `het_vimp()` to identify moderator covariates. Works reliably with 500+ respondents (Robinson and Duch 2024). When interpretable respondent clusters and their moderator-driven membership are the primary target, fit the Bayesian mixture-of-regularized-regressions model in Goplerud, Imai, and Pashley (2025) as an alternative or complement.
- **Profile-Context Effect Surface (GML estimand):** When the substantive question is "where in the profile space is the focal attribute's effect largest or smallest," compute $\theta(\vec{x})$ as a cell-mean over each unique value of the context vector $\vec{x}$, with $\theta(\vec{x}) = \Pr(\text{supported} \mid \text{focal} = A, \vec{x}) - \Pr(\text{supported} \mid \text{focal} = B, \vec{x})$ in non-adaptive data, or the choice probability $\Pr(Y(\vec{x}, S, A) = A)$ in an adaptive design (Gosciak, Molitor, and Lundberg 2026). Report $\theta(\vec{x}_{Max})$ and $\theta(\vec{x}_{Min})$ as the bounds of the focal-attribute effect's range across context. With non-adaptive data, use respondent-clustered SEs and acknowledge that thin cells in rare contexts will produce wide intervals; with an adaptive design, the validation-phase estimates are unweighted means and asymptotically normal. Two-way AMIEs and the CRT (above) detect heterogeneity; this estimand localizes it. The three are complementary: AMIE gives shape (low-order interactions), CRT gives a yes/no on any heterogeneity, and $\theta(\vec{x}_{Max}), \theta(\vec{x}_{Min})$ give the range and the cells. Pre-register the focal attribute and the binary coarsening of context attributes if used as a confirmatory analysis.
- **Assumption-Free Factor Testing (CRT):** To test whether a factor matters in *any* way — including through interactions the AMCE averages over — use the conditional randomization test in `CRTConjoint` (Ham, Imai, and Janson 2024). The same package tests profile-order, carryover, and fatigue assumptions that underlie standard AMCE estimation.
- **Multiple Testing Correction:** Apply a pre-specified multiple-testing correction across the full family of AMCEs being tested. Report Bonferroni-corrected, Benjamini-Hochberg-corrected (FDR), or adaptive-shrinkage-corrected estimates per Liu and Shiraito (2023), along with uncorrected estimates for transparency.
- **Sensitivity Checks:** Beyond secondary DV replication, consider: specification curves across alternative reference categories, models with and without respondent fixed effects, subgroup analyses by task number to assess fatigue effects, and a "stress test" replication with escalated attribute levels when initial findings suggest ceiling or floor effects (Dill, Howlett, and Müller-Crepon 2024). Treger (2025) also recommends reporting how estimates shift under alternative outcome variables and estimands (AMCE, MM, rating-based) to characterize estimand contingency.

## Quality Checks
- [ ] **Independence:** Is the randomization of attribute levels truly independent?
- [ ] **Power:** Was the sample size calculated using the level with the smallest predicted effect? Was the MDE checked against empirical AMCE benchmarks?
- [ ] **Respondent Priority:** Was the power analysis based on adding respondents rather than tasks when possible?
- [ ] **Interaction Budget:** If interaction hypotheses are confirmatory, was the sample size doubled relative to the main-effect requirement?
- [ ] **Baseline:** Are the reference categories for all attributes explicitly stated?
- [ ] **SESOI:** Is the smallest meaningful AMCE stated and justified for each confirmatory hypothesis?
- [ ] **Ecological Validity:** Is the handling of implausible ("odd") attribute combinations justified on statistical, theoretical, or substantive grounds (e.g., target profile distribution per De la Cuesta, Egami & Imai 2022) rather than on unsupported cognitive-burden concerns (Bansak & Jenke 2025)?
- [ ] **Treatment Validation:** Were attribute levels pilot-tested for information access, attention, and intended interpretation?
- [ ] **Nesting Documented:** If any attributes are nested or linked, is this documented and justified?
- [ ] **Interaction Pre-Specified:** Are theoretically motivated interactions specified in the pre-analysis plan? Are they estimated using AMIEs (not baseline-dependent product terms) (Egami and Imai 2019)?
- [ ] **Multiple DVs:** Is the primary DV (forced choice or rating) clearly identified, with secondary DVs labeled as robustness checks?
- [ ] **Vignette Assembly:** If using factorial vignettes, do the assembled paragraphs read coherently for all attribute combinations?
- [ ] **PAP Tiers:** Does the pre-analysis plan distinguish locked, conditional, and exploratory specifications?
- [ ] **SDB Trade-Off:** Has the choice between vignette format (lower SDB) and conjoint table format (higher attribute salience) been justified (Stantcheva 2023)?
- [ ] **Randomization Level:** Is attribute order randomized at the respondent level (not the task level) to reduce cognitive overload (Stantcheva 2023)?
- [ ] **AMCE Distribution:** Is the marginalizing distribution for AMCE interpretation documented, and its correspondence to real-world attribute distributions discussed?
- [ ] **Heterogeneity Plan:** If respondent-level heterogeneity is of interest, is the detection method (BART/cjbart or pre-specified subgroups) stated, and are results labeled confirmatory or exploratory (Robinson and Duch 2024)?
- [ ] **Lexicographic Check:** If the theory predicts categorical/veto preferences, are nested marginal means used to test for lexicographic ordering (Dill, Howlett, and Müller-Crepon 2024)?
- [ ] **IRR/Repeated Task:** Is a repeated task included at design stage for intra-respondent reliability estimation and measurement error correction (Clayton et al. 2023)?
- [ ] **Multiple Testing:** Is a multiple-testing correction (Bonferroni, Benjamini-Hochberg, or adaptive shrinkage) pre-specified for the family of AMCEs being tested (Liu and Shiraito 2023)?
- [ ] **No Majoritarian Claims:** Are AMCE results framed to *avoid* majoritarian or electoral-majority claims (e.g., "voters prefer X"), per Abramson, Koçak, and Magazinnik (2022)? If such claims are central, are the bounding methods or structural interpretations from that paper applied?
- [ ] **Forced-Choice Fit:** For electoral or abstention-relevant designs, has the forced-choice vs. unforced-choice trade-off been considered (Visconti and Yang 2024; Miller and Ziegler 2024)?
- [ ] **Assumption-Free Factor Test:** Have `CRTConjoint` (or equivalent) tests of profile-order, carryover, and fatigue assumptions been planned or reported (Ham, Imai, and Janson 2024)?
- [ ] **Configural vs Additive Argument:** Is the paper's central claim configural ("the effect of X depends on the full bundle") or additive ("the effect of X is large on average")? If configural, are the estimands $\theta(\vec{x}_{Max})$ and $\theta(\vec{x}_{Min})$ reported in addition to AMCE/AMIE, or is an adaptive focal/context design considered (Gosciak, Molitor, and Lundberg 2026)?
- [ ] **Adaptive Design Trade-Offs:** If an adaptive randomization design is selected, are the one-pair-per-respondent constraint, the two-signal context construction, and the coarsening of the context space documented and justified (Gosciak, Molitor, and Lundberg 2026)?
