Significance tests are most useful for well-controlled experiments, which are usually designed, often with great care, to make the null hypothesis essentially true when the experimental manipulation has no effect. For population studies involving correlations of measured variables, it is arguable that the null hypothesis is always false, so significance testing is only of descriptive value in these cases. Other useful descriptions include means, standard errors, confidence intervals, or effect sizes in meaningful units (e.g., "the conditions differed by 1.5 points on the 7-point response scale"). Please limit the number of descriptors you use, especially in the middle of sentences.
For JDM, the convention is that "significant" means that the p-level is .05 or less. Note that "not significant" is not the same as "no effect".
Report exact p-levels, even for non-significant results. (See, for example, Wilkinson et al., American Psychologist, 54, 594-604.) Exact p-levels have descriptive value, even when we use .05 as the cutoff for claiming that a result is significant. (If all we know is that p<.05, then in fact our expectation of the unreported true p-level will decrease as sample size increases.)
P-levels alone are not sufficient to establish the reliability of a result. When power is lower (e.g., from smaller sample size), the probability of a significant result given a true effect is lower but the probability of a significant result given no effect is the p-level. Thus, the probability of a true effect given a significant result (at a given p-level) is lower (assuming that the prior probability of a true effect is independent of power).
More generally, p-values often result from the garden of forking paths, and often this is apparent from the paper itself, e.g., when the main result depends on an analysis other than the most obvious one.
You should also look for reasons why your model is inappropriate, such as quadratic trends in the data. Note, however, that a linear model may still be more powerful than an ordinal one, even if its assumptions are not quite met. For example, in the developmental study just described, you might find a ceiling effect, so that ages 13 and 17 do not differ. This would violate the assumption of linearity, but you may still have no reason to think that the true function is an inverted U, and a simple linear regression may still be the most powerful test.
If you have a one-tailed hypothesis, such as "increase with age" rather than "increase or decrease", then you lose power by doing a two-tailed test. It is no more "conservative" to do a two-tailed test for a one-tailed hypothesis than to treat ages as names when they are really numbers. If you want to be conservative, use a lower p-level to call something significant. Most experimental hypotheses in JDM are one-tailed. For example, when we do a de-biasing experiment, the hypothesis of interest is that the bias is reduced. If it increases, our hypotheis is just wrong, just as if the bias failed to change at all.
But a split does not always lead to a higher p-value. By chance, a median split, or some other split, will occasionally lead to a lower p-value (as in the article just mentioned). It is thus possible to get a significant result with a median split that would not be significant with the most powerful test. Choices in data analysis are inevitable, and this fact allows "p-hacking", that is, trying out different sets of choices until one yields the desired result. Such a procedure, sometimes easy to rationalize in hindsight, subverts the test of significance. The requirement that researchers report what could be argued to be the single most powerful test limits the possible range of such snooping, and reduces readers' suspicion, justified or not.
The same argument applies to other sorts of categorization.
Outliers are a different matter. Sometimes you must do something with them. It depends on where you think they come from. Sometimes (as discussed in the next paragraph) they go away with an appropriate transformation of the data. Sometimes they result from obvious typos, which you can correct before you do any other analyses. Often it makes sense to trim them so that they equal the closest non-outlier value, winsorize them, or, as a last resort, use a rank order test. Sometimes they are just nonsense, and it is best to omit them. Whatever you do, say what you did. And if possible, say whether it matters or not.
Transformations of data are reasonable and perhaps under-used. For example, a log transform is often appropriate for measures that are bounded at zero such as willingness to pay or reaction time. (For willingness to pay, it makes sense to add 1 so that the minimum after the transformation is zero rather than minus infinity.) Of course, many other transforms are sensible. Sometimes a sensible transformation can eliminate the need to remove or trim outliers.
The two facts that one effect is significant and another is not do not together imply that the effects are different. This problem also arises with comparisons of correlations; differences of correlations must be tested.
Some interactions are difficult to interpret because of ceiling effects, floor effects, or, more generally, scaling effects. Apparent interactions could be removable by a reasonable transformation. Most dependent variables are more sensitive to any manipulation in some parts of their range than in other parts. Measures are generally less sensitive near their limits (floor or ceiling), but this is not the only possibility. If we transform the measure so that it is equally sensitive everywhere, an interaction might disappear. This problem cannot account for cross-over interactions and some others.
If you report interactions and main effects from the same regression model (or Type 3 ANOVA), be careful about the effect of interaction terms on other estimates. Estimates of lower-order interactions and main effects may differ as a function of: 1) whether higher-order interactions are included; and 2) how variables are coded. See this paper
In statistical control, we usually regress Y on X and M, and we seek to show that the coefficient for X is still significant when M is included in the model. We want to conclude that M does not explain the correlation between X and Y. Statistical control often yields misleading results. The problem is that M is usually intended as a measure of some underlying variable M*, which is the true variable whose effect we want to remove. If we want to remove the full variance due to M*, we must measure it perfectly, without error. Any error can be expected to reduce the coefficient for M in the model, thus increasing the coefficient for X. To take an extreme example, suppose M* is "cognitive ability" and M is "head circumference". Although we can measure M with great accuracy (and reliability), it does not correlate very highly with M*, so we do not remove much of the variance in M* by including M in the model. The validity of M as a measure of M* is low.1 (We can think of "validity" as the correlation between M and M*.) Statistical control can be useful when we measure M* without error, e.g., when it is gender or age, or when the X coefficient is not reduced at all by the inclusion of M in the model, and when M is reasonably valid.2
Mediation tests can be informative. But such tests can show spurious mediation when M has no causal effect on Y but some other (possibly un-measured) variable Z correlates positively with both M and Y, or when X has no effect on M but Z correlates with both X and M. Sometimes these spurious effects are implausible or even impossible. If X is an experimental manipulation, for example, then Z cannot affect X. True causal mediation can be missed when (for example) Z correlates in opposite directions with M and Y, or when M is measured poorly.
2 This problem was noted by Daniel Kahneman in 1965, although his argument applies only to reliability, and the probelm also exists for validity, as in the case of head circumference. A broader analysis, with some possible solutions, is here, although the solutions proposed are limited when validity is an issue. Another general statement, with extensions, is here. When validity is not an issue (e.g., when a test consists of problems of a certain type and the variable of interest is "ability to solve that type of problem"), a tolerable solution is to "disattenuate" a regression model starting with a raw correlation matrix M (dependent variable and all predictors) and then correcting all correlations using a reliability measure (such as omega; alpha might over-correct) for each variable, as follows in R code:
M <- M/sqrt(R %*% t(R)) # correct all correlations; R is a vector of reliabilities in the same order diag(M) <- 1 # set the diagonal of the corrected matrix to 1 Predictors <- M[-1,-1] # remove the dependent variable (DV), here the first DVcors <- M[1,-1] # disattenuated correlations of the DV with each predictor, a vector CorrectedCoefficients <- solve(Predictors) %*% DVcors # invert the matrix and multiply (Thanks to Andrew Meyer for this solution and code.)
3 For example, suppose x1=[-1,-2,-3,-4,-3], x2=[1,4,9,16,25], x3=[1,4,9,16,16], and y=[0,2,6,12,20]. Suppose the simple correlations with y are the most theoretically relevant results: y correlates negatively with x1 but positively with x2 and x3. But, if you regress y on x1 and x2, or on x1 and x3, or on x1, x2 and x3, then the coefficient for x1 is positive, despite the negative correlation between y and x1. The coefficient for x3 is positive for these three regressions, as it should be. However, if you regress y on x2 and x3, the x3 coefficient is slightly negative. Problems like these can result from nonlinearity, and from correlations among some predictors. See this article for a more general discussion of benefits as well as costs of such effects.