This methodological study was approved by the Research Ethics Committee of the Federal University of Rio Grande do Sul, and by the Municipal Department of Health of the city of Porto Alegre (protocol number: CAAE 19651113.5.0000.5338). All parents and/or guardians provided written consent for the participation of their children in the study, and all adolescents signed an assent form prior to enrollment.

### Participants

Fifth to ninth-grade students of both sexes, aged between 10 and 17 years, attending three public schools from the city of Porto Alegre (RS, Brazil), were eligible for enrollment. A total of 713 agreed to participate and were recruited. Of these, 10 were excluded based on teacher report of intellectual disability. Thus, the final sample included 703 (98.6 %) adolescents, of whom 380 (54 %) were girls. Mean age was 13 years (± standard deviation, SD, 1.58 years). Race was self-reported as white (*n* = 308; 43.8 %), brown (*n* = 194; 27.6 %), or black (*n* = 173; 24.3 %).

### Instrument

The questionnaires were administered during school hours, in the presence of two members of the research team who had been previously trained in the use of these instruments.

The QBO is a self-report instrument composed of 23 items about bullying (bully scale) and 23 items about victimization (victim scale). Each item describes a different behavior, and the respondent is asked to determine the frequency with which this behavior occurred over the past month. For instance: “Dei socos, pontapés ou empurrões/I hit, kicked or pushed someone” (bully scale); “Me deram socos, pontapés ou empurrões/I was hit, kicked or pushed” (victim scale).

Participants choose a response to each of the 23 items from a four-category Likert scale that reflects the frequency of behaviors: (1) “Nunca/Never”, (2) “Uma ou duas vezes no mês/Once or twice a month”, (3) “Cerca de uma vez por semana/Around once a week”, and (4) “Várias vezes por semana/Several times a week” (Olweus, 1996; Fischer, et al., 2010). Because the QBO employs multiple-choice answers, it is said to be polytomously scored.

### Data analysis procedures

Polytomous item response theory (IRT) was used to determine QBO validity. A discriminating parameter is calculated for each item. This parameter reflects the influence of each item on the latent variable – the higher the discriminating parameter, the higher the relevance of the item for the proposed measurement. A severity parameter is also determined, reflecting to which degree the behavior is enacted – a student with a high severity parameter is more likely to choose the highest score for the item in the Likert scale (Andrade et al. 2000). Two IRT models were tested to determine which was best suited to assess the ability of each item to identify victims and bullies: graded response (GRM) and generalized partial credit models (GPCM), described by Samejima (1969) and Muraki (1992) respectively.

#### Graded response model

The GRM was developed by Samejima (1969) and deals with polytomous categories arranged in ascending order. This model estimates the probability that an individual will select a given response or a higher-level one in each item on the scale based on the formula

$$ {P}_{1,k}^{+}\left({\theta}_j\right)\kern0.5em =\kern0.5em \frac{1}{1+{e}^{-1,702\ast {a}_i\left({\theta}_{\mathit{\mathsf{j}}}-{b}_{i,k}\right)}} $$

Where: *i* represents a given item in the questionnaire; *j* refers to the subject under assessment; *k* designates an item response category; *n* is the number of subjects in the sample; *m*
_{
i
} is the number of response categories *i*; *a*
_{
i
} is the discriminating parameter of item *i*, and *b*
_{
i,k
} is the severity parameter of response category *k* in item *i* (Andrade et al., 2000).

#### Generalized partial credit model

The GPCM, developed by Muraki (1992), is used to determine the discriminating power for each set of adjacent choices in a Likert scale. The GPCM is said to be “generalized” because it does not assume that all items have uniform discriminating power. The formula used for the GPCM is

$$ {P}_{i,k}\left({\theta}_j\right)\kern0.5em =\kern0.5em \frac{ \exp \left[{\displaystyle {\sum}_{u=0}^k{D}_{a_i}\left({\theta}_j-{b}_{i,u}\right)}\right]}{{\displaystyle {\sum}_{u=0}^{m_i} \exp \left[{\displaystyle {\sum}_{v=0}^u1,702\ast {a}_i\left({\theta}_j-{b}_{i,v}\right)}\right]}} $$

Where: *i* represents a given item in the questionnaire; *j* refers to the subject under assessment; *k* designates an item response category; *n* is the number of subjects in the sample; *m*
_{
i
} is the number of response categories *i*; *a*
_{
i
} is the discriminating parameter of item *i*; and *b*
_{
i,k
} is the severity parameter of response category *k* for item *i* (Andrade et al., 2000).

### Statistical analysis

The construct validity of the QBO was established through IRT analysis using the GRM and GPCM, both of which deal with polytomous variables. The bully and victim scales of the questionnaire were independently analyzed.

The GPCM was run in three variations: a) constant discriminating power equal to 1; b) constant discriminating power not equal to 1; and c) variable discriminating power across all items. The GRM was run in two variations: d) constant discriminating power across items; and e) variable discriminating power across items (Andrade et al., 2000).

The best model was selected based on the comparison of the area under the curve generated by each model, where the size of the area reflects the amount of information included in the calculations. The curves with the largest area correspond to the best models. The intersections of item characteristic curves (ICC) were then analyzed to verify whether any categories should be removed from the model. Any categories with response probabilities below those of the other categories were excluded.

To facilitate the interpretation of victim and bully scores, which are normally distributed with a mean of zero and a standard deviation of one, scores were multiplied by the standard deviation of total scores, and added to the mean total score on the scale (Pasquali, & Primi, 2003). The unidimensionality of the scales (that is, the ability of scale items to measure the aspect they propose to measure, being a victim or a bully) was verified through factorial and parallel analysis. The reliability of each scale was estimated using Cronbach’s alpha. A Cronbach coefficient > 0.70 indicates a satisfactory level of reliability (Pilatti et al. 2010).

Data analysis was performed using the *R* statistical software package (Gentleman, & Ihaka, 2015) and IRT analysis was performed using the *Itm* package (Rizopoulos, 2006).