- 1011Matrix.gvcf.gz: all SNPs and indels called at the population level (.gvcf format). - 1011GWASMatrix.tar.gz: the matrix used for GWAS, which contains all biallelic positions known for 1,000 isolates or more with MAF > 5% as well as CNVs (encoded 0/1/2 for absence/0.5–1 copy/multiple copies) (.bed,.bim and.fam formats). - 1011DistanceMatrixBasedOnSNPs.tab.gz: for each pair of strains, the value is the percentage, based on SNPs, of non-identical bases. Heterozygous differences were half-weighted compared to the homozygous differences. - 1011DistanceMatrixBasedOnORFs.tab.gz: for each couple of strains, the value is the number of ORFs that are present in only one out of the two isolates. - 1011Assemblies.tar.gz: de novo assemblies of the 1,011 isolates (.fasta format). - allReferenceGenesWithSNPsAndIndelsInferred.tar.gz: sequences of the genes found in the reference genome in which SNPs and indels have been automatically inferred for each isolate. - allORFs_pangenome.fasta.gz: sequences of the 7,796 pangenomic ORFs (.fasta format). - genesMatrix_PresenceAbsence.tab.gz: pattern of presence and/or absence of pangenomic ORFs for each isolate, in which the presence of an ORF is marked as 1 and its absence is marked as 0. - genesMatrix_CopyNumber.tab.gz: estimated copy number for each pangenomic ORF, per isolate. Values are given for the haploid genome, so that non-integer values can be found (different copy number on homologous chromosomes). - genesMatrix_Frameshift.tab.gz: for each isolate, the presence or absence (indicated by 1 or 0, respectively) of homozygous frameshift is reported in each gene, based on the number of bases affected by indels. - gene_dNdS.tab.gz: for each gene, mean and median dN/dS values computed with PAML. - geneTree.concat.4Dsites.fa.gz: fasta file alignment recapitulating the 160,415 4-fold degenerate sites used to estimate the out-of-China timing. - phenoMatrix_35ConditionsNormalizedByYPD.tab.gz: growth ratio between 35 stress conditions and standard YPD medium at 30 °C, for 971 isolates. - 1011LossOfFunction.xls.gz: loss of function predicted for all genes in the 1011 dataset (nonsense mutation + SIFT predictions) - 1011proteome.tar.gz: proteome of the 1011 isolates, organized by gene. This dataset was created by inferring SNPs only, and takes into account the presence/absence of the gene. In case of heterozygosity, one of the possible allele was randomly inferred. - 1011CDS_withAmbiguityResidues.tar.gz: the CDS of the 1011 isolates, organized by gene. This dataset was created by inferring SNPs only, and takes into account the presence/absence of the gene. Ambiguity residues were inferred at heterozygous positions.