[SystemSafety] How much public Ada source code is there?

Derek M Jones derek at knosof.co.uk
Tue Jun 4 17:48:35 CEST 2024


All,

Ada source code is present in version 2 of the Stack,
a public source code repo designed for training LLMS
https://huggingface.co/datasets/bigcode/the-stack-v2

Technical details here
https://arxiv.org/abs/2402.19173

The amount of Ada source is:

language        : "Ada"
num_files       : 183,890
dedup_num_files :  92,104
train_num_files :  89,221
size_bytes      : 2.03e+10
dedup_size_bytes: 8.25e+08
train_size_bytes: 6.14e+08

num_files is the number of unique source files.
dedup_num_files is further deduplication, e.g., ignoring
differences in whitespace and blank lines.

Do these numbers look like they are representative
of the total amount of publicly available Ada source?

Is there some huge Ada repository someplace that looks
like it might not have been included?

-- 
Derek M. Jones           Evidence-based software engineering
blog:https://shape-of-code.com


More information about the systemsafety mailing list