Exploring the Snowmass 2021 LOI's graphically
Some basic statistics:
A word-vector for each document is made. From that, simple vector distance between documents is calculated. The closer the vector distance, the more common words are used. When documents are identical, of course, their vector distance is one. If you calculate the 2x2 matrix for the similarity between all documents (and ignore the diagonal), you get the following:
Everything to the right of the duplicates
line is classified as being identical below. And everything to the right of the updates
line is classified as being an update.
These LOI’s are duplicates of each other.
These LOI’s might be updates of each other