Data Cleaning

For some of this work we need to clean or aggragate the data. People entered a lot of details - which is amazing - but we don’t want to look at per-person responses as much as in aggregate. This page describes what we did to clean up the survey responses.

This notebook loads in the raw data, and writes out a new file that can be loaded “at will” with “clean” data by all the other notebooks.

import pandas as pd
import numpy as np
from typing import List
import re

raw_survey = pd.read_csv('/workspaces/snowmass-attendance/survey.csv')

Time

Convert the time the person filled out the survey into something that is a time object.

time_col = pd.to_datetime(raw_survey["Timestamp"])
survey = pd.DataFrame(time_col).rename(columns={"Timestamp": "time"})
survey["time"] = pd.to_datetime(raw_survey["Timestamp"])

Employed By

Here is a list of what people answered for how they were employed. People entered a lot of different things - below is how we mapped everyone. Some of the choices were… guesses. Hopefully they do not affect things too much.

c_employed_by_name = "I'm employed by"
','.join(list((str(i) for i in raw_survey[c_employed_by_name].unique())))
'nan,US National Lab,US University,Italian National Institute,International university ,TRIUMF,Perfer not to answer,International University,Non-US institution,European university,EU Lab,Spanish Research Council (CSIC),INFN (Italy),European Lab,research institute in Eastern Europe,Italian University,European lab,CERN,German University,Other country,University in Canada,Research laboratory in Germany.,UK University,European National lab,retired,INFN,Istituto Nazionale di Fisica Nucleare (INFN), Italy,Lund University,University,overseas university,retired faculty,employeed by national lab, grad student at a US university,non US,Canadian research institute,Università degli Studi di Padova,Canada University,Research institute,Canadian university (and, btw, "perfer" is not a word ...),Foreign Lab,Small Liberal Arts  College,Ukrainian university (Main Astronomical Observatory of NAS Ukraine),foreign lab,non-US lab,Foreign University,a European national insitution,Korean University ,EU university,EU Research center,European institution,Non-US university (University of ICeland),University outside US,non-US University,The University of Torino,Non US University (Colombia),UK university,INFN, Italy,Industry,University of Granada (Spain),University of Medellin,CSIC (Spain),Indian University,National Lab outside US,Brazilian University,German research institute,European Institution,US College,UT Austin,NASA,Joint Italian and US university,University in Europe ,European University,Department of Atomic Energy, India,Retired from US University,Canadian university,Swiss University,Small business,kIAS,Sultan Qaboos University (Oman),non-US university,Uk institution,Cairo university, Egypt,Canadian University ,US DOE-funded laboratory (not formally a US National Lab),US Liberal-Arts College (if "university" was meant to encompass this, feel free to log my response as such, but liberal-arts-college faculty probably would not use that terminology),Japanese University,International university,Business,EU National Research Institute,Saha Institute of Nuclear Physics , kolkata, India,Research institution in Canada,former employee of US Nat Lab,CNRS,German National Lab,Institute of High Energy Physics, Chinese Academy of Sciences,Italian Institution,INFN - Italy,Foreign university,FEI University at SBC São Paulo, Brazil,Swiss and French pensioner would be on CERN mission,South Korea University,actually, I am retired,Non-US University,Dutch university,French Lab, previously working in the US,Retired,EU University,EU institution,Overseas university,A foreign university full time and a US University as a part time ,emeritus,both lab and university,US National Lab (ret\'d),Employed abroad in a National Lab,Canadian University,I am a self-funded PhD researcher so not employed.,University&Laboratory'
employted_by_t = {
 'US National Lab': 'US National Lab',
 'US University': 'US University',
 'Italian National Institute' : "non-EU",
 'International university ' : "non-EU",
 'TRIUMF': "Canada",
 'Perfer not to answer': "Unknown",
 'International University': "EU",
 'Non-US institution': "EU",
 'European university': "EU",
 'EU Lab': "EU",
 'Spanish Research Council (CSIC)': "EU",
 'INFN (Italy)': "EU",
 'European Lab': "EU",
 'research institute in Eastern Europe': "EU",
 'Italian University': "EU",
 'European lab': "EU",
 'CERN': "EU",
 'German University': "EU",
 'Other country': "non-EU",
 'University in Canada': "Canada",
 'Research laboratory in Germany.': "EU",
 'UK University': "EU",
 'European National lab': "EU",
 'retired': "Unknown",
 'INFN': "EU",
 'Istituto Nazionale di Fisica Nucleare (INFN), Italy': "EU",
 'Lund University': "EU",
 'University': "EU",
 'overseas university': "EU",
 'retired faculty': "Unknown",
 'employeed by national lab, grad student at a US university': "US National Lab",
 'non US': "EU",
 'Canadian research institute': "Canada",
 'Università degli Studi di Padova': "EU",
 'Canada University': "Canada",
 'Research institute': "EU",
 'Canadian university (and, btw, "perfer" is not a word ...)': "Canada",
 'Foreign Lab': "EU",
 'Small Liberal Arts  College': "US University",
 'Ukrainian university (Main Astronomical Observatory of NAS Ukraine)': "non-EU",
 'foreign lab': "EU",
 'non-US lab': "EU",
 'Foreign University': "EU",
 'a European national insitution': "EU",
 'Korean University ': "non-EU",
 'EU university': "EU",
 'EU Research center': "EU",
 'European institution': "EU",
 'Non-US university (University of ICeland)': "non-EU",
 'University outside US': "EU",
 'non-US University': "EU",
 'The University of Torino': "EU",
 'Non US University (Colombia)': "non-EU",
 'UK university': "EU",
 'INFN, Italy': "EU",
 'Industry': "Industry",
 'University of Granada (Spain)': "EU",
 'University of Medellin': "EU",
 'CSIC (Spain)': "EU",
 'Indian University': "non-EU",
 'National Lab outside US': "EU",
 'Brazilian University': "non-EU",
 'German research institute': "EU",
 'European Institution': "EU",
 'US College': "US University",
 'UT Austin': "US University",
 'NASA': "US National Lab",
 'Joint Italian and US university': "EU",
 'University in Europe ': "EU",
 'European University': "EU",
 'Department of Atomic Energy, India': "non-EU",
 'Retired from US University': "US University",
 'Canadian university': "Canada",
 'Swiss University': "EU",
 'Small business': "Industry",
 'kIAS': "non-EU",
 'Sultan Qaboos University (Oman)': "non-EU",
 'non-US university': "EU",
 'Uk institution': "EU",
 'Cairo university, Egypt': "non-EU",
 'Canadian University ': "Canada",
 'US DOE-funded laboratory (not formally a US National Lab)': "US National Lab",
 'US Liberal-Arts College (if "university" was meant to encompass this, feel free to log my response as such, but liberal-arts-college faculty probably would not use that terminology)': "US University",
 'Japanese University': "non-EU",
 'International university': "International",
 'Business': "Industry",
 'EU National Research Institute': "EU",
 'Saha Institute of Nuclear Physics , kolkata, India': "non-EU",
 'Research institution in Canada': "Canada",
 'former employee of US Nat Lab': "US National Lab",
 'CNRS': "EU",
 'German National Lab': "EU",
 'Institute of High Energy Physics, Chinese Academy of Sciences': "non-EU",
 'Italian Institution': "EU",
 'INFN - Italy': "EU",
 'Foreign university': "EU",
 'FEI University at SBC São Paulo, Brazil': "non-EU",
 'Swiss and French pensioner would be on CERN mission': "EU",
 'South Korea University': "non-EU",
 'actually, I am retired': "US University",
 'Non-US University': "EU",
 'Dutch university': "EU",
 'French Lab, previously working in the US': "EU",
 'Retired': "US University",
 'EU University': "EU",
 'EU institution': "EU",
 'Overseas university': "EU",
 'A foreign university full time and a US University as a part time ': "US University",
 'emeritus': "US University",
 'both lab and university': "US National Lab",
 "US National Lab (ret'd)": "US National Lab",
 'Employed abroad in a National Lab': "EU",
 'Canadian University': "Canada",
 'I am a self-funded PhD researcher so not employed.': "Industry",
 'University&Laboratory': "US University"
}
survey["employed_by"] = raw_survey[c_employed_by_name].transform(lambda i: employted_by_t[i] if i == i else 'Unknown')
','.join(survey.employed_by.unique())
'Unknown,US National Lab,US University,non-EU,Canada,EU,Industry,International'

Rank

Normalize the rank of the person that replied

c_rank = "I am a"
','.join(list((str(i) for i in raw_survey[c_rank].unique())))
'Post-doc,nan,Professor or Lab Scientist (tenured),Graduate Student,Professor or Lab Scientist (untenured),Research Scientist,Perfer not to answer,Lab fellow,Engineer, Staff, or other non-physicist role,PHD student,Lab management,retired engineer/physicist,retired professor ,Lab Director - Japan,Retired lab scientist,Division Director,retired professor,Retired,Honorary Professor,Emeritus,Scientist Emeritus (Fermilab),Guest Scientist, retired. '
rank_t = {
 'Post-doc': "Post-doc",
 'Professor or Lab Scientist (tenured)': 'Professor or Lab Scientist (tenured)',
 'Graduate Student': 'Graduate Student',
 'Professor or Lab Scientist (untenured)': 'Professor or Lab Scientist (untenured)',
 'Research Scientist': 'Research Scientist',
 'Perfer not to answer': 'Unknown',
 'Lab fellow': 'Professor or Lab Scientist (untenured)',
 'Engineer, Staff, or other non-physicist role': 'Engineer, Staff, or other non-physicist role',
 'PHD student': 'Graduate Student',
 'Lab management': 'Professor or Lab Scientist (tenured)',
 'retired engineer/physicist': 'Professor or Lab Scientist (tenured)',
 'retired professor ': 'Professor or Lab Scientist (tenured)',
 'Lab Director - Japan': 'Professor or Lab Scientist (tenured)',
 'Retired lab scientist': 'Professor or Lab Scientist (tenured)',
 'Division Director': 'Professor or Lab Scientist (tenured)',
 'retired professor': 'Professor or Lab Scientist (tenured)',
 'Retired': 'Professor or Lab Scientist (tenured)',
 'Honorary Professor': 'Professor or Lab Scientist (tenured)',
 'Emeritus': 'Professor or Lab Scientist (tenured)',
 'Scientist Emeritus (Fermilab)': 'Professor or Lab Scientist (tenured)',
 'Guest Scientist, retired. ': 'Professor or Lab Scientist (tenured)'
}
survey["job_rank"] = raw_survey[c_rank].transform(lambda i: rank_t[i] if i == i else 'Unknown')
','.join(survey.job_rank.unique())
'Post-doc,Unknown,Professor or Lab Scientist (tenured),Graduate Student,Professor or Lab Scientist (untenured),Research Scientist,Engineer, Staff, or other non-physicist role'

Attending

Would you attend the workshop. While we designed the survey with no options, someone slipped a Yes in there, which we need to map back to the other values.

c_attend = "Setting aside the COVID, funding, etc issues for the moment, would you attend the Snowmass Summer Workshop in person?"
raw_survey[c_attend].unique()
array(['Yes', 'Would attend', 'Might attend', 'Would not attend'],
      dtype=object)
attend_t = {
    'Yes': 'Would attend', 
    'Would attend': 'Would attend', 
    'Might attend': 'Might attend',
    'Would not attend': 'Would not attend'
}
survey["attend"] = raw_survey[c_attend].transform(lambda i: attend_t[i])

Attendance Feedback

The text feedback has to be done as a giant set of text strings, so no cleaning here.

c_attendance_feedback = "If you'd like to say a few words to expand on the answer, please do!"
survey["attend_feedback"] = raw_survey[c_attendance_feedback]
c_worried_covid = "Please indicate how worried you are about each issue that might prevent you from attending: [COVID]"
c_worried_funding = "Please indicate how worried you are about each issue that might prevent you from attending: [Funding]"
c_worried_length = "Please indicate how worried you are about each issue that might prevent you from attending: [Length of the Workshop]"
c_worried_accessibility = "Please indicate how worried you are about each issue that might prevent you from attending: [Accessibility/Accommodation]"
c_worried_responsibilities = "Please indicate how worried you are about each issue that might prevent you from attending: [Competing Responsibilities]"
raw_survey[c_worried_covid].unique()
array(['Very worried', 'Might prevent me from attending',
       'Likely will prevent me from attending',
       'Could attend, but might limit participation', 'Not a concern'],
      dtype=object)
worried_t = {
    'Very worried': 1,
    'Likely will prevent me from attending': 1,
    'Might prevent me from attending': 2,
    'Could attend, but might limit participation': 3,
    'A little worried': 3,
    'Not a concern': 4,
    'No Worried': 4,
    'Unknown': 5,
}
survey['worried_covid'] = raw_survey[c_worried_covid].transform(lambda i: worried_t[i])
survey['worried_funding'] = raw_survey[c_worried_funding].transform(lambda i: worried_t[i])
survey['worried_length'] = raw_survey[c_worried_length].transform(lambda i: worried_t[i])
survey['worried_accessibility'] = raw_survey[c_worried_accessibility].transform(lambda i: worried_t[i])
survey['worried_responsibilities'] = raw_survey[c_worried_responsibilities].transform(lambda i: worried_t[i] if isinstance(i, str) else 5)
c_other_issues = "If there are other issues that concern you that are not listed above, please indicate here!"
survey['attend_other_concerns'] = raw_survey[c_other_issues]

Travel

c_travel = "If I attend Snowmass, I will be traveling"
survey['travel'] = raw_survey[c_travel].transform(lambda v: v if isinstance(v, str) else "Unknown")

Management

c_management="Are you part of the Snowmass Management Team (frontier group or sub-group convener, etc.)?"
raw_survey[c_management].unique()
survey['management'] = raw_survey[c_management].transform(lambda v: v if isinstance(v, str) else "Unknown")

Dinner

No need to convert this - there were no optional values!

c_dinner = "If you attended the conference, would you also attend the dinner?"
raw_survey[c_dinner].unique()
array(['No', 'Yes', 'Maybe'], dtype=object)
survey["dinner"] = raw_survey[c_dinner]

Frontier

This is a bit difficult b.c each item is comma separated list, which is not very amenable to manipulation. We will convert to using the abreviations and also to converting it to separate columns.

c_frontier = "I primarily am interested in sessions associated with:"
frontier_mapping = {
    'Neutrino Frontier': 'NF',
    'Energy Frontier': 'EF',
    'Rare Processes and Precision Measurements': 'RF',
    'Cosmic Frontier': 'CF',
    'Theory Frontier': 'TF',
    'Accelerator Frontier': 'AF',
    'Instrumentation Frontier': 'IF',
    'Computational Frontier': "CompF",
    'Underground Facilities': 'UF',
    'Community Engagement': 'CommF'
}

survey["frontier_all"] = raw_survey[c_frontier].transform(lambda v: np.array([frontier_mapping[f.strip()] for f in v.split(',')]) if isinstance(v, str) else [])
for f in frontier_mapping.values():
    survey[f'frontier_{f}'] = survey["frontier_all"].apply(lambda v: f in v)

COVID and Attending

c_covid = "The points below are labeled on the case/day plot above. Please indicate the largest USA new case-load/day rate for which you'd be willing to attend Snowmass in person."
covid_t = {
    'Point A': 'A',
    'Point B': 'B',
    'Point C': 'C',
    'Point D': 'D',
    'Point E': 'E',
    'Poing E': 'E',
    'Not com': 'At No Level',
    'My inst': 'Travel Restrictions'
}
survey['covid_comfort'] = raw_survey[c_covid].transform(lambda v: covid_t[v[0:7]])
c_covid_comments = "There is no way to get everyone's full perspective with a multiple choice question like that. Please feel free to add a sentence or so if you wish!"
survey['covid_comments'] = raw_survey[c_covid_comments]

Accomodation

In the end the below cleaning couldn’t be used - the survey response format makes it impossible to tell between comma’s in the answer, and separate answers. Something to know for the next time we run a survey! The code is left here because it does provide some good example code for future efforts.

c_hearing = "Is there any type of hearing-related accommodation that would increase the accessibility of the Snowmass Summer Study for you? (Please note that captioning can be useful for both Deaf and hard of hearing people and for people who do not have these identities but who find it helpful to be able to see as well as hear speakers' words.)"
survey['acc_hearing'] = raw_survey[c_hearing]

survey['acc_hearing_ar'] = raw_survey[c_hearing].transform(lambda v: np.array([f.strip() for f in v.split(',')]) if isinstance(v, str) else [])
m = re.compile(r', (?=(?:"[^"]*?(?: [^"]*)*))|, (?=[^",]+(?:,|$))')
def splitter(v: str) -> List[str]:
    return [f.strip('"') for f in m.split(v)]    
c_other_acc = "What other accommodations would increase the accessibility of the Snowmass Summer Study for you?"
survey["acc_other"] = raw_survey[c_other_acc]

survey['acc_other_ar'] = raw_survey[c_other_acc].transform(lambda v: np.array(splitter(v)) if isinstance(v, str) else [])
survey["acc_other"]
0                                                    NaN
1                                                    NaN
2                                                    NaN
3                                                    NaN
4      Seating available at all events, breaks, meals...
                             ...                        
723                                                  NaN
724                                                  NaN
725                                                  NaN
726                                                  NaN
727                                                  NaN
Name: acc_other, Length: 728, dtype: object
survey['acc_other_ar'].loc[4]
array(['Seating available at all events', 'breaks', 'meals',
       'discussion areas'], dtype='<U31')
c_acc_comments = "Do you have any additional details you'd like to share or any other comments on how Snowmass Summer Study accessibility could be improved for you?"
survey['acc_comments'] = raw_survey[c_acc_comments]

Writing out the file

As a WARNING: Never check this in!

survey.to_parquet('/workspaces/snowmass-attendance/clean_survey.parquet')