
—
2021-05-27
Manual annotations of 'Contextual Information Level' and 'Context Description Quality' of VIST-VAL dataset, as described in the paper "Automated Generation of Storytelling Vocabulary from Photographs for Use in AAC"
FILES
'all-annotations': include annotations of all 1946 photos from VIST-VAL that have 5 narrative sentences associated. Annotations were made by the first author.
'external-annotations': annotations of 514 photos made by a person unfamiliar with the study, used to calculate the interrater reliability score.
DATASET ENTRIES
Each key in the json file corresponds to one photo of the VIST-VAL dataset. Each photo entry has the following attributes:
photo_id: the original photo id in VIST-VAL
azureCaption: the caption generated automatically by the machine learning technique adopted.
photo_quality: a score between 0 and 3 based on the number of contextual categories (environment, people/object, activity) it clearly depicts (0 when ambiguous). It is the sum of "photo_quality_location", "photo_quality_subject", and "photo_quality_activity".
photo_quality_location: a 0/1 score indicating whether the location of the scene photographed in clearly depicted
photo_quality_subject: a 0/1 score indicating whether the subject (person or object) of the scene photographed in clearly depicted
photo_quality_activity: a 0/1 score indicating whether the activity present in the scene photographed in clearly depicted
azureCaption_quality: a score between 0 and 3 given to the azureCaption generated, according to these rules : 0) not generated or completely unrelated; 1) misses most important elements OR contains most of important elements and a few unrelated elements; 2) contains most of important elements OR all important elements and a few unrelated elements; 3) contains all important elements in the photo and does not contain any unrelated elements.
groundTruthSIS: a set of five narrative sentences from VIST-VAL associated with the photo_id