Construction sites are dynamic with various activities and equipment sounds, essential for identifying equipment, understanding work processes, and assessing site conditions. However, recognizing equipment actions using audio data faces challenges like manual recording dependency, collecting high-quality datasets, and background noise. This paper introduces an automated framework, aided by computer vision algorithms, for generating an audio dataset for construction equipment from online sources. The framework uses computer vision to identify relevant visual content and audio classification models to filter out irrelevant content, ensuring high-quality data. Through the proposed framework, an audio dataset was generated with annotations covering equipment types and actions. Performance evaluation with classification models showed F-scores ranging from 61 % to 91 % at the equipment level and 52 % to 87 % at the action level. The framework offers an effective approach to creating audio datasets, supporting advancements in audio-based activity recognition, contributing to improvements in real-world construction site safety and productivity.