Google Cloud Storage File Management Pattern¶
The Google Cloud Storage File Management Pattern describes a backend workflow for generating, managing, and serving large files asynchronously. This pattern is commonly used in reporting systems where file generation (e.g., CSV or PDF) is time-consuming and needs to be decoupled from the user's HTTP request^[001-TODO__code.md].
Architecture Overview¶
The system relies on Google Cloud Storage (GCS) as the central repository for files, using a "temporary to permanent" storage strategy^[001-TODO__code.md].
- Request & Record Creation: A user requests a report. The system creates a database record with a status of
RUNNINGand verifies that a duplicate request (based on MD5 hash of parameters) has not been made recently^[001-TODO__code.md]. - Temporary Storage: As data is processed, individual "chunks" or pages are uploaded to a temporary directory in GCS (e.g.,
/doc/report/pdf/{id}-temp/)^[001-TODO__code.md]. - Combination: Once all chunks are uploaded, a Message Queue consumer (RabbitMQ) triggers a process to combine these temporary files into a single document^[001-TODO__code.md].
- Final Storage & Cleanup: The combined file is saved to a permanent path (e.g.,
/doc/report/pdf/{id}/{id}.pdf), and the temporary directory is deleted^[001-TODO__code.md]. - Download: The user downloads the final file via a signed URL or direct byte stream from the permanent path^[001-TODO__code.md].
Core Workflow Stages¶
1. File Upload and Storage¶
The system treats files as byte arrays. The FileManageService handles interactions with GCS^[001-TODO__code.md].
- Uploading: Files are created using
storage.create(BlobInfo, byte[]). The service defines prefixes based on document type (e.g.,/doc/report/csv/or/doc/report/pdf/)^[001-TODO__code.md]. - Reading: Files are retrieved using
storage.readAllBytes(BlobId)^[001-TODO__code.md].
2. Asynchronous Document Generation¶
Document generation is handled by specific services (e.g., CSVDocumentServiceImpl) that implement ReportDocumentService^[001-TODO__code.md].
- Chunking: Large datasets are split into pages. Each page is generated individually and uploaded to a temporary GCS directory named using the record ID (e.g.,
.../{id}-temp/0.csv). - Check: The system compares the expected
totalPagecount against the actual number of files in the temporary GCS directory before proceeding^[001-TODO__code.md].
3. File Combination Pattern¶
When a report generation job sends a message indicating all pages are ready, the system combines the temporary files^[001-TODO__code.md].
- Listing: The system lists all blobs in the temporary directory using
storage.list(bucket, BlobListOption.prefix(directory))^[001-TODO__code.md]. - Merging: It downloads all temporary files, merges the content (e.g., concatenating CSV rows and removing duplicate headers for subsequent chunks), and uploads the result to the final permanent path^[001-TODO__code.md].
- Cleanup: After successful merging, the temporary directory and its contents are deleted via
storage.delete(blobIds)^[001-TODO__code.md].
4. Download Handling¶
The download endpoint retrieves the final file path from the database (ensuring the status is SUCCESS) and streams the bytes from GCS to the client^[001-TODO__code.md].
Key Implementations¶
GCS Path Management¶
The system uses a structured path format to organize files^[001-TODO__code.md]:
* Prefix: A global prefix (e.g., gcpUtil.getPrefixFolder()) is prepended to all paths.
* Document Paths: /{prefix}/{type_path}/{id}/{id}{suffix} (e.g., /doc/report/pdf/105/105.pdf).
* Temporary Paths: /{prefix}/{type_path}/{id}-temp/.
Concurrency and Idempotency¶
- Locking: Database status checks (
isRunningStatus) prevent processing of invalid or completed records^[001-TODO__code.md]. - Deduplication: A
duplicateQrymethod checks against a hash of search parameters within a specific time window to prevent redundant report generation^[001-TODO__code.md].
Related Concepts¶
- [[Asynchronous Messaging]]
- [[Cloud Storage]]
- [[Microservices Pattern]]
- [[CSV Generation]]
Sources¶
^[001-TODO__code.md]