Google Cloud Storage File Management Pattern¶

The Google Cloud Storage File Management Pattern describes a backend workflow for generating, managing, and serving large files asynchronously. This pattern is commonly used in reporting systems where file generation (e.g., CSV or PDF) is time-consuming and needs to be decoupled from the user's HTTP request^[001-TODO__code.md].

Architecture Overview¶

The system relies on Google Cloud Storage (GCS) as the central repository for files, using a "temporary to permanent" storage strategy^[001-TODO__code.md].

Request & Record Creation: A user requests a report. The system creates a database record with a status of RUNNING and verifies that a duplicate request (based on MD5 hash of parameters) has not been made recently^[001-TODO__code.md].
Temporary Storage: As data is processed, individual "chunks" or pages are uploaded to a temporary directory in GCS (e.g., /doc/report/pdf/{id}-temp/)^[001-TODO__code.md].
Combination: Once all chunks are uploaded, a Message Queue consumer (RabbitMQ) triggers a process to combine these temporary files into a single document^[001-TODO__code.md].
Final Storage & Cleanup: The combined file is saved to a permanent path (e.g., /doc/report/pdf/{id}/{id}.pdf), and the temporary directory is deleted^[001-TODO__code.md].
Download: The user downloads the final file via a signed URL or direct byte stream from the permanent path^[001-TODO__code.md].

Core Workflow Stages¶

1. File Upload and Storage¶

The system treats files as byte arrays. The FileManageService handles interactions with GCS^[001-TODO__code.md].

Uploading: Files are created using storage.create(BlobInfo, byte[]). The service defines prefixes based on document type (e.g., /doc/report/csv/ or /doc/report/pdf/)^[001-TODO__code.md].
Reading: Files are retrieved using storage.readAllBytes(BlobId)^[001-TODO__code.md].

2. Asynchronous Document Generation¶

Document generation is handled by specific services (e.g., CSVDocumentServiceImpl) that implement ReportDocumentService^[001-TODO__code.md].

Chunking: Large datasets are split into pages. Each page is generated individually and uploaded to a temporary GCS directory named using the record ID (e.g., .../{id}-temp/0.csv).
Check: The system compares the expected totalPage count against the actual number of files in the temporary GCS directory before proceeding^[001-TODO__code.md].

3. File Combination Pattern¶

When a report generation job sends a message indicating all pages are ready, the system combines the temporary files^[001-TODO__code.md].

Listing: The system lists all blobs in the temporary directory using storage.list(bucket, BlobListOption.prefix(directory))^[001-TODO__code.md].
Merging: It downloads all temporary files, merges the content (e.g., concatenating CSV rows and removing duplicate headers for subsequent chunks), and uploads the result to the final permanent path^[001-TODO__code.md].
Cleanup: After successful merging, the temporary directory and its contents are deleted via storage.delete(blobIds)^[001-TODO__code.md].

4. Download Handling¶

The download endpoint retrieves the final file path from the database (ensuring the status is SUCCESS) and streams the bytes from GCS to the client^[001-TODO__code.md].

Key Implementations¶

GCS Path Management¶

The system uses a structured path format to organize files^[001-TODO__code.md]: * Prefix: A global prefix (e.g., gcpUtil.getPrefixFolder()) is prepended to all paths. * Document Paths: /{prefix}/{type_path}/{id}/{id}{suffix} (e.g., /doc/report/pdf/105/105.pdf). * Temporary Paths: /{prefix}/{type_path}/{id}-temp/.

Concurrency and Idempotency¶

Locking: Database status checks (isRunningStatus) prevent processing of invalid or completed records^[001-TODO__code.md].
Deduplication: A duplicateQry method checks against a hash of search parameters within a specific time window to prevent redundant report generation^[001-TODO__code.md].

[[Asynchronous Messaging]]
[[Cloud Storage]]
[[Microservices Pattern]]
[[CSV Generation]]

Sources¶

^[001-TODO__code.md]