I've been trying to get Document AI batch submissions to work but I'm having some difficulty. I used a RawDocument for single file submission, assuming I could iterate over my dataset (27k images), but opted for batch processing as it seemed to be the more appropriate technique.
When I run the code I see the error: "Unable to process all documents". The first few lines of debug information are:
O:17:"Google\Rpc\Status":5:{ s:7:"*Code";i:3;s:10:"*Message";s:32:"Unable to process all documents."; s:26:"Google\Rpc\Statusdetails"; O:38:"Google\Protobuf\Internal\RepeatedField":4:{ s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass ";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\ Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s: 42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname ";s:4:"code";```
Support for this error states that the cause of the error is:
gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to start with gs:// and end with a backslash character (/). Check the configuration of the bucket URI.
I'm not using gcsUriPrefix (should I? My Bucket > Max Batch Limits), but my gcsOutputConfig.gcsUri is within those limits. The file list I provide gives the file names (pointing to the right bucket), so there should be no trailing backslash.
Welcome to consult
function filesFromBucket( $directoryPrefix ) { // NOT recursive, does not search the structure $gcsDocumentList = []; // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix $bucketName = 'my-input-bucket'; $storage = new StorageClient(); $bucket = $storage->bucket($bucketName); $options = ['prefix' => $directoryPrefix]; foreach ($bucket->objects($options) as $object) { $doc = new GcsDocument(); $doc->setGcsUri('gs://'.$object->name()); $doc->setMimeType($object->info()['contentType']); array_push( $gcsDocumentList, $doc ); } $gcsDocuments = new GcsDocuments(); $gcsDocuments->setDocuments($gcsDocumentList); return $gcsDocuments; } function batchJob ( ) { $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] ); // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig // nb: all uri paths must end with / or an error will be generated. $outputConfig = new DocumentOutputConfig( [ 'gcs_output_config' => new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ] ); // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient $documentProcessorServiceClient = new DocumentProcessorServiceClient(); try { // derived from the prediction endpoint $name = 'projects/######/locations/us/processors/#######'; $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]); $operationResponse->pollUntilComplete(); if ($operationResponse->operationSucceeded()) { $result = $operationResponse->getResult(); printf('<br>result: %s<br>',serialize($result)); // doSomethingWith($result) } else { $error = $operationResponse->getError(); printf('<br>error: %s<br>', serialize($error)); // handleError($error) } } finally { $documentProcessorServiceClient->close(); } }
Typically, the error
"Unable to process all documents"
is caused by incorrect syntax of the input file or output bucket. Since a malformed path may still be a "valid" path to the cloud storage, but not the file you expected. (Thank you for checking the error message page first!)If you want to provide a specific list of documents to process, you do not have to use
gcsUriPrefix
. Although based on your code it appears that you are adding all files in the GCS directory to theBatchDocumentsInputConfig.gcs_documents
field, so it would make sense to try to send the prefix in the>BatchDocumentsInputConfig.gcs_uri_prefix
instead Not a list of individual files.Note: The maximum number of files that can be sent in a single batch request (1000), and specific processors have their own page limits.
https://cloud.google.com/document-ai/quotas#content_limits
You can try splitting the file into multiple batch requests to avoid hitting this limit. The Document AI Toolbox Python SDK has a built-in function for this purpose, but you can try reimplementing this function in PHP based on your use case. https:// github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213
This turns out to be an ID-10-T bug with clear PEBKAC overtones.
$object->name() does not return the bucket name as part of the path.
Change
$doc->setGcsUri('gs://'.$object->name());
to$doc->setGcsUri('gs://'. $bucketName.'/'.$object->name());
solves the problem.