Question 1

What is Julia's Data?

Accepted Answer

A curated collection of Brazilian Portuguese medical audio datasets built for ASR, TTS, and conversational AI teams, with ground-truth text, recording provenance, and structured delivery metadata.

Question 2

What does each record contain?

Accepted Answer

Five components: raw medical notes read aloud, a long-form text (~3k–5k words), question & answer pairs, a medical terminology lexicon, and a two-speaker dialog — all with transcription ground truth.

Question 3

What format is the data in?

Accepted Answer

Each delivery includes dataset-wide JSONL manifests, per-record review folders, original WebM/Opus segment files, WAV derivatives, checksums, schema documentation, recording-session metadata, provenance, and rights metadata.

Question 4

Who records the audio?

Accepted Answer

Native Brazilian Portuguese speakers record in a controlled home-office setup, including a medical professional voice and additional approved speakers for dialog coverage.

Question 5

Is the dataset commercially ready?

Accepted Answer

Yes. Every delivery includes source-to-audio provenance, speaker identity and consent status, recording-session metadata, file-level technical metadata, and rights fields designed for buyer review.

Question 6

How do you handle rights, consent, and PHI risk?

Accepted Answer

We track speaker consent and rights chain at the record level, and medical source material is reviewed with de-identification and release fields before it becomes recordable content.

Question 7

What quality checks happen before delivery?

Accepted Answer

Each record is reviewed for transcript alignment, speaker labeling, metadata completeness, and file integrity. Delivered files include checksums and structured manifests for ingestion validation.

Question 8

How do I get access?

Accepted Answer

An early-access batch is available now. Reach out at julia@juliasdata.com to review sample materials, packaging options, and licensing.

Julia's Data — Brazilian Portuguese Medical Audio Datasets